0% found this document useful (0 votes)
24 views5 pages

Alfa Elevado Redundancia Streiner 2003

Uploaded by

Antonio Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

Alfa Elevado Redundancia Streiner 2003

Uploaded by

Antonio Garcia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

JOURNAL OF PERSONALITY ASSESSMENT.

80( 1), 99-l 03


Copyright© 2003, Lawrcnce Erlhuum Asso<:iates, lnc.

STATISTICAL DEVELOPMENTS AND APPLICATIONS

Starting at the Beginning: An Introduction


to Coefficient Alpha and Interna! Consistency
David L. Streiner
Baycrest Centrefor Geriatric Care
Department of Ps_vchiaoy
Universit.)' ofToronto

Cronbach's ais the most widcly uscd índex of thc rcliabílíty of a scale. Howcver, its use and in-
terpretation can be subject to a number of errors. This article díscusses the historícal develop-
ment of a from olher indexes of interna! consistency (split-half reliability and Kuder-Ríchard-
son 20) and discusses four myths associated with rx. (a) that it is a lixed property of the se ale, (b)
that it measures only the interna! consistency of the sea! e, (e) that hígher values are always pre-
ferred over lower ones, and (d) that it is restricted to the range ofO to l. It provides some recom-
mendations for acceptable values of a in diffcrcnt situations.

Perhaps the most widely used measure of the reliability of a this is that a person's total score will vary arouncl the true
scale is Cronbach's a(l951). One reason for this is obvious; score to some degree. One way of thinking about reliability,
it is the only reliability index that does not require two ad- then, is that it is the ratio of the variance of thc true scores
ministrations of the scale, or two or more raters, and so can be (cr ~me ) to the total se ores (a ~""' 1 ):
detennined with much less effort than test-retest or interrater
reliability. Unfortunately, the ubiquity of its use is matched
only by the degree of misunderstanding regarding what a Reliahility (2)
does and does not measure. This article is in tended to be aba- (} To!ü.l

sic primer about a. lt will approach these issues from a con-


ceptual and a statistical perspective and illustrate both the
strengths and weaknesses of the index. However, al any given time, a person's true score will be the
1 begin by discussing what is meant by reliability in gen- same from one testing to another, so that an individual 's
eral and how a and other indexes of "interna! consistency" a~,, will always be zero. Thus, Equation 2 pcrtains only toa
determine this. In classical test theory, a person's total score group of people who differ with respect to the characteristic
(i.e., the score a person receives on a test or scale. which is being measured.
sornetimes referred toas the observed score) is composed of Before continuing with issues of measuring reliability,
two parts: the tme score plus sorne error associated with the however, it would be worthwhile to digress for a moment and
measuremenL That is: expand on what is meant by the "true" score. In many re-
spects. it' s a poor choice of words and a poten ti al! y mislead-
(1'\
ing term (although one we're stuck with), because ''true,'' in
',.
':l

100 STREINER

80. However, if the test is given in English. which the person addition to these two sources of error (time and observer), we
learned only 2 years ago. the "true" seo re of 80 is likely notan can add a third source~that associated with the homogenei ty
accurate reflection ofher intelligence. Similarly, a person un- of the items that comprise the scale.l
dergoing an assessment for purposes of child access and cus- If a scale taps a single constructor domain, such as anxiety
tody may deliberately understate the degree to which he uses or mathematical abílity, then lo ensure content validity, we
corporal punishment. Repeat evaluations may yield similar want the scale to (a) consist of items that sample the entire
scores on the test and the mean will be a good approximation uf domain and (b) not include items that tap other abilities or
the true score (beca use of luw random error), but the defensive constructs. For example, a test of mathematics should sample
response style, which produces a bias, means that the true everything a child is expected to know ata given grade leve!,
score will not be an accurate one. Final! y, a depressed person but not consist of long, writlen passages that may retlect the
may ha ve a Tscore around 75 on numerous administrations of child's reading ability as muchas his or her math skills. Simi-
a personality test. However, if she responds well to therapy, larly, an anxiety inventory should tap all of the components
then both her depression and her true score should rnove closer of anxiety (e.g .. cognitive, behavioral, affective) but not in-
to the average range. elude items from other realms, such as ego strength or social
The different effects of random and systematic error are desirability. Because classical test theory as sumes that the
captured in Judd, Smith, and Kidder's (1991) expansion of items on a scale are a random sample from the uní verse of all
Equation 1: possible items drawn from the domain, then they should be
correlated highly with one another. However, this rnay not
always be true. For example, Person A may endorse two
ScoreToral Scorec1 + ScoresE + ScoreRE• (3)
items on an anxiety inventory (e.g., "l feel tense most of the
time": "I arn afraid to leave the house on my own"), whereas
where CI is the construct of interest, SE the systematic error, Person B may say True to the first but Noto the second. This
and RE is the random enor. In this forrnulation, Scoren + difference in the pattern of responding would affect the cor-
ScoresE is the same as Scorerrue in Equation l. Two advan- relations among the items, and hence the interna! consistency
tages of expressing the true score as the su m of the construct of the scale. A high degree of interna! consistency ís desír-
and the systematíc enor is that it illustrates the relationship able, because it "speaks dírectly to the ability of the clinician
between reliability and validity, and shows how the different or the researcher to interpret the composite score as a retlec-
types of error affect each of thern: tion ofthe test's items" (Henson, 2001, p. 178).
The original method of measuring interna! consistency is
called "split half' reliability. As the name implies, it is calcu-
Reliability (4) lated by splitting the test in half (e.g .. all of the odd numbered
items in one half and the even numbered ones in the other)
and correlating the two parts. If the scale as a whole is inter-
nally consistent, then any two randomly derived halves
whereas
should contain similar items and thus yield comparable
scores. A modífication of this was proposed by Rulon
( 1939), which relies on calculating the variance of the differ-
Validity (5) ence score between the two half-tests (cr~) and the variance
of the total score (cr ~oral) across people:

These last two equations show that random error affects both
reliability and validity (because the larger it is, the smaller Reliabilitv = 1- (6)
. cr"
the ratio between the numerators and denominators), Toral

whereas systematic error affects only validity.


Retuming to reliability, it can be defined on a conceptual
The right most part of the equation (cr ~ 1cr ~nwl) is the propor-
leve! as the degree to which "measurements of individuals on tinn ()f PITé\1"' \!1H..¡-;:¡nrP in thP- C'rnrP" nFh1r-h r"''l:n h.,. thnucrht ~f'
! 1

COEFFICIENT ALPHA 101

portional to its length. Splitting a scale in half reduces its ever, reliability is a characteristic ofthe test scores, not ofthe
length by 50%, and hence underestimates the reliabilíty. This test itself (e.g., Caruso, 2000; Pedhazur & Schmelkin, 1991;
difficulty can be solved relatively easily, though, by using Yin & Fan, 2000). That is, reliability depends as much on the
the Spearman-Brown "prophecy" formula that compensares sample being tested as on the test. This has been reinforced in
for the reduction in length. The second issue is that there are the recent guidelines for publishing the results of studies
many ways to split a scale in half: in fact, a 12-item scale can (Wilkinson & The Task Force on Statistical Inference,
be divided 462 ways and each one will result in a somewhat 1999), which stated that, "It is important to remember that a
different estímate of the reliability2 This problem was dealt test is not reliable or unreliable. Reliability is a property of
with for the case of dichotomous items by Kuder and Rich- the scores on a testfor a particular population ofexaminees"
ardson ( 1937). Their famous equation, which is referred to as {p. 596; emphasis added). The reasons for this flow out of
KR-20 because it was the 20th one in their article, reads: Equations 1, 2, and 8. Equation 2 te lis us that reliability is the
ratio of the true and total seo re variances. However, Equation
l shows that you can never obtain the true score. Conse-
KR-20=-k-[1 ~p;qk l· quently, any measured value of the reliability is an estímate
k -1 cr:r.nal and, as with all estimates of parameters, subject to some de-
gree of error. Finally, Equation 8 reflects the fact that the reli-
ability depends on the total score variance, and this is going
where k is the number of items, Pk the proportion of people
to differ from one sample of people to another. The more het-
who answercd positively to ítem k, qk is the proportion of
erogeneous the sample, then the larger the variance of the to-
people who answered negatively (i.e., qk = l - pk), andO"~"'"' tal scores and the higher the reliability. Caruso (2000) did a
is the variance of the total scores. KR-20 can be thought of as
meta-analysis of reliability studies done with the NEO and
the mean of al! possible split-half reliabilities.
found, for example, that the mean rcliability of the Agree-
The limitation of handling only dichotomous items was
ableness subscale was .79 when it was used in studics with
sol ved by Cronbach ( 1951 ), in his generalization of KR-20
the general population, but only .62 in clinical samples. Simi-
into coefficient a, which can be written as:
larly, Henson, Kogan, and Vacha-Haase's (2001)
meta-analysis of teacher efficacy scales found that the reli-
ability estimates for the Interna! Failure scale ranged from
(8) .51 to .82, and from .55 to .82 for the General Teaching Effi-
cacy scale. The reliabilities were affected by a number of al-
tributes of the samples including, not surprisingly, the heter-
where LO"!. is the su m of the varían ces of all of the items. Co- ogeneity of the teachers. Consequently, a scale that may ha ve
efficient a has the same property as KR-20. in terms ofbeing excellcnt reliability wíth one group may have only marginal
the average of all possible splits.3 reliabilíty in another. One implication of this is that ít is not
That pretty much describes what a is and can do. In the sufficient to rely on published reports of reliability if the
next section, 1 look at the other side of the equation and dis- scale is to be used with another group of people; it may be
cuss what a is not and cannot, or does not, do. necessary to determine it anew if the group is sufficiently dif-
ferent, especially with regard to its homogeneity.
MYTHS ABOUT ALPHA
Myth 2: Alpha Measures Only the Interna!
Myth 1: Alpha ls a Fixed Property of a Scale Consistency of the Sea le

Thc piimary myth surrounding a(and al! other indexes of re- lt is true that the higher the correlations among the items of a
liability. for that matter) is that once it is determined in one scale, the higher will be the value of a. But, the converse of
study, then you know the reliability of the scale under all cir- this-that a high val ue of a implies a high degree of interna!
cumstances. As a number of authors have pointed out, how- consistency-is not always true. The reason is that a is also
102 STREINER

three from each subscale), .65 wíth 12 ítems, and .75 with 18 trait for sorne items, and strongly disagree for other items) to
items. A scale composed of three orthogonal (í.e., minimize Yea-saying bias (e.g., Streiner & Norman, 1995).
uncorrelated) subscales had an a of .64 with 18 items. He Needless to say, the scoring for the reversed items should
concluded that also be reversed. If this isn't done, the items will be nega-
tively correlated, leading toa value of athat is below zero. Of
if a sea! e has more than 14 items, then it will havean aof .70 course, if the items are scored correctly and sorne correla-
or better even if ít consists of two orthogonal dimensions tions are stíll negative, then it points to serious problems in
with modest (i.e., .30) ítem intercorrelations. If the dimen- the original construction of the scale.
sions are correlated with each other, as they usually are, then A less frequent cause of a negative value of a is when the
a is even greater. (p. 102)
variability of the individual items exceeds their shared vari-
ance, which may occur when the items are tapping a variety
In other words, even though a scale may consist of two or
of different constructs (Henson, 2001 ). Because negatíve
more independent constructs, a could be substantial as long
values of a are theoretically impossible, Henson recom-
as the scale contains enough items. The bottom line is that a
mended reporting them as zero, but negative or zero, the con-
high val ue of a is a prerequisite for interna! consistency, but
clusions are the same-the ítems are most likely not
does not guarantee it; long, multidimensional scales will also
measuring what they purport to.
have high values of a.

Myth 3: Bigger ls Always Better USING ALPHA

For most indexes of reliability, the higher the val ue the better. Not all indexes of reliability can be used in all situations. For
We would Iike high levels of agreement between independent example, it is impossible to assess interrater reliability for
raters and good stabilíty of scores over time in the absence of self-administered scales and difficult to determine test-retest
change. This is true, too, about a, but only up to a point. As 1 reliability for conditions that change over brief periods of
just noted, ameasures not only the homogeneity of the items, time (which is not to say that sorne of our students haven't
but also the homogeneity of what is being assessed. In many tried). Similarly, there are certain types of sea! es for which a
cases, even seemingly unidimensional constructs can be con- is inappropriate. lt should not be used for "power" tests that
ceptualized havíng a number of different aspects. Lang measure how many items are completed in a fixed period of
( 1971 ), for example, stated that anxíety can be broken down time (such as the Digit Symbol Coding subtest of the
ínto three components--cognitive, physiological, and behav- WAIS-III). The issue here is that it is assumed that people
íoral-whereas Koksal and Power ( 1990) added a fourth, af- will differ only in terms of the number of items completed,
fectíve, dimension. Moreover, these do not always respond in and that everyone will be correct on most or all of the com-
concert and the correlatíons among them may be quite mod- pleted ones. So, for any given person, the correlatíons be-
est (Antony, 2001). Consequently, any scale that is desígned tween items will depend on how many items were finíshed,
to measure anxiety as a whole must by necessity have sorne and not the pattern of responding.
degree of heterogeneity among the items. If the anxiety scale Closely related to this are many of the other subtests of the
has three or four subscales, they should each be more homo- Wechsler scales and similar types of indexes, where the items
geneous than the sea! e as a whole, but e ven here, ashould not are presented in order of difficulty. Again, the expected pat-
be too high (over .90 or so). Higher values may reflect unnec- tern of answers is that they will all be correct until the difficulty
essary duplication of content across items and point more to leve! exceeds the person' s ability and the remaining items
redundancy than to homogeneity; or, as McClelland (1980) would be wrong; or there should be a number of two-point re-
put it, "asking the same question many different ways" (p. sponses, followed by sorne one-point answers, and then zeros.
30). In the final section, I will expand on this a bit more. If a is computed for these types of tests, it will result in a very
high value, one that is only marginally below l.O.
Myth 4: Alpha Ranges Between O and 1 Third, a is inappropriate if the answer to one ítem depends
on the response to a previous one, or when more than one
COEFFICIENT ALPHA 103

than 20 or so items, a can be quite respectable, giving the Caruso. J. C. (2000). Relíability generaliz.ation of the NEO personality
misleading impression that the scale is homogeneous. scales. Educational and Psychological Measuremenr. 60, 236-254.
Clark. L. A., & Watson, D. ( 1995 ). Constructing validity: Basic issues in ob-
So, how high should a be? In the first version of his book, jeclive scale developmenl. Psychological Assessment, 7, 309-319.
Nunnally (1967) recommended .50 to .60 forthe early stages Cortina, J. M. (1993). What is coefficient alpha? An examination of theory
of research, .80 for basic research tools, and .90 as the "mini- and applícations. Joumal of Applied Psychology, 78, 98104.
mally tolerable estímate" forclinícal purposes, with an ideal of Cronbach. L. J. (1951 ). Coefficíent alpha and the interna! structure of tests.
Psychometrika. 16, 297-334.
.95. He increased the starting leve! to. 70 in laterversions ofhis
Hamilton, M. A. ( 1967). Development of a rating scale for primary depressive
book (Nunnally, 1978; Nunnally & Bernstein, 1994). In my íllness. British Joumal of Social and Clínica! P.1ycholog¡; 6, 278-296.
opinion (and note that this ís an opinion, as are al! other val ues Henson, R. K (2001). Understanding interna! consistency reliability esti-
suggested by various authors), he got itright for research tools, mates: A conceptual primer on coefficient alpha. Measuremenr and Eval-
but went too far for clinical sca!es. As outlined in Myth 3, ex- twrion in Counseli11g and Developme11t, 34. 177-189.
Henson, R. K, Kogan, L. R., & Vacha-Haase, T. (2001 ). A relíability gener-
cept forextremely narrowly defined traits (and I can't think of
aliz.ation study of the Teacher Efficacy Se ale and related instmments. Ed-
any ), as o ver .90 most likely indicate unnecessary redundancy ucational and Psychological Measurement, 61. 404-420.
rather than a desirable leve! of interna) consistency. Judd, C. M., Smíth, E. R .• & Kidder, L. H. (1991). Research merhods in so-
cial relarions (6th ed.). New York: Harcourt Brace Jovanovích.
CONCLUSIONS Koksal, F., & Power, K. G. (1990). Four Systems Anxiety Questionnaire
(FSAQ): A self-report measure of somatic, cognitive, behavioral, and
feeling components. Joumal of Personality Assessmelll, 54. 534-545.
Interna) consistency is necessary in scales that measure vari- Kuder, G. F., & Richardson, M. W. ( 1937). The theory of the estimation of
ous aspects of personalíty (a subsequent article will examine test relíability. P.>:vclwmetrika, 2, 151-160.
situations where it is not importan!). However, Cronbach's a Lang, P. J. (1971 ). The application of psychophysíological methods. In S.
must be u sed and interpreted with sorne degree of caution. Garfield & A. Bergin (Eds.), Handbook ofpsychotherapy and behavior
change (pp. 75-125). New York: Wiley.
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test
l. You cannot tmst that published estimates of a apply scores. Reading, MA: Addison-Weslcy.
in al! situations. lf the group for which the sea! e will McCielland, D. C. (1980). Motive dispositions: The merits of operan! and
be u sed is more or less homogeneous than the one in responden! measures. In L. Wheeler (Ed.). Revíew ofpersonality and so-
the published report, then a will most likely be differ- cial psydwlogy (Vol. 1; pp. 10-41 ). Beverly Hills, CA: Sage.
ent (higher in the first case, lower in the second). Nunnally, J. C. (1967). f'syclwmetríc theory. New York: McGraw-Hill.
Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York:
2. Because a is affected by the length of the se ale, high McGraw-Hill.
values do not guarantee interna) consistency or Nunnally. J. C., & Bernsteín, l. H. (1994). Psychometric theory (3rd ed.).
unidimensionality. Scales over 20 items or so will New York: McGraw-Hill.
have acceptable values of a., even though they may Pedhazur, E. J., & Schmelkin, L. P. (1991). Measuremenl, design, and analysis:
An ímegrated approach. Híllsdale, NJ: Lawrence Erlbaum Associates, Inc.
consist of two or three orthogonal dimensions. It is
Rulon, P. J. ( 1939). A simplified procedure for determining the reliability of
necessary to al so examine the matrix of correlations a test of split-halves. Harvard Educational Review. 9. 99-103.
of the individual items and to look at the ítem-total Streiner, D. L., & Norman. G. R. (1995). Health measuremenr sea/es: A
correlations. In thís vein, Clark and Watson (1995) practica/ guide to their development and use (2nd ed.) Oxford, England:
recommended a mean interitem correlation withín Oxford Uníversity Press.
the range of .15 to .20 for scales that measure broad Wechsler, D. ( 1997). WA/S-111 administtation and scoring manual (3rd ed.).
San Antonio: TX: Psychological Corporation.
characteristics and between .40 to .50 for those tap- Wilkínson, L., & The Task Force on Statistical Infcrence. (1999 ). Statistícal
ping narrower ones. methods in psychology journals: Guidelines and explanations. American
3. Values of a can be too high, and point to redundancy Psychologisr, 54, 594~04.
among the items. I recommend a maximum value of Yin, P., & Fan, X. (2000). Assessing the relíability of Beck Depression In-
ventory scores: Reliabílity generalíz.ation across studíes. Educational and
.90.
Psychologicall'v!easurement, 60, 201-223.

REFERENCES
David L. Streiner
Kunin-Lunenfeld Applied Research Unit

You might also like