CHAPTER 5: REALIBILITY RELIABILITY OF A TEST
The greater the proportion of the total
RELIABILITY variance attributed to true variance, the more
Consistency in measurement; the total reliable the test.
variance in an observed distribution of test scores
equals the sum of the true variance plus the error SOURCES OF ERROR VARIANCE
variance. Test Construction
Administration
RELIABILITY COEFFICIENT
Scoring
Index of reliability; proportion that Interpretation
indicates the ratio between the true score variance
on a test and the total variance. ITEM/CONTENT SAMPLING
Terms that refer to variation among items
CONCEPT OF RELIABILITY
within a test as well as to variation among items
X = T+ E between tests.
X = Observed score
T = True score CHALLENGE IN TEST DEVELOPMENT
E = Error Maximize the proportion of the total
variance that is true variance and to minimize the
TRUE SCORE MODEL
proportion of the total variance that is error
Also true that the magnitude of the variance.
presence of a certain psychological trait as
FACTORS RELATED TO THE TEST
measured b a test of that trait will be due to the
ENVIRONMENT
true amount of that trait and other factors.
Room temperature.
VARIANCE Level of Lighting.
Statistic useful in describing sources of a Amount of ventilation and noise.
test score variability; useful because it can be Instrument used to enter responses and
broken down into components. even the writing surface on which
responses are written.
TRUE VARIANCE
FACTORS RELATED TO TESTTAKER VARIABLES
Variance from true differences.
Pressing emotional problems
ERROR VARIANCE Physical discomfort
Lack of sleep
Variance from irrelevant, random sources.
Effects of drugs or medication
FACTORS RELATED TO EXAMINER-RELATED VARIABLES ALTERNATE FORMS
Examiner's physical appearance and Different versions of a test that have been
demeanor; presence or absence of an examiner. constructed so as to be parallel; designed to be
equivalent with respect to variables such as
SCORING AND SCORING SYSTEMS
content and level of difficulty.
Technical glitches may contaminate data.
SIMILARITY BETWEEN OBTAINING ESTIMATES
OF ALTERNATE FORMS RELIABILITY AND
TEST-RETEST METHOD
PARALLEL FORMS RELIABILITY AND
Using the same instrument to measure the OBTAINING AN ESTIMATE OF TEST-RETEST
same thing at two points in time. RELIABILITY
Two test administrations with the same
TEST-RESTEST RELIABILITY
group are required.
Result of a reliability evaluation; estimate
of reliability obtained by correlating pairs of Test scores may be affected by factors
scores from the same people on two different such as motivation, fatigue, or intervening events
administrations of the same test. such as practice, learning or therapy.
TEST-RETEST MEASURE ITEM SAMPLING
Appropriate when evaluating the reliability Inherent in the computation of an
of a test that purports to measure something that is alternate- or parallel-forms reliability coefficient;
relatively stable over time. testtakers may do better or worse on a specific
form of the test not as a function of their true
COEFFICIENT OF STABILITY ability but simply because of the particular items
that were selected for inclusion in the test.
Estimate of test-retest reliability when the
interval between testing is greater than six months. INTERNAL CONSISTENCY ESTIMATE OF
RELIABILITY/ESTIMATE OF INTER-ITEM CONSISTENCY
COEFFICIENT OF EQUIVALENCE
Obtaining an estimate of the reliability of a
Alternate-Forms or Parallel forms coefficient of test without developing an alternate form of the
reliability. test and without having to administer the test
twice to the same people.
PARALLEL FORMS
SPLIT-HALF RELIABILITY
Exist when for each form of the test, the
means and the variances of observed test scores Obtained by correlating two pairs of scores
are equal; means of scores obtained on parallel obtained from equivalent halves of a single test
forms correlate equally with the true score; scores administered once; useful measure of reliability
obtained on parallel test correlate equally with when it is impractical or undesirable to assess
other measures.
reliability with two tests or to administer a test IN ADDING ITEMS TO INCREASE TEST
twice. RELIABILITY TO A DESIRED LEVEL
The rule is that new items must be
STEPS TO COMPUTE A COEFFICIENT OF
equivalent in content and difficulty so that the
SPLIT-HALF RELIABILITY
longer test still measures what the original test
Divide the test into equivalent halves. measured.
Calculate a Pearson r between scores on the two
halves of the test When Internal Consistency Estimates of
Adjust the half-test reliability using the Spearman- Reliability are Inappropriate – When measuring
Brown formula. the reliability of a heterogeneous test and speed
test.
TO SPLIT A TEST
INTER-ITEM CONSISTENCY
Randomly assign items to one or the other
Refers to the degree of correlation among
half of the test; assign odd-numbered items to one
all the items on a scale; calculated from a single
half of the test and even-numbered items to the
administration of a single form on a test; useful in
other half.
assessing the homogeneity of a test.
ODD-EVEN RELIABILITY
HOMOGENEITY
Assign odd-numbered items to one half of
Degree to which a test measures a single
the test and even-numbered items to the other half.
factor; extent to which items in a scale are
MINI PARALLEL FORMS unifactorial.
Each half equal to the other in format, HETEROGENEITY
stylistic, statistical, and related aspect.
Degree to which a test measures different
SPEARMAN-BROWN FORMULA factors; composed of items that measure more
than one trait.
Allows a test developer or user to estimate
internal consistency and reliability from a NATURE OF HOMOGENEOUS TEST
correlation of two halves of a test; Specific
The more homogeneous a test is, the more
application to estimate the reliability of a test that
inter-item consistency it can be expected to have;
is lengthened or shortened by any number of
Desirable because it allows relatively
items; used to determine the number of items
straightforward test-score interpretation
needed to attain a desired level of reliability.
TESTTAKERS WITH THE SAME SCORE ON A
HOMOGENEOUS TEST
Have similar abilities in the area tested
TESTTAKERS WITH THE SAME SCORE ON A R KR20
HETEROGENEOUS TEST
The Kuder-Richardson Formula 20
May have different abilities. Reliability Coefficient.
HOMOGENEOUS TEST KR-21
Insufficient tool for measuring Used if there is reason to assume that all
multifaceted psychological variables such as the test items have approximately the same degree
intelligence or personality. of difficulty; Outdated in an era of calculators and
computers.
G. FREDERIC KUDER & M.W. RICHARDSON
Developed their own measures for COEFFICIENT ALPHA
estimating reliability; Kuder-Richardson Formula Variant of the KR-20 that has received the
20 (KR-20). most acceptance and is in widest used today; mean
of all possible split-half correlations, corrected by
KUDER RICHARDSON FORMULA 20 (KR-20)
the Spearman-Brown formula; appropriate for use
Most popular formula
on tests containing no dichotomous items;
WHERE TEST ITEMS ARE HIGHLY HOMOGENEOUS
preferred statistic for obtaining an estimate of
internal consistency reliability; formula yields an
KR-20 and split-half reliability estimates
estimate of the mean of all possible test-retest,
will be similar.
split-half coefficients; widely used as a measure of
WHERE TEST ITEMS ARE HIGHLY HETEROGENEOUS reliability, in part because it requires only one
KR-20 will yield lower reliability administration of the test; gives information about
estimates than the split-half method. the test scores and not the test itself.
COEFFICIENT ALPHA RESULT COEFFICIENT
DICHOTOMOUS ITEMS
ALPHA IS CALCULATED TO HELP ANSWER
Items that can be scored right or wrong, QUESTIONS ABOUT HOW SIMILAR SETS OF
such as multiple choice items. DATA ARE:
Ranges in value from 0 to 1; impossible to
TEST BATTERY
yield a negative value of alpha, if negative, report
A selected assortment of tests and as zero.
assessment procedures in the process of
SCALE OF COEFFICIENT OF ALPHA
evaluation; typically composed of tests designed
to measure different variables. 0 Absolutely no similarity
1 Perfectly identical.
Alpha is usually reported as Zero Administration Error 5%
Unidentified Error 5%
Scorer Error 5%
INTER-SCORER RELIABILITY
HOMOGENEITY OF TEST ITEMS
Degree of agreement or consistency
Homogeneous in items if it is functionally
between two or more scorers (or judges or raters)
uniform throughout.
with regard to a particular measure.
COEFFICIENT OF INTER-SCORER RELIABILITY
HETEROGENEITY OF TEST ITEMS
A way to determine the degree of An estimate of internal consistency might
consistency among scorers. be low relative to a more appropriate estimate of
test-retest reliability.
APPROACHES TO THE ESTIMATION OF RELIABILITY
DYNAMIC CHARACTERISTIC
Test-Retest
Alternate or Parallel Forms A trait, state, or ability presumed to be
Internal or Inter-Item Consistency ever-changing as a function of situational and
cognitive experiences; Obtained measurement
HOW HIGH A COEFFICIENT OF RELIABILITY
would not be expected to vary significantly as a
SHOULD BE
function of time, and either the test-retest or the
On a continuum relative to the purpose and
alternate forms method would be appropriate.
importance of the decisions to be made on the
basis of scores on the test. STATIC CHARACTERISTIC
CONSIDERATIONS OF THE NATURE OF THE Trait, state, or ability resumed to be
TESTING ITSELF relatively unchanging.
Test items are homogeneous or
RESTRICTION OF RANGE/VARIANCE
heterogeneous in nature.
If Variance of either variable in a
The characteristic, ability, or trait being
correlational analysis is restricted by the sampling
measured is presumed to be dynamic or
procedure used, then the resulting correlation
static.
coefficient tends to be lower; if the variance of
The range of test scores is or is not
either variable in a correlational analysis is
restricted.
inflated by the sampling procedure, then the
Test is a speed or a power test.
resulting correlation coefficient tends to be higher.
Test is or is not criterion-referenced.
SOURCES OF VARIANCE IN A HYPOTHETICAL TEST POWER TEST
True Variance 67% When a time limit is long enough to allow
Error due to Test Construction 18% testtakers to attempt all items and if some items
are so difficult that no testtaker is able to obtain a Designed to provide an indication of
perfect score. whether a testtaker stands with respect to some
variable or criterion, such as an educational or a
vocational objective; tend to contain material that
has been mastered in hierarchical fashion; tend to
SPEED TEST
be interpreted in pass-fail terms, and any scrutiny
Generally, contains items of uniform level of performance on individual items tends to be for
of difficulty so that when given generous time diagnostic and remedial purpose.
limits, all testtakers should be able to complete all
test items correctly; based on performance speed; TEST-RETEST RELIABILITY ESTIMATE
time limit is established so that few, if any, of the
Based on the correlation between the total
testtakers will be able to complete the entire test.
scores on two administrations of the same test.
RELIABILITY ESTIMATE OF A SPEED TEST
ALTERNATE-FORMS RELIABILITY ESTIMATE
Based on performance from two independent A reliability estimate is based on the
testing periods using one of the following: correlation between scores on two halves of the
test and is then adjusted using the Spearman-
Test-Retest Reliability
Brown formula to obtain a reliability estimate of
Alternate-Forms Reliability
the whole test.
Split-Half Reliability from two separately
timed half tests GENERALIZABILITY THEORY/DOMAIN
SAMPLING THEORY
IF SPLIT HALF PROCEDURE IS USED FOR A SPEED
TEST Seek to estimate the extent to which
The obtained reliability coefficient is for a half specific sources of variation under defined
test and should be adjusted using the Spearman- conditions are contributing to the test score; A
Brown formula. test's reliability is conceived of as an objective
measure of how precisely the test score assesses
SPEED TEST ADMINISTERED ONCE & MEASURE the domain from which the test draws a sample.
OF INTERNAL CONSISTENCY IS CALCULATED
DOMAIN OF BEHAVIOR
Result will be a spuriously high reliability
coefficient; two people, one who completes 82 Universe of items that could conceivably
items of a speed test and another who completes measure that behavior; hypothetical construct: one
61 items of the same speed test; correlation of the that shares certain characteristics with (and is
two will be close to 1 but will not say anything measured by) the sample of items that make up
about response consistency. the test.
CRITERION-REFERENCED TEST GENERALIZABILITY THEORY
May be viewed as an extension of true DECISION STUDY
score theory wherein the concept of a universe
Developers examine the usefulness of test
score replaces that of a true score; developed by
scores in helping the test user make decisions;
Lee J. Cronbach; Given the same conditions of all
designed to tell the test user how test scores
the facets in the universe, the exact same test score
should be used and how dependable those scores
should be obtained.
are as a basis for decisions, depending on the
context of their use.
LEE J. CRONBACH
Encouraged test developers and researchers to ITEM RESPONSE THEORY
describe the details of the particular test situation Provide a way to model the probability
(universe) leading to a specific test score. that a person with X ability will be able to perform
at a level of Y; Stated in terms of personality
UNIVERSE
assessment, it models the probability that a person
Described in terms of its facets with X amount of a particular personality trait will
exhibit Y amount of that trait on a personality test
FACETS designed to measure it; not a term used to rear to a
Include things like the number of items in single theory or method.
the test, the amount of training the test scorers
LATENT
have had, and the purpose of the test
administration. Physically unobservable.
UNIVERSE SCORE LATENT-TRAIT THEORY
The test score; analogous to a true score in Synonym for IRT; Propose models that
the true score model. describe how the latent trait influences
performance on each test item; theoretically can
GENERALIZABILITY STUDY take on values from -infinity to +infinity.
Examines how generalizable scores from a
CHARACTERISTICS OF ITEMS WITHIN AN IRT
particular test are if the test is administered in
FRAMEWORK
different situations; examines how much of an
Difficulty Leel of an Item
impact different facets of the universe have on the
Item's Level of Discrimination.
test score.
DIFFICULTY
COEFFICIENTS OF GENERALIZABILITY
Refers to the attribute of not being easily
Influence of particular facets on the test
accomplished, solved, or comprehended; May also
score; similar to reliability coefficients in the true
refer to physical difficulty.
score model.
PHYSICAL DIFFICULTY SEM: Provides a measure of the precision
of an observed test score; provides an estimate of
How hard or easy it is for a person to
the amount of error inherent in an observed score
engage in a particular activity.
or measurement; inverse relationship between
DISCRIMINATION SEM and reliability of a test; the higher the
reliability of a test (or individual subtest within a
Signifies the degree to which an item
test) the lower the SEM; tool used to estimate or
differentiates among people with higher or lower
infer the extent to which an observed score
levels of the trait, ability, or whatever it is that is
deviates from a true score; standard deviation of a
being measured.
theoretically normal distribution of test scores
DICHOTOMOUS TEST ITEMS obtained by one person on equivalent tests.
Test items or questions that can be STANDARD ERROR OF A SCORE
answered with only one of two alternate
Another term for Standard Error of
responses, such as true-false, yes-no, or correct-
Measurement; Index of the extent to which one's
incorrect questions.
individual's scores vary over tests presumed to be
POLYTOMOUS TEST ITEMS parallel.
Test items or questions with three or more CONFIDENCE INTERVAL
alternative responses, where only one is scored
Range or band of test scores that is likely
correct or scored as being consistent with a
to contain the true score.
targeted trait or other construct.
STANDARD ERROR OF THE DIFFERENCE
GEORG RASCH
A statistical measure that can aid a test
Developed a group of IRT models; each
user in determining how large a difference should
item on the test is assumed to have an equivalent
be before it is considered statistically significant.
relationship with the construct being measured by
the test. QUESTIONS THAT STANDARD ERROR OF THE
DIFFERENCE BETWEEN TWO SCORES CAN ANSWER
RELIABILITY COEFFICIENT
How did this individual's performance on
Helps the test developer build an test 1 compare with his or her performance
adequate measuring instrument. on test 2?
Helps the test user select a suitable How did this individual's performance on
test. test 1 compare with someone else's
Its usefulness does not end with performance on test 1?
test construction and selection.
STANDARD ERROR OF MEASUREMENT
How did this individual’s performance on
test 1 compare with someone else's
performance?