RELIABILITY Show - PPSX
RELIABILITY Show - PPSX
So,
Where,
XT = X∞ + Xe
XT = The actual obtained
score
X∞ = The true score
true score is the score which is free from errors occurring X e =to
due The
theerror score
chance factor as
well as other kinds of errors. It is indicated by the mean of a large number of scores
made by the same person on the same test.
X∞ = (X1 + X2 + X3 + . . . . . . . + Xn )/n
Usually, a person’s true score remains the same but his obtained score may vary
from trial to trial because error score contributes to obtained score in each trial.
Error score may be a result of two kinds of errors- random (or chance) and
systematic (or constant)errors.
•Random errors: these chance errors work randomly in both the positive and negative
directions and therefore, sometimes inflate and sometimes depress the obtained score.
Since, these errors in the long run would tend to cancel each other out and therefore,
the mean of all of these errors of measurement would be zero.
E.g.. Malfunctioning electronic thermometer
•Systematic errors: they work constantly in one direction and, therefore, they would
either tend to inflate or depress the score. The mean of such errors of measurement
would not be zero.
E.g.. Height board with an incorrect baseline.
The reliability is directly related to the size of the error score. The smaller the error
score, the more reliable the test or measuring instrument.
Reliability is defined, so to speak, through error: the more the error, the less the
reliability; the less error, the greater the reliability. Practically speaking, this means that if
we can estimate the error variance of a measure we can also estimate the measures
reliability.
Where, V= variance
VT = V ∞ + Ve
R= V ∞ / VT
• Reliability is the proportion of error variance to the total obtained
variance yielded by a measuring instrument subtracted from 1.00, the
index 1.00 indicating perfect reliability.
R= 1- (Ve / VT)
RELIABILTY COEFFICIENT
Since all types of reliability are concerned with the degree of consistency or agreement between two
independently derived sets of scores, they can also be expressed in terms of a correlation coefficient.
“In psychometrics, a correlation coefficient or other numerical index of the reliability of
a test or measure.”
-Oxford dictionary of Psychology.
It varies from 0 to +1.00 (it can not be negative).
This is judged by the purpose for which the test is given to the examinees. If the purpose
is to segregate superiors and inferiors i.e. to make individual diagnosis, (E.g..
Intelligence, aptitude and achievement tests). The reliability coefficient of 0.90 or higher
is regarded as the best coefficient. Like wise, where the purpose is to compare the means
of the two groups of narrow range a reliability coefficient of 0.50-0.60 should suffice.
Theoretically, the correlation between the obtained scores and the true scores
should be perfect but this is rarely ever the case.
R1∞ = √R
TYPES OF
RELIABILTY:
1. Test –retest reliability
2. Internal Consistency reliability
• Split half reliability
• Rulon formula
• Flanagan formula
• Kuder- Richardson formula
• Cronbach’s alpha
3. Alternate forms reliability
4. Scorer reliability
5. ANOVA
TEST- RETEST RELIABILITY
“ A measure of a test’s reliability or more specifically its stability, based on the
correlation between scores of a group of respondents on two separate occasions.”
- Oxford dictionary of Psychology.
In test rest reliability, the single form of the test is administered twice on the same sample
with a reasonable time gap. In this way, two administrations of the same test yield two
independent sets of scores. The two sets, when correlated, give the value of the reliability
coefficient, which is also known as the temporal stability coefficient and indicates to what
extent the examinees retain their relative position as measured in terms of the test score over
a given period of time.
Criteria Administration 1 Administration 2
High Reliability high high
low low
Low Reliability high low
low high
So, what is ‘reasonable’ time gap between two administrations of the test?
When the time is too short, it is likely to increase the reliability coefficient due to carry over and
practice effects, on the other hand if the time gap is too long, it is likely to lower the reliability
coefficient.
The most appropriate and convenient time gap between the two administrations is a fortnight,
which is considered neither too long or too short. There is evidence to support that this time
interval yields a comparatively higher reliability coefficient.
● The test-retest method involves
● (1) administering a test to a group of individuals
● (2) readministering that same test to the same group at some later time, and
● (3) correlating the first set of scores with the second.
The correlation between scores on the first test and scores on the retest is used to
estimate the reliability of the test.
● Same test is administered twice and every test is parallel with itself, differences between
scores on the test and scores on the retest should be due solely to measurement error.
● This argument is often inappropriate for psychological measurement, since it is often
impossible to consider the second administration of a test a parallel measure to the first.
● Thus, it may be inaccurate to treat a test-retest correlation as a measure of reliability.
● The second administration of a psychological test might yield systematically different
scores than the first administration for several reasons.
● First, the characteristic or attribute that is being measured may change between the first
test and the retest.
● Second, the experience of taking the test itself can change a person’s true score; this is
referred to as reactivity.
● Third, one must be concerned with carryover effects, particularly if the interval between
test and retest is short.
● When retested, people may remember their original answers, which could affect their
answers the second time around.
Advantages of test retest reliability
⮚It is the most appropriate method of estimating reliability of both the speed test and the
power test.
⮚In case of heterogeneous tests, the test retest method has proved to be most suitable.
⮚The test-retest method is most useful when one is interested in the long-term stability of a
measure. For example, research on the accuracy of personnel selection tests is concerned
with the ability of the test to predict long-term job performance.
Disadvantages of test retest reliability
Contributing to error variance:
⮚Highly time consuming
⮚It assumes that the examinee and examiner’s physical and mental set up remains unchanged over
time.
⮚It does not account for uncontrollable environmental changes that may take place during either
administration
⮚Maturational effects.
Therefore the source of error variance in this method is time sampling.
GUILLIKSEN (1950) has defined parallel tests as tests having equal means,
equal variances and equal inter item correlation.
FREEMAN (1962) has listed the following criteria for judging whether or not
the two forms of the test are parallel:
1. The number of items each should be same.
2. Items in both should have uniformity regarding the content, the range of
difficulty and the adequacy of sampling.
3. Distribution of the indices of difficulty of items in both should be similar.
4. Items in both should have and equal degree of homogeneity, which can be
shown either by inter item correlation or by correlating each item with
subtest scores or total test scores.
5. Means and standard deviations of both the forms should be equal or nearly
so.
6. Mode of administration and scoring of both should be uniform.
SPLIT HALF METHOD
In this, a test is given and divided into two halves that are scored separately. The
results of one half of the test are then compared with the results of the other. The two
halves of the test can be created in a variety of ways. The most common among which
is the odd- even method, whereby one set of score is obtained for the odd numbered
items in the test and other set for the even numbered items.
Split-half methods of estimating reliability provide a simple solution to the two
practical problems that plague the alternate forms method:
(1) the difficulty in developing alternate forms and
(2) the need for two separate test administrations.
● The simplest way to create two alternate forms of a test is to split the existing test
in half and use the two halves as alternate forms.
● The split-half method of estimating reliability thus involves
(1) administering a test to a group of individuals,
(2) splitting the test in half, and
(3) correlating scores on one half of the test with scores on the other half. The
correlation between these two split halves is used in estimating the reliability of the
test.
ESTIMATING
2 SETS OF ODD PRODUCT RELIABILITY RELIABILITY OF WHOLE
SCORES EVEN MOMENT OF HALF TEST
CORRELATION TEST (SPEARMAN BROWN
PROPHECY FORMULA)
2 X reliability of half test
Spearman Brown Prophecy Formula:
1 + reliability of half test
From a given test four types of scores are generated: the score on even items, the score on odd items, a
difference score (the score on odd items minus the score on even items) and the score of total items (odd plus
even). The variance of the difference scores and the variance of the total scores is then computed and put into
the above formula. Note that if the scores on the two halves were perfectly consistent then there would be no
variance between the odd items score and the even items score, and so the variance of the difference score
would be zero, and therefore the estimated R would equal 1. the ratio of the two variances in fact reflects the
proportion of error variance which when subtracted from 1, leaves the proportion of “true” variance i.e. the
reliability.
Flanagan formula
Advantages:
•No need to calculate reliability coefficients of the two halves
•Used to compute the reliability of alternate forms of the test
INTERNAL CONSISTENCY RELIABILITY
“In psychometrics, an aspect of reliability associated with the degree to which the items of a test measure
the same construct or attribute.”-Oxford dictionary of Psychology.
So, it indicates the homogeneity of the test. If all the items of the test measure the same
function or trait, the test is said to be a homogenous one and its internal consistency
would be high. From a single administration of a single form of the test it is possible to
arrive at a measure of reliability by various procedures.
Thus, the internal consistency method involves
● (1) administering a test to a group of individuals,
● (2) computing the correlations among all items and computing the average of those
intercorrelations, and
● (3) using Formula 7 or an equivalent formula to estimate reliability.
● This formula gives a standardized estimate; raw score formulas that take into account the
variance of different test items may provide slightly different estimates of internal
consistency reliability.
● There are both mathematical and conceptual ways of demonstrating the links between
internal consistency methods and the methods of estimating reliability discussed so far.
● First, internal consistency methods are mathematically linked to the split-half method. In
particular, coefficient alpha, which represents the most widely used and most general form
of internal consistency estimate, represents the mean reliability coefficient one would
obtain from all possible split halves.
● In particular, Cortina (1993) notes that alpha is equal to the mean of the split halves
defined by formulas from Rulon (1939) and J. C. Flanagan (1937). In other words, if
every possible split-half reliability coefficient for a 30-item test were computed, the
The difference between the split-half method and the internal
consistency method is, for the most part, a difference in unit of analysis.
Split-half methods compare one half-test with another; internal
consistency estimates compare each item with every other item.
In understanding the link between internal consistency and the general
concept of reliability, it is useful to note that internal consistency
methods suggest a fairly simple answer to the question, “Why is a test
reliable?” Remember that internal consistency estimates are a function
of (1) the number of test items and
● (2) the average intercorrelation among these test items.
If we think of each test item as an observation of behavior, internal
consistency estimates suggest that reliability is a function of
● (1) the number of observations that one makes and
● (2) the extent to which each item represents an observation of the
same thing observed by other test items. For example, if you wanted
to determine how good a bowler someone was, you would obtain
more reliable information by observing the person bowl many
frames than you would by watching the person roll the ball once.
Kuder- Richardson formulae
KUDER and RICHARDSON (1937) did a series of researches to remove some of the difficulties of the split
half method. They devised their own formulae for estimating the internal consistency of the test:
KR20 is the basic formula for computing the reliability coefficient and KR 21 is the modified form of KR20 .
These techniques are based on an examination of performance on each item instead of two half scores.
Mathematically, KR reliability coefficient is actually the mean of all split half coefficients resulting from the
different splittings of a test(CRONBACH, 1951)
The KR formulae are applicable to test whose items are scored as either 0 or +1 (wrong
or right) or according to some other all-or-none system. Some tests, however, may have
multiple scored items.
E.g. personality inventory test
For calculating reliability of such tests a generalized formula, i.e. Cronbach’s alpha is
used. The sources of error variance here are content sampling and content heterogeneity.
SCORER RELIABILITY
It is the reliability which can be estimated by having a sample of test independently scored
by two or more examiners or scorers. The two sets of scores obtained by each examiner are
completed in the usual way and the resulting correlation coefficient is known as scorer
reliability. This method is most appropriate for tests where judgment of the scorer is
required such as tests of creativity and projective tests. The source of error variance in
scorer reliability is interscorer differences.
● Both the split-half and the internal consistency methods define measurement error
strictly in terms of consistency or inconsistency in the content of a test.
● Test-retest and alternate forms methods, both of which require two test
administrations, define measurement error in terms of three general factors:
● (1) the consistency or inconsistency of test content (in the test-retest method,
content is always consistent);
● (2) changes in examinees over time; and
● (3) the effects of the first test on responses to the second test. Thus, although each
method is concerned with reliability, each defines true score and error in a
somewhat different fashion.
● Schmidt, Le, and Ilies (2003) note that all of these sources of error can operate
simultaneously, and that their combined effects can be substantially larger than
reliability estimates that take only one or two effects into account.
● The principal advantage of internal consistency methods is their practicality.
● Since only one test administration is required, it is possible to estimate internal
consistency reliability every time the test is given.
● Although split-half methods can be computationally simpler, the widespread
availability of computers makes it easy to compute coefficient alpha, regardless of
the test length or the number of examinees. It therefore is possible to compute
coefficient alpha whenever a test is used in a new situation or population.
THE GENERALIZABILITY OF TEST SCORES
Reliability theory tends to classify all the factors that may affect test scores into two
components, true scores and random errors of measurement.
Although this sort of broad classification may be useful for studying physical measurements, it
is not necessarily the most useful way of thinking about psychological measurement
(Lumsden, 1976).
We typically think of the reliability coefficient as a ratio of true score to true score plus error,
but if the makeup of the true score and error parts of a measure change when we change our
estimation procedures, something is seriously wrong.
In reliability theory, the central question is how much random error there is in our measures.
In generalizability theory, the focus is on our ability to generalize from one set of measures
(e.g., a score on an essay test) to a set of other plausible measures (e.g., the same essay graded
by other teachers).
The central question in generalizability theory concerns the conditions over which one can
generalize, or under what sorts of conditions would we expect results that are either similar to
or different from those obtained here? Generalizability theory attacks this question by
G theory (Cronbach, Gleser, Nanda, & Rajaratnam, 1972; Shavelson
& Webb, 1991). Generalizability theory is an extension of classical
test theory that uses analysis of variance (ANOVA) methods to
evaluate the combined effects of multiple sources of error variance on
test scores simultaneously. A distinct advantage that G theory has—
compared to the method for combining reliability estimates —is that
it also allows for the evaluation of the interaction effects from
different types of error sources. Thus, it is a more thorough procedure
for identifying the error variance component that may enter into
scores. On the other hand, in order to apply the experimental designs
that G theory requires, it is necessary to obtain multiple observations
for the same group of individuals on all the independent variables that
might contribute to error variance on a given test (e.g., scores across
occasions, across scorers, across alternate forms, etc.).
On the whole, however, when this is feasible, the results provide a
better estimate of score reliability than the approaches described
earlier.
ANOVA
Analysis of variance as a measure of relative reliability technique has
been used by HOYT (1941), JACKSON (1939) and ALEXANDER
(1947). Four assumptions as given by HOYT need to be kept in mind
when using the ANOVA technique:
1. The total score of an examinee on a test can be divided into 4
independent components:
• A component which is common to all examinees and to
all items on the test
• A component associated with items only
• A component associated with examinees only
• The error component independent of the first three factors
2. The variance of the error component of each item is equal
3. The error component for each item is symmetrical and usually
distributed.
4. The error component of any two distinct items is independent.
Disadvantage: cannot be used with a test where speed is an important factor.
Advantage: ANOVA approach can be applied to data obtained from alternative forms and test retest
administrations.
The central assumption of reliability theory is that measurement
errors are essentially random.
This does not mean that errors arise from random or mysterious
processes, across a large number of individuals, the causes of
measurement error are assumed to be so varied and complex that
measurement errors act as random variables.
Thus, a theory that assumes that measurement errors are essentially
random may provide a pretty good description of their effects.
If errors have the essential characteristics of random variables, then it
is reasonable to assume that errors are equally likely to be positive or
negative and that they are not correlated with true scores or with
errors on other tests.
That is, it is assumed that
1. Mean error of =0
2. True scores and errors are uncorrelated
3. Errors on different measures are uncorrelated. On the basis of these
three assumptions, an extensive theory of test reliability has been
developed (Gulliksen, 1950; F. M. Lord & Novick, 1968).
FACTORS INFLUENCING RELIABILITY
EXTRINSIC FACTORS
Factors that lie outside the test itself and tend to make the test reliable or unreliable:
a) Group variability: when the group of examinees being tested is homogenous in ability, the
reliability of test scores is likely to be lower. Only when there is some variability in the
group, correlation and reliability are possible.
b) Environmental conditions: the testing environment should be uniform. Arrangements should
be such that light, sound, and other comforts are equal and uniform to all examinees
otherwise it will tend to lower the reliability coefficient.
c) Momentary fluctuations in the examinee: E.g. a broken pencil, changes in anxiety level,
motivation, distraction.
d) Guessing by the examinees: this has two important effects upon the total test scores.
-it tends to raise the total score, making the reliability coefficient very high
-it contributes to the measurement error since examinees differ in exercising their
luck over guessing the correct answer.
e.g.. True-false, MCQ
INTRINSIC FACTORS
Factors which lie within the test itself and influence the reliability of the test:
a) Range of the total scores: if the obtained total scores on the test are very close to each other
the reliability of the test is lowered.
b) Length of the test: longer tests tend to yield a higher reliability coefficient than a shorter test.
It has been demonstrated that averaging the test scores of several applications essentially
gives the same result as increasing the length of the test. When lengthening, care has to be
taken to see that added items should have the same variance and the same inter-item
correlations as items of the original test. With the Spearman- Brown formula it is possible to
estimate the length of the test required to achieve a given level of the reliability coefficient.
The use of this formula makes two assumptions:
-new items added to the original test must have the same statistical properties as the original
test items, i.e. same average difficulty value and same inter item correlation.
-added items should not influence the examinee’s response.
Ebel (1972) has shown that doubling the length of a test quadruples true variance while only
doubling the error variance.
c) Homogeneity of items: this includes two things- item reliability ( or inter item correlation)
and the homogeneity of function or trait measured from one item to another. When the item
measures different functions and the inter correlations of items are zero or near it
(heterogeneous test) then the reliability is zero or very low.
d) Difficulty value of items: items having indices of difficulty at 0.5 or close to it yield higher
reliability that items of extreme indices of difficulty.
e) Discrimination value: when the test is composed of discriminating items, the inter item test
correlation is likely to be high and then the reliability is also likely to be high.
f) Scorer reliability also known as reader reliability means how closely two or more scorers
agree in scoring or rating the same set of responses. If they do not agree the reliability is
likely to be lowered.
IMPROVING RELIABILITY
The following suggestions are useful for controlling the factors that adversely affect the
reliability of the test:
1. The group of examinees should be heterogeneous that is they should vary widely in the
ability or trait being measured.
2. Items should be homogenous.
3. The test should preferably be a long one.
4. Items as far as possible should be of moderate difficulty values (0.4- 0.5- 0.6).
5. Items should be discriminatory ones.
Apart from these general suggestions there are two common approaches to improving
the reliability of the test:
•First approach emphasizes on increasing the length of the test and assumes that if new
items similar to the original set of items are added, the reliability of the test would tend
to increase.
•Second approach to improve reliability is to discard the items that run down the
reliability. It assumes that for increasing the reliability it must be ensured that all items
measure the same thing. Under this approach two techniques are commonly applied:
factor analysis and item analysis.