RELIABILITY
Reliability Defined
Reliability – refers to the consistency of
scores obtained by the same person when re-
examined with the same test on different
occasions, or with different sets of equivalent
items, or under other variable examining
condition.
This mainly refers to the attribute of
consistency in measurement.
Remember…
Measurement error is common in all fields of science.
Tests that are relatively free of measurement error
are considered to be reliable while tests that contain
relatively large measurement error are considered to
be unreliable.
Classical Test Score Theory
Classical Test Score Theory – this assumes
that each person has a true score that would
be obtained if there were no errors in
measurement.
Measuring instruments are imperfect,
therefore the observed score for each person
almost always differ from the person’s true
ability or characteristic.
Classical Test Score Theory
Measurement error – the difference between the
observed score and the true score results.
E = X - T
(error) (observed score) - (true score)
Standard error of measurement – the standard
deviation of the distribution of errors for each
repeated application of the same test on an
individual.
Remember…
Error (E) can either be positive or negative. If E
positive, the Obtained Score (X) will be higher than
the True Score (T); if E is negative, then X will be
lower than T.
Although it is impossible to eliminate all
measurement error, test developers do strive to
minimize psychometric nuisance through careful
attention to the sources of measurement error.
It is important to stress that true score is never
known.
Classical Test Score Theory
Factors that contribute to consistency:
These consist entirely of those stable attributes of
the individual, which the examiner is trying to
measure.
Factors that contribute to inconsistency:
These include characteristics of the individual,
test, or situation, which have nothing to do with
the attribute being measured, but which
nonetheless affect test scores.
Classical Test Score Theory
Domain Sampling Model
There is a problem in the use of limited number of
items to represent a larger and more complicated
construct.
A sample of items is utilized instead of the infinite
pool of items of the construct.
The greater the number of items, the higher the
reliability.
Sources of Measurement Error
A. Item selection
One source of measurement error is the instrument itself.
A test developer must settle upon a finite number of
items from a potentially infinite pool of test question.
Which items should be included? How should they be
worded?
Although psychometricians strive to obtain
representative test items, the particular set of questions
chosen for a test might not be equally fair to all persons.
Sources of Measurement Error
B. Test Administration
General environmental conditions may exert an
untoward influence on the accuracy of measurement,
such as uncomfortable room temperature, dim
lighting, and excessive noise.
Momentary fluctuation in anxiety, motivation,
attention, and fatigue level of the test taker may also
introduce sources of measurement error.
The examiner may also contribute to the measurement
error in the process of test administration.
Sources of Measurement Error
C. Test Scoring
Whenever psychological test uses a format other than
machine-scored multiple choice items, some degree of
judgment is required to assign points to answers.
Most tests have well-defined criteria for answers to each
question. These guidelines help minimize the impact of
subjective judgment in scoring.
Consistent and reliable scores have reliability coefficient
near 1.0; Conversely, tests, which reflect large amount of
measurement error, produce inconsistent and unreliable
score and their reliability coefficients are close to 0.
Item Response Theory (IRT)
With the help of a computer, the item difficulty
is calibrated to the mental ability of the test
taker.
If you got several easy items correct, the
computer will then move to more difficult items.
If you get several difficult items wrong, the
computer moves back to average items.
The Correlation Coefficient
A correlation coefficient (r) expresses the degree
and magnitude of a linear relationship between two
sets of scores obtained from the same person.
It can take on values ranging from -1.00 to +1.00.
The Correlation Coefficient
When two measures have a positive (+) correlation,
the high/low scores on Y are associated with the
high/low scores on X.
When two measures have a negative (-) correlation,
the high scores on Y are associated with low scores on
X and vice versa.
Correlations of +1.00 are extremely rare in
psychological research and usually signify a trivial
finding.
Computation for the
Correlation Coefficient
Pearson Product-Moment Correlation
A statistics developed by Karl Pearson in which for
each subject, the deviation from the moment or
mean on the first variable is multiplied by the
second variable therefore yielding a product.
The Pearson r correlation coefficient takes into
account not only each subject’s ranking on the
first and second variable, but also the amount of
his or her deviation above or below the mean.
Computation for the
Correlation Coefficient
There are numerous mathematically equivalent
formulas for computing Pearson r, but the following
equation will be used because of its conceptual
simplicity:
∑xy
rxy =
(N) (sdx) (sdy)
Where: rxy = Pearson r
∑xy = summation of the product of
deviation scores x and y
sd = standard deviation
Correlation Interpretation Guide
PERFECT POSITIVE CORRELATION
+1.00 Very high positive correlation
+0.75 High positive correlation
+0.50 Moderately small positive correlation
+0.25 Very small positive correlation
NO CORRELATION
0.00
Very small negative correlation
-0.25
Moderately small negative correlation
-0.50
High negative correlation
-0.75
Very high negative correlation
-1.00 PERFECT NEGATIVE CORRELATION
FORMS OF
RELIABILITY
A. Test-Retest Reliability
It is established by comparing the scores obtained from
two successive measurements of the same individuals
and calculating a correlation between the two sets of
scores.
It is also known as time sampling reliability since it
measures the error associated with administering a
test at two different times.
This is used when we measure only traits or
characteristics that do not change over time. (e.g. IQ) .
A. Test-Retest Reliability
Example: You took an IQ test today and you will take it
again after exactly a year. If your scores are almost the
same (e.g. 105 and 107), then the measure has a good
test-retest reliability.
Error variance – corresponds to the random
fluctuations of performance from one test session to
the other.
Clearly, this type of reliability is only applicable to
stable traits.
Limitations of Test-Retest Reliability
Carryover effect – occurs when the first
testing session influences the results of the
second session and this can affect the test-
retest reliability of a psychological measure.
Practice effect – a type of carryover effect
wherein the scores on the second test
administration are higher than they were on
the first.
Remember…
If the results of the first and second administration has
a low correlation, it might mean that:
The test has poor reliability
A major changed had occurred on the subjects between the
first and second administration.
A combination of low reliability and major change have
occurred.
Sometimes, a poor test-retest correlation do not mean
that the test is unreliable. It might mean that the
variable under study has changed.
B. Parallel Forms Reliability
It is established when at least two different versions of
the test yield almost the same scores.
It is also known as item sampling reliability or alternate
forms reliability since it compares two equivalent forms
of a test that measure the same attribute to make sure
that the items indeed assess a specific characteristic.
The correlation between the scores obtained on the
two forms represents the reliability coefficient of the
test.
B. Parallel Forms Reliability
Examples:
▪ The Purdue Non-Language Test (PNLT) has Forms A and B and
both yield slightly identical scores of the test taker.
▪ The SRA Verbal Form has parallel forms A and B and both yield
almost identical scores of the test taker.
The error of variance in this case represents
fluctuations in performance from one set of items
to another, but not fluctuations over time.
B. Parallel Forms Reliability
The error of variance in this case represents
fluctuations in performance from one set of items to
another, but not fluctuations over time.
Tests should contain the same number of items and the
items should be expresses in the same form and should
cover the same type of content. The range and level of
difficulty of the items should also be equal.
Instructions, time limits, illustrative examples, format
and all other aspects of the test must likewise be
checked for equivalence.
Limitations of Parallel Forms Reliability
One of the most rigorous and burdensome
assessments of reliability since test
developers have to create two forms of the
same test.
Practical constraints make it difficult to retest
the same group of individuals.
C. Inter-rater Reliability
It is the degree of agreement between two observers who
simultaneously record measurements of the behaviors.
Examples:
▪ Two psychologists observe the aggressive behavior of
elementary school children. If their individual records of the
construct are almost the same, then the measure has a good
inter-rater reliability.
▪ Two parents evaluated the ADHD symptoms of their child. If they
both yield identical ratings, then the measure has good inter-
rater reliability.
C. Inter-rater Reliability
This uses the kappa statistic in order to assess
the level of agreement among several raters
using nominal scales.
Kappa Qualitative
Coefficient Interpretation
> 0.75 Excellent
agreement
0.40 – 0.75 Satisfactory
agreement
< 0.40 Poor
agreement
D. Split-Half Reliability
It is obtained by splitting the items on a
questionnaire or test in half, computing a separate
score for each half, and then calculating the
degree of consistency between the two scores for
a group of participants.
The test can be divided according to the odd and
even numbers of the items (odd-even system).
D. Split-Half Reliability
This model of reliability measures the internal
consistency of the test which is the degree to which
each test item measures the same construct. It is
simply the intercorrelations among the items.
If all items on a test measure the same construct, then
it has a good internal consistency.
Spearman-Brown, Kuder-Richardson, and Cronbach’s
alpha are the formulae used to measure the internal
consistency of a test.
D. Split-Half Reliability
Spearman –Brown Formula
A statistics which allows a test developer to
estimate what correlations between the two halves
would have been if each half had been the length of
the whole test and have equal variances.
rhh = correlation between the
two halves of the test.
Note: You have to get first the
correlation coefficient (rhh) first
before getting the rSB.
D. Split-Half Reliability
Cronbach’s coefficient alpha
A statistics which allows the test developer to
confirm that a test has substantial reliability in
case the two halves of a test have unequal
variances. = number of test
items
= variance of the
observed total score
= variance of the
component ί for the
current sample of
persons.
D. Split-Half Reliability
Kuder-Richardson 20 (KR20) Formula
The statistics used for calculating the reliability of
a test in which the items are dichotomous or
scored as 0 or 1.
K = number of test items
σ2x = variance of the total test
score
p = proportion of people
getting each item correct
q = proportion of people
getting each item incorrect
Ʃpq = sum of the products of
p times q for each item
Brief Synopsis of Methods for
Estimating Reliability
Method No. of Forms No. of Sessions Sources of Error
Variance
Test - Retest 1 2 Changes over time
Alternative – Forms 2 1 Item sampling
(Immediate)
Alternative – Forms 2 2 Item sampling
(Delayed) changes over time
Split – Half 1 1 Item sampling
(Spearman-Brown) Nature of split
Coefficent Alpha & 1 1 Item sampling
Kuder -Richardson Test Heterogeneity
Inter -Rater 1 1 Scorer Differences
Which Type of Reliability is
Appropriate?
For tests that has two forms, use parallel
forms reliability.
For tests that are designed to be
administered to an individual more than
once, use test-retest reliability.
For tests with factorial purity, use Cronbach’s
coefficient alpha.
Which Type of Reliability is
Appropriate?
For tests with items carefully ordered
according to difficulty, use split-half
reliability.
For tests which involve some degree of
subjective scoring, use inter-rater reliability.
For tests which involve dichotomous items or
forced choice items, use KR20.
What to Do About Low Reliability
Increase the number of items.
Use factor analysis and item analysis.
Use the correction for attenuation formula.
ANY QUESTIONS?
The task we must set
for ourselves is not to
feel secure, but to be
able to tolerate
insecurity.
- Erich Fromm
THANK YOU!