0% found this document useful (0 votes)
160 views38 pages

3 - Reliability

This document discusses the concept of reliability in psychological testing. Reliability refers to the consistency of test scores when a test is re-administered to the same person under different conditions. Sources of measurement error include item selection, test administration, and scoring. Classical test theory assumes that each person has a true score that is difficult to measure precisely due to error. Common forms of assessing reliability include test-retest, parallel forms, and internal consistency methods. Ensuring reliability is important for establishing that a test is consistently measuring the intended construct.

Uploaded by

Jhunar John Tauy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
160 views38 pages

3 - Reliability

This document discusses the concept of reliability in psychological testing. Reliability refers to the consistency of test scores when a test is re-administered to the same person under different conditions. Sources of measurement error include item selection, test administration, and scoring. Classical test theory assumes that each person has a true score that is difficult to measure precisely due to error. Common forms of assessing reliability include test-retest, parallel forms, and internal consistency methods. Ensuring reliability is important for establishing that a test is consistently measuring the intended construct.

Uploaded by

Jhunar John Tauy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

RELIABILITY

Reliability Defined

 Reliability – refers to the consistency of


scores obtained by the same person when re-
examined with the same test on different
occasions, or with different sets of equivalent
items, or under other variable examining
condition.

 This mainly refers to the attribute of


consistency in measurement.
Remember…
 Measurement error is common in all fields of science.

 Tests that are relatively free of measurement error


are considered to be reliable while tests that contain
relatively large measurement error are considered to
be unreliable.
Classical Test Score Theory

 Classical Test Score Theory – this assumes


that each person has a true score that would
be obtained if there were no errors in
measurement.

 Measuring instruments are imperfect,


therefore the observed score for each person
almost always differ from the person’s true
ability or characteristic.
Classical Test Score Theory

 Measurement error – the difference between the


observed score and the true score results.

E = X - T
(error) (observed score) - (true score)

 Standard error of measurement – the standard


deviation of the distribution of errors for each
repeated application of the same test on an
individual.
Remember…
 Error (E) can either be positive or negative. If E
positive, the Obtained Score (X) will be higher than
the True Score (T); if E is negative, then X will be
lower than T.

 Although it is impossible to eliminate all


measurement error, test developers do strive to
minimize psychometric nuisance through careful
attention to the sources of measurement error.

 It is important to stress that true score is never


known.
Classical Test Score Theory

 Factors that contribute to consistency:


 These consist entirely of those stable attributes of
the individual, which the examiner is trying to
measure.

 Factors that contribute to inconsistency:


 These include characteristics of the individual,
test, or situation, which have nothing to do with
the attribute being measured, but which
nonetheless affect test scores.
Classical Test Score Theory

 Domain Sampling Model


 There is a problem in the use of limited number of
items to represent a larger and more complicated
construct.

 A sample of items is utilized instead of the infinite


pool of items of the construct.

 The greater the number of items, the higher the


reliability.
Sources of Measurement Error
A. Item selection
 One source of measurement error is the instrument itself.
A test developer must settle upon a finite number of
items from a potentially infinite pool of test question.

 Which items should be included? How should they be


worded?

 Although psychometricians strive to obtain


representative test items, the particular set of questions
chosen for a test might not be equally fair to all persons.
Sources of Measurement Error
B. Test Administration
 General environmental conditions may exert an
untoward influence on the accuracy of measurement,
such as uncomfortable room temperature, dim
lighting, and excessive noise.

 Momentary fluctuation in anxiety, motivation,


attention, and fatigue level of the test taker may also
introduce sources of measurement error.

 The examiner may also contribute to the measurement


error in the process of test administration.
Sources of Measurement Error
C. Test Scoring
 Whenever psychological test uses a format other than
machine-scored multiple choice items, some degree of
judgment is required to assign points to answers.

 Most tests have well-defined criteria for answers to each


question. These guidelines help minimize the impact of
subjective judgment in scoring.

 Consistent and reliable scores have reliability coefficient


near 1.0; Conversely, tests, which reflect large amount of
measurement error, produce inconsistent and unreliable
score and their reliability coefficients are close to 0.
Item Response Theory (IRT)

 With the help of a computer, the item difficulty


is calibrated to the mental ability of the test
taker.

 If you got several easy items correct, the


computer will then move to more difficult items.

 If you get several difficult items wrong, the


computer moves back to average items.
The Correlation Coefficient
 A correlation coefficient (r) expresses the degree
and magnitude of a linear relationship between two
sets of scores obtained from the same person.

 It can take on values ranging from -1.00 to +1.00.


The Correlation Coefficient
 When two measures have a positive (+) correlation,
the high/low scores on Y are associated with the
high/low scores on X.

 When two measures have a negative (-) correlation,


the high scores on Y are associated with low scores on
X and vice versa.

 Correlations of +1.00 are extremely rare in


psychological research and usually signify a trivial
finding.
Computation for the
Correlation Coefficient
 Pearson Product-Moment Correlation
 A statistics developed by Karl Pearson in which for
each subject, the deviation from the moment or
mean on the first variable is multiplied by the
second variable therefore yielding a product.

 The Pearson r correlation coefficient takes into


account not only each subject’s ranking on the
first and second variable, but also the amount of
his or her deviation above or below the mean.
Computation for the
Correlation Coefficient
 There are numerous mathematically equivalent
formulas for computing Pearson r, but the following
equation will be used because of its conceptual
simplicity:

∑xy
rxy =
(N) (sdx) (sdy)

Where: rxy = Pearson r


∑xy = summation of the product of
deviation scores x and y
sd = standard deviation
Correlation Interpretation Guide

PERFECT POSITIVE CORRELATION


+1.00 Very high positive correlation

+0.75 High positive correlation

+0.50 Moderately small positive correlation

+0.25 Very small positive correlation


NO CORRELATION
0.00
Very small negative correlation

-0.25
Moderately small negative correlation
-0.50
High negative correlation
-0.75
Very high negative correlation
-1.00 PERFECT NEGATIVE CORRELATION
FORMS OF
RELIABILITY
A. Test-Retest Reliability
 It is established by comparing the scores obtained from
two successive measurements of the same individuals
and calculating a correlation between the two sets of
scores.

 It is also known as time sampling reliability since it


measures the error associated with administering a
test at two different times.

 This is used when we measure only traits or


characteristics that do not change over time. (e.g. IQ) .
A. Test-Retest Reliability
 Example: You took an IQ test today and you will take it
again after exactly a year. If your scores are almost the
same (e.g. 105 and 107), then the measure has a good
test-retest reliability.

 Error variance – corresponds to the random


fluctuations of performance from one test session to
the other.

 Clearly, this type of reliability is only applicable to


stable traits.
Limitations of Test-Retest Reliability

 Carryover effect – occurs when the first


testing session influences the results of the
second session and this can affect the test-
retest reliability of a psychological measure.

 Practice effect – a type of carryover effect


wherein the scores on the second test
administration are higher than they were on
the first.
Remember…
 If the results of the first and second administration has
a low correlation, it might mean that:
 The test has poor reliability
 A major changed had occurred on the subjects between the
first and second administration.
 A combination of low reliability and major change have
occurred.

 Sometimes, a poor test-retest correlation do not mean


that the test is unreliable. It might mean that the
variable under study has changed.
B. Parallel Forms Reliability
 It is established when at least two different versions of
the test yield almost the same scores.

 It is also known as item sampling reliability or alternate


forms reliability since it compares two equivalent forms
of a test that measure the same attribute to make sure
that the items indeed assess a specific characteristic.

 The correlation between the scores obtained on the


two forms represents the reliability coefficient of the
test.
B. Parallel Forms Reliability
Examples:
▪ The Purdue Non-Language Test (PNLT) has Forms A and B and
both yield slightly identical scores of the test taker.

▪ The SRA Verbal Form has parallel forms A and B and both yield
almost identical scores of the test taker.

The error of variance in this case represents


fluctuations in performance from one set of items
to another, but not fluctuations over time.
B. Parallel Forms Reliability
 The error of variance in this case represents
fluctuations in performance from one set of items to
another, but not fluctuations over time.

 Tests should contain the same number of items and the


items should be expresses in the same form and should
cover the same type of content. The range and level of
difficulty of the items should also be equal.
Instructions, time limits, illustrative examples, format
and all other aspects of the test must likewise be
checked for equivalence.
Limitations of Parallel Forms Reliability

 One of the most rigorous and burdensome


assessments of reliability since test
developers have to create two forms of the
same test.

 Practical constraints make it difficult to retest


the same group of individuals.
C. Inter-rater Reliability
 It is the degree of agreement between two observers who
simultaneously record measurements of the behaviors.

 Examples:
▪ Two psychologists observe the aggressive behavior of
elementary school children. If their individual records of the
construct are almost the same, then the measure has a good
inter-rater reliability.

▪ Two parents evaluated the ADHD symptoms of their child. If they


both yield identical ratings, then the measure has good inter-
rater reliability.
C. Inter-rater Reliability
 This uses the kappa statistic in order to assess
the level of agreement among several raters
using nominal scales.
Kappa Qualitative
Coefficient Interpretation
> 0.75 Excellent
agreement
0.40 – 0.75 Satisfactory
agreement
< 0.40 Poor
agreement
D. Split-Half Reliability

 It is obtained by splitting the items on a


questionnaire or test in half, computing a separate
score for each half, and then calculating the
degree of consistency between the two scores for
a group of participants.

 The test can be divided according to the odd and


even numbers of the items (odd-even system).
D. Split-Half Reliability
 This model of reliability measures the internal
consistency of the test which is the degree to which
each test item measures the same construct. It is
simply the intercorrelations among the items.

 If all items on a test measure the same construct, then


it has a good internal consistency.

 Spearman-Brown, Kuder-Richardson, and Cronbach’s


alpha are the formulae used to measure the internal
consistency of a test.
D. Split-Half Reliability

 Spearman –Brown Formula


 A statistics which allows a test developer to
estimate what correlations between the two halves
would have been if each half had been the length of
the whole test and have equal variances.
rhh = correlation between the
two halves of the test.

Note: You have to get first the


correlation coefficient (rhh) first
before getting the rSB.
D. Split-Half Reliability

 Cronbach’s coefficient alpha


 A statistics which allows the test developer to
confirm that a test has substantial reliability in
case the two halves of a test have unequal
variances. = number of test
items

= variance of the
observed total score
= variance of the
component ί for the
current sample of
persons.
D. Split-Half Reliability

 Kuder-Richardson 20 (KR20) Formula


 The statistics used for calculating the reliability of
a test in which the items are dichotomous or
scored as 0 or 1.
K = number of test items
σ2x = variance of the total test
score
p = proportion of people
getting each item correct
q = proportion of people
getting each item incorrect
Ʃpq = sum of the products of
p times q for each item
Brief Synopsis of Methods for
Estimating Reliability
Method No. of Forms No. of Sessions Sources of Error
Variance

Test - Retest 1 2 Changes over time


Alternative – Forms 2 1 Item sampling
(Immediate)

Alternative – Forms 2 2 Item sampling


(Delayed) changes over time

Split – Half 1 1 Item sampling


(Spearman-Brown) Nature of split

Coefficent Alpha & 1 1 Item sampling


Kuder -Richardson Test Heterogeneity

Inter -Rater 1 1 Scorer Differences


Which Type of Reliability is
Appropriate?
 For tests that has two forms, use parallel
forms reliability.

 For tests that are designed to be


administered to an individual more than
once, use test-retest reliability.

 For tests with factorial purity, use Cronbach’s


coefficient alpha.
Which Type of Reliability is
Appropriate?
 For tests with items carefully ordered
according to difficulty, use split-half
reliability.

 For tests which involve some degree of


subjective scoring, use inter-rater reliability.

 For tests which involve dichotomous items or


forced choice items, use KR20.
What to Do About Low Reliability

 Increase the number of items.

 Use factor analysis and item analysis.

 Use the correction for attenuation formula.


ANY QUESTIONS?

The task we must set


for ourselves is not to
feel secure, but to be
able to tolerate
insecurity.

- Erich Fromm

THANK YOU!

You might also like