RELIABILITY
Title: "Halley's Comet"
In a small rural town, excitement buzzed as news spread about the impending arrival of Halley's
Comet. The townspeople were eager to witness this once-in-a-lifetime event. Among them was
an elderly couple, Mr. and Mrs. Thompson.
Mr. Thompson, a retired astronomer, had spent weeks preparing his telescope for the comet's
arrival. He wanted to share this special moment with his wife, who had no interest in astronomy
but adored her husband. The couple's communication was usually spot-on, but this time,
miscommunication loomed.
One evening, Mr. Thompson excitedly called out to his wife, "Honey, the comet will be visible
tonight! Come outside and take a look through the telescope with me!"
Mrs. Thompson, engrossed in her book, replied without looking up, "Sure, dear, I'll come out
right after I finish this chapter."
Hours passed, and Mr. Thompson set up his telescope, eagerly anticipating his wife's arrival.
Meanwhile, Mrs. Thompson, unaware of the comet's significance, thought her husband had been
referring to a regular star. She finally closed her book and stepped outside, expecting a routine
stargazing session.
As she approached the telescope, Mr. Thompson excitedly pointed to the night sky. "There it is,
Halley's Comet! Isn't it magnificent?"
Mrs. Thompson squinted through the telescope and saw what appeared to be an ordinary star.
She nodded politely, not wanting to disappoint her husband, and said, "Yes, dear, it's lovely."
Unbeknownst to each other, their definitions of "lovely" differed drastically. Mr. Thompson saw
the breathtaking beauty of the comet, while Mrs. Thompson simply saw a distant twinkle in the
night sky.
As the comet moved across the heavens, Mr. Thompson continued to share fascinating facts
about its history and significance. Mrs. Thompson listened attentively, trying her best to engage
in the conversation even though she didn't fully understand or share her husband's enthusiasm.
In the end, they both walked back to their house with smiles on their faces, each believing they
had shared a special moment. Though they had seen the same celestial event, their perspectives
and the depth of their understanding remained worlds apart.
“Consistency in measurement.”
▪ Reliability Coefficient
▪ An index of reliability
▪ Best statistical to use? VARIANCE
▪ it describes sources of test score variability
▪ it can be broken into components of:
▪ true variance – variance from true differences
▪ error variance – variance from irrelevant sources
a proportion of the total score variance attributed to true variance
▪ Reliability Coefficient
▪ An index of reliability
▪ A proportion that indicates the ratio between the true score variance on a test and
the total variance
A. Sources of Error Variance
1. Test Construction: difference in test items; item sampling (variation among items
within a test—wordings); content sampling (variation among items between a test)
2. Test Administration: attention or motivation, test environment (room temperature,
level of lighting, amount of ventilation and noise), test taker variables (emotional
problems, physical discomfort, lack of sleep, medication), examiner-related variables
(physical appearance and demeanor, presence or absence)
3. Test Scoring: scorers (subjectivity) and scoring system (technical glitch if computer;
procedure if not)
4. Other Sources: sample error (wrong representation of population—voter’s sample);
methodological error (e.g., untrained interviewer, ambiguous wording in questionnaire);
systematic error (internal agreement, underreporting, overreporting); nonsystematic error
(forgetting, failing to understand abusive behavior, misunderstanding of instructions)
B. Reliability Estimates
1. Test-Retest Reliability
▪ it is an estimate of reliability obtained by correlating pairs of scores from the same
people on two different administrations of the test
▪ it is appropriate when evaluating the reliability of a test that purports to measure
something that is relatively stable over time
▪ Example: personality trait
TYPICAL STATISTICAL ERROR
PURPOSE METHOD
USE PROCEDURES VARIANCE
To evaluate When 1. Administer the Pearson r or Passage of
the stability assessing psychological test Spearman rho/ Time and Test
of a measure the 2. Get Results Spearman rank Administration
stability of 3. Interval (Time Gap)
various Coefficient Stability: if
traits more than 6 months
4. Re-administer
5. Get Results
6. Correlate
▪ Disadvantages:
▪ Checking of answers
▪ Practice Effect
▪ Passage of Time
▪ Memory
▪ Fatigue
▪ Motivation
Example:
2. Parallel-Form and Alternate-Form Reliability
▪ Coefficient of Equivalence: degree of relationship between various forms of a test by
means of an alternate or parallel forms
▪ Parallel Forms: exist when for each form of the test, the means and the variances of
observed test scores are equal.
▪ Alternate Forms: different versions of a test that have been constructed so as to be
parallel; designed to be equivalent with respect to content and level of difficulty
▪ How to make alternate form?
▪ Same Number of Items: test 1: 100; test 2: 100
▪ Same Format: Likert Scale, Dichotomous Items
▪ Same Type: Achievement, Intelligent
▪ Same Language: English, Filipino
▪ Same Content: Has items that measure the same variable
▪ Same Level oF Difficulty: both has easy, hard, difficult
▪ How hard to construct alternate form?
1. Construct/Administer Original Test
2. Compute Item difficulty for each item
3. Construct/Administer the Alternate Form (Clone) to the same population
4. Compute item difficulty of alternate form items
5. Match items according to difficulty
6. Find new population
7. Administer the original test and alternate form
8. Correlate scores of original and alternates tests
TYPICAL STATISTICAL ERROR
PURPOSE METHOD
USE PROCEDURES VARIANCE
To evaluate When there 1. Administer the Pearson r Test
the is a need first test or Spearman rho Construction
relationship for 2. Administer the or Test
between different alternate test Administration
different forms of a 3. Score both test
forms of a test 4. Correlate
measure and to
avoid practice
effect
▪ Advantages:
▪ minimize the effect of memory for the content of previously administered form of
the test
▪ Disadvantages:
▪ hard to construct, time consuming, expensive
3. Inter-Scorer Reliability
▪ also known as scorer reliability, judge reliability, observer reliability, and inter-rater
reliability
▪ it is the degree of agreement or consistency between two or more scorers with regard
to a particular measure
▪ Coefficient of inter-scorer reliability is the degree of consistency among scorers in the
scoring of a test
▪ Kappa statistics
▪ the best method for assessing the level of agreement among several observers.
▪
▪ where po is the relative observed agreement among raters, and pe is the
hypothetical probability of chance agreement, using the observed data to calculate
the probabilities of each observer randomly saying each category. If the raters are
in complete agreement then κ = 1. If there is no agreement among the raters other
than what would be expected by chance (as given by pe), κ ≤ 0.
▪ Example
Reader B
Yes No
Reader Yes 20 5
A No 10 15
▪ Note that there were 20 proposals that were granted by both reader A and reader
B, and 15 proposals that were rejected by both readers. Thus, the observed
proportionate agreement is po = (20 + 15) / 50 = 0.70
▪ To calculate pe (the probability of random agreement) we note that:
▪ Reader A said "Yes" to 25 applicants and "No" to 25 applicants. Thus reader A
said "Yes" 50% of the time.
▪ Reader B said "Yes" to 30 applicants and "No" to 20 applicants. Thus reader B
said "Yes" 60% of the time.
▪ Therefore the probability that both of them would say "Yes" randomly is 0.50 ·
0.60 = 0.30 and the probability that both of them would say "No" is 0.50 · 0.40 =
0.20. Thus the overall probability of random agreement is Pr(e) = 0.3 + 0.2 = 0.5
TYPICAL STATISTICAL ERROR
PURPOSE METHOD
USE PROCEDURES VARIANCE
To evaluate Used when 1. Look at least 2 Cohen’s kappa, Scoring and
the level of researchers raters Pearson r, or Interpretation
agreement need to show 2. Teach scoring Spearman rho
between that there is system
test raters consensus in 3. Administer test
on a the way that to sample
measure different testtakers
raters view 4. Let 2 raters rate
particular the test
behavior 5. Correlate
pattern
4. Internal Consistency
▪ also known as inter-item consistency
▪ estimates the reliability of a test without developing an alternate form or without having
to administer the test twice to same people
▪ it is the consistency or homogeneity of the items of a test
TYPICAL STATISTICAL ERROR
PURPOSE METHOD
USE PROCEDURES VARIANCE
To evaluate When Depends on Pearson r between Test
the extent to evaluating the the equivalent test halves Construction
which items homogeneity statistical with Spearman rho
on a scale of a measure procedures correction, or Kuder
relate to one used Richardson for
another dichotomous items, or
Coefficient Alpha for
multipoint items, or
APD
4. Internal Consistency (SPLIT-HALF RELIABILITY)
▪ mini parallel forms
▪ applicable only to assessment of homogeneity of the test
▪ obtained by correlating two pairs of scores obtained from equivalent halves of a single
test administered once
▪ Method:
1. Randomly group test items into 2
2. Administer each half to single subject
3. Total each half
4. Correlate!
Assignment of Items: Odd-even (odd-even reliability) or Random assignment or by content
and/or difficulty
▪ Advantages:
▪ Time efficient
▪ Addresses issues about two forms or two administrations
▪ Disadvantages:
▪ Reliability is based on 50% of the test
▪ Solution → Spearman Brown Formula
▪ estimate reliability of the half test if it becomes WHOLE
▪ METHOD: X2
▪ Not for heterogenous test (measures different factors)
4. Internal Consistency (KUDER-RICHARDSON KR20)
▪ For homogenous test as well but primarily those dichotomous in nature, can be scored
right or wrong
▪ Method:
1. Administer the test
2. Score the test
3. Tally how many testtakers get the right to wrong answers per item (RATIO; e.g.
2(right):10(wrong)
4. Then compute
Disadvantages:
• More difficult computation
• Broader range of difficulty
Solution → KR21
• For equal item of difficulty
• For easier computation
Disadvantage of Both:
• does not work for non-objective test
4. Internal Consistency (CRONBACH’S ALPHA)
▪ mean of all possible split-half correlations
▪ unlike KR20, it is appropriately used on tests containing nondichotomous items
▪ for objective and non-objective, likert type
▪ the preferred statistic for obtaining an estimate of internal consistency reliability
▪ typically ranges from 0 (absolutely no similarity) to 1 (perfect similarity) to help answers
questions about how similar sets of data are
▪ Method:
1. Administer test to the subjects
2. Score the test by multiplying each half of each subtest into 2
3. Correlate
▪ Disadvantage:
▪ Does not measure degree of difference
▪ Using and Interpreting a Coefficient of Reliability
Cronbach's alpha Internal consistency
α ≥ 0.9 Excellent
0.9 > α ≥ 0.8 Good
0.8 > α ≥ 0.7 Acceptable
0.7 > α ≥ 0.6 Questionable
0.6 > α ≥ 0.5 Poor
0.5 > α Unacceptable
4. Internal Consistency (AVERAGE PROPORTIONAL DISTANCE METHOD)
▪ rather than focusing on similarity between scores on items of a test, it focuses on the
degree of difference that exists between item scores
▪ Method:
1. Calculate the absolute difference between scores for all of the items
2. Average the difference between scores
3. Obtain the APD by dividing the average difference between scores by the number of
response options on the test, minus one.
2 or lower is excellent internal consistency
.25 to .3 is the acceptable range
not connected to the number of items in a measurement unlike cronbach’s alpha
What will make an item unreliable?
▪ Items are TOO LONG!
▪ Items are vaguely/unclearly written
How to increase reliability?
▪ Eliminate items that are unclear
▪ Standardize the conditions under which the test is taken
▪ Moderate the degree of difficulty of the tests
▪ Minimize the effects of external events
▪ Standardize instructions
▪ Maintain consistent scoring procedures
SUMMARY of Reliability Estimates
Type of Number of Number of Sources of Error Statistical
Reliability Testing Test Forms Variance Procedures
Sessions
Test-retest Administration Pearson r or
2 1
Spearman rho
Alternate- Test construction or Pearson r or
1 or 2 2
forms administration Spearman rho
Internal Test construction Pearson r or
consistency Spearman rho;
Spearman
1 1
Brown
correction;
KR21; α
Inter-scorer Scoring and Pearson r or
1 1
interpretation Spearman rho
C. Using and Interpreting a Coefficient of Reliability
▪ Purpose:
▪ If a specific test is designed for use at various times over the course of the
period, it would be reasonable to expect the test to demonstrate reliability across
time (test-retest).
▪ For a test designed for a single administration only, an estimate of internal
consistency would be the reliability of choice.
▪ If the purpose is to break down the error variance into its parts, then a number
of reliability would have to be calculated.
▪ Nature
▪ Homogeneity vs Heterogeneity
▪ Homogeneity: uniform functionality/one factor (e.g. internal consistency)
▪ Heterogeneity: more than one factor (e.g, test-retest)
▪ Dynamic Characteristics vs Static Characteristics
▪ Dynamic Characteristics: trait, state, or ability presume to be ever-
changing as a function of situation and cognitive experiences
▪ Static Characteristics: relatively unchanging
▪ Restriction Range vs Inflated Range
▪ Restriction Range: variance is restricted by the sampling procedure
(Restricted; correlation coefficient=down)
▪ Inflated Range: variances is inflated by the sampling procedure
▪ (Inflated; correlation coefficient=up)
▪ Speed Test vs Power Test
▪ Speed Test: time pressured; consistency of response speed
▪ Power Test: performance; right or wrong
▪ Criterion-Referenced Test
▪ provide an indication of where a testtaker stands with respect to some
variable or criterion
D. Alternative to the True Score Model
▪ True Score Theory – estimate the portion of a test score that is attributable to
error
▪ Domain Sampling Theory – estimate the extent to which specific sources of
variation under defined conditions are contributing to test score
▪ Generalizability Theory – given the exact same conditions of all the facets in
the universe, the exact same test score should be obtained
▪ Facets – number of items, amount of training, purpose of test
administration
▪ Universe – particular test situations
▪ Item Response Theory – probability that a person with X ability will be able to
perform a level of Y
▪ also known as Latent (unobservable) – Trait theory
E. Standard Error of Measurement
▪ Also known as Standard Error of Scores
▪ provides a measure of the precision of an observed test score
▪ or, provides an estimate of the amount of error inherent in an observed score of
measurement
▪ The higher the SEM, the lower the reliability. And vice-versa
▪ Formula:
▪
▪ σ is the standard deviation of test scores
▪ rxx is equal to the reliability coefficient of the test
▪ Confidence Intervals
▪ 68% → ±1σmeas
▪ 95% → ±2σmeas
▪ 99% → ±3σmeas
Connecting Sources of Error with Reliability Assessment Method
Source of Example Method How assessed
error
Time sampling Same test given at Test–retest Correlation (Pearson r or
two points in time Sperman’s rho)
Item sampling Different items used Alternate forms Correlation (Pearson r or
to assess the same or Sperman’s rho)
attribute parallel forms
Internal Consistency of items [Link]-half 1. Ordinal/Composite
consistency within the same test 2. KR20 2. Kuder-Richardson
3. Alpha 3. Cronbach’s Alpha
Observer Different observers Kappa statistic Kappa’s Coefficient
differences recording Percentage