CLASS PRESENTATION - Test Reliability
CLASS PRESENTATION - Test Reliability
TEST RELIABILITY
Introduction
Today we are going to talk about test reliability. As future educators, we will be creating
tests and exams for our students, and it is important that we understand the concept of test
reliability to ensure that the assessments we create are valid and accurate.
For example, you have two children and each time you send them to get a piece of
information for you, one of them always return back with information that you can trust but the
other return with a piece of information you cannot depend on. Which of the children will you
say is reliable?
The same can be said about tests, exams and research findings. One important aspect of our
research findings is the dependability or trustworthiness of the results obtained from such a
research findings. To establish a high level of trustworthiness of our tests or research findings,
the measuring or research instrument must equally possess the property of dependability or
trustworthiness. The property of dependability or trustworthiness is simply called Reliability.
So, what exactly is test reliability? Test reliability refers to the consistency of results from
a test. In other words, if the same test is given to the same group of people at different times, the
results should be similar. In simple terms, reliability is consistency.
Test scores are reliable to the extent that they are consistent over ...
Why is test reliability important? Test reliability is important because it ensures that the
assessments we create are accurate and valid. If a test is not reliable, then the results may not be
an accurate representation of what the student actually knows. It also ensures that the results
obtained from a test are dependable, consistent, and free from errors.
Assumptions for Test Reliability
1. Stability Assumption:
Assumes that the trait being measured is stable over time and does not change
significantly between test administrations.
2. Homogeneity Assumption:
Assumes that the construct being measured is consistent across all items or tasks
within the test.
3. Consistency Assumption:
Assumes that the test-takers respond consistently to the test items, regardless of
external factors or variations in test administration.
The choice of a type of reliability may depend on the number of times the instrument will
be administered (which may be either once or twice) or on evaluating persons agreement and/or
items agreement.
This is the extent to which the scores on the same test are consistent over time. It involves
giving one group the same instrument or test at two different times, and then correlate the two
sets of scores. This type of reliability is, therefore, an evaluation of two scores of individuals
about an instrument.
The time interval between the first and second tests has no generally acceptable rule of
thumb, but a period of 2 to 6 weeks is okay.
The reliability statistic employed here is the bivariate correlation (e.g. Pearson’s Product
Moment Correlation, PPMC).
It is the degree to which two similar or equivalent forms of an instrument are consistent.
This could also be referred to as Parallel Forms Reliability. Hence, it establishes the relationship
between two versions of a test or research instrument (about the same construct) intended to be
equivalent. Alternate-forms reliability answers the question, “To what extent does the test takers
who perform well on one edition of the test also perform well on another edition?”
This method involves administering the two versions of the same instrument once to a
single group at the same time or almost the same time. The two sets of scores obtained are
statistically correlated using bivariate correlation (e.g., PPMC).
Example: In a psychology class, one group of students takes Test A, and the same group
takes Test B. Both tests cover the same content but have different questions. Equivalence forms
reliability helps determine if the two tests yield similar results.
Equivalence and Stability
As the name implies, it somewhat combines the equivalent (alternate form) and stability
(test-retest) forms. It is aimed at establishing the relationship between equivalent versions of an
instrument administered to a single group at two different times; such that one version is
administered at a time, while the other version is administered at a later time.
Example: In a psychology class, one group of students take Test A, while another group
takes Test B. Both tests cover the same content but have different questions. Parallel forms
reliability helps determine if the two tests yield similar results.
Internal Consistency
It is the degree to which the items of an instrument are consistent among themselves and
with the test as a whole. It measures the extent to which the items are similar to one another in
content. It involves administering the instrument once to a single group, and then apply any of
these approaches:
Split-half reliability — after administering the instrument, (a) split the scores into two halves,
usually scores of odd ana even numbers/items, (b) correlate the two sets of scores and apply
PPMC, (c) then apply Brown’s correlation formula.
pi, referred to as the item difficulty, is the proportion of examinees who answered
item i correctly;
Example: To illustrate, suppose that a five-item multiple-choice exam was administered with the
following percentages of correct response: p1 = .4, p2= .5, p3 = .6, p4 = .75, p5 = .85, and σ2X
=1.84. Cronbach’s alpha would be calculated as follows:
Cronbach’s alpha ranges from 0 to 1.00, with values close to 1.00 indicating high
consistency. Professionally developed high-stakes standardized tests should have internal
consistency coefficients of at least .90. Lower-stakes standardized tests should have internal
consistencies of at least .80 or .85. For a classroom exam, it is desirable to have a reliability
coefficient of .70 or higher.
McDonald's omega reliability - is closely related to Cronbach’s alpha, but the Mcdonald’s
omega formula is applied instead.
On the other hand, intra-rater (within rater) reliability evaluates agreement on how
consistently the same rater can assign a score to a measure, behaviour or test at two or more
different times. The statistic often applied is Spearman’s rho, Cohen’s kappa, Krippendorff’s
alpha, or Intra-class Correlation Coefficients (ICC)
Example: ICC can be used to assess the reliability of scores assigned by different judges
in a figure skating competition.
References
Chukwuedo, S. O. (2021), Conceptualizing the Forms of Reliability and its types in Quantittative
Behavioral, Education and Social Research. Research-Statistics Mind.
https://youtu.be/0qcYNJa1a7l