0% found this document useful (0 votes)
32 views

CLASS PRESENTATION - Test Reliability

This is basically a class presentation prepared for TEST RELIABILITY

Uploaded by

oc.nwafor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

CLASS PRESENTATION - Test Reliability

This is basically a class presentation prepared for TEST RELIABILITY

Uploaded by

oc.nwafor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

CLASS PRESENTATION

TEST RELIABILITY

Introduction

Today we are going to talk about test reliability. As future educators, we will be creating
tests and exams for our students, and it is important that we understand the concept of test
reliability to ensure that the assessments we create are valid and accurate.

For example, you have two children and each time you send them to get a piece of
information for you, one of them always return back with information that you can trust but the
other return with a piece of information you cannot depend on. Which of the children will you
say is reliable?

The same can be said about tests, exams and research findings. One important aspect of our
research findings is the dependability or trustworthiness of the results obtained from such a
research findings. To establish a high level of trustworthiness of our tests or research findings,
the measuring or research instrument must equally possess the property of dependability or
trustworthiness. The property of dependability or trustworthiness is simply called Reliability.

So, what exactly is test reliability? Test reliability refers to the consistency of results from
a test. In other words, if the same test is given to the same group of people at different times, the
results should be similar. In simple terms, reliability is consistency.

Test scores are reliable to the extent that they are consistent over ...

• different occasions of testing,


• different editions of the test, containing different questions or problems designed to
measure the same general skills or types of knowledge, and
• different scorings of the test takers’ responses, by different raters.

Why is test reliability important? Test reliability is important because it ensures that the
assessments we create are accurate and valid. If a test is not reliable, then the results may not be
an accurate representation of what the student actually knows. It also ensures that the results
obtained from a test are dependable, consistent, and free from errors.
Assumptions for Test Reliability

To effectively utilize test reliability, we need to make some assumptions:

1. Stability Assumption:

 Assumes that the trait being measured is stable over time and does not change
significantly between test administrations.

2. Homogeneity Assumption:

 Assumes that the construct being measured is consistent across all items or tasks
within the test.

3. Consistency Assumption:

 Assumes that the test-takers respond consistently to the test items, regardless of
external factors or variations in test administration.

Types, forms or methods of Reliability

The choice of a type of reliability may depend on the number of times the instrument will
be administered (which may be either once or twice) or on evaluating persons agreement and/or
items agreement.

There are five types or forms of reliability

• Stability (also called test-retest reliability)


• Equivalence (also called alternate form reliability)
• Equivalent and Stability
• Internal Consistency
• Scorer or Rater reliability

Stability or Test-retest reliability

This is the extent to which the scores on the same test are consistent over time. It involves
giving one group the same instrument or test at two different times, and then correlate the two
sets of scores. This type of reliability is, therefore, an evaluation of two scores of individuals
about an instrument.

The time interval between the first and second tests has no generally acceptable rule of
thumb, but a period of 2 to 6 weeks is okay.

The reliability statistic employed here is the bivariate correlation (e.g. Pearson’s Product
Moment Correlation, PPMC).

Example: Imagine we want to measure the test-retest reliability of a mathematics quiz. We


administer the quiz to a group of students and then re-administer it after two weeks. By
comparing the scores, we can determine the consistency of the test.

Equivalence or Alternate Form Reliability

It is the degree to which two similar or equivalent forms of an instrument are consistent.
This could also be referred to as Parallel Forms Reliability. Hence, it establishes the relationship
between two versions of a test or research instrument (about the same construct) intended to be
equivalent. Alternate-forms reliability answers the question, “To what extent does the test takers
who perform well on one edition of the test also perform well on another edition?”

The two versions of the instrument are similar in that they:

a) measure the same construct, and


b) have the same number of items, level of difficulty, scoring and interpretation.
However, they differ in the wording of the specific items.

This method involves administering the two versions of the same instrument once to a
single group at the same time or almost the same time. The two sets of scores obtained are
statistically correlated using bivariate correlation (e.g., PPMC).

Example: In a psychology class, one group of students takes Test A, and the same group
takes Test B. Both tests cover the same content but have different questions. Equivalence forms
reliability helps determine if the two tests yield similar results.
Equivalence and Stability

As the name implies, it somewhat combines the equivalent (alternate form) and stability
(test-retest) forms. It is aimed at establishing the relationship between equivalent versions of an
instrument administered to a single group at two different times; such that one version is
administered at a time, while the other version is administered at a later time.

Just as the equivalent form, the two versions of the instrument

(a) measure the same construct, and


(b) have the same number of items, level of difficulty, scoring and interpretation; but differ
in the wording of the specific items.

Bivariate correlation is often employed as the reliability statistic.

Example: In a psychology class, one group of students take Test A, while another group
takes Test B. Both tests cover the same content but have different questions. Parallel forms
reliability helps determine if the two tests yield similar results.

Internal Consistency

It is the degree to which the items of an instrument are consistent among themselves and
with the test as a whole. It measures the extent to which the items are similar to one another in
content. It involves administering the instrument once to a single group, and then apply any of
these approaches:

(a) Split-half (r^ reliability),


(b) Kuder-Richardson reliability (KR-20 or 21),
(c) Cronbach’s alpha (a) reliability,
(d) McDonald’s omega (ɸ) reliability, and many others (e.g., Revelle’s beta, greatest lower
bound GLB)
Split-half, KR, Alpha & Omega Reliability

Split-half reliability — after administering the instrument, (a) split the scores into two halves,
usually scores of odd ana even numbers/items, (b) correlate the two sets of scores and apply
PPMC, (c) then apply Brown’s correlation formula.

Kuder-Richardson reliability - applies to only dichotomous (two) response options. After


administering tne instrument, apply KR-20 or KR-21 formula.

Cronbach’s alpha reliability (sometimes called tau-equivalent reliability) - Cronbach’s alpha


provides a measure of the extent to which the items on a test, each of which could be thought of
as a mini-test, provide consistent information with regard to students’ mastery of the domain. It
applies to more than two (polychotomous) response options. After administering the instrument,
apply the Cronbach’s alpha formula.

Where k is the number of items on the


exam;

pi, referred to as the item difficulty, is the proportion of examinees who answered
item i correctly;

and σ2X s is the sample variance for the total score.

Example: To illustrate, suppose that a five-item multiple-choice exam was administered with the
following percentages of correct response: p1 = .4, p2= .5, p3 = .6, p4 = .75, p5 = .85, and σ2X
=1.84. Cronbach’s alpha would be calculated as follows:
Cronbach’s alpha ranges from 0 to 1.00, with values close to 1.00 indicating high
consistency. Professionally developed high-stakes standardized tests should have internal
consistency coefficients of at least .90. Lower-stakes standardized tests should have internal
consistencies of at least .80 or .85. For a classroom exam, it is desirable to have a reliability
coefficient of .70 or higher.

McDonald's omega reliability - is closely related to Cronbach’s alpha, but the Mcdonald’s
omega formula is applied instead.

Scorer or Rater Reliability

Rater or Scorer reliability is the degree of consistency of subjective scores (about a


subjective instrument) obtained from either (a) two or more raters/scorers/judges at one time
(inter-rater), or (b) one rater/scorer/judge at two or more different times (intra-rater). Hence,
inter-rater (between rater) reliability evaluates agreement on how consistently two or more
independent raters can assign the same score to a measure, behaviour or test at a particular time.

On the other hand, intra-rater (within rater) reliability evaluates agreement on how
consistently the same rater can assign a score to a measure, behaviour or test at two or more
different times. The statistic often applied is Spearman’s rho, Cohen’s kappa, Krippendorff’s
alpha, or Intra-class Correlation Coefficients (ICC)

Other Reliability Coefficient

Intraclass Correlation Coefficient (ICC): This determines the agreement or consistency


between multiple raters or observers.

Example: ICC can be used to assess the reliability of scores assigned by different judges
in a figure skating competition.
References

Chukwuedo, S. O. (2021), Conceptualizing the Forms of Reliability and its types in Quantittative
Behavioral, Education and Social Research. Research-Statistics Mind.
https://youtu.be/0qcYNJa1a7l

Craig, S. W. & James, A. W. (2003). An Instructor’s Guide to Understanding Test Reliability:


Testing & Evaluation Services, University of Wisconsin, 1025 W. Johnson St., #373
Madison, WI 53706

Livingston, S. A. (2018). Test reliability—Basic concepts (Research Memorandum No. RM-18-


01). Princeton, NJ: Educational Testing Service.

You might also like