0% found this document useful (0 votes)
19 views41 pages

Chapter 6 Item Analysis and Validation Assessment in Learning 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views41 pages

Chapter 6 Item Analysis and Validation Assessment in Learning 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

REPORTERS

Eusebio, Neshema Matillano, Abegail Danieles, Allysa


Faith Joy Mae

Amoguez, Apple Bienes, Jessica Unasin, Francis Mae


Jane
Chapter 6

ITEM ANALYSIS AND


VALIDATION
LEARNING OUTCOMES

 Explain the meaning of item analysis, item


validity, reliability, item difficulty,
discrimination index
 Determine the validity and reliability of
given test items
 Determine the quality of a test item by its
difficulty index, discrimination index and
plausibility of options (for a selected -
response test)
ITEM ANALYSIS
• It is the act of analyzing student responses to
individual exam questions with the intention of
evaluating exam quality.
• It is an important tool to uphold test
effectiveness and fairness.
• It will provide information that will allow the
teacher to decide whether to revise or replace
an item.
Two important characteristics of
an item

a) Difficulty Index
b) Discrimination
Index
Difficulty is defined as the number of students
who
Index
are able to answer the item correctly divided by the total
number of students. Thus:

Item difficulty = number of students with correct answer/


total number of students.

The item difficulty is usually expressed in percentage.


Encoded Pilot Testing Result
Item difficulty = number of students with correct answer/
total number of students.

= 14/20
=0.7 (70%)
Encoded Pilot Testing Result
Range of Difficulty
Interpretation Action
Index

0-0.25 Difficult Revise or discard

0.26-0.75 Right difficulty Retain

0.76 - above Easy Revise or discard


Encoded Pilot Testing Result
Discrimination
refersIndex
to the power of the item to
discriminate the students between
those scored high and low in the
overall test.
Encoded Pilot Testing Result
Index of discrimination DU DL (U-
Upper group; L- Lower group)

upper 25% (easy) of the class and


lower 25% (difficult) of the class
25% x total no. of students

= 25% (0.05) x 20
=5 5 students in upper 25%
5 students in lower 25%
Encoded Pilot Testing Result
Encoded Pilot Testing Result
Encoded Pilot Testing Result
Discrimination Index = item
difficulty index of upper 25% - item
difficulty index lower 25%
Encoded Pilot Testing Result
Index Range Interpretation Action

-1.0 - -.50 Can discriminate Discard

-.55 - 0.45 Non-discriminating Revise

0.46 - 1.0 Discriminating item Include


Encoded Pilot Testing Result
Index Range Interpretation Action

-1.0 - -.50 Can discriminate Discard

-.55 - 0.45 Non-discriminating Revise

0.46 - 1.0 Discriminating item Include


Encoded Pilot Testing Result
More Sophisticated Discrimination
Index

Item discrimination is the ability of an


item to differentiate among students based on
their understanding of the test material.
Traditional hand calculation methods compare
item responses to total test scores, but
computerized analyses offer more accurate
assessment by considering all student
responses.
More Sophisticated Discrimination
Index
The item discrimination index by ScorePak® is a
Pearson Product Moment correlation between student
responses to a specific item and total scores on all other items
on the test. It estimates the degree to which an individual item
is measuring the same thing as the rest of the items. The
discrimination index reflects the degree to which an item and
the test as a whole measure a unitary ability or attribute.
Values of the coefficient tend to be lower for tests measuring a
wide range of content areas than for more homogeneous tests.
Item discrimination indices should be interpreted in the
context of the test type, with items with low discrimination
indices often being ambiguously worded.
More Sophisticated Discrimination
Index
Tests with high internal consistency consist of
items with mostly positive relationships with total test
score. In practice, values of the discrimination index will
seldom exceed .50 because of the differing shapes of
item and total score distributions. ScorePak® classifies
item discrimination as "good" if the index is
above .30; "fair" if it is between. 10 and .30; and
"poor" if it is below .10.

A good item is one that has good discriminating


ability and has sufficient level of difficult (not too difficult
nor too easy).
More Sophisticated Discrimination
Index

At the end of the Item Analysis report,


test items are listed according to their degrees
of difficulty (easy, medium, hard) and
discrimination (good, fair, poor). These
distributions provide a quick overview of the
test, and can be used to identify items which
are not performing well and which can perhaps
be improved or discarded.
VALIDATION AND
VALIDITY

Validation is the process of collecting and


analyzing evidence to support the meaningfulness
and usefulness of the test.

Validity is the extent to which a test measures


what it purports to measure or as referring to the
appropriateness, correctness, meaningfulness and
usefulness of the specific decisions a teacher makes
based on the test results.
There are essentially three main types
of evidence that may be collected: content-
related evidence of validity, criterion-
related evidence of validity and
construct-related evidence of validity.

Content-related evidence of
validity refers to the content and format of
the instrument.
Criterion-related evidence of
validity refers to the relationship between
scores obtained using the instrument and
scores obtained using one or more other
tests (often called criterion).

Construct-related evidence of
validity refers to the nature of the
psychological construct or characteristic
being measured by the test.
The usual procedure for determining content
validity may be described as follows:

• The teacher writes out the objectives of the test


based on the Table of Specifications and then gives
these together with the test to at least two (2)
experts along with a description of the intended test
takers.

• The experts look at the objectives, read over the


items in the test and place a check mark in front of
each question or item that they feel does not
measure one or more objectives.
• They also place a check mark in front of each
objective not assessed by any item in the test.

• The teacher then rewrites any item checked and


resubmits to the experts and/or writes new items to
cover those objectives not covered by the existing
test.

• This continues until the experts approve of all items


and also until the experts agree that all of the
objectives are sufficiently covered by the test.
In order to obtain evidence of criterion-
related validity, the teacher usually compares
scores on the test in question with the scores on some
other independent criterion test which presumably
has already high validity. In particular, this type of
criterion-related validity is called its concurrent
validity. Another type of criterion-related validity
is called predictive validity wherein the test scores
in the instrument are correlated with scores on a later
performance (criterion measure) of the students.
In summary content validity refers to how
will the test items reflect the knowledge actually
required for a given topic area (e.g. math).

Criterion-related validity is also known as


concrete validity because criterion validity refers
to a test's correlation with a concrete outcome.
In the case of pre-employment test, the two
variables that are compared are test scores and employee
performance.

There are 2 main types of criterion validity-


concurrent validity and predictive validity.

● Concurrent validity refers to a comparison between the


measure in question and an outcome assessed at the
same time.

● In predictive validity, we ask this question: Do the


scores in NAT Math exam predict the Math grade in Grade
12?
RELIABILITY

Reliability refers to the consistency of the


scores obtained - how consistent they are for each
individual from one administration of an instrument
to another and from one set of items to another. We
already gave the formula for computing the reliability
of a test: for internal consistency; for instance, we
could use the split-half method or the Kuder-
Richardson formulae (KR-20 or KR-21).
Reliability and validity are related concepts. If an
instrument is unreliable, it cannot get valid outcomes. As
reliability improves, validity may improve (or it may not).
However, if an instrument is shown scientifically to be valid
then it is almost certain that it is also reliable.

Predictive validity compares the question with an


outcome assessed at a later time. An example of predicitve
validity is a comparison of scores in the National Achievement
Test (NAT) with first semester grade point average (GPA) in
college. Do NAT scores predict college performance? Construct
validity refers to the ability of a test to measure what it is
supposed to measure. As researcher, you intend to measure
depression but you actually measure anxiety so your research
gets compromised.
The
The following
following table
table is
is a
a standard
standard followed
followed almost
almost
universally
universally in
in educational
educational test
test and
and measurement.
measurement.

Reliability Interpretation
Excellent reliability; at the level of the best.
.90 and above standardized tests
.80 - 90 Very good for a classroom test
Good for a classroom test; in the range of most. There
.70-80 are probably a few items which could be improved.
Somewhat low. This test needs to be supplemented by
other measures (e.g., more tests) to determine.
.60-70 grades. There are probably some items which could
be improved.
Reliability Interpretation
Suggests need for revision of test, unless it is
quite short (ten or fewer items). The test
.50-60 definitely needs to be supplemented by other
measures (e.g., more tests) for grading.

Questionable reliability. This test should not


.50 or below contribute heavily to the course grade, and it
needs revision.
THANK YOU
FOR
LISTENING!

You might also like