0% found this document useful (0 votes)
89 views

Item Analysis and Evaluation Statistical Analysis of Assessment Data

An item analysis examines student responses to test items to assess item and test quality. It provides useful information for improving student learning and developing better future tests. The item difficulty index measures the proportion of students answering correctly. The discrimination index measures how well items differentiate between higher- and lower-scoring students. Together, these indices help evaluate item quality and identify items needing revision or removal to improve the test's validity in measuring the intended construct.

Uploaded by

Robert Terrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Item Analysis and Evaluation Statistical Analysis of Assessment Data

An item analysis examines student responses to test items to assess item and test quality. It provides useful information for improving student learning and developing better future tests. The item difficulty index measures the proportion of students answering correctly. The discrimination index measures how well items differentiate between higher- and lower-scoring students. Together, these indices help evaluate item quality and identify items needing revision or removal to improve the test's validity in measuring the intended construct.

Uploaded by

Robert Terrado
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 50

ITEM ANALYSIS

After you create your assessment items and


give your test, how can you be sure that the
items are appropriate -- not too difficult and
not too easy? How will you know if the test
effectively differentiates between students who
do well on the overall test and those who do
not? An item analysis is a valuable, procedure
that teachers can use to answer both of these
questions.
What is item analysis?
- is a process which examines student
responses to individual test items in order
to assess the quality of those items and of
the test as a whole.
Benefits derived from Item Analysis
1. It provides useful information for class
discussion of the test.
2. It provides data which helps students
improve their learning.
3. It provides insights and skills that lead
to the preparation of better tests in the
future.
2 main characteristics of an
item:
-Item Difficulty index
-Item Discrimination index
Item Difficulty index
- is a common and very useful analytical
tool for statistical analysis, especially
when it comes to determining the validity
of test questions in an educational setting.
-It is often called the p-value because it
-is a measure of proportion 
Difficulty index = no. of students w/ correct answer
total number of students
The item difficulty usually expressed in percentage.

The higher the difficulty index, the easier


the item is understood to be. (Wood,
1960)
Example:
Out of the 20 students who
answered question five, only four
answered correctly.
4 ÷ 20 = 0.2
Because the resulting p-value is
closer to 0.0, we know that this is a
difficult question.
For example, let's say you gave a multiple choice quiz and
there were four answer choices (A, B, C, and D). The
following table illustrates how many students selected each
answer choice for Question #1 and #2.
Question A B C D
#1 0 3 24* 3
#2 12* 13 3 2

* Denotes correct answer


One problem with this type of difficulty
index is that it may not actually indicate
that the item is difficult or easy. A student
who does not know the subject matter will
naturally be unable to answer the item
correctly even if the question is easy. How
do we decide on the basis of this index
whether the item is too difficult or too easy?
 Difficult items tend to discriminate between
those who know and those who does not
know the answer.
Easy items cannot discriminate between
those two groups of students.
We are therefore interested in deriving a
measure that will tell us whether an item
can discriminate between these two groups
of students. Such a measure is called an
index of discrimination.
Item Discrimination Index
- which is a measure of how well the item
discriminates between examinees who are
knowledgeable in the content area and
those who are not.
There are several different formulas that calculate
item discrimination, but the one that is most
commonly used is called the point-biserial
correlation, which compares a test taker’s score on
an individual item with their score on the test
overall. For highly discriminating questions,
students who answer correctly are those who
have done well on the rest of the test.
The opposite is also true. Students who answer
highly discriminating questions incorrectly
tend to do poorly on the rest of the test as well.
Item discrimination is measured in a range
between -1.0 to 1.0. Negative discrimination
indicates that students who are scoring highly
on the rest of the test are answering that
question wrong. This could mean that there is
a problem with the question, such as bias or
even a typo in the answer key. Test writers
should reevaluate questions that result in
negative discrimination because they do not
help to show mastery.
How to Find the Item Discrimination Index

Item discrimination Index = (hc – lc) ÷ t

hc= higher scoring group


lc= lower scoring group
t= half of the class
1. Arrange your students from highest scorers
to lowest scorers, with the highest scorers at
the top.

2. Divide the table in half between high and


low scorers, with an equal number of students
on each side of the dividing line.
3. Subtract the number of students in the
lower-scoring group who
answered the question correctly (lc) from the
number of students in the higher-scoring
group who answered the question correctly
(hc).

4. Divide the resulting number by the number


of students on each side of your dividing line,
which should be half of the class (t).
STUDENT TOTAL SCORE QUESTIONS/ITEMS
1 2 3
A 90 1 0 1
B 90 1 0 1
C 80 0 0 1
D 80 1 0 1
E 70 1 0 1
F 60 1 0 0
G 60 1 0 1
H 50 1 1 1
I 50 1 1 1
J 40 0 1 1

1"indicates the answer was correct; "0" indicates it was


incorrect.
For Question #1, that means
you would subtract 4 from
4, and divide by 5, which
results in a Discrimination
Index of  0.
Item # Correct  #Correct Difficulty (p) Discrimination
(Upper group) (Lower group) (D)
Question 1 4 4 .80 0

Question 2 0 3 .30 -0.6

Question 3 5 1 .60 0.8


Item Analysis Worksheet
Ten students have taken an objective assessment. The quiz contained 10 questions. In the table below,
the students’ scores have been listed from high to low (A,B,C,D,E are in the upper half). There are five
students in the upper half and five students in the lower half. The number“1” indicates a correct answer
on the question; a “0” indicates an incorrect answer.

Questions
 Student Total Score (%)
1 2 3 4 5 6 7 8 9 10

A 100 1 1 1 1 1 1 1 1 1 1

B 90 1 1 1 1 1 1 1 1 0 1

C 80 1 1 0 1 1 1 1 1 0 0

D 70 0 1 1 1 1 1 0 1 0 1

E 70 1 1 1 0 1 1 1 0 0 1

F 60 1 1 1 0 1 1 0 1 0 0

G 60 0 1 1 0 1 1 0 1 0 1

H 50 0 1 1 1 0 0 1 0 1 0

I 40 1 1 1 0 1 0 0 0 0 1

J 30 0 1 0 0 0 1 0 0 1 0
Calculate the Difficulty Index (p) and the Discrimination
Index (D) for each question.
# Correct
# Correct Difficulty Discrimination
Item (Upper Action
(Lower group) (p) (D)
group)
Question 1  4  2  0.6  0.4 Revise
Question 2  5  5  1  0 Discard
Question 3  4  4  0.8  0 Discard
Question 4  4  1  0.5  0.6 Include
Question 5        
Question 6        
Question 7        
Question 8        

Question 9        
Question 10        
VALIDATION
-Validation is the process of collecting and
analyzing evidence to support the
meaningfulness and usefulness of the test.
What is Validity?
The concept of validity was formulated by
Kelly (1927, p. 14) who stated that a test is
valid if it measures what it claims to
measure.
For example a test of intelligence should
measure intelligence and not something else
(such as memory).
Several Ways to Estimate the Validity of a Test
Internal Validity

Internal validity is the measure of the


experimenter’s measurement of the
dependent variable.
External Validity

• External validity refers to the extent to which the findings of a study can
generalized.
2 Types
• Ecological Validity
>refers to the extent to which the results and conclusion are generalized to real life.
• Population Validity
>refers to the extent to which the sample can be generalized to similar and wider
populations
Temporal Validity

• Temporal validity refers to the extent to


which the findings and conclusions of study
are valid when we consider the differences
and progressions that come with time.
Test Validity

• Test validity refers to the extent to which the


results of a study or test can be said to have
meaning.
CONTENT VALIDITY- refers to the content and format
of the instrument.
-How appropriate the items seem to a panel of
reviewers who have knowledge of the subject
matter.
-Does the instrument include everything it should
and nothing it should not?
Example:
Constructing a vocabulary test using a sample of
all vocabulary words studied in a semester.
A comprehensive math achievement test
would lack content validity if good scores
depended primarily on knowledge of
English or it only had questions about
aspect of math(e.g., algebra)
CONSTRUCT VALIDITY
-refers to the agreement of test results with
certain characteristics which the test aim to
portrait.
For example, a test of intelligence nowadays
must include measures of multiple
intelligences, rather than just logical-
mathematical and linguistic ability measures.
CRITERION-RELATED VALIDITY
Also referred to as instrumental validity, it
states that the criteria should be clearly
defined by the teacher in advance. It has to
take into account other teachers´ criteria to
be standardized and it also needs to
demonstrate the accuracy of a measure or
procedure compared to another measure or
procedure which has already been
demonstrated to be valid.
CONCURRENT VALIDITY
Concurrent validity is a statistical method using
correlation, rather than a logical method.
Examinees who are known to be either masters or non
masters on the content measured by the test are identified
before the test is administered. Once the tests have been
scored, the relationship between the examinees’ status as
either masters or non-masters and their performance (i.e.,
pass or fail) is estimated based on the test. This type of
validity provides evidence that the test is classifying
examinees correctly. The stronger the correlation is, the
greater the concurrent validity of the test is.
PREDICTIVE VALIDITY
This is another statistical approach to validity that
estimates the relationship of test scores to an examinee's
future performance as a master or nonmaster.
Predictive validity considers the question,
"How well does the test predict examinees' future
status as masters or non-masters?" For this type of
validity, the correlation that is computed is based on
the test results and the examinee’s later performance.
This type of validity is especially useful
for test purposes such as selection or admissions.
FACE VALIDITY
Like content validity, face validity is determined by a
review of the items and not through the use of
statistical analyses. Unlike content validity, face
validity is not investigated through formal procedures.
Instead, anyone who looks over the test, including
examinees, may develop an informal opinion as to
whether or not the test is measuring what it is
supposed to measure. While it is clearly of some value to
have the test appear to be valid, face validity alone is
insufficient for establishing that the test is
measuring what it claims to measure.
Statistical Conclusion Validity

• It refers to the extent to which we can the


results are statistically significant, that is, we
can establish cause and effect above chance.
Instrumental Validity

• It refers to the extent to which the


instruments are used to measure the
dependent variables are correct for that
measurement.
Diagnostic Validity

• It refers to the extent to which a diagnosis


made about a condition is accurate.
• It is most commonly used in clinical setting.
STATISTICAL ANALYSIS OF ASSESSMENT
DATA
Once we have collected quantitative data, we
will have a lots of numbers. It’s now time to
carry out some statistical analysis to make
sense of and draw some inferences from our
data.
The first thing to do with any data is to
summarize it, which means to present it in a
way that best tells the story.
One of the most common techniques is using
graphs…( line graph, pie, bar..etc.)

Drawing a graph gives an immediate


“picture” of the data. It is always worth
drawing a graph before we start any further
analysis.
2 Methods of Statistical Analysis
DESCRIPTIVE STATISTICS- Descriptive
statistics try to describe the relationship
between variables in a sample or population.
-It provides a summary of data in the form of
mean, median and mode.
INFERENTIAL STATISTICS - use a random
sample of data taken from a population to
describe and make inferences about the whole
population.
DESCRIPTIVE STATISTICS- is the first level of
analysis. A commonly used descriptive
statistics are the measures of central
tendency( mean, mode, median) and the
measures of variability (interquartile range,
variance, standard deviation ,coefficient of
variation)
INFERENTIAL STATISTICS- show the
relationship between several different variables.
A few types of inferential analysis are:
Correlation-this describes the relationship
between two variables. If correlation found, it
means that there is a relationship among
variables. For example, taller people tend to
have a higher weight. Hence height and weight
are correlated with each other.
Regression- this shows the relationship
between two variables. For example, regression
can helps us guess someone’s weight base don
their weight.
Analysis of variance: statistical procedure used
to test the degree to which two or more groups
vary or differ in an experiment.
INFERENTIAL STATISTICS-significant
relationships are determine by rejecting the
null hypothesis and accepting the alternative
hypothesis.
Null hypothesis is rejected if:
computed statistics is greater than the critical
value

You might also like