EDUR 8331 14a Item Analysis
EDUR 8331 14a Item Analysis
Van Blerkom Chapter 11: Using Item Analysis to Improve Test Items
We learned about item analysis for scales, and item analysis for tests is similar, but item analysis for tests has
some differences due to the types of test items available. Recall that item analysis refers to procedures used to analyze
responses to test items. Through these analyses, one may revise items to produce more reliable and more valid tests. A
complete item analysis has several components, each of which will be described below.
1. Test Analysis
When examining a test, one may first note how students performed on the test as a whole. If most students
perform very well on a test, that could mean
the test content was taught well, students learned the content, and scored high on the test – this is an ideal
situation; or
the test was too easy and could not adequately assess student content understanding.
For items scored dichotomously – scored as 1 = correct and 0 = incorrect – the item difficulty is simply the sum of
students who answered the item correctly divided by the number of students who completed the test. For example, if
15 students answered an item, but only 8 students answered the item correctly, the item difficulty would be 8/15 = .53
or 53%.
Ideally, the two groups are defined such that the top performing 25% or 1/3 (33%) of the class on the test represents the
above average group and the bottom 25% or 1/3 (33%) the below average groups. In most cases for classroom tests, the
number of students will be small, usually less than 30, so best to use the top 1/3 and bottom 1/3 for calculating item
discrimination.
item discrimination = item difficulty for top - item difficulty for low
performing group performing group
If the top 1/3 of students all correctly answered the chosen item, their item difficulty would be 1.00, and if 40% of the
bottom 1/3 correctly answered the item correctly, their item difficulty would be .40. The item discrimination would be:
The larger the discrimination index, the better performing the item in terms of distinguishing between knowledgeable
students and less knowledgeable students for the content domain and skill level sampled by the item. For classroom,
teacher-constructed tests, discrimination indexes of .20 and greater are good when using the upper 1/4 vs. lower 1/4,
and one should expect lower levels of discrimination when using larger groups as the basis of the upper and lower
groups, such as upper 1/3 vs. lower 1/3, or upper 1/2 vs. lower 1/2. Some measurement specialists recommend only
using the top 10 and bottom 10 performers, even if a hundred (or more) students took the test. As these
recommendations reveal, there is no uniform agreement on how to define the top and bottom group, so for your own
practice work with whatever seems best for you. Key is to have a large sample for both groups, i.e., 10 or more students
in each group.
The discrimination index ranges from -1.00 to 1.00. Negative values indicate that less knowledgeable students are
answering the item correctly more often than more knowledgeable students. This is a signal that something is wrong
with the item and it is deficient. A common problem with such items is ambiguity, i.e., the item has more than one
correct response or no correct response.
Another method for calculating the discrimination index is to calculate the correlation between the item score and the
total test score across the sample of students. This correlation is called the item-total correlation and is the same item-
total correlation learned for item analysis with scales. This is especially the method used when one wishes to calculate
discrimination for an item that is scored with multiple points or partial credit, like essay or short-answer items.
Normally, one will use all students, not just the top and bottom groups, to calculate the item-total correlation. As with
the discrimination index, the item-total correlation ranges from -1.00 to 1.00. The better the item discriminates, the
larger will be the positive item-total correlation. Like negative discrimination indices, negative correlations indicate that
the item is not properly discriminating and signals a problematic item.
To illustrate both methods for calculating discrimination, listed below are the total test scores and the item
performances for two groups of students. Note that all information is sorted by Total Test Score, so it is easy to identify
top and bottom performers.
Table 1
Example Discriminations: Note that one should sort the table or spreadsheet by Total Test Score
Student Group Item 1 Item 20 Item 20 Total Test Score
Identification Multiple Choice Essay Essay (% Correct)
(1 = correct, (Partial Score out (Proportion Correct
0 = incorrect) of possible total 5) out of 5 Points)
The item-total correlation for all 12 students uses the 1,0 scoring for Item 1 and the Total Test Score. The Pearson r = .
36, which, while lower, corresponds with the above discrimination index. Both the discrimination index of .50 and the
correlation of .36 tells us students with higher scores also performed better on item 1, which is what we hope to find
with a good performing item.
For the Item 20, the essay item, discrimination can be found by calculating the difference in mean proportion correct
between the two groups. For the Top 1/3 group, the mean item difficulty is (.90+.80+.90+.70)/4 = .825. For the bottom
1/3 group the mean item difficulty is (.60+.70+.50+.40)/4 = .55.
The item-total correlation for Item 20 is found by correlating the Item 20 score, either raw score or proportion correct,
with the Total Test Score for all students. Since Pearson r is invariant to linear transformations, it does not matter
whether raw score or proportion score is used for Pearson r. For Item 20, the item-total correlation is .853. Both item
discrimination and item-total correlation are positive which indicates the essay item functions as it should by
discriminating between more and less knowledgeable students.
Finally, there is a direct relationship between item difficulty and item discrimination. The more (or less) difficult the item,
the less it discriminates. When item difficulty approaches the half-way point (.50), item discrimination will most likely be
maximized.
Table 2
Distractor Analysis of Item 1 (Numbers are percentages)
Item X A B* C Omit
Upper 1/4 25 75 0 0
Lower 1/4 25 25 50 0
All Students 25 60 15 0
Note. The Omit category represents students who did not answer the item.
*Correct response.
Note that Item 1, illustrated in Table 2, behaves accordingly. That is, the upper group chose B more often than the lower
group, and the lower group chose distractors (options A and C) more often than the upper group.
Should the pattern illustrated in Table 2 not occur, then the item will probably need some revision. For example, if lower
group chooses the correct response more often than the upper group, then the item is most likely ambiguous. Or, if the
upper group chooses one of the distractors more often than the lower group (but a larger percentage of the upper
groups still chooses the correct response), then that distractor needs revision. The last type of problem that may be
observed is the case in which more students (both upper and lower) choose a distractor rather than the correct option.
When this occurs, students view the distractor as more correct than the correct option, and the item should be carefully
reviewed.
4
Distractor analysis is beneficial in learning which skills and content students are having the greatest and least success
with. Oftentimes items will contain distractors that represent common mistakes, and when students select such
distractors with great frequency, this is a clear indication that further instruction is necessary. Thus, proper
interpretation of distractors may lead to significant alterations to instruction.
Note: The three components of item analysis (difficulty, discrimination, and distractor analysis) should be viewed
cautiously when one has a small sample of students. Ideally large groups, say 50 or greater, are needed for reliable
analysis. But when small numbers are present, an analysis should be viewed as preliminary, and one should collect more
data from additional classes over time.
When an item has low discrimination and moderate difficulty, then the item is most likely ambiguous and should be
revised, or perhaps instruction was less than adequate and should be corrected. When examining an item, one should
consider all aspects: stem, correct option, and distractors. Moreover, one should also ensure that the item reflects the
desired capability and performance objective.
One method for calculating difficulty is to determine what is a minimally acceptable response level, in terms of points
awarded, and calculate the percentage of students who scored above (or below) this level. For example, an essay item
may be worth a total of 10 points, yet one may decide that a minimum of 6 points is needed to be consider acceptable
performance on the essay. One then calculates the percentage of student who scored 6 or more points and this
percentage represents the item difficulty. Another approach is that illustrated above in section 3 – calculate the mean
5
proportion of points earned for all students who completed the essay. For example, if the essay is worth 10 points, and
the mean points earned across all students is 7.3, then the item difficulty is 7.3/10 = .73.
Item discrimination may be calculated using the discrimination technique just described. After determining the top and
bottom performing groups for the test as a whole, one may then calculate, for each group, the percentage of students
awarded 6 or more points to the essay item. Thus, for example, the top performing group may have had 85% receive 6
or more points for the essay, and the bottom group may have had 53% receive 6 or more, so the item discrimination
would be .85 - .53 = .32. Another alternative is to use the approach illustrated above in section 3, Table 1. Find the
proportion correct for both upper and lower groups and then find the difference. For the Item 20 in Table 1, the upper
1/3 of students had a mean difficulty of (.90+.80+.90+.70)/4 = .825 and the bottom 1/3 group had a mean difficulty of
(.60+.70+.50+.40)/4 = .55, so the discrimination index was .825 - .55 = .275
A better approach, however, is the use the item-total correlation. That is, calculate the correlation between the points
received for the essay item by each student with the total score received by each student on the test. The higher the
positive correlation, the better the item discriminates.
Table 3
An item analysis for two items.
Item 1 A B *C D E Omit
Upper 1/3 4 6 74 5 11 0
Lower 1/3 21 12 38 25 4 0
All Students 12 10 54 15 9 0
Item 2 A B C D E Omit
Upper 1/3 28 42 15 0 15 0
Lower 1/3 13 51 10 0 26 0
All Students 20 46 13 0 21 0
Items 7 through 14 are interpretations of the information in Table 3. Indicate (yes or no) whether each
interpretation is correct.
Items 15 through 19 suggest some benefits of using student input to help interpret an item analysis. Indicate
(yes or no) whether each of these statements represents a benefit.
15. Test scores can be more readily corrected for student guessing.
16. Ambiguity within a test item can be identified.
17. Ambiguity within instruction given students can be identified.
18. The reliability of the test can be estimated.
19. Misconceptions learned by students can be addressed.