1.
Introduction
Item analysis is a post-examination evaluation and can provide
information about the quality of tests. Item analysis is a statistical
analysis of the student’s responses on a test. Collection and
summarization of students’ responses can provide quantitative
objective information that is useful in deciding the quality of the test
items and increasing the assessment’s efficiency [3, 4]. Also, Item
analysis “investigates the performance of items considered
individually either in relation to some external criterion or the
remaining items on the test” [5].
2. Importance of psychometric analysis:
Any educational test should measure students’ achievement in
content material. Also, it leads to an overall assessment of students’
development to decide their academic status [6, 7]. The importance
of item analysis is determined by the objective of the assessment
[8]. In summative assessment, the assessment results should be
reliable and valid because incorrect decisions about the academic
status will lead to negative consequences [9]. While for the
formative evaluation where the target is students learning, the item
analysis has no much importance in giving feedback about items
construction to test composers. In literature, many reasons were
reported for the conduction of item analysis, including examining if
the item is functioning as intended, did it assess the required
concepts (content)? did it discriminate between those who master
the content material and those who were not? was it within the
acceptable level of difficulty? whether the distracters are
functioning or not? [10, 11].
3. Factors affecting item analysis
Many factors can affect item analysis and hence its interpretation
[8]. Difficulty and discrimination indices were constantly changing
per administration and influenced by the ability and number of the
examinee, the number of items, and the quality of instructions [8,
12]. Whatever the exam or test blueprinting (item selection)
method, exam items remain a sample of the needed content
material. The number of items (item sampling) carries excellent
importance because one cannot ask about all contents. With a too-
small number of items, the results may not be enough to reflect true
student ability [8]. Technical item flaws are divided into two major
types, test wiseness, and irrelevant difficulty. Test wiseness flaws
can result in more easy items. Faults related to irrelevant difficulty
can result in more challenging items unrelated to the content under
assessment. It was reported that item analysis of exam with 200
examinees is stable, and with fewer than 100 examinees should be
interpreted with caution (item difficulty or item discrimination
index). While Downing and Yudkowsky described that even for a
small number of the examinee (e.g., 30) still, the item analysis can
provide a piece of a helpful information to improve item [13, 14].
4. Parameters of item analysis
The item or psychometric analysis parameters include difficulty
index, reliability, discrimination index, distractor efficiency [2]. The
descriptive statistics of the exam are important and can provide
helpful generalized information [2]. The descriptive statistics include
scores frequency, the mean, the mode, the median, and the
standard deviation.
5. Cronbach’s alpha (Index of reliability)
Cronbach’s alpha (KR20) is widely accepted and used estimate of
test reliability (the internal consistency) and reported to be superior
to the split-half estimate [15, 16]. Although validity and reliability
are closely associated, the reliability of an assessment
does not depend on its validity [16, 17]. Coefficient alpha is known
to be equal to Kr-20 if the item has a single answer, such as in the
case of type A MCQs or binary [18–21]. Coefficient alpha reflects the
degree to which item response scores correlate with
total test scores [15]. It also describes the degree to which items in
the exam measure
the same concept or construct [22]. Therefore, it is connected to the
inter-relatedness
and dimensionality of the items within the exam [16, 20].
Cronbach’s alpha is affected by exam time, the number and inter-
relation of the items (dimensionality) and easy or hard, poorly
written or confusing items, Variations in examinee responses,
curriculum content not reflected in the test, Testing conditions, and
Errors in recording or scoring [22–24]. The value of alpha is
decreased in the exam with fewer items and increased if items
assessing the same concept (uni-dimensionality of the exam) [16].
Other factors were reported to impact alpha value, such as item
difficulty, number of the examinee, and student performance in the
exam time. It was argued that very high alpha values could indicate
lengthy exams, parallel items, or a narrow coverage of the content
material [22]. The alpha value of the exam can be increased by
increasing the number of items with a high p-value (difficulty index).
It was reported that items with moderate difficulty could maximize
alpha value and while those with zero difficulties or 100 can
minimize it [15]. In the same way, deletion of faulty items can
increase alpha value. It should be considered that repetition of
items in the same exam or using items assessing the same concept
can increase alpha value.
6. Interpretation of Cronbach’s alpha
The interpretation of reliability is the correlation of the test with
itself. When the estimate of reliability increases, the portion of a test
score related to the error will decrease. Wise interpretation of alpha
needs an understanding of the interrelatedness
of items and whether the items measure a single latent trait or
construct. Exam or test with different content materials such as
integrated courses, for example, in the musculoskeletal system
course, although is dominated by anatomy it contains other subjects
of basic medical and clinical sciences that have different contains.
Therefore, interpretation of such a course exam needs deep looks
beyond the alpha figure. It was reported that KR20 of 0.7 is
acceptable to short test (less than 50 items) and KR20 of 0.8 for an
extended test (more than 50 item-test) [25]. Moreover, it was
documented that a multidimensional exam does not have a lower
(Table 1) alpha value than a unidimensional one [26]. A low alpha
value can be due to a smaller number of items, reduced
interrelatedness between items, or heterogeneous constructs [22].
A high value of alpha can suggest exam reliability, and some items
are non-functional as they are testing the same content but in a
different guise or repeated ones [16, 22]. Also, a high value
indicates items with high interrelatedness, indicating a limited
coverage of the content materials [22].
7. Improving Cronbach’s alpha
Adding new items with an acceptable difficulty index, high
discrimination power and distractor efficiency can increase the test
reliability [22, 27, 28]. In addition, deletion of faulty items or those
with low or very high p-value can improve Cronbach’s alpha. Items
with poor correlation or are not related should be revised or
discarded from the exam.
8. Distractor analysis
Commonly are formed of a stem with or without leading question
and five or four alternatives (type A MCQs). Among item’s
alternatives, only one is the key answer and others are called
distractors [4]. Distractors should carry or convey a miss concept
about the key answer and appear plausible. The distractors should
appear similar to the key answer in terms of the used words,
grammatical form, style, and length [19]. Distractor efficiency (DE)
is the ability of incorrect answers to distract the students [12]. A
functional distractor (FD) the distractor that is selected by 5% or
more of the examinee [4, 33]. At the same time, those chosen by
less than 5% of the examinee are
considered non-functional (NFD) [4]. In comparison, other authors
reported 1% of the examinee as the demarcation of functional
distractors [34, 35]. Commonly items are categorized based on the
numbers of NFDs in the item (Table 2) [12, 29, 36, 37]. The
occurrence of NFD makes the item easier and reduces its
discrimination power, while FD distractors are making it more
difficult [36, 38]. It was reported that non-functional distractors are
negatively correlating with reliability [38]. The presence of non-
functional distractors can be related to two main causes. First is the
training and construction ability of the item writer or composer.
Second, the miss-match between the target content and the
possible number of a distractor created. Thus, training and more
effort in item writing and construction can decrease NFDs [36].
Other causes were related to NFDs, including the low cognitive level
of the item, irrelevant or limited number of plausible distractors, or
presence of logic cues [39]. Another possibility of NFDs is mastering
the content material of the item, and students can identify the
distractor as the wrong one. If no other cause (s) for NFDs, they
should be removed or changed with a more plausible option
because it has no contribution to the measurement of the test [12].
If a distractor is selected more frequently than the key answer by a
higher-scoring examinee, this may indicate poor constructions or a
misleading question or miss or double-keyed [12, 40]. In this,
concerning the use of three options is more practical than four, does
not affect reliability, and does not affect the discrimination index
significantly [29, 35–37]. Furthermore, it was reported that there is
no psychometric reason that all items in the exam should have the
same number of distractors [29, 41]. The required number of
options in an item should be considered according to the content
material from which plausible distractors can be developed [33, 40,
42]. Reducing the number of options/distractors will result in other
important benefits such as reducing the answering time of the test
and safe time can be used to cover more content material, reduce
the burden on item composers, and have items with more
acceptable parameters [43, 44]. Puthiaparampil et al. reported a
non-high significant negative and positive correlation between the
number of functional distractors and difficulty and discrimination
indices, respectively [34]. While a significant positive correlation
was reported between the DIF and the number of NFDs [45]. Many
authors concluded that no predictable relationship between DE and
difficulty index and discrimination index [27, 29, 40, 46, 47]. In
addition Licona-Chávez et al. did not find a parallel performance
between DE and other parameters of item analysis including
Cronbach alpha [46]. In contrast, some authors claimed that low DE
decreases the difficulty index [47, 48].
9. Improving distractor analysis
Restoring the optimal DE of the item can be achieved by identifying
flaws related to the NFDs and correcting them or removing the NFDs
from the item [39].
10. Difficulty index
The item difficulty (easiness, facility index, P-value) is the
percentage of students who answered an item correctly [6, 40]. The
difficulty index ranges from 0 to 100, whereas the higher the values
indicate, the easier the question and the low value represents the
difficulty of hard items. The ideal (optimal) difficulty levels for type A
MCQs is varying according to the number of the options (Table 3)
[49, 50]. The range of items difficulty can be categorized into
difficult, moderate, and easy. Easy and difficult items were reported
to have very little discrimination power [48]. Item difficulty is related
to the item and the examinee that took the test in the given time
[24]. Thus, reusing of the item depending on its difficulty index
should be controlled. Some authors found that difficulty indices of
items assessing high cognitive levels in Bloom’s taxonomy such as
evaluation, explanation, analysis, and synthesis are lower than
those assessing remembering, understanding, and applying [51,
52]. During item or exam construction, the constructor should aim
for acceptably level of difficulty [6]. Sugianto reported that items
within the exam could be distributed according to difficulty to
moderate level (40%), easy and challenging levels (20%), and easier
and more challenging levels (10%) [6]. Other authors reported that
most items should be of moderate difficulty or 5% should be in the
difficult range [50, 53]. Some authors found that difficulty indices of
items assessing high cognitive levels in Bloom’s taxonomy such as
evaluation, explanation, analysis, and synthesis are lower than
those assessing remembering, understanding, and applying [51,
52]. Regarding the general arrangement of test or examination,
easy items start first then are followed by difficult ones. At the same
time, in the case of diagnostic assessment, the sequence of the
learning material is more important [6, 7]. Easy and difficult items
affect the item’s ability to discriminate between students and show
low discrimination power. Some reports described a negative
correlation between exam reliability and difficult and easy items
[38]. Oermann et al. reported that educationalists must be careful in
deleting items with poor DIF because the number of items has more
effect on test validity [54]. It is recommended that difficult items
should be reviewed for the possible technical and content causes
[50]. Possible causes of low difficulty index include uncovered
(taught) content material, challenging items, missed key or no
correct answer among the item options [55]. Easy items (high P-
value) can be due to technical causes, or the concerned learning
objective (s) were achieved or revisited in coverage that is more
superficial [55].
11. Interpretation of difficulty index
In literature including medical education, many ranges of difficulty
indices were
reported (Table 4).
12. Discrimination index (Power)
Item discrimination (DI) is the ability of an item to discriminate
between higher achiever (good) students and low ones. It was
defined as “stated that item discrimination is a statistic that
indicates the degree to which an item separates the students who
performed well from those who did poorly on the test as a whole”
[6]. The discrimination power of an item is calculated by
categorizing the examinee into upper 27% and lower 27% according
to their total test score. The difference between the upper and lower
group is divided by the number of the examinee in the upper group
or the larger group or by half of the total number of the examinee or
even by the total number [4, 6, 58, 59]. Obon and Rey [12]
calculated the discrimination index as the difference of difficulty
index between the upper and lower groups [12]. In literature, both
25 and 27% were reported as possible percentages of examinee
categorization [60, 61]. The 27% is commonly used to maximize
differences in normal distributions and increase the number of
examinees in each category. The discrimination index range from
1.0 to −1.0. The positive discrimination index indicates that high
achievers answer the item correctly more than those in the lower
ones, which is desirable. The negative discrimination index reflects
that lower achiever examinees answer the item more correctly,
while zero discrimination indicates equal numbers of students in the
upper and lower groups [36, 37]. Negative discrimination is thought
to be due to item flaws or inefficient distractors, miss keys,
ambiguous wording, gray areas of opinion, and areas of controversy
[12, 62]. Nevid and McClelland [52] reported that items assessing
evaluation and explanation domains could discriminate between
high and low performers, while Kim et al. [51] comments that items
assessing remembering and understanding levels have low
discrimination power [52, 54]. It was reported that discrimination
indices are positively associated with difficulty index and distractor
efficiency [39, 63]. The discrimination power of the item is reduced
by the increased number of non-functional distractors [36]. A test
with poor discriminating power will not provide a reliable
interpretation of the examinee’s actual ability [6, 64]. In addition,
discrimination power will not indicate item validity, and deletion of
items with poor discrimination power negatively impacts validity
due to a decrease in the item number [65].
13. Discrimination coefficients
Discrimination coefficients can evaluate item discrimination. The
discrimination coefficients include point biserial correlation, biserial
correlation, and phicoefficient. Although point biserial correlation is
used interchangeably with the discrimination index, discrimination
coefficients are considered superior to the discrimination index [24].
The superiority came from the fact that discrimination coefficients
are calculated using all examinees’ responses in the item rather
than only 54% of the examinees such as in the discrimination index.
The difference between Point-biserial correlation (rBP) and
discrimination indexes is that rBP is the correlation between an item
in the exam and the overall student score [2, 66]. In cases of highly
discriminating items, the examinees who responded to the item
correctly also did well on the test. In general, the examinees who
responded to the item incorrectly also tended to perform poorly on
the overall test. It was suggested that point biserial can express the
predictive validity better than Biserial correlation coefficients [61,
67].
14. Interpretation of discrimination index
Discrimination power of items more than 0.15 was reported as
evidence of item validity [50, 53]. While any item with less than
0.15 or negative should be reviewed [50] (Table 5). When
interpreting the discrimination power of an item to decide about,
especial consideration should be related to its difficulty. Items with a
high difficulty index (most of the examinee answer it right) and
those with low difficulty index (most of the examinee answer it
wrong) commonly have low discrimination power [35, 63]. In both
cases, such items will not discriminate examines as the majority are
on one side. Thus items with a moderate difficulty index are more
likely to have good discrimination power. The common causes of
poor discrimination power of item include technical or writing flaws,
untaught or not well covered content material, ambiguous wording,
gray areas of opinion and controversy, and wrong keys [12, 50, 62,
66]. In general, the statistical data obtained from item analysis can
help item constructors and exam composers to detect defective
items. The decision to revise an
item or distractors must be based on the difficulty index,
discrimination index, and distractor efficiency. Revision of items can
lead to modification in the teaching method or the content material
[68].
15. Item analysis application
Figure 1:
In this Example 1.
The number of examinees was 21.
The number of test items (Total possible) is 40.
Figure 1.
Standard item analysis of mid-course examination. The total number of items is 40, and the
total number of
the examinee is 21. The KR20 is 0.82. [Link]: Point biserial correlation, Disc Index:
discrimination index,
Correct: number and percentage of the correct answer (difficulty index), Pct. Incorrect:
percentage of an
incorrect answer.
The highest and lowest scores were 38 and 14 respectively.
The class average (mean) (30.3) is more than the class median (30) which
represents a positively skewed distribution of examinee scores. Despite
this, examinee
scores may show normal ball-shape distribution. If the median is larger
than
the average (mean), the examinee scores will be negatively skewed
distribution.
Average equals median, the examinees’ scores are symmetrically (zero
skewed) and
normally distributed with ball-shaped.
The KR20 (Cronbach’s alpha) is 0.82 which is an acceptable value for most
of the
authors. Such value of internal consistency of exam allows deciding
pass/fail. Lower
values put the exam in questionable status.
• Item 1: the difficulty index is 85.7% (easy). Although it has high
discrimination
power (DE = 0.6, Pbiserial = 0.58), two distractors are non-functional (B,
C).
Comment: the item needs reediting. Distractors B and C need to be
revised or
changed by more plausible ones before being re-used.
• Item 2: the difficulty index is 100% (easy). It has low discrimination
power
(DE = 00, Pbiserial = 00), all distractors are non-functional.
Comment: the item needs major revision or rewriting. This item is
absolutely easy with no difficulty or discrimination index. Such items
should be removed
from the question bank and removal from the exam is considered valid.
• Item 6: the difficulty index is 66.7% (moderate). It has high
discrimination
power (DE = 0.6, Pbiserial = 0.43) and all the distractors are functional.
Comment: The item has acceptable indices. Such items can be saved in
the
question bank for further use. The distractors need to be updated to have
more
efficiency.
• Item 7: the difficulty index is 28.6% (difficult). Although it has high
discrimination
power (DE = 0.67, Pbiserial = 0.39), all the distractors are
functional Comment: The item has acceptable indices. Such items can be
saved in the question
bank for further use. The distractors need to be updated to have more
efficiency.
• Item 8: the difficulty index is 76.2% (moderate). This item has a
negative discrimination
index (−0.33) and poor Pbiserial (0.04). Only one distractor is functional
(C). The negative discrimination index is caused by the increased number
of
students in the lower account (27%) than those in the upper account
(27%).
Comment: although the item has a moderate difficulty index, but is poorly
discriminating. Such an item needs major revision.
Figure 2.
Standard item analysis of Mid-course examination. The total number of items is 40, and the
total number
of examinee is 25. The KR20 is 0.74. [Link]: Point biserial correlation, Disc Index:
discrimination index,
Correct: number and percentage of the correct answer (difficulty index), Pct. Incorrect:
percentage of an
incorrect answer.
Figure 2:
In this Example 2.
The number of examinees was 25.
The number of test items is 40.
The highest and lowest scores were 33 and 13 respectively.
The class average (mean) (24.6) is more than the class median (25),
distribution
of examinee scores is skewed to the left. Despite this, examinee scores
may show
normal ball shape distribution.
The KR20 (Cronbach’s alpha) is 0.74 which is an acceptable value for most
of the
authors. Such a value of internal consistency is suitable for class tests.
• Item 8: the difficulty index is 4.0% (difficult). It has negative
discrimination
power (DE = -0.17, Pbiserial = −0.06), one distractors is non-functional
(C).
Comment: the correct answer is (A) while most of the examinees chose
(B).
According to distractor analysis, this item is miss-keyed rather than an
implausible
distractor.
• Item 9: the difficulty index is 20% (difficult). It has low discrimination
power
(DE = 0.17, Pbiserial = 0.09), all distractors are functional.
Comment: distractor analysis show option number (A) and (B) are more
selected by examinees. This can be due to implausible. The presence of
implausible
can affect the item difficulty index. Distractors in this item should be
revised or changed with plausible ones.
• Item 11: the difficulty index is 44.0% (moderate). It has low
discrimination power
(DE = 0.0, Pbiserial = 0.01) and only one the distractors is non-functional.
Comment: The item has an acceptable difficulty index. Distractor (D) is
more
selected by upper examinee such as the key answer. Such a situation can
favor
missed key or implausible distractors. The distractors need to be updated
to
have more efficiency.