Measurement Reading Material
Measurement Reading Material
Contents
1
AN OVERVIEW OF MEASUREMENT AND EVALUATION..........................................................................3
Unit objectives....................................................................................................................................3
BLOOM’S TAXONOMY OF EDUCATIONAL OBJECTIVES..........................................................................6
Taxonomy of Educational Objectives.................................................................................................6
Steps for Stating Instructional Objectives..........................................................................................6
CLASSROOM ACHIEVEMENT TESTS AND ASSESSMENTS........................................................................7
Unit Objectives....................................................................................................................................7
TEST DEVELOPMENT – PLANNING THE CLASSROOM TEST..................................................................19
Test Development – Planning the Classroom Test...........................................................................19
ASSEMBLING, REPRODUCING, ADMINISTERING, AND SCORINGOF CLASSROOM TESTS....................22
Unit objectives..................................................................................................................................22
Arranging the test items...................................................................................................................22
Reproducing test items.....................................................................................................................23
SUMMARIZING AND INTERPRETING TEST SCORES..............................................................................25
Descriptive statistics.........................................................................................................................25
RELIABILITY AND VALIDITY OF A TEST..................................................................................................34
Test Reliability...................................................................................................................................35
Validity of a Test...............................................................................................................................39
JUDGING THE QUALITY OF A CLASSROOM TEST..................................................................................43
Judging the Quality of A Classroom Test..........................................................................................43
Item Analysis and Criterion Referenced Mastery Tests...................................................................46
Building A Test Item File (Item Bank)...............................................................................................48
2
UNIT ONE
AN OVERVIEW OF MEASUREMENT AND EVALUATION
Unit objectives
After reading through this unit and completion of the tasks and activities, you will be able to:
A test is a measuring tool or instrument in education. More specifically, a test is considered to be a kind
or class of measurement device typically used to find out something about a person. Most of the times,
when you finish a lesson or lessons in a week, your teacher gives you a test. This test is an instrument
given to you by the teacher in order to obtain data on which you are judged. It is an educationally
common type of device which an individual completes; the intent is to determine changes or gains
resulting from such instruments as inventory, questionnaire, scale etc.
Testing on the other hand is the process of administering the test on the pupils. In other words, the process
of making you or letting you take the test in order to obtain a quantitative representation of the cognitive
or non-cognitive traits you possess is called testing. So, the instrument or tool is the test and the process
of administering the test is testing.
Assessment
Assessment is a systematic basis for making inference about the learning and development of students…
the process of defining, selecting, designing, collecting, analysing, interpreting and using information to
increase students’ learning and development.
Measurement
In simple terms, measurement refers to giving or assigning a number value to a certain attribute or
behaviour. It is a systematic process of obtaining the quantified degree to which a trait or an attribute is
present in an individual or object. In other words, it is a systematic assignment of numerical values or
figures to a trait or an attribute in a person or object. Measurement conveys a broader meaning.
Measurement uses a variety of ways to obtain information in a quantitative form. Measurement can use
paper and pencil test, rating scales, and observations to assign a number value to a given trait or
3
behaviour. Measurement can also mean to both the score obtained by the measuring device and the
process used to obtain the score.
Evaluation
Evaluation is formative when conducted over small bodies of content to provide feedback in directing
further instruction and student learning. Formative evaluation then refers to an ongoing process which is
done before instruction, during instruction, and at the end of term or unit. Summative evaluation on the
other hand, is an evaluation conducted over a larger outcome of an extended instructional sequence, over
an entire course of a large or part of it.
Summative evaluation may serve for reporting a student’s overall achievement, licensing and certifying,
predicting success in related courses, assigning marks, and reporting overall achievement of a class.
Evaluation in the classroom context is directed to the improvement of student learning by supporting the
instructional process as in the following.
Types of evaluation
The different types of evaluation are: placement, formative, diagnostic and summative evaluations.
Placement Evaluation
This is a type of evaluations carried out in order to fix the students in the appropriate group or class. In
some schools for instance, students are assigned to classes according to their subject combinations, such
as science, Technical, arts, Commercial etc. before this is done an examination will be carried out. This is
in form of pretest or aptitude test. It can also be a type of evaluation made by the teacher to find out the
entry behavior of his students before he starts teaching. This may help the teacher to adjust his lesson
plan. Tests like readiness tests, ability tests, aptitude tests and achievement tests can be used.
Formative Evaluation
This is a type of evaluation designed to help both the student and teacher to pinpoint areas where the
student has failed to learn so that this failure may be rectified. It provides feedback to the teacher and the
student and thus estimating teaching success e.g., weekly tests, terminal examinations etc.
Diagnostic Evaluation
This type of evaluation is carried out most of the time as a follow up evaluation to formative evaluation.
As a teacher, you have used formative evaluation to identify some weaknesses in your students. You have
also applied some corrective measures which have not showed success. What you will now do is to
design a type of diagnostic test, which is applied during instruction to find out the underlying cause of
students persistent learning difficulties. These diagnostic tests can be in the form of achievement tests,
performance test, self-rating, interviews, observations, etc.
4
Summative evaluation:
This is the type of evaluation carried out at the end of the course of instruction to determine the extent to
which the objectives have been achieved. It is called a summarizing evaluation because it looks at the
entire course of instruction or program and can pass judgment on the teacher and students, the curriculum
and the entire system. It is used for certification. Think of the educational certificates you have acquired
from examination bodies. These were awarded to you after you had gone through some types of
examination. This is an example of summative evaluation.
Norm-Referenced
These are tests used to compare the performance of an individual with those of other individuals of
comparable background. In other words, the score of an individual in a norm-referenced testing has
meaning only when it is viewed in relation to the scores of other individuals on the test. The success or
failure of an individual on this kind of test is, therefore, determined on the basis of how he/she performs
in relation to his/her colleagues’ performance on the test.
Criterion-referenced tests
In contrast to norm-referenced tests criterion-referenced tests are tests when the score of an individual on
a given test is related to a specific performance standard for interpretation purposes. Such tests are
labelled criterion-referenced; the criterion in this respect being the specific performance standard.
If a given score of an examinee is equal to or greater than a specified standard (i.e., the criterion) the
examinee is said to have passed; otherwise, she/he is deemed to have failed the test or examination.
Therefore, the success or failure of an individual on a criterion-referenced test (CRT) depends on what
he/she scores on the test in relation to the set standard, and this may depend, to a large extent on the test
content itself or the relative strictness of the marker if the test is not an objective one. As a result of any
criterion measure, therefore, it is possible for all the members of a class to pass or fail the test depending
on a number of obvious reasons such as the easiness or difficulty of items of the test.
SELF CHECK
1. Explain the difference between measurement and evaluation, assessment and testing.
2. What are the types of evaluation?
3. What is the major difference between test and testing?
4. In your own words define Assessment.
5. Give an example of a test.
6. What are the major differences and similarities between formative evaluation and diagnostic
evaluation?
7. List 5 purposes of measurement and evaluation?
5
8. List the instruments which you can use to measure the following: weight, height, length,
achievement in Mathematics, performance of students in technical drawing, attitude of workers
towards delay in the payment of salaries
UNIT TWO
BLOOM’S TAXONOMY OF EDUCATIONAL OBJECTIVES
Unit objectives
At the end of this unit, you will be able to:
6
Begin each general objective with a verb; like knows, applies, interprets
State each general objective to include only one general learning outcome, that is,
objectives should be unitary; not knows and understands
State each general objective at the proper level of generality; it should encompass a
readily definable domain of response
Stating specific learning outcomes
List beneath each general instructional objective a representative sample of specific
learning outcomes that describes the terminal performance students are expected to
demonstrate
Begin each specific learning outcome with an action verb that specifies observable
performance like identifies, describes
Make sure that each specific learning outcome is relevant to the general objective it
describes
Self-Check Exercises
Instruction I: Choose the best answer from the alternatives given for each item
1. Which one of the following action verbs can be used in synthesis level of cognitive domain?
A. Write B. Transfer C. Distinguish D. Interpret
2. Which one of the following action verbs does not indicate learning outcome at the
evaluation level?
A. Decide B. Conclude C. Validate D. Discriminate
3. Which one of the following affective domains does the learner expected to develop a
consistent philosophy of life?
A. Organization B. Characterization C. Valuing D. Responding
4. Which one of the following action verbs can be used in synthesis level of cognitive
domain?
A. Write B. Transfer C. Distinguish D. Interpret
UNIT THREE
CLASSROOM ACHIEVEMENT TESTS AND ASSESSMENTS
Unit Objectives
After going through this unit, the learner should be able to:
7
List the different types of items used in classroom test.
Describe the different types of objectives questions.
Describe the different types of essay questions.
Compare the characteristics of objectives and essay tests.
Explain the meaning of objective test,
TYPES OF TESTS USED IN THE CLASSROOM
There are different types of test forms used in the classroom. These can be essay test, objectives test,
norm-referenced test or criterion referenced test. But we are going to concentrate on the essay test and
objectives test. These are the most common tests which you can easily construct for your purpose in the
class.
Types of objective tests
The objective test can be classified into those that require the examinee to supply the answer to the test
items (free-response type) and those that require the examinee to select the answer from a given number
of alternatives (fixed response type). The free-response type consists of the short answer and completion
items while the fixed response type is commonly further divided into true-false or alternative response
matching items and multiple-choice items.
Selection type items
This is the type where possible alternatives are provided for the teste to choose the most appropriate or
the correct option. Can you mention them? Let us take them one by one.
True/False or two option items
The true-false type of test is representative of a somewhat larger group called alternate-response items
such as, yes-no, correct-incorrect, agree-disagree, right-wrong, etc... This group consists of any question
in which the student is confronted with two possible answers. Since most of the points discussed here are
equally applicable to all alternative-response items, since teachers are familiar with the true-false type, the
following discussion will concentrate on true-false items.
Advantages of true/false items
It is commonly used to measure the ability to identify the correctness of statements of fact,
definitions of terms, statements of principles and other relatively simple learning outcomes to
which a declarative statement might be used with any of the several methods of responding.
It is also used to measure examinee ability to distinguish fact from opinion; superstition from
scientific belief.
It is used to measure the ability to recognize cause – and – effect relationships.
It is best used in situations in which there are only two possible alternatives such as right or
wrong, more or less, and so on.
8
It is easy to construct alternative response item but the validity and reliability of such item depend
on the skill of the item constructor. To construct unambiguous alternative response item, which
measures significant learning outcomes, requires much skill.
A large number of alternative response items covering a wide area of sampled course material can
be obtained and the examinees can respond to them in a short period of time.
Disadvantages of true/false items
It requires course material that can be phrased so that the statement is true or false without
qualification or exception as in the Social Sciences.
It is limited to learning outcomes in the knowledge area except for distinguishing between facts
and opinion or identifying cause – and – effect relationships.
It is susceptible to guessing with a fifty-fifty chance of the examinee selecting the correct answer
on chance alone. The chance selection of correct answer has the following effects:
i. It reduces the reliability of each item thereby making it necessary to include many
items in order to obtain a reliable measure of achievement.
ii. The diagnostic value of answers to guess test items is practically nil because analysis
based on such response is meaningless.
iii. The validity of examinees response is also questionable because of response set.
9
Avoid using negative statements especially double negatives
- Under the demands of the testing situation, students may fail to see the negative
qualifier.
- Poor: None of the steps in planning stages of test construction are not important. T F
- Better: All of the steps in planning stage of test construction are important. T F
Test important ideas, knowledge, or understanding (rather than trivia, general knowledge, or
common sense). Look the following examples:
o Artists live longer than farmers. T F
o The coefficient of correlation shows the cause-and-effect relationship between two
paired variables. T F
o Avoid copying statements directly taken from textbook and other written materials.
Keep the word length of true statements about the same as that of false statements
Make sure that the statements used are entirely true or entirely false. (Partially or marginally true
or false statements cause unnecessary ambiguity.)
Matching items
The matching test items usually consist of two parallel columns. One column contains a list of word,
number, symbol or other stimuli (premises) to be matched to a word, sentence, phrase or other possible
answer from the other column (responses) lists. The examinee is directed to match the responses to the
appropriate premises. Usually, the two lists have some sort of relationship. Although the basis for
matching responses to premises is sometimes self-evident but more often it must be explained in the
directions.
Advantages of matching items
It is used whenever learning outcomes emphasize the ability to identify the relationship
between things and a sufficient number of homogenous premises and responses can be
obtained.
Essentially used to relate two things that have some logical basis for association.
It is adequate for measuring factual knowledge like testing the knowledge of terms,
definitions, dates, events, references to maps and diagrams.
The major advantage of matching exercise is that one matching item consists of many
problems. This compact form makes it possible to measure a large amount of related factual
material in a relatively short time.
It enables the sampling of larger content, which results in relatively higher content validity.
The guess factor can be controlled by skilfully constructing the items such that the correct
response for each premise must also serve as a plausible response for the other premises.
10
The scoring is simple and objective and can be done by machine.
Disadvantages of matching items
It is restricted to the measurement of factual information based on rote learning because the
material tested lend themselves to the listing of a number of important and related concepts.
Many topics are unique and cannot be conveniently grouped in homogenous matching
clusters and it is sometimes difficult to get homogenous materials clusters of premises and
responses that can sufficiently match even for contents that are adaptable for clustering.
It requires extreme care during construction in order to avoid encouraging serial
memorization rather than association and to avoid irrelevant clues to the correct answer.
Guidelines for preparing matching items
Use only homogeneous material in a set of matching items (i.e., dates and places should not
be in the same set).
Use the more involved expressions in the stem and keep the responses short and simple.
Supply directions that clearly state the basis for the matching, indicating whether or not a
response can be used more than once, and stating where the answer should be placed.
Make sure that there are never multiple correct responses for one stem (although a response
may be used as the correct answer for more than one stem).
Avoid giving inadvertent grammatical clues to the correct response (e.g., using a/an, singular/
plural verb forms).
Arrange items in the response column in some logical order (alphabetical, numerical, and
chronological) so that students can find them easily.
Avoid breaking a set of items (stems and responses) over two pages.
Use no more than 15 items in one set.
Provide more responses than stems to make process-of-elimination guessing less effective.
Number each stem for ease in later discussions.
Use capital letters for the response signs rather than lower-case letters.
11
option and are therefore called distracters, foils or decoys. These incorrect alternatives receive their name
from their intended function – to distract the examinees who are in doubt about the correct answer.
Advantages of the MCQs
The multiple-choice item is the most widely used of the types of tests available. It can be
used to measure a variety of learning outcomes from simple to complex.
It is adaptable to any subject matter content and educational objective at the knowledge
and understanding levels.
It can be used to measure knowledge outcomes concerned with vocabulary, facts,
principles, method and procedures and also aspects of understanding relating to the
application and interpretation of facts, principles and methods.
Most commercially developed and standardized achievement and aptitude tests make use
of multiple-choice items.
The main advantage of multiple-choice test is its wide applicability in the measurement
of various phases of achievement.
It is the desirable of all the test formats being free of many of the disadvantages of other
forms of objective items. For instance, it presents a more well-defined problem than the
short-answer item, avoids the need for homogenous material necessary for the matching
item, reduces the clues and susceptibility to guessing characteristics of the true-false item
and is relatively free from response sets.
It is useful in diagnosis and it enables fine discrimination among the examinees on the
basis of the amount of what is being measured possessed by them.
It can be scored with a machine.
Disadvantages/limitations of the MCQs
It measures problem-solving behaviour at the verbal level only.
It is inappropriate for measuring learning outcomes requiring the ability to recall,
organize or present ideas because it requires selection of correct answer.
It is very difficult and time consuming to construct.
It requires more response time than any other type of objective item and may favour the
test-wise examinees if not adequately and skilful constructed.
Measuring evaluation and synthesis can be difficult.
Inappropriate for measuring outcomes that require skilled performance
Guidelines for preparing multiple-choice items
Use the stem to present the problem or question as clearly as possible; eliminate excessive
wordiness and irrelevant information.
12
Use direct questions rather than incomplete statements for the stem.
Include as much of the item as possible in the stem so that alternatives can be kept brief.
Include in the stem words that would otherwise be repeated in each option.
In testing for definitions, include the term in the stem rather than as one of the alternatives.
List alternatives on separate lines rather than including them as part of the stem so that they
can be clearly distinguished.
Keep all alternatives in a similar format (e. g. All phrases, all sentences, etc.).
Make sure that all options are plausible responses to the stem. (Poor alternatives should not
be included just for the sake of having more options.)
Check to see that all choices are grammatically consistent with the stem.
Try to make alternatives for an item approximately the same length. (Making the correct
response consistently longer is a common error.)
Use misconceptions which students have indicated in class or errors commonly made by
students in the class as the basis for incorrect alternatives.
Use “all of the above” and “none of the above” sparingly since these alternatives are often
chosen on the basis of incomplete knowledge. Words such as “all,” “always,” and “never” are
likely to signal incorrect options.
Use capital letters (A, C, D, and E) on tests as responses rather than lower-case letters (“a”
gets confused with “d” and “c” with “e” if the type or duplication is poor). Instruct students to
use capital letters when answering (for the same reason), or have them circle the letter or the
whole correct answer, or use scan able answer sheets.
Try to write items with equal numbers of alternatives in order to avoid asking students to
continually adjust to a new pattern caused by different numbers.
Put the incomplete part of the sentence at the end rather than the beginning of the stem.
Phrase the item as a statement rather than a direct question.
Use negatively stated items sparingly. (When they are used, it helps to underline or otherwise
visually emphasize the negative work.)
Make sure that there is only one best or correct response to the stem. If there are multiple
correct responses, instruct students to “choose the best response.”
Limit the number of alternatives to five or less. (The more alternatives used, the lower the
probability of getting the correct answer by guessing. Beyond five alternatives, however,
confusion and poor alternatives are likely.)
13
This is the type of test item, which requires the testee to give very brief answers to the questions. These
answers may be a word, a phrase, a number, a symbol or symbols etc. Supply test items can be in the
form of short answer or completion form. Bothare supply-type test items consisting of direct questions
which require a short answer (short-answer type) or an incomplete statement or question to which a
response must be supplied by an examinee (completion type). The answers to such questions could be a
word, phrase, number or symbol. It is easy to develop and if well developed, the answers are definite and
specific and can be scored quickly and accurately.
Advantages of supply type items
Measure the ability to interpret diagrams, charts, graphs and pictorial data.
Most effective for measuring a specific learning outcome such as computational learning
outcomes in mathematics and sciences
Questions must be carefully worded so that all students understand the specific nature of the
question asked and the answer required.
_Better: In what battle fought in 1869 did Tewodros II defeat Ras Ali?
14
Word completion or fill in the blank questions so that missing information is at, or near the end
of, the sentence. Make reading and responding easier.
- Better: If a room measures 7 meters by 4 meters, the perimeter is ______meters (or m).
Do not use too many blanks in completion items. The emphasis should be on knowledge and
comprehension not mind reading.
Consider: In the year __________, Prime Minister ___________ signed the _________, which led
to a __________ which was ____________.
Word each item in specific terms with clear meanings so that the intended answer is the only one
possible, and so that the answer is a single work, brief phrase, or number.
_In supply items, present much of the statement and blank the key word.
_Better: Proper nouns are words that refer to particular _________, __________ or
________.
_Best: Words that refer to particular persons, objects or things are ____________.
Extended/Unrestricted/Open-ended/free response
Restricted/Closed-ended
No restrictions on response
15
No restrictions on No of pages
Originality required
Applicable in measuring higher level learning outcomes of the cognitive level such as
analysis, synthesis & evaluation level
Describe the processes of producing or cutting screw threads in the school technical workshop.
Why should the classroom teacher state his instructional objectives to cover the three domains of
educational objectives?
Open and Distance Learning is a viable option for the eradication of illiteracy in Ethiopia.
Discuss
Such items are directional questions and aimed at the desired responses.
16
o Knowledge, comprehension, analysis
Examples:
17
Difficulty levels and discrimination powers
How to make essay questions less subjective
Avoid open-ended questions.
Let the students’ answer the same questions. Avoid options/choices.
Use students’ numbers instead of their names, to conceal their identity.
Score all the answers to each question for all students at a time.
Do not allow score on one question to influence you while marking the next. Always
rearrange the papers before you mark.
Do not allow your feelings or emotions to influence your marking.
Decide a policy for handling irrelevant or incorrect responses:
SELF CHECK
1. Briefly explain the meaning of objective test.
2. What are the two major advantages of objective test items over the Essay test item?
3. What is the major feature of objective test that distinguishes it from essay test?
4. Which of the types of objective test items would you recommend for school wide test and why?
5. How would you make essay questions less subjective?
6. What are the two sub-divisions of the supply test items?
7. (a) Subjectivity in scoring is a major limitation of essay test?
True / false
(b) Essay questions cover the content of a course and the objectives as comprehensively as
possible? True / False
(c) Grading of essay questions is time consuming?
True / False
(d) Multiple choice questions should have only two options.
True / False
8. Construct five multiple choice questions in any course of your choice.
9. Give 5 examples of free response or extended response questions
10. Briefly identify the two most outstanding weaknesses of essay test as a measuring instrument.
UNIT FOUR
TEST DEVELOPMENT – PLANNING THE CLASSROOM TEST
Unit objectives
By the time you finish this unit, you will be able to:
18
Identify the sequence of planning a classroom test,
Prepare table of specifications for classroom test in a given subject.
recognize some common problems of teacher made tests
carry out content survey in the development of table of specifications
Test Development – Planning the Classroom Test
The development, of good questions or items writing for the purpose of classroom test, cannot be
taken for granted. An inexperienced teacher may write good items by chance. But this is not
always possible. Development of good questions or items must follow a number of principles
without which no one can guarantee that the responses given to the tests will be relevant and
consistent. In this unit, we shall examine the various aspects of the teacher’s own test.
Some Pit Falls in Teacher – Made Tests
The following observations have been made about teacher-made tests. They are listed below in
order to make you avoid them when you construct your questions for your class tests.
Most teacher-tests are not appropriate to the different levels of learning outcomes. The
teachers specify their instructional objectives covering the whole range simple recall to
evaluation. Yet the teachers’ items fall within the recall of specific facts only
Many of the test exercises fail to measure what they are supposed to measure. In other
words most of the teacher-made tests are not valid. You may wonder what validity is. It is
a very important quality of a good test, which implies that a test is valid if it measures
what it is supposed to measure. You will read about in details later in this course.
Some classroom tests do not cover comprehensively the topics taught. One of the
qualities of a good test is that it should represent the entire topic taught. But, these tests
cannot be said to be a representative sample of the whole topic taught.
Most tests prepared by teacher lack clarify in the wordings. The questions of the tests are
ambiguous, not precise, and not clear and most of the times carelessly worded. Most of
the questions are general or global questions.
Most teacher-made tests fail item analysis test. They fail to discriminate properly and not
designed according to difficulty levels.
These are not the only pit falls. But you should try to avoid both the ones mentioned here and
those not mentioned here. Now let us look at how to develop test items.
Considerations in Planning a Classroom Test
19
To plan a classroom test that will be both practical and effective in providing evidence of
mastery of the instructional objectives and content covered requires relevant considerations.
Hence the following serves as guide in planning a classroom test.
Determine the purpose of the test;
Describe the instructional objectives and content to be measured.
Determine the relative emphasis to be given to each learning outcome;
Select the most appropriate item formats (essay or objective);
Develop the test blue print to guide the test construction;
Prepare test items that are relevant to the learning outcomes specified in the test plan;
Decide on the pattern of scoring and the interpretation of result;
Decide on the length and duration of the test, and
Assemble the items into a test, prepare direction and administer the test.
Analysis of the Instructional Objectives
The instructional objectives of the course are critically considered while developing the test
items. This is because the instructional objectives are the intended behavioral changes or
intended learning outcomes of instructional programs which students are expected to possess at
the end of the course or program of study. The instructional objectives usually stated for the
assessment of behavior in the cognitive domain of educational objectives are classified by Bloom
(1956) in his taxonomy of educational objectives into knowledge, comprehension, application,
analysis, synthesis and evaluation. The objectives are also given relative weight in respect to the
level of importance and emphasis given to them. Educational objectives and the content of a
course form the nucleus on which test development revolves.
Content Survey
This is an outline of the content (subject matter or topics) of a course of program to be covered in
the test. The test developer assigns relative weight to the outlined content – topics and subtopics
to be covered in the test. This weighting depends on the importance and emphasis given to that
content area. Content survey is necessary since it is the means by which the objectives are to be
achieved and level of mastering determined.
Planning the table of specifications/test blue print
The table of specification is a two-dimensional table that specifies the level of objectives in
relation to the content of the course. A well-planned table of specification enhances content
20
validity of that test for which it is planned. The two dimensions (content and objectives) are put
together in a table by listing the objectives across the top of the table (horizontally) and the
content down the table (vertically) to provide the complete framework for the development of the
test items. The table of specification is planned to take care of the coverage of content and
objectives in the right proportion according to the degree of relevance and emphasis (weight)
attached to them in the teaching learning process. The table of specifications is a two-
dimensional table that specifies the level of objectives in relation to the content of the course or
the type of the items in relation to the content of the course. These are called table of
specifications by objective and table of specifications by test type respectively. A hypothetical
table of specification by objective is illustrated in table 4.1 below:
Table 4.1 A Hypothetical Test Blue Print/Table of Specifications by objective
Content Objectives Total
Set A 15% - 1 - 2 - - 3
Set B 15% - 1 - 2 - - 3
Set C 25% 1 - 1 1 1 1 5
Set D 25% 1 - 1 1 1 1 5
Set E 20% - 1 1 - - 2 4
Total 100% 2 3 3 6 2 4 20
Self-Check Exercises
Dear Trainees! You have now completed your study of chapter four. Therefore, you are expected to
answer the following self-test questions.
UNIT FIVE
ASSEMBLING, REPRODUCING, ADMINISTERING, AND
SCORINGOF CLASSROOM TESTS
Unit objectives
By the time you finish this unit you will be able to:
Explain the meaning of test administration
State the steps involved in test administration
Identify the need for civility and credibility in test administration
State the factors to be considered for credible and civil test administration
Introduction
In most cases, different item formats cannot be administered orally or cannot be easily written on
chalkboard to administer. This means that there is a necessity for reproducing tests and during
reproduction care must be taken to assure that:
Younger children may not realize that the first set of directions is applicable to all items
of a particular format and may become confused;
It makes it easier for the examinee to maintain a particular mental set rather than having
to change from one to another;
It makes it easier for the teacher to score the test, mainly when hand scoring is done.
In arranging item formats due emphasis should be given to the complexity of mental activity they demand
in answering them. In this way we have to arrange item formats so that they progress from the simple to
the complex. For instance, items that measure simple recall should precede those that measure
understanding and application.
22
According to Ground (1985) item formats can be arranged in the following way, which roughly
approximates the complexity of the instructional objectives measured. Hence:
Teachers should be aware of the significance of providing clear and concise directions. The directions
given should transmit clear information concerning what to do, how to do and where to record answers. In
other words, directions should tell students:
23
7. In the elementary grades, if workspace is needed to solve numerical problems, provide this space
in the test booklet rather than having examinees use scratch paper. This would minimize
recording errors that might occur when students transfer questions from test booklet to scratch
paper for computation.
8. All illustrative material used should be clear, legible, and accurate.
9. Proof the test carefully before it is reproduced. If you can it is better if a teacher who is teaching
the same subject with you checks errors for early correction. But if errors are found after the test
has been produced, they should be called to the students’ attention before the actual test is begun.
10. Even for essay tests, every student should have a copy of the test.
Teachers should not write the questions on the black board.
Self-assessment exercise
1. What is test administration?
UNIT SIX
SUMMARIZING AND INTERPRETING TEST SCORES
Unit objectives
Dear learner, by the time you finish this unit, you will be able to:
Interpret classroom test scores by criterion-referenced or norm referenced
Calculate the average result of a given class of test scores
Decide if there is a relationship between two factors
Convert raw scores to z-scores and T-scores
Compare one’s score with the score of a group
Descriptive statistics
One of the major purposes of statistics in test use is to allow us to describe and summarize data-
for example, test scores- in efficient and useful ways.
24
1. Describe
2. Interpret
3. Pass judgment
What is the average/ most popular/ mean / typical/ “middle” / most common data value?
A value (i.e. single number) that is used to represent where the majority of the data values
lie for a given random variable.
Three commonly used central location measures are:
1. Mean
2. Median
3. Mode
The Mean( X )
It is the Arithmetic average of the observed scores. It is also the most commonly used measure of
location. This is the most popular and useful measure of central location.
−
∑X
1. Mean( X ) = where∑=summation
N
X=Raw score
N=total No of students
−
X =mean
−
∑ fX
2. Mean( X ) = where∑=summation
N
X=raw score
25
N=total No of students
f= frequency
The Median
The median of a random variable is the value which divides ranked data into two equal
halves, i.e., it is the middle number of an ordered set of data.
The median is valid for quantitative random variables only.
It is the score that divides a score distribution into two equal parts.
It is most appropriate when dealing with small number of students.
It is easy to understand.
It is not affected by outliers
One of its disadvantages is that it can only be calculated for quantitative variables.
26
- Arrange the scores in decreasing /increasing order.
- Take the middle scores and divide by two.
Example1: Here is the score of six people 14, 11, 8,6,7,9 calculate the median.
Mdn=
(3rd +4 th )term
2
9+ 8
[ ]
=Mdn= 2 = 17/2= 8.50
Example 2: Calculate the median of the following scores 14, 11, 9, 6,8,7,5.
Remark: mdn= [ ]
N +1 th
2
term
The Mode
It is the most frequent value. Mode means most ‘’popular’’. The score/s with the highest
frequency is/are the mode of the distribution. The mode is valid for both quantitative and
qualitative random variables. A set of data may have one mode, or two or more modes.
1. It is easy to calculate
2. It is valid for all data types.
3. It is not affected by outliers
4. Most appropriate when numerical values in a data set are labels for categories
(nominal)
27
1. There could be more than one mode, which could lead to confusion since no single value
is then representative of this number
2. It could be a random event and not truly representative of the data, especially in a
relatively small data set.
Measures of Variability
Distribution I 37 37 37 37 37
Distribution II 33 36 37 38 41
Note: the three distributions have the same arithmetic mean=37. But there is marked difference
among the distributions.
Therefore, the topic on dispersion of scores is concerned with studying measures which show the
amount of variability among data.
1. The Range
2. The Variance
3. The Standard Deviation
The range
The range is the difference between the highest and the lowest scores in a distribution. The
higher the value of the range, the greater the difference between the students in academic
achievement. However, it is a crude measure of variability.
The variance
28
Variance is the arithmetic mean of the squared deviations of individual scores from the mean. It
is expressed in squared units. It shows a spread or dispersion of scores i.e. a tendency for any set
of scores to depart from a central point or any other point.
Definitional formulae
2
δ =∑ ¿ ¿ ; ∑ f ¿ ¿;
Where σ 2=Variance
X= Raw score
=Mean of the distribution
f=frequency
N=total number of students
The standard deviation is a measure of how much a set of scores varies on the average around the
mean of scores. In other words, it reveals how closely scores tend to vary from the mean.
Standard deviation is the positive square root of variance. It measures the extent to which scores
tend to deviate from the mean. It is useful for:
σ¿ √ ∑ ¿ ¿ ¿; σ¿ √ ∑ f ¿ ¿ ¿
- The larger the σ, the greater is the difference in academic achievement. The smaller the
standard deviation the less the scores tend to vary from the mean.
29
Measures of Position
Percentiles and Percentile Ranks
Percentiles and percentile ranks are frequently used as indicators of performance in both the
academic and corporate worlds. Percentiles and percentile ranks provide information about how
a person or thing relates to a larger group. Relative measures of this type are often extremely
valuable to researchers employing statistical techniques.
Percentiles
A percentile is the point in a distribution at or below which a given percentage of scores is
found. OR The value below which P% of the values fall is called the Pth percentile
For example, the 5th percentile is denoted by P5, the 10th by P10 and 95th by P95.
Percentile Rank
A percentile rank is used to determine where a particular score or value fits within a broader
distribution. For example: A student receives a score of 75 out of 100 on an exam and wishes to
determine how her score compares to the rest of the class. She calculates a percentile rank for a
score of 75 based on the reported scores of the entire class. Her percentile rank in this example
would be 80, meaning that 80 percent of scores on the exam were at or below 75.
Notes:
I. A Percentile is a value in the data set.
II. A Percentile rank of a given value is a percent that indicates the percentage of data is
smaller than the value.
III. Percentiles are not the same as percentage.
Calculation of Percentiles and Percentile ranks
A. In case of ranked raw data:
th
The (approximate) value of the K percentile Pk , is calculated by the formula,
Pk
( Kn
= Value of the 100
) th
term
Where k –is the percentile one wishes to calculate
n- is the total number of values in the distribution.
The percentile rank (PR) of a given value, xi, is obtained by the formula,
30
The Z – Scores
The Z – score is the simple standard score which expresses test performance simply and directly
as the number of standard deviation units a raw score is above or below the mean. The Z-score is
computed by using the formula.
−
X− X
Z – Score = SD
Where
X = any raw score
−
X = arithmetic mean of the raw scores
SD = standard deviation
When the raw score is smaller than the mean the Z –score results in a negative (-) value which
can cause a serious problem if not well noted in test interpretation. Hence Z-scores are
transformed into a standard score system that utilizes only positive values.
Measures of Relationship
Measures of association provide a means of summarizing the size of the association between two
variables. Most measures of association are scaled so that they reach a maximum numerical
value of 1 when the two variables have a perfect relationship with each other. They are also
scaled so that they have a value of 0 when there is no relationship between two variables. While
there are exceptions to these rules, most measures of association are of this sort. Some measures
of association are constructed to have a range of only 0 to 1; other measures have a range from -1
to +1. The latter provide a means of determining whether the two variables have a positive or
negative association with each other.
Correlation
Chi-square
A. Correlation
31
A correlation coefficient is used to measure the strength of the relationship between numeric
variables (e.g., weight and height)
If the coefficient is between 0 and 1, as one variable increases, the other also increases. This
is called a positive correlation. For example, height and weight are positively correlated
because taller people usually weigh more.
If the correlation coefficient is between -1 and 0, as one variable increases the other
decreases. This is called a negative correlation. For example, age and hours slept per night
are negatively correlated because older people usually sleep fewer hours per night.
There are two common methods of computing correlation coefficient. These are:
Pearson Product-Moment Correlation.
Spearman Rank-Difference Correlation
Pearson Product-Moment Correlation:
This is the most widely used method and the coefficient is denoted by the symbol r. This method
is favoured when the number of scores is large and it’s also easier to apply to large group. The
computation is easier with ungrouped test scores and would be illustrated here. The computation
with grouped data appears more complicated and can be obtained from standard statistics test
book.
The following steps listed below will serve as guide for computing a product-moment correlation
coefficient (r) from ungrouped data.
Step 1 - Begin by writing the pairs of score to be studied in two columns. Make certain that the
pair of scores for each examinee is in the same row. Call one Column X and the other Y
Step 2 - Square each of the entries in the X column and enter the result in the X2 column
Step 3 - Square each of the entries in the Y column and enter the result in the Y2 column
Step 4 - In each row, multiply the entry in the X column byte entry in the Y column, and enter
the result in theXY column
Step 5 - Add the entries in each column to find the sum of ( ∑ ) each column.
Step 6 -Apply the following formula
32
N ( )( )
∑ XY − ∑ X ∑ Y
N N
√ (∑ ) √∑ (∑ )
2 2
∑ X 2− X Y2
−
Y
r= N N N N
∑ XY − M
N
( X )( M Y )
OR r = SD X SDY
where
MX = mean of scores in X column
MY= mean of scores in Y column
SDX= standard deviation of scores in X column
SDY = standard deviation of scores in Y column
Spearman Rank-Difference Correlation:
This method is satisfactory when the number of scores to be correlated is small (less than 30). It
is easier to compute with a small number of cases than the Pearson Product-Moment Correlation.
It is a simple practical technique for most classroom purposes. To use the Spearman Rank-
Difference Method, the following steps listed under should be taken.
Computing Procedure for the Spearman Rank-Difference Correlation
Step 1 -Arrange pairs of scores for each examinee in columns (Columns 1 and 2)
Step 2 - Rank examinees from 1 to N (number in group) for each set of scores
Step 3 - Rank the difference (D) in ranks by subtracting the rank in the right hand column from
the rank in the left-hand column
Step 4- Square each difference in rank to obtain difference squared (D2)
Where
D = Difference in rank and N= total no of students
Self-Check Exercises
33
Dear Trainees! You have now completed your study of chapter six. Therefore, you are expected
to answer the following self-test questions.
UNIT SEVEN
RELIABILITY AND VALIDITY OF A TEST
Unit objectives
By the time you finish this unit you will be able to:
Define reliability of a test
State the various forms of reliability
Explain the factors that influence reliability measures
Compare and contrast the different forms of estimating reliability
Define validity as well as content, criterion and construct validity
Test Reliability
Reliability of a test may be defined as the degree to which a test is consistent, stable, dependable
or trustworthy in measuring what it is measuring. This definition implies that the reliability of a
test tries to answer questions like: How can we rely on the results from a test? How dependable
are scores from the test? How well are the items in the test consistent in measuring whatever it is
measuring? In general, reliability of a test seeks to find if the ability of a set of testees are
determined based on testing them two different times using the same test, or using two parallel
forms of the same test, or using scores on the same test marked by two different examiners, will
the relative standing of the testees on each of the pair of scores remain the same.
34
Method Types of Reliability Procedure
Measure
Test-retest Measure of stability Give the same test twice to the same group
with any time interval between tests
method
Equivalent-forms Measure of equivalence Give two forms of the test to the same group
in close succession
methods
Split-half Measure of internal Give test once. Score two equivalent halves
consistency say odd and even number items, correct
Method reliability coefficient to fit whole test by
Spearman- Brown formula
Kuder- Measure of internal Give test once. Score total test and apply
consistency kuder-Richardson formula
Richardson
methods
35
This method is satisfactory when the number of scores to be correlated is small, which is less
than 30. It is easier to compute with a small number of cases than the Pearson Product-Moment
Correlation. It is a simple practical technique for most classroom purposes. To use the Spearman
Rank-Difference Method, the following steps listed under should be taken.
Computing Procedure for the Spearman Rank-Difference Correlation
Step 1. Arrange pairs of scores for each examinee in columns (Columns 1 and 2)
Step 2. Rank examinees from 1 to N (number in group) for each set of scores
Step 3. Rank the difference (D) in ranks by subtracting the rank in the right-hand column
from the rank in the left-hand column
Step 4. Square each difference in rank to obtain difference squared (D2)
Where:
∑ = Sum of
D = Difference in rank
N = Number of examinees
The following steps listed below will serve as guide for computing a product-moment correlation
coefficient (r) from ungrouped data.
Step 1. Begin by writing the pairs of score to be studied in two columns. Make certain that the
pair of scores for each examinee is in the same row. Call one Column X and the other Y
Step 2. Square each of the entries in the X column and enter the result in the X2 column
36
Step 3. Square each of the entries in the Y column and enter the result in the Y2 column
Step 4. In each row, multiply the entry in the X column by the entry in the Y column, and enter
the result in the XY column
Step 5. Add the entries in each column to find the sum of ∑ each column.
Step 6. Apply the following formula
N ( )( )
∑ XY − ∑ X ∑ Y
N N
√ ( )√ ( )
2 2
∑ X 2− ∑ ∑ ∑X Y2
−
Y
r= N N N N
∑ XY − M
N
( X )( M Y )
OR r = SD X SDY
Where:
MX = Mean of scores in X column
MY = Mean of scores in Y column
SDX = Standard deviation of scores in X column
SDY = standard deviation of scores in Y column
Factors Influencing Reliability Measures
The reliability of classroom tests is affected by some factors. These factors can be controlled
through adequate care during test construction. Therefore, the knowledge of the factors are
necessary to classroom teachers to enable them control them through adequate care during test
construction in order to build in more reliability in norm referenced classroom tests.
Length of Test
The reliability of a test is affected by the length. The longer the length of a test is, the higher its
reliability will be. This is because longer test will provide a more adequate sample of the
behaviour being measured, and the scores are apt to be less distorted by chance factors such as
guessing. If the quality of the test items and the nature of the testees can be assumed to remain
the same, then the relationship of reliability to length can be expressed by the simple formula
stated as follow:
37
nr ii
rnn = 1+ ( n−1 ) r ii
Where:
rnn= is the reliability of a test n times as long as the original test
rii= is the reliability of the original test
n = is as indicated, the factors by which the length of the test is increased
Increase in length of a test brings test scores to depend closer upon the characteristics of the
person being measured and more accurate appraisal of the person is obtained. However, we all
know that lengthen a test is limited by a number of practical considerations. The considerations
are the amount of time available for testing, factors of fatigue and boredom on part of the testees,
inability of classroom teachers to constructs more equally good test items. Nevertheless,
reliability can be increased as needed by lengthening the test within these limits.
Spread of Scores
The reliability coefficients of a test are directly influenced by the spread of scores in the group
tested. The larger the spread of scores is, the higher the estimate of reliability will be if all other
factors are kept constant. Larger reliability coefficients result when individuals tend to stay in
same relative position in a group from one testing to another. It therefore follows that anything
that reduces the possibility of shifting positions in the group also contributes to larger reliability
coefficient. This means that greater differences between the scores of individuals reduce the
possibility of shifting positions. Hence, errors of measurement have less influence on the relative
position of individuals when the differences among group members are large when there is a
wide spread of scores.
Difficulty of Test
When norm-referenced test are too easy or too difficult for the group members taking it, it tends
to produce scores of low reliability. This is so since both easy and difficult tests will result in a
restricted spread of scores. In the case of easy test, the scores are closed together at the top of the
scale while for the difficult test; the scores are grouped together at the bottom end of the scale.
Thus, for both easy and difficult tests, the differences among individuals are small and tend to be
unreliable. Therefore, a norm-referenced test of ideal difficulty is desired to enable the scores to
spread out over the full range of the scale. This implies that classroom achievement tests are to
be designed to measure differences among testees. This can be achieved by constructing test
38
items with at least average scores of 50 percent and with the scores ranging from zero to near
perfect scores. Constructing tests that match this level of difficulty permits the full range of
possible scores to be used in measuring differences among individuals. This is because the
bigger the spread of scores, the greater the likelihood of its measured differences to be reliable.
Objectivity
This refers to the degree to which equally competent scorers obtain the same results in scoring a
test. Objective tests easily lend themselves to objectivity because they are usually constructed so
that they can be accurately scored by trained individuals and by the use of machines. For such
test constructed using highly objective procedures, the reliability of the test results is not affected
by the scoring procedures. Therefore, the teacher made classroom test calls for objectivity. This
is necessary in obtaining reliable measure of achievement. This is more obvious in essay testing
and various observational procedures where the results of testing depend to a large extent on the
person doing the scoring. Sometimes even the same scorer may get different results at different
times. This inconsistency in scoring has an adverse effect on the reliability of the measures
obtained. The resulting test scores reflect the opinions and biases of the scorer and the
differences among testees in the characteristics being measured.
Validity of a Test
Validity is the most important quality you have to consider when constructing or selecting a test.
It refers to the meaningfulness or appropriateness of the interpretations to be made from test
scores and other evaluation results. Validity is a measure or the degree to which a test measures
what it is intended to measure. It is always concerned with the specific use of the results and the
soundness of our proposed interpretations. Hence, to the extent that a test score is decided by
factors or abilities other than that which the test was designed or used to measure, its validity is
impaired.
Types of Validity
The concern of validity is basically three, which are:
Determining the extent to which performance on a test represents level of knowledge of
the subject matter content which the test was designed to measure – content validity of
the test.
Determining the extent to which performance on a test represents the amount of what was
being measured possessed by the examinee – construct validity of the test.
39
Determining the extent to which performance on a test represents an examinee’s probable
task – criterion (concurrent and prediction) validity of a test.
Factors Influencing Validity
Many factors tend to influence the validity of test interpretation. These factors include factors in
the test itself. The following are the list of factors in the test itself that can prevent the test items
from functioning as intended and thereby lower the validity of the interpretations from the test
scores. They are:
Unclear directions on how examinees should respond to test items
Too difficult reading vocabulary and sentence structure
Inappropriate level of difficulty of the test items
Poorly structured test items
Ambiguity leading to misinterpretation of test items
Inappropriate test items for the outcomes being measured
Test too short to provide adequate sample
Improper arrangement of items in the test and
Identifiable pattern of answer that leads to guessing.
SELF ASSESSMENT EXERCISE
PART I. READ THE FOLLOWING QUESTIONS AND GIVE SHORT AND PRECISE
ANSWER.
1. Define the reliability of a test.
2. Mention methods of estimating reliability and the type of reliability measure associated
with each of them.
3. What are the factors that influence reliability measures?
4. Define the following terms:
i. Content Validity
ii. Criterion related Validity
iii. Construct Validity
5. What are the three main concerns of validity of a test?
6. What are the factors that affect validity?
7. Explain the relationship between reliability and validity?
40
READ EACH OF THE FOLLOWING QUESTION AND CHOOSE THE MOST
APPROPRIATE ONE FROM THE GIVEN ALTERNATIVES.
1.A form of correlation that shows equal magnitude increment value in both variables is called______?
A. Perfect positive relation
B. Perfect negative relation
C. No relation at all
D. Very high positive relation
2. Which of the following is not true about correlation?
A. It measures association
B. It is scaled between -1 to +1
C. It shows the relation of two variables
D. It estimates the average value of two variables
3. Which one is not true about negative correlation?
A. As one variable increases, the other will decrease
B. The coefficient lies between-1 and less than 0
C. Both variables show increment in the same direction
D. Each variable may increase or decrease in the opposite direction
4. What is the meaning of correlation coefficient r=0.00?
A. Perfect positive relation
B. There is no relation at all
C. There is negligible relation
D. Perfect negative relation
5. Which reliability measure uses a single exam two times for the same group?
A. Equivalence
B. Stability
C. Internal consistency
D. Split half
6. In split half method of estimating reliability, even and odd number items correlation is 0.64. What will
be the total items reliability?
A. 0.74 B. 0.87 C. 0.64 D. 0.78
7. Which one of the following dose not influence the reliability measures?
41
C. The variability of score distribution
D. The nature of the subject matter
A. Content validity
B. Criterion validity
C. Construct validity
D. None of the above
9 . Which type of validity is guaranteed if test preparation represents the entire content of the course?
10. Which one of the following is not influence the reliability measures?
11. The method of computing the degree of relationship between two sets of scores is called__________?
12. The type of correlation that shows as one variable increase the other decreases?
42
D. Comprehensiveness with which it measures the construct
17. A measure has high internal consistency reliability when:
A. Multiple observers obtain the same score every time they use the measure
B. Multiple observers make the same ratings using the measure
C. Participants score at the high end of the scale every time they complete the measure
D. Each of the items correlates with other items on the measure
UNIT EIGHT
JUDGING THE QUALITY OF A CLASSROOM TEST
Unit objectives
By the end of this unit, you will be able to:
Differentiate distinctively between item difficulty, item discrimination and the distraction
power of an option
Recognize the need for item analysis, its place and importance in test development
Conduct item analysis of a classroom test
Calculate the value of each item parameter for different types of items
Appraise an item based on the results of item analysis
Judging the Quality of A Classroom Test
Item Analysis
Item analysis is the process of testing the item to ascertain specifically whether the item is
functioning properly in measuring what the entire test is measuring. As already mentioned, item
analysis begins after the test has been administered and scored. It involves detailed and
systematic examination of the testees’ responses to each item to determine the difficulty level
and discriminating power of the item.
Purpose and Uses of Item Analysis
Item analysis is usually designed to help determine whether an item functions as intended with
respect to discriminating between high and low achievers in a norm-referenced test, and
measuring the effects of the instruction in a criterion referenced test items. It is also a means of
determining items having desirable qualities of a measuring instrument, those that need revision
for future use and even for identifying deficiencies in the teaching/learning process.
The Process of Item Analysis for Norm Referenced Classroom Test
43
To illustrate the method of item analysis using an example with a class of 40 learners taking a
10-item test that have been administered and scored, and using 25% test groups. The item
analysis procedure might follow this basic step.
Step 1. Arrange the 40 test papers by ranking them in order from the highest to the lowest score.
Step 2. Select the best 10 papers (upper 25% of 40 testees) with the highest total scores and the
least 10 papers (lower 25% of 40 testees) with the lowest total scores.
Step 3. Drop the middle 20 papers (the remaining 50% of the 40 testees) because they will no
longer be needed in the analysis.
Step 4. Draw a table as shown in table 8.1 in readiness for the tallying of responses for item
analysis.
Step 5. For each of the 10 test items, tabulate the number of testees in the upper and lower
groups who got the answer right or who selected each alternative (for multiple choice
items).
Step 6. Compute the difficulty of each item (percentage of testees who got the item right).
Step 7. Compute the discriminating power of each item (difference between the number of
testees in the upper and lower groups who got the item right).
Step 8. Evaluate the effectiveness of the distracters in each item (attractiveness of the incorrect
alternatives) for multiple choice test items.
44
to the degree to which it discriminates between testees with high and low achievement. It is
obtained from this formula:
H−L
D= n
Where:
D= Item Discrimination Power
H= Number of high scorers whogot the item right
Hence for item 1 in table 8.1, the item discriminating power D is obtained thus:
H−L 10−4 6
D= n = 10 = 10 = 0∙60
Item discrimination values range from – 1∙00 to + 1∙00. The higher the discriminating index, the
better is an item in differentiating between high and low achievers. Usually, if item
discriminating power is a:
Positive value when a larger proportion of those in the high scoring group get the item
right compared to those in the low scoring group.
Negative value when more testees in the lower group than in the upper group get the item
right.
Zero(0) value when an equal number of testees in both groups get the item right; and
One (1.00) when all testees in the upper group get the item right and all the testees in the
lower group get the item wrong.
Evaluating the Effectiveness of Distracters
The distraction power of a distracter is its ability to differentiate between those who do not know
and those who know what the item is measuring. That is, a good distracter attracts more testees
from the lower group than the upper group. The distraction power or the effectiveness of each
distracter (incorrect option) for each item could be obtained using the formula:
L−H
Do= n
Where:
45
Do = Option Distracter Power
H =Number of high scorers who marked option
L=Number of low scorers who marked option
n = Total Number of examinees in upper or lower group
Item Analysis and Criterion Referenced Mastery Tests
The item analysis procedures we used earlier for norm–referenced tests are not directly
applicable to criterion–referenced mastery tests. In this case indexes of item difficulty and item
discriminating power are less meaningful because criterion referenced tests are designed to
describe learners in terms of the types of learning tasks they can perform unlike in the norm-
referenced test where reliable ranking of testees is desired.
Item Difficulty
In the criterion referenced mastery tests the desired level of item difficulty of each test item is
determined by the learning outcome it is designed to measure and not as earlier stated on the
items ability to discriminate between high and low achievers. However, the standard formula for
determining item difficulty can be applied here but the results are not usually used to select test
items or to manipulate item difficulty. Rather, the result is used for diagnostic purposes. Also
most items will have a larger difficulty index when the instruction is effective with large
percentage of the testees passing the test.
Item Discriminating Power
As you know the ability of test items to discriminate between high and low achievers are not
crucial to evaluating the effectiveness of criterion referenced tests this is because some of the
best items might have low or zero indexes of discrimination. This usually occurs when all testees
answer a test item correctly at the end of the teaching learning process implying that both the
teaching-learning process and the item are effective. Moreover, they provide useful information
concerning the mastery of items by the testees unlike in the norm-referenced test where they
would be eliminated for failing to eliminate between the high and the low achievers. Therefore,
the traditional indexes of discriminating power are of little value for judging the test items
quality since the purpose and emphasis of criterion referenced test is to describe what learners
can do rather than to discriminate among them.
Analysis of Criterion Referenced Mastery Items
46
Ideally, a criterion referenced mastery test is analyzed to determine extent to which the test items
measure the effects of the instruction. In other to provide such evidence, the same test items is
given before instruction (pretest) and after instruction (posttest) and the results of the same test
pre-and-post administered are compared. The analysis is done by the use of item response chart.
The item response chart is prepared by listing the numbers of items across the top of the chart
and the testees names or identification numbers down the side of the chart and the record correct
(+) and incorrect (-) responses for each testee on the pretest (B) and the posttest (A). This is
illustrated in Table 8.2 for an arbitrary 10 testees.
An index of item effectiveness for each item is obtained by using the formula for a measure of
Sensitivity to Instructional Effects (S) given by:
R A −R B
S= T
Where:
RA = Number of testees who got the item right after the teaching-learning process.
RB = Number of testees who got the item right before the teaching-learning process.
T = Total number of testees who tried the item both times.
Usually for a criterion-referenced mastery test with respect to the index of sensitivity to
instructional effect,
An ideal item yields a value of 1.00.
Effective items fall between 0.00 and 1.00, the higher the positive value, the more
sensitive the item to instructional effects; and
Items with zero and negative values do not reflect the intended effects of instruction.
Building A Test Item File (Item Bank)
This entails a gradual collection and compilation of items administered, analyzed and selected
based on their effectiveness and psychometric characteristics identified through the procedure of
item analysis over time. This file of effective items can be built and maintained easily by
recording them on item card, adding item analysis information indicating both objective and
content area the item measures and can be maintained on both content and objective categories.
This makes it possible to select items in accordance with any table of specifications in the
particular area covered by the file.
47
Building item file is a gradual process that progresses over time. At first it seems to be additional
work without immediate usefulness. But with time its usefulness becomes obvious when it
becomes possible to start using some of the items in the file and supplementing them with other
newly constructed ones. As the file grows into item bank most of the items can then be selected
from the bank without frequent repetition. Some of the advantages of item bank are that:
Parallel test can be generated from the bank which would allow learners who were ill for
a test or due to some other reasons were unavoidable absent to take up the test later;
They are cost effective since new questions do not have to be generated at the same rate
from year to year;
The quality of items gradually improves with modification of the existing ones with time;
and
The burden of test preparation is considerably lightened when enough high quality items
have been assembled in the item bank.
SELF ASSESSMENT EXERCISE
PART I. READ THE FOLLOWING QUESTION AND GIVE SHORT AND PRECISE
ANSWER.
1. Explain the meaning of item analysis?
2. List and explain the purposes of item analysis?
3. Show norm referenced item analysis procedure?
4. When do teachers use norm reference and criterion referenced item analysis procedure?
5. How can you compute item difficulty and discrimination?
6. How do you evaluate the effectiveness of distractor in item analysis?
7. Explain the purposes of building item bank?
8. Explain the relationship between reliability and validity?
9. What is the difference between index of discriminating power (D) and index of
sensitivity to instructional effects (S)?
10. Do you think that item analysis could help teachers to improve their skill of classroom
test preparation? Why?
48
49