0% found this document useful (0 votes)
447 views20 pages

Modern Methods in Standard Setting

This document provides an overview of standard setting methods for educational tests. It describes how standard setting establishes cut scores that divide test scores into performance levels like basic, proficient, and advanced. Newer standard setting methods aim to reduce cognitive burden on participants and set standards for performance tests with open-ended items. The document also discusses how revised testing standards, legislation, and the increased use of performance-level results have increased the importance and scrutiny of standard setting. It aims to summarize recent changes and new standard setting methods.

Uploaded by

RAKESH MAMODIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
447 views20 pages

Modern Methods in Standard Setting

This document provides an overview of standard setting methods for educational tests. It describes how standard setting establishes cut scores that divide test scores into performance levels like basic, proficient, and advanced. Newer standard setting methods aim to reduce cognitive burden on participants and set standards for performance tests with open-ended items. The document also discusses how revised testing standards, legislation, and the increased use of performance-level results have increased the importance and scrutiny of standard setting. It aims to summarize recent changes and new standard setting methods.

Uploaded by

RAKESH MAMODIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

An NCME Instructional Module on

Setting Performance Standards:


Contemporary Methods
Gregory J. Cizek, University of North Carolina-Chapel ffill
Michael B. Bunch, Measurement Incorporated
Heather Koons, North Carolina Department of Public Instruction, Raleigh

referenced interpretations became


outdated with the introduction of
This module describes some common standard-setting proce- standards-rejereru:ed testing. Traditional
dures used to derive performance levels for achievement tests in standard-setting methods were devel-
education, licensure, and certification. Upon completing the mod- oped largely for contexts in which only
ule, readers will be able to: describe what standard setting is: two categories (e.g., pass/fail) were re-
understand why standard setting is necessary: recognize some
quired. The introduction of standards-
of the purposes of standard setting: calculate cut scores using
referenced testing was accompanied
by increased interest in defining more
various methods: and identify elements to be considered when than two categories or performance
evaluating standard-setting procedures. A self-test and annotated levels. A prominent national testing
bibliography are provided at the end of the module. Teaching program, the National Assessment of
aids to accompany the module are available through NCME. Educational Progress (NAEP), was
one of the first, highly visible testing
Keywords: cut scores, performance standards, standard setting programs to express performance ac-
cording to a graded series of perfor-
mance levels: Basic, Projicient, and
Advanced.
ewer than to years have elapsed new methods has been to reduce the cog-
F since the publication of the first
Instructional Topics in Educational
nitive burden placed on participants'
to form and consistently apply con- New Standards jor Educational
Measurement Series (ITEMS) module ceptualizations of a hypothetical min- and Psychological Testing
on standard setting in Educationat imally qualified examinee in making
Measurement: Issues and Practue judgments about probable success on In 1999, the three sponsoring entities
(Cizek, 1996a). Nevertheless, since that individual test items. Another goal of for the Standards-the American Psy-
time, a great deal of research, recon- emerging methods has been to provide chological Association, the American
ceptualization, and refinements to the a satisfactory way to establish standards Educational Research Association, and
methods of standard setting have tran- on performance tests, that is, on tests the National Council on Measurement
spired. In the earlier module, common that do not consist of dichotomously in Education-issued revised standards
standard-setting procedures-primarily scored items, but contain polytomously for sound testing practice_ This edition
applicable to selected-response format scored samples of examinee work. As the of the Slandards highlights the impor-
testing-were described, including the consequences and costs of standard set- tance of setting performance standards.
Contrasting Groups and Borderline ting have escalated, research in the area
Groups methods (Livingston & Zieky, of standard setting has attempted to
1982), and the Angoff (1971), Ebel derive methods that are more intuitive
(1972), and Nedelsky (1954) methods. to participants and stakeholders and Gregory J. Cizek i<; Professor of Edu-
So-called "compromise" methods by which can be implemented efficiently. cational Measurmnent and Evaluation,
Beuk (1984) and Hofstee (1983) were In addition to these changes, the School ofEducation, CB3500, University of
also described. standard-setting landscape has changed North Carolina, Chapet Hill, NC27599-3500,
in other fundamental ways. A few ex- cizek@[Link]. Hi<; areas ofspecialization
While many of the aforementioned are standard selling and testing policy.
methods remain defensible routes for amples of these profound changes are Michael B. Bunch i<; Vice-President,
setting performance standards, other described. Measurmnentlncorporated, Durham, NG.
methods have been introduced. These His areas of specialization are large-scale
contemporal)' methods have provided assessment design and standard selling.
Standards-Referenced Testing Heather Koons i<; Project Manager, North
viable options for addressing evolving Carolina Department ofPublic Inslruction.
standard-setting controversies and chal- Traditional ways of thinking about tests Her areas ofspecializalion are reading and
lenges. For example, one goal of some as yielding either norm- or criterion- science test de1ielopment.

Winter 2004 31
For testing in general, the Standards sound empirical data concerning the new Standards Jor Educational and
note that: relation of test performance to rele- Psychotogical Testing, and recent federal
vant criteria. (p. 60) legislation~have necessitated greater
A critical step in the development
and use of some tests is to establish Standard 4.2/: When cut scores attention to standard setting than per-
one or more cut points dividing the defining pass-fail or proficiency cate- haps ever before. Much has been de-
score range to partition the distri- gories are based on direct judgments manded of the technology of standard
bution of scores into categories. . about the adequacy of item or test setting. New methods have been devel-
[C I ut scores embody the rules ac- performances or performance levels, oped to meet new contexts and chal-
cording to which tests are used or in- the judgmental process should be lenges, and substantially greater scrutiny
terpreted. Thus, in some situations designed so that judges can bring and awareness of standard setting by
the validity of test interpretations their knowledge and experience to policymakers, educators, and the public
may hinge on the cut scores. (p. 53) bear in a reasonable way. (p. 60)
have resulted. This module is an attempt
Standard 6.5: When relevant for test to catch up on these fast-paced changes.
And, in the specific case of licensure and interpretation, test documents should In the following sections, we provide
certification tests, include item level information, an update on concepts and methods of
cut scores, ... (p. 69)
The validity of the inferences drawn setting performance standards. First, we
from the test depends on whether the Standard /4.17: The level of perfor- describe what is meant by standard
standard for passing makes a valid mance required for passing a creden- setting and we provide a rationale for the
distinction between adequate and in- tialing test should depend on the need for setting standards. Next, we list
adequate performance. (p. 157) knowledge and skills necessary for
some general considerations that war-
acceptable performance in the occu-
The 1999 version also includes new pation or profession and should not rant attention in any standard-setting
guidelines for standard setting. Among be adjusted to regulate the number or procedure. Then, we describe three spe-
the guidance in the new Standards are: proportion of persons passing the cific methods, introduced since the pub-
test. (p. 162) lication of the earlier module, which
Standard 1. 7: When a validation have found fairly wide usage in achieve-
rests in part on the opinions or deci-
sions of expert judges, observers, or Federal Legislation ment testing contexts. These methods
raters, procedures for selecting such are presented in how-to format which, it
At the national level, at least two wide-
experts and for eliciting judgments is hoped, will provide sufficient detail
ranging laws have affected the practice
or ratings should be fully described. to actually enable readers to use the
of standard setting. The Individuals
The qualifications, and experience, of with Disabilities Education Act (1997) method to obtain cut scores in a rele-
the judges should be presented. The vant situation. The final section of the
requires greatly expanded participa-
description of procedures should in- module presents guidelines for evaluat-
clude any training and instructions tion of students with special needs in ing standard setting. An annotated bibli-
provided, should indicate whether large-scale assessment programs. Among
ography and self-test appear at the end
participants reached their decisions other regulations, the Act requires states
to: (I) include children with disabili- of this module.
independently, and should report the
level of agreement reached. [f partic- ties in general state and district-level
ipants interacted with one another assessment programs; (2) develop and Definition of Standard Setting
or exchanged information, the proce- conduct alternate assessments for stu- It might seem obvious that what is called
dures through which they may have dents who cannot participate in the gen- standard setting is the process bywhich
inliuenced one another should be set eral programs; and (3) provide public a standard or cut score is established.
forth. (p. 19) reports on the performance of special In reality, however, standard setting is
Standard 2./4: Where cut scores are needs students with the same frequency notso straightforward. For example, par-
spccified for selection or classifica- and detail as reports on the assessment ticipants in a standard-setting process
tion, thc standard errors of measure- of nondisabled children. Developing new rarely set standards; rather, a standard-
rnentshould bc reported in the vicin- approaches to establishing performance setting panel usually makes a recom-
ity of each cut score. (p. 35) standards for the required alternate as- mendation to a body with the actual
Standard 2./5: When a test or com- sessments, which often comprise novel or authority to implement, adjust, or re-
bination of measures is used to nontraditional formats, has proven to be ject the standard-setting panel's recom-
make categorical decisions, esti- a significant standard-setting challenge. mendation (e.g., state board of educa-
mates should be provided of the per- Asecond piece offar-reaching legisla- tion, medical board, licensing agency).
centage of examinees who would be tion was enacted in 2001. The No Child It is now a widely accepted tenet of
classified in the same v·,ray on two Left Behind (NCLB) Act (2001) requires measurement theory that the work of
applications of the procedu re, using states to: (I) develop challenging con- standard-setting panels is not to search
the sarnc or alternate forms of the tent standards in reading, mathematics, for a knowable boundary between cate-
instrument. (p. 35) and science; (2) dcvelop and administer gories that exist. Instead, standard-
Standard 4.19: When proposed inter- assessments aligned to those standards; setting procedures enable participants
pretations involve one or more cut and (3) (of particular relevance to stan- to bring to bear their judgments in such
scorcs, the rationale and procedures dard setting) establish three levels of a way as to translate policy decisions
used for cstablishing cut scores high achievement (Basic, /'rojicient, and (often, as operationalized in perfor-
should be clearlydocu mented (p.59) Advanced) to describe varying levels of mance lovel descriptors) into locations
Standard 4.20: When feasible, cut mastery of the content standards. on a score scale; it is these translations
scores defining categorics wit.h dis- These three phenomena taken that create the effective performance
tinct substantive interpretations together-the rise of standards- categories. This translation and creation
should be established on the basis of referenced testing, the puhlieation of arc seldom, if ever, purely statistical, im-

32 Educational Measurement: [ssues and Practice


partial, apolitical, or ideologically neu- term used to refer to statements that mation yielded by tests is of knowable
tral activities. As noted in theSlandards describe specific knowledge or skills quality-and often of higher quality
J&r Educational and PfrJlchologicat Test- over which examinees are expected to than other sources of in formation.
ing, standard setting "embod Iies I value have mastery for agiven age, grade level, According to the Standards: "The
judgments as well as technical and em- or field of study. Whereas content stan- proper use of tests can result in wiser
pirical considerations" (AERAIAPAI dards delineate the referent (i.e., the decisions about individuals and pro-
NCME, 1999, p. 54). From this perspec- "what") of testing, performance stan- grams than would be the case without
tive, it is clear that what psychometrics dards define "how much" or "how well" their use and also can provide a route to
as a social science can contribute to the examinees are expected to perform in broader and more equitable access to
practice of standard setting is as much order to be described as falling in a education and employment" (AERAI
social as it is science. As Cizek (200Ib) given category. APA/NCME, 1999, p. I). Because cut
has observed: "Standard setting is per- scores are the mechanism that results
haps the branch of psychometrics that Need for Standard Setting in category formation on tests, the im-
blends more artistic, political, and cul- portance of deriving defensible cut
tural ingredients into the mix of its prod- Afundamental issue in standard setting scores and their relevance to sound de-
ucts than any other" (p. 5). Nonetheless, is the purpose for setting standards in cision making are obvious. Again, ac-
psychometricians have developed and the first place. From one perspective, cording to the Slan-dards: "Verifying
continue to refine methods for negotiat- the general need for standard setting is the appropriateness of the cut score or
ing these currents, and for aiding partic- clear: Decisions must be made. As stated scores ... is a critical element of the va-
ipants in bringing their judgments to elsewhere: lidity of test results" (p. 157).
bear in ways that are reproducible, in- Thero is simply no way to escape
formed by relevant sources of evidence, making decisions.. These deci- Cross-Cutting Issues and General
and fundamentally fair to those affected sions, by definition, create cate- Considerations in Standard Setting
by the process. gories. If, for example, some students
One definition of standard setting, graduate from high school and others Several issues must be considered when
suggested by Cizek (1993), highlights do not, a categorical decision has setting performance standards regard-
the procedural aspect of standard set- been made, even if agraduation test less of which method is selected. Five
was not used. (The decisions were, such issues are described in the follow-
ting and draws on both legal theory of presumably, made on some basis.) ing paragraphs. Afirst consideration is
due process' and traditional definitions High school music teachers make de- the purpose of establishing standards in
of measurement. According to Cizek, cisions such as who should be first
standard setting is "the proper follow- chair for the clarinets. College facul- the first place. Acommon practice in all
ing of a prescribed, rational system of ties make decisions to tenure (or
standard setting is to begin the session
rules or proccdures resulting in the as- not) thei,. colleagues. lVe embrace with an orientation for participants to
signment of a number to differentiate decision making regarding who the purpose of the task at hand. This ori-
between two or more states or degrees should be licensed to practice medi- entation is a pivotal point in the process
of performance" (p. 100). cine. All of these kinds of decisions and provides the frame participants
Kane (1994) has provided a defini- are unavoidable; each shou ld be are expected to apply in the conduct of
based on sound information; and the their work. Linn (1994) has sug-
tion of standard setting that highlights information should be combined in gested that standard setting can focus on
the conceptual nature of the endeavor. some deliberate, considered, defen-
According to Kane: one of four purposes: (I) exhortation,
sible manner. (Cizek. 2001a, p. 21; (2) exemplification, (3) accountability
see also Mehrens & Cizek, 2001, for educators, and (4) certification of
It is useful to draw a distinction be- pp.478-479)
tween the passing score, defined as a student achievement. Depending on the
point on the score scale, and the per- Certainly, decisions can be made on purpose, the orientation to participants
ionnanee standard, defined as the information other than, or in addition can differ substantially. For example,
minimally adequate level of perfor- to, that yielded by tests. 1ndeed, the standard setting might involve exhor-
mance for some purpose. ... The per- Slandards Jar Educational and Psy- tation. Using the policy rhetoric of higher
formance standard is the conceptual standards, ifthe purpose were to "ratchet
version of the desired level of com-
chological Testing state that "a decision
petence, and the passing score is the or characterization that will have ma- up expectations to world-class levels"
operational version. (p. 426, empha- jor impact on a student should not be for high seliool students in a state, the
sis in original) made on the basis of a single test score" orientation provided to standard-setting
(AERA/APA/NC~IE, 1999,p.146).lnone participants might focus on describing
Finally, two additional observations sense, of course, this recommendation is the low level of challenge of previous
are warranted. Despite Kane's (l994) always heeded. For example, a single content standards, the low bar set on
attempted clarification, the term per- measure such as the SAT for college ad- previous state examinations, the evolv-
formance standard is frequently used missions should be used "oth olher cri- ing needs of the work force, and so on.
as a synonym for the terms cut score, teria (e.g., high school graduation, grade An orientation like this, typically de-
achievement level, or passing score. It point average. and so on). On the other livered by a person of relatively high
is equally important to recognize that hand, the infonnalion yielded by tests status, would exhort participants to
important decisions rest on two differ- routinely figures prominently into deci- establish relatively high standards. By
ent kinds of slandards that combine ·sions such as placement in a remedial or contrast, standard setting with an ori-
to make interpretation of test resulls gifted program, selection of employees, entation of exemplification would focus
meaningful; these are often referred to awarding of scholarships, licensure to more on providing concrete examples
as content standards and pmjormance practice in a profession, and others. This to educators of the competencies em-
standaTds. Content slandards is the is perhaps the case because the infor- bedded in the content standards.

Winter 2004 33
Asecond cross-cutting aspect of stan- There is an inherent tension in the NCME, 1999) provide guidance on repre-
dard setting is the creation and use of creation of PLDs. Descriptions that pro- sentation, selection, and training of par-
peiformance level labels (PLLs). PLLs vide too little specificity do not help il- ticipants. For example, the Standards
refer to the (usually) single-word terms lustrate or operationalize the perfor- indicate that "a sufficiently large and
used to identify performance categories; mance categories. As such, they do not representative group of judges should
Sasi£, ProflCienl, and Advanced would assist in communication to external be involved to provide reasonable as-
be examples of such labels. Many such audiences about the meaning of catego- surance that results would not vary
categorical labeling systems exist; a few rization at a given performance level. greatly if the process were replicated"
examples are shown in Table 1. Though Descriptions that provide too much (p. 54). The Standards also recom-
PLLs may have little technical underpin- specificity by providing a detailed list of mend that "the qualifications of any
ning, they clearly carry rhetorical value the knowledge and skills that a student judges involved in standard setting and
as related to the purpose of the standard at a given level possesses may be des- the process by which they are selected"
setting. Such labels have the potential to tined to pose validation problems. For ex- (p. 54) should be fully described and
convey agreat deal in a succinct manner ample, suppose that very detailed de- included as part of the documentation
vis-a-vis the meaning of classifications scriptions are generated describing the for the standard-setting process. The
that result from the application of cut specific knowledge and skills possessed Standards also address training:
scores. It is obvious from a measurement by examinees in a category. Suppose fur-
ther that actual categorical classifica- Care must be taken to assure that
perspective that PLLs should be carefully judges understand what they are to
chosen to relate to the purpose of the as- tions will be based on examinees' total do. The process must be such that
sessment, to the construct assessed, and test scores. Under such a scenario, there well-qualified judges can apply their
to the intended, supportable inferences ,viII almost always be many instances in knowledgo and experience to reach
arising from the classifications. which a test taker demonstrates mastery meaningful and relevant judgments
Athird issue-actually an extension of knowledge or skills outside the cate- that accurately reflect their under-
of the concern with PLLs-is evident gory to which he or she is assigned, and standings and intentions. (p. 54)
when performance level descriplors fails to demonstrate mastery of knowl-
edge or skills for some elements within As with the development of PLDs,
(PLDs) are created. PLD refers to the the performance category. This contra- there is atension present in the selection
(usually) several sentences or para- diction between the statement of knowl- of standard-setting participants. While it
graphs that provide fuller, more com- edge and skills that examinees in a is often recommended that participants
plete illustration of what performance category are supposed to possess (as in- have special expertise in the area for
within a particular category comprises. dicated in the PLDs) and the knowledge which standards ,viII be set, in practice
PLDs vary in their level of specificity, and skills that they actually possess (as this can mean that standard-setting pan-
but have in common the verbal elabo- indicated by observed test performance) els consist of participants whose per-
ration of the knowledge, skills, or at- makes validation of the PLDs problem- spectives are not representative of all
tributes of test takers within a perfor- atic. Some researchers have attempted practitioners in a field, all teachers at a
mance level. It is highly desirable for to solve this dilemma by crafting stan- grade level, and so on. Such a bias might
PLDs to be developed in advance of st<ln- dard-setting procedures in which be desirable if the purpose of standard
dard setting by a separate committee for items are matched to performance setting is exhortation, though less so if
approval by the appropriate policymak- level descriptions (see, e.g., Ferrara, the purpose of standard setting is to cer-
ing body. Standard-setting participants Perie, & Johnson, 2002). Despite these tify competence of students for awarding
then use these PLDs as acritical referent efforts, the vexing issue of ensuring fi- a high school diploma.
for theirjudgments. Sometimes, elabora- delity of PLDs with actual examinee per- In addition, once standard-setting
tions of the PLDs arc developed by par- formance is an area that remains one for participants have been selected and
ticipants during a standard-setting pro- which much additional work is needed. trained and the procedure has begun,
cedure as a first step (i.e., prior to Fourth, it has long been known there is the matter of providing feed-
making any item or task judgments) to- that the participants in the standard- back to participants. Many standard-
ward operationalizing and internalizing setting process are critical to the success setting approaches comprise "rounds"
the performance levels intended by the of the endeavor and are a source of or iterations of judgments. At each
policy body. Sample PLDs, in this case variability of standard-setting results. round, participants are provided vari-
those used for the NAEP Grade 4 read- The Standards for Educational and ous kinds of information to summarize
ing assessment, are shown in Table 2. Psychological Testing (AERNAPN their own variability, correspondence

Table I. Sample Performance Level Labels


labels Source

Basic, Proficient, Advanced National Assessment of Educational Progress


Starting Out, Progressing, Nearing Proficiency, TerraNova, 2nd ed. (CTB/McGraw-Hill)
Proficient, Advanced
Limited, Basic, Proficient, Accelerated, Advanced State of Ohio Achievement Tests
Far Below Basic, Below Basic, Basic, Proficient, Advanced State of California, California standardsTests
Did Not Meet Standard, Met Standard, State of Texas, Texas Assessment of Knowledge
Commended Performance and Skills

34 Educational t.,lleasuremcnt: Issues and Practice


Table 2. NAEP Performance Level Descriptors for Grade 4 Reading Tests
Performance Level
Label Performance Level Descriptor
Advanced Fourth-grade students performing at the Advanced level should be able to generalize about
topics in the reading selection and demonstrate an awareness of how authors compose and
use literary devices. When reading text appropriate to fourth grade, they should be able to
judge texts critically and, in general, give thorough answers that indicate careful thought.
For example, when reading literary text, Advanced-level students should be able to make
generalizations about the point of the story and extend its meaning by integrating personal
experiences and other readings with ideas suggested by the text. They should be able to
identify literary devices such as figurative language.
When reading informational text, Advanced-level fourth-graders should be able to explain the
author's intent by using supporting material from the text. They should be able to make critical
judgments of the form and content of the text and explain their judgments clearly.

Proficient Fourth-grade students performing at the Proficient level should be able to demonstrate an
overall understanding of the text, providing inferential as well as literal information. When
reading text appropriate to fourth grade, they should be able to extend the ideas in the text by
making inferences, drawing conclusions, and making connections to their own experiences.
The connections between the text and what the student infers should be clear.
For example, when reading literary text, Proficient-level fourth graders should be able to
summarize the story, draw conclusions about the characters or plot, and recognize relationships
such as cause and effect.
When reading informational text, Proficient-level students should be able to summarize the
information and identify the author's intent or purpose. They should be able to draw reasonable
conclusions from the text, recognize relationships such as cause and effect or similarities and
differences, and identify the meaning of the selection's key concepts.

Basic Fourth-grade students performing at the Basic level should demonstrate an understanding of
the overall meaning of what they read. When reading text appropriate for fourth graders, they
should be able to make relatively obvious connections between the text and their own
experiences, and extend the ideas in the text by making simple inferences.
For example, when reading literary text, they should be able to tell what the story is generally
about-providing details to support their understanding-and be able to connect aspects of
the stories to their own experiences.
When reading informational text, Basic-level fourth graders should be able to tell what the
selection is generally about or identify the purpose for reading it, provide details to support
their understanding, and connect ideas from the text to their background knowledge
and experiences.

with the group's ratings, or likely im- such conceptualizations may have ori- whether a particular metliod is consid-
pact on the examinee population. gins in the Nedelsky (1954) method in ered to be "examinee centered" or "test
A complete treatment of selecting, which standard setters are required to centered" (Jaeger, 1989). For example,
training, and providing feedback to par- consider options that a hypothetical FID to use the Bookmark method (Mitzel,
ticipants in standard setting is beyond student would reeognize as incorrect. Lewis, Patz, & Green, 2001), partici-
the scope of this module. Readers are (According to edelsky, the Fro student pants must consider at what point stu-
referred to the work of Raymond and is one who was on the borderline be- dents in a certain performance category
Reid (2001) for further information on tween passing and failing a course; (e.g., Basic) or on the borderline be-
the selection, training, and evaluation hence, the notion of a point differentiat- tween categories will have a specified
of standard-setting participants, and to ing between a failing grade of "F" and a probability of responding correctly.
Reckase (2001) for more information passing grade of"D.") Participants using While standard-setting participants are
on providing feedback to participants. an Angoff (1971) or derivative methodol- orten selected for their subject area ex-
Finally, a fifth common issue is the ne- ogy form a conceptualization of the min- pertise and knowledge of examinees to
cessity for standard-setting participants imally competent examinee. whom the test will be given, tlie abstract
to form and rely on a conceptualization In contemporary standard setting, notion of an examinee within or between
related to the examinee group to whom these often-Iiypothetieal conceptualiza- particular categories is still required for
the standard (s) will apply. The need for tions remain important, regardless of standard setting to proceed.

Winter 2004 35
Standard-Setting Methods framework of steps required for ficulty---easiest items first and hardest
According to the StandardsJar Edw;a- standard setting has been put forth by items last. This has come to be referred
tional and Psychological Testing, "There Hambleton (1998) and is presented to as an ordered it= booklet (OIB).
here as Table 3. However, each step The preparation of an OIB may seem
can be no single method for determining
warrants deeper attention in its own simple enough in concept yet, until
cut scores for all tests or for all purposes,
right, and readers interested in addi- Lewis et al. (1996) introduced the idea,
nor can there be any single set of proce-
tional details on these topics are re- it had not been incorporated into a
dures for establishing their defensibility"
ferred to other sources (e.g., Kane, 2001; formal standard-setting method. The
(AERAIAPAINCME, 1999, p. 53). Recent
Raymond & Reid, 2001; Reckase, 2001). idea, however, instantly transformed
advances in standard setting have added
standard setting into a classical psy-
new approaches to the inventory of avail-
Bookmark Method chophysics experiment in which a stim-
able methods. The new methods de-
ulus of gradually changing strength or
scribed in the following sections have The Bookmark method is one of several
form is presented to subjects who are
some common advantages: they are gen- item-mapping procedures developed in
given the task of noting the point
erally more holistic (they require an attempt to simplify the cognitive task
at which a just-noticeable difference
standard-setting participants to make of standard setters who are required to
(JND) occurs. In the Bookmark pro-
holistic judgments about items or exam- consider performance-level descriptions,
cedure, participants begin with the
inee test performance); they are in- maintain appropriate conceptualizations knOWledge that each succeeding item
tended to reduce the cognitive burden on of examinees within or between perfor- will be harder than (or at least as hard
participants; and they can be applied to mance levels, and make probabillty esti- as) the one before; they are charged
a wide variety of item and task formats. mates. First introduced by Lewis, Mitzel, with noting one or more JNDs in the
Before turning to a description of and Green in 1996, the procedure has course of several test items in the 01B.
three such methods, we note two pur- rapidly become widely used in K-12 edu- The ordering of MC items in an OlB is
poseful omissions in the following sub- cation settings. Among the advantages of rather straightforward, particularly if a
sections. First, we do not review meth- the Bookmark method are the compara- one-parameter logistic (I-PL) item re-
ods that would be highly appropriate for tive ease with which it can be applied by sponse model (e.g., Rasch model) was
situations involving a mix of item for- standard-setting participants, the fact used to obtain estimates of item diffi-
mats and multiple cut scores (e.g., con- that it can be applied to tests comprising culty. Whether a I-PL, 2-PL, or 3-PL
trasting groups, borderline groups), but both selected-response (SR, e.g., model is used, items are simply arranged
which have been described in previous multiple-choice) and constructed- in ascending b-value (i.e., item diffi-
modules (see Cizek, 1996a). Second, the response (CR) items, and the fact that it culty) order. When a test contains both
following descriptions of each method can be used to set multiple cut scores on SR and CR items, each CR item appears
generally focus on the procedures used a single test. several times in the booklet-once for
to actually obtain one or more cut scores. each of its score points. For a glven CR
or course, much more is required of a The Ordered Item Booklet. The item, the item prompt, the rubric, and
defensible standard-setting process, in- Bookmark procedure is so named be- sample examinee responses illustrating
cluding identification and training of cause standard-setting participants the score pointes) are also provided to
appropriately qualified participants, ef- identify cut scores by placing markers standard setters. The OlB is formatted
fective facilitation, monitoring, and feed- in aspecially prepared test booklet. The with only one item (or CR score point)
back to participants, and well conceived distinguishing characteristic of the spe- per page.
data collection to support whatever cial test booklet is that it is prepared in The OIB can be composed of any col-
validity claims are made. A generic advance with test items ordered by dif- lection of items that is representative of

Table 3. Generic Steps in Setting Performance Standards


Step Description

1 Select a large and representative panel.


2 Choose a standard-setting method; prepare training materials and standard-setting meeting agenda.
3 Prepare descriptions of the performance categories (i.e_, PLDs).
4 Train participants to use the standard-setting method.
S Compile item ratings or other judgments from participants and produce descriptive/summary information or other
feedback for participants.
G Facilitate discussion among participants of initial descriptive/summary information.
7 Provide an opportunity for participants to generate another round of ratings; compile information and
facilitate discussion as in Steps S and G.
8 Provide a final opportunity for participants to review information and arrive at final recommended
performance standards.
9 Conduct an evaluation of the standard-setting process, including gathering participants' confidence in the
process and resulting performance standard!s).
10 Assemble documentation of the standard-setting process and other evidence, as appropriate, bearing on the
validity of resulting performance standards.
Source: Adapted from Hambleton (1998).

36 Educational Measurement: Issues and Practice


the range of content, item types, and The boldfaced number in the upper- concern themselves with aquestion such
summary statistical characteristics of a right corner of the page is simply pagi- as, "Is it likely that an examinee on the
typical test form. An OrB need not con- nation; the item in this example ap- borderline between categories Xand Y
sist only of items that appear in an ac- peared on page 35 of the alB. The next will answer this MC item correctly (or
tual test; it can have more or fewer items information providcd is the item's posi- earn this CR item point)?" Obviously,
than an operational test booklet. How- tion in the intact test form (it was item to actually implement the Bookmark
ever, it is important that the OIB fully number 22) and the item response the- method the task becomes one of defin-
represent the breadth and depth of con- ory (IRT) ability level required to have a ing "likely." In practice, most applica-
tent to which examinees will be exposed .67 probability of answering the item cor- tions of the Bookmark method employ a
in order for standard-setting participants rectly-in this case 1.725. Information 67% likelihood of the correct response
to understand more clearly the precise preceding the item indicates that it is (for SR items), or of obtaining at least a
ability level needed to achieve a partic- one of a set of items associated with a particular score point (for CR items).
ular standard. Thus, it is most common passage titled "Yellowstone." (A collec- Standard-setting participants are in-
for the alB to comprise an intact test tion of all passages used in the test structed to place a marker in their OrB
form. One advantage of using an opera- would be supplied to participants as a on the page (Le., item) immediately
tional form is that participants evaluate separate booklet for their use during after the page at which, in their opin-
the test on which the standard will be standard setting.) An asterisk by option ion, the likelihood criterion applies,
set, as opposed to revicwing some items C indicates the correct response. Had that is, to place their bookmarks at the
to which examinees may never actually this been a CR item, the prompt would first point in the booklet at which they
be exposed. have been followed by a sample response believe examinees' probability of mak-
An example of a page from an alB is at a particular score point; in the full ing the desired response drops below
shown in Figure 1. The example is OIB, the prompt and an associated sam- .67. It is important to note that this
taken from a high-stakes reading tcst ple response would appear once for each point is not the cut score in the sense
administered to high school students of its non-zero score points, distributed that the point at which the markcr is
in a large midwcstern state. Detailed throughout the alB in order of the diffi- placed cannot be translated into a raw
PLDs, based on the state's content culty of obtaining the particular score cut score by counting the number of
standards, were developed in advance point (or higher). items preceding it. Rather, as will be
and used by standard-setting partici- shown in the next section, the cut score
pants (n = 20) to identify three cut Probability Judgments in the Book- will be determincd by obtaining the
scores separating four performance markApproaclt [n using the Bookmark scalc value (often an IRT ability esti-
levels: Advanced, Projicient, Basic, method, participants must make a prob- mate) corresponding to a .67 probabil-
and Below Basic. 3 ability judgment. [[\ essence, they must ity of answering the item correctly.

Item 22 35
Ability level required for a .67 chance of answering
correctly: 1.725

Passage = Yellowstone

Which of these subheadings most accurately reflects the


information in paragraphs 1 and 27

A. Effects of the Yellowstone Fire


B. Tourism Since the Yellowstone Fire
• C. News Media Dramatically Reports Fire
D. Biodiversity in Yellowstone Since the Fire

FIGURE 1. Sample page from ordered item booklet.

Winter 2004 37
The particular likelihood used-in nee ability (e), item difficulty (bj ), 0, = difficulty of item i; and
this case .67-is referred to as the re- item discrimination (aj), and a thresh- exp = natural logarithm raised to the
sponse probability (RP). According to old or chance variable (c,) in accor- power inside the parentheses'
Mitzel et al. (2001), an RP of .67 can be dance with the fundamental equation Allowing the expression on the right of
interpreted in the following way: "For a of the 3-PL model: Equation 5 to equal .67 and solving for
given cut score, a student with a test e" we obtain the following:
score at that point will have a .67 proba- Pie) =cj + (I-c,)!( 1+ exp
bility of answering an item also at that 1-1.7a,(e-bj )!), (I)
cut score correctly" (p. 260). However,
e, = 0, + .708, (6)
where exp represents the natural log-
the use of other RPs has been investi- arithm e (2.71828 ...) raised to the which is very similar to Equation 3 ex-
gated. Huynh (2000) suggested that power of the expression to the right. cept for the omission of the a parame-
the RP which maximized the informa- However, Mitzel et al. (2001) set the ter, which is the distinguishing charac-
tion function of the test would produce threshold or chance parameter (Cj) teristic of the 2-PL model. Thus, the
the optimum decision rule. For a two- equal to zero, reducing Equation I to Rasch ability level required for an ex-
parameter IRT model, Huynh found that aminee to have a .67 probability of an-
an RP of .67 maximized this function. Pj(e) = II{ 1+ exp swering agiven SR item correctly would
Wang (2003) concluded that an RP of [-1.7aie - bj ) I), (2) be .70810gitsgreater than the difficulty
.50 is preferable when the Rasch (i.e., of the item.
I-PL) scaling model is used. The choice or essentially a 2-PL model. When a test comprises CR items, the
of .50 in the Rasch model context has For dichotomously scored (i.e., SR) derivation of the ability level necessary
certain mathematical advantages over items, the basic standard-setting ques- to obtain a given score point is some-
.67 in that the likelihood of a correct re- tion is whether or not an examinee just what more complex than for SR items.
sponse is exactly .50 when the examinee barely categorized into a given perfor- Indeed, it is necessary to calculate a sys-
ability is equal to the item difficulty. mance level would have a .67 chance of tem of probabilities for each CR item
Issues related to selection of the answering a given SR item correctly. (i.e., a probability for each score point).
most appropriate RP remain, however. Thus, starting with a probability of .67 To accomplish this, a partial-credit
Whether standard-setting participants and solVing Equation 2 for the ability model is commonly used. According to
can use any particular RP value more (e) needed to answer an item correctly, this model, the likelihood (IT,;,) of a per-
effectively than another and whether we obtain the following: son (n) with a given ability (e,) obtain-
they can understand and apply the con- e = bj + .708/1.7ar (3) ing a given score (x) on an item (i) with
cept of RP more consistently and accu- a specified number of steps (j) is shown
rately than they can generate probabil- For CR items, the situation becomes in Equation 7 (taken from Wright &
ity estimates using, for example, a somewhat more complicated. Mitzel et Masters, 1982, Equation 3.1.6):
modified-Angoff approach remain topics al. (200 I) used the two-parameter gen-
for future research. eralized partial-credit model (Muraki, z
1992). This model, shown in Equation 4, exp I(a, -li ii )
Psychometric Foundatwns oj the presents the probability of obtaining a IT,,,, = j=fJ (7)
Bookmark Appmach. As originally de- given score point (c), given some ability m, ,
scribed, the Bookmark method employs level (e), as a function of the difficulty Iexp I(e, -Ii.)
a three-parameter logistic (3-PL) model of the various score points (b~ tob~) and
for SR items and a two-parameter the item discrimination (a,): where x is the value of the score point
partial-credit (2PPC) model for CR (0, 1,2,3, etc.) in question, andm, is the
items. However, an alternative approach final step. The numerator in Equation 7
usinga I-PL (i.e., Rasch) model for both
SR and CR items is also frequently
exp[±a,(e -b,,)] refers only to the steps completed for
,=0
( 4) the score pointx, while the denominator
used in practice. Both approaches are includes the sum of allm, + I possible
described in this section beginning numerators.
with a brief explication of the Bookmark
method as originally proposed. Determining a Cut Score Using the
As indicated previously, standard- Mitzel et al. (200 I) note that the Book- Bookmark Method. The following ex-
setti ng participants express their judg- mark procedure can also be imple- ample illustrates the application of the
ments by placing a marker in the OIB mented under other lRT models, such Bookmark procedure. The items shown
on the page after the last item that as the I-PL (Rasch) model. This partic- in Table 4 are drawn from a report by
they believe an examinee who is just ular application of the Bookmark pro- Schagen and Bradshaw (2003) regard-
barely qualified for a particular clas- cedure begins with a basic expression ing a national reading test given to 11-
sification (e.g., Projicient) has a .67 of the Rasch model for dichotomous year-olds in Great Britain. The test con-
probability of answering correctly. These items (cf., Wright & Stone, 1979; sisted of 27 SR items and 10 CR items.
judgments are translated into cutscores Equation 1.4.1): Of the 10 CR items, seven were worth
by noting the examinee ability associ- 2 points each, and three were worth
ated with a .67 probability of a correct P(X = I1e,., 0,) = 3 points each, for a total of 50 points for
response and then translating that exp(e,. - 0,)/[ 1+ exp(e,. - 0,) J, (5) the entire test. 1\velve participants eval-
ability into a raw score. As originally uated the OIB represented in Table 4
where
described by Mitzel et al. (2001), the and rendered their bookmark place-
probability of a correct response (P,) e,. = ability (theta estimate) of an ments for a minimat student (Level 3).
for an SR item is a function of exami- examinee; Those judgments are shown in Table 5.

38 Educational Measurement: Issues and Practice


Table 4. Ordered Booklet Item Parameters and Associated Theta Values
Difficulty Discrim. Theta@ Difficulty Discrim. Theta@
Page Item (b) (a) RP = .67 Page Item (b) (a) RP = .67
1 19 -3.395 0.493 -2.550 26 32 -0.341 0.869 0.138
2 13 -2.770 0.997 -2.352 27 29.1 -0.333 0.667 0.160
3 1 -2.757 1.441 -2.468 28 11 -0.133 0.494 0.710
4 22 -2.409 0.461 -1.505 29 37.1 -0.120 0.515 0.120
5 4 -2.282 0.527 -1.492 30 10 -0.063 0.402 0.973
6 2 -2.203 0.607 -1.517 31 31.2 -0.052 0.817 0.940
7 12 -2.141 0.503 -1.313 32 16 0.107 0.316 1.425
8 3 -1.781 0.520 -0.980 33 6 0.247 0.866 0.728
9 14 -1.737 0.931 -1.290 34 36 0.312 0.421 1.301
10 31.1 -1.710 0.817 -1.240 35 24 0.396 0.489 1.248
11 23 -1.454 0.778 -0.919 36 35.1 0.469 0.586 1.060
12 21 -1.444 0.845 -0.951 37 26.2 0.558 0.563 1.280
13 7 -1.122 0.953 -0.685 38 30.2 0.806 0.600 2.220
14 20.1 -1.044 0.743 -0.830 39 17 0.931 0.724 1.506
15 28 -0.973 0.770 -0.432 40 37.2 1.099 0.515 1.920
16 30.1 -0.942 0.600 -0.420 41 18 1.390 0.572 2.118
17 34.1 -0.935 0.657 -0.270 42 29.2 1.513 0.667 2.190
18 15 -0.873 0.567 -0.138 43 26.3 1.519 0.563 3.180
19 9 -0.833 0.863 -0.350 44 34.2 1.541 0.657 2.750
20 8 -0.724 0.901 -0.262 45 27.1 2.062 0.292 2.450
21 25.1 -0.703 0.750 0.010 46 25.2 2.293 0.750 3.310
22 5 -0.500 0.595 0.200 47 37.3 2.384 0.515 4.160
23 26.1 -0.424 0.563 -0.270 48 35.2 2.479 0.586 3.900
24 20.2 -0.422 0.743 0.840 49 29.3 3.149 0.667 4.420
25 33 -0.379 0.828 0.124 50 27.2 3.174 0.292 6.440
Note: CR items have multiple entries. For example, Item 37 has three score points, shown as score point 37.1 (alB page 29), 37.2
(OIB page 40) and 37.3 (OIB page 47).
Source: Adapted frol11 Schagen and Bradshaw (20031.

The cut score is based on the mean level). It is on those ability levels, not the data in Table 4. These ability esti-
theta at the associated response prob- the page numbers or cumulative num- mates were then averaged to determine
ability (theta @ RP = .67). In this in- ber of items, that the cut score is set. the mean ability estimate of a student
stance, the mean theta value of -1.594 The student who has a 67% likelihood of just barely at the minimal level. That
corresponds to a raw score of 15.25. answering Item 2 correctly also has a ability level was then converted to a
Because fractional raw scores are not slight chance of answering subsequent raw score using standard, commercially
possible, the operational cut score items correctly or obtaining scores of 2 a\·ailable 3PL model software.
would need to be rounded to a possible or 3 on moderately difficult CR items.
score point, such as 15 or 16, depend- The expected score for the student at AngoJf Variations
ing on the rounding rules in place, the just barely minimal level is the Originally proposed by Angoff (1971) and
though it should be noted that a stu- aggregate of expected scores on all 37 described elsewhere (see Cizek, 1996a),
dent who had earned a raw score of 15 items in the test. For this particular the Angoff approach has produced many
would have an ability less than the tar- test, based on the average of these par- variations which have adapted this most
get value of -1.594. ticipants' estimates, that expected raw thoroughly researched and still widely
It should also be noted that partici- score is somewhere between 15 and 16. used method to evolving assessment con-
pants selected items on the second, fifth, To summarize this application of the texts and challenges. Just as the previ-
and sixth pages of the OIB (Items 13,4, Bookmark method, 12 standard-setting ously described Bookmark approach was
and 2, respectively). If none of the par- participants made judgments about the de\eloped in an attempt to reduce the
ticipants went farther than page 6 in location of the minimal achievement complexity of the cognitive task faeing
the booklet, it might seem reasonable level by placing bookmarks in their 0 IBs. standard-setting participants, so too
that the cut score for the minimal level These judgments are shown in the col- does a derivative of the Angoff proce-
should be no more than 6 points. How- umn labeled "Item umber" in Table 5. dure referred to as the YeslNo method
ever, the Bookmark procedure focuses The relationships for each item between by Imparaand Plake (1997). The essen-
on the student ability level associated page number and ability required to tial question that must be addressed
with the 67% likelihood of answering reach that level (with a 67% likelihood) by standard-setting participants can be
Item 2, 4, or 13 (the ones identified by are shown in Table 4. The page num- answered "Yes" or "No." According to
the participants as marking the bound- bers supplied by the participants were Impara and Plake, participants are
ary between minimal and the next lower translated into ability estimates using directed to

Winter 2004 39
Table 5. Summary of Participants' Bookmark or a mix of SR and CR formaLS has nOL
Placements for Level 3 (Minimal) been attempted, another variation of
Angoffs (1971) basic approach has been
Page Number created to address tests that include CR
Participant Item Number inOIB Theta @ RP = .67 items. Hambleton and Plake (1995) de-
scribe what they have labeled an ex-
A 2 6 -1.517 [Link][procedure. In addition to
B 4 5 -1.492 providing traditional probability esti-
C 4 5 -1.492 mates of borderline examinee perfor-
D 2 6 -1.517 mance for each SR item, participants
E 2 6 -1.517 also estimate the number of scale points
F 2 6 -1.517 that they believe borderline exami-
G 13 2 -2.352 nees will obtain On each CR task in
H 4 5 -1.492 the assessment. Cut scores for the ex-
I 2 6 -1.517 tended Angoff approach are calculated
J 13 2 -2.352 in the same way as with traditional
K 2 6 -1.517 Angoffmethods, although, as Hambleton
L 2 6 -1.517 (1998) notes, more complex weighting
Mean -1.594 schemes can also be used for combin-
Source: Adapled from Schagen and Bradshaw (2003). ing components in a mixed-format
assessment.
read each item [in the test I and 1'0 begin, qualified participants are se-
make ajudgment about whether the lected and are oriented to the standard- Cakulation of Yes/No and Extended
borderline student you have in mind setting task. They are often grounded in AngoJ[ Cut Scores. Table 6 presents
will be able to answer each question the content standards upon which the hypothetical data for the ratings of 20
correctly. Ifyou think so, then under test was built; they may be required to items by six participants in two rounds
Rating I on the sheet you have in take the test themselves, and they dis- of ratings using the YesINo and extended
front of you, \\7ite in a Y. ffyou think cuss the relevant competencies and Angoff methods. The table has been
thc student will not be able to an- prepared to illustrate calculation of cut
swer correctly, then write in an N. characteristics of the target population
(pp.364-365) of examinees for whom the performance scores that would result from use of the
levels are to be set. After discussion of Yes/No method alone for a set of di-
In essence then, the Yes/No method is the borderline examinees, participants chotomously scored SR items (i.e., the
highly similar to the first Angoff (197 I) are asked to make performance esti- first 12 items listed in the table), the
approach. In his oft-cited chapter on mates for a group of examinees in extended Angoff method alone for a set
scaling, norming, and equating, Angoff an iterative process over two or more of polytomously scored CR items (the
described two variations of a standard- "rounds" or ratings. last eight items in the table), or a com-
setting method. While his second sug- Typically, in a first round of perfor- bination ofYesINo and extended Angoff
gestion came to be known as the widely mance estimation, participants using (for the full20-item set). For this set of
used Angoff method, Angoff first sug- the YesINo method rate a set of opera- items the CR items were scored on a
gested that standard setters simply tional items often comprising an intacL 1-4 scale.
judge whether or nota hypothetical min- test form. At the end of Round I, each The means for each participant and
imally acceptable person would answer participant would be provided with each item are also presented for each
an item correctly. According to Angoff, feedback on their ratings in the form of round. Using the Round 2ratings shown
a systematic procedure for deciding information about how their ratings in Table 6, Lhe recommended YesINo
on the minimum raw scores for pass- compared to actual examinee perfor- passing score for the SR item test would
ing and honors might be dcveloped mance or to oLher participants' ratings. be approximately 58% of the total raw
as follows: keeping the hypothetical Asecond round of yes/no judgments On score points (.58 x 12 items), or ap-
"minimally acceptable person" in each item follows as participants re- proximately 7 out of 12 points possible.
mind, onc could go thrcugh the tcst review cach iLem in the test. If not pro- The recommended passing score on the
item by item and decide whether vided to Lhem previously, at the end of CR item test would be 21 out of a total
such a person could answer correctly the second round of judgments, partic-
each item under consideration. If a of 32 possible score points (2.69 x 8
ipants would receive additional infor- items). Arecommended passing score
score of one is given for each item an- mation regarding how many examinees
swered correctly by the hypothetical for the 20-item test comprising a mix of
person and a score of zero is given
would be predicted to pass/fail based on SR and CR items would be approxi-
for each item answered incorrectly their participants' judgments (i.e., im- mately 28 of the 44 total possible raw
by that person, the sum of the item pact data). Regardless of how many score points I (.58 x 12) + (2.69 x 8)].
scores will equal the raw score earned rounds of ratings occur, calculation of (See Hambleton & Plake, 1995 and
by the 'minimally acceptable person.- the final recommended passing score
(pp.5l4-515) would be based on data obtained in the 1'alente, Haist, & Wilson, 2003 for addi-
final round. tional information on setting standards
Implementing the Yes/No Method. for complex performance assessments.)
The basic procedures for implementing Extended AngoJ[ Method. Although
thc YesINo method follow those for most an extension of the Yes/No method to Research on the Yes/No Method. One
common standard-setting approaches. contexts with polytomously scored items of the appealing features of the YesINo

40 Educational Measurement: Issues and Practice


Table 6. Hypothetical Data and Examples of Yes/No and Extended Angoff
Standard-Setting Methods
Participant 10 Number
Item No. 1 2 3 4 5 6 Means
1 a a 1 a 0.50
1 1 a a a 0.50
2 a a a a a a 0.00
a a a 1 a a 0.17
3 a 1 1 0.83
a 1 1 0.83
4 1 1 1.00
1 1 1.00
5 a a a a a a 0.00
a a a a a a 0.00
6 a a a a a a 0.00
a a a a a a 0.00
7 1.00
1.00
8 1.00
1.00
9 1 , 1.00
1 1 1.00
10 1 a 0.83
1 a 0.83
11 a a a a a a 0.00
a a a a a a 0.00
12 a a a a a 0.17
1 a 1 1 a 0.67
Means .58 .50 .50 .50 .50 .58 .53
.67 .58 .50 .58 .58 .58 .58
13 2 3 2 2 3 1 2.17
3 3 3 3 3 2 2.83
14 1 2 1 2 2 1 1.50
2 2 2 2 3 2 2.17
15 2 2 2 2 2 2 2.00
3 3 3 3 3 2 2.83
16 3 3 2 2 3 2 2.50
3 3 3 3 3 3 3.00
17 1 1 2 1 2 1 1.33
2 2 2 2 2 1 1.83
18 2 3 3 2 3 2 2.50
3 3 3 3 3 2 2.83
19 3 2 2 2 3 2 2.33
3 3 3 3 3 3 3.00
20 2 3 3 2 3 2 2.50
3 3 3 3 3 3 3.00
Means 2.00 2.38 2.13 1.88 2.63 1.63 2.10
2.75 2.75 2.75 2.75 2.88 2.25 2.69
Note: The upper and lower entries in each cell represent participants' first and second round ratings, respectively; values in bold are
Round 2 means for SR and CR items, respectively.

Winter 2004 41
method is its simplicity. In typical im- using the Angoff method and the Ye&'No tests comprising exclusively CR items
plementations of modified Angoff pro- method, the variance of the ratings with (e.g., a writing test) or a mix of SR and
cedures, participants must maintain a the Ye&'No method was smaller and the CR formats (e.g., a mathematics test).
concept of a group of hypothetical ex- participants' scores were more stable Several of these methods can be
aminees and must estimate the propor- from Round I to Round 2. Participants termed "holistic," in that they require
tion of that group which will answer an reported that thinking of an actual ex- participants to focus judgment on a
item correctly. Clearly, this is an impor- aminee when rating the items was easier sample or collection of examinee work
tant-though difficult-task. Impara than thinking of agroup of hypothetical greater than a single item or task at a
and Plake (1998) found that the Ye&'No examinees. time. Though a number of methods
method ameliorated some of the diffi- The relative cognitive simplicity of satisfy this characteristic, we are
culty of the task. They reported that: the Yes/No method identified by aware, too, that differences between
Impara and Plake was also reported by these methods can defy common clas-
We believe that the yes/no method Chinn and Hertz (2002). They report sification. With that caveat, we note
shows substantial promise. Not only
do panelists find this method clearer that participants found the yes/no de- several examples of more holistic
and easier to use than the more tradi- cisions easy to make because "they methods, then we provide greater de-
tional Angoff probablility estimation were forced to decide between a yes or tail on a single implementation of one
procedures, its results show less sen- a no rather than estimate perfor- such procedure.
sitivity to performance data and lower mance from a range of estimates,"
within-panelist variability. Further, whereas participants using a modified Examples oj Some Holistic Meth-
panelisls repert that the conceptual- Angoff method "commented that de- ods. One such method that would be
ization of a typical borderline exami- termining the proportion of candi- considered more holistic has been pro-
nee is easier for them than the task of posed by Plake and Hambleton (2001)
imagining agroup of hypothetical tar- dates who would answer each item
correctly was difficult and subjective" (although the developers described
get candidates. Therefore, the per- their method as "analytic jUdgment").
formance standard derived from the (p. 7). However, in contrast to the at-
yes/no method may be more valid tractive stability of the participants' The method was developed for tests
than that derived from the traditional ratings observed by Impara and Plake that include polytomously seored per-
Angoff method. (p. 336) (1998), Chinn and Hertz found that formance tasks and other formats, re-
there was greater variance in ratings sulting in a total test comprising differ-
As Impara and Plake (1998) have using the Yes/No method. They hy- ent components. To implement the
demonstrated, even teachers who were pothesize that this may be due to de- method, panelists review a carefully se-
familiar with an assessment and with sign limitations and several depar- lected set of materials for each compo-
the examinees taking the assessment tures from the methodology used by nent, representing the range of actual
were not highly accurate when asked to Impara and Plake including their se- examinee performance on each of the
predict the proportion of a group of bor- lection of participants, instructions, questions comprising the assessment
derline students who would answer an and level of discussion about the (although examinees' scores are not
item correctly. The Ye&'No method sim- process. revealed to the panelists). Panelists
plifies the judgment task by reducing To date the Ye&'No method has only then classify the work samples accord-
the probability estimation required to a been applied in contexts where the out- ing to whatever performance levels are
dichotomous outcome' come is dichotomous (i.e., with multiple- required (e.g., Basic, Projicient, and
There are two alternative ways in choice or other SR-format items which Advanced). Plake and Hambleton used
which the Ye&'No method can be ap- will be scored as correct or incorrect). even narrower categories within these
plied. One variation requires partici- performance levels, which they called
pants to form the traditional conceptu- low, middle, and high (e.g., low-Basic,
alization of a hypothetical borderline Holistic Melhods middle-Basic, high-Basic). Although
examinee; the other requires partici- Increasingly, large-scale assessments Plake and Hambleton suggested alterna-
pants to reference their judgments with have incorporated a mix of item formats tive methods for calculating the eventual
respect to an actual examinee on the in order to tap more fully the constructs cut scores, a simple averaging approach
borderline between classifications (e.g., that are measured by those tests and to appeared to work as well as the others.
between Basic and Proficient). In a com- avoid one common validity threat known The averaging approach consisted oftak-
parative trial of the Ye&'No method with as construct underrepresentation. While ing all papers classified by participants
a modified AngoIT approach, lmpara and tests comprising SR-format items ex- into what were called borderline cate-
Plake (1997) asked participants using clusively may have been more common gories. For example, the cut score distin-
the Ye&'No method to think of one ac- in the past, newer tests often comprise guishing Basic from Proftdent was ob-
lual borderline examinee with whom short-response items, essays, show-your- tained by averaging the scores of papers
the participant was familiar instead of work, written reftections, grid-in re- classified into the high-Basic and low-
conceptualizing a group of hypothetical sponse format, and other test construc- Projicient borderline categories.
examinees. Keeping this actual person tion features for which standard-setting Loomis and Bourque (2001) have de-
in mind, participants were then asked methods designed for SR tests are not scribed a similar approach to that of
to determine whether the examinee amenable. Plake and Hambleton (2001) in what
would answer each item correctly. The Assessment specialists have respon- they call a paper selection rnetlwd. They
results showed that although the final ded by proposing a variety of methods also describe another similar approach,
standard was similar for participants for setting performance standards on which they term the booklet classi,[ica-

42 Educational Measurement: Issues and Practice


tion metfwd; the latter method differs in passages with both SR and CR items, 13, based on preliminary research.
essence from the component-based with a total score of 50 points. Select- During Round I, the 20 participants en-
methods in that it requires participants ing 40 student work samples wouLd en- tered a total of 360 ratings, an average
to engage in the sorting! classification tail some decisions about which score of 18 ratings per participant, though the
task at the level of an entire test booklet. points to leave in and which to Leave rate varies considerably. Similarly,
What can be termed "holistic" methods out, since 10 work samples can rcpre some work samples have been rated
have also been proposed by Jaeger sent, at most, 40 different score points. more times than others. Figure 2 shows
(1995) in ajudgmental policy captur- These 40 or so work samples are pre- the results of Round I.
ing approach and by Putnam, Pence, and sented to participants who sort them Each category (Below Basic, Basic,
Jaeger (1995) in the dominant profile into the categories such as the four Proficient, Advanced) is represented
method. For additional information on performance levels named previously. by a score distribution. These distri-
any of these methods, readers should Participants may then discuss their de- butions overlap to a considerable de-
consult the corresponding original cisions in small groups or in a large gree. Indeed, not only do some ratings
sources listed. In the following para- group, and may modify some of their for Basic overlap Projicient but also
graphs, we provide detail on one hoListic decisions before submitting their judg- Advanced. This degree of overlap is not
method as an example of the character- ments to the facilitators. Where these uncommon in Round I of a holistic rat-
istics of such an approach. within-round discussions occur, the ing procedure, and it frequently occurs
rounds are sometimes subdivided into in later rounds in holistic rating with
The Body oJWork Metfwd. One fairly Round Ll, 1.2, 2.1, 2.2, and so on. certain kinds of assessments (e.g.,
well known holistic method is the Body Following Round I and data analysis, those for alternate assessments for stu-
oj Work (BoW) method, proposed by preliminary cut scores are identified. dents with special needs).
Kingston, Kahl, Sweeney, and Bay I n this case, there would be three cut There are three vertical lines in
(2001). The BoW method differs some- scores, one to separate Below Basic Figure 2, each representing a likely
what from other holistic methods in its from Basic, one to separate Basil; from cut score: CI, C2, and C3. CI, for ex-
calculation of cut scores. Rather than ProfICient, and one to separate ProJi- ample, is placed where the Below
taking simple means of borderline cient from Advanced. At this point, it Basil; distribution crosses the Basil;
groups (which may be skewed if not will become evident that some score distribution. In a BoW application,
moderated to account for different points are beyond consideration as pos- this point would also correspond to
numbers of examinees in the two sible cut scores. the value yielded by logistic regres-
groups), Kingston et aL (2001) employ In the current example, if no partic- sion, which searches for the point at
a logistic regression to derive cut ipant identified a work sample with a which the likelihood of being classi-
scores. As with many standard-setting total score below 17 as belonging to the fied as Basic reaches 50%. This is at
methods, a number of variations of Basic category, then work samples about 20 raw score points. Below 20
Kingston et aL's basic suggestion have with scores of 16 and below would be points, the work sample is more likely
been implemented, and comprise what eliminated from further consideration, to be classified as Below Basic. At or
we refer to generally as a fwlistil; work and additional work samples would above 20 points, the work sample is
sample method. Lnformation in the fol- be brought into the mix in Round 2 more likely to be classified as Basil;. A
lOWing paragraphs is reLevant for ob- to augment the likely regions of the similar shift occurs at about 29 points
taining cut scores using this genre of cut scores. For this reason, Round I (Basic to Projicient) and again at
standard-setting methods, regardLess of is sometimes referred to as range about 39 points (ProJil;ient to
the label applied. Jinding, and Round 2 is referred to as Advanced).
Holistic approaches typically present pinpointing (see Kingston et ai., 2001, The computed cut scores will de-
large numbers of intact student work pp.226-230). pend on tile analytical method that ac-
samples to participants. Typically, these In Round 2, participants may reex- companies the particular holistic
student work samples have been scored amine some of the Round I work sam- method used. As noted above, the BoW
prior to standard setting, but the individ- ples plus additional work samples that method uses logistic regression to de-
ual scores are not provided to par- fi II in any gaps in the ranges of the pre- termine the point at which the like-
ticipants during the judgment process. liminary cut scores, or they may review lihood of a particular classification
I nstead, participants rate each work all new work samples, selected on the reaches or first exceeds 50%. The ana-
sample holistically and classify it into basis of Round I results. Similarly, by lytic judgment method (Plake &
one of the required categories (e.g., Round 3, the range of scores repre- Hambleton, 2001) would have subdi-
Below Basil;, Basic, ProflCienl, or Ad- sented in student work samples may be vided the groups into high-Basic, low-
vanced). In preparation for the further curtailed. Projicient, and so on, determined the
standard-setting meeting, as many as To illustrate a holistic standard- mean scores for each of these border-
1,000 scored student work samples may setting approach, consider the 50-point line groups, and then produced a cut
be reviewed by standard-setting facilita- language arts test described above. score equal to the midpoint between
tors; from that number, 40 to 50 samples Twenty participants have rated 40 stu- two adjacent borderline group means.
to represent the range of totaL scores may dent work samples with scores ranging Similarly, one might simply calculate
be selected. from 13 to 50. Participants do not know the mean (or median) for each cate-
Consider, for example, a language the scores of any of the work samples. gory and then calculate the midpoint
arts test consisting of two essays, a The facilitators have purposely elimi- between two adjacent category means
revise-and-edit task, and two reading nated work samples with scores below to derive a cut score.

Winter 2004 43
12
· ..... Below Basic
11 - ...... -·Basic
i\
I I ------ Proficient
10 ~

I
I I
I
!\ r;o;;- C3
_ _ Advanced

\~ \
9 Cl,J
I
I
8

\
I \
I \

~
~
7 J.
'"
Cl
C
:; 6
I
I I
I \
\
J\
-,.,
a:
0
5
I
I'.
/ \
1 I
i
J
u \
c
'" I
\
j \
~
r
::l 4

I
0'

'"
~
LL
I
I
I
I \
3

2
I
I
f
I
~,

rv \ \
\
I
I

f···..
I
I 1
\I
o
"
I

131415161718192021 222324252627282930 31 32333435363738394041 424344454647484950


I I

Raw Score

FIGURE 2. Results of holistic rating Round 1.

When using holistic approaches, de- tion, a priori, as is given to canying out These aforementioned evaluations
cisions about whether and when to the standard-setting procedure itself. are external in nature. However, on-site
share student work sample scores, The evaluation of. standard setting evaluations of the process of standard
overall distributions of scores (i.e., im- is a [Link] endeavor. It can be setting, by the participants themselves,
pact data), item difficulty, and other thought of as beginning with a critical serve as an important internal check on
data are made prior to the standard- appraisal of. the degree of. alignment the validity and success of the process.
setting activity. Typically, item and score between the standard-setting method Typically, two evaluations are con-
data are shared after Round I, and im- selected and the purpose and design of. ducted during the course of a standard-
pact data are shared after Round 2. the test, the goals of the standard- setting meeting. Afirst evaluation nor-
However, in some cases, impact data setting agency, and the characteristics mally occurs after initial orientation of
are also shared after Round 1. of. the standard setters. This match participants to the process, training in
should be evaluated by an independent the method, and (when appropriate)
body (such as a technical advisory administration to participants of an
Evaluating the Standard-8etting committee) acting on behalf. of. the actual test form. This first evaluation
Process standard-setting agency. Evaluation serves as a check on the extent to which
participants have been adequately
Although not strictly a method itself, it continues with a close examination of. trained, understand key conceptuali-
is important that any standard-setting the application of the standard-setting zations and the task before them, and
process gather evidence bearing on the procedure: To what extent did it ad- have confidence that they will be able
manner in which any particular ap- here faithfully to the published princi- to apply the selected method. Asecond
proach was implemented and the extent ples of the procedure? Did it deviate evaluation is ordinarily conducted at
to which participants in the process were in unexpected. undocumented ways? If the conclusion of the standard-setting
able to understand, apply, and have there are deviations, are they reason- meeting. Commonly, both evaluations
confidence in the eventual performance able adaptations, specified and ap- consist of a series of survey questions. A
standards (Cizek, 1996b). Thus, evalua- proved in advance, and consistent with sample end-of-meeting survey is shown
tion of the standard-setting process the overall goals of the activity" Amea- in Pigure 3.
can be thought of as an aspect of each sme of the degree to which individual It should be noted that the format of
method described previously in this mod- standard-setting participants converge the items in the survey showll in Pigure
ule. Equal attention must be devoted to from one round to the next is yet an- 3 requires only an "Agree" or "Disagree"
planning the standard-setting evalua- other part of the evaluation. check mark from respondents. Because

44 Educational Measurement: Issues and Praclice


Directions: Please check "Agree" or "Disagree" for each of the following statements and add any
additional feedback on the process at the bottom of this page.

Statement Agree Disagree


I The orientation provided me with a clear understanding of the purpose of
the meeting.

2 The workshop leaders clearly explained the task.

3 The training and practice exercises helped me understand how to


perform the task.

4 Taking the test helped me to understand the assessment.

5 The performance level descriptions were clear and useful.

6 The large and small group discussions aided my understanding of the


process.

7 The time provided for discussions was adequate.

8 There was an equal opportunity for everyone in my group to contribute


his/her ideas and opinions.

9 I was able to follow the instructions and complete the rating sheets
accurately.
to The discussions after the first round of ratings were helpful to me.

II The discussions after the second round of ratings were helpful to me

12 The information showing the distribution of student scores was helpful to


me.
13 I am confident about the defensibility and appropriateness of the final
recommended cut scores.
14 The facilities and food service helped create a productive and efficient
working environment.

15) Comments: _

FIGURE 3. Sample evaluation form for standard-setting participants.

standard-setting meetings can be long shown in Pigure3would be to replace the These activities all focus on an eval-
and arduous activities, it is considered AgreelDisagree options with a Likert- uation of the process. What of the prod-
desirable to conduct the final evalua- type scale that gives participants greater uct(s) of standard setting? Commonly
tion in such a way as to make the task response options (e.g., I = Strongly employed criteria here include reason-
relatively easy for participants to com- Disagree to 4 = Strongly Agree). Such a ableness and replicability. Afirst poten-
plete and to lessen the proportion of modification would permit finer grained tial aspect of product evaluation is the
nonresponse. Consequently, open-ended reporting of participants' perceptions, usefu lness of the PLLs and PLDs. Par
survey items requiring lengthy re- calculations of means and standard de- a given subject and grade level, they
sponses are generally avoided. One sim- viation for each question on the survey, should accurately reflect the content
ple modification of the evaluation form and so on. standards or credentialing objectives

Winter 2004 45
Table 7. Criteria for Evaluating Standard-Setting Procedures
Evaluation Criterion Description
Procedural
Explicitness The degree to which the standard-setting purposes and processes were clearly and
explicitly articulated a priori
Practicability The ease of implementation of the procedures and data analysis; the degree to
which procedures are credible and interpretable to relevant audiences
Implementation The degree to which the following procedures were reasonable, and systematically
and rigorously conducted: selection and training of participants, definition of the
performance standard, and data collection
Feedback The extent to which participants have confidence in the process and in resulting
cut scorers)
Documentation The extent to which features of the study are reviewed and documented for evalu-
ation and communication purposes
Internal
Consistency within method The precision of the estimate of the cut scorers)
Intrapanelist consistency The degree to which a participant is able to provide ratings that are consistent with
the empirical item difficulties, and the degree to which ratings change across
rounds
Interpanelist consistency The consistency of item ratings and cut scores across participants
Decision consistency The extent to which repeated application of the identified cut scores(s) would yield
consistent classifications of examinees
Other measures The consistency of cut scores across item types, content areas, and cognitive
processes
External
Comparisons to other The consistency of cut scores across replications using other standard-setting
standard-setting methods methods
Comparisons to other The relationship between decisions made using the test to other relevant criteria
sources of information (e.g., grades, performance on tests measuring similar constructs, etc.)
Reasonableness of The extent to which cut score recommendations are feasible or realistic (including
cut scores pass/fail rates and differential impact on relevant subgroups)
Source: Adapted from Pitoniak (2003).

and be reasonably consistent with state- gories? In the end, regardless of how rea- setting method is most defensible in a
ments developed by others with similar sonable a set of performance standards given situation. Again, differences in
goals. seems to assessment professionals or results from two different procedures
Reasonableness can be assessed by those who participated in the actual would not be an indication that one was
the degree to which cuL scores derived standard-setting activity, those standards right and the other wrong; even if two
from Lhe standard-setting process being 'viII need to be locally reproducible-at methods did produce the same or simi-
evaluated classify examinees into groups least in an informal sense- in order Lo lar cut scores, we could only be sure of
in a manner consistent with oLher infor- be widely accepted and recognized. precision, not accuracy.
mation about the examinees. For exam- Replicability is another possible av- The aspects of standard-setting eval-
ple, suppose it could be assumed that a enue for evaluating standard setting. For uation listed here do not cover all of the
state's eighth-grade reading test and the example, in some contexts where great critical elements of standard setting that
NAEP were based on common content resources are available, it is possible to can yield evidence about the soundness
standards (or similar content standards conduct independent applications of a of a particular application. The preced-
that had roughly equal instructional em- standard-setting process to assess the ing paragraphs havc only attempted to
phasis). In such a case, a standard- degree La which independent replica- highlight the depth and complexity of
setting procedure for the staLe test re- Lions yield similar results. Evaluation that important task. Table 7 provides a
sulting in 72% of the state's eighth mighL also involve comparisons between more inclusive list and description of
graders being classified as Projicient, results obtained using one method and evaluation criteria that can be used as
while NAEP results for the same grade an independent application of one or sources of evidence bearing on the qual-
showed that only 39% were Projicient, more different methods. Interpretation ity of the standard-setting process.
would cause concern thaL one or Lhe of the results of these comparisons, how-
other set of standards was inappropriate. ever, is far from clear. For example, Conclusion
Local information can also provide cri- Jaeger (1989) has noted thaL different
teria by which to judge reasonableness. methods will yield different results, and Setting performance standards has
Do students who typically do well in class there is no way to determine that one been called "the most controversial
and on assignments mostly meet the top method or the other produced the wrong problem in cducational assessment
standard set for the test, whilo students results. Zieky (2001) noted that there is today" (Hambleton, 1998, p. 103). As
who struggle fall into the lower cate- still no consensus as to which standard- long as important decisions must be

46 Educational Measurement: Issues and Practice


made, and as long as test performance References twn and Bookmark standard selling.
plays a part in those decisions, it is American Educational Research Associa-
Paper presented at the annual meeting of
likely that controversy will remain. At tion, American Psychological Association,
the National Council on Measurement in
least to some degree, however, any con- Edueation, New Orleans, LA.
National Council on Measurement in
Impara, J. C., & Plake, B. S. (1997).
troversy can be minimized by crafting Education. (1999). Standards for edu- Standard setting: An alternative ap-
well conceived methods for setting cational and psychological lesling. proach. Journal of Educatwnal Mea-
performance standards, implementing Washington, DC: American Psychologieal surement, 34, 353-366.
those methods faithfully, and gathering Association. Impara,J. C., & Plake, B. S. (1998). Teachers'
Angoff, W. H. (1971). Seales, norms, and ability to estimate item difficulty: Atest of
sound evidence regarding the validity of equivalent seores. In R. L. Thorndike
the process and the result. the assumptions in the Angoff standard
(Ed.), Educational measurement (pp. setting method. Journol of Educational
508--li00). Washington, DC: American Measurement, 35 (I), 69-81.
Council on Education. Individuals ,vith Disabilities Education Act.
Notes Beuk, C. H. (1984). Amethod for reaching a (1997). Public Law 105-17 (20 U.S.C.
I Some sources refer to participants in compromise between absolute and rela- 1412a, 15-17).
standard selting procedures asjudges.
4
tive standards in examinations. Journal of Jaeger, R. M. (1989). Certification of stu-
'It should be noted that this definition ad- Educalwnal MeasurtmUJn~ 21, 147-152. dent competence. In R. L. Linn (Ed.),
dresses only one aspect of the legal theory Chinn, R. N., & lIertz, N. R. (2002). Educotional measuremenl (3rd ed.,
known as due process. According to the legal Alternative approaches to standard- pp. 485-514). New York: Maemillan.
theory, governmental actions concerning a setting for licensing and certification Jaeger, R. M. (1995). Setting performanee
person's life, liberty, or property must involve examinations. Applied Measurel'nent in standards through two-stage judgmental
due process-that is, a systematic, open Educatwn, 15, 1-14. policy capturing. Applied Measurement
process, stated in advance, and applied uni- Cizek, G. J. (1993). Reconsidering standards in EducaJwn, 8, 15-40.
formly. The theory further divides the con- and criteria. Journal of Educational Kane, M. (1994). Validating the perfor-
cept of due process into procedural due Measurerrum~ 30(2), 9:>--106. mance standards associated with passing
process and substantive due process. Cizek, G. J. (1996a). Setting passing scores. scores. Review ofEducatwnal Research,
Whereas procedural due process provides Educational Measurement: Issues and 64(3),425-461.
guidance regarding what clements of a pro- Practice, /5(2),20-31. Kane, M. (2001). So mueh remains the
Cizek, G. J. (1996b). Standard setting same: Conception and status of valida-
cedure are necessary, substantive due
guidelines. Educotwnal Measurement: tion in setting standards. In G. J. Cizek
process characterizes the result of the proce- /ssuesandPractice, /5(1), 12,13-21.
dure. The notion of substantive due process (Ed.), Standard pelformance stan-
Cizek, G. J. (2001a). More unintended dards: Concepts, methOds, and pfffspec-
demands that the procedure lead to a deci- consequences of high·stakes testing.
sion that is fundamentally fair. Whereas tives (pp. 53-88). Mahwah, NJ: Erlbaum.
Educational Measurement: [ssues and Kingston, N. M., Kahl, S. R., Sweeney, K., &
Cizek's definition clearly sets forth a proce- Practice, 20 (4), 19-27.
dural conception of slandard setting, it fails Bay, L. (200 I). Setting performance stan-
Cizek, G. J. (200 Ib). Conjectures on the rise dards using the body of work method. In
to address the result of standard setting. This and call of standat'd setting: An introduc- G. J. Cizek (Eel.), Setting pmformance
aspect of fundamental fairness is similar to tion to context and practiee. In G J. Cizek standards.' Concepts, methods, and pfff-
whal has been called the "consequential basis (Ed.), Selling pfffformance standards: speclives (pp. 219-248). Mahwah, NJ:
of test use" (Messick, 1989, p. 84). Concepts, methods, and perspectives Erlbaum.
'Though describing all procedures is be- (pp. 3-17). Mahwah, NJ: Erlbaum. Lewis, D. M., Mitzel, H. C., & Green, D. R.
yond the scope of this module, it should be Ebel, R. L. (1972). Essentials ofeducatwnal (1996, June). Standard setting: Abook-
noted that participants were prepared and fa- measunnnent. Englewood Cliffs, NJ: mark approaeh. In D. R. Green (Chair),
cilitation of this standard setting followed Prentice-Hall. /RT-based standard setting procedures
standard practice as regards advance materi- Ferrara, S., Perie, M., & Johnson, E. (2002, utilizing behamoral anchoring. Sympo-
als provided to participants, orientation and April). Selling pffffonnance standards: sium condueted at the Couneil of Chief
training, monitoring of the process, and so on. The item descriptor (ID) malching pro- State Sehool Offieers National Confer-
'We note one difference between the pre- cedure. Paper presented at the annual ence on Large-Scale Assessmen~ Phoenix,
ceding formulation and that presented in meeting of the American Educational AZ.
Wright and Stone (1979). While Wright and Researeh Association, New Orleans, LA. Linn, R. L. (1994, October). The likely im-
Stone use ~ to represent examinee ability, Hambleton, R. M. (1998). Setting perfor- pact ofpmformancestandards as afunc-
we have used e here and in the rest of this mance standards on achievement tests: tion ofuses: Prom rhetoric 1.0 sanctions.
discussion for the sake of consistency 'vith Meeting the requirements of Title I. In Paper presented at the Joint Conference
Equations 1-4. L. N. Hansche (Ed.), Handbookfor the on Standard Setting for Large-Scale
5As an anonymous reviewer of this manu- development ofpelfonnance stanrkJrds Assessments, Washington, DC.
script poinlcd out, the simplicityofjudgment (pp. 87-114). Washington, DC: Couneil Livingston, S. A., & Zieky, M. J. (1982).
of Chief State Sehool Officers. Passing scores. Princeton, NJ: Edueational
comes at acost, which is the potential for ei~
lIambleton, R. M., & Plake, B. S. (1995). Testing Sm;;ce.
ther positive or negative bias depending on Using an extended Angoff proeedure to Loomis, S. C., & Bourque, M. L. (2001).
the characteristics of the test. The potential set standards on complex performance From tradition to innovation: Standard
for bias arises because the method is based assessments. Applied Measurement in setting on the National Assessment of
on an implicitjudgment of whether the prob- Education, 8, 41-56. Educational Progress. In G. J. Cizek
ability of correct response at the cut score is Hofstee, IV. K. B. (1983). The case for com- (Ed.), Selling pmformance stanrkJrds:
greater than .5. To illustrate, suppose that a promise in educational selection and grad- Concepts, methods, and perspectives
test were composed of identieal items that ing. In S. B. Anderson & J. S. Helmick (pp. 175-218). Mahwah, NJ: Erlbaum.
all had a probability of correct response at (Eds.), On educational testing (pp. Mehrens, W. A., & Cizek, G. J. (2001).
the cutscore of .7. Aparlicipantshouldjudge 109-127). San Francisco, CA: Jossey-Bass. Standard selting and the public good:
that the borderline examinee will answer all Huynh, H. (2000, April). On item mappings Benefits accrued and anticipated. In
items eorrectly, and the resulting perfor- and stalistical rilles for selecting binary G. J. Cizek (Ed.), Setting pelfonnance
mance standard would be a perfect score. itemsforcriterion-referenced inlerpreta- standards: Concepts, methods, and per-

Winter 2004 47
spectives (pp. 477-485). Mahwah, NJ: Zieky, M. J. (2001). So much has changcd: educational measurement. One of 18
Erlbaum. How the sctting of cut scores has evolved chapters, this chapter provides an
Messick, S. (1989). Validil¥. In R L. Linn since the 1980s. In G. J. Cizek (Ed.), overview of standard-setting meth-
(Ed.), Educalional [Link] (3rd Selting pe1fonnancestandnrds: Concepts, ods, issues, and concerns for the fu-
ed., pp. 13-104). New York: Macmillan. melhods, and perspectives (pp. 19-52). ture. A revision of this chapter, fo-
Mitzel, H. C., Lewis, D. M., Patz, R J., & Green, Mahwah, J: Erlbaum. cusing exclusively on standard
D. R. (2001). The Bookmark procedure: setting and to be written by R. M.
Psychological perspectives. In G. J. Cizek Hambleton and M. J. Pitoniak will be
(Ed.), Setting performance standards: Annotated Bibliography included in the forthcoming 4th edi-
Concepts, 11U1thods, and perspectives (pp. Cizek, G. J. (1996). Setting passing scores. tion of Educational [Link].
249-281). Mahwah, NJ: Erlbaum. Educatwnal [Link]: Issues and
Muraki, E. (1992). A generalized partial Practice, 15 (2), 20-31.
credit model: Application of an EM algo- Self-Test
rithm. Applied Psychological [Link]- This ITEMS module is the pre-
ment, 16, 159~176. cursor to the current module. It Multiple-Choice Items
Nedclsky, L. (1954). Absolute grading stan- describes standard setting for
dards for objective tests. Educational and achievement measures, with a focus 1. The Standards for Educational
PIJYc/lOlogicalMeasurement, 14, 3-19. on methods applierl in thr. context and Psycliolo{fi£;al Testing ( 1999)
No Child Left Behind Act. (2001). Public of selected-response item formats. require all of the following related
Law 107-110 (20 U.S.C. 6311). In addition to description of spe- to standard setting except:
Pitoniak, M. J. (2003). Standard setting cific methods, it provides back- A. estimates of elassification!
11U1thodsforcomplexlicensure examina- ground and context for standard decision consistency.
twns. Unpublished doctoral dissertation, setting and describes issues sur- B. description of the qualifica-
Universil¥ of Massachusetts, Amherst. rounding standard setting.
tions and experience of par-
Plake, B. S., & Hambleton, R. K. (2001). The Cizek, G. J. (Ed.). (2001). Selting perfor- ticipants.
analyticjudgment method for setting stan- mance standards: Concepts, 11U1lhods, C. scientifically based (I.e., ex-
dards on complex performance assess- and perspectives. Mahwah, NJ: perimental) standard-setting
[Link] G. J. Cizek (Ed.),Selting perfor- Erlbaum.
mance standnrds: Concepts, 11U1l!wd~ and study designs.
perspectives (pp. 283-312). Mahwah, NJ: This fairly recent volume contains D. estimates of standard errors of
Erlbaum. chapters written by some of the measurement for scores in the
Putnam, S. E., Pence, P., & Jaeger, R M. most authoritative and experienced regions(s) of recommended
(1995). A multi-stage dominant profile persons working in the field of slan- cut scores.
method for setting standards on complex dard setting. The book covers all as- 2. The typical role of the standard-
performance assessments. Applied Mea- pects of standard setting, including setting panel is to
sure11Uint in Education, 8, 57-83. theoretical foundations, methodolo- A. determine one or morc cut
Raymond, M. R., & Reid, J. B. (2001). Who gies, and current perspectives on
scores for a particular test.
made thee a judge? Selecting and train- legal issues, validation, social signif-
icance, and applications for special B. recommend one or more cut
ing participants for standard setting. in scores to authorized decision
G. J. Cizek (Ed.), Selting perfonnance populalions and computer-adaptive
standnrds: Concepts, 11U1thods, and per- testing. makers.
spectives (pp. 119-157). Mahwah, NJ. C. determine the most appropri-
Hansche, L. N. (Ed.) (1988). Handbook ate method to use for the
Erlbaum. for the development of performance
Reckase, M. D. (2001). Innovative methods standard-setting task.
standards. Washington, DC: Council of D. develop performance level
for helping standard-setting participants Chief State School Officers.
to perform their task. The role of feedback descriptors that best match
regarding consistency, accuracy, and im- This handbook focuses on methods the target examinees.
pact. In G. J. Cizek (Ed.), Setting perfor- for developing performance stan- 3. Peljormance standard is to pass-
mance standards: Concepts, methods, dards in the aligned system of stan- ing score as
and perspectives (pp.159-174). Mahwah, dards and assessments required by A. practical is to ideal.
NJ: Erlbaum. IASA!T'itie I. Sections I and 2 pro- B. decision is to process.
Schagen, I., & Bradshaw, J. (2003, vide definitions of performance stan-
Septcmber). Modeling item difficultyfor dards in the context of an aligned C. objective is to subjective.
Bookmark standard selting. Paper pre- educational system, advice for those D. conceptual is to operational.
scnted at the annual meeting of the developing systems of performance 4. P81formance level label (PLL) is
British Educational Research Association, standards, and information about to performance level descriptor
Edinburgh. experiences of several states regard- (PLD) as title is to
Talente, G., Haist, S., & Wilson, J. (2003). A ing slandards-based assessmcnt sys- A. index.
model for setting performance standards tems. Section 3 contains reports B. summary.
for standardized patient examinations. about research on developing per- C. main idea.
Evaluation and the Health Professions, formance standards and setting cut D. first draft.
26(4),427-446. scores on complex performance
assessments. 5. Which of the following is an ex-
Wang, N. (2003). Use of the Rasch IRT ample of a performance standard?
model in standard setting: An item map-
Jaeger, R. M. (1989). Certification of stu- A. Students should be able to
ping method. Journal of Educational dcnt competence. In R. L. Linn (Ed.),
Measurement, 40, 231-253. apply enabling strategies and
Wright, B. D., & Masters, G. N. (1982).
Educatwnal measurement (3rd ed., skills to learn to read and
pp. 485-514). New York: Macmillan. write including inferring word
Rating scale analysis Chicago: MESA.
Wright, B. D., & Stone, M. H. (1979). Best This chapter appears in the founda- meanings from taught roots,
test design. Chicago: MESA tional rcference text for thc field of prefixes, and suffixes to de-

48 Educational Measurement: Issues and Practice


code words in text to assist in decision rules for the SR items Il. Suppose that standard-setting
comprehension. remained the same). What would participants have completed
B. To be prepared for the jobs be the result on the cut score for their Round 1 ratings for Basic
of the future, students must the total test? using the Bookmark method.
demonstrate an understand- A. The cut score would remain Which of the following pieces of
ing of the overall meaning of the same. data would be used to calculate
what they read. When read- B. The cut score would increase the Round I cut score for
ing appropriate grade-level by about 5 raw score points. Basic?
text, they should be able C. The cut score would decrease A. Page number only
to make relatively obvious by about 5 raw score points. B. Page number and item diffi-
connections between the D. Cannot determine; the result culty
text and their own experi- would depend on examinee C. Student ability (theta) esti-
ences and extend the ideas performance on the CR items. mate
in the text by making simple 9. Suppose that a sixth-grade read- D. Standard deviation of the
inferences. ing test consisted of four reading Round I ratings
C. To be considered "Accelera- passages, each of which was fol- 12. Which of the following scenarios
ted" students must obtain at lowed by eight multiple-choice would most likely be classified
least 35 points on the set of items and one constructed- as a "holistic" standard-setting
seven constructed-response response item. Using the Book- procedure?
items designed to assess mark method, which would be the A. Standard setters review
grade-level reading compre- most appropriate way to con- standardized math portfolios
hension. struct an ordered item booklet for produced by 35 different
D. Students performing at the this test? students.
"Accelerated" level consis- A. First arrange the passages in B. Standard setters review sam-
tently demonstrate mastery of increasing order of readabil- ple performances by 200 stu-
grade-level subject matter ity, then arrange the items for dents on a single writing
and skills and are well pre- each passage in order of in- prompt.
pared for the next grade level. creasing difficulty. C. Standard setters estimate
6. Which of the following is true B. Arrange all items in difficulty the likelihood of a minimally
Proficient student answering
regarding the composition of a order, printing the appropri- each of 60 multiple-choice
standard-setting panel? ate portion of the passage on items correctly.
A. It should consist of at least the individual item pages. D. Standard setters compare
10 members for each con- C. Arrange all items in difficulty the performances of a group
struct measured by a multi- order, with reference to the of known experts in a field
dimensional test. appropriate passage on each with the performances of a
B. It should include only partici- page, printing all passages in group of known novices.
pants with previous standard- a separate booklet. 13. Which information would most
setting experience. D. Arrange the test booklet to likely be withheld from standard-
C. It should be diverse enough to be identical to the one stu- setting participants during
represent all likely examinee dents used, printing at the the second round of a holistic
demographics. top of each page the diffi- standard-setting activity?
D. It should be large and repre- culty index of the item and A. performance level descrip-
sentative enough to produce its rank order. tors
reliable results. 10. Suppose that facilitators for a B. individual student scores on
7. A primary benefit of the Yes! standard-setting study using a the tests
No method is that it Bookmark method trained par- C. distributions of student scores
A. simplifies the decision-making ticipants to use an "RP50" deci- on the tests
task for participants. sion rule to set a cut score for D. cut scores from the earlier
B. increases the sensitivity of Proficient. In this situation, round of judgments
participants to impact data. RP50 refers to the probability To answer Item 14, refer to Figure 2 in
C. reduces the need for partici- that the Module.
pants to be familiar with typ- A. 50% of examinees who an- 14. In Figure 2, what is the ratio-
ical examinee ability. swer this item correctly will nale for setting a cut score at 39
D. increases the likelihood that be considered Proficient. points?
the true cut score will result B. 50% of borderline-Proficient A. A score of 39 points treats
from the standard-setting examinees will answer this misclassifications of Proji-
process. item correctly. cient and Advanced as equally
8. Suppose that a decision was C. 50% of all Proficient exami- serious.
made to require examinees to nees will answer this item B. Fifty percent of the exami-
obtain a score of at least 2 on correctly. nees in the Advanced group
each of the CR items shown in D. 50% of all examinees will an- had raw scores of 39 or
Table 6 (and that the ratings and swer this item correctly. higher.

Winter 2004 49
c. The midpoint between the
means of work samples
rated as Projicient and
Advanced is 39.
D. The mean score for work sam-
ples rated as Advanced In
Round I is 39.
ConstTU<;ted-Response Item
15. Develop one additional survey
item that would be appropriate
for inclusion in the list of evalu-
ation items shown in Figure 3.
Answer Key to Self Test
I. C
2. B
3. D
4. B
5. C
6. D
7. A
8. C
9. C
10. B
II. C
12. A
13. B
14. A
15. Answers will vary, but may in-
clude items such as:
"The members of my group
brought diverse perspectives to
the discussions."
"I felt qualified to make thejudg-
ments we were asked to make."
"The data we received showing
probable effects of our ratings on
pass/fail rates was a helpful piece
of information."
"Reviewing Lhe content standards
that were sent prior to the meet-
ing helped me understand the
purpose of the test."

50

You might also like