Modern Methods in Standard Setting
Modern Methods in Standard Setting
Winter 2004 31
For testing in general, the Standards sound empirical data concerning the new Standards Jor Educational and
note that: relation of test performance to rele- Psychotogical Testing, and recent federal
vant criteria. (p. 60) legislation~have necessitated greater
A critical step in the development
and use of some tests is to establish Standard 4.2/: When cut scores attention to standard setting than per-
one or more cut points dividing the defining pass-fail or proficiency cate- haps ever before. Much has been de-
score range to partition the distri- gories are based on direct judgments manded of the technology of standard
bution of scores into categories. . about the adequacy of item or test setting. New methods have been devel-
[C I ut scores embody the rules ac- performances or performance levels, oped to meet new contexts and chal-
cording to which tests are used or in- the judgmental process should be lenges, and substantially greater scrutiny
terpreted. Thus, in some situations designed so that judges can bring and awareness of standard setting by
the validity of test interpretations their knowledge and experience to policymakers, educators, and the public
may hinge on the cut scores. (p. 53) bear in a reasonable way. (p. 60)
have resulted. This module is an attempt
Standard 6.5: When relevant for test to catch up on these fast-paced changes.
And, in the specific case of licensure and interpretation, test documents should In the following sections, we provide
certification tests, include item level information, an update on concepts and methods of
cut scores, ... (p. 69)
The validity of the inferences drawn setting performance standards. First, we
from the test depends on whether the Standard /4.17: The level of perfor- describe what is meant by standard
standard for passing makes a valid mance required for passing a creden- setting and we provide a rationale for the
distinction between adequate and in- tialing test should depend on the need for setting standards. Next, we list
adequate performance. (p. 157) knowledge and skills necessary for
some general considerations that war-
acceptable performance in the occu-
The 1999 version also includes new pation or profession and should not rant attention in any standard-setting
guidelines for standard setting. Among be adjusted to regulate the number or procedure. Then, we describe three spe-
the guidance in the new Standards are: proportion of persons passing the cific methods, introduced since the pub-
test. (p. 162) lication of the earlier module, which
Standard 1. 7: When a validation have found fairly wide usage in achieve-
rests in part on the opinions or deci-
sions of expert judges, observers, or Federal Legislation ment testing contexts. These methods
raters, procedures for selecting such are presented in how-to format which, it
At the national level, at least two wide-
experts and for eliciting judgments is hoped, will provide sufficient detail
ranging laws have affected the practice
or ratings should be fully described. to actually enable readers to use the
of standard setting. The Individuals
The qualifications, and experience, of with Disabilities Education Act (1997) method to obtain cut scores in a rele-
the judges should be presented. The vant situation. The final section of the
requires greatly expanded participa-
description of procedures should in- module presents guidelines for evaluat-
clude any training and instructions tion of students with special needs in ing standard setting. An annotated bibli-
provided, should indicate whether large-scale assessment programs. Among
ography and self-test appear at the end
participants reached their decisions other regulations, the Act requires states
to: (I) include children with disabili- of this module.
independently, and should report the
level of agreement reached. [f partic- ties in general state and district-level
ipants interacted with one another assessment programs; (2) develop and Definition of Standard Setting
or exchanged information, the proce- conduct alternate assessments for stu- It might seem obvious that what is called
dures through which they may have dents who cannot participate in the gen- standard setting is the process bywhich
inliuenced one another should be set eral programs; and (3) provide public a standard or cut score is established.
forth. (p. 19) reports on the performance of special In reality, however, standard setting is
Standard 2./4: Where cut scores are needs students with the same frequency notso straightforward. For example, par-
spccified for selection or classifica- and detail as reports on the assessment ticipants in a standard-setting process
tion, thc standard errors of measure- of nondisabled children. Developing new rarely set standards; rather, a standard-
rnentshould bc reported in the vicin- approaches to establishing performance setting panel usually makes a recom-
ity of each cut score. (p. 35) standards for the required alternate as- mendation to a body with the actual
Standard 2./5: When a test or com- sessments, which often comprise novel or authority to implement, adjust, or re-
bination of measures is used to nontraditional formats, has proven to be ject the standard-setting panel's recom-
make categorical decisions, esti- a significant standard-setting challenge. mendation (e.g., state board of educa-
mates should be provided of the per- Asecond piece offar-reaching legisla- tion, medical board, licensing agency).
centage of examinees who would be tion was enacted in 2001. The No Child It is now a widely accepted tenet of
classified in the same v·,ray on two Left Behind (NCLB) Act (2001) requires measurement theory that the work of
applications of the procedu re, using states to: (I) develop challenging con- standard-setting panels is not to search
the sarnc or alternate forms of the tent standards in reading, mathematics, for a knowable boundary between cate-
instrument. (p. 35) and science; (2) dcvelop and administer gories that exist. Instead, standard-
Standard 4.19: When proposed inter- assessments aligned to those standards; setting procedures enable participants
pretations involve one or more cut and (3) (of particular relevance to stan- to bring to bear their judgments in such
scorcs, the rationale and procedures dard setting) establish three levels of a way as to translate policy decisions
used for cstablishing cut scores high achievement (Basic, /'rojicient, and (often, as operationalized in perfor-
should be clearlydocu mented (p.59) Advanced) to describe varying levels of mance lovel descriptors) into locations
Standard 4.20: When feasible, cut mastery of the content standards. on a score scale; it is these translations
scores defining categorics wit.h dis- These three phenomena taken that create the effective performance
tinct substantive interpretations together-the rise of standards- categories. This translation and creation
should be established on the basis of referenced testing, the puhlieation of arc seldom, if ever, purely statistical, im-
Winter 2004 33
Asecond cross-cutting aspect of stan- There is an inherent tension in the NCME, 1999) provide guidance on repre-
dard setting is the creation and use of creation of PLDs. Descriptions that pro- sentation, selection, and training of par-
peiformance level labels (PLLs). PLLs vide too little specificity do not help il- ticipants. For example, the Standards
refer to the (usually) single-word terms lustrate or operationalize the perfor- indicate that "a sufficiently large and
used to identify performance categories; mance categories. As such, they do not representative group of judges should
Sasi£, ProflCienl, and Advanced would assist in communication to external be involved to provide reasonable as-
be examples of such labels. Many such audiences about the meaning of catego- surance that results would not vary
categorical labeling systems exist; a few rization at a given performance level. greatly if the process were replicated"
examples are shown in Table 1. Though Descriptions that provide too much (p. 54). The Standards also recom-
PLLs may have little technical underpin- specificity by providing a detailed list of mend that "the qualifications of any
ning, they clearly carry rhetorical value the knowledge and skills that a student judges involved in standard setting and
as related to the purpose of the standard at a given level possesses may be des- the process by which they are selected"
setting. Such labels have the potential to tined to pose validation problems. For ex- (p. 54) should be fully described and
convey agreat deal in a succinct manner ample, suppose that very detailed de- included as part of the documentation
vis-a-vis the meaning of classifications scriptions are generated describing the for the standard-setting process. The
that result from the application of cut specific knowledge and skills possessed Standards also address training:
scores. It is obvious from a measurement by examinees in a category. Suppose fur-
ther that actual categorical classifica- Care must be taken to assure that
perspective that PLLs should be carefully judges understand what they are to
chosen to relate to the purpose of the as- tions will be based on examinees' total do. The process must be such that
sessment, to the construct assessed, and test scores. Under such a scenario, there well-qualified judges can apply their
to the intended, supportable inferences ,viII almost always be many instances in knowledgo and experience to reach
arising from the classifications. which a test taker demonstrates mastery meaningful and relevant judgments
Athird issue-actually an extension of knowledge or skills outside the cate- that accurately reflect their under-
of the concern with PLLs-is evident gory to which he or she is assigned, and standings and intentions. (p. 54)
when performance level descriplors fails to demonstrate mastery of knowl-
edge or skills for some elements within As with the development of PLDs,
(PLDs) are created. PLD refers to the the performance category. This contra- there is atension present in the selection
(usually) several sentences or para- diction between the statement of knowl- of standard-setting participants. While it
graphs that provide fuller, more com- edge and skills that examinees in a is often recommended that participants
plete illustration of what performance category are supposed to possess (as in- have special expertise in the area for
within a particular category comprises. dicated in the PLDs) and the knowledge which standards ,viII be set, in practice
PLDs vary in their level of specificity, and skills that they actually possess (as this can mean that standard-setting pan-
but have in common the verbal elabo- indicated by observed test performance) els consist of participants whose per-
ration of the knowledge, skills, or at- makes validation of the PLDs problem- spectives are not representative of all
tributes of test takers within a perfor- atic. Some researchers have attempted practitioners in a field, all teachers at a
mance level. It is highly desirable for to solve this dilemma by crafting stan- grade level, and so on. Such a bias might
PLDs to be developed in advance of st<ln- dard-setting procedures in which be desirable if the purpose of standard
dard setting by a separate committee for items are matched to performance setting is exhortation, though less so if
approval by the appropriate policymak- level descriptions (see, e.g., Ferrara, the purpose of standard setting is to cer-
ing body. Standard-setting participants Perie, & Johnson, 2002). Despite these tify competence of students for awarding
then use these PLDs as acritical referent efforts, the vexing issue of ensuring fi- a high school diploma.
for theirjudgments. Sometimes, elabora- delity of PLDs with actual examinee per- In addition, once standard-setting
tions of the PLDs arc developed by par- formance is an area that remains one for participants have been selected and
ticipants during a standard-setting pro- which much additional work is needed. trained and the procedure has begun,
cedure as a first step (i.e., prior to Fourth, it has long been known there is the matter of providing feed-
making any item or task judgments) to- that the participants in the standard- back to participants. Many standard-
ward operationalizing and internalizing setting process are critical to the success setting approaches comprise "rounds"
the performance levels intended by the of the endeavor and are a source of or iterations of judgments. At each
policy body. Sample PLDs, in this case variability of standard-setting results. round, participants are provided vari-
those used for the NAEP Grade 4 read- The Standards for Educational and ous kinds of information to summarize
ing assessment, are shown in Table 2. Psychological Testing (AERNAPN their own variability, correspondence
Proficient Fourth-grade students performing at the Proficient level should be able to demonstrate an
overall understanding of the text, providing inferential as well as literal information. When
reading text appropriate to fourth grade, they should be able to extend the ideas in the text by
making inferences, drawing conclusions, and making connections to their own experiences.
The connections between the text and what the student infers should be clear.
For example, when reading literary text, Proficient-level fourth graders should be able to
summarize the story, draw conclusions about the characters or plot, and recognize relationships
such as cause and effect.
When reading informational text, Proficient-level students should be able to summarize the
information and identify the author's intent or purpose. They should be able to draw reasonable
conclusions from the text, recognize relationships such as cause and effect or similarities and
differences, and identify the meaning of the selection's key concepts.
Basic Fourth-grade students performing at the Basic level should demonstrate an understanding of
the overall meaning of what they read. When reading text appropriate for fourth graders, they
should be able to make relatively obvious connections between the text and their own
experiences, and extend the ideas in the text by making simple inferences.
For example, when reading literary text, they should be able to tell what the story is generally
about-providing details to support their understanding-and be able to connect aspects of
the stories to their own experiences.
When reading informational text, Basic-level fourth graders should be able to tell what the
selection is generally about or identify the purpose for reading it, provide details to support
their understanding, and connect ideas from the text to their background knowledge
and experiences.
with the group's ratings, or likely im- such conceptualizations may have ori- whether a particular metliod is consid-
pact on the examinee population. gins in the Nedelsky (1954) method in ered to be "examinee centered" or "test
A complete treatment of selecting, which standard setters are required to centered" (Jaeger, 1989). For example,
training, and providing feedback to par- consider options that a hypothetical FID to use the Bookmark method (Mitzel,
ticipants in standard setting is beyond student would reeognize as incorrect. Lewis, Patz, & Green, 2001), partici-
the scope of this module. Readers are (According to edelsky, the Fro student pants must consider at what point stu-
referred to the work of Raymond and is one who was on the borderline be- dents in a certain performance category
Reid (2001) for further information on tween passing and failing a course; (e.g., Basic) or on the borderline be-
the selection, training, and evaluation hence, the notion of a point differentiat- tween categories will have a specified
of standard-setting participants, and to ing between a failing grade of "F" and a probability of responding correctly.
Reckase (2001) for more information passing grade of"D.") Participants using While standard-setting participants are
on providing feedback to participants. an Angoff (1971) or derivative methodol- orten selected for their subject area ex-
Finally, a fifth common issue is the ne- ogy form a conceptualization of the min- pertise and knowledge of examinees to
cessity for standard-setting participants imally competent examinee. whom the test will be given, tlie abstract
to form and rely on a conceptualization In contemporary standard setting, notion of an examinee within or between
related to the examinee group to whom these often-Iiypothetieal conceptualiza- particular categories is still required for
the standard (s) will apply. The need for tions remain important, regardless of standard setting to proceed.
Winter 2004 35
Standard-Setting Methods framework of steps required for ficulty---easiest items first and hardest
According to the StandardsJar Edw;a- standard setting has been put forth by items last. This has come to be referred
tional and Psychological Testing, "There Hambleton (1998) and is presented to as an ordered it= booklet (OIB).
here as Table 3. However, each step The preparation of an OIB may seem
can be no single method for determining
warrants deeper attention in its own simple enough in concept yet, until
cut scores for all tests or for all purposes,
right, and readers interested in addi- Lewis et al. (1996) introduced the idea,
nor can there be any single set of proce-
tional details on these topics are re- it had not been incorporated into a
dures for establishing their defensibility"
ferred to other sources (e.g., Kane, 2001; formal standard-setting method. The
(AERAIAPAINCME, 1999, p. 53). Recent
Raymond & Reid, 2001; Reckase, 2001). idea, however, instantly transformed
advances in standard setting have added
standard setting into a classical psy-
new approaches to the inventory of avail-
Bookmark Method chophysics experiment in which a stim-
able methods. The new methods de-
ulus of gradually changing strength or
scribed in the following sections have The Bookmark method is one of several
form is presented to subjects who are
some common advantages: they are gen- item-mapping procedures developed in
given the task of noting the point
erally more holistic (they require an attempt to simplify the cognitive task
at which a just-noticeable difference
standard-setting participants to make of standard setters who are required to
(JND) occurs. In the Bookmark pro-
holistic judgments about items or exam- consider performance-level descriptions,
cedure, participants begin with the
inee test performance); they are in- maintain appropriate conceptualizations knOWledge that each succeeding item
tended to reduce the cognitive burden on of examinees within or between perfor- will be harder than (or at least as hard
participants; and they can be applied to mance levels, and make probabillty esti- as) the one before; they are charged
a wide variety of item and task formats. mates. First introduced by Lewis, Mitzel, with noting one or more JNDs in the
Before turning to a description of and Green in 1996, the procedure has course of several test items in the 01B.
three such methods, we note two pur- rapidly become widely used in K-12 edu- The ordering of MC items in an OlB is
poseful omissions in the following sub- cation settings. Among the advantages of rather straightforward, particularly if a
sections. First, we do not review meth- the Bookmark method are the compara- one-parameter logistic (I-PL) item re-
ods that would be highly appropriate for tive ease with which it can be applied by sponse model (e.g., Rasch model) was
situations involving a mix of item for- standard-setting participants, the fact used to obtain estimates of item diffi-
mats and multiple cut scores (e.g., con- that it can be applied to tests comprising culty. Whether a I-PL, 2-PL, or 3-PL
trasting groups, borderline groups), but both selected-response (SR, e.g., model is used, items are simply arranged
which have been described in previous multiple-choice) and constructed- in ascending b-value (i.e., item diffi-
modules (see Cizek, 1996a). Second, the response (CR) items, and the fact that it culty) order. When a test contains both
following descriptions of each method can be used to set multiple cut scores on SR and CR items, each CR item appears
generally focus on the procedures used a single test. several times in the booklet-once for
to actually obtain one or more cut scores. each of its score points. For a glven CR
or course, much more is required of a The Ordered Item Booklet. The item, the item prompt, the rubric, and
defensible standard-setting process, in- Bookmark procedure is so named be- sample examinee responses illustrating
cluding identification and training of cause standard-setting participants the score pointes) are also provided to
appropriately qualified participants, ef- identify cut scores by placing markers standard setters. The OlB is formatted
fective facilitation, monitoring, and feed- in aspecially prepared test booklet. The with only one item (or CR score point)
back to participants, and well conceived distinguishing characteristic of the spe- per page.
data collection to support whatever cial test booklet is that it is prepared in The OIB can be composed of any col-
validity claims are made. A generic advance with test items ordered by dif- lection of items that is representative of
Item 22 35
Ability level required for a .67 chance of answering
correctly: 1.725
Passage = Yellowstone
Winter 2004 37
The particular likelihood used-in nee ability (e), item difficulty (bj ), 0, = difficulty of item i; and
this case .67-is referred to as the re- item discrimination (aj), and a thresh- exp = natural logarithm raised to the
sponse probability (RP). According to old or chance variable (c,) in accor- power inside the parentheses'
Mitzel et al. (2001), an RP of .67 can be dance with the fundamental equation Allowing the expression on the right of
interpreted in the following way: "For a of the 3-PL model: Equation 5 to equal .67 and solving for
given cut score, a student with a test e" we obtain the following:
score at that point will have a .67 proba- Pie) =cj + (I-c,)!( 1+ exp
bility of answering an item also at that 1-1.7a,(e-bj )!), (I)
cut score correctly" (p. 260). However,
e, = 0, + .708, (6)
where exp represents the natural log-
the use of other RPs has been investi- arithm e (2.71828 ...) raised to the which is very similar to Equation 3 ex-
gated. Huynh (2000) suggested that power of the expression to the right. cept for the omission of the a parame-
the RP which maximized the informa- However, Mitzel et al. (2001) set the ter, which is the distinguishing charac-
tion function of the test would produce threshold or chance parameter (Cj) teristic of the 2-PL model. Thus, the
the optimum decision rule. For a two- equal to zero, reducing Equation I to Rasch ability level required for an ex-
parameter IRT model, Huynh found that aminee to have a .67 probability of an-
an RP of .67 maximized this function. Pj(e) = II{ 1+ exp swering agiven SR item correctly would
Wang (2003) concluded that an RP of [-1.7aie - bj ) I), (2) be .70810gitsgreater than the difficulty
.50 is preferable when the Rasch (i.e., of the item.
I-PL) scaling model is used. The choice or essentially a 2-PL model. When a test comprises CR items, the
of .50 in the Rasch model context has For dichotomously scored (i.e., SR) derivation of the ability level necessary
certain mathematical advantages over items, the basic standard-setting ques- to obtain a given score point is some-
.67 in that the likelihood of a correct re- tion is whether or not an examinee just what more complex than for SR items.
sponse is exactly .50 when the examinee barely categorized into a given perfor- Indeed, it is necessary to calculate a sys-
ability is equal to the item difficulty. mance level would have a .67 chance of tem of probabilities for each CR item
Issues related to selection of the answering a given SR item correctly. (i.e., a probability for each score point).
most appropriate RP remain, however. Thus, starting with a probability of .67 To accomplish this, a partial-credit
Whether standard-setting participants and solVing Equation 2 for the ability model is commonly used. According to
can use any particular RP value more (e) needed to answer an item correctly, this model, the likelihood (IT,;,) of a per-
effectively than another and whether we obtain the following: son (n) with a given ability (e,) obtain-
they can understand and apply the con- e = bj + .708/1.7ar (3) ing a given score (x) on an item (i) with
cept of RP more consistently and accu- a specified number of steps (j) is shown
rately than they can generate probabil- For CR items, the situation becomes in Equation 7 (taken from Wright &
ity estimates using, for example, a somewhat more complicated. Mitzel et Masters, 1982, Equation 3.1.6):
modified-Angoff approach remain topics al. (200 I) used the two-parameter gen-
for future research. eralized partial-credit model (Muraki, z
1992). This model, shown in Equation 4, exp I(a, -li ii )
Psychometric Foundatwns oj the presents the probability of obtaining a IT,,,, = j=fJ (7)
Bookmark Appmach. As originally de- given score point (c), given some ability m, ,
scribed, the Bookmark method employs level (e), as a function of the difficulty Iexp I(e, -Ii.)
a three-parameter logistic (3-PL) model of the various score points (b~ tob~) and
for SR items and a two-parameter the item discrimination (a,): where x is the value of the score point
partial-credit (2PPC) model for CR (0, 1,2,3, etc.) in question, andm, is the
items. However, an alternative approach final step. The numerator in Equation 7
usinga I-PL (i.e., Rasch) model for both
SR and CR items is also frequently
exp[±a,(e -b,,)] refers only to the steps completed for
,=0
( 4) the score pointx, while the denominator
used in practice. Both approaches are includes the sum of allm, + I possible
described in this section beginning numerators.
with a brief explication of the Bookmark
method as originally proposed. Determining a Cut Score Using the
As indicated previously, standard- Mitzel et al. (200 I) note that the Book- Bookmark Method. The following ex-
setti ng participants express their judg- mark procedure can also be imple- ample illustrates the application of the
ments by placing a marker in the OIB mented under other lRT models, such Bookmark procedure. The items shown
on the page after the last item that as the I-PL (Rasch) model. This partic- in Table 4 are drawn from a report by
they believe an examinee who is just ular application of the Bookmark pro- Schagen and Bradshaw (2003) regard-
barely qualified for a particular clas- cedure begins with a basic expression ing a national reading test given to 11-
sification (e.g., Projicient) has a .67 of the Rasch model for dichotomous year-olds in Great Britain. The test con-
probability of answering correctly. These items (cf., Wright & Stone, 1979; sisted of 27 SR items and 10 CR items.
judgments are translated into cutscores Equation 1.4.1): Of the 10 CR items, seven were worth
by noting the examinee ability associ- 2 points each, and three were worth
ated with a .67 probability of a correct P(X = I1e,., 0,) = 3 points each, for a total of 50 points for
response and then translating that exp(e,. - 0,)/[ 1+ exp(e,. - 0,) J, (5) the entire test. 1\velve participants eval-
ability into a raw score. As originally uated the OIB represented in Table 4
where
described by Mitzel et al. (2001), the and rendered their bookmark place-
probability of a correct response (P,) e,. = ability (theta estimate) of an ments for a minimat student (Level 3).
for an SR item is a function of exami- examinee; Those judgments are shown in Table 5.
The cut score is based on the mean level). It is on those ability levels, not the data in Table 4. These ability esti-
theta at the associated response prob- the page numbers or cumulative num- mates were then averaged to determine
ability (theta @ RP = .67). In this in- ber of items, that the cut score is set. the mean ability estimate of a student
stance, the mean theta value of -1.594 The student who has a 67% likelihood of just barely at the minimal level. That
corresponds to a raw score of 15.25. answering Item 2 correctly also has a ability level was then converted to a
Because fractional raw scores are not slight chance of answering subsequent raw score using standard, commercially
possible, the operational cut score items correctly or obtaining scores of 2 a\·ailable 3PL model software.
would need to be rounded to a possible or 3 on moderately difficult CR items.
score point, such as 15 or 16, depend- The expected score for the student at AngoJf Variations
ing on the rounding rules in place, the just barely minimal level is the Originally proposed by Angoff (1971) and
though it should be noted that a stu- aggregate of expected scores on all 37 described elsewhere (see Cizek, 1996a),
dent who had earned a raw score of 15 items in the test. For this particular the Angoff approach has produced many
would have an ability less than the tar- test, based on the average of these par- variations which have adapted this most
get value of -1.594. ticipants' estimates, that expected raw thoroughly researched and still widely
It should also be noted that partici- score is somewhere between 15 and 16. used method to evolving assessment con-
pants selected items on the second, fifth, To summarize this application of the texts and challenges. Just as the previ-
and sixth pages of the OIB (Items 13,4, Bookmark method, 12 standard-setting ously described Bookmark approach was
and 2, respectively). If none of the par- participants made judgments about the de\eloped in an attempt to reduce the
ticipants went farther than page 6 in location of the minimal achievement complexity of the cognitive task faeing
the booklet, it might seem reasonable level by placing bookmarks in their 0 IBs. standard-setting participants, so too
that the cut score for the minimal level These judgments are shown in the col- does a derivative of the Angoff proce-
should be no more than 6 points. How- umn labeled "Item umber" in Table 5. dure referred to as the YeslNo method
ever, the Bookmark procedure focuses The relationships for each item between by Imparaand Plake (1997). The essen-
on the student ability level associated page number and ability required to tial question that must be addressed
with the 67% likelihood of answering reach that level (with a 67% likelihood) by standard-setting participants can be
Item 2, 4, or 13 (the ones identified by are shown in Table 4. The page num- answered "Yes" or "No." According to
the participants as marking the bound- bers supplied by the participants were Impara and Plake, participants are
ary between minimal and the next lower translated into ability estimates using directed to
Winter 2004 39
Table 5. Summary of Participants' Bookmark or a mix of SR and CR formaLS has nOL
Placements for Level 3 (Minimal) been attempted, another variation of
Angoffs (1971) basic approach has been
Page Number created to address tests that include CR
Participant Item Number inOIB Theta @ RP = .67 items. Hambleton and Plake (1995) de-
scribe what they have labeled an ex-
A 2 6 -1.517 [Link][procedure. In addition to
B 4 5 -1.492 providing traditional probability esti-
C 4 5 -1.492 mates of borderline examinee perfor-
D 2 6 -1.517 mance for each SR item, participants
E 2 6 -1.517 also estimate the number of scale points
F 2 6 -1.517 that they believe borderline exami-
G 13 2 -2.352 nees will obtain On each CR task in
H 4 5 -1.492 the assessment. Cut scores for the ex-
I 2 6 -1.517 tended Angoff approach are calculated
J 13 2 -2.352 in the same way as with traditional
K 2 6 -1.517 Angoffmethods, although, as Hambleton
L 2 6 -1.517 (1998) notes, more complex weighting
Mean -1.594 schemes can also be used for combin-
Source: Adapled from Schagen and Bradshaw (2003). ing components in a mixed-format
assessment.
read each item [in the test I and 1'0 begin, qualified participants are se-
make ajudgment about whether the lected and are oriented to the standard- Cakulation of Yes/No and Extended
borderline student you have in mind setting task. They are often grounded in AngoJ[ Cut Scores. Table 6 presents
will be able to answer each question the content standards upon which the hypothetical data for the ratings of 20
correctly. Ifyou think so, then under test was built; they may be required to items by six participants in two rounds
Rating I on the sheet you have in take the test themselves, and they dis- of ratings using the YesINo and extended
front of you, \\7ite in a Y. ffyou think cuss the relevant competencies and Angoff methods. The table has been
thc student will not be able to an- prepared to illustrate calculation of cut
swer correctly, then write in an N. characteristics of the target population
(pp.364-365) of examinees for whom the performance scores that would result from use of the
levels are to be set. After discussion of Yes/No method alone for a set of di-
In essence then, the Yes/No method is the borderline examinees, participants chotomously scored SR items (i.e., the
highly similar to the first Angoff (197 I) are asked to make performance esti- first 12 items listed in the table), the
approach. In his oft-cited chapter on mates for a group of examinees in extended Angoff method alone for a set
scaling, norming, and equating, Angoff an iterative process over two or more of polytomously scored CR items (the
described two variations of a standard- "rounds" or ratings. last eight items in the table), or a com-
setting method. While his second sug- Typically, in a first round of perfor- bination ofYesINo and extended Angoff
gestion came to be known as the widely mance estimation, participants using (for the full20-item set). For this set of
used Angoff method, Angoff first sug- the YesINo method rate a set of opera- items the CR items were scored on a
gested that standard setters simply tional items often comprising an intacL 1-4 scale.
judge whether or nota hypothetical min- test form. At the end of Round I, each The means for each participant and
imally acceptable person would answer participant would be provided with each item are also presented for each
an item correctly. According to Angoff, feedback on their ratings in the form of round. Using the Round 2ratings shown
a systematic procedure for deciding information about how their ratings in Table 6, Lhe recommended YesINo
on the minimum raw scores for pass- compared to actual examinee perfor- passing score for the SR item test would
ing and honors might be dcveloped mance or to oLher participants' ratings. be approximately 58% of the total raw
as follows: keeping the hypothetical Asecond round of yes/no judgments On score points (.58 x 12 items), or ap-
"minimally acceptable person" in each item follows as participants re- proximately 7 out of 12 points possible.
mind, onc could go thrcugh the tcst review cach iLem in the test. If not pro- The recommended passing score on the
item by item and decide whether vided to Lhem previously, at the end of CR item test would be 21 out of a total
such a person could answer correctly the second round of judgments, partic-
each item under consideration. If a of 32 possible score points (2.69 x 8
ipants would receive additional infor- items). Arecommended passing score
score of one is given for each item an- mation regarding how many examinees
swered correctly by the hypothetical for the 20-item test comprising a mix of
person and a score of zero is given
would be predicted to pass/fail based on SR and CR items would be approxi-
for each item answered incorrectly their participants' judgments (i.e., im- mately 28 of the 44 total possible raw
by that person, the sum of the item pact data). Regardless of how many score points I (.58 x 12) + (2.69 x 8)].
scores will equal the raw score earned rounds of ratings occur, calculation of (See Hambleton & Plake, 1995 and
by the 'minimally acceptable person.- the final recommended passing score
(pp.5l4-515) would be based on data obtained in the 1'alente, Haist, & Wilson, 2003 for addi-
final round. tional information on setting standards
Implementing the Yes/No Method. for complex performance assessments.)
The basic procedures for implementing Extended AngoJ[ Method. Although
thc YesINo method follow those for most an extension of the Yes/No method to Research on the Yes/No Method. One
common standard-setting approaches. contexts with polytomously scored items of the appealing features of the YesINo
Winter 2004 41
method is its simplicity. In typical im- using the Angoff method and the Ye&'No tests comprising exclusively CR items
plementations of modified Angoff pro- method, the variance of the ratings with (e.g., a writing test) or a mix of SR and
cedures, participants must maintain a the Ye&'No method was smaller and the CR formats (e.g., a mathematics test).
concept of a group of hypothetical ex- participants' scores were more stable Several of these methods can be
aminees and must estimate the propor- from Round I to Round 2. Participants termed "holistic," in that they require
tion of that group which will answer an reported that thinking of an actual ex- participants to focus judgment on a
item correctly. Clearly, this is an impor- aminee when rating the items was easier sample or collection of examinee work
tant-though difficult-task. Impara than thinking of agroup of hypothetical greater than a single item or task at a
and Plake (1998) found that the Ye&'No examinees. time. Though a number of methods
method ameliorated some of the diffi- The relative cognitive simplicity of satisfy this characteristic, we are
culty of the task. They reported that: the Yes/No method identified by aware, too, that differences between
Impara and Plake was also reported by these methods can defy common clas-
We believe that the yes/no method Chinn and Hertz (2002). They report sification. With that caveat, we note
shows substantial promise. Not only
do panelists find this method clearer that participants found the yes/no de- several examples of more holistic
and easier to use than the more tradi- cisions easy to make because "they methods, then we provide greater de-
tional Angoff probablility estimation were forced to decide between a yes or tail on a single implementation of one
procedures, its results show less sen- a no rather than estimate perfor- such procedure.
sitivity to performance data and lower mance from a range of estimates,"
within-panelist variability. Further, whereas participants using a modified Examples oj Some Holistic Meth-
panelisls repert that the conceptual- Angoff method "commented that de- ods. One such method that would be
ization of a typical borderline exami- termining the proportion of candi- considered more holistic has been pro-
nee is easier for them than the task of posed by Plake and Hambleton (2001)
imagining agroup of hypothetical tar- dates who would answer each item
correctly was difficult and subjective" (although the developers described
get candidates. Therefore, the per- their method as "analytic jUdgment").
formance standard derived from the (p. 7). However, in contrast to the at-
yes/no method may be more valid tractive stability of the participants' The method was developed for tests
than that derived from the traditional ratings observed by Impara and Plake that include polytomously seored per-
Angoff method. (p. 336) (1998), Chinn and Hertz found that formance tasks and other formats, re-
there was greater variance in ratings sulting in a total test comprising differ-
As Impara and Plake (1998) have using the Yes/No method. They hy- ent components. To implement the
demonstrated, even teachers who were pothesize that this may be due to de- method, panelists review a carefully se-
familiar with an assessment and with sign limitations and several depar- lected set of materials for each compo-
the examinees taking the assessment tures from the methodology used by nent, representing the range of actual
were not highly accurate when asked to Impara and Plake including their se- examinee performance on each of the
predict the proportion of a group of bor- lection of participants, instructions, questions comprising the assessment
derline students who would answer an and level of discussion about the (although examinees' scores are not
item correctly. The Ye&'No method sim- process. revealed to the panelists). Panelists
plifies the judgment task by reducing To date the Ye&'No method has only then classify the work samples accord-
the probability estimation required to a been applied in contexts where the out- ing to whatever performance levels are
dichotomous outcome' come is dichotomous (i.e., with multiple- required (e.g., Basic, Projicient, and
There are two alternative ways in choice or other SR-format items which Advanced). Plake and Hambleton used
which the Ye&'No method can be ap- will be scored as correct or incorrect). even narrower categories within these
plied. One variation requires partici- performance levels, which they called
pants to form the traditional conceptu- low, middle, and high (e.g., low-Basic,
alization of a hypothetical borderline Holistic Melhods middle-Basic, high-Basic). Although
examinee; the other requires partici- Increasingly, large-scale assessments Plake and Hambleton suggested alterna-
pants to reference their judgments with have incorporated a mix of item formats tive methods for calculating the eventual
respect to an actual examinee on the in order to tap more fully the constructs cut scores, a simple averaging approach
borderline between classifications (e.g., that are measured by those tests and to appeared to work as well as the others.
between Basic and Proficient). In a com- avoid one common validity threat known The averaging approach consisted oftak-
parative trial of the Ye&'No method with as construct underrepresentation. While ing all papers classified by participants
a modified AngoIT approach, lmpara and tests comprising SR-format items ex- into what were called borderline cate-
Plake (1997) asked participants using clusively may have been more common gories. For example, the cut score distin-
the Ye&'No method to think of one ac- in the past, newer tests often comprise guishing Basic from Proftdent was ob-
lual borderline examinee with whom short-response items, essays, show-your- tained by averaging the scores of papers
the participant was familiar instead of work, written reftections, grid-in re- classified into the high-Basic and low-
conceptualizing a group of hypothetical sponse format, and other test construc- Projicient borderline categories.
examinees. Keeping this actual person tion features for which standard-setting Loomis and Bourque (2001) have de-
in mind, participants were then asked methods designed for SR tests are not scribed a similar approach to that of
to determine whether the examinee amenable. Plake and Hambleton (2001) in what
would answer each item correctly. The Assessment specialists have respon- they call a paper selection rnetlwd. They
results showed that although the final ded by proposing a variety of methods also describe another similar approach,
standard was similar for participants for setting performance standards on which they term the booklet classi,[ica-
Winter 2004 43
12
· ..... Below Basic
11 - ...... -·Basic
i\
I I ------ Proficient
10 ~
I
I I
I
!\ r;o;;- C3
_ _ Advanced
\~ \
9 Cl,J
I
I
8
\
I \
I \
~
~
7 J.
'"
Cl
C
:; 6
I
I I
I \
\
J\
-,.,
a:
0
5
I
I'.
/ \
1 I
i
J
u \
c
'" I
\
j \
~
r
::l 4
I
0'
'"
~
LL
I
I
I
I \
3
2
I
I
f
I
~,
rv \ \
\
I
I
f···..
I
I 1
\I
o
"
I
Raw Score
When using holistic approaches, de- tion, a priori, as is given to canying out These aforementioned evaluations
cisions about whether and when to the standard-setting procedure itself. are external in nature. However, on-site
share student work sample scores, The evaluation of. standard setting evaluations of the process of standard
overall distributions of scores (i.e., im- is a [Link] endeavor. It can be setting, by the participants themselves,
pact data), item difficulty, and other thought of as beginning with a critical serve as an important internal check on
data are made prior to the standard- appraisal of. the degree of. alignment the validity and success of the process.
setting activity. Typically, item and score between the standard-setting method Typically, two evaluations are con-
data are shared after Round I, and im- selected and the purpose and design of. ducted during the course of a standard-
pact data are shared after Round 2. the test, the goals of the standard- setting meeting. Afirst evaluation nor-
However, in some cases, impact data setting agency, and the characteristics mally occurs after initial orientation of
are also shared after Round 1. of. the standard setters. This match participants to the process, training in
should be evaluated by an independent the method, and (when appropriate)
body (such as a technical advisory administration to participants of an
Evaluating the Standard-8etting committee) acting on behalf. of. the actual test form. This first evaluation
Process standard-setting agency. Evaluation serves as a check on the extent to which
participants have been adequately
Although not strictly a method itself, it continues with a close examination of. trained, understand key conceptuali-
is important that any standard-setting the application of the standard-setting zations and the task before them, and
process gather evidence bearing on the procedure: To what extent did it ad- have confidence that they will be able
manner in which any particular ap- here faithfully to the published princi- to apply the selected method. Asecond
proach was implemented and the extent ples of the procedure? Did it deviate evaluation is ordinarily conducted at
to which participants in the process were in unexpected. undocumented ways? If the conclusion of the standard-setting
able to understand, apply, and have there are deviations, are they reason- meeting. Commonly, both evaluations
confidence in the eventual performance able adaptations, specified and ap- consist of a series of survey questions. A
standards (Cizek, 1996b). Thus, evalua- proved in advance, and consistent with sample end-of-meeting survey is shown
tion of the standard-setting process the overall goals of the activity" Amea- in Pigure 3.
can be thought of as an aspect of each sme of the degree to which individual It should be noted that the format of
method described previously in this mod- standard-setting participants converge the items in the survey showll in Pigure
ule. Equal attention must be devoted to from one round to the next is yet an- 3 requires only an "Agree" or "Disagree"
planning the standard-setting evalua- other part of the evaluation. check mark from respondents. Because
9 I was able to follow the instructions and complete the rating sheets
accurately.
to The discussions after the first round of ratings were helpful to me.
15) Comments: _
standard-setting meetings can be long shown in Pigure3would be to replace the These activities all focus on an eval-
and arduous activities, it is considered AgreelDisagree options with a Likert- uation of the process. What of the prod-
desirable to conduct the final evalua- type scale that gives participants greater uct(s) of standard setting? Commonly
tion in such a way as to make the task response options (e.g., I = Strongly employed criteria here include reason-
relatively easy for participants to com- Disagree to 4 = Strongly Agree). Such a ableness and replicability. Afirst poten-
plete and to lessen the proportion of modification would permit finer grained tial aspect of product evaluation is the
nonresponse. Consequently, open-ended reporting of participants' perceptions, usefu lness of the PLLs and PLDs. Par
survey items requiring lengthy re- calculations of means and standard de- a given subject and grade level, they
sponses are generally avoided. One sim- viation for each question on the survey, should accurately reflect the content
ple modification of the evaluation form and so on. standards or credentialing objectives
Winter 2004 45
Table 7. Criteria for Evaluating Standard-Setting Procedures
Evaluation Criterion Description
Procedural
Explicitness The degree to which the standard-setting purposes and processes were clearly and
explicitly articulated a priori
Practicability The ease of implementation of the procedures and data analysis; the degree to
which procedures are credible and interpretable to relevant audiences
Implementation The degree to which the following procedures were reasonable, and systematically
and rigorously conducted: selection and training of participants, definition of the
performance standard, and data collection
Feedback The extent to which participants have confidence in the process and in resulting
cut scorers)
Documentation The extent to which features of the study are reviewed and documented for evalu-
ation and communication purposes
Internal
Consistency within method The precision of the estimate of the cut scorers)
Intrapanelist consistency The degree to which a participant is able to provide ratings that are consistent with
the empirical item difficulties, and the degree to which ratings change across
rounds
Interpanelist consistency The consistency of item ratings and cut scores across participants
Decision consistency The extent to which repeated application of the identified cut scores(s) would yield
consistent classifications of examinees
Other measures The consistency of cut scores across item types, content areas, and cognitive
processes
External
Comparisons to other The consistency of cut scores across replications using other standard-setting
standard-setting methods methods
Comparisons to other The relationship between decisions made using the test to other relevant criteria
sources of information (e.g., grades, performance on tests measuring similar constructs, etc.)
Reasonableness of The extent to which cut score recommendations are feasible or realistic (including
cut scores pass/fail rates and differential impact on relevant subgroups)
Source: Adapted from Pitoniak (2003).
and be reasonably consistent with state- gories? In the end, regardless of how rea- setting method is most defensible in a
ments developed by others with similar sonable a set of performance standards given situation. Again, differences in
goals. seems to assessment professionals or results from two different procedures
Reasonableness can be assessed by those who participated in the actual would not be an indication that one was
the degree to which cuL scores derived standard-setting activity, those standards right and the other wrong; even if two
from Lhe standard-setting process being 'viII need to be locally reproducible-at methods did produce the same or simi-
evaluated classify examinees into groups least in an informal sense- in order Lo lar cut scores, we could only be sure of
in a manner consistent with oLher infor- be widely accepted and recognized. precision, not accuracy.
mation about the examinees. For exam- Replicability is another possible av- The aspects of standard-setting eval-
ple, suppose it could be assumed that a enue for evaluating standard setting. For uation listed here do not cover all of the
state's eighth-grade reading test and the example, in some contexts where great critical elements of standard setting that
NAEP were based on common content resources are available, it is possible to can yield evidence about the soundness
standards (or similar content standards conduct independent applications of a of a particular application. The preced-
that had roughly equal instructional em- standard-setting process to assess the ing paragraphs havc only attempted to
phasis). In such a case, a standard- degree La which independent replica- highlight the depth and complexity of
setting procedure for the staLe test re- Lions yield similar results. Evaluation that important task. Table 7 provides a
sulting in 72% of the state's eighth mighL also involve comparisons between more inclusive list and description of
graders being classified as Projicient, results obtained using one method and evaluation criteria that can be used as
while NAEP results for the same grade an independent application of one or sources of evidence bearing on the qual-
showed that only 39% were Projicient, more different methods. Interpretation ity of the standard-setting process.
would cause concern thaL one or Lhe of the results of these comparisons, how-
other set of standards was inappropriate. ever, is far from clear. For example, Conclusion
Local information can also provide cri- Jaeger (1989) has noted thaL different
teria by which to judge reasonableness. methods will yield different results, and Setting performance standards has
Do students who typically do well in class there is no way to determine that one been called "the most controversial
and on assignments mostly meet the top method or the other produced the wrong problem in cducational assessment
standard set for the test, whilo students results. Zieky (2001) noted that there is today" (Hambleton, 1998, p. 103). As
who struggle fall into the lower cate- still no consensus as to which standard- long as important decisions must be
Winter 2004 47
spectives (pp. 477-485). Mahwah, NJ: Zieky, M. J. (2001). So much has changcd: educational measurement. One of 18
Erlbaum. How the sctting of cut scores has evolved chapters, this chapter provides an
Messick, S. (1989). Validil¥. In R L. Linn since the 1980s. In G. J. Cizek (Ed.), overview of standard-setting meth-
(Ed.), Educalional [Link] (3rd Selting pe1fonnancestandnrds: Concepts, ods, issues, and concerns for the fu-
ed., pp. 13-104). New York: Macmillan. melhods, and perspectives (pp. 19-52). ture. A revision of this chapter, fo-
Mitzel, H. C., Lewis, D. M., Patz, R J., & Green, Mahwah, J: Erlbaum. cusing exclusively on standard
D. R. (2001). The Bookmark procedure: setting and to be written by R. M.
Psychological perspectives. In G. J. Cizek Hambleton and M. J. Pitoniak will be
(Ed.), Setting performance standards: Annotated Bibliography included in the forthcoming 4th edi-
Concepts, 11U1thods, and perspectives (pp. Cizek, G. J. (1996). Setting passing scores. tion of Educational [Link].
249-281). Mahwah, NJ: Erlbaum. Educatwnal [Link]: Issues and
Muraki, E. (1992). A generalized partial Practice, 15 (2), 20-31.
credit model: Application of an EM algo- Self-Test
rithm. Applied Psychological [Link]- This ITEMS module is the pre-
ment, 16, 159~176. cursor to the current module. It Multiple-Choice Items
Nedclsky, L. (1954). Absolute grading stan- describes standard setting for
dards for objective tests. Educational and achievement measures, with a focus 1. The Standards for Educational
PIJYc/lOlogicalMeasurement, 14, 3-19. on methods applierl in thr. context and Psycliolo{fi£;al Testing ( 1999)
No Child Left Behind Act. (2001). Public of selected-response item formats. require all of the following related
Law 107-110 (20 U.S.C. 6311). In addition to description of spe- to standard setting except:
Pitoniak, M. J. (2003). Standard setting cific methods, it provides back- A. estimates of elassification!
11U1thodsforcomplexlicensure examina- ground and context for standard decision consistency.
twns. Unpublished doctoral dissertation, setting and describes issues sur- B. description of the qualifica-
Universil¥ of Massachusetts, Amherst. rounding standard setting.
tions and experience of par-
Plake, B. S., & Hambleton, R. K. (2001). The Cizek, G. J. (Ed.). (2001). Selting perfor- ticipants.
analyticjudgment method for setting stan- mance standards: Concepts, 11U1lhods, C. scientifically based (I.e., ex-
dards on complex performance assess- and perspectives. Mahwah, NJ: perimental) standard-setting
[Link] G. J. Cizek (Ed.),Selting perfor- Erlbaum.
mance standnrds: Concepts, 11U1l!wd~ and study designs.
perspectives (pp. 283-312). Mahwah, NJ: This fairly recent volume contains D. estimates of standard errors of
Erlbaum. chapters written by some of the measurement for scores in the
Putnam, S. E., Pence, P., & Jaeger, R M. most authoritative and experienced regions(s) of recommended
(1995). A multi-stage dominant profile persons working in the field of slan- cut scores.
method for setting standards on complex dard setting. The book covers all as- 2. The typical role of the standard-
performance assessments. Applied Mea- pects of standard setting, including setting panel is to
sure11Uint in Education, 8, 57-83. theoretical foundations, methodolo- A. determine one or morc cut
Raymond, M. R., & Reid, J. B. (2001). Who gies, and current perspectives on
scores for a particular test.
made thee a judge? Selecting and train- legal issues, validation, social signif-
icance, and applications for special B. recommend one or more cut
ing participants for standard setting. in scores to authorized decision
G. J. Cizek (Ed.), Selting perfonnance populalions and computer-adaptive
standnrds: Concepts, 11U1thods, and per- testing. makers.
spectives (pp. 119-157). Mahwah, NJ. C. determine the most appropri-
Hansche, L. N. (Ed.) (1988). Handbook ate method to use for the
Erlbaum. for the development of performance
Reckase, M. D. (2001). Innovative methods standard-setting task.
standards. Washington, DC: Council of D. develop performance level
for helping standard-setting participants Chief State School Officers.
to perform their task. The role of feedback descriptors that best match
regarding consistency, accuracy, and im- This handbook focuses on methods the target examinees.
pact. In G. J. Cizek (Ed.), Setting perfor- for developing performance stan- 3. Peljormance standard is to pass-
mance standards: Concepts, methods, dards in the aligned system of stan- ing score as
and perspectives (pp.159-174). Mahwah, dards and assessments required by A. practical is to ideal.
NJ: Erlbaum. IASA!T'itie I. Sections I and 2 pro- B. decision is to process.
Schagen, I., & Bradshaw, J. (2003, vide definitions of performance stan-
Septcmber). Modeling item difficultyfor dards in the context of an aligned C. objective is to subjective.
Bookmark standard selting. Paper pre- educational system, advice for those D. conceptual is to operational.
scnted at the annual meeting of the developing systems of performance 4. P81formance level label (PLL) is
British Educational Research Association, standards, and information about to performance level descriptor
Edinburgh. experiences of several states regard- (PLD) as title is to
Talente, G., Haist, S., & Wilson, J. (2003). A ing slandards-based assessmcnt sys- A. index.
model for setting performance standards tems. Section 3 contains reports B. summary.
for standardized patient examinations. about research on developing per- C. main idea.
Evaluation and the Health Professions, formance standards and setting cut D. first draft.
26(4),427-446. scores on complex performance
assessments. 5. Which of the following is an ex-
Wang, N. (2003). Use of the Rasch IRT ample of a performance standard?
model in standard setting: An item map-
Jaeger, R. M. (1989). Certification of stu- A. Students should be able to
ping method. Journal of Educational dcnt competence. In R. L. Linn (Ed.),
Measurement, 40, 231-253. apply enabling strategies and
Wright, B. D., & Masters, G. N. (1982).
Educatwnal measurement (3rd ed., skills to learn to read and
pp. 485-514). New York: Macmillan. write including inferring word
Rating scale analysis Chicago: MESA.
Wright, B. D., & Stone, M. H. (1979). Best This chapter appears in the founda- meanings from taught roots,
test design. Chicago: MESA tional rcference text for thc field of prefixes, and suffixes to de-
Winter 2004 49
c. The midpoint between the
means of work samples
rated as Projicient and
Advanced is 39.
D. The mean score for work sam-
ples rated as Advanced In
Round I is 39.
ConstTU<;ted-Response Item
15. Develop one additional survey
item that would be appropriate
for inclusion in the list of evalu-
ation items shown in Figure 3.
Answer Key to Self Test
I. C
2. B
3. D
4. B
5. C
6. D
7. A
8. C
9. C
10. B
II. C
12. A
13. B
14. A
15. Answers will vary, but may in-
clude items such as:
"The members of my group
brought diverse perspectives to
the discussions."
"I felt qualified to make thejudg-
ments we were asked to make."
"The data we received showing
probable effects of our ratings on
pass/fail rates was a helpful piece
of information."
"Reviewing Lhe content standards
that were sent prior to the meet-
ing helped me understand the
purpose of the test."
50