Livro-Bilog Multilog Parscale
Livro-Bilog Multilog Parscale
BILOG-MG MULTILOG
PARSCALE TESTFACT
General notice: Other product names mentioned herein are used for identification
purposes only and may be trademarks of their respective companies.
Cover by Clint Smith of Esse Group. Based on a design by Louis Sullivan for
elevator grillwork in the Chicago Stock Exchange (1893).
1 2 3 4 5 6 7 8 9 0 08 07 06 05 04 03
Published by:
ISBN: 0-89498-053-X
Preface
Software for item analysis and test scoring has long been an important subset of the products
published by SSI. In this new volume, four of the IRT programs that have previously been
published separately have been brought together for the first time. The four programs—BILOG-
MG, MULTILOG, PARSCALE, and TESTFACT—have been ported to the Windows platform.
In the case of BILOG-MG and MULTILOG, analyses can be set up and performed interactively
via dialog boxes from within the program. Interfaces for TESTFACT and PARSCALE do not
presently include dialog boxes to build syntax interactively. All programs offer extensive on-line
help, and BILOG-MG, MULTILOG, and PARSCALE also include an IRT graphing program,
capable of producing quality graphics.
The programs
BILOG-MG, an extension of the BILOG program for the analysis of dichotomous data, was
written by Michele Zimowski (National Opinion Research Center, Chicago), Eiji Muraki
(Tohoku University, Japan), Robert Mislevy (Educational Testing Service), and Darrell Bock
(University of Chicago). This program can also perform multiple group analysis, allowing the
user to study both DIF and DRIFT. The documentation for the program, which has been
incorporated into Chapters 2, 7, 8, and 10 of this volume, was written by Darrell Bock and
Michele Zimowski, while Eiji Muraki and Robert Mislevy made major contributions in terms of
programming.
MULTILOG, written by David Thissen (University of North Carolina, Chapel Hill), is designed
to facilitate the analysis and scoring of items with multiple alternatives and makes use of logistic
response models, such as Samejima’s (1969) model for graded responses, Bock’s (1972) model
for nominal (non-ordered) responses, and Steinberg’s (1984) model for multiple-choice items.
Documentation by David Thissen has been included in Chapters 4, 7, 8, and 12 of this volume.
Eiji Muraki and Darrell Bock wrote PARSCALE, a program for the analysis and scoring of
rating-scale data. The program, which has proven to be a very flexible tool over the years, can
also perform multiple-group and DIF analysis. Documentation for PARSCALE, provided by Eiji
Muraki, is included in Chapters 3, 7, and 8.
The fourth program, TESTFACT, was written by Robert Wood (Pearn Kandola Downs, Oxford,
England). Other contributors to the program are Darrell Bock, Robert Gibbons (University of
Illinois, Chicago), Steven Schilling (University of Michigan), Eiji Muraki, and Douglas Wilson
(London, England). TESTFACT performs classical test scoring, item analysis, and item factor
analysis. Documentation provided by Robert Wood has been included in Chapters 4, 7, and 8.
About this book
This volume can be divided into two sections: a setup and reference guide, and an applications
guide.
The first section contains a description of data preparation and reference guides for each of the
four programs. It also provides descriptions of the user's interfaces (where applicable) and the
IRT graphing program.
Chapter 1, dealing with the preparation of data for use in the programs, was written by Leo Stam,
SSI’s president and IRT consultant. Chapters 2, 3, 4, and 5 provide reference guides to both
syntax and interface for BILOG-MG, PARSCALE, MULTILOG and TESTFACT respectively.
Chapter 6 deals with a new feature common to BILOG-MG, MULTILOG, and PARSCALE: the
new graphics module. The item characteristic curves, item and test information curves, and a
matrix plot of all item characteristic curves simultaneously, can all be plotted with this module.
The option to obtain a histogram of the estimated abilities has also been included.
The final two chapters in the first section of this volume provide information on the various
models that may be fitted in each program (See Chapter 7), while Chapter 8 discusses the
methods of estimation and the implementation of these in each of the applications.
The applications guide, covering chapters 9 to 13, starts with an overview of item response
theory and current applications thereof given by Professor Darrell Bock, the cofounder and
former president of SSI and one of the main authors of the IRT software. Chapters 10-13
provides annotated examples for the four programs. These chapters are meant as an aid both to
setting up command files and the interpretation of results obtained from IRT analyses. In each
example a description of the research problem as well as the program keywords used in the
syntax file for the analysis are given. I have also revised and, in a number of cases, added to the
annotation of key sections of the output files produced by each program.
Appendix A contains a paper by Darrell Bock, A brief history of item response theory. This
paper, which first appeared in Educational Measurement: Issues and Practice, has been reprinted
here with the kind permission of the journal editors and provides a fascinating overview of the
development of IRT to date.
Using the CD / Installing the programs
The software CD contains four IRT programs. Each one can be installed separately, and in each
case complete on-line help is provided. SSI provides technical support for all registered users of
its software and it is recommended that the registration card, included in each shipment, be
returned to SSI for this purpose.
If the installation process does not begin automatically, locate and run setup.exe from the root
directory of your computer’s CD drive. Each of the IRT programs has a unique serial number
that appears on the CD jacket and/or shipment invoice; these should be retained for your records.
Although provision is made for a custom installation, the typical installation is recommended.
This installation includes the program files, the online help and a subfolder with all the examples
discussed in the help file and in this volume. The default installation folder can be changed to
suit the user’s needs. The readme.txt and/or readme.wri files contain instructions on how to
create a desktop icon for and shortcut to each program.
In addition to the IRT programs, the CD contains the most recent student editions of the LISREL
and HLM programs that are also published by SSI. Other extra resources include this volume (in
.PDF format) and a copy of Adobe Systems’ Acrobat® Reader®.
Acknowledgements
Invaluable contributions from Darrell Bock and David Thissen made this project possible. The
daunting task of porting the IRT programs to Windows and designing the new dialog boxes was
undertaken by Shyan Lam. All data sets and examples were carefully revised by Leo Stam.
Debugging the programs and writing the graphics module were the responsibilities of Stephen du
Toit, whose untiring work and support went a long way toward making this volume a reality.
Bola King and Gerhard Mels spent weeks patiently working through all the documentation,
proofreading and offering suggestions on how this volume could be made more consistent in
style and more useful to the user of IRT programs. Without the assistance of all of these people,
this volume would never have been anything more than a good idea. Lastly, I must mention that
a venture of this magnitude is bound to be imperfect; I accept responsibility for any errors or
omissions in this volume and look forward to constructive criticism that will make the next
version even better.
– Mathilda du Toit
Table of Contents
1 DATA PREPARATION ..........................................................................................................16
2 BILOG-MG...............................................................................................................................24
2.4.2 A second model: DIF model for spelling data ..................................................................................... 100
2.5 SYNTAX..............................................................................................................................108
2.5.1 Data structures: ITEMS, TEST, GROUP and FORM commands ........................................................ 108
1
2.6.1 Overview of syntax................................................................................................................................. 113
3 PARSCALE ............................................................................................................................257
2
3.1.5 Font option.............................................................................................................................................. 260
3
3.4 OUTPUT FILES.....................................................................................................................337
4 MULTILOG ...........................................................................................................................345
4.3.2 Three-parameter (and guessing) model for the LSAT6 data.............................................................. 364
4
4.3.3 Generating syntax for a fixed- θ model............................................................................................... 370
5 TESTFACT.............................................................................................................................410
5
5.3.2 Overview of syntax................................................................................................................................. 415
6
5.3.28 TITLE command ................................................................................................................................... 502
6 IRT GRAPHICS.....................................................................................................................505
7
6.6 TEST INFORMATION CURVES ...............................................................................................526
8
7.4.4 The multiple response model ............................................................................................................... 568
9
8 ESTIMATION ........................................................................................................................592
10
9.7 APPROACHES TO ANALYSIS OF ITEM RESPONSE DATA .........................................................620
11
10.12 EAP SCORING OF THE NAEP FORMS AND STATE MAIN AND VARIANT TESTS..................686
11.1 ITEM CALIBRATION AND EXAMINEE BAYES SCORING WITH THE RATING-SCALE GRADED
MODEL ......................................................................................................................................692
11.3 CALIBRATION AND SCORING WITH THE GENERALIZED PARTIAL CREDIT RATING-SCALE
MODEL: COLLAPSING OF CATEGORIES.......................................................................................709
11.4 TWO-GROUP DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS WITH THE PARTIAL CREDIT
MODEL ......................................................................................................................................710
11.5 A TEST WITH 26 MULTIPLE-CHOICE ITEMS AND ONE 4-CATEGORY ITEM: THREE-PARAMETER
LOGISTIC AND GENERALIZED PARTIAL CREDIT MODEL ..............................................................720
11.6 ANALYSIS OF THREE TESTS CONTAINING ITEMS WITH TWO AND THREE CATEGORIES:
CALCULATION OF COMBINED SCORES .......................................................................................722
11.8 RATER-EFFECT MODEL: ONE-RECORD INPUT FORMAT WITH SAME NUMBER OF RATERS PER
EXAMINEE ................................................................................................................................727
12.3 THREE-PARAMETER (AND GUESSING) MODEL FOR THE FIVE-ITEM TEST ............................733
12.5 THREE-CATEGORY PARTIAL CREDIT MODEL FOR THE TWO-ITEM QUESTIONNAIRE ............738
12
12.7 A GRADED MODEL ANALYSIS OF ITEM-WORDING EFFECT ON RESPONSES TO AN OPINION
SURVEY ....................................................................................................................................741
12.15 A MIXED NOMINAL AND GRADED MODEL FOR SELF-REPORT INVENTORY ITEMS .............765
12.16 A MIXED THREE-PARAMETER LOGISTIC AND PARTIAL CREDIT MODEL FOR A 26-ITEM TEST
.................................................................................................................................................767
12.18 DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSIS OF EIGHT ITEMS FROM THE 100-ITEM
SPELLING TEST .........................................................................................................................770
12.19 INDIVIDUAL SCORES FOR A SKELETAL MATURITY SCALE BASED ON GRADED RATINGS OF
OSSIFICATION SITES IN THE KNEE ..............................................................................................772
13 TESTFACT EXAMPLES....................................................................................................775
13.1 CLASSICAL ITEM ANALYSIS AND SCORING ON A GEOGRAPHY TEST WITH AN EXTERNAL
CRITERION ................................................................................................................................775
13.3 ONE-FACTOR NON-ADAPTIVE FULL INFORMATION ITEM FACTOR ANALYSIS OF THE FIVE-
ITEM TEST .................................................................................................................................780
13
13.4 A THREE-FACTOR ADAPTIVE ITEM FACTOR ANALYSIS WITH BAYES (EAP) ESTIMATION OF
FACTOR SCORES: 32 ITEMS FROM AN ACTIVITY SURVEY ...........................................................780
13.5 ADAPTIVE ITEM FACTOR ANALYSIS AND BAYES MODAL (MAP) FACTOR SCORE ESTIMATION
FOR THE ACTIVITY SURVEY .......................................................................................................802
13.6 SIX-FACTOR ANALYSIS OF THE ACTIVITY SURVEY BY MONTE CARLO FULL INFORMATION
ANALYSIS .................................................................................................................................803
13.14 THREE-FACTOR ANALYSIS WITH PROMAX ROTATION: 32 ITEMS FROM THE SCIENCE
ASSESSMENT TEST ....................................................................................................................823
13.17 ADAPTIVE ITEM FACTOR ANALYSIS OF 25 SPELLING ITEMS FROM THE 100-ITEM SPELLING
TEST .........................................................................................................................................827
14
14 APPENDIX A: A BRIEF HISTORY OF ITEM RESPONSE THEORY .......................830
15 REFERENCES .....................................................................................................................848
15
1 DATA PREPARATION
1 Data preparation1
The only type of data that the IRT programs currently can handle is fixed format with one or
more lines per record (case) and one-character response codes. Fixed format means that the vari-
ables occupy the same column positions throughout the data file. The only acceptable values in
such a data file are the upper- and lowercase characters a through z, the digits 0 through 9, and
any of the special characters like +-.*&. Tab characters (^t) and other control characters that are
usually embedded in files from word processing (e.g., doc), database (e.g., dbf), spreadsheet
(e.g., xls), and statistical applications (e.g., sav) are not acceptable and data files with such extra-
neous characters will produce unexpected program behavior that may be difficult to trace. Sec-
tion 1.5 illustrates the conversion of an Excel2 file to a fixed format file.
In its simplest form the data file contains individual response data. Such a flat file usually has
one line per record, starting with a subject ID (identification field) and followed by a number of
one-character response codes for the items in the test. Spaces in between fields and/or items are
permitted, as long as those blanks maintain the column positions of the item responses through-
out the file.
Example:
Mary-Ann selected response category a for items 3, 9, and 10, while John answered b, c, and c,
respectively.
The item response codes may represent right/wrong answers, selected response categories,
nominal category codes, ordinal variable values, ratings, etc. The maximum number of different
codes per item is dependent on the program used for analysis. BILOG-MG and TESTFACT ana-
lyze binary (dichotomous) responses only. The data may be multiple-category (1,2,3,4 or
a,b,c,d,e, etc.), but the program reduces that to right/wrong data with the correct response code
key that the user provides. MULTILOG and PARSCALE can handle both binary and multiple-
category items or mixtures of those types.
1
This section was contributed by Leo Stam.
2
Excel 2000 was used in the examples.
16
1 DATA PREPARATION
Besides a subject ID with up to 30 characters and the single-character item response codes, other
fields that may be present in the records are:
The specific requirements for these fields can be found in the Command Reference section for
the different programs. For example, the group identifier in BILOG-MG has to be a single digit
(integer) , starting with 1, while in TESTFACT it can be any single character (M,F, etc.), and in
PARSCALE it can be a name up to eight characters.
Including the single-subject data described above, the programs allow the following data types:
The IRT programs are command-driven and are run in batch mode. That is to say that the user
prepares a command file (either directly in an editor or through a dialog-box user interface, if
present) and submits this command file to the program for execution (Run).
While it is true that command-driven programs were the standard before the “point-and-click”
user interfaces (“GUI”) entered the computing scene, maintaining this standard for the current
programs was done deliberately. The dialog-box interfaces that have been added are merely a so-
called front-end for the convenience of the user in building such a command file. Despite the
progress that has been made with the graphical user interfaces, in our experience users who use a
program routinely still prefer the ease of use of the command file. Moreover, such a file stores
the particulars of an analysis in a very succinct way, such that making small changes to an analy-
sis, retrieving an old analysis, or sharing the analysis with other users of the program (also: tech-
nical support) is a straightforward task. It is like giving somebody a map of how to get from A to
B instead of having to describe the route with “take the first street to the right, then a left at the
third traffic light”, etc. Granted, learning and remembering the commands, keywords, and op-
tions used in a program requires a considerable effort (like learning how to read maps), while the
click-and-point interface can lay claim to being intuitive for the user. The dialog-box user inter-
face is especially helpful in that learning process or as a means to refresh the memory of the oc-
casional user of the programs.
17
1 DATA PREPARATION
Besides the particular analysis specifications, the command file informs the program where the
data file can be found and how it should be read. The location of the data file to be analyzed is
simply a matter of specifying that location with a keyword.
For example:
>FILES … DFNAME=’C:\PARSCALE\DATA\EXAMPL01.DAT’;
or
>GLOBAL … DFNAME=’F:\BILOGMG\EXAMPLES\EXAMPL06.RAW’;
or
>INPUT … FILE=’D:\TESTFACT\DATAFILES\TEST01.DAT’;
or
>PROBLEM … DATA=’G:\MULTILOG\DATA\TEST04.RAW’;
This shows that each program has its own flavor of command file syntax but also that those
specifications are essentially the same and that it is fairly easy to tell a program where it can find
the data input. Note that the name of the data file must be enclosed in single quotes. The drive
and directory path should be included if the data file is not in the same folder as the command
file. It is also good practice to keep all the files, including the command file, for a particular
analysis together in a separate folder. In that case, all that is needed is the filename.
Now that the program knows where to find the data, it needs to be told how to read those data.
What part of a record has the subject ID, in which column is the response code for the first item
to be read, where is the group code, if any, etc. To that end, the user includes a format statement
in the command file.
Format statements are enclosed in parentheses. They are entered on a separate line in the com-
mand file and usually one line is all that is needed. However, if more lines are needed, the user
can indicate that with a keyword (e.g., NFMT=2 tells the program that the format statement occu-
pies two lines).
The format statement for the simple example above is: (8A1,1X,5A1,1X,5A1).
Here is the file again, with a column counter added above for convenience:
12345678901234567890
John abbac aaacc
Mary-Ann bcabb bbcaa
As can be seen, the total length of each record in the file is 20 columns. The first eight columns
contain the ID field. This is specified in the format statement with “8A1.” That stands for “eight
18
1 DATA PREPARATION
alphanumeric characters of length one.” The “A” is a format code and stands for alphanumeric.
The 1 indicates the width and the 8 is a repeat count. Other possible format codes are “F” (for
floating point, used to read real numbers) and “I” (for integer).
The next element in the format statement is an example of an operator, in this case “X”. The “X”
is used to tell the program to skip one or more columns. The example specifies “1X” or skip one
column. Next follows a block of five item responses to be read as “5A1”. Then, we instruct the
program to skip another column and to read a second set of five alphanumeric characters: items 6
through 10. Thus, the complete format statement, (8A1,1X,5A1,1X,5A1), describes how to read
each of the twenty columns in a record. Because the format statement describes one data record
and that description is applied to the whole data file, all the records in the data file should look
identical: the essence of a fixed format.
Instead of the “X” operator, the “T” (tab) operator can be used with the same result. The tab op-
erator specifies the column position to tab to. Thus, the format statement (8A1,1X,5A1,1X,5A1)
becomes (8A1,T10,5A1,T16,5A1) when using the tab operator. Tabbing backward is also pos-
sible. That is often used when the examinee records have the examinee ID at the end of each line,
while the program wants it as the first thing being read. Here is our example in that format. The
first line is a column counter added for your convenience. It is not part of the actual data file.
12345678901234567890
abbac aaacc John
bcabb bbcaa Mary-Ann
With the format statement (T13,8A1,T1,5A1,1X,5A1) we instruct the program to read the eight-
character ID starting at column 13, then go back to column 1 and read two blocks of five items,
skipping a blank column in the middle. This examples also illustrates that the “X” and “T” opera-
tors can be used within the same format statement. Obviously, the “T” operator can also be used
to read the items in an order that is different from the order in the data file. For example, with
(T13,8A1,T7,5A1,T1,5A1) we read the second block of 5 items before the first block of 5
items.
The final operator that the user of our IRT programs should know about is the “/” (slash) opera-
tor. It instructs the program to go to the next line of the data file. Oftentimes, users have data
where the record for each examinee spans more than one line in the data file. A simple example
is as follows (again, with the column counter added for convenience).
1234567890123456
John 1 abbac
John 2 aaacc
Mary-Ann 1 bcabb
Mary-Ann 2 bbcaa
Here, each block of five items is given on a separate line. This could easily result from two dif-
ferent data files (each with an examinee ID and five items) that were concatenated into one file,
then sorted on examinee ID. To keep the order of the item blocks the same for each examinee, a
block number was added to the original data files.
19
1 DATA PREPARATION
The format statement (8A1,T12,5A1,/,11X,5A1) will read the examinee ID from the first line
of the record (8A1), tab to column 12 and read the first five items (T12,5A1), then go to the next
line of the record (/), skip the first 11 columns and read columns 12—16 as the responses to the
second set of five items. Note that the examinee ID in the second line of each record is not
needed.
A special use of the forward slash operator is to read every first, second, third, etc. record of a
large data file. For example, (8A1,1X,20A1,/) reads every odd record of a data file, starting
with the first one, while (/,8A1,1X,20A1) reads every even record of a data file, starting with
the second one.
The examples that come with the programs use a variety of format statements and it is a good
idea to look for an example that resembles your data when in doubt about the right format state-
ment. The chapters in this book that describe the examples also offer further details on the use of
the format statement.
When you are analyzing multiple-choice items that are either answered correctly or incorrectly,
the program needs to know the item response code for each item that represents a correct answer.
The user provides that information with a response key.
MULTILOG and TESTFACT require the response key in the command file as a string of item
codes for correct responses, while users of BILOG-MG should specify in the command file
where the response key can be found (unless the data are already coded as 1 for a right and 0 for
a wrong answer). Because it is slightly more complicated, let us look at a BILOG-MG example.
The response key is a record with the exact same format as the data records. It can be in its own
file, or it can be part of the data file. The latter option makes it easier to check that the format is
indeed identical.
The file has the response key as the first record. The word key is used in the ID field for conven-
ience. It is not needed and will not be read by the program. BILOG-MG will apply the response
key to the data records and it will convert John’s responses to 1001001100 and Mary-Ann’s re-
sponses to 0110110001.
In educational assessment, the reason for an item response in a data file to be coded as missing is
generally limited to two possibilities. The specific item was not presented to the examinee or the
examinee did not respond to the specific item. The former occurs when examinees answer differ-
ent forms (selection of items) of the same test and all the items of the test are included in the data
20
1 DATA PREPARATION
file. The importance of the differentiation in missing codes lies in the fact that omitted items can
be treated as a wrong response, a fractionally correct response, or the same as a not-presented
item, i.e., excluded from the calculations.
Using the simple example again, the data file with not-presented items could look like:
John took a different form of the test than Mary-Ann. They both responded to the five items in
their form and all ten items of the two forms are included in the data file. Although the example
uses the same not-presented code for all items, note that with BILOG-MG and PARSCALE the
not-presented (or omitted item code) may vary among items.
The four programs approach handling missing codes differently; details can be found in the
chapters describing the programs. BILOG-MG and PARSCALE are similar and accommodate
both omitted and not-presented codes. TESTFACT allows only one value for all items to repre-
sent an omitted item and another value for not-presented items. TESTFACT is the only program
that allows omitted items to be differentiated into skipped items and not-reached items. The latter
are defined as all the omitted items after the last item the examinee responded to. This situation
occurs when tests are administered under a time restriction (speed tests) and such tests are not
considered appropriate ability measurements under the assumptions underlying the power test
models used in the other programs. MULTILOG does not distinguish between omitted and not-
presented items, the user can only assign one missing code per item.
The format of not-presented and omitted keys is as described in Section 1.3. Note that, if more
than one key is used as part of the data file, the keys should follow the order as described in the
Command Reference sections for the respective programs.
The IRT programs from SSI expect plain text (ascii) data files with a fixed format. Because the
programs do not include an import facility to handle various file formats, the user with data in
such a format faces the task of converting the dataset to the plain text, fixed format. Spreadsheet,
database, and statistical applications generally offer the user some form of data export (or Save
As) that includes the plain text format. In this section we will illustrate such a conversion with an
Excel dataset as starting point. We selected Excel, because it has a format that other applications
include in their export formats, and it is a widely used program. This way, users that are unclear
about how to convert a specific data format to plain text format may convert to Excel, then fol-
low one of the two methods described below.
The user is advised always to use copies of the original dataset. With Excel, for example, the
Save As operation uses a format that can only save the active worksheet, so some of your work
may get lost.
21
1 DATA PREPARATION
Using this simple format can only be done with files up to 240 columns (after conversion). In
other words, if your Excel worksheet has more than 240 (minus maximum ID length minus pos-
sible form and/or group indicators) items, this method will not work.
In Excel, highlight all the columns with the item response codes and set the column width of the
highlighted columns to 1. This assumes that your response codes are already one-character
codes. If not, you should use the recode capabilities of Excel. For example, if a twelve-category
item is coded as 1 through 12, recode it as 1,2,3,4,5,6,7,8,9,A,B,C or as A through L. The col-
umn with the ID field should be set to the maximum length of the values appearing in that col-
umn. Form or group indicators are best coded as numbers, starting with one.
Now, save the data file as a “*.prn” file. Excel calls that a Formatted text (Space-delimited)
file. If you want your filename to have the extension dat (instead of the automatic prn exten-
sion), use double quotation marks (") around the name of the file you want to save it to. Answer
Yes to the question about loosing special formatting features.
The resulting file should look as shown below, where the first 8 columns are the ID field, fol-
lowed by 17 item responses. Note that the leading blanks in the first ID field are automatically
included because the column width in Excel was set to 8 and the ID itself has only 4 characters.
The alignment of the item responses is preserved.
John0101010101010101
Mary-Ann1010101010101010
....
Another option in Excel is to Save As txt format, which produces a tab delimited file. This
method has no limitations on the maximum record length. However, the IRT programs stumble
on tab characters (do not know how to handle that) and they have to be removed. You can do that
in MS Word, for example, by reading in the file as a plain text file, then do a global replace of
the “^t” with either a blank or nothing. Then save the file. This works well if your ID field has
the same number of characters. Otherwise, you can move the column to the end of the worksheet
before you do the Save As operation.
A second problem occurs when your worksheet has cells with no entries at all (missing re-
sponse). When exporting (Save As) this as a tab-delimited text file, a global replacement of the
tab character with a blank will throw off the column alignment. In that case, you should replace
all instances of tab-tab with tab-space-tab.
To accommodate the user, SSI has included a NOTAB utility on the program CD that can filter
out unwanted tab characters correctly. This utility as well as a worked example can be found in
the dataprep folder on the “IRT from SSI” CD.
22
1 DATA PREPARATION
Going the other way, from a plain text, formatted data file to an Excel file has a number of us-
ages. Foremost is data editing. The first attempt at analysis may reveal several difficulties in the
data. Values that are out of range, negative item-test correlations, group codes that are coded
with characters instead of numbers, etc. Importing the plain text data file into Excel or a similar
application provides the user with powerful tools for data editing and data cleaning.
From within Excel, select Get External Data from the Data menu, then Import Text File. Se-
lect the data file to import. The Text Import Wizard opens with a preview of the data file. Se-
lect Fixed width as the type that best describes the data, then click the Next button. In the Data
Preview box, use the mouse to set break lines separating the data into columns. Once satisfied,
click Next. The last step allows you to skip columns, if needed. Click Finish.
23
2 BILOG-MG REFERENCE
2 BILOG-MG
BILOG-MG is an extension of the BILOG program that is designed for the efficient analysis of
binary items, including multiple-choice or short-answer items scored right, wrong, omitted, or
not-presented. BILOG-MG is capable of large-scale production applications with unlimited
numbers of items or respondents. It can perform item analysis and scoring of any number of sub-
tests or subscales in a single program run. All the program output may be directed to text files for
purposes of selecting items or preparing reports of test scores.
The BILOG-MG program implements an extension of Item Response Theory (IRT) to multiple
groups of respondents. It has many applications in test development and maintenance. Applica-
tions of multiple-group item response theory in education assessment and other large-scale test-
ing programs include:
Nonequivalent groups equating for maintaining the comparability of scale scores as new
forms of the test are developed.
Vertical equating of test forms across school grades or age groups.
Analysis of Differential Item Functioning (DIF) associated with demographic or other
group differences.
Detecting and correcting for item parameter trends over time (DRIFT).
Calibrating and scoring tests in two-stage testing procedures designed to reduce total test-
ing time.
Estimating latent ability or proficiency distributions of students in schools, communities,
or other aggregations.
In addition, the BILOG-MG program provides for “variant items” that are inserted in tests for
purpose of estimating item statistics, but that are not included in the scores of the examinees.
The most important change is that BILOG-MG is now a Windows application. Syntax can be
generated or adapted using menus and dialog boxes or, as before, with command files in text
format. The interface has menu options in the order the user would most generally use: model
specification is followed by data specification and technical specifications, etc. Each of the menu
options provides access to a number of dialog boxes in which the user can make specifications.
For an overview of the required and optional commands in BILOG-MG syntax, please see Sec-
tion 2.6.1. For more information on which dialog box to use to specify a specific keyword or op-
tion, please see the location of keywords in the interface discussed in Section 2.3.13.
Filename length: All filenames with path may now extend to 128 characters. The file-
name must be enclosed in single quotes. Note that each line of the command file has a
maximum length of 80 characters. If the filename does not fit on one line of 80 characters,
the remaining characters should be placed on the next line, starting at column 1.
24
2 BILOG-MG REFERENCE
Factor loadings: The item dispersion (reciprocal of the item slope) previously listed
among the parameter estimates has been replaced by the one-factor item factor loading
given by the expression Slope / 1 + Slope 2 .
Average measurement error and empirical reliability for each subtest: The mean-
square error and root-mean-square error for the sample cases are listed for each test. In
addition, the empirical reliability computed from the IRT scale score variance and the
mean-square error is listed.
Note that for EAP and MAP estimated ability the formula for this reliability differs from
the formula for ML estimated ability (to account for the regression effect in EAP and
MAP estimation). If there are multiple test forms, these test statistics are averages over the
forms. If there are multiple groups, the statistics are listed for both the combined groups
and the separate groups.
Reliabilities in connection with information plots: The reliabilities given by the pro-
gram in connection with the information plots of Phase 3 differ from empirical reliabilities
in that they assume a normal distribution of ability in the population. They depend only on
the parameters of the items and not on the estimated abilities in the sample. The program
now computes and lists these theoretical reliabilities for both combined and separated test
forms and sample groups. (For a discussion of empirical and theoretical reliability see
Bock & Zimowski (1999).)
Information curves and reliabilities for putative test forms: It may be useful in test
development to preview the information and theoretical reliability of test forms that might
be constructed from items drawn from a calibrated item bank. (For a discussion of this
procedure, see Section 2.2.)
GLOBAL command—PRNAME keyword: This keyword instructs the program to read
the provisional values of parameters of selected items in the test forms from the specified
file.
SAVE command—PDISTRIB keyword: This keyword allows the user to save the
points and weights of the posterior latent distribution at the end of Phase 2. These quanti-
ties can be included as prior values following the SCORE command for later EAP estima-
tion of ability from previously estimated item parameters.
TEST command—FIX keyword: This keyword allows the user to keep selected item pa-
rameters fixed at their starting values. Starting values may be entered on the SLOPE,
THRESHLD, and GUESSING keywords on the same command or read from an existing item
parameter file.
CALIB command—NOADJUST option: BILOG-MG routinely rescales the origin and
scale of the latent distribution, even in the one-group case. This option may be used to
suppress this adjustment.
CALIB command—CHI keyword: This keyword determines the number of items re-
quired and the number of intervals used for χ 2 computations.
CALIB command—FIXED option: If this option is present, the prior distributions of
ability in the population of respondents are kept fixed at the values specified in the IDIST
keyword and/or the QUAD commands. It suppressed the updating of the means and standard
deviations of the prior distribution at each EM cycle in the multiple-group case.
CALIB command—GROUP-PLOTS option: By default, the program item plots show
observed proportions of correct responses in the data combined for all groups. The GROUP-
25
2 BILOG-MG REFERENCE
PLOTS option provides plots for each separate group, along with the combined plot.
CALIB command—RASCH option: If this option is specified, the parameter estimates
will be rescaled according to Rasch model conventions: that is, all the slopes will be re-
scaled so that their geometric mean equals 1.0, and the thresholds will be rescaled so that
their arithmetic mean equals 0.0. If the 1-parameter model has been specified, all slope pa-
rameters will therefore equal 1.0.
PRIORS command—SMU and SSIGMA keywords: Prior values for slope parameter
means and sigma are now entered in arithmetic units rather than natural log units. The
means for both forms are printed in the Phase 2 output, however. The default for SMU is
1.0 (log SMU = 0.0) and for SSIGMA the default is 1.64872127 (log SSIGMA = 0.5).
SCORE command—MOMENTS option: Inserting the MOMENTS option in the SCORE
command causes the program to compute and list the coefficients of skewness and kurto-
sis of the ability estimates and of the latent distribution.
SCORE command—DOMAIN keyword: BILOG-MG now includes a procedure for
converting the Phase 3 estimates of ability into domain scores if the user supplies a file
containing the item parameters for a sample of previously calibrated items from the do-
main. Weights can be applied to the items to improve the representation of the domain
specifications.
SCORE command—FILE keyword: This keyword is used to supply the external file
used to calculate the domain scores (see above).
Phase 1: INPUT
The input routine reads formatted data records. Data for each observation consist of subject iden-
tification, optional form number, optional group number, optional case weight, and item response
data. Item responses of individual examinees comprise one character for each of n items. The
answer key, not-presented, and omit codes are read in exactly the same format as the observa-
tions. For aggregate-level data, the “responses” consist of number of attempts and number cor-
rect for each item. If data are for the aggregate-level model, vectors of numbers of attempts and
correct responses to the items are read in decimal format.
Omits may be scored “wrong”, treated as fractionally correct, or omitted from calculations.
The INPUT routine accepts a list of numbers and corresponding names for all items to be read
from the data records. The order in which the items appear in the data records is specified in a
form key(s). If the data are collected with a multiple-form test, the program accepts a key for
each form. Each respondent’s data record is identified by its form number.
26
2 BILOG-MG REFERENCE
Multiple groups
When multiple-group IRT analysis is requested, the INPUT routine accepts a list of item num-
bers or names identifying the items administered to each group. Each respondent’s data record is
identified by its group number. The Phase 1 program computes classical item statistics separately
for each group.
Subtests
The INPUT routine also accepts lists of item numbers or names, not necessarily mutually exclu-
sive, describing i subtests. It scores each subtest and creates a file containing the item scores,
item attempts, subtest scores, and other input information for each respondent. Each subtest is
calibrated separately in Phase 2. Each respondent is scored on all subtests in Phase 3.
Case weights
If there are case weights for respondents (because they were drawn in an allocation sample), the
item responses and item attempts are multiplied by the weight. If the data consist of response
patterns, the case weights are the frequencies of the patterns.
Samples
If there are a large number of respondents or aggregate-level records, the INPUT routine can be
instructed to select a random sample of a specified size to be passed to CALIBRATE (Phase 2).
The complete master file of cases will nevertheless be passed to Phase 3 for scoring.
While preparing the item-score file, the INPUT routine also accumulates, subtest by subtest, cer-
tain item and test statistics (accumulated from the sample file when the number of cases exceeds
the user-specified sampling level). These statistics consist of
These quantities are listed and passed to the Phase 2 and Phase 3 routines to provide starting val-
ues for item parameter and respondent scale-score estimation.
Phase 2: CALIBRATE
The CALIBRATE routine fits a logistic item-response function to each item of each subscale.
There are many options available to the user in this section of the program.
27
2 BILOG-MG REFERENCE
Item-response model
The response model may be the 1-, 2- or 3-parameter logistic response function. The scaling fac-
tor D = 1.7, employed to scale estimates in the normal metric, may be included or omitted at the
user’s option. Information that assists the user in model selection is provided in the marginal log
likelihood and goodness of fit indices and statistics for individual items. The user may request
plots of the observed and expected item-response curves.
Item parameters may be estimated from either binary (right/wrong/omit) data or aggregate-level
frequency data (number of correct responses, number of attempts) input from Phase 1. If aggre-
gate-level data are used, it is assumed that each respondent in each group responds to only one
item per subscale, as required in matrix-sampling applications (see Mislevy, 1983). The aggre-
gate-level option can also be applied to individual data if weights are used and the binary re-
sponses take on fractional values. In this use of the aggregate-level option, each respondent re-
sponds to more than one item.
The MML solution employs two methods of solving the marginal likelihood equations: the so-
called EM method and Newton-Gauss (Fisher scoring) iterations. The default number of cycles
for the EM algorithm is 10; the default for Newton steps is 2. Convergence in the EM steps is
hastened by the accelerator described in Ramsay, 1975. Results of each cycle are displayed so
that the extent of convergence can be judged. The information matrix for all item parameters is
approximated during each Newton step and then used at convergence to provide large-sample
standard errors of estimation for the item parameter estimates.
Phase 2 provides the item parameters in the form of the lower asymptote and the item intercept
(equal to minus the product of the slope and threshold) and so-called “slope” or “discrimination”
parameter, and the item threshold (location) and loading (one-factor item factor loading =
Slope / 1 + Slope 2 ).
28
2 BILOG-MG REFERENCE
In the one-parameter solution, all slopes are equal. In both the one- and two-parameter solutions,
all lower asymptotes are zero. In the three-parameter solution with a common lower asymptote,
all lower asymptote parameters are equal. Otherwise, they are estimated separately for each item.
When an analysis of differential item functioning (DIF) is requested, the program provides esti-
mates of the unadjusted and adjusted threshold parameters for each group along with their stan-
dard errors. Estimates of group differences in the adjusted threshold parameters are also pro-
vided. When an item parameter drift (DRIFT) analysis is selected, the program provides esti-
mates of the coefficients of the linear or polynomial function.
In Phase 2, when there is a single group, the unit and origin of the scale on which the parameters
are expressed are based on the assumption that the latent ability distribution has zero mean and
unit variance. This is referred to as the “0, 1” metric. When there are multiple groups, the pro-
gram provides the option of setting the mean and standard deviation of the combined estimated
distributions of the group to zero and one.
The parameter estimates in Phase 3 can be rescaled according to scale conventions selected by
the user. If the one-parameter model has been selected, the item slope estimates are uniformly
1.0. In other cases, the scores can be scaled to a specified mean and standard deviation in the
sample. In both Phase 2 and Phase 3, the item parameter estimates can be saved before and after
rescaling, respectively, in formatted external files.
When some items are extremely easy or extremely difficult, there may be insufficient informa-
tion in the sample to estimate their parameters accurately. This will be especially true if the
number of respondents is only moderate (250 or fewer). As an alternative to deleting these items,
prior distributions can be placed on the item parameters. The user may specify normal priors for
item thresholds, log-normal priors for slopes, and beta priors for lower asymptotes. Each item
may have a different specification for its prior.
Default specifications are for prior distributions on slopes under the two-parameter models, and
on slopes and lower asymptotes under the three-parameter model. By specifying tight priors on
selected item parameters, the user may hold these values essentially fixed while estimating other
item parameters. This feature is useful in linking studies, where new test items are to be cali-
brated into an existing scale without changing parameter values for old items.
Approximate χ 2 indices of fit are computed for each item following the final estimation cycle.
For the purpose of computing these χ 2 , the scale score continuum is divided into a number of
successive intervals convenient for displaying the response proportions (maximum of 20). Each
respondent is assigned to the interval that includes the EAP estimate (based on the type of prior
29
2 BILOG-MG REFERENCE
specified by the user) of his or her score. For the item in question, the expected response prob-
abilities corresponding to the average EAP estimate of ability of cases that fall in the interval are
used as the expected proportion for the interval.
A likelihood ratio χ 2 is then computed after combining extreme intervals so that the expected
frequency exceeds five. Degrees of freedom are equal to the number of combined intervals.
There is no reduction in degrees of freedom due to estimating the item parameters because the
marginal maximum likelihood method does not place linear constraints on the residuals.
At the user’s request, observed and expected item-response curves are plotted for each item.
When the expected frequencies of the individual response patterns are too small to justify the
likelihood ratio test of goodness-of-fit, the change in likelihood ratio χ 2 between the 1- and 2-
parameter models, or between the 2- and 3-parameter models, is a valid large-sample test of the
hypothesis that the added parameters are null. The degrees of freedom of each of these change
χ 2 are equal to the number of items.
If the sample size is large and the number of items is small, the overall fit of the response func-
tions of all items can be tested by comparing the observed frequencies of the patterns with the
expected marginal frequencies computed from the fitted functions. The data must be in the form
of response patterns and frequencies. The likelihood ratio χ 2 statistic for the test of fit is
2n
ri
G = 2∑ ri log e
2
_
i =1
N Pi
where 2n is the number of possible patterns of the n binary item scores, ri is the observed fre-
quency of pattern i, N is the number of respondents, and Pi is the estimated marginal probability
of pattern i.
The number of degrees of freedom is 2n − kn − 1 , where k is the number of parameters in the re-
sponse model.
This test should be used only when the number of respondents is large relative to the number of
patterns. If a few patterns have zero observed frequency, ½ should be substituted as the fre-
quency for those patterns and corresponding ½s subtracted from the frequency of the most fre-
quent pattern (or 1 could be used for this purpose).
30
2 BILOG-MG REFERENCE
Phase 3: SCORE
The SCORE routine makes use of the master response file from Phase 1 and the item parameter
estimate files from Phase 2 to compute estimated scale scores for respondents. The user may se-
lect one of the three methods described below for estimating scale scores.
In each of these methods the user has the option of biweight robustification to protect the esti-
mates from spurious responses due to guessing or inattention. Because effects of guessing are
suppressed by the robustification, the lower asymptote is not incorporated in the response model
in Phase 3 when the biweight option is selected. Scores and standard errors for all subscales are
calculated simultaneously for each respondent. Results may be printed and/or saved on an exter-
nal file.
Estimates for respondents with all correct or all incorrect responses are attributed by the half-
item rule. That is, respondents who score all incorrect are assigned one-half a correct response to
the easiest item; respondents who score all correct are assigned one-half a correct response to the
hardest item. The estimate is then computed from this modified response pattern.
Standard errors are computed as the square root of the negative reciprocal of the expected second
derivative of the log likelihood at the estimate, i.e., the square root of the reciprocal Fisher in-
formation.
EAP estimates with or without robustification are computed by quadrature using a discrete dis-
tribution on a finite number of points as the prior. The user may select the number of points and
has the choice of a normal, locally uniform, or empirical prior. For the latter, the user may supply
the values of the points and the corresponding empirical weights or may use the empirical
weights generated in Phase 2.
The EAP estimate is the mean of the posterior distribution and the standard error is the standard
deviation of the posterior distribution.
MAP estimates with or without robustification are also computed by the Newton-Gauss method.
This procedure always converges and gives estimates for all possible response patterns. A nor-
mal prior distribution with user-specified mean and variance is assumed [the default is N(0, 1)].
The estimate corresponds to the maximum of the posterior density function (mode); the standard
31
2 BILOG-MG REFERENCE
error is the square root of the negative reciprocal of the curvature of the density function at the
mode.
When EAP estimation is selected, the SCORE routine obtains an estimate of the population dis-
tribution of ability in the form of a discrete distribution on a finite number of points. This distri-
bution is obtained by accumulating the posterior densities over the subjects at each quadrature
point. These sums are then normalized to obtain the estimated probabilities at the points. The
program also computes the mean and standard deviation for the estimated latent distribution.
Sheppard’s correction for coarse grouping is used in calculating the standard deviation.
Rescaling
The ability estimates are calculated initially in the scale of the item parameter estimates from
Phase 2. In addition, however, rescaled estimates may be obtained by one of the following op-
tions:
the mean and standard deviation of the sample distribution of score estimates are set to
arbitrary values specified by the user (default = 0, 1);
a linear transformation of scale is provided by the user;
if EAP estimation has been selected, the mean and standard deviation of the latent score
distribution may be set to arbitrary values by the user (default = 0, 1). Any of these op-
tions may be applied to all subtests in the same computer run, or different rescaling pa-
rameters may be used for each subtest. Parameter estimates and standard errors for items
from Phase 2 are rescaled for each subtest according to the selected option.
When EAP estimation is selected, the marginal probability of each response pattern in the sam-
ple is calculated and printed along with the corresponding number-right score and scale score.
BILOG-MG provides at the user’s request a number of indices and plots concerning item and
test information:
Plots of test information and standard error curves for each subtest.
Tables of item information indices, including the point and value of maximum informa-
tion.
Classical reliability
The classical definition of reliability is simply the ratio of the true score variance to the observed
score variance, which is the sum of the true scores variance and the error variance. In an IRT
context, the true scores are the unobservable theta values that are estimated with a specified
32
2 BILOG-MG REFERENCE
standard error from item response patterns, as for example in Phase 3 of the BILOG-MG pro-
gram.
Classical reliability is implemented in BILOG-MG in two different ways according to how the
true score and error variances are estimated. To distinguish the two results, we refer to one as
“theoretical” reliability and the other as “empirical” reliability. The result for theoretical reliabil-
ity appears in connection with the test information plots in the Phase 3 output; the result for
“empirical” reliability appears following the display of the means, standard deviations, and aver-
age standard error of the scores earlier in the Phase 3 output. The computation of these two quan-
tities is carried out as follows.
Theoretical reliability
The theoretical reliability value applies to IRT scores estimated by the maximum likelihood
method (METHOD=1 of the SCORE command). It is based only on the item parameters passed from
Phase 2 and does not depend in any way on the ability scores computed in Phase 3. Instead, it
assumes that the true ability scores are distributed normally with mean zero and variance one in
the population of examinees. The test information function is integrated numerically with respect
to this assumed distribution to obtain the average information expected when the test is adminis-
tered in the population. The formulas for evaluating test information for any given value of abil-
ity, assuming a one, two, or three parameter logistic item response model, are as follows:
1PL
∧ ∧ ∧ 1
S .E.(1) (θ ) = {1/ D a ∑ P (θ )[1 − P(1) j (θ )]}
2 2 n
2
j =1 (1) j
2PL
∧ ∧ ∧ 1
S .E.(2) (θ ) = {1/ D 2 ∑ j =1 a 2j P(2) j (θ )[1 − P(2) j (θ )]}2
n
3PL
1
∧ ∧
2
2
∧
1 − P(3) j (θ ) P(3) j (θ ) − g j
S .E.(3) (θ ) = 1/ D ∑ j =1 a j
2 n 2
∧
P(3) j (θ ) 1 − g j
Although the formulas are expressed in terms of standard errors, the information values can be
obtained by taking the reciprocal of the squared standard error. Conversely, the reciprocal of the
average information with respect to the ability distribution is the harmonic mean of the error
variance. Since by assumption the variance of the true score (i.e., ability) distribution is equal to
one when expressed in the scale of the Phase 2 item parameter calibration, the theoretical reli-
ability is one divided by the quantity one plus the error variance.
33
2 BILOG-MG REFERENCE
In the program the theoretical reliability is computed for each form of the test when there are
multiple forms. Whether the analysis pertains to one group or multiple groups of examinees is
not relevant; because the theoretical reliability is a function only of the item parameters, the
presence of multiple groups has no effect on the results.
This version of BILOG-MG has provisions for computing information curves and reliability for
any set of item parameters supplied in Phase 1 as starting values for item parameter estimation. If
alternative forms are to be constructed from the items set, the user can insert forms commands
following the score command to indicate the item composition of the forms. See the documenta-
tion of these score-forms commands for instructions on how to set up these calculations
(REFERENCE, READF and NFORMS on the SCORE command discussed in Section 2.6.16).
Empirical reliability
The formulas for estimating the error and true score variances for calculating empirical reliability
differ depending on how the ability scores of the examinees in the sample (or in the samples in
the case of a multiple-group analysis) are estimated:
For maximum likelihood scores (method 1), the estimated error variance is the reciprocal
of the mean of the best information evaluated at the ability estimates of all examinees in
the sample or samples. The score variance is just the variance of the maximum likelihood
scores in the sample or samples. The true score variance can therefore be estimated simply
by subtracting the error variance from the score variance. The empirical reliability in each
sample is then given by that value for the true score variance divided by the score vari-
ance.
For Bayes EAP scores (METHOD=2 on the SCORE command), the estimate of the error vari-
ance is the mean of the variances of the posterior distribution’s ability for all examinees in
the sample or samples. Because the ability scores are regressed estimates in Bayes estima-
tion, the true score variance is estimated directly by the variance of the means of the pos-
terior distribution (i.e., the EAP scores) in the sample or samples. An empirical reliability
is therefore the true score variance divided by the sum of the true score variance and the
error variance. The formulas for computing, by numerical integration, the means and vari-
ances of the examinee posterior distributions of ability are as follows.
The Bayes estimate is the mean of the posterior distribution of θ , given the observed response
pattern x i (Bock & Mislevy, 1982). It can be approximated as accurately as required by the
Gaussian quadrature,
− ∑ X k P(x i | X k ) A( X k )
k =1
θi ≅ q
∑ P(x i | X k ) A( X k )
k =1
34
2 BILOG-MG REFERENCE
This function of the response pattern x i has also been called the expected a posteriori (EAP) es-
timator. A measure of its precision is the posterior standard deviation (PSD), approximated by
q −
− ∑ ( X k − θ i )2 P(xi | X k ) A( X k )
PSD(θ i ) ≅ k =1
q
.
∑ P(x
k =1
i | X k ) A( X k )
The EAP estimator exists for any answer pattern and has a smaller average error in the popula-
tion than any other estimator, including the ML estimator. It is in general biased toward the
population mean, but the bias is small within ±3σ of the mean when the PSD is small (e.g., less
than 0.2σ , see Bock & Mislevy, 1982).
Although the sample mean of the EAP estimates is an unbiased estimator of the mean of the la-
tent population, the sample standard deviation is in general smaller than that of the latent popula-
tion. In most applications, this effect is not apparent because the sample standard deviation is
adjusted arbitrarily when the scale scores are standardized. Thus, the bias is not a serious prob-
lem if all the respondents are compared using alternative test forms that have much different
PSDs. The same problem occurs, of course, when number-right scores from alternative forms
with differing reliabilities are used to compare respondents. Users of tests should avoid making
comparisons between respondents who have taken alternative forms that differ appreciably in
their reliability or precision. A further implication is that, if EAP estimates are used in computer-
ized adaptive testing, the trials should not terminate after a fixed number of items, but should
continue until a prespecified PSD is reached.
For Bayes MAP scores, the estimated error variance is the mean of the reciprocal of the test in-
formation at the modes of the posterior distributions of all examinees in the sample or samples.
Similarly the true score variance is estimated by the mean of the variances of the posterior distri-
butions at the mode. As in the case of Bayes EAP scores, the empirical reliability for the MAP
scores is equal to the true score variance divided by the sum of the true score variance and the
error variance. The formulas for computing the posterior mode and test information at the mode
are as follows.
Similar to the Bayes estimator but with a somewhat larger average error is the Bayes modal, or
so-called maximum a posteriori (MAP) estimator. It is the value of θ that maximizes
n
P(θ | xi ) = ∑ {xij log e Pi (θ ) + (1 −xij ) log e [1 − Pi (θ )]} + log e g (θ ),
j =1
35
2 BILOG-MG REFERENCE
Analogous to the maximum likelihood estimate, the MAP estimate is calculated by Fisher scor-
ing, employing the posterior information,
J (θ ) = I (θ ) − ∂ 2 log e g (θ ) / ∂θ 2 ,
where the right-most term is the second derivative of the population log density of θ .
In the case of the 2PL model and a normal distribution of θ with variance σ 2 , the posterior in-
formation is
n
1
J (θ ) = ∑ a 2j Pj (θ )[1 − Pj (θ )] + .
j =1 σ2
∧
PSD(θ ) = 1/ J (θ ).
Like the EAP estimator, the MAP estimator exists for all response patterns but is generally bi-
ased toward the population mean.
Because empirical reliabilities are estimated from the results of test score estimation, they are
reported separately for each group of examinees in a multiple-group analysis. Note, however,
that the test forms are not distinguished in these computations. If there are multiple forms of the
test, the empirical reliabilities are aggregations over the test forms.
It may be useful in test development to preview the information and theoretical reliability of test
forms that might be constructed from items drawn from a calibrated item bank. This can now be
done using the FIX keyword on the TEST commands. Starting values for the item parameters are
supplied to the program (see definition of the FIX keyword in Section 2.6.3 for details) or the
parameters may be read from an IFNAME file. Then all of the items are designated as fixed using
the FIX keyword. If the INFO keyword appears in the SCORE command, the required information
and reliability analysis will be performed in Phase 3.
In order for this procedure to work, however, the program must have data to process in Phases 1
and 2 for at least a few cases. Some artificial response data can be used for this purpose. The
only calculations that will be performed in Phase 2 are preparations for the information analy-
36
2 BILOG-MG REFERENCE
sis in Phase 3. The number of EM cycles in the CALIB command can therefore be set to 1 and the
number of Newton cycles to 0. The NOADJUST option must also be invoked.
Output files
Phase 1 results appear in the *.ph1 file. They include test and item identification and classical
item statistics.
Phase 2 results appear in the *.ph2 file. They include assumed prior distributions, estimated
item parameters, standard errors and goodness-of-fit statistics, DRIFT parameters, estimates of
differential item functioning, posterior distributions for the groups, group means, and standard
deviations, and estimates of their standard errors.
Phase 3 results appear in the *.ph3 file. They include assumed prior distributions of the scale
scores for MAP and EAP estimation, correlations among the subtest scores, rescaling constants,
rescaled item parameters, scale scores for the subjects, test information plots, and parameters of
the rescaled latent distribution.
When the BILOG-MG program is opened the first time, a blank window is displayed with only
three active options: File, View and Help. By default, however, BILOG-MG will open with the
last active file syntax displayed. In this case, or when a command file is opened, the main menu
bar shown below is displayed.
There are 12 menu titles available on the main menu bar. The main purpose of each is summa-
rized in Table 2.1.
37
2 BILOG-MG REFERENCE
File Creating or opening files, printing files and exiting the program
Technical Specifying starting values and priors for calibration and/or scoring
Run Generating syntax and running one or all phases of the program; access-
ing the graphics procedure.
Help Access to the online help, build number and contact information for SSI
The File menu provides the user with options to open an existing syntax or text file, to create a
new file, to save or to print files.
38
BILOG-MG INTERFACE
When the New or Open options are selected from the File menu, the user is prompted for the
name of a command file. This can be either a new file, in which case a new name is entered in
the File name field, or an existing file, in which case one can browse and select the previously
created command file to be used as the basis for the current analysis.
The Close option is used to close any file currently open in the main BILOG-MG window, while
the Save option is used to save any changes made to the file since it was opened. With the Save
As option a file may be saved under the same or a different name in a folder of the user’s choos-
ing.
The Print and Print Setup options represent the usual Windows printing options, while selec-
tion of the Print Preview option opens a new window, in which the current file is displayed in
print preview mode. Options to move between pages and to zoom in and out are provided. The
printing options are followed by the names of the last files opened, providing easy access to re-
cently used files. The Exit option is used to exit the program and return to Windows.
39
2 BILOG-MG REFERENCE
The Edit menu has the standard Windows options to select, cut, copy and paste contents of files.
In addition, the user can search for text strings and/or replace them with new text using the Find
and Replace options.
The Setup menu is used to provide general information to be used in the analysis. The three op-
tions on the Setup menu are:
General: used for entering general information on the type of analysis required.
Item Analysis: used to specify the allocation of items to forms, subtests, and/or groups
and to control the item parameter estimation procedure.
Test Scoring: used to request the scoring of individual examinees or of response patterns,
item and test information and rescaling of scores.
40
BILOG-MG INTERFACE
The menu options are used to activate dialog boxes. The function of each dialog box is described
below.
The General dialog box has four tabs on which the job description, model, type of response and
test, group and item labels may be specified. The Job Description tab is shown below.
41
2 BILOG-MG REFERENCE
The top half of the Job Description tab on the General dialog box is used to provide a title and
additional comments for the analysis. Below these fields, the number of items, subtests, groups
and/or forms (if any), and the reference group in the case of a multiple-group analysis are en-
tered. On the images shown here, links between the fields and the corresponding keywords are
provided.
Related topics
The second tab, Model, is used to select a 1-, 2-, or 3PL model and to specify the response func-
tion metric to be used. If variant items are to be included in the analysis, or a DIF or DRIFT mul-
tiple-group analysis is required, this is indicated in the Special Models group box.
42
BILOG-MG INTERFACE
Note that the selection of some models is dependent on the presence of other keywords in the
syntax. For example, in order to request Variant Item Analysis the NVTEST keyword on the
GLOBAL command should have a value larger than the default of 0, or the NVARIANT keyword on
the LENGTH command should have a non-zero entry.
Related topics
GLOBAL command: LOGISTIC option, NPARM and NVTEST keywords (see Section 2.6.7)
INPUT command: DIF and DRIFT options (see Section 2.6.9)
LENGTH command: NVARIANT keyword (see Section 2.6.11)
The Response tab allows specification of the number of response alternatives, and codes for the
responses, not-presented and/or omitted items. In the case of a 3-PL model, the user may also
request that omitted responses are scored fractionally correct. If the NPARM keyword on the
43
2 BILOG-MG REFERENCE
GLOBAL command is not set to 3 to indicate a 3-PL model (see the previous tab) any instructions
in the Omits will be scored group box will not be used.
Related topics
Finally, the Labels tab provides the default item labels and group/test names. The user may enter
names in the respective fields, or import item labels from an external file by using the Browse
button next to the Item Label File Name field. After entering or selecting the file containing the
44
BILOG-MG INTERFACE
item labels, click the Import button. Alternatively, after completion of the Item Labels and Test
or Group fields, the user may save the labels to file using the Save button.
45
2 BILOG-MG REFERENCE
Related topics
The Item Analysis dialog box has 5 tabs and is used to assign items to subtests, forms, and/or
groups. In addition, subtests to be calibrated are selected here. Calibration specifications control-
ling the iterative procedure are also entered on this dialog box.
On the Subtests tab shown below, labels for the subtests are entered in the first fields. The next
two fields are used to indicate the number of items per test. Note that variant items should also
46
BILOG-MG INTERFACE
be indicated here. The final column is used to select the subtests for which item parameter esti-
mation is required.
Related topics
On the images below, links between the fields and the corresponding keywords are provided.
47
2 BILOG-MG REFERENCE
The Subtest Items tab allows the user to assign specific items to the main and variant tests. Note
that, if fewer items are selected here than were indicated on the Subtests tab, the information on
the Subtest Items tab will be adjusted accordingly (see table above for specific information).
The Select and Unselect buttons may be used to include or exclude single items or sets of
items (selected by holding down the mouse button and dragging over a selection of items).
Double-clicking a single item also reverses the state of the item.
To reverse the state of a block of items, highlight the items and click the Toggle button.
A variant item can only be selected when its corresponding subtest item is selected.
Note that the table only supports rectangular blocks of items. There are two ways to high-
light a rectangular block of items:
Click and drag: Left-click on any one corner of the block you want to highlight, hold the mouse
button down and drag the mouse to the opposite corner of the block before releasing the mouse
button. All items bounded by the opposite corners used will be highlighted.
Click-Shift-Click: Left-click on any corner of the block you want to highlight. Press and hold
down the Shift key, move the mouse pointer to the opposite corner of the block and left-click.
All items bounded by the opposite corners used will be highlighted.
Related topics
48
BILOG-MG INTERFACE
The next two tabs, Form Items and Group Items, are only available when a multiple-group or
multiple-form analysis was indicated on the Job Description tab of the General dialog box.
Both dialog boxes have the same form and mode of operation as the Subtest Items tab previ-
ously discussed, the only difference being that information entered here are recorded on the
FORMS and GROUPS commands respectively. Both are used to indicate the length of and assign-
ment of items to forms/groups.
49
2 BILOG-MG REFERENCE
Related topics
FORM command: INAMES, INUMBERS, and LENGTH keywords (see Section 2.6.6)
GROUP command: INAMES, INUMBERS, and LENGTH keywords (see Section 2.6.8)
50
BILOG-MG INTERFACE
The final tab of the Item Analysis dialog box is the Advanced tab that controls the estimation of
item parameters. Most of the information pertains to the CALIB command. The number of itera-
tions and convergence criterion are set at the top of the dialog box, while the number of items
and ability intervals for calculation of χ 2 item fit statistics are specified in the Chi-square Item
Fit Statistics group box. At the bottom of the dialog box, prior item constraints may be re-
quested and the estimation of the means of the prior distributions on the item parameters speci-
fied to be kept at a fixed value or to be estimated along with the parameters.
If a 3PL model is selected, all the prior check boxes in the Prior Item Constraints group box
will be enabled. In the case of a 2PL model, the Prior on Guessing check box is disabled, while
both the Prior on Guessing and Prior on Slope check boxes are disabled when a 1PL model is
fitted to the data.
Related topics
CALIB command: CHI, CRIT, CYCLES, NEWTON, NQPT, FLOAT, EMPIRICAL, GPRIOR,
SPRIOR, and TPRIOR keywords (see Section 2.6.3)
QUAD command: POINTS and WEIGHTS keywords (see Section 2.6.13)
51
2 BILOG-MG REFERENCE
Information entered on the Test Scoring dialog box controls the type of scoring performed in
Phase 3 of the analysis. The General tab of this dialog box is used to select the method of scor-
ing and to import item parameters for scoring from previously saved files. In the latter case, the
Browse button at the bottom of the tab can be used to locate the file containing the item parame-
ters to be used for scoring.
Group-level fit statistics, the suppression of printing of scores to the output file when scores are
saved to an external file using the SCORE keyword on the SAVE command, and biweighted esti-
mates robust to isolated deviant responses are requested using the Group Level Fit Statistics,
52
BILOG-MG INTERFACE
List Scores, and Biweight Items radio buttons. On the images below, links between the fields
and the corresponding keywords are provided.
Related topics
The Rescaling tab is associated with the RSCTYPE, LOCATION and SCALE keywords on the SCORE
command and is used to request the scaling of the ability scores according to user-specified val-
ues. Provision is made for different scaling options for different subtests.
53
2 BILOG-MG REFERENCE
Related topics
SCORE command: LOCATION, RSCTYPE, and SCALE keywords (see Section 2.6.16)
The Data menu is used to enter data or to provide information on the data file; type and number
of records in the data file; and answer, omit and not-presented keys if applicable (Item Keys op-
tion). A distinction is made between single-subject and group-level data (Examinee Data and
Group-level Data tabs respectively).
54
BILOG-MG INTERFACE
The Examinee Data dialog box deals with single-subject data. On the General tab of this dialog
box, the type and number of data records to be used in the analysis are specified. All of the en-
tries on this dialog box correspond to keyword values on the INPUT command, as indicated on
the image below. Note that when the check box labeled External Ability Criterion and Stan-
dard Error is checked, the External Ability and Ability S.E. data fields on the Data File tab
are enabled.
Related topics
INPUT command: EXTERNAL, NIDCHAR, SAMPLE, TAKE, and TYPE keywords (see Section
2.6.9)
INPUT command: PERSONAL option
The name of the raw data file and the format of the data are specified on the Data File tab. An
external data file may be selected using the Browse button at the top of the tab. Data may be dis-
played in the window below the Data File Name field by clicking the Show Data button.
55
2 BILOG-MG REFERENCE
Data can be read in free- or fixed column format. For fixed-format data, a format string is re-
quired to tell the program where in the file each data element is located. To ensure the accuracy
of the format information, the column locations of the various data elements can be determined
directly using the spreadsheet in which the data are displayed: clicking directly in the display
56
BILOG-MG INTERFACE
paces a cursor whose exact position is shown by the Line: and Col: indicators.
Related topics
57
2 BILOG-MG REFERENCE
Data may be entered interactively on the third tab of the Examinee Data dialog box. The Ap-
pend button is used to add a new case to the end of the data file. The Insert button is used to in-
sert a new case at the current cursor location, while the Delete button is used to delete lines of
entered data. For example, if case 10 is highlighted in the table, pressing the Insert button will
insert a new case at case 10, and all cases starting from 10 will move one row down in the table.
In the Read as Fixed-Column Records group box, the user can indicate the number of data re-
cords per case and then fill in the information on the positions of the case ID, the form and group
numbers (if applicable), and the responses. The Set Format button is then clicked to create
automatically a format statement in the Format String data field. Alternatively, the format
statement may be entered directly in the Format String data filed. Clicking the Set Fields button
will then automatically fill in the fields in the Read as Fixed-Column Records group box. Note
that with either method the response string must be continuous, that is, there can be no spaces in
the response string. Any attempt to specify non-continuous response data will result in incor-
58
BILOG-MG INTERFACE
rect format and/or response information, and the data will not be read correctly. For example, if
the response string is “10A,1X,10A,1X,15A” clicking the Set Fields button does not correctly set
the Format String data field.
Related topics
The sole purpose of the Item Keys option on the Data menu is to provide the option to use an-
swer, not-presented or omit keys. The three tabs on the Item Keys dialog box are similar. Possi-
ble key codes consist of the entire list of “possible keys”. The information is taken from the Re-
sponse Codes edit box on the Response tab of the General dialog box on the Setup menu.
On the first tab, Answer Key, an answer key may be read from an external file using the Open
button and browsing for the file containing the answer key, or entered interactively in the win-
dow towards the top of the tab.
In the case of multiple forms, a separate answer key for each form should be provided. The for-
mat of the keys should be the same as that used for the raw response data. If a key is entered in-
teractively, the Save button may be used to save the entered information to an external file. The
file used as answer key is referenced by the KFNAME keyword on the INPUT command.
Related topics
59
2 BILOG-MG REFERENCE
The second tab is used for the not-presented key (if any) and information entered here is echoed
to the NFNAME keyword on the INPUT command.
60
BILOG-MG INTERFACE
Related topics
The Omit Key tab is used for the omit key, if any. This tab corresponds to the OFNAME entry on
the INPUT command in the completed command file.
61
2 BILOG-MG REFERENCE
Related topics
The Group-Level Data dialog box is similar in purpose to the Examinee Data dialog box where
single-subject data may be entered. On this dialog box, however, information on the structure of
group-level data to be used in analysis is provided.
The General tab is used to provide information on the number of groups, group ID, and number
of data records and weights, if any, to use in analysis. All entries correspond to keywords on the
INPUT command.
62
BILOG-MG INTERFACE
Related topics
INPUT command: EXTERNAL, NIDCHAR, SAMPLE, TAKE, and TYPE keywords (see Section
2.6.9)
The Browse button on the Data File tab allows the user to browse for the file containing the
group-level data. After clicking the Show Data button the contents of the selected file are dis-
63
2 BILOG-MG REFERENCE
played in the window below these buttons. The Format String field should be completed ac-
cording to the contents of the file. In contrast to item responses in the case of single-subject data,
which are read in “A” format, the frequencies in group-level data files are read in “F” format as
shown below.
Related topics
The first set of options on the Technical menu is used to assign starting values, prior constraints,
and information on prior latent distributions for both calibration and scoring during the analysis.
The last three options on this menu provide the user with the option to exercise even more con-
64
BILOG-MG INTERFACE
trol over the sampling and EM and Newton cycles (Data Options); request a Rasch model, plots
per group or to prevent the adjustment of the latent distribution to a mean of 0 and S.D. of 1
(Calibration Options), and finally to calculate domain scores based on a user-supplied file con-
taining information on previously calibrated items (Score Options).
The Assign Item Parameter Starting Values option on the Technical menu may be used to
import starting values for item parameters from a saved item parameter or user-supplied file or,
alternatively, to enter starting values interactively.
The first tab on the Item Parameter Starting Values dialog box is used to select a previously
created file. To use an item parameter file created during a previous BILOG-MG analysis, check
the radio button next to the Import Saved Item Parameter File option. If starting values are
provided through a user-supplied file, check the radio button next to the Import User Supplied
File … option. The Browse button is used to locate the file.
65
2 BILOG-MG REFERENCE
Enter starting values for the item parameters on the Enter Values tab to set values for the corre-
sponding keywords on the TEST command. A subset of slope, threshold or asymptote parameters
may be selected by holding the mouse button down and dragging until the selection is complete.
Clicking the right mouse button will display a pop-up menu that can be used to assign values to
the parameters.
All selected parameters may be set to a specific value or to the default value. In addition, the user
may select one parameter and assign a value to this and all other parameters below it by selecting
the appropriate option from the pop-up menu. Alternatively, the Default Value or Set Value but-
tons may be used to assign values to the selected parameters.
There are two ways to select the cell for parameter values:
method works when selecting a continuous block of columns. To select a disjoint block of
columns, press and hold the Ctrl key down when clicking the header.
Note that when selecting a block of cells, the Shift key is used. When selecting a block of col-
umns through clicking column headers, the Ctrl key is used.
Clicking the column header changes the selection state of the entire column. It toggles the items
in the column from the “selected” state to the “unselected” state and vice versa. The Save as
67
2 BILOG-MG REFERENCE
User Data option may be used to provide a name for the external file to which input is saved
with the file extension *.prm.
Related topics
TEST command: DISPERSN, GUESS, INTERCPT, SLOPE, and THRESHLD keywords (see
Section 2.6.17)
This dialog box is associated with the FIX keyword on the TEST command, which is used to indi-
cate which items of a subtest are free to be estimated, and which are to be held fixed at their
starting values.
As with the Enter Values tab on the Item Parameter Starting Values dialog box discussed
above, cells may be selected in rectangular blocks or by columns. The same conventions for the
use of the Shift and Control keys apply. Additionally, double-clicking on any one cell under the
Fixed column also toggles the cell state: fixed to free or free to fixed.
Related topics
68
BILOG-MG INTERFACE
The Item Parameter Prior Constraints dialog box is associated with the PRIORS command.
The number of tabs on the dialog box depends on the number of subtests – priors may be entered
for each subtest separately.
The user can set values by selecting an item or group of items and clicking the Set Value button.
A subset of cells in the displayed table may be selected by holding the mouse button down and
dragging until the selection is complete. Clicking the right mouse button will display a pop-up
menu, which can be used to assign values to the cells. All selected cells may be set to a specific
value or to the default value.
69
2 BILOG-MG REFERENCE
In addition, the user may select one parameter and assign a value to this and all other parameters
below it by selection the appropriate option from the pop-up menu. A dialog box appears,
prompting the user to enter the value to be assigned.
Alternatively, the Default Value or Set Value buttons may be used to assign values to the se-
lected parameters. To set the priors of a selection of items to their default value, the Default
Value button may be used. Links between the fields on this dialog box and the corresponding
keywords on the PRIOR command are shown on the image below.
Related topics
The Assign Calibration Prior Latent Distribution option provides the opportunity to assign
prior latent distributions, by subtest, to be used during item parameter estimation. This dialog
box is associated with the QUAD command(s). This option is only enabled when the IDIST key-
word is set to 1 or 2 on the CALIB command.
There is no user’s interface to select this option. It must be set manually in the command file. For
assigning prior latent distributions to be used during scoring, see the Assign Scoring Prior La-
tent Distribution dialog box.
The first image below shows the dialog box for a single group analysis. Quadrature points and
weights may be provided separately for each subtest. On the second image, the Calibration
Prior Latent Distribution dialog box for a multiple-group analysis is shown. Note that quadra-
ture points and weights may be entered per group and subtest, as a tab for each subtest is pro-
vided in this case, and that the set of positive fractions entered as Weights should sum to 1.0.
70
BILOG-MG INTERFACE
The format of the table on the Calibration Prior Latent Distribution dialog box depends on the
values of the NTEST, NGROUP and IDIST keywords. Examples are shown below.
71
2 BILOG-MG REFERENCE
>GLOBAL NTEST=1, …
>INPUT NGROUP=1, …
>CALIB IDIST=1 or 2, …
>GLOBAL NTEST>1, …
>INPUT NGROUP=1, …
>CALIB IDIST=1, …
72
BILOG-MG INTERFACE
>GLOBAL NTEST>1, …
>INPUT NGROUP=1, …
>CALIB IDIST=2, …
>GLOBAL NTEST=1, …
>INPUT NGROUP>1, …
>CALIB IDIST=1 or 2, …
73
2 BILOG-MG REFERENCE
>GLOBAL NTEST>1, …
>INPUT NGROUP>1, …
>CALIB IDIST=1, …
>GLOBAL NTEST>1, …
>INPUT NGROUP>1, …
>CALIB IDIST=2, …
74
BILOG-MG INTERFACE
Related topics
The Assign Scoring Prior Latent Distribution dialog box provides the opportunity to assign
prior latent distributions, by subtest, to be used during scoring. This dialog box is associated with
keywords on the SCORE and QUAD commands.
For assigning prior latent distributions to be used during the item parameter estimation phase, see
the Assign Calibration Prior Latent Distribution dialog box.
On the Normal tab of this dialog box, the type of prior distribution to be used for the scale scores
is the first information required. This tab is used when separate arbitrary discrete prior for each
group or for each group for each subtest are to be read from a QUAD command. These options are
only available when the Expected A Posteriori (EAP) method of scale score estimation is used.
When maximum likelihood (ML) or Maximum A Posteriori (MAP) estimation is selected, these
options are disabled and the PMN and PSD keywords may be used to specify real-numbered values
for the means and standard deviation of the normal prior distributions. The default values of
these keywords for each group for each subtest, 0 and 1 respectively, are displayed.
To provide alternative values for the PMN and PSD keywords, click in the fields and enter the new
values.
Information in the table below corresponds to numbers on the image shown overleaf.
Related topics
SCORE command: IDIST, PMN and PSD keywords (see Section 2.6.16)
75
2 BILOG-MG REFERENCE
The User Supplied tab allows the user to change the number of quadrature points to be used by
subtest. Different quadrature points and weights may be supplied for each group per subtest, as
shown in the image below where two subtests were administered to two groups of examinees.
76
BILOG-MG INTERFACE
Related topics
To set the values for the random number generator seed used with the SAMPLE keyword on the
INPUT command, or to change the value of the acceleration constant used during the E-steps in
item calibration, the Data Options dialog box may be used. To use default values, the Set to De-
fault Value buttons may be clicked after which the program defaults will be displayed in the cor-
responding fields.
77
2 BILOG-MG REFERENCE
Note that:
The Item Analysis and/or Scoring from Saved Master File Name section is the same as
the Master Data edit box in the Save Output to File dialog box .
The dialog box does not read any data from the specified master file. The filename is sim-
ply copied to the MASTER keyword on the SAVE command.
Related topics
The Calibration Options dialog box is associated with keywords on the CALIB command.
To request a Rasch model, the One Parameter Logistic Model option should be checked. Sepa-
rate item plots for each group may be requested using the Separate Plot for Each Group check
box while adjustment of the latent distribution to a mean of 0 and S.D. of 1 may be suppressed
using the first check box. To keep the prior distributions of ability in the population of respon-
dents fixed at the value specified in the IDIST keyword and/or the QUAD commands, the Fixed
Prior Distribution of Ability check box should be checked. This corresponds to the FIXED op-
tion on the CALIB command.
78
BILOG-MG INTERFACE
Related topics
CALIB command: FIXED, GROUP-PLOTS, NOADJUST and RASCH options (see Section 2.6.3)
This dialog box allows the user to request the calculation of domain scores based on a user-
supplied file containing the item parameters for a sample of previously calibrated items for a
domain and to request the computation and listing of the coefficients of skewness and kurtosis of
the ability estimates and of the latent distribution.
Related topics
79
2 BILOG-MG REFERENCE
The Save Output to File dialog box is accessed through the Save menu. Various types of data
may be saved to external files using the SAVE command. On the image below, links are provided
between the fields of this dialog box and the corresponding keywords on the SAVE command.
Related topics
80
BILOG-MG INTERFACE
The Run menu provides the necessary options to generate syntax from the input provided on the
dialog boxes accessed through the Setup, Data, and Technical menus (Build Syntax option) or
to run separate or all phases of the analysis. This menu is also used to access the graphics proce-
dure described in Chapter 6 via the Plot option. Note that this option is only enabled after com-
pletion of the three phases of analysis.
81
2 BILOG-MG REFERENCE
Select the Build Syntax option to generate a syntax or command file based on the contents of the
previous dialog boxes and menus. When the Initialize option is selected, changes made to an
existing command file in the syntax window are transferred to the dialog boxes and menus.
Run only the first phase of the analysis to obtain the classical statistics by selecting the Classical
Statistics Only option. The item parameter estimation may be performed next by selecting the
Calibration Only option, and scoring after that using the Scoring Only option. These options
are provided to allow the user to run and verify information in the output for each phase of the
analysis before continuing to the next step. When running the analysis phase by phase, the option
to run the next phase will only be enabled after successful completion of the previous phase.
Alternatively, the user can request to run all three phases in succession by selecting the Stats,
Calibration and Scoring option. A message indicating the normal or abnormal termination of
each phase will appear in the main window between phases to alert the user to possible problems
in a particular phase of the analysis. This message may be suppressed using the Options menu.
To view the output obtained during any of the three phases of analysis, the options on the Out-
put menu may be used. Options will be enabled or disabled depending on the number of com-
pleted phases of the analysis. When any of these options is selected, the relevant output file will
be displayed. After inspection, the user may close this file to return to the main BILOG-MG
window, where the command file on which the analysis was based will be displayed.
The View menu allows the user to add or remove the status bar displayed at the bottom of the
main BILOG-MG window. The toolbar, allowing the standard Windows editing functions, is
displayed by default.
82
BILOG-MG INTERFACE
The Options menu provides access to the Settings dialog box. This dialog box has three tabs:
General, Editor, and Server.
On the General tab as shown below, the size of the application window and document window
can be set. The user may opt to always open the last active document when opening BILOG-MG
(default) or to start with a blank screen instead by unchecking the Open last active document
on start check box.
To change the font in which the contents of the editor window are displayed, or to use tabs, the
Editor tab of the Settings dialog box may be used. Reminders of file changes and automatic re-
loading of externally modified documents may also be requested.
83
2 BILOG-MG REFERENCE
The Server tab of the Settings dialog box may be used to show or hide the windows in which
details of the analysis are displayed during the run. To open multiple command files, which can
then be accessed using the Windows menu, check the box next to the Allow multiple command
file… option on this tab.
84
BILOG-MG INTERFACE
The Window menu allows the user to arrange multiple windows or to switch between open files.
To open multiple command files simultaneously that may be accessed through this menu, use the
Server tab on the Settings dialog box accessed through the Options menu.
The Help menu provides access to the BILOG-MG help file (Help Topics option) and to the
About BILOG-MG for Windows dialog box in which the version and build number of the ap-
plication are displayed. This box may also be used to directly e-mail SSI for technical support or
product information or to link to the SSI website.
CFNAME - -
NWGHT - -
NVTEST
85
2 BILOG-MG REFERENCE
SAVE Save -
MASTER Save -
CALIB Save -
PARM Save -
SCORE Save -
COVARIANCE Save -
TSTAT Save -
POST Save -
EXPECTED Save -
ISTAT Save -
DIF Save -
DRIFT Save -
PDISTRIB Save -
LENGTH command:
INPUT command:
86
BILOG-MG INTERFACE
DIAGNOSE
INUMBERS - -
TEST command:
87
2 BILOG-MG REFERENCE
FORM command:
MAXPOWER - -
MIDPOINT - -
Data, Examinee Data / Data,
Variable format statement Data File
Group-Level Data
CALIB command:
PRINT - -
IDIST - -
PLOT - -
88
BILOG-MG INTERFACE
DIAGNOSIS - -
RIDGE - -
NSD - -
COMMON - -
NORMAL - -
89
2 BILOG-MG REFERENCE
QUAD command:
Technical, Assign Calibration
POINTS -
Prior Latent Distributions
Technical, Assign Calibration
WEIGHTS -
Prior Latent Distributions
PRIORS command:
Technical, Assign Item Pa-
TMU -
rameter Prior Constraints
Technical, Assign Item Pa-
TSIGMA -
rameter Prior Constraints
Technical, Assign Item Pa-
SMU -
rameter Prior Constraints
Technical, Assign Item Pa-
SSIGMA -
rameter Prior Constraints
Technical, Assign Item Pa-
ALPHA -
rameter Prior Constraints
Technical, Assign Item Pa-
BETA -
rameter Prior Constraints
SCORE command:
INFO - -
90
BILOG-MG INTERFACE
YCOMMON - -
POP - -
REFERENCE - -
READF - -
NFORMS - -
Related topics
To illustrate the use of the interface in creating syntax files, the data file exampl01.dat in the
examples subfolder of the BILOG-MG installation folder is used. This problem is based on an
example in Thissen, Steinberg & Wainer (1993). Other examples based on the same data (see
complete description below) can be found in Chapter 10.
In the late 1980s, R. Darrell Bock created a “College Level Spelling Test” comprising a sample
of 100 words drawn from a large source list by simple random sampling. Data collected using
that test are the basis for the empirical example in the paper “IRT Estimation of Domain Scores”
(R.D. Bock, M.F. Zimowski, & D. Thissen, Journal of Educational Measurement, 1997, 34, 197-
211). Parameter estimates for the 2PL IRT model for the 100-item test are tabulated in that pa-
per. Bock created the script for conventional oral presentation of the test, and recorded the origi-
nal reading of the script (by Monica Marie Bock) on reel-to-reel magnetic tape. Subsequent cop-
ies onto cassette tape were used by Jo Ann Mooney in the collection of data from around 1000
91
2 BILOG-MG REFERENCE
University of Kansas Undergraduates. We are using the file with 100 words (items) and 1000
records (examinees).
The words for the test were randomly selected from a popular wordbook for secretaries. Students
were asked to write the words as used in a sentence on the tape recording. Responses were
scored 1 if spelled correctly and 0 if spelled incorrectly. Because the items are scored 1,0, ac-
cording to the defaults assumed by the program, an answer key is not required.
The purpose of this section is to give the new user a quick overview of the interface and the ab-
solute minimum input needed to run the program. In Chapter 11, the syntax and keywords of
each example are discussed in detail.
11 0000
21 0001
31 1000
41 1001
…
162 1111
The first three characters in each line represent the examinee identification field. This is followed
by the responses to the four items in the test.
As a first example, we wish to set up a simple 2-PL model for this data. To construct a command
file, begin by selecting the New option from the File menu. The Open dialog box is now acti-
vated.
Assign a name, with the *.blm file extension, to the command file. In this case, the command file
first.blm is created in the examples folder as shown below. Click Open when done to return to
the main BILOG-MG window.
92
GETTING STARTED WITH BILOG-MG
Note that a number of options have been added to the main menu bar of the BILOG-MG win-
dow. Of interest for this example are the Setup, Data, Run and Output options. The Setup
menu is used to describe the model to be fitted to the data. As a first step, select the General op-
tion from this menu to access the General dialog box.
The General dialog box has four tabs, on which both required and optional keywords may be
set. On the Job Description tab below, the number of items in the test is indicated as 4. The type
of model is selected on the Model tab. As the default model fitted by BILOG-MG is a 2PL
model, this tab is not used now. Click OK to return to the main window.
93
2 BILOG-MG REFERENCE
The next step in specifying the analysis is to assign the items to be calibrated to the test. To do
this, select the Item Analysis option on the Setup menu to access the Item Analysis dialog box.
Change the default value of 1 under Subtest Length to 4 by clicking in this field and typing in
“4”. By default, all items will be analyzed, as indicated under the Analyze this run header. Click
OK when done.
This completes the model specification. All that remains to be done is to provide information on
the data. To do so, the Data menu is used. In this case, we have examinee data and thus the Ex-
aminee Data option is selected from the Data menu.
94
GETTING STARTED WITH BILOG-MG
On the Examinee Data dialog box, enter the number of characters representing the examinee
identification (in this case 3) in the Number of Case ID Characters field. By default, all data
are used as shown below.
To provide information on the name and format of the data file, click the Data File tab.
95
2 BILOG-MG REFERENCE
format statement (3A1,1X,4A1) to the Format String field. Click OK to return to the
main BILOG-MG window.
96
GETTING STARTED WITH BILOG-MG
Having completed the specification in terms of model and data, the command file is created by
selecting the Build Syntax option from the Run menu. The syntax created by the program is
now displayed in the main window, as shown below. Note that no options are given on the ITEMS
and SCORE commands in this file, indicating that all program defaults will be used.
Save the completed syntax to file by selecting the Save option on the File menu.
97
2 BILOG-MG REFERENCE
The analysis is now performed by using some of the other options on the Run menu. Although
the analysis can be done phase by phase (using the Classical Statistics Only, Calibration Only,
and Scoring Only options) all three phases can be run in sequence by selecting the Stats, Cali-
bration and Scoring option from this menu.
After successful completion of all three phases of the analysis, a message to this effect is dis-
played on the screen. If a problem was encountered during analysis, this message box will indi-
cate that all phases were not completed successfully. Access the output from the analysis through
the Output menu. Classical statistics are given in the *.ph1 file.
98
GETTING STARTED WITH BILOG-MG
In the first.ph2 file, the results of the item calibrations are given. The item parameter estimates
for the four items in the test are shown below.
Scoring results are given in the first.ph3 file. The complete list of scores is printed to this file by
default. A section of this output, showing summary statistics for the score estimates, is shown
below.
99
2 BILOG-MG REFERENCE
The data analyzed in the previous example actually came from two groups of respondents. The
groups in this example are the two sexes. The same four items are presented to both groups on a
single test form. The group indicator is found in column 3 of the data records.
11 0000
21 0001
31 1000
41 1001
…
162 1111
The third column of the data contains either a 1 or a 2, indicating whether an examinee belonged
to group 1 (male) or group 2 (female).
The previous single-group analysis for this group, contained in the command file first.blm, is
modified to perform a DIF analysis for the two groups. As a first step, the General option on the
Setup menu is used to indicate the presence of multiple groups.
On the Job Description tab of the General dialog box, change the Number of Examinee
Groups from the default value of 1 to 2, as shown below.
100
GETTING STARTED WITH BILOG-MG
In the case of a DIF model, a 1PL model is required. To change the model from the default 2PL
model previously used, click the Model tab and check the 1-Parameter Logistic (1PL) radio
button in the Response Model group box.
To request a DIF model, click the Differential Item Functioning (DIF) radio button in the Spe-
cial Models group box. By default, the first group will be used as reference group as indicated in
the Reference Group field.
Once this is done, all necessary changes to the General dialog box have been made. Click OK to
return to the main BILOG-MG window.
101
2 BILOG-MG REFERENCE
The allocation of items to be calibrated for each of the two groups is specified using the Item
Analysis option of the Setup menu. Once this option is selected, the Item Analysis dialog box is
displayed.
Leaving the Subtests tab as previously completed, click the Group Items tab. By default, all
items will be selected for the first group. This is indicated by the display of the item names in a
bold font in the first column of the table. To select all four items for the second group, click on
ITEM0001 in the second column. While holding the Shift button on the keyboard down, click on
ITEM0004. All four items are now highlighted. Click the Select button at the bottom left of the
dialog box to select all items.
This completes the model specification. Click OK to return to the main window.
102
GETTING STARTED WITH BILOG-MG
The only remaining task is to revise the reading of the data file so that the group identification
field can be recognized and processed by the program. To do so, select the Examinee Data op-
tion from the Data menu.
On the General tab of the Examinee Data dialog box, the number of case identification charac-
ters is now decreased to 2, as shown below. (Recall that previously this field was set to 3: in ef-
fect, a combination of actual examinee ID and group ID was used to identify the cases in the
previous example.)
103
2 BILOG-MG REFERENCE
The format statement is now adjusted accordingly by changing the entries in the Read as Fixed-
Column Records group box:
This completes the syntax specification. Return to the main window by clicking the OK button.
104
GETTING STARTED WITH BILOG-MG
The revised command file is generated by selecting the Build Syntax option from the Run
menu.
After generating the syntax, it is saved to file using the Save As option on the File menu. The
revised syntax is saved in the file second.blm in the examples folder. Click the Save button after
specifying a name for and path to the new command file.
105
2 BILOG-MG REFERENCE
When the syntax displayed in the main BILOG-MG window is compared to the first example,
we note the addition of two GROUP commands and the NGROUP and DIF keywords on the INPUT
command. The revised format statement is also included. The NPARM keyword on the GLOBAL
command (not shown here) indicates that a 1-PL model is requested.
The three phases of the analysis can be run separately using the Classical Statistics Only, Cali-
bration Only, and Scoring Only options on the Run menu. To run the phases sequentially, se-
lect the Stats, Calibration, and Scoring option from this menu.
Output for the analysis is accessed as before using the Output menu from the main menu bar.
In the partial output from the second.ph1 file for this DIF analysis classical item statistics are
provided by group. Similar statistics are also given for the combined group (not shown below).
106
GETTING STARTED WITH BILOG-MG
The Phase 2 output in the second.ph2 file provides item parameter estimates, and DIF specific
output as shown below.
107
2 BILOG-MG REFERENCE
Although the second.ph3 file is created as shown below, no scoring is performed in the case of a
DIF analysis.
2.5 Syntax
2.5.1 Data structures: ITEMS, TEST, GROUP and FORM commands
In addition to conventional IRT analysis of one test administered to one group of examinees,
BILOG-MG is capable of analyzing data from test development and scoring applications in
which multiple alternative test forms, each consisting of multiple subtests or scales, are adminis-
tered to persons in one or more groups. BILOG-MG relies on a system of four commands,
ITEMS, TEST, FORM, and GROUP, describing the assignment of items to subtests, forms, and
groups. The syntax of these commands is discussed in detail in the syntax section. Here a de-
scription is given of how the commands work together to accommodate a wide range of applica-
tions.
Related topics
The ITEMS command attaches names and numbers to items of the test instrument. In the TEST,
FORM, and GROUP commands, the user can select items either by name or number. As a conven-
ience to the user, the program can automatically create sequences of eight-character item
108
OVERVIEW OF SYNTAX
Related topics
The TEST commands describe the subsets (or scales) that will be scored in the test. There is a
separate TEST command for each subtest. A subtest may consist of a combination of items in the
instrument including items that appear on different test forms making up the instrument. The
TEST commands identify the items belonging to each subtest.
In addition, when the LENGTH command indicates variant items are present in a particular subtest
(items that are included in the test to obtain item statistics for a subsequent form of the test but
are not used in computing test scores), the user identifies these items with the corresponding sub-
test by means of an additional TEST command that immediately follows the TEST command of
the subtest.
If the entire instrument is analyzed in a single subtest without variant items, the problem setup
requires a single TEST command that lists all the items in the test instrument. Chapter 10 illus-
trate this type of application.
If multiple subtests of items are selected for analysis, a separate TEST command is required for
each subscale. The example in Section 10.6 illustrates the problem setup for an analysis with
multiple subtests within a single test form. The example discussed in Section 10.8 shows the
setup for analysis with multiple subtests for an instrument and with multiple test forms. Section
10.7 illustrates the special TEST command setup for an instrument with variant items.
Related topics
LENGTH command
Setup menu: General dialog box
Setup menu: Item Analysis dialog box
Technical menu: Assign Fixed Items dialog box
Technical menu: Item Parameter Starting Values dialog box
109
2 BILOG-MG REFERENCE
The FORM command controls the input of the response record. It lists the items in the order in
which they appear in the data records. Most applications of BILOG-MG require at least one FORM
command.
There are two arrangements in which multiple forms data can be supplied to the program. We
refer to them as the expanded format and the compressed format (see also the file structure speci-
fications):
Expanded format
The response record of each examinee spans the entire set of items appearing in the test instru-
ment. Each item of the test instrument has a unique location (column) in the input records. A not-
presented code appears in the locations of the items belonging to forms not presented to a given
examinee. Expanded format is convenient for users who store data in two-dimensional (row by
column) arrays typical of many database systems. This format requires only a single FORM com-
mand, even though the data arise from multiple forms. Note that the order of the items in the in-
put records, and thus the order of their listing on the FORM command, does not have to be the
same as that in the list of names and numbers in the ITEMS command (although ordinarily it
would be). Note also that a code to identify the form administered to a particular examinee is not
read by the program from an expanded format record.
Compressed format
The data record for each examinee contains responses only to the items presented to that person,
and the responses appear in the same column field of each record (the number of columns is
equal to the number of items in the longest test form). Data entry in the compressed format is
easier than in the expanded format and results in smaller data files.
With compressed-format data, the locations of the items in the input records are not unique. An
item in one record may occupy the same column as a different item in another record. A separate
FORM command is therefore required for each test form in the instrument. In addition, each re-
sponse record must contain a number identifying the FORM command that applies to that record.
The number (1, 2, 3, etc.) refers to the order of the FORM command in the command file. The item
list of the corresponding FORM command gives the names or numbers of the items in the order
that they appear in the response field of the data records (see Section 2.6.18 for details). Inter-
nally, the program works by expanding the compressed records and inserting not-presented
codes in locations corresponding to the forms not administered to the examinee.
Related topics
110
OVERVIEW OF SYNTAX
GROUP commands are required whenever a multiple-group analysis is specified. The number of
commands is equal to the number of groups in the analysis. GROUP commands serve two pur-
poses. First, they identify groups of respondents for multiple-group analysis. Second, they iden-
tify the set of items administered to each group. Note that whenever a multiple-group analysis is
requested, each response record must contain a number identifying the GROUP command that ap-
plies to that record.
Related topics
The FORM and GROUP commands control the input of the individual response records. How they
work together depends on the following:
The sections below describe how these factors determine the structure of the FORM and GROUP
commands.
When an instrument consists of a single test form, a single FORM command is assumed, and it ap-
plies to all records. The program reads the entire response records according to the specifications
on the command. If a FORM command is not included in the problem setup, the program reads the
response records according to the order of items in the ITEMS command list. As in all applica-
tions with a single FORM command, the response records do not contain a form indicator.
Single-group analysis
The examples in Sections 10.1 and 10.3 illustrate the simple case of a single-group analysis of a
single test form. The program reads all response records according to the specifications on the
FORM or ITEMS commands. GROUP commands are not required for the analysis.
Multiple-group analysis
In multiple-group analysis of a single test form, the groups may represent naturally occurring
subgroups within a population of respondents, or groups of respondents drawn from different
populations. In either case, the structure of the FORM and GROUP commands is the same. A single
FORM command applies to all response records and a separate GROUP command is required for
111
2 BILOG-MG REFERENCE
each group of respondents in the analysis. Because all respondents receive the same test form,
and thus respond to the same set of items, the lists of items in the GROUP commands are the same
for all groups in the analysis. The lists include all of the items specified in the FORM or ITEMS
commands.
The primary function of GROUP commands in applications of this type is to identify the groups of
respondents for multiple-group analysis. The example in Section 10.2 shows how this command
structure applies to examinations of differential item functioning in subgroups of a population.
Group differences in the latent distributions of ability may also be examined in this way.
When an instrument consists of multiple test forms, the structure of the FORM and GROUP com-
mands depends in part on whether the forms are administered to equivalent or nonequivalent
groups of respondents. If the forms of the instrument are randomly assigned to respondents
drawn from a single population, the groups are equivalent, and the data may be analyzed with a
single-group IRT model. GROUP commands are not required in this case, but may be added to ex-
amine subgroup differences in item functioning. When test forms are administered to nonequiva-
lent groups of respondents, the forms must contain common “linking” items, and a multiple-
group analysis is necessary to place the items from the forms on the same scale. GROUP com-
mands are required in this case.
The number of GROUP commands corresponds to the number of groups in the analysis. In multi-
ple-form applications the response records may follow either of the two formats. The sections
below show how the structure of the FORM and GROUP commands depends on these formats.
Single-group analysis
When there are multiple forms in the test instrument but only one group of examinees, multiple
FORM commands are required if the compressed data format is used, but a GROUP command or a
group indicator on the response records is not required. Section 10.4 illustrates an application of
this type.
Multiple-group analysis
In the case of multiple forms and multiple groups, all such applications can be handled by ex-
panded format. Only one FORM command is then required and the data records will not contain a
forms indicator. Similarly, if the assignment of items to groups is performed in expanded format,
including the codes for items presented to a given examinee in a given group, the GROUP com-
mands require only the group names, not the item identifications. Specification of the items as-
signed to each group will, however, shorten the run time. The example in Section 10.5 illustrates
this type of data structure. The expanded style of data entry is mandatory in applications where
the test forms contain more than one subtest and the examinee is assigned to different groups for
different subtests. This can occur in complex two-stage testing designs.
In more typical applications, however, whole forms rather than subtests are assigned to groups.
In this case, the compressed style of data entry is suitable and may be more convenient. The
112
OVERVIEW OF SYNTAX
GROUP commands must then contain, in addition to the group name, a list of all items on all forms
assigned to the corresponding groups. The data records must include both a forms identifier and
a group identifier. The advantage of this method is that response records need not contain codes
for not-presented items. Examples illustrating this type of data input are discussed in Sections
10.4 and 10.8 respectively.
Related topics
A greater-than sign (>) must be entered in column 1 of the first line of a command and
followed without a space by the command name.
All command names, keywords, options, and keyword values may be entered in upper
and/or lower case.
Command names, keywords, options, and keyword values may be entered in full or ab-
breviated to the first three characters.
At least one space must separate the command name from any keywords or options.
Commas must separate all keywords and options.
The equals sign is used to set a keyword equal to a value, which may be integer, real or
character. A real value must contain a decimal point. A character value must be enclosed
in single quotes if it:
o Contains more than eight characters
o Begins with a numeral
o Contains embedded blanks, commas, slashes, or semi-colons
A keyword may be vector valued, i.e., set equal to a list of integer, real or character con-
stants, separated with commas or spaces, and enclosed in left and right parentheses (as
KEYWORD2 above).
If the list is an arithmetic progression of integer or decimal numbers, the short form, first
(increment) last, may be used. Thus, a selection of items 1, 3, 7, 8, 9, 10, 15 may be en-
tered as 1,3,7(1)10,15. Real values may be entered in a similar way.
If the values in the list are equal, the form, value (0) number of values, may be used. Thus
1.0, 1.0, 1.0, 1.0, 1.0 may be entered as 1.0(0)5.
The italic elements in the format description are variables that the user needs to replace.
113
2 BILOG-MG REFERENCE
Command lines may not exceed 80 columns. Continuation on one or more lines is permit-
ted.
Each command terminates with a semi-colon (;). The semi-colon signals the end of the
command and the beginning of a new command.
The table below lists all available BILOG-MG commands in their necessary order. Commands
marked below as “Required” must appear in each problem in the order shown. All other com-
mands are optional. Note that, in the rest of this chapter, description for the commands follow an
alphabetical order. The data layout must be described in a variable format statement. This state-
ment is entered within parentheses.
(variable for-
mat *
statement)
114
OVERVIEW OF SYNTAX
Note that if there are not variant items in the subtest, there is one TEST command for each sub-
test. If a subtest contains variant test items, there must be exactly two TEST commands for that
subtest. The first identifies the main test items while the second identifies the variant test items.
Related topics
115
2 BILOG-MG REFERENCE
Purpose
To control the item parameter estimation procedure and the specification of prior distribu-
tions on the item parameters.
Format
Examples
This example uses simulated responses to illustrate nonequivalent groups equating of two
forms of a 25-item multiple-choice examination administered to different populations. Sepa-
rate latent distributions are estimated for each population (EMPIRICAL option). The indeter-
minacy in location and scale of the distributions is resolved by setting the mean and standard
deviation of Group 1 to 0 and 1, respectively, with REF=1 on the CALIB command.
In the following example of vertical equating of test forms over three grade levels, students
at each of three grade levels were given grade-appropriate versions of a arithmetic examina-
tion. The distributions of ability are assumed to be normal at each grade level (NORMAL op-
tion). The second group serves as the reference group in the calibration of the items. A prior
is placed on the item thresholds by the addition of the TPRIOR option.
In the following example of a 3-PL model, the PLOT keyword has been set to 0.99 so that all
item response functions will be plotted. The FLOAT option is added to request the MML (un-
der normal distribution assumptions) of the means of the prior distributions on the item pa-
rameters along with the parameters. This option should not be invoked when the data set is
small and the items few. The acceleration constant (ACCEL keyword) is set to 0.5 instead of
the default value of 1.0 for a single group analysis.
The next example, again of a 3-PL model, illustrates the command’s usage in the presence
of aggregate-level, multiple-matrix sampling data. In this case, the data come from eight
forms of a rather difficult, multiple-choice instrument. Since aggregate-level data are always
116
CALIB COMMAND
more informative than individual-level item responses, it is worthwhile to increase the num-
ber of quadrature points (NQPT), to set a stricter convergence criterion (CRIT), and to in-
crease the CYCLES limit. A prior on the thresholds (TPRIOR) and a ridge constant of 0.8
(RIDGE) are required for convergence with the exceptionally difficult second subtest.
Aggregate-level data typically have smaller slopes in the 0,1 metric than do person-level
data. Thus, the mean of the prior for log slopes is set to 0.5 with the READPRIOR option and
the succeeding PRIOR commands as shown.
Related topics
Purpose
Format
ACCEL=n
Default
Related topics
117
2 BILOG-MG REFERENCE
Purpose
To specify the number of items required and the number of intervals used for χ 2 computa-
tions.
Format
CHI=(a,b)
where a is the number of items for computation of the χ 2 fit statistics, and b is the number
of intervals into which the score continuum will be divided for purposes of computation of
χ 2 item fit statistics.
Default
CHI=(20,9).
Example
In the CALIB command shown below, the CHI keyword is used to request the calculation of
the χ 2 item fit statistics on 18 items and 7 intervals.
Related topics
Purpose
To estimate a common value for the lower asymptote of all items in the 3PL model.
Format
COMMON
Default
118
CALIB COMMAND
Example
is used for a 3PL model, output as shown below is obtained. Note that the asymptote pa-
rameter is estimated at a common value of 0.031 for all items.
Related topics
Purpose
Format
CRIT=n
Default
0.01.
119
2 BILOG-MG REFERENCE
Example
Here, the convergence criterion has been set to the more restrictive value of 0.0050 in order
to deal with a more informative aggregate-level data set.
Related topics
Purpose
To set the maximum number of EM cycles. If CYCLES=0 and NEWTON=0, item parameter es-
timates will be calculated from the classical item statistics from Phase 1 or from the starting
values of the TEST command. The former will be corrected for guessing if the 3-parameter
model is selected.
Format
CYCLES=n
Default
Examples
In this example of vertical equating of test forms over three grade levels, a maximum of 30
EM cycles and 2 Newton-Gauss iterations are requested.
>CALIB NQPT=20,NORMAL,CYCLE=30,TPRIOR,NEWTON=2,CRIT=0.01,REFERENCE=2;
Here, the CYCLES limit is increased in order to deal with a more informative aggregate-level
data set.
120
CALIB COMMAND
Related topics
Purpose
Format
DIAGNOSIS=n
Default
0.
Example
When DIAGNOSIS is set to 1, for example, item parameter estimates are printed to the Phase
2 output phase at each iteration.
Purpose
To estimate the score distribution in the respondent population in the form of a discrete dis-
tribution on NQPT points. This empirical distribution is used in place of the prior in the MML
estimation of the item parameters.
If NGROUP >1, separate score distributions are estimated for each group.
Format
EMPIRICAL
Default
121
2 BILOG-MG REFERENCE
Example
For this example, which comes from a simulation of non-equivalent groups equating, the
EMPIRICAL option is used to estimate separate latent distributions for each population.
Related topics
Purpose
To keep the prior distributions of ability in the population of respondents fixed at the values
specified in the IDIST keyword and/or the QUAD commands.
Format
FIXED
Default
Related topics
Purpose
To estimate the means of the prior distributions on the item parameters by marginal maxi-
mum likelihood (under normal distribution assumptions), along with the parameters. To
keep the means of the prior distributions on the item parameters fixed at their specified val-
ues during estimation, the NOFLOAT option should be used.
122
CALIB COMMAND
Standard deviations of the priors are fixed in either case. The FLOAT option should not be in-
voked when the data set is small and the items few. The means of the item parameters may
drift indefinitely during the estimation cycles under these conditions.
Format
FLOAT
Default
Example
In this example of a 3-PL model, the FLOAT option is added to request the MML (under
normal distribution assumptions) of the means of the prior distributions on the item parame-
ters along with the parameters. This option should not be invoked when the data set is small
and the items few. The acceleration constant (ACCEL keyword) is set to 0.5 instead of the de-
fault value of 1.0 for a single group analysis.
Related topics
Purpose
To select or suppress prior distributions on the lower asymptote (guessing) parameter. This
may be needed in order to give plausible values for easy items (which carry little or no in-
formation about guessing).
Priors on the slope parameters are sometimes required to prevent Heywood cases. Priors on
the lower asymptote parameters may be needed to give plausible values for easy items.
Format
GPRIOR/NOGPRIOR
Default
123
2 BILOG-MG REFERENCE
Examples
For a 3PL model, priors on slopes and asymptote parameters are assumed. To remove these
priors, the CALIB command
may be used.
To remove the default prior distribution on the asymptote parameters and use a prior distri-
bution on the thresholds instead, use
Related topics
Purpose
To provide plots showing the proportions of correct responses for each separate group in a
multiple-group analysis. These plots may provide more information than the combined plot
provided by the PLOT keyword.
Format
GROUP-PLOTS
Default
Example
In the CALIB command from a two-group analysis below, the PLOT keyword has been set to
0.99 so that all item response functions will be plotted. In order to obtain plots by group, the
GROUP-PLOT keyword has been added.
124
CALIB COMMAND
Related topics
Purpose
Format
IDIST=n
separate, arbitrary discrete priors for each group for each subtest read
n=1
from QUAD command
separate, arbitrary discrete priors for each group read from QUAD com-
n=2
mand
Default
0.
Example
This example illustrates how user-supplied priors for the latent distributions are specified
with IDIST=1 on the CALIB command. The points and weights for these distributions are
supplied in the QUAD commands. Note that with IDIST=1, there are separate QUAD commands
for each group for each subtest. Within each subtest, the points are the same for each group.
This is a requirement of the program. But as the example shows, the points for the groups
may differ by subtest.
>CALIB IDIST=1,READPR,EMPIRICAL,NQPT=16,CYCLE=25,TPRIOR,NEWTON=5,
CRIT=0.01,REFERENCE=1,NOFLOAT;
>QUAD1 POINTS=(-0.4598E+01 -0.3560E+01 -0.2522E+01 -0.1484E+01
-0.4453E+00 0.5930E+00 0.1631E+01 0.2670E+01 0.3708E+01
0.4746E+01),
WEIGHTS=(0.2464E-05 0.4435E-03 0.1724E-01 0.1682E+00
0.3229E+00 0.3679E+00 0.1059E+00 0.1685E-01 0.6475E-03
0.8673E-05);
>QUAD2 POINTS=(-0.4598E+01 -0.3560E+01 -0.2522E+01 -0.1484E+01
-0.4453E+00 0.5930E+00 0.1631E+01 0.2670E+01 0.3708E+01
0.4746E+01),
WEIGHTS=(0.2996E-04 0.1300E-02 0.1474E-01 0.1127E+00
125
2 BILOG-MG REFERENCE
Related topics
Purpose
If CYCLES=0 and NEWTON=0, item parameter estimates will be calculated from the classical
item statistics from Phase 1 or from the starting values of the TEST command. The former
will be corrected for guessing if the 3-parameter model is selected.
Format
NEWTON=n
Default
2.
Example
In this example, the value of NEWTON is increased to 4 in order to deal with a more informa-
tive aggregate-level data set.
126
CALIB COMMAND
Related topics
Purpose
To specify that the Fisher-scoring steps for estimating item parameters use the full informa-
tion matrix (if the number of items n is less than p) or the block-diagonal approximation to
the information matrix (if n is greater than or equal to p).
Format
NFULL=p
Default
p=20.
Example
The NFULL keyword is used on the CALIB command to request the use of the full information
matrix in the Newton steps for this data set where only 4 items were presented to subjects. In
the absence of the NFULL keyword, the block diagonal approximation to the information ma-
trix would have been used in this case, as NITEMS=4 is less than the default threshold of 20
items.
>CALIB TPRIOR,SPRIOR,NFULL=4;
Related topics
Purpose
In multiple-group applications, each group has its own latent distribution. To resolve the
indeterminacy of origin and scale of measurement in the IRT analysis, the user can choose
to set the mean and standard deviation to 0.0 and 1.0 in a reference group specified by the
REF keyword of the CALIB command; alternatively, the user can choose to assign these
127
2 BILOG-MG REFERENCE
keyword of the CALIB command; alternatively, the user can choose to assign these values to
the combined distributions weighted by their sample sizes.
BILOG-MG routinely rescales the origin and scale of the latent distribution (i.e., linearly
transforms the quadrature points) exactly to these values even in the case of one group. The
item slopes and thresholds are then linearly transformed to match the adjusted scale.
This results in small differences between the values estimated in BILOG and BILOG-MG
because the posterior latent distribution has mean and standard deviation equal to only ap-
proximately zero and one. To obtain the BILOG values (when all other conditions of estima-
tion are identical), the user may include the option NOADJUST in the CALIB command, as in
the example below.
Format
NOADJUST
Default
Example
In the syntax below, a single subtest is analyzed in a single group analysis. The NOADJUST
option is used on the CALIB command to suppress the adjustment of the rescaling of the la-
tent distribution.
Related topics
Purpose
To specify the estimation of the means and standard deviations of the prior distributions of
ability in the population of respondents by marginal maximum likelihood (under normal dis-
tribution assumptions) along with the item parameters. If NGROUP>1, separate means and
standard deviations are estimated for each group.
128
CALIB COMMAND
Format
NORMAL
Default
Example
In this example of vertical equating of test forms over three grade levels, the distributions of
ability are assumed to be normal at each grade level (NORMAL on the CALIB command).
>CALIB NQPT=20,NORMAL,CYCLE=30,TPRIOR,NEWTON=2,CRIT=0.01,REFERENCE=2;
Related topics
Purpose
To specify the number of quadrature points in MML estimation for each group.
Format
NQPT=n
Default
Examples
Here, the value of NQPT is increased to 30 in order to deal with a more informative aggre-
gate-level data set.
129
2 BILOG-MG REFERENCE
Related topics
Purpose
To specify the range of the prior distribution(s) for the population(s) in standard deviation
units.
Format
NSD=n
Default
Purpose
To specify the significance level for the goodness-of-fit of the item-response functions to be
plotted. All items for which the significance level is below the real-number value (decimal
fraction) provided will be plotted.
Format
PLOT=n
0.0.
130
CALIB COMMAND
Examples
Plots of the item-response functions of all items for which the goodness-of-fit statistic is less
than 0.05 are requested.
In this example of a 3-PL model, the PLOT keyword has been set to 1.0 so that all item re-
sponse functions will be plotted
Purpose
To print provisional item parameter estimates at each iteration during the calibration phase.
If PRINT=1, provisional item parameter estimates are printed; if PRINT=0 printing is sup-
pressed.
Format
PRINT=n
Default
0.
Example
If the following CALIB command is used for a 2-group DIF analysis, only the information
shown below is printed concerning the iterative process:
[E-M CYCLES]
-2 LOG LIKELIHOOD = 3152.375
CYCLE 1; LARGEST CHANGE= 0.17572
-2 LOG LIKELIHOOD = 3128.806
CYCLE 2; LARGEST CHANGE= 0.15440
-2 LOG LIKELIHOOD = 3117.237
…
131
2 BILOG-MG REFERENCE
the output provided in the Phase 2 output file is expanded and parameter estimates are given
for each group after each cycle. The output obtained for both groups after the third EM cycle
is given below as an example.
132
CALIB COMMAND
Purpose
To rescale the parameter estimates according to Rasch-model conventions. That is, all the
slopes will be rescaled so that their geometric mean equals 1.0, and the thresholds will be re-
scaled so that their arithmetic mean equals 0.0. If the 1-parameter model has been specified,
all slope parameters will therefore equal 1.0.
Because the threshold parameters are constrained in other ways in DIF and DRIFT analysis,
the RASCH option cannot be used with these models. The posterior latent distribution dis-
played in Phase 2 is not rescaled in the Rasch convention.
Format
RASCH
Default
No Rasch rescaling.
Example
In the syntax for a single-group analysis shown below, a 1-parameter model is fitted to the
data (NPARM=1 on GLOBAL command). Rasch rescaling is requested on the CALIB command
through inclusion of the RASCH keyword, and all slope parameters will therefore equal 1.0.
>GLOBAL DFNAME='EXAMPL04.DAT',NIDCH=5,NPARM=1;
…
>CALIB CYCLE=10,TPRIOR,NEWTON=2,CRIT=0.01,RASCH;
Related topics
Purpose
To specify that the prior distributions for selected parameters will be read from the ensuing
PRIORS command(s). Otherwise, default priors will be used for these parameters.
133
2 BILOG-MG REFERENCE
Format
READPRI
Default
Example
In this example, the mean of the prior for the log slopes has been set to 0.5 by use of the
READPRI option of the CALIB command and the following PRIORS commands.
Related topics
Purpose
To resolve the indeterminacy of the location and scale of the latent variable when NGROUP>1.
When the groups originally came from one population as, for example, in two-stage testing,
REFERENCE should be set to 0. When the groups represent separate populations, REFERENCE
should be set to the value of one of the group indicators. It specifies the reference group for
the DIF model and the reference cohort for the DRIFT model.
Format
REFERENCE=n
n=0 The mean and standard deviation of the combined estimated distribu-
tions of the groups weighted by their sample sizes are set to 0 and 1,
respectively.
134
CALIB COMMAND
n>0 The mean and standard deviation of group n are set to 0 and 1, respec-
tively.
Default
1.
Examples
Here, the second group serves as the reference group in the calibration of the items.
>CALIB NQPT=20,NORMAL,CYCLE=30,TPRIOR,NEWTON=2,CRIT=0.01,REFERENCE=2;
Related topics
Purpose
To add a ridge constant (if a = 2) to the diagonal elements of the information matrix to be
inverted during the EM cycles and Newton iterations. The ridge constant starts at the value 0
and is increased by b if the ratio of a pivot and the corresponding diagonal elements of the
matrix is less than c.
The old ridge option can be invoked with the RIDGE=1 specification. It is provided so users
may duplicate old results from BILOG. The present default is an improvement of the old
method.
Format
RIDGE=(a, b, c)
135
2 BILOG-MG REFERENCE
Default
Example
This example emanates from an analysis of aggregate-level data that includes some fairly
difficult items. A ridge constant of 0.8 is required for convergence as one of the subtests is
exceptionally difficult.
Aggregate-level data typically have smaller slopes in the 0,1 metric than do person-level
data. For this reason, the mean of the prior for the log slopes has been set to -0.5 by use of
the READPRI option of the CALIB command and the following PRIOR commands.
Related topics
Purpose
To select, with a vector of ones and zeros, subtests for which item-parameter calibration is
desired.
Format
where
Default
136
CALIB COMMAND
Example
In this example with three subtests, only the second subtest is to be calibrated.
>TEST1 INUMBERS=(1(1)10);
>TEST2 INUMBERS=(11(1)30);
>TEST3 INUMBERS=(31(1)45);
(5A1,45A1)
>CALIB NQPT=10, CYCLES=25, NEWTON=5, SELECT=(0,1,0);
Related topics
Purpose
The presence of these options selects or suppresses, respectively, prior distributions on the
threshold, slope, and lower asymptote (guessing) parameter.
Priors on the slope parameters are sometimes required to prevent Heywood cases.
Format
SPRIOR/NOSPRIOR
Default
Examples
In the case of a 1PL model, no priors are used by default and thus the two CALIB commands
and
are equivalent.
137
2 BILOG-MG REFERENCE
In order to assume a prior distribution on the slopes in the 1PL case, the CALIB command
may be used.
In a 2PL model, a prior is placed on the slopes by default and thus the commands
and
are equivalent.
Related topics
Purpose
To select prior distributions on the threshold parameters. Although extreme threshold values
do not affect the estimation of ability adversely, a diffuse prior distribution on the thresholds
will keep their estimates within a reasonable range during the estimation cycle.
Format
TPRIOR/NOTPRIOR
Default
Examples
In this example of vertical equating of test forms, a prior is placed on the item thresholds by
the addition of the TPRIOR option to the CALIB command.
>CALIB NQPT=20,NORMAL,CYCLE=30,TPRIOR,NEWTON=2,CRIT=0.01,REFERENCE=2;
138
CALIB COMMAND
This example emanates from an analysis of aggregate-level data that includes some fairly
difficult items. A prior on the thresholds is required for convergence as one of the subtests is
exceptionally difficult.
In the case of a 1PL model, no priors are used by default and thus the two CALIB commands
and
are equivalent. In order to assume a prior distribution on the slopes in the 1PL case, the
CALIB command
may be used.
In a 2PL model, a prior is placed on the slopes by default and thus the commands
and
indicates that an additional prior distribution should be assumed for the threshold parame-
ters.
For a 3PL model, priors on slopes and asymptote parameters are assumed. To remove these
priors, the CALIB command
may be used.
139
2 BILOG-MG REFERENCE
In a 3PL model, to remove the default prior distribution on the asymptote parameters and
use a prior distribution on the thresholds instead, use
Related topics
140
COMMENT COMMAND
(Optional)
Purpose
To enter one or more lines of explanatory remarks into the program output stream. This line
and all subsequent lines preceding the GLOBAL command will be printed in the initial output
stream. The maximum length of each line is 80 characters. A semicolon to signal the end of
the command is not needed.
Format
>COMMENT
…text…
…text…
Example
EXAMPLE 4
SIMULATED RESPONSES TO TWO 20-ITEM PARALLEL TEST FORMS
>COMMENT
This example illustrates the equating of equivalent groups with the BILOG-
MG program. Two parallel test forms of 20 multiple-choice items were ad-
ministered to two equivalent samples of 200 examinees drawn from the same
population. There are not common items between the forms.
>GLOBAL DFNAME=’EXAMPL04.DAT’,NIDCH=5,NPARM=2;
Default
No comments.
Related topics
141
2 BILOG-MG REFERENCE
Purpose
To provide the maximum degree of the polynomial item parameter drift model and a vector
of time points, n1 , n2 ,..., nn .
Format
Default
No DRIFT command.
Example
Related topics
MAXPOWER keyword
(optional)
Purpose
To specify the maximum degree of the drift polynomial included in the model. The maxi-
mum degree must be less than the number of groups.
Format
MAXPOWER=n
Default
NGROUP-1.
Related topics
142
DRIFT COMMAND
MIDPOINT keyword
(optional)
Purpose
Format
MIDPOINT=( n1 , n2 ,..., nn )
Default
(1, 2, …, NGROUP-1)
Related topics
143
2 BILOG-MG REFERENCE
Purpose
To supply the order of the item responses in the data records. Each FORM command gives the
number of items in the form and lists the items in the order in which the item responses ap-
pear on the data records for that form. The items may be listed by name or number, but not
by both.
When NFORMS > 1, the FORM command requires a form number in the data record. The form
numbers must range in value from 1 to the number of forms. The form indicator field fol-
lows the case ID field and is INTEGER in the variable format statement. Because the same
format statement is used to read the data records for all forms, the item responses, the case
ID and weight, and the form and group indicators must occupy the same columns on all re-
cords. If the forms are of unequal length, the size of the item-response field on the format
statement should equal the number of items in the longest form.
The order of the several FORM commands corresponds to the number of the respective form.
Format
Default
None.
Example
Form 1 consists of items 1, 2, 3, 4, and 6, and form 2 consists of items 1, 6, 7, 8, 9, and 10.
The data records are as follows:
SUBJECT001 1 21321
SUBJECT002 2 513122
…
SUBJECT999 1 21422
Responses to item 1 appear in column 14 of the data records for form 1 and at the end of the
data records for form 2. The FORM commands and format statement are as follows:
144
FORM COMMAND
Related topics
Purpose
To specify the list of item names, as specified in the ITEMS command, in the order in which
the item response appear on the data records for FORMj.
Format
Default
When NFORM = 1, the sequence of items specified on the ITEMS command. When NFORM > 1,
no sequence is specified.
Example
>ITEMS INAMES=(I1(1)I10);
appears earlier in the command file to give the name Ix to item x. Then the FORM1 statement
could be replaced with
Note that if the item names are in a sequence, they can be specified using the variable list
format “first (increment) last”, as “I1(1)I4” is used here to specify items 1 through 4.
Related topics
ITEMS command
INPUT command: NFORM keyword
Setup menu: General dialog box
145
2 BILOG-MG REFERENCE
Purpose
To provide the list of item numbers, as specified in the ITEMS command, in the order in
which the item response appear on the data records for FORMj.
Format
Default
When NFORM = 1, the sequence of items specified on the ITEMS command. When NFORM > 1,
none.
Related topics
Purpose
Format
LENGTH=n
Default
Related topics
146
GLOBAL COMMAND
(Required)
Purpose
To supply input filenames and other information used in the three phases of the program.
The GLOBAL keywords DFNAME, MFNAME, CFNAME, and IFNAME enable the user to assign spe-
cific names to the program’s input files. A filename must be not more that 128 characters
long and may include a drive prefix, a path name, and an extension. The filename must be
enclosed in single quotes. Note that each line of the command file has a maximum length of
80 characters. If the filename does not fit on one line of 80 characters, the remaining charac-
ters should be placed on the next line, starting at column 1.
Format
Example
Related topics
Purpose
To supply the name of the previously created calibration file (if any) to be read in. If data are
read from a previously generated calibration file, DFNAME must not appear, and TYPE=0 must
appear in the INPUT command.
The PARM keyword of the SAVE command must be specified to save updated parameter esti-
mates to an external file.
Format
CFNAME=<'filename'>
147
2 BILOG-MG REFERENCE
Example
In a previous run, a calibration file was created as shown below. The calibration file was
saved to exampl03.cal using the CALIB keyword on the SAVE command. Note that a calibra-
tion file will be created only if the SAMPLE keyword is also specified on the INPUT command,
with a number less that the total number of examinees.
EXAMPLE:
CREATING A CALIBRATION FILE
>COMMENT
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2, SAVE;
>SAVE CALIB='EXAMPL03.CAL';
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',NIDCHAR=5,
NALT=5,NFORM=2,TYPE=1;
The previously created calibration file is now used as data source through the use of the
CFNAME keyword on the GLOBAL command. Note that the TYPE keyword on the INPUT com-
mand is now set to 0, compared to 1 previously. The updated item parameter estimates are
saved to the file latest.prm using the PARM keyword on the SAVE command.
EXAMPLE:
USING A MASTER FILE AS INPUT
>COMMENT
>GLOBAL CFNAME='EXAMPL03.CAL',NPARM=2, SAVE;
>SAVE PARM='LATEST.PRM';
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,NIDCHAR=5,
NALT=5,NFORM=2,TYPE=0;
Related topics
Purpose
To supply the name of the raw data file that contains the original data. The format for this
file is described in the section on input and output files.
Format
DFNAME=<'filename'>
148
GLOBAL COMMAND
Notes
The path to and filename of this file may be longer than 80 characters. However, as the
maximum length of any line in the command file is 80 characters, multiple lines may be
used. It is important to continue up to and including the 80th column when specifying a long
path and filename.
The correct way to enter this information in the command file is to enclose the name and
path in single quotes, and continue until column 80 is reached. Then proceed in column 1 of
the next line as shown below:
If the data are stored in the same folder as the command file, it is sufficient to type
DFNAME='EXAMPL06.DAT'
Examples
This example shows the use of the external data file exampl03.dat.
>GLOBAL DFNAME='EXAMPL03.DAT';
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,KFNAME='EXAMPL03.DAT',NIDCHAR=5,
Note that this file is referenced on both the GLOBAL command (DFNAME keyword) and on the
INPUT command (KFNAME keyword). This indicates that the answer key for correct responses
is given at the top of the data file, as shown below:
Related topics
149
2 BILOG-MG REFERENCE
Purpose
To supply the name of the previously created item parameter file (if any) to be used as input.
The PARM keyword of the SAVE command must be specified to save updated parameter esti-
mates to an external file.
Format
IFNAME=<'filename'>
Example
The previously created parameter file exampl03.par is used as data source through the use
of the IFNAME keyword on the GLOBAL command. The updated item parameter estimates are
saved to the file latest.par using the PARM keyword on the SAVE command.
EXAMPLE:
USING A ITEM PARAMETER FILE AS INPUT
>COMMENT
>GLOBAL IFNAME='EXAMPL03.PAR',NPARM=2, SAVE;
>SAVE CALIB='LATEST.PAR';
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,NIDCHAR=5,
NALT=5,NFORM=2;
Related topics
Purpose
To assume the natural metric of the logistic response function in all calculations. Otherwise,
the logit is multiplied by D = 1.7 to obtain the metric of the normal ogive model.
Format
LOGISTIC
150
GLOBAL COMMAND
Default
Examples
For the 2-parameter model requested in this first GLOBAL command, the natural metric of the
logistic response function is assumed:
while a similar normal ogive model can be obtained by using the command:
Related topics
Purpose
To supply the name of a previously created master file to be read in. If data are read from a
previously prepared master file, DFNAME must not appear, and TYPE=0 must appear in the
INPUT command. The PARM keyword of the SAVE command may be specified to save up-
dated parameter estimates to an external file.
Format
MFNAME='filename'
Example
The previously created master file exampl03.mas is used as data source through the use of
the MFNAME keyword on the GLOBAL command. Note that the TYPE keyword on the INPUT
command is now set to 0.
EXAMPLE:
USING A MASTER FILE AS INPUT
>GLOBAL MFNAME='EXAMPL03.MAS',NPARM=2;
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,NIDCHAR=5,
NALT=5,NFORM=2,TYPE=0;
151
2 BILOG-MG REFERENCE
Related topics
Purpose
Format
NPARM=n
Default
NPARM=2.
Examples
The following GLOBAL commands are used to request a 1PL, 2PL and 3PL model respec-
tively.
Related topics
Purpose
152
GLOBAL COMMAND
Format
NTEST=n
Default
NTEST=1.
Examples
In the GLOBAL command below, the NTEST keyword is used to indicate that two subtests are
used. Note the two TEST commands in the syntax. The LENGTH command is used to indicate
the length of the subtests.
Related topics
Purpose
Format
NVTEST=n
Default
NVTEST=0.
Example
In the example below, both a main and variant test are used. In this case, NTEST is set to 1 to
indicate the main test, and the NVTEST keyword is used to indicate the presence of a variant
test. The first test command is that for the main test, while items for the variant test are se-
lected by name in the next TEST command (here named TESTV purely for convenience).
153
2 BILOG-MG REFERENCE
There are 20 main test items and 4 variant test items, selected from a total of 50 items in the
data file. The LENGTH command is used to indicate the length of the subtests.
Related topics
Purpose
To specify the weighting of response records. A value larger than 0 is required when the
data are input in the form of response patterns and frequencies, or when the sampling proce-
dure requires the use of case weights. The data file (TYPE) in the INPUT command must also
be set appropriately. See the information on format statements for the data format with
weights in Section 2.6.18.
Format
NWGHT=n
0: none
1: for classical item statistics only
2: for IRT item calibration only
3: for both statistics and calibrations.
Default
NWGHT=0.
Example
In this example, the data are accumulated into answer patterns. TYPE=2 and NWGHT=3 are in-
154
GLOBAL COMMAND
Related topics
Purpose
To specify that omits are treated as fractionally correct when the 3-parameter model is em-
ployed. The fraction is the reciprocal of the number of the alternatives in the multiple-choice
items (see the NALT keyword on the INPUT command, Section 2.6.9). Also see Section 2.6.20
for more information on the specification of an omit key using the OFNAME keyword on the
INPUT command.
Format
OMITS
Default
Examples
For the following 3-parameter model, an omitted response will be scored fractionally correct
with the fraction equal to 1/5 (NALT=5). The omit response key can be found in the data file.
In this example the omitted response will be scored fractionally correct with fraction 1/4.
The key for omitted responses can be found in a separate, external file.
Related topics
155
2 BILOG-MG REFERENCE
Purpose
To specify the name of the file from which the provisional (i.e. starting) values of parame-
ters of selected items will be obtained. The values are read in space-delimited, free-format
form.
Format
PRNAME=<’filename’>
Contents:
Line 1:
Remaining lines:
The serial position of each item selected from the corresponding subtest, followed by the
slope, threshold, and chance success (guessing) probability of the item. If a two-parameter
model is assumed, the latter should be entered as 0.
Default
None.
Example
5 5
5 1.0 0.0 0.333
10 1.0 0.0 0.333
15 1.0 0.0 0.333
25 1.0 0.0 0.333
30 1.0 0.0 0.333
5 1.1 0.5 0.233
10 1.1 0.5 0.233
15 1.1 0.5 0.233
25 1.1 0.5 0.233
30 1.1 0.5 0.233
156
GLOBAL COMMAND
Provisional values will be assigned to five items in each of two subtests. In each subtest, the
5-th, 10-th, 15-th, 25-th, and 30-th item will be assigned the values in the corresponding
line.
The following is an example of a command file that will input these values. Note that PRINT
has been set to 1 on the CALIB command to print the item parameters at cycle zero and show
the assigned values.
EXAMPLE 15:
ASSIGNED STARTING VALUES FOR TWO SUBTESTS
>GLOBAL DFNAME='EXAMPL03.DAT',PRNAME='EXAMPL15.PRM',NPARM=2,
NTEST=2,SAVE;
>SAVE PDISTRIB='EXAMPL15.PST',SCORE='EXAMPL15.SCO';
>LENGTH NITEMS=(35,35);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',
NALT=5,NFORMS=2,NIDCHAR=5;
>ITEMS INUMBERS=(1(1)45),
INAME=(C01(1)C45);
>TEST1 TNAME=SUBTEST1, INAME=(C01(1)C15,C21(1)C40);
>TEST2 TNAME=SUBTEST2, INAME=(C06(1)C25,C31(1)C45);
>FORM1 LENGTH=25,INUMBERS=(1(1)25);
>FORM2 LENGTH=25,INUMBERS=(21(1)45);
>GROUP1 GNAME=POP1,LENGTH=25,INUMBERS=(1(1)25);
>GROUP2 GNAME=POP2,LENGTH=25,INUMBERS=(21(1)45);
(T28,5A1,T25,I1,T25,I1/45A1)
>CALIB IDIST=1,EMPIRICAL,NQPT=11,CYCLE=10,TPRIOR,NEWTON=1,
CRIT=0.01,REF=1,NOFLOAT,PRINT=1;
>SCORE IDIST=3,RSCTYPE=3,INFO=1,YCOMMON,POP,NOPRINT,MOMENTS;
Related topics
Purpose
Format
SAVE
Default
157
2 BILOG-MG REFERENCE
Example
In the syntax below, the item parameters and scale scores are saved to file through the use of
the SCORE and PARM keywords on the SAVE command. Note that, in order to use the SAVE
command, the SAVE option is added to the GLOBAL command.
Related topics
158
GROUP COMMAND
Purpose
To specify information about the items in each particular group. When the NGROUP keyword
on the INPUT command is greater than one, that same number of GROUP commands must fol-
low the FORM commands. Each GROUP command specifies the group’s name, the length of
the group’s form, and the items included in that form. Items may be identified by name or
number, but not by both.
The GROUP command requires a group number in the data record. The group numbers must
range in value from 1 to the number of groups. If NFORM > 1, the group indicator field fol-
lows the form indicator field. If NFORM = 1, the group indicator field follows the case ID
field. The group indicator field is INTEGER in the variable format statement. If the subtest
is personalized (the option PERSONAL is present in the INPUT command) there are NTEST
group indicators for each subject.
The order of the several GROUP commands corresponds to the number of the respective
group. If the same items are administered to all groups, the INUMBERS and INAMES lists are
the same as those in the ITEMS command.
Format
Default
No groups assumed.
Example
If the form(s) for group 1 consists of items 1, 2, 4, and 5, and the form(s) for group 2 con-
sists of items 3 through 8, then the corresponding group commands are as follows:
Related topics
159
2 BILOG-MG REFERENCE
Purpose
Format
GNAME=character string
Default
Blanks.
Related topics
Purpose
To specify the list of item names, as specified in the ITEMS command, for all items in all
forms administered to GROUPk.
Format
Default
Example
>ITEMS INAMES=(I1(1)I8)
appears earlier in the command file to give the name Ix to item x. Then the two GROUP state-
statements could be replaced with
160
GROUP COMMAND
Note the use of the list notation in the GROUP2 statement to specify items I3 through I8.
Related topics
Purpose
To provide a list of item numbers, as specified in the ITEMS command, for all items in all
forms administered to GROUPk.
Format
Default
Example
In the following example, the INUMBERS keywords specify the item list for each group. Note,
again, the use of the “sequence” notation in the second statement to specify items 3 through
8.
Related topics
161
2 BILOG-MG REFERENCE
Purpose
Format
LENGTH=n
Default
NTOTAL.
Example
In the following example, the LENGTH keyword in each GROUP statement specifies the num-
ber of items for each group.
Related topics
162
INPUT COMMAND
(Required)
Purpose
To provide the information which describes the raw data file. One or more variable format
statements describing the layout of the data must follow the FORM, GROUP, or DRIFT com-
mand, if present.
The keywords KFNAME, NFNAME, and OFNAME enable the user to assign specific names to the
program’s input files. A filename must be no more than 128 characters long and may in-
clude a drive prefix, a path name, and an extension. The filename must be enclosed in single
quotes. Note that each line of the command file has a maximum length of 80 characters. If
the filename does not fit on one line of 80 characters, the remaining characters should be
placed on the next line, starting at column 1.
Format
Examples
In the following example, responses from two groups are analyzed. There are two forms of a
25-item multiple-choice examination, with 5 items in common. In total, the responses of a
sample of 2000 respondents to the 45 items are considered.
The INPUT command below is used to request a DIF analysis on 4 items administered to two
groups.
Related topics
163
2 BILOG-MG REFERENCE
Purpose
To specify a level of diagnostic printout for Phase 1. Larger values of n give increasing
diagnostic output.
Format
DIAGNOSE=n
Default
No diagnostic printout.
Related topics
Purpose
To specify a differential equation modeling (DIF) analysis for multiple groups, which as-
sumes common slopes and guessing parameter for all groups.
Format
DIF
Default
No DIF analysis.
Example
In the syntax below, a 1-parameter DIF model is fitted to data from two groups of exami-
nees. DIF parameters are saved to the file exampl01.dif through use of the SAVE option on
the GLOBAL command and the DIF option on the SAVE command.
>GLOBAL NPARM=1,LOGISTIC,SAVE;
>SAVE PARM='EXAMPL01.PAR',DIF='EXAMPL01.DIF';
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NGROUPS=2,DIF,NIDC=2;
164
INPUT COMMAND
Related topics
DRIFT option
GROUP command (see Section 2.6.8)
INPUT command: NGROUP keyword
SAVE command: DIF keyword (see Section 2.6.15)
Setup menu: General dialog box (see Section 2.3.3)
Purpose
To specify an item parameter drift model for multiple groups. A DRIFT command must also
appear after the GROUP commands.
Format
DRIFT
Default
No DRIFT model.
Example
In the syntax below, a 2-parameter DRIFT model is fitted to data from two groups of exami-
nees. DRIFT parameters are saved to the file exampl01.drf by using the SAVE option on the
GLOBAL command and the DRIFT option on the SAVE command.
>GLOBAL NPARM=1,LOGISTIC,SAVE;
>SAVE PARM='EXAMPL01.PAR',DRIFT='EXAMPL01.DRF';
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NGROUPS=2,DRIFT,NIDC=2;
Related topics
165
2 BILOG-MG REFERENCE
Purpose
To specify the computation of the item parameters with respect to an external variable, the
values of which are supplied in the data records, rather than to a latent variable inferred from
the item responses. When item parameters are estimated in this way and used to score test
data of any other groups of examinees, the resulting scores are the best predictors of the abil-
ity measured by the external variable.
In each record of the calibration data, each test in the analysis must be represented by a
value of the external variable and its corresponding standard error. These two quantities for
each test in the data record must precede the item responses in the same order as the tests
appear in their successive command lines. The columns of the data records devoted to these
pairs of scores and standard errors must be identified in the input variable format statement.
Format
EXTERNAL
Default
Calibration with respect to a latent variable inferred from the item responses.
Example
Suppose a group of students took an end-of-term reading test and math test routinely
administered to all students in a metropolitan school district. Suppose these students were
also part of the sample for a state assessment of reading and math achievement. If scores and
standard errors on the assessment tests for these students were available to the district, the
district test could be calibrated to best predict the state reading and math scores of students
of all students in the district. For this purpose, the state test results would serve as the
external variables for calibrating items of the local tests to predict the state assessment’s
scores.
For the sake of generality, suppose also that there are three random parallel forms of the dis-
trict tests and that these forms are assigned at random to students in two successive school
grades. Then there will be two groups of students in the analysis and the record layout of the
data might be the following:
166
INPUT COMMAND
(4A1,1X,I1,1X,I1,2(1X,F4.1,1X,F4.1),1X,60A1)
and the item parameter file from the calibration could be saved for use in scoring other
students.
Related topics
Data, Examinee Data /Data, Group-Level Data dialog boxes (see Section 2.3.4)
Variable format statement (see Section 2.6.18)
Purpose
To specify the seed for the random number generator used for sampling subjects.
By default, the same seed will always be used for sampling subjects when the SAMPLE key-
word on the INPUT command is used. ISEED may be used to change the seed, thus producing
a different random sample of subjects.
Format
ISEED=n
Default
ISEED=1.
Related topics
Purpose
To specify the name of the file which contains the answer key. This key consists of the cor-
rect response alternative for each item, in the same format as the corresponding response re-
cords. Any single ASCII character can be used as a response alternative. If the answer key is
in the same file as the item response data, the key must precede the first response record. If
167
2 BILOG-MG REFERENCE
KFNAME does not appear on the INPUT command, then the data are assumed to be scored 1
for correct and 0 for incorrect.
When NFORM > 1, separate answer, not-presented, and omit keys must be specified for each
form in the order of the forms to which they apply. Again, if they are in the same file as the
response data, all keys must precede the first response record.
Format
KFNAME=<’filename’>
Default
No answer key.
Notes
The path to and filename of this file may be longer than 80 characters. As the maximum
length of any line in the command file is 80 characters, multiple lines should be used. It is
important to continue up to and including the 80th column when specifying a long path and
filename.
The correct way to enter this information in the command file is to enclose the name and
path in apostrophes, and continue until column 80 is reached. Then proceed in column 1 of
the next line as shown below:
If the data are stored in the same folder as the command file, it is sufficient to type
DFNAME='EXAMPL06.DAT'
Example
In the analysis of single subject data from the file exampl04.dat, the answer key appears at
the top of the file as indicated by the use of the KFNAME keyword.
>INPUT NTOTAL=40,NFORM=2,KFNAME='EXAMPL04.DAT',NALT=5;
168
INPUT COMMAND
As two forms are used, answer keys are given by form before the actual data, and in the
same format as the data records. The first few lines of exampl04.dat are as follows:
Related topics
Purpose
To specify the maximum number of response alternatives in the raw data. 1/NALT is used as
the automatic starting value for estimating lower asymptotes (guessing parameters) of the 3-
parameter model.
Format
NALT=n
Default
5 for the 3PL model; 1000 for the 1PL and 2PL models.
Examples
In the case of the following 2-parameter model, 5 responses to each item are given in the
data file.
The correct response to each item is noted in the answer key, which appears at the top of the
data file (indicated by the KFNAME keyword on the INPUT command).
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2;
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',NIDCHAR=5,
NALT=5,NFORM=2,TYPE=1;
When a 3-parameter model is fitted to the same data, 1/5 will be used as starting value for
the lower asymptote (guessing parameter) of each item.
169
2 BILOG-MG REFERENCE
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=3;
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',NIDCHAR=5,
NALT=5,NFORM=2,TYPE=1;
In the following example, a 2-parameter model is fitted to the data. No answer key is given,
and it is assumed that the 2 response alternatives (NALT=2) are coded 1 for correct responses
and 0 for incorrect responses. If more than 2 response alternatives are present and no code is
given, all responses other than 1 will be assumed incorrect.
>GLOBAL DFNAME='EXAMPL04.DAT',NPARM=2;
>LENGTH NITEMS=(40);
>INPUT NTOTAL=40,NALT=2;
Related topics
Purpose
To specify the number of format records for reading the respondent data records.
Format
NFMT=n
Default
1.
Examples
In the format statement below, item responses are read from two lines: the first 25 responses
are read on the first line of data for each examinee and the second 25 on the second line of
data. Although responses are read over two lines, the format statement fits comfortably on
one line in the command file, and thus NFMT=1.
(11A1,T39,25A1/T13,25A1)
If, however, a large data file is used as input, and it becomes necessary to write the format
statement over multiple lines in the command file, the value assigned to NFMT should be ad-
justed to reflect this. For example, NFMT=2 for the following format statement in which 15
items are selected and columns between items are passed over using the “X” operator:
170
INPUT COMMAND
(11A1,1X,A1,2X,A1,1X,A1,3X,A1,1X,A1,2X,A1,1X,A1,3X,A1,1X,A1,2X,A1,1X,A1,
3X,A1,1X,A1,2X,A1,1X,A1)
Related topics
Purpose
To specify the name of the file which contains the not-presented key. This key must be
given in the same format as the corresponding response records. Any single ASCII character
can be used to represent a not-presented item. If the not-presented key is in the same file as
the item response data, the key must precede the first response record. If this key appears in
the same file as the answer key, it must appear in the file after the answer key. If NFNAME
does not appear on the INPUT command, then all items are assumed presented.
When NFORM > 1, separate answer, not-presented, and omit keys must be provided for each
form in the order of the forms to which they apply. Again, if they are in the same file as the
response data, all keys must precede the first response record.
Format
NFNAME=<’filename’>
Default
No not-presented key.
Examples
In the analysis of single subject data from the file exampl04.dat, the not-presented key ap-
pears at the top of the file as indicated below, using the NFNAME keyword.
>INPUT NTOTAL=40,NFORM=2,NFNAME='EXAMPL04.DAT',NALT=5;
>ITEMS INUMBERS=(1(1)40),INAME=(T01(1)T40);
>TEST TNAME=SIM;
>FORM1 LENGTH=20,INUMBERS=(1(1)20);
>FORM2 LENGTH=20,INUMBERS=(21(1)40);
(T28,5A1,T25,I1/40A1)
As two forms are used, the not-presented keys are given by form before the actual data, and
in the same format as the data records. The first few lines of exampl04.dat are as follows:
171
2 BILOG-MG REFERENCE
can be saved to a not-presented key file exampl04.nfn, and referenced as such in a revised
INPUT command:
>INPUT NTOTAL=40,NFORM=2,NFNAME='EXAMPL04.NFN',NALT=5;
If both a not-presented key and an omit key are used for the two forms, the following lines
should appear at the top of the data file when the data file is referenced by the NFNAME and
OFNAME keywords in the INPUT command:
>INPUT NTOTAL=40,NFORM=2,NFNAME='EXAMPL04.DAT',
OFNAME='EXAMPL04.DAT',NALT=5;
Related topics
172
INPUT COMMAND
Purpose
To specify the number of test forms. If NFORM > 1, the response records must contain an in-
dicator specifying the form to which the examinee responded. This keyword is used in com-
bination with the FORM command and the variable format statement.
The NFORM keyword is required when multiple-form data is supplied to the program in com-
pressed form (see input file format discussed in Section 2.6.20 for more details). If the in-
strument consists of a single test form, or multiple-form data is supplied to the program in
expanded format, the NFORM keyword, with NFORM=1, is required by the program if the order
of items on the response records does not correspond to the order of items in the ITEMS
command list.
Format
NFORM=n
Default
No FORM commands will be read and the order of items in the response records is assumed to
be the same as that in the ITEMS command.
Example
In the following example, two forms were administered to two groups of examinees. As
both the NFORM and NGROUP keywords are used on the INPUT command, both FORM and
GROUP commands are given.
>INPUT NTOTAL=45,NGROUP=2,NIDCHAR=5,NALT=5,NFORM=2;
>ITEMS INUMBERS=(1(1)45), INAME=(C01(1)C45);
>TEST TNAME=CHEMISTRY;
>FORM1 LENGTH=25,INUMBERS=(1(1)25);
>FORM2 LENGTH=25,INUMBERS=(21(1)45);
>GROUP1 GNAME=POP1,LENGTH=25,INUMBERS=(1(1)25);
>GROUP2 GNAME=POP2,LENGTH=25,INUMBERS=(21(1)45);
Note that the format statement contains both a form and a group indicator.
(5A1,T25,I1,T25,I1,25A1)
Related topics
173
2 BILOG-MG REFERENCE
Purpose
To specify the number of groups or cohorts of respondents. If NGROUP > 1, the response re-
cords must contain an indicator specifying the group or cohort to which the respondent be-
longs. This keyword is used in combination with the GROUP command and the variable for-
mat statement, where a group indicator is added.
Format
NGROUP=n
Default
1.
Related topics
Purpose
To specify the number of characters in the respondent’s identification field. Valid values are
1 to 30.
Format
NIDCHAR=n
Default
30.
Example
Data from two groups, found on two forms are analyzed in this example. The NIDCHAR
keyword is set to 5, indicating that the subject ID field is 5 columns in length. This corre-
174
INPUT COMMAND
sponds with the format statement, where the first entry, for the subject ID, is 5A1.
>INPUT NTOTAL=45,NGROUP=2,NIDCHAR=5,NALT=5,NFORM=2,TYPE=1;
(5A1,T25,I1,T25,I1/25A1)
Related topics
Purpose
To specify the total number of unique items in the respondent data records. The number in-
cludes all main and variant test items on all forms.
Format
NTOTAL=n
Default
0.
Examples
In this example, responses from two groups are analyzed. There are two forms of a 25-item
multiple-choice examination, with 5 items in common. In total, the responses of a sample of
2000 respondents to the 45 items are considered.
The INPUT command below is used to request a DIF analysis on 4 items administered to two
groups.
In the following example, responses to 50 items are read from the data file. From the 50, 20
are selected as Main Test items and 4 as Variant Test items. Items for the main test are se-
lected by name in the TESTM command; items for the variant test are selected by name in the
TESTV command.
175
2 BILOG-MG REFERENCE
>ITEMS INUMBERS=(1(1)50),INAME=(I26(1)I75);
>TESTM TNAME=MAINTEST,
INAMES=(I26,I27,I28,I29,I31,I33,I34,
I35,I36,I38,I39,I47,I48,I49,I50,I54,I60,I64,I68,I72);
>TESTV TNAME=VARIANT,INAMES=(I53,I59,I69,I73);
Related topics
Purpose
To specify the name of the file which contains the omit key. This key must be specified in
the same format as the response records. Any single ASCII character can be used to repre-
sent a not-presented item. If the not-presented key is in the same file as the item response
data, the key must precede the first response record. If this key appears in the same file as
the answer and/or not-presented keys, it must appear in the file after the both keys.
If OFNAME does not appear on the INPUT command, omits will not be distinguished from in-
correct responses. When NFORM > 1, separate answer, not-presented, and omit keys must be
provided for each form in the order of the forms to which they apply. Again, if they are in
the same file as the response data, all keys must precede the first response record.
Format
OFNAME=character string
Default
No omit key.
Examples
In the analysis of single subject data from the file exampl04.dat, the omit key appears at the
top of the file as indicated by the use of the OFNAME keyword.
>INPUT NTOTAL=40,NFORM=2,OFNAME='EXAMPL04.DAT',NALT=5;
>ITEMS INUMBERS=(1(1)40),INAME=(T01(1)T40);
>TEST TNAME=SIM;
>FORM1 LENGTH=20,INUMBERS=(1(1)20);
>FORM2 LENGTH=20,INUMBERS=(21(1)40);
(T28,5A1,T25,I1/40A1)
As two forms are used, omit keys are given by form before the actual data, and in the same
format as the data records. The first few lines of exampl04.dat are as follows:
176
INPUT COMMAND
can be saved to a omit key file exampl04.ofn, and referenced as such in a revised INPUT
command:
>INPUT NTOTAL=40,NFORM=2,NFNAME='EXAMPL04.OFN',NALT=5;
If both a not-presented key and an omit key are used for the two forms, the following lines
should appear at the top of the data file when the data file is referenced by the NFNAME and
OFNAME keywords in the INPUT command:
>INPUT NTOTAL=40,NFORM=2,NFNAME='EXAMPL04.DAT',
OFNAME='EXAMPL04.DAT',NALT=5;
Related topics
Purpose
To specify the assumption that the group or cohort assignment of an examinee is personal-
ized by subtest. The response records must contain NTEST indicators, one for each subtest,
specifying the groups or group cohorts to which the respondent belongs. The NTEST group
177
2 BILOG-MG REFERENCE
indicators must be specified in the variable format statement in the same order as the sub-
tests.
The PERSONAL option is especially useful for two-stage tests that measure ability in more
than one area. Assignment to the second-stage booklets may differ among areas.
Format
PERSONAL
Default
None.
Related topics
Purpose
To specify the number of respondents to be randomly sampled from the raw data file.
Format
SAMPLE=n
Default
1000.
Example
Here data are read from the file exampl03.dat, which also contains the answer key (DFNAME
and KFNAME keywords). Although the data file contains only 400 records, a sample of 2000
is requested.
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2;
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',NIDCHAR=5,
NALT=5,NFORM=2,TYPE=1;
178
INPUT COMMAND
If the first few records of the data file are to be used, the TAKE keyword should be used in-
stead.
Related topics
Purpose
To specify an analysis using only the first n respondents in the data file. This option is useful
for testing the problem setup on a smaller number of respondents when the sample size is
large. Note that the maximum value for this keyword is the actual number of respondents in
the data file. To obtain a random sample of the respondents, the SAMPLE keyword should be
used. TAKE and SAMPLE are mutually exclusive keywords.
Format
TAKE=n
Default
Examples
In the following example, data are read from the file exampl03.dat, which also contains the
answer key (DFNAME and KFNAME keywords). Although the data file contains only 400 re-
cords, a sample of 2000 is requested.
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2;
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NIDCHAR=5,NALT=5,TYPE=1;
If, however, only the first 100 records are to be used in the analysis, the modified INPUT
command
>INPUT NTOTAL=45,TAKE=100,NIDCH=5,NALT=5,TYPE=1;
should be used.
179
2 BILOG-MG REFERENCE
Related topics
Purpose
Format
TYPE=n
Default
1.
Examples
In a preliminary run, an item parameter file was created as shown below. The item parame-
ter file was saved to exampl03.par using the PARM keyword on the SAVE command. As sin-
gle-subject data were used in this run TYPE was set to 1 in the INPUT command.
EXAMPLE:
CREATING A ITEM PARAMETER FILE
>COMMENT
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2, SAVE;
>SAVE PARM='EXAMPL03.PAR';
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',NIDCH=5,
NALT=5,NFORM=2,TYPE=1;
The previously created calibration file is now used as input through the use of the IFNAME
keyword on the GLOBAL command. Note that the TYPE keyword on the INPUT command is
now set to 0, compared to 1 previously. The updated item parameter estimates are saved to
the file latest.par using the PARM keyword on the SAVE command.
180
INPUT COMMAND
EXAMPLE:
USING A ITEM PARAMETER FILE AS INPUT
>COMMENT
>GLOBAL CFNAME='EXAMPL03.PAR',NPARM=2, SAVE;
>SAVE CALIB='LATEST.PAR';
>LENGTH NITEMS=(45);
>INPUT NTOTAL=45,SAMPLE=2000,NGROUP=2,NIDCHAR=5,
NALT=5,NFORM=2,TYPE=0;
Related topics
181
2 BILOG-MG REFERENCE
(Required)
Purpose
To specify the names and corresponding numbers for all items in the data records. The items
may be listed in any order, but the order in which the names appear must correspond with
the order of the numbers. The names and numbers specified in the ITEMS command are used
to refer to the items in the TEST, FORM, and GROUP commands.
Strings of consecutive numbers may be abbreviated as m(1)n, where m is the number of the
first item and n is the number of the last item. Strings of up to 8 character names including
consecutive numbers may be abbreviated as Xm(1)Xn, where X is a string of up to 4 letters of
the alphabet, m is the up-to-4 character integer number of the first item and n is the up-to-4
character integer number of the last item.
Format
Default
None.
Examples
In the first example, 15 items are assigned the names MATH01 through MATH15.
>ITEMS INAME=(MATH01(1)MATH15);
In the syntax that follows, 16 items belonging to 2 subtests are identified. From the LENGTH
command, we see that each subtest has 8 items. The ITEMS command is used to first number
these items, and then to assign the names N1 through N8 to items belonging to the first sub-
test. Items belonging to the second subtest are named A1 through A8. On the TEST com-
mands, items are referenced by number. Referencing by the names assigned in the ITEMS
command is another option.
>LENGTH NITEMS=(8,8);
>INPUT NTOTAL=16,NALT=5,NIDCHAR=9,TYPE=3;
>ITEMS INUMBERS=(1(1)16),INAMES=(N1(1)N8,A1(1)A8);
>TEST1 TNAME=NUMCON,INUMBERS=(1(1)8);
>TEST2 TNAME=ALGCON,INUMBERS=(9(1)16);
182
ITEMS COMMAND
Related topics
Purpose
To specify a list of NTOTAL unique names (up to eight characters each). Item names that do
not begin with letters must be enclosed in single quotes.
Format
Default
1, 2, …, NTOTAL.
Related topics
Purpose
To specify the list of NTOTAL unique numbers. Strings of consecutive numbers may be ab-
breviated as m(1)n, where m is the number of the first item and n is the number of the last
item.
Format
Default
1, 2, …, NTOTAL.
183
2 BILOG-MG REFERENCE
Example
In the syntax that follows, 16 items belonging to 2 subtests are identified. From the LENGTH
command we see that each subtest has 8 items. The ITEMS command is used to first number
these items, and then to assign the names N1 through N8 to items belonging to the first sub-
test. Items belonging to the second subtest are named A1 through A8. On the TEST com-
mands, items are referenced by number. Referencing by the names assigned in the ITEMS
command is another option.
>LENGTH NITEMS=(8,8);
>INPUT NTOTAL=16,NALT=5,NIDCHAR=9,TYPE=3;
>ITEMS INUMBERS=(1(1)16),INAMES=(N1(1)N8,A1(1)A8);
>TEST1 TNAME=NUMCON,INUMBERS=(1(1)8);
>TEST2 TNAME=ALGCON,INUMBERS=(9(1)16);
Related topics
184
LENGTH COMMAND
(Required)
Purpose
To supply the number of items in subtests and the number of variant items in the subtests.
Format
Example
Consider two subtests. Subtest 1 has a total of 20 items; subtest 2 has a total of 15 items.
Five of the items in subtest 1 are variant items. None of the items in subtest 2 are variant
items.
Note that the number of variant tests has to be specified using the NVTEST keyword on the
GLOBAL command. The corresponding number of TEST commands must also be included in
the syntax.
>GLOBAL DFNAME=’EXAMPL04.DAT’,NTEST=2,NVTEST=1;
…
>LENGTH NITEMS=(20,15), NVARIANT=(5,0);
Related topics
Purpose
To provide a list of the number of items in the successive subtests to be analyzed. If a sub-
test contains variant items, they are included in this count of items.
Format
Default
None.
185
2 BILOG-MG REFERENCE
Example
In the example below, 20 of the 24 items are selected as main test items and 4 as variant test
items. The number of variant tests is specified using the NVTEST keyword on the GLOBAL
command. The TEST command for the main test is followed by a TEST command in which
the variant items are specified by item number.
Related topics
Purpose
To specify the number of variant items, if any, in the successive subtests to be analyzed. Al-
though parameter estimates for these items will be obtained, these items are not used in scor-
ing of tests/forms.
Format
Default
0.
Related topics
186
PRIORS COMMAND
(Optional)
Purpose
To specify prior distributions for constrained estimation of the item parameters of the main
test and for the variant items, if any. This command is required when the READPR keyword
appears in the CALIB command.
There is one prior command for each subtest. Values are read in order of the items in the
subtest beginning with the main test items and ending with the variant test items. If
NGROUP>1, more than one set of prior means and standard deviations for the item thresholds
may be required when the DIF or DRIFT models are specified. See the TMU and TSIGMA
keywords below.
Format
Notes
If the same value applies to all items of the subtest, you may use the “repeat” form: “value
(0) number of values” (see Section 2.6.2).
For a mean of p with a weight of n observations for the beta prior distribution, set
ALPHA=np+1
BETA=n(1–p)+1
To set an item parameter to a fixed value, set the mean of the prior to the parameter value
and set the corresponding standard deviation to a very small value. Suitable values for
TSIGMA are 0.005, for SSIGMA, 0.001 and for ALPHA and BETA, n = 1000. The priors for free
parameters should be set to the default values above. The PRIORS command for each test
should appear immediately after the QUAD commands for that test.
Examples
187
2 BILOG-MG REFERENCE
The next example illustrates how user-supplied priors for the latent distributions are speci-
fied with IDIST=1 on the CALIB command. The points and weights for these distributions
are supplied in the corresponding QUAD commands. Note that with IDIST=1, there are sepa-
rate QUAD commands for each group for each subtest. Within each subtest the points are the
same for each group. This is a requirement of the program. But as the example shows, the
points for the groups may differ by subtest. The PRIOR command for each subtest is placed
after the QUAD commands for that subtest. In this example, only the prior for the standard de-
viations of the thresholds is supplied on the PRIOR command. Default values are used for the
other prior distributions. The means of the distributions are kept fixed at their specified val-
ues by using the NOFLOAT option on the CALIB command.
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2,NTEST=2;
>LENGTH NITEMS=(35,35);
>INPUT NTOT=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',NALT=5,
NFORMS=2,NIDCHAR=5;
>ITEMS INUMBERS=(1(1)45), INAME=(C01(1)C45);
>TEST1 TNAME=SUBTEST1,INAME=(C01(1)C15,C21(1)C40);
>TEST2 TNAME=SUBTEST2,INAME=(C06(1)C25,C31(1)C45);
>FORM1 LENGTH=25,INUMBERS=(1(1)25);
>FORM2 LENGTH=25,INUMBERS=(21(1)45);
>GROUP1 GNAME=POP1,LENGTH=25,INUMBERS=(1(1)25);
>GROUP2 GNAME=POP2,LENGTH=25,INUMBERS=(21(1)45);
(T28,5A1,T25,I1,T25,I1/45A1)
>CALIB IDIST=1,READPR,EMPIRICAL,NQPT=16,CYCLE=25,TPRIOR,NEWTON=5,
CRIT=0.01,REFERENCE=1,NOFLOAT;
>QUAD1 POINTS=(-0.4598E+01,-0.3560E+01,-0.2522E+01,-0.1484E+01,
-0.4453E+00,0.5930E+00,0.1631E+01,0.2670E+01,0.3708E+01,
0.4746E+01),
WEIGHTS=(0.2464E-05,0.4435E-03,0.1724E-01,0.1682E+00,
0.3229E+00,0.3679E+00,0.1059E+00,0.1685E-01,0.6475E-03,
0.8673E-05);
>QUAD2 POINTS=(-0.4598E+01,-0.3560E+01,-0.2522E+01,-0.1484E+01,
-0.4453E+00,0.5930E+00,0.1631E+01,0.2670E+01,0.3708E+01,
0.4746E+01),
WEIGHTS=(0.2996E-04,0.1300E-02,0.1474E-01,0.1127E+00,
0.3251E+00,0.3417E+00,0.1816E+00,0.2149E-01,0.1307E-02,
0.3154E-04);
>PRIOR TSIGMA=(1.5(0)35);
>QUAD1 POINTS=(-0.4000E+01,-0.3111E+01,-0.2222E+01,-0.1333E+01,
-0.4444E+00,0.4444E+00,0.1333E+01,0.2222E+01,0.3111E+01,
0.4000E+01),
WEIGHTS=(0.1190E-03,0.2805E-02,0.3002E-01,0.1458E+00,
0.3213E+00,0.3213E+00,0.1458E+00,0.3002E-01,0.2805E-02,
0.1190E-03);
>QUAD2 POINTS=(-0.4000E+01,-0.3111E+01,-0.2222E+01,-0.1333E+01,
-0.4444E+00,0.4444E+00,0.1333E+01,0.2222E+01,0.3111E+01,
0.4000E+01),
188
PRIORS COMMAND
WEIGHTS=(0.1190E-03,0.2805E-02,0.3002E-01,0.1458E+00,
0.3213E+00,0.3213E+00,0.1458E+00,0.3002E-01,0.2805E-02,
0.1190E-03);
>PRIOR TSIGMA=(1.5(0)35);
Suppose IDIST=1, NGROUP=2, and NTEST=2. The setup for the QUAD and PRIOR commands is
as follows:
Related topics
Purpose
To specify the real-valued “alpha” parameters for the beta prior distribution of lower asymp-
tote (guessing) parameters.
Format
ALPHA=( n1 , n2 ,..., nN )
Default
20p+1.
Related topics
189
2 BILOG-MG REFERENCE
Purpose
To specify the real-valued “beta” parameters for the beta prior distribution of lower asymp-
tote (guessing) parameters.
Format
BETA=( n1 , n2 ,..., nN )
Default
20(1–p)+1.
Related topics
Purpose
Format
SMU=( n1 , n2 ,..., nN )
Default
1.0.
Example
In the following example, SMU is used to specify prior means for the item slopes.
190
PRIORS COMMAND
Related topics
Purpose
Format
SSIGMA=( n1 , n2 ,..., nN )
Default
1.64872127.
Example
In the calibration of a single subtest with 35 items, the following PRIOR command is used to
provide a real-valued prior standard deviation of 1.75 for the item slopes.
Related topics
Purpose
To specify real-valued prior means for the item thresholds (DIF) or polynomial coefficients
(DRIFT) including intercept.
Format
191
2 BILOG-MG REFERENCE
Default
0.0.
Example
In the example, below, PRIOR commands are used to specify prior distributions for the con-
strained estimation of the thresholds in the calibration of two subtests with 8 items each.
Related topics
Purpose
If neither the DIF nor the DRIFT model is selected, L = 1. If the DIF model is selected, L =
NGROUP. If the DRIFT model is selected, L = MAXPOWER.
Format
Default
2.0.
Related topics
192
QUAD COMMAND
Purpose
To read in user-supplied quadrature points and weights, or points and ordinates of the dis-
crete finite representations of the prior distribution for the groups. This command follows di-
rectly after the CALIB command.
If:
Format
Example
This example illustrates user-supplied priors for the latent distributions are specified with
IDIST=1 on the CALIB command. The points and weights for these distributions are supplied
in the QUAD commands. Note that with IDIST=1, there are separate QUAD commands for each
group for each subtest.
Within each subtest the points are the same for each group. This is a requirement of the pro-
gram. But as the example shows, the points for the groups may differ by subtest. The PRIOR
command for each subtest is placed after the QUAD commands for that subtest.
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2,NTEST=2;
>LENGTH NITEMS=(35,35);
>INPUT NTOT=45,SAMPLE=2000,NGROUP=2,KFNAME='EXAMPL03.DAT',NALT=5,
NFORMS=2,NIDCHAR=5;
>ITEMS INUMBERS=(1(1)45), INAME=(C01(1)C45);
>TEST1 TNAME=SUBTEST1,INAME=(C01(1)C15,C21(1)C40);
>TEST2 TNAME=SUBTEST2,INAME=(C06(1)C25,C31(1)C45);
>FORM1 LENGTH=25,INUMBERS=(1(1)25);
>FORM2 LENGTH=25,INUMBERS=(21(1)45);
>GROUP1 GNAME=POP1,LENGTH=25,INUMBERS=(1(1)25);
>GROUP2 GNAME=POP2,LENGTH=25,INUMBERS=(21(1)45);
(T28,5A1,T25,I1,T25,I1/45A1)
>CALIB IDIST=1,READPR,EMPIRICAL,NQPT=16,CYCLE=25,TPRIOR,NEWTON=5,
CRIT=0.01,REFERENCE=1,NOFLOAT;
>QUAD1 POINTS=(-0.4598E+01,-0.3560E+01,-0.2522E+01,-0.1484E+01,
-0.4453E+00,0.5930E+00,0.1631E+01,0.2670E+01,0.3708E+01,
193
2 BILOG-MG REFERENCE
0.4746E+01),
WEIGHTS=(0.2464E-05,0.4435E-03,0.1724E-01,0.1682E+00,
0.3229E+00,0.3679E+00,0.1059E+00,0.1685E-01,0.6475E-03,
0.8673E-05);
>QUAD2 POINTS=(-0.4598E+01,-0.3560E+01,-0.2522E+01,-0.1484E+01,
-0.4453E+00,0.5930E+00,0.1631E+01,0.2670E+01,0.3708E+01,
0.4746E+01),
WEIGHTS=(0.2996E-04,0.1300E-02,0.1474E-01,0.1127E+00,
0.3251E+00,0.3417E+00,0.1816E+00,0.2149E-01,0.1307E-02,
0.3154E-04);
>PRIOR TSIGMA=(1.5(0)35);
>QUAD1 POINTS=(-0.4000E+01,-0.3111E+01,-0.2222E+01,-0.1333E+01,
-0.4444E+00,0.4444E+00,0.1333E+01,0.2222E+01,0.3111E+01,
0.4000E+01),
WEIGHTS=(0.1190E-03,0.2805E-02,0.3002E-01,0.1458E+00,
0.3213E+00,0.3213E+00,0.1458E+00,0.3002E-01,0.2805E-02,
0.1190E-03);
>QUAD2 POINTS=(-0.4000E+01,-0.3111E+01,-0.2222E+01,-0.1333E+01,
-0.4444E+00,0.4444E+00,0.1333E+01,0.2222E+01,0.3111E+01,
0.4000E+01),
WEIGHTS=(0.1190E-03,0.2805E-02,0.3002E-01,0.1458E+00,
0.3213E+00,0.3213E+00,0.1458E+00,0.3002E-01,0.2805E-02,
0.1190E-03);
>PRIOR TSIGMA=(1.5(0)35);
Related topics
Purpose
If:
IDIST = 1: a set of NQPT real-numbered values (with decimal points) of the quadrature
points must be supplied for each group for each subtest.
IDIST = 2: one set of points is required for each group.
Format
194
QUAD COMMAND
Default
Supplied by program.
Example
Related topics
Purpose
If:
IDIST = 1 on the CALIB command: A set of NQPT positive fractions (with decimal points
and summing to 1.0) for weights for quadrature points must be supplied for each group for
each subtest. This set of points applies to all subtests.
IDIST = 2: One set of weights is required for each group. This set of weights applies to all
subtests.
Format
Default
Supplied by program.
Related topics
195
2 BILOG-MG REFERENCE
Purpose
To supply arbitrary prior distributions of scale scores for the respondents when EAP estima-
tion is selected. This command follows directly after the SCORE command.
If:
If there are multiple groups (NGROUPS > 1) and IDIST = 1 or 2, the POINTS must have the
same values for all groups. The WEIGHTS may differ by group, and the POINTS may differ by
subtest.
Format
Example
In the 2-group example below, an illustration is given of the use of user-supplied priors for
the scale scores (IDIST=2) for the respondents when EAP estimation is selected (METHOD=2).
The points and weights for these distributions are supplied in the QUADS commands. Note
that with IDIST=2, there are separate QUADS commands for each group.
196
QUADS COMMAND
Related topics
Purpose
To specify real-numbered, non-negative values (with decimal points) for the NQPT points of
the arbitrary discrete prior distribution.
Format
Default
Supplied by program.
Example
Related topics
Purpose
To specify real-numbered, non-negative values (with decimal points) for the NQPT weights
of the arbitrary discrete prior distribution. The sum of the weights must equal unity.
Format
197
2 BILOG-MG REFERENCE
Default
Supplied by program.
Example
Related topics
198
SAVE COMMAND
Purpose
This command is used to supply output filenames. The filenames must be less than 128
characters long and may contain a drive prefix, a path name, and an extension. The filename
must be enclosed in single quotes. Note that each line of the command file has a maximum
length of 80 characters. If the filename does not fit on one line of 80 characters, the remain-
ing characters should be placed on the next line, starting at column 1. All output files other
than the MASTER and CALIB files are saved in a formatted form. See Section 2.6.20 on output
files for more information. Note that, in order to use the SAVE command, the SAVE option
must be included in the GLOBAL command.
Format
Example
In the syntax below, the item parameters and scale scores are saved to file through use of the
SCORE and PARM keywords on the SAVE command. Note that, in order to use the SAVE com-
mand, the SAVE keyword is added to the GLOBAL command.
Related topics
Purpose
To specify a filename for the calibration data file that is to be saved. The original response
data are sampled and calibrated, then saved as a temporary binary file. If no sampling oc-
curs, this temporary file cannot be created. Upon normal termination of the program this
temporary file is deleted automatically. By assigning a specific name to the calibration data
file, the user can save and reuse it as a master data file in subsequent analyses.
199
2 BILOG-MG REFERENCE
Format
CALIB=<’filename’>
Default
Do not save.
Example
The calibration file is saved to exampl03.cal using the CALIB keyword on the SAVE com-
mand.
Related topics
Purpose
To specify a filename for the external file to which the covariances of item parameter esti-
mates for each item are written. This file is written automatically in the calibration phase
(Phase 2) as a temporary file, which passes necessary information to the scoring phase
(Phase 3). Normally, it is deleted at the termination of the program, but by assigning a spe-
cific name to this file the user can save it as a permanent file.
Format
COVARIANCE=<’filename’>
Default
Do not save.
Example
A covariance file from a previous calibration can be used to compute test information by
specifying the name of the file with the COVARIANCE keyword on the SAVE command. During
the scoring phase, the item information indices will be added to this file if requested. This
feature is intended for use when scoring is based on a previously created item parameter file.
200
SAVE COMMAND
Related topics
Purpose
To specify a filename for saving the DIF parameters if requested and computed during the
calibration phase (Phase 2) to an external file.
Format
DIF=<’filename’>
Default
Do not save.
Example
The DIF parameters are saved to the file exampl03.dif using the DIF keyword on the SAVE
command.
Related topics
201
2 BILOG-MG REFERENCE
Purpose
To specify a filename for saving the DRIFT parameters computed during the calibration
phase (Phase 2) to an external file.
Format
DRIFT=<’filename’>
Default
Do not save.
Example
In the following example, the DRIFT parameters are saved to the file exampl03.dri using
the DRIFT keyword on the SAVE command.
Related topics
Purpose
To specify the filename to which the expected frequencies of correct responses, attempts,
and proportions of correct responses, attempts, and proportions of correct responses for each
item at each quadrature point by group will be saved. This file will also contain standardized
posterior residuals and model proportions of correct responses.
Format
EXPECTED=<’filename’>
202
SAVE COMMAND
Default
Do not save.
Example
In the following example, the expected frequencies are saved to exampl03.frq using the
EXPECTED keyword on the SAVE command.
Related topics
Purpose
To specify a filename for saving the classical item statistics computed in Phase 1 of the pro-
gram to an external file.
Format
ISTAT=<’filename’>
Default
Do not save.
Example
The classical item statistics are saved to the file exampl03.sta using the ISTAT keyword on
the SAVE command.
Related topics
203
2 BILOG-MG REFERENCE
Purpose
To specify a filename for the master data file. The original response data are scored and
stored as a temporary binary file. Upon normal termination of the program this temporary
file is deleted automatically. By assigning a specific name to this master data file, the user
can save and reuse it as an input file in subsequent analyses.
Format
MASTER=<’filename’>
Default
Do not save.
Example
The master file is saved to exampl03.mas using the MAS keyword on the SAVE command.
Related topics
Purpose
Item parameter estimates are saved in a formatted form as an external output file. This file
can be used as initial estimates of item parameters for further iterations or as final estimates
of the item parameters for scoring new data.
In either case, the user must specify the name of the previously created item parameter file
using the IFNAME keyword of the GLOBAL command.
204
SAVE COMMAND
Format
PARM=<’filename’>
Default
Do not save.
Example
In the syntax below, the item parameters are saved to file through use of the SCORE and PARM
keywords on the SAVE command. Note that, in order to use the SAVE command, the SAVE
option is added to the GLOBAL command.
The use of this file as initial estimates for further iterations is illustrated in the syntax below:
Related topics
Purpose
To save the points and weights of the posterior latent distribution at the end of Phase 2 to an
external file. These quantities can be included as prior values following the SCORE command
for later EAP estimation of ability from previously estimated item parameters.
Format
PDISTRIB=<’filename’>
Default
Do not save.
205
2 BILOG-MG REFERENCE
Related topics
Purpose
To save the case weight and marginal probability for each observation to an external output
file.
Format
POST=<’filename’>
Default
Do not save.
Example
The case weights and marginal probabilities are saved to the file exampl03.pos using the
POST keyword on the SAVE command.
Related topics
Purpose
Format
SCORE=<’filename’>
206
SAVE COMMAND
Default
Do not save.
Example
In the following example, the score file is saved to exampl03.sco using the SCORE keyword
on the SAVE command.
Related topics
Purpose
To specify a filename when the tables of test information statistics are to be saved.
Format
TSTAT=<’filename’>
Default
Do not save.
Example
The test information statistics file is saved to exampl03.tsa using the TSTAT keyword on the
SAVE command.
Related topics
207
2 BILOG-MG REFERENCE
(Optional)
Purpose
To initiate the scoring of individual examinees or of response patterns; to compute item and
test information and plot information curves; to rescale scores to a specified mean and stan-
dard deviation in either the sample or the latent distribution.
Format
Examples
The aggregate scores for the following analysis of school-level data are estimated by the
EAP method using the empirical distributions from Phase 2. The number of quadrature
points is set to 12 per subtest.
The scores are rescaled to a mean of 250 and a standard deviation of 50 in the latent distri-
bution of schools (IDIST=3, LOCATION=250, SCALE=50). The fit of the data to the group-
level model is tested for each school (FIT).
The next SCORE command gives the specifications for a scoring phase that includes an in-
formation analysis (INFO=2) with expected information indices for a normal population
(POP). Rescaling of the scores and item parameters to mean 0 and standard deviation 1 in the
estimated latent distribution has been requested (RSC=3). Printing of the students' scores on
the screen is suppressed (NOPRINT).
In the following SCORE command, the EAP scale scores of Phase 3 are computed from the
responses to items in the main test as specified by setting METHOD to 2. Printing of scores is
suppressed (NOPRINT).
>SCORE METHOD=2,NOPRINT;
In this score command, Maximum Likelihood estimates of ability (METHOD=1) are rescaled to
a mean of 250 and standard deviation of 50 in Phase 3 (RSCTYPE=3, LOCATION=250,
SCALE=50).
208
SCORE COMMAND
Related topics
Purpose
To request the calculation of biweighted estimates robust to isolated deviant responses. (See
also Mislevy & Bock, 1982.)
Format
BIWEIGHT
Related topics
Purpose
To convert the Phase 3 estimates into domain scores if the user supplies a file containing the
item parameters for a sample of previously calibrated items. The FILE keyword on the
SCORE command is used to specify this parameter file. Weights can be applied to the items to
improve the representation of the domain specifications. This conversion may be useful as
an aid to the interpretation of test results (see Bock, Thissen, & Zimowski (1997).)
Note that the formula for the domain scores that appears in the paper cited here contains ty-
pographical errors. The computation of the domain scores in the program uses the corrected
formula. The domain scores will appear in the score listing following the test percent correct
score for each case in the Phase 3 output file.
Format
DOMAIN=n
209
2 BILOG-MG REFERENCE
Default
No domain scores.
Related topics
Purpose
To specify the name of the file containing the item parameters to be used for the domain
score conversions.
The first line of the file referenced by the FILE keyword must contain a variable format
statement (in parentheses) describing the column layout of the weights and parameter in the
following lines of the file. The values must be read in order—item weight, slope, threshold,
and guessing parameter. The weights will be automatically scaled to sum to 1.0 by the pro-
gram. The domain score will appear in the score listing following the test percent correct
score for each case. Note that the parameter file produced by the SAVE command does not
have the layout described above.
Format
FILE=<'filename'>
Default
Related topics
Purpose
To request the computation of a likelihood ratio χ 2 goodness-of-fit statistic for each re-
sponse pattern. This statistic is intended only for use with aggregate-level data.
210
SCORE COMMAND
Format
FIT
Default
No fit statistic.
Example
The aggregate scores for this analysis of school-level data are estimated by the EAP method
using the empirical distributions from Phase 2. The fit of the data to the group-level model is
tested for each school (FIT).
Related topics
Purpose
To designate the type of prior distribution of scale scores. IDIST = 0 applies to both MAP
and EAP estimation. IDIST = 1, 2, 3, or 4 applies only to EAP estimation.
Format
IDIST=n
Default
0.
211
2 BILOG-MG REFERENCE
Examples
In the following aggregate-level example, IDIST=3 is used to estimate scores by the EAP
method by using the empirical distributions from Phase 2.
In the next example, EAP estimates of ability are calculated (METHOD=2) using the informa-
tion in the posterior distributions from Phase 2 (IDIST=3). The ability estimates are rescaled
to a mean of 0 and standard deviation of 1 by specifying RSCTYPE=3 on the SCORE command.
Related topics
Purpose
Format
INFO=n
n=0 none
n=1 test information curves
n=2 test information curves and table of information statistics
Default
0.
Examples
The following SCORE command gives the specifications for a scoring phase that includes an
information analysis (INFO=2) with expected information indices for a normal population
(POP).
212
SCORE COMMAND
In the following SCORE command, Maximum Likelihood estimates of ability (METHOD=1) are
rescaled to a mean of 250 and standard deviation of 50 in Phase 3.
Related topics
Purpose
Format
Default
0.0.
Examples
The scores are rescaled to a mean of 250 and a standard deviation of 50 in the latent distri-
bution of schools (IDIST=3, LOCATION=250, SCALE=50). The fit of the data to the group-
level model is tested for each school (FIT).
In the next SCORE command, Maximum Likelihood estimates of ability (METHOD=1) are re-
scaled to a mean of 250 and standard deviation of 50 in Phase.
Related topics
213
2 BILOG-MG REFERENCE
Purpose
To specify the method of estimating scale scores. If ML is selected, it is advisable to use the
PMN keyword to set bounds on the estimated scores. If EAP or MAP is selected, the PMN and
PSD keywords may be used to specify the means and standard deviations of the prior distri-
butions.
Format
METHOD=n
Default
2.
Examples
In this score command, Maximum Likelihood estimates of ability (METHOD=1) are rescaled to
a mean of 250 and standard deviation of 50 in Phase 3 (RSCTYPE=3, LOCATION=250,
SCALE=50).
Related topics
Purpose
To request the computation and listing of the coefficients of skewness and kurtosis of the
ability estimates and of the latent distribution.
214
SCORE COMMAND
Format
MOMENTS
Default
No computation or listing.
Examples
The MOMENTS keyword on the SCORE commands below is used to obtain the coefficients of
skewness and kurtosis for the rescaled ability.
>SCORE NQPT=11,RSCTYPE=3,LOCATION=250,SCALE=50,NOPRINT,INFO=1,
POP,MOMENT;
>SCORE IDIST=3,RSCTYPE=3,INFO=1,YCOMMON,POP,NOPRINT,MOMENTS;
Related topics
Purpose
To indicate the number of additional FORM commands after the SCORE command. It is used
when scoring is to be performed using these additional form specifications. The reference
form for scoring is set using the REFERENCE keyword on the SCORE command.
Format
NFORMS=n
Default
Example
In the example below, two additional form commands follow the SCORE command. The first
is the references group (as set by the REFERENCE keyword) while the READF keyword in-
structs the program to read and process the additional FORM commands.
>SCORE IDIST=3,RSCTYPE=3,INFO=1,YCOMMON,POP,NOPRINT,REF=1,NFORMS=2,READF;
>FORM1 LENGTH=25,INUM=(1(1)25);
>FORM2 LENGTH=25,INUM=(21(1)45);
215
2 BILOG-MG REFERENCE
Related topics
Purpose
To suppress the display of the scores on screen and in the printed output of Phase 3.
To shorten the run time for scoring a large subject response file, it is advisable to specify an
external file using the SCORE keyword in the SAVE command, and the NOPRINT option. In this
way, scores for all subjects are computed but are stored only in the external file.
Format
NOPRINT
Default
Scores will appear both on screen and in the Phase 3 output file.
Examples
The EAP scale scores of Phase 3 are computed from the responses to items in the main test
as specified by setting METHOD to 2. Printing of scores is suppressed (NOPRINT).
Related topics
Purpose
To set the number of quadrature points for each subtest with the NQPT keyword when EAP
estimation is selected by the METHOD keyword.
To reduce computing time when there are items not-presented, use 2 x square root of the
maximum number of items per respondent as the number of quadrature points.
216
SCORE COMMAND
Format
Default
Example
The aggregate scores for this analysis of school-level data are estimated by the EAP method
using the empirical distributions from Phase 2. The number of quadrature points is set to 12
per subtest.
Related topics
Purpose
To specify real-numbered means (with decimal points) of the normal prior distributions for
each group for each subtest.
Format
PMN=( n1,1 , n1,2 ,..., n1, NGROUP , n2,1 , n2,2 ,..., n2, NGROUP ,..., nNTEST ,1 , nNTEST ,2 ,..., nNTEST , NGROUP )
Default
0.0.
Example
In the following two-group analysis for one subtest, the PMN and PSD keywords are used on
the SCORE command to provide the means and standard deviations of the normal prior distri-
butions for each subtest.
217
2 BILOG-MG REFERENCE
Related topics
Purpose
To request the calculation of the expected information for the population when INFO > 0.
This includes an estimate of the classical reliability coefficient for each subtest. The score
metric after rescaling is used in these calculations.
Format
POP
Default
Example
This SCORE command gives the specifications for a scoring phase that includes an informa-
tion analysis (INFO=2) with expected information indices for a normal population (POP). Re-
scaling of the scores and item parameters to mean 0 and standard deviation 1 in the esti-
mated latent distribution has been requested (RSC=3). Printing of the students' scores on the
screen is suppressed (NOPRINT).
Related topics
Purpose
To specify real-numbered standard deviations (with decimal points) of the normal prior dis-
tributions for each group for each subtest.
218
SCORE COMMAND
Format
PSD=( n1,1 , n1,2 ,..., n1, NGROUP , n2,1 , n2,2 ,..., n2, NGROUP ,..., nNTEST ,1 , nNTEST ,2 ,..., nNTEST , NGROUP )
Default
1.0.
Example
In the following two-group analysis for one subtest, the PMN and PSD keywords are used on
the SCORE command to provide the means and standard deviations of the normal prior distri-
butions for each subtest.
Related topics
Purpose
To indicate the presence of multiple FORM commands after the SCORE command. It is used to
indicate that scoring is to be performed using this form specification. The reference form for
scoring is set using the REFERENCE keyword on the SCORE command.
Format
READF
Default
Example
In the example below, two additional form commands follow the SCORE command. The first
is the references group (as set by the REFERENCE keyword) while the READF keyword in-
structs the program to read and process the additional FORM commands.
219
2 BILOG-MG REFERENCE
>SCORE IDIST=3,RSCTYPE=3,INFO=1,YCOMMON,POP,NOPRINT,REF=1,NFORMS=2,READF;
>FORM1 LENGTH=25,INUM=(1(1)25);
>FORM2 LENGTH=25,INUM=(21(1)45);
Related topics
Purpose
To set the reference group for scoring when scoring is performed by forms, as specified with
the READF and NFORM keywords on the same command. Note that, if this keyword is omitted
while the READF and NFORM keywords are present, the reference form specified in the CALIB
command will be used.
Format
REFERENCE=n
Default
Example
In the example below, two additional form commands follow the SCORE command. The first
is the references group (as set by the REFERENCE keyword) while the READF keyword in-
structs the program to read and process the additional FORM commands.
>SCORE IDIST=3,RSCTYPE=3,INFO=1,YCOMMON,POP,NOPRINT,REF=1,NFORMS=2,READF;
>FORM1 LENGTH=25,INUM=(1(1)25);
>FORM2 LENGTH=25,INUM=(21(1)45);
Related topics
220
SCORE COMMAND
Purpose
Format
RSCTYPE=n
Uses the LOCATION and SCALE constants specified by the options below. Note that there is no
option 2.
0: no rescaling
1: linear transformation of scores: new score = SCALE x old score + LOCATION
3: rescale to SCALE and LOCATION in the sample of scale score estimates
4: only if EAP estimation has been selected: Set the mean of the latent population distri-
bution equal to LOCATION and set the standard deviation equal to SCALE.
Default
0.
Examples
The aggregate scores for this analysis of school-level data are estimated by the EAP method
using the empirical distributions from Phase 2. The number of quadrature points is set to 12
per subtest. The scores are rescaled to a mean of 250 and a standard deviation of 50 in the
latent distribution of schools (IDIST=3, LOCATION=250, SCALE=50). The fit of the data to the
group-level model is tested for each school (FIT).
Related topics
Purpose
221
2 BILOG-MG REFERENCE
Format
Default
1.0.
Examples
In the following example, Maximum Likelihood estimates of ability (METHOD=1) are rescaled
to a mean of 250 and standard deviation of 50 in Phase 3 (RSCTYPE=3, LOCATION=250,
SCALE=50).
Related topics
Purpose
To specify that the test information curves for subtests should be expressed in comparable
units when INFO > 0. If INFO = 0, the curves for subsets will be adjusted separately to make
each plot fill the available space.
Format
YCOMMON
Default
Example
The following SCORE command specifies a scoring phase that includes an information analy-
sis (INFO=2) with expected information indices for a normal population (POP).
222
SCORE COMMAND
Test information curves for subtests will be expressed in comparable units and printed to the
Phase 3 output file.
Related topics
223
2 BILOG-MG REFERENCE
(Required)
Purpose
To identify the main test items and the variant test items (if any) in each of the NTEST sub-
tests. If the subtest contains only main test items, there is only one TEST command for that
subtest. If there are variant items in the subtest, two TEST commands are required for that
subtest. The first describes the main test items, while the second describes the variant test
items. There are as many TEST commands as there are main and variant subtests specified in
the NTEST and NVTEST keywords of the GLOBAL command.
Items may be identified by name or number, but not by both. The names or numbers must
correspond to those listed in the ITEMS command. If numbers are supplied, the program will
refer to the names supplied in the ITEMS command only for printing of item information.
Starting values for estimating the item parameters may also be supplied in the TEST com-
mand. Note that parameter estimation for variant items is non-iterative and does not require
starting values.
Format
Default
Examples
In the example below, two subtests are used, each with 8 items. The NTEST keyword on the
GLOBAL command indicates that two subtests are to be used, and two TEST commands follow
the ITEMS command. The TEST commands are assigned names through the TNAME keyword
and items are referenced by number.
>GLOBAL NPARM=3,NTEST=2,DFNAME='EXAMPL08.DAT';
>LENGTH NITEMS=(8,8);
>INPUT NTOTAL=16,NALT=5,NIDCHAR=9,TYPE=3;
>ITEMS INUMBERS=(1(1)16),INAMES=(N1(1)N8,A1(1)A8);
>TEST1 TNAME=NUMCON,INUMBERS=(1(1)8);
>TEST2 TNAME=ALGCON,INUMBERS=(9(1)16);
In the next example, the ITEMS command lists the four items in the order that they will be
read from the data records. The INAMES and INUMBERS keywords assign each item a name
and a corresponding number. Because there is only one form, the NFORM keyword is not re-
224
TEST COMMAND
quired in the INPUT command and a FORM command is not required. Because examinees in
both groups are presented all the items listed in the ITEMS command, the TEST command
needs contain only the test name.
>GLOBAL NPARM=1,NWGHT=3,LOGISTIC;
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NGROUPS=2,DIF,NIDCHAR=2,TYPE=2;
>ITEMS INAMES=(SP1(1)SP4),INUMBERS=(1(1)4);
>TEST TNAME=SPELL;
>GROUP1 GNAME=MALES;
>GROUP2 GNAME=FEMALES;
Related topics
Purpose
To specify positive real-numbered starting values for dispersion (2- and 3-parameter models
only).
Starting values may be specified for slopes or for dispersions, but not for both.
Format
DISPERSN=( n1 , n2 ,..., nn ( i ) )
Default
1/slope.
Example
In the syntax below, starting values for the dispersion and intercepts of the four items con-
sidered in this 3-parameter model are provided on the TEST command.
EXAMPLE:
USING STARTING VALUES
>GLOBAL NPARM=3,LOGISTIC';
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NIDCHAR=2;
225
2 BILOG-MG REFERENCE
>ITEMS INAME=(SP01,SP02,SP03,SP04),INUMBERS=(1(1)4);
>TEST TNAME=SPELL,
INTERCPT = (1.284,0.287,-1.912,-0.309),
DISPERSN=(0.957,0.623,0.545,0.620);
Related topics
Technical menu: Item Parameter Starting Values dialog box (see Section 2.3.5)
TEST command: SLOPE keyword
Purpose
To specify whether the parameters of specific items are free to be estimated or are to be held
fixed at their starting values. This keyword appears in the j-th TEST command as
FIX=( n1 , n2 ,..., nLENGTH ( j ) ) where
The starting values may be entered by the SLOPE, THRESHLD, and GUESSING keywords of the
j-th TEST command, or read from an existing item parameter file (IFNAME) designated by
IFNAME=<’filename’> on the GLOBAL command and saved in a previous job by the
PARM=<’filename’> keyword on the SAVE command; or, alternatively, read from a file of
provisional item parameters, designated by the PRNAME=<’filename’> keyword on the
GLOBAL command. When only a few items are to be fixed, this method is the most conven-
ient. If all items are designated as fixed, and the INFO keyword appears on the SCORE com-
mand, the required information and reliability analysis will be performed in Phase 3.
In order for this procedure to work, however, the program must have data to process in
Phases 1 and 2 for at least a few cases. Some artificial response data can be used for this
purpose. The only calculations that will be performed in Phase 2 are preparations for the in-
formation analysis in Phase 3. The number of EM cycles in the CALIB command can there-
fore be set to 2 and the number of NEWTON cycles to 1. The NOADJUST option must also be in-
voked.
Format
Default
Do not fix.
226
TEST COMMAND
Example
The following command file shows the fixing of five items by specifying values in a PRNAME
file.
5
6 1.27168 0.10504 0.14011
7 1.79009 0.10221 0.07543
8 0.81238 0.24523 0.22179
9 1.33017 -0.22387 0.15453
10 1.06557 0.58430 0.08921
DIAGNOS has been set equal to 1 to produce more detailed output which show that these val-
ues do not change during the Phase 2 estimation cycles. They will, of course, be rescaled
along with those of the estimated items in Phase 3.
Related topics
Purpose
To specify starting values for the lower asymptote (guessing) parameters (3-parameter
model only). These values should be positive fractional numbers with decimal points.
227
2 BILOG-MG REFERENCE
Format
GUESS=( n1 , n2 ,..., nn ( i ) )
Default
0.0.
Example
In the syntax below, starting values for the slopes and guessing parameters of the four items
considered in this 3-parameter model are provided on the TEST command.
EXAMPLE:
USING STARTING VALUES
>GLOBAL NPARM=3,LOGISTIC';
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NIDCHAR=2;
>ITEMS INAME=(SP01,SP02,SP03,SP04),INUMBERS=(1(1)4);
>TEST TNAME=SPELL,
SLOPE=(1.045,1.604,1.836,1.613),
GUESS=(0.189,0.168,0.101,0.152);
Related topics
Purpose
To provide a list of names, as specified in the ITEMS command for items in TEST. Item
names that do not begin with letters must be enclosed in single quotes.
Format
INAME=( n1 , n2 ,..., nn ( i ) )
Default
If NTEST =1, and NVTEST = 0, all NTOTAL items are as specified in the INPUT command.
There is no default if NTEST > 1 or NVTEST ≠ 0.
228
TEST COMMAND
Example
In the following example, responses to 50 items are read from those of 100 items in the data
file. From the 50, 20 are selected as Main Test items and 4 as Variant Test items. Items for
the main test are selected by name in the TESTM command; items for the variant test are se-
lected by name in the TESTV command. The item names correspond to the sequence numbers
in the original set of 100 items.
Related topics
Purpose
To specify real-numbered starting values (with decimal points) for estimating the item inter-
cept. Starting values may be specified for intercepts or for thresholds, but not for both.
Format
INTERCPT=( n1 , n2 ,..., nn ( i ) )
Default
Example
In the syntax below, starting values for the intercepts of the four items considered in this 3-
parameter model are provided on the TEST command.
229
2 BILOG-MG REFERENCE
EXAMPLE:
USING STARTING VALUES
>GLOBAL NPARM=3,LOGISTIC';
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NIDCHAR=2;
>ITEMS INAME=(SP01,SP02,SP03,SP04),INUMBERS=(1(1)4);
>TEST TNAME=SPELL,INTERCPT = (1.284,0.287,-1.912,-0.309);
Related topics
Purpose
To provide a list of numbers, as specified in the ITEMS command for items in TEST. If TEST
refers to main test items, n(i) is the number of main test items. If TEST refers to variant test
items, n(i) is the number of variant test items.
The notation “first (increment) last” in these lists may be used when the item numbers form
an arithmetic progression.
Format
INUMBER=( n1 , n2 ,..., nn ( i ) )
Default
If NTEST=1, and NVTEST=0, all NTOTAL items as specified in the INPUT command. There is no
default if NTEST>1 of NVTEST ≠ 0.
Examples
For the case where NTEST=1 and NVTEST=1 in the GLOBAL command, NITEMS=10 and
NVARIANT=4 in the LENGTH command, and NTOT=10 in the INPUT command, the main test
items of subtest i might be specified in the first TEST command with
ITEMS=(1,2,3,6,8,10). The variant test items of subtest i might be specified in the second
TEST command with ITEMS=(4,5,7,9).
In the example below, two subtests are used, each with 8 items. The NTEST keyword on the
GLOBAL command indicates that two subtests are to be used, and two TEST commands follow
the ITEMS command. The subtests are assigned names through the TNAME keyword and items
are referenced by number.
230
TEST COMMAND
>GLOBAL NPARM=3,NTEST=2,DFNAME='EXAMPL08.DAT';
>LENGTH NITEMS=(8,8);
>INPUT NTOTAL=16,NALT=5,NIDCHAR=9,TYPE=3;
>ITEMS INUMBERS=(1(1)16),INAMES=(N1(1)N8,A1(1)A8);
>TEST1 TNAME=NUMCON,INUMBERS=(1(1)8);
>TEST2 TNAME=ALGCON,INUMBERS=(9(1)16);
Related topics
Purpose
To provide starting values for slopes (2- and 3-parameter models only). These starting val-
ues should be positive, real numbers with decimal points. Starting values may be specified
for slopes or for dispersions, but not for both.
Format
SLOPE=( n1 , n2 ,..., nn ( i ) )
Default
1.0.
Example
In the syntax below, starting values for the intercepts and slopes of the four items considered
in this 3-parameter model are provided on the TEST command.
EXAMPLE:
USING STARTING VALUES
>GLOBAL NPARM=3,LOGISTIC';
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NIDCHAR=2;
>ITEMS INAME=(SP01,SP02,SP03,SP04),INUMBERS=(1(1)4);
>TEST TNAME=SPELL,
INTERCPT = (1.284,0.287,-1.912,-0.309),
SLOPE=(1.045,1.604,1.836,1.613);
231
2 BILOG-MG REFERENCE
Related topics
Technical menu: Item Parameter Starting Values dialog box (see Section 2.3.5)
TEST command: DISPERSN keyword
Purpose
To specify real-numbered starting values (with decimal points) for estimating the item
thresholds. Starting values may be specified for intercepts or for thresholds, but not for both.
Format
THRESHLD=( n1 , n2 ,..., nn ( i ) )
Default
0.0.
Example
In the syntax below, starting values for the slopes and thresholds of the four items consid-
ered in this 3-parameter model are provided on the TEST command.
EXAMPLE:
USING STARTING VALUES
>GLOBAL NPARM=3,LOGISTIC';
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NIDCHAR=2;
>ITEMS INAME=(SP01,SP02,SP03,SP04),INUMBERS=(1(1)4);
>TEST TNAME=SPELL, SLOPE=(1.045,1.604,1.836,1.613),
THRESHLD=(-1.229,-0.179,1.041,0.192);
Related topics
Technical menu: Item Parameter Starting Values dialog box (see Section 2.3.5)
TEST command: INTERCPT keyword
Purpose
To supply a name for subtest i (up to eight characters), if there are not variant test items in
subtest i; or name of the main test items in subtest i, if there are variant test items in subtest
i.
232
TEST COMMAND
Format
TNAME=character string
Default
None.
Examples
In the example below, two subtests are used, each with 8 items. The NTEST keyword on the
GLOBAL command indicates that two subtests are to be used, and two TEST commands follow
the ITEMS command. The TEST commands are assigned names through the TNAME keyword
and items are referenced by number.
>GLOBAL NPARM=3,NTEST=2,DFNAME='EXAMPL08.DAT';
>LENGTH NITEMS=(8,8);
>INPUT NTOTAL=16,NALT=5,NIDCHAR=9,TYPE=3;
>ITEMS INUMBERS=(1(1)16),INAMES=(N1(1)N8,A1(1)A8);
>TEST1 TNAME=NUMCON,INUMBERS=(1(1)8);
>TEST2 TNAME=ALGCON,INUMBERS=(9(1)16);
In the next example, the ITEMS command lists the four items in the order that they will be
read from the data records. Because examinees in both groups are presented all the items
listed in the ITEMS command, the TEST command needs contain only the test name.
>GLOBAL NPARM=1,NWGHT=3,LOGISTIC;
>LENGTH NITEMS=4;
>INPUT NTOTAL=4,NGROUPS=2,DIF,NIDCHAR=2,TYPE=2;
>ITEMS INAMES=(SP1(1)SP4),INUMBERS=(1(1)4);
>TEST TNAME=SPELL;
>GROUP1 GNAME=MALES;
>GROUP2 GNAME=FEMALES;
Related topics
233
2 BILOG-MG REFERENCE
(Required)
Purpose
To provide a label that will be used throughout the output to identify the problem run. The
first two lines of the command file are always the title lines. If the title fits on one line, a
second, blank line must be entered before the next command starts.
The maximum length of each line is 80 characters. The text will be printed verbatim at the
top of each output section, as well as at the start of some output files. The two title lines are
required at the start of the command file. No special delimiters (> or ;) are required.
Format
…text…
…text…
Example
EXAMPLE 4
SIMULATED RESPONSES TO TWO 20-ITEM PARALLEL TEST FORMS
Related topics
234
VARIABLE FORMAT STATEMENT
(Required)
Purpose
To supply variable format statements describing the column assignments of fields in the data
records.
Format
(aA1,nX,Ib,Ic,Fd.e,Tw,fA1)
where:
Notes
Columns skipped between fields are indicated by nX, where n is the number of columns to
be passed over.
If the fields in the data records are not in the above order, the format tab designator (Tw) may
be inserted in front of any of the fields (w is the position of the first column of the field,
counting from column one). Check the input data carefully when left tabs are used.
A forward slash (/) means “skip to the next line”. For example,
(5A1,5X,15A1/10X,15A1)
would read the case ID and 15 item responses from line 1; then, skip ten columns and read
15 item responses from line 2.
235
2 BILOG-MG REFERENCE
The variable format statement for aggregate-level data has the general form:
(aA1,Ib,Ic,Fd.e,f(Fg.h,Fi.j))
where:
Examples
The following example uses simulated responses to illustrate nonequivalent groups equating
of two forms of a 25-item multiple-choice examination administered to different popula-
tions. The two forms have five items in common: C21, C22, C23, C24, and C25. The items
for each group are specified in the GROUP1 and GROUP2 commands. Note that the item lists
on the GROUP commands are the same as those on the FORMS command. This is because
Group 1 took Form 1 of the examination and Group 2 took Form 2 of the examination.
As an answer key is provided in the raw data file (KFNAME=EXAMPL03.DAT on the INPUT
command), the answer key appears first. Note that, when multiple forms are used, an answer
key for each form should be provided. The answer key is in the same format as the data. For
each examinee, two lines of data are provided. The first line contains identifying information
and the second the item responses.
The first information read from the data file is the examinee’s ID, which is in column 35
(5A1). For the first examinee the ID is 0001, and for the last 0200. Using the T operator to
move to column 25, the form indicator is read next (I1). Because the values for form and
group are the same for any given subject, a single form/group indicator appears on each data
record. The indicator is read twice, first for forms and then for groups. The “/” operator is
used to move to the first column of the second line. The 25 item responses are then read as
(25A1).
236
VARIABLE FORMAT STATEMENT
The following example illustrates the equating of equivalent groups with the BILOG-MG
program. Two parallel test forms of 20 multiple-choice items were administered to two
equivalent samples of 200 examinees drawn from the same population. There are no com-
mon items between the forms. Because the samples were drawn from the same population,
GROUP commands are not required. The FORM1 command lists the order of the items in Form
1 and the FORM2 command lists the order of the items in Form 2.
As in the previous example, two lines of data are provided for each examinee. The first line
contains identifying information and the second the item responses. The first information
read from the data file is the examinee’s ID, which is in column 35 (5A1). For the first ex-
aminee the ID is 0001, and for the last 0200. Using the “T” operator to move to column 25,
the form indicator is read next (I1). The “/” operator is used to move to the first column of
the second line. The 20 item responses per form are then read in.
Two hundred students at each of three grade levels, grades four, six, and eight, were given
grade-appropriate versions of a 20-item arithmetic examination. Items 19 and 20 appear in
the grade 4 and 6 forms; items 37 and 38 appear in the grade 6 and 8 forms. Because each
237
2 BILOG-MG REFERENCE
item is assigned a unique column in the data records, a FORM command is not required. Both
an answer and not-presented key are given at the top of the raw data file
(KFNAME=EXAMPL05.DAT, NFNAME=EXAMPL05.DAT on the INPUT command). In the case of
the answer key, a “1” represents a correct response. A not-presented item is indicated by a
blank, ” “.
As no FORM command is required, only a group indicator has to be read in. The case ID,
given in column 35, is read first (5A1), followed by the group indicator in column 25 (I1).
The 56 item responses are read from the second line of data (56A1) after using the
“/”operator to move to the start of this line.
>GLOBAL DFNAME='EXAMPL05.DAT',NPARM=2;
>LENGTH NITEMS=(56);
>INPUT NTOTAL=56,SAMPLE=2000,NGROUPS=3,KFNAME='EXAMPL05.DAT',
NFNAME='EXAMPL05.DAT',NIDCHAR=5;
>ITEMS INUMBERS=(1(1)56),INAME=(M01(1)M56);
>TEST TNAME=MATH;
>GROUP1 GNAME='GRADE 4',LENGTH=20,INUMBERS=(1(1)20);
>GROUP2 GNAME='GRADE 6',LENGTH=20,INUMBERS=(19(1)38);
>GROUP3 GNAME='GRADE 8',LENGTH=20,INUMBERS=(37(1)56);
(T35,5A1,T25,I5/56A1)
ANSWER KEY
11111111111111111111111111111111111111111111111111111111
NOT-PRESENTED KEY
The following example illustrates the use of the TYPE=3 specification on the INPUT com-
mand to analyze aggregate-level, multiple-matrix sampling data. The data in exampl08.dat
are numbers tried and numbers correct for items from eight forms of a matrix sampled as-
sessment instrument. The groups are selected 8-th grade students from 32 public schools.
The first record for each school contains the data for the items of a Number Concepts scale,
NUMCON, and the second record contains the data for items of an Algebra Concepts scale,
ALGCON. An answer key is not relevant for aggregate-level data in number-tried, number-
right summary form. Note the format statement for reading the two sets of eight number-
tried, number-right observations from the two data lines. Again, the “/” operator is used to
move to the start of the second line of data for each school.
238
VARIABLE FORMAT STATEMENT
SCHOOL 1 NUM 1 0 3 2 2 1 4 4 3 2 2 1 4 3 4 1
SCHOOL 1 ALG 1 0 3 1 2 0 3 2 3 2 2 1 4 1 4 0
SCHOOL 2 NUM 5 3 4 4 3 2 3 3 2 2 4 3 4 3 5 3
SCHOOL 2 ALG 5 2 4 2 3 2 3 2 2 2 4 2 4 2 5 3
The next example illustrates the use of BILOG-MG with multiple groups and multiple sub-
tests. Based on previous test performance, examinees are assigned to two groups for adap-
tive testing. Out of a set of 45 items, group 1 is assigned items 1 through 25, and group 2 is
assigned items 21 through 45; thus, there are 5 items linking the test forms administered to
the groups.
Twenty of the 25 items presented to group 1 belong to subtest 1 (items 1-15 and 21-25).
Twenty items also belong to subtest 2 (items 6-25). Of the 25 items presented to group 2, 20
belong to subtest 1 (items 21-40) and 20 to subtest 2 (items 21-25 and 31-45).
In all, there are 35 items from the set of 45 assigned to each subtest. (This extent of item
overlap between subtests is not realistic, but it illustrates that more than one subtest can be
scored adaptively provided they each contain link items between the test forms.)
Note that, in this case, the item responses on the second line of data for each examinee rep-
resent responses to different items. When we previously considered these data, the response
in the first column of the second line represented the response to item 1, regardless of group
membership. Here, that response would be the response to item 1 for a member of group 1,
but the response to item 21 for an examinee from group 2.
>GLOBAL DFNAME='EXAMPL03.DAT',NPARM=2,NTEST=2,SAVE;
>SAVE SCORE='EXAMPL09.SCO';
>LENGTH NITEMS=(35,35);
>INPUT NTOTAL=45, SAMPLE=2000, NGROUP=2, KFNAME='EXAMPL03.DAT', NALT=5,
NFORMS=2,NIDCHAR=5;
>ITEMS INUMBERS=(1(1)45), INAME=(C01(1)C45);
>TEST1 TNAME=SUBTEST1, INAME=(C01(1)C15,C21(1)C40);
>TEST2 TNAME=SUBTEST2, INAME=(C06(1)C25,C31(1)C45);
>FORM1 LENGTH=25,INUMBERS=(1(1)25);
>FORM2 LENGTH=25,INUMBERS=(21(1)45);
>GROUP1 GNAME=POP1,LENGTH=25,INUMBERS=(1(1)25);
>GROUP2 GNAME=POP2,LENGTH=25,INUMBERS=(21(1)45);
(T35,5A1,T25,I1,T25,I1/45A1)
239
2 BILOG-MG REFERENCE
Default
None.
Related topics
240
INPUT AND OUTPUT FILES
Input files
The following data files contain problem information that must be supplied by the user as
needed. Any text editor that writes an ASCII file may be used to prepare these files.
File Keyword
Note:
The assignment of specific names to these files in the INPUT command causes the program
to read external files.
These files may be combined into one file, using the order above. The arrangement is to
construct an arbitrarily named file consisting of the answer key, if any, the not-presented
key, if any, the omit key, if any and the item-response data. Any of the above files would
then have the name of that file. Section 10.5 illustrates the combination of an answer key
and not-presented key within the data file.
The keys and the data records must have the same fixed-column formats.
The fields of the data records are read in the following order:
The respondent identification field (up to 30 columns of characters as specified by the
NIDCHAR keyword on the INPUT command).
The form number (only if NFORMS>1).
The group number or numbers (integer) (only if specified by a value larger than 1 for the
NGROUP keyword of the INPUT command).
A real-valued (with decimal point) case weight for the respondent or frequency for a re-
sponse pattern (only if specified by the NWGHT keyword of the GLOBAL command).
The individual item-response records or patterns.
The type of entries in the item-response field is determined by the TYPE keyword of the
INPUT command and by the presence or absence of the KFNAME keyword of the INPUT
command:
if KFNAME is not present, the item responses are scored 1 = correct and 0 = not correct.
241
2 BILOG-MG REFERENCE
if KFNAME is present, the item responses are arbitrary single ASCII characters, the correct
alternatives of which appear in the same columns of the answer key.
In either of the above types of data, not-presented items may be coded by an arbitrary
character defined in the corresponding column of the not-presented key. (See the
NFNAME keyword of the INPUT command in Section 2.6.9.)
Similarly, omitted items may be coded by another character defined in the corresponding
column of the omit key. (See the OFNAME keyword of the INPUT command.)
The path to and filename of any of these files may be longer than 80 characters. As the
maximum length of any line in the command file is 80 characters, multiple lines should be
used. It is important to continue up to and including the 80th column when specifying a
long path and filename.
The correct way to enter this information in the command file is to enclose the name and
path in apostrophes, and continue until column 80 is reached. Then proceed in column 1 of
the next line as shown below:
If the data are stored in the same folder as the command file, it is sufficient to type
DFNAME='EXAMPL06.DAT'
Related topics
Output files
Through use of the keywords on the SAVE command, the following output files may be created.
242
INPUT AND OUTPUT FILES
Related topics
Keyword: SCORE
This file is created during Phase 3 of the program if SCORE is specified in the SAVE command. It
consists of the title records and two records per subtest for each respondent.
Records Description
1&2 In 20A4/20A4 format, the title records of the BILOG-MG run that created the
ability score file.
3+ Two records per subtest for each respondent, containing the following infor-
mation
First re-
cord
Second
record
243
2 BILOG-MG REFERENCE
8 – 15 A8 subtest name
Related topics
Keyword: ISTAT
This file contains all classical item statistics computed and printed by Phase 1 of the program.
The following items are written to this external file in the same format as used in the result out-
put from Phase 1, *.ph1:
Related topics
244
INPUT AND OUTPUT FILES
Keyword: DIF
Records Description
1&2 In 20A4/20A4 format, the title records of the BILOG-MG run that created the
DIF parameter file
3+ Three sets of item records for each subtest. The first set contains the unadjusted
item threshold parameters and s.e.s for each group.
The second set contains adjusted threshold parameters and s.e.s for each group.
The last set contains estimates of group differences in adjusted threshold pa-
rameters.
First record
8 – 10 2X blank filler
11 - 18 A8 item name
19 – 20 2X blank filler
Second record
245
2 BILOG-MG REFERENCE
First record
8 – 10 2X blank filler
11 – 18 A8 item name
19 – 20 2X blank filler
Second record
First record
8 – 10 2X blank filler
11 – 18 A8 item name
19 – 20 2X blank filler
Second record
246
INPUT AND OUTPUT FILES
Related topics
Keyword: DRIFT
This file is saved during Phase 2 if DRIFT is specified on the SAVE command. It consists of
title records and two records for each item. The format is as follows:
Records Description
1&2 In 20A4/20A4 format, the title records of the BILOG-MG run that created the
DRIFT parameter file
First record
9 – 10 2X blank filler
11 – 21 F11.5 Intercept
247
2 BILOG-MG REFERENCE
Second record
Related topics
Keyword: COVARIANCE
This file is created by Phase 2 of the program and passed to Phase 3, where item information
indices are added if requested. It contains title records and the item parameter estimates at
the conclusion of Phase 2 and the added item information indices at the conclusion of Phase
3. The format is as follows:
Records Description
1&2 In 20A4/20A4 format, the title records of the BILOG-MG run that created the
covariance file.
248
INPUT AND OUTPUT FILES
First record
9 – 16 A8 subtest name
17 – 21 I5 group indicator
Second record
249
2 BILOG-MG REFERENCE
Related topics
Keyword: EXPECTED
This file is created by Phase 2 of the program. It contains expected sample sizes, expected
number of correct responses, expected proportions of correct responses, standardized poste-
rior residuals and model proportions of correct responses. These values are evaluated at each
quadrature point and item. The format of each item and each of the quadrature points is as
follows:
Records Description
1&2 In 20A4/20A4 format, the title records of the BILOG-MG run that created the
expected file
250
INPUT AND OUTPUT FILES
First record
9 – 10 2X blank filler
11 – 15 I5 group indicator
16 – 17 2X blank filler
Second record
9 – 10 2X blank filler
11 – 15 I5 group indicator
16 – 17 2X blank filler
Third record
9 – 10 2X blank filler
11 – 15 I5 group indicator
16 – 17 2X blank filler
251
2 BILOG-MG REFERENCE
Fourth record
9 – 10 2X blank filler
11 – 15 I5 group indicator
16 – 17 2X blank filler
Fifth record
9 – 10 2X blank filler
11 – 15 I5 group indicator
16 – 17 2X blank filler
Sixth record
9 – 10 2X blank filler
11 – 15 I5 group indicator
16 – 17 2X blank filler
252
INPUT AND OUTPUT FILES
Seventh record
9 – 10 2X blank filler
11 – 15 I5 group indicator
16 – 17 2X blank filler
Remark:
If more than five quadrature points are used, each record is duplicated with the same format.
If there is more than one group, the item information is presented for each group. Sets of re-
cords within an item are separated by single-dashed lines. Sets of records between items are
separated by double-dashed lines.
Related topics
Keyword: PARM
This file is saved during Phase 2 of the program if PARM is specified in the SAVE command.
The file contains the item parameter estimates and other information. The format is as fol-
lows:
Records Description
1&2 In 20A4/20A4 format, the title records of the BILOG-MG run that created
the item parameter file
3 In 2I4 format, the number of subtests and the total number of items
appearing in this file
4 In 20I4 format, the numbers of items in the main and variant subtest on as
many records as necessary.
253
2 BILOG-MG REFERENCE
5+ One record for each item in the main and variant subtests (if any), contain-
ing the following information
9 – 16 A8 subtest name
27 – 36 F10.5 intercept s. e.
47 – 56 F10.5 slope s. e.
67 – 76 F10.5 threshold s. e.
87 – 96 F10.5 dispersion s. e.
254
INPUT AND OUTPUT FILES
Related topics
Keyword: POST
This file is created by Phase 2 of the program. It contains title records, the respondent’s
identification and group numbers, the case weight, and the marginal posterior probability of
its response pattern. The format of each respondent’s record is as follows:
Records Description
1&2 In 20A4/20A4 format, the records of the BILOG-MG run that created the
posterior file
3+ Two records for each response pattern, containing the following information
First record
6 – 10 5X blank filler
Second record
9 – 10 2X blank filler
21 – 25 5X blank filler
255
2 BILOG-MG REFERENCE
Related topics
Keyword: TSTAT
This file contains all summary item and test information computed and printed by Phase 3 of the
program. The following items are written to this external file in the same format as used in the
result output from Phase 3, *.ph3:
The following items are written only if the appropriate INFO keyword on the SCORE command
has been specified:
Related topics
256
3 PARSCALE REFERENCE
3 PARSCALE
The PARSCALE program developed in the early 1990s by Eiji Muraki (then from Educational
Testing Service) and R. Darrell Bock (University of Chicago), implements a powerful extension
of Item Response Theory (IRT) measurement methods ranging from binary-item analysis to mul-
tiple-category and rating-scale items.
PARSCALE was originally developed with large-scale social surveys and educational assess-
ments in mind. More recently, however, the program has become a popular tool for a wider vari-
ety of applications, seeing use by governmental statistical agencies, marketing researchers, pol-
icy and management consultants, and investigators of the many different “classical” (psychologi-
cal, sociological, educational, medical) assessment studies. Its flexibility and the wealth of in-
formation it can provide have kept it in regular use by researchers around the world.
The program can handle a great diversity of data types. The simple survey is probably the most
common of these. In such a case, items are rated in a common set of categories (known to behav-
ioral scientists as a “Likert”-type scale). Whereas the original Likert approach assigned arbitrary,
successive integer values to the categories, the IRT procedures implemented in PARSCALE es-
timate optimal, empirical values for the boundaries between categories. These boundaries, as
well as item locations and respondent scores, can all be represented as points along the latent di-
mension of measurement. Tests that utilize this type of data might be behavioral surveys in
which the answers are “always,” “sometimes,” “often,” or “never”; expressions of opinion such
as “agree,” “undecided,” or “disagree”; or ratings of status, as perhaps a physician using “criti-
cal", “stable,” “improved,” or “symptom-free” as levels of evaluation.
For instruments of assessment, PARSCALE can also be used to analyze rating-scale items (such
as open-ended essay questions) and multiple-choice items. With multiple-choice, simple “right-
wrong” scoring and analysis is achieved by treating items as if only two categories are available
(collapsing all wrong choices into a single category). However, if more information is desired,
the choices can remain separated within each item so that the identity of the chosen alternative is
retained during the analysis. In this way, information on wrong responses can be recovered for
detailed analysis. The effects of guessing can also be included in the analysis.
Often an instrument will consist of a mixture of item types, some having common categories and
some with unique categories. PARSCALE handles this kind of diversity by allowing items to be
assigned to “blocks” within which the item categories are common. Any item that has unique
category definitions will be assigned to its own block. An educational test, for example, may
contain open-ended exercises rated in five categories in one block and multiple-choice items in
another block.
PARSCALE’s multiple-group capability adds the options of Differential Item Functioning (DIF)
analysis for trends between groups or over time, and Rater’s-Effect analysis in order to allow for
rater bias or differences in rater severity. PARSCALE for Windows allows for both easier ma-
nipulation of the command (syntax) file and more efficient review of the output files.
257
3 PARSCALE REFERENCE
This section describes those elements in the user’s interface that may not be immediately clear to
the user or that behave in a somewhat nonstandard way.
At the center of the interface is the main menu bar, which adapts to the currently active function.
For example, when you start the program, the menu bar shows only the menu choices File, View,
and Help.
However, as soon as you open a PARSCALE output file or any other text file (by using the File
menu), the Windows and Edit options show up on the menu bar. At the same time, the File
menu choices are expanded with selections like Save and Save As. In addition, the View menu
now includes a Font option following the Status Bar and Toolbar options.
The opening of an existing PARSCALE command (*.psl) file, or starting a new one, adds addi-
tional choices to the main menu bar: the Output, Run, and Workspace menus.
Note that you can open only one command file at a time. If you want to paste some part from an
existing command file in your current one, opening the old file will automatically close the cur-
rent one. After you copy the selection you want to the clipboard, you have to reopen the *.psl file
for pasting.
258
3 PARSCALE REFERENCE
Note also that, by choosing “All Files (*.*)” in the Open File dialog box, score files, parameter
files, or other files created during the run can be reviewed.
3.1.2 Workspace
The Workspace option on the main menu bar provides access to a dialog box that shows the cur-
rent values that are reserved for the numeric and the character workspace.
The defaults are 50 Kbytes for character workspace and 200KBytes for the numeric workspace.
Most problems will run with these settings. If there is insufficient workspace for an analysis to
finish, the program will alert you with a message box and you will find a message at the end of
the output file. For example:
When you encounter such a message, increase the workspace and run the problem again. Re-
member that the changes remain in effect until you change the settings again. Allocating too
much workspace may slow down your analysis, or other programs that are running simultane-
ously, so increase the workspace in reasonable steps. If a run is successful, the program reports at
the end of the output file how much memory it actually used. The values are reported in bytes
and you should divide them by 1024 to arrive at the values for the numbers used in the Work-
space dialog box.
The Run menu includes you the option to run All Phases of the program or to run them one at
the time. If you opt for the latter, remember that the different program phases build on each
other. In other words, you need calibration (Phase 2) before you can do the scoring (Phase 3).
That is why the program interface disallows the possibility of running the phases out of order.
259
3 PARSCALE REFERENCE
If you have a new (or changed) command file, initially only Phase 0, Phase 1 and All Phases are
enabled on the Run menu.
When you run an analysis by clicking on one of the options under the Run menu, the current
command file will first be saved, if you made any changes. You can easily tell if a command file
has changed by looking at the filename above the menu bar. An asterisk after the filename shows
that the current file has changed but has not been saved yet.
Once all phases have been completed, the Plot option, providing access to the graphics proce-
dure described in Chapter 6, is enabled.
By using the Output menu, you can open the output files for the four different program phases,
named with the file extensions ph0, ph1, ph2, and ph3, respectively. Always check the end of
each output file to see if it reports: NORMAL END. If it does not, something went wrong and the
output file should include some information on that.
The Font option on the View menu displays the Font dialog box with the fonts that are available
on your system. You may use different fonts for command and output files. At installation, they
are both set to a special Arial Monospace font that ships with the program. To keep the tables in
the output aligned, you should always select a monospace or fixed pitch font where all the char-
acters in the font have the same width. Once you select a new font, that font becomes the default
font. This gives you the option to select a font (as well as font size and font style) for your com-
mand (*.psl) files that is different from the one for your output (*.ph*) files as a quick visual
reminder of the type of file.
260
3 PARSCALE REFERENCE
The Window menu is only available when you have at least one file open. You can use the Ctrl-
Tab key combination to switch between open files, or use the Window menu to arrange the open
files (cascade, tile). If you have several or all output (*.ph*) files open for a particular analysis,
you could use the Window menu to arrange them for convenient switching.
PARSCALE uses the command conventions of other IRT programs published by SSI. Com-
mands employ the general syntax:
>NAME KEYWORD1=n, KEYWORD2=(list), …, OPTION1….
A greater-than sign (>) must be entered in column 1 of the first line of a command and
followed without a space by the command name.
All command names, keywords, options, and keyword values must be entered in UPPER
CASE.
Command names, keywords, and options may be entered in full or abbreviated to the first
three characters.
At least one space must separate the command name from any keywords or options.
All keywords and options must be separated by commas.
The equals sign is used to set a keyword equal to a value, which may be integer, real, or
character. A real value must contain a decimal point. A character string must be enclosed
in single quotes if:
261
3 PARSCALE REFERENCE
Example:
DFNAME=’EXAMPL0l.DAT’, TNAME=’20-ITEMS’.
A keyword may be vector valued; i.e., set equal to a list of integer, real, or character constants,
separated by commas or spaces, and enclosed in parentheses.
If the list is an arithmetic progression of integer or decimal numbers, the short form,
first(increment)last, may be used. Thus, a selection of items 1,3,7,8,9,10,15 may be entered as
1,3,7(1)10,15. Real values may be used in a similar way.
If the values in the list are equal, the form, value(0)number of values, may be used. Thus,
1.0,1.0,1.0,1.0,1.0 maybe entered as 1.0(0)5.
The italic elements in the command format description are variables that the user needs to
replace.
Command lines may not exceed 128 columns. Continuation on one or more lines is per-
mitted. See Section 3.2.6 for more information.
Filenames, including the path, may not exceed 128 characters, including path to file.
Each command terminates with a semicolon (;). The semi colon functions as the command
delimiter; it signals the end of the command and the beginning of a new command.
Related topics
For information on the order of commands and keywords associated with each command, please
see Section 3.2.1.
The table below lists all available PARSCALE commands in their necessary order. This order is
also used in the remainder of this section of the user’s guide. Commands marked as “required”
must appear in the command file for each problem setup. All other commands are optional. In
other words, at a minimum the command file should start with two TITLE lines, followed by the
FILES, INPUT, TEST (or SCALE), BLOCK, CALIB, and SCORE command lines. Note that INPUT
and the variable format statement may be followed by data. The variable format statement is also
required in the command file when raw data are read in from an external file.
Note that, in the remainder of this chapter, the commands are discussed in alphabetical order, and
not in the required order as shown below.
262
OVERVIEW OF SYNTAX
* (variable
format
statement)
TEST TNAME=n, NBLOCK=n,
* ITEMS=(list),INAME=(list),
INTERCEPT=(list),
THRESHOLD=(list), SLOPE=(list)
BLOCK BNAME=(list), NITEMS=n, CSLOPE, NOCADJUST
* NCAT=n, ORIGINAL=(list),
MODIFIED=(list),
CNAME=(list), CADJUST=n,
CATEGORY=(list), GPARM=(list),
GUESSING=(list),
SCORING=(list),
REPEAT=n, SKIP=(list),
RATER=(list),
MGROUP GNAME=(list), GCODE=(list),
DIF=(list), REFERENCE=n,
COMMON=(list)
MRATER RNAME=(list), RCODE=(list),
RATER=(list)
263
3 PARSCALE REFERENCE
Notes
A series of commands from TEST to QUADS should be repeated for the number of subtests,
specified by the NTEST keyword in the INPUT command.
The BLOCK command should be repeated for the number of blocks, specified by the
NBLOCK keyword in the TEST (or SCORE) command. Repetition of the BLOCK commands
can be shortened by utilizing the REPEAT keyword in the TEST (or SCORE) command.
The COMBINE command is optional and must be placed at the end of the PARSCALE com-
mand file.
Related topics
264
BLOCK COMMAND
(Required)
Purpose
To provide a block name, and to identify the items that belong to block j in subtest or sub-
scale i.
Format
Notes
There should be as many BLOCK commands as the total number of blocks specified with
the NBLOCK keyword on each TEST (or SCALE) command. These BLOCK commands are re-
quired commands.
Each of the BLOCK commands provides a block name (BNAME), the number of items in the
block (NITEMS), the number of categorical responses that those items share (NCAT), and the
identification of those items. Categorical responses of the raw data are assumed to be
coded as consecutive integers, such as 1, 2, 3, and so forth. (Notice that the first categori-
cal response is coded 1 instead of 0.) Use the ORIGINAL keyword to describe categorical
responses that are coded differently in the input file.
The ORIGINAL and MODIFIED keywords may be used to re-order or concatenate the origi-
nal categorical responses in the block. See the examples in Chapter 11.
The user may supply the initial values of the parameters for the estimation phase with the
CATEGORY keyword.
Block names or category names that
do not begin with a letter, or
contain blanks and/or special (non-alphanumeric) symbols, or
consist of more than 8 characters,
Related topics
265
3 PARSCALE REFERENCE
BNAME keyword
Purpose
To provide the block name, which may be up to eight characters in length. If the REPEAT
keyword is used, all values of the keywords including the block name are replicated for sub-
sequent blocks. A user can supply unique block names for those replicated blocks by using
the BNAME keyword.
Format
Default
Supplied by program
Related topics
CADJUST keyword
Purpose
To control the location adjustment: n sets the mean of the category parameters.
Format
CADJUST=n
Default
0.0.
Related topics
CATEGORY keyword
Purpose
To provide initial category parameter values for the estimation process. If the CATEGORY
keyword is supplied, but no values are specified, then the constant values from “scores
266
BLOCK COMMAND
for ordinal or ranked data” (Statistical Tables for Biological, Agricultural, and Medical Re-
search, R. A. Fisher & F. Yates, p. 66) substitute the default initial values of the category
parameters.
Format
Default
Supplied by program.
Related topics
CNAME keyword
Purpose
Format
Default
Blanks.
Related topics
CSLOPE option
Purpose
To request the estimation of a single common slope parameter for all items in the block.
Format
CSLOPE
267
3 PARSCALE REFERENCE
Related topics
GPARM keyword
Purpose
To provide guessing parameters that are used only for the correction of dichotomous item
response probabilities if GUESSING is specified. If GUESSING is specified, these guessing pa-
rameters are used for the initial parameter values.
Format
Default
0.0.
Related topics
GUESSING keyword
Purpose
To request the use of the item-response model with a lower asymptote (guessing) parameter,
g. P* = g +(1 - g)P for the k-th response category and P* = (1- g)P for others. The lower
asymptote (guessing) parameters are estimated if ESTIMATE is specified; otherwise, the
probabilities of categorical responses are only corrected by fixed parameter values, which
are supplied by the item-parameter file or the keyword GPARM in the BLOCK command.
Format
GUESSING=(n,FIX/ESTIMATE)
Default
268
BLOCK COMMAND
Related topics
MODIFIED keyword
Purpose
To provide a list of integers corresponding to the original response codes. The first category
should correspond to n = 1, not n = 0. The number of arguments should be equal to NCAT.
The program computes automatically the number of response categories after the modifica-
tion, as specified by the MODIFIED keyword. If some categories are collapsed, making the
modified number less than NCAT, the modified number is used to read the keywords CNAME
and CATEGORY.
Format
Default
1 through NCAT
Related topics
NCAT keyword
Purpose
Format
NCAT=n
Default
2.
Related topics
269
3 PARSCALE REFERENCE
NITEMS keyword
Purpose
Format
NITEMS=n
Default
Related topics
NOCADJUST option
Purpose
To omit the adjustment provided by the CADJUST keyword during the calibration.
Format
NOCADJUST
Related topics
ORIGINAL keyword
Purpose
To provide a list of the original categorical response codes (up to four characters each). The
number of arguments should be equal to NCAT.
270
BLOCK COMMAND
Format
Default
1 through NCAT
Related topics
RATER keyword
Purpose
To provide the ratio of a rater variance and an error variance per item. This ratio is used for
the correction of the information function per item.
If “ n1 ” is specified, but no other “ n ” is specified, then the default values of those unspeci-
fied “ n ” are “ n1 ” (the first “ n ” value).
Format
Default
n1 = 0
Related topics
REPEAT keyword
Purpose
To request the repetition of a BLOCK command. The ij-th BLOCK command will be automati-
cally repeated n times. This option maybe used to estimate different category values for each
item (Samejima’s model).
271
3 PARSCALE REFERENCE
Format
REPEAT=n
Default
0.
Related topics
SCORING keyword
Purpose
To specify the scoring function of the partial credit models using scoring function values.
Values can be fractional.
Format
Default
Related topics
SKIP keyword
Purpose
To skip the parameter estimation for this particular block and use the parameter values sup-
plied by a user or the program.
n1 : If the estimation of the slope parameters needs to be skipped, set this value to one,
otherwise 0.
n2 : If the estimation of the threshold parameters needs to be skipped, set this value to one,
otherwise 0.
n3 : If the estimation of the category parameters needs to be skipped, set this value to one,
otherwise 0.
272
BLOCK COMMAND
n4 : If the estimation of the lower asymptote parameters needs to be skipped, set this value
to one, otherwise 0.
If the keyword SKIP appears without arguments, all of the parameter estimations are
skipped, that is, SKIP=(1,1,1,1). If no SKIP keyword appears, none of the parameter esti-
mations is skipped, that is, SKIP=(0,0,0,0).
Format
SKIP= ( n1 , n2 , n3 , n4 )
Related topics
The four categorical responses are coded as A, B, C, and D and the user wants to concate-
nate the categories A and B as the first category. Note that NCAT specifies the number of
categories before the modification.
The four categorical responses are coded as 1, 2, 3, and 4 and the user wants to reverse the
order of the categories. The ORIGINAL keyword is not really needed in this case, because it
specifies the default. Note the single quotes around the specified block name, due to the
presence of the hyphen.
273
3 PARSCALE REFERENCE
(Required)
Purpose
To control the item and category parameter estimation and to specify prior distributions on
the parameters for subtest or sub-scale i.
Format
Notes
Related topics
274
CALIB COMMAND
ACCEL/NOACCEL option
Purpose
To specify whether or not the acceleration routine should be used after each cycle of the EM
iterations. ACCEL specifies that it will be used, while NOACCEL specifies that it will not be
used.
Format
ACCEL/NOACCEL
Default
NOACCEL
CRIT keyword
Purpose
Format
CRIT= (j,k,l,m,n,o)
CSLOPE option
Purpose
To request the estimation of a single common slope parameter for all items in the subtest.
275
3 PARSCALE REFERENCE
Format
CSLOPE
CYCLES keyword
Purpose
d The maximum number of EM cycles. (Default 10, if LENGTH < 50 (see INPUT
command, Section 3.2.7); 5, otherwise)
e The maximum number of inner EM iterations of item and category parameter
estimation. (Default 1)
f The maximum number of inner EM iterations of category parameter estimation.
(Default 1)
g The maximum number of inner EM iterations of item parameter estimation.
(Default 1)
h The maximum number of inner EM iterations of the multiple rater parameter es-
timation. (Default 1)
i The minimum number of the inner EM iterations of item and category parame-
ter estimation. (Default 1)
Format
CYCLES = (d,e,f,g,h,i)
Related topics
DIAGNOSIS keyword
Purpose
To request diagnostic output and to specify the level of diagnostic output, from zero (no di-
agnostic out put) through 6 (maximum diagnostic output).
Diagnostic output of higher numbers includes the printout of the lower ones. n = 3 or higher
is not recommended for normal use.
276
CALIB COMMAND
Format
DIAGNOSIS=n
Default
0, no diagnostic output
DIST keyword
Purpose
To designate the type of prior distribution specified for the ability distribution in the popula-
tion of respondents.
n = 1: Uniform distribution
n = 2: Normal on equally spaced points
n = 3: Normal on Gauss-Hermite points
n = 4: User supplied
Format
DIST=n
Default
ESTORDER option
Purpose
To reverse the estimation order of the EM cycles. This implies that the item parameters will
be estimated before the category parameters, rather than the other way around.
Format
ESTORDER
FLOAT option
Purpose
To specify that the means of the prior distributions on the item parameters are estimated by
marginal maximum likelihood, along with the parameters. If this option does not appear, the
means are kept fixed at their specified values during estimation.
277
3 PARSCALE REFERENCE
Format
FLOAT
Remark:
Standard deviations of the priors are fixed in either case. This option should not be invoked
when the data set is small and the items few. The means of the item parameters may drift in-
definitely during the estimation cycles under these conditions.
FREE keyword
Purpose
If the DIF model is chosen, a prior latent trait distribution is normally used for each sub-
group. If the FREE keyword is specified, the posterior distribution is substituted for the
prior distribution.
If this keyword is specified with numerical values of t and u, the multiple posterior distri-
butions are rescaled to mean t and standard deviation u. If NOADJUST is specified for
either argument, no rescaling will be done with respect to mean or standard deviation or
both. The defaults are rescaling with t = 0.0 and u = 1.0.
If the third argument is COMBINED, the multiple posterior distributions are combined and a
total distribution is rescaled. Otherwise, only the reference group is rescaled to mean t
and standard deviation u and other groups are adjusted accordingly.
If the fourth argument is specified, the MLE scores are computed and used for the poste-
rior distributions. The MLE option is not generally recommended.
Format
Default
GPRIOR option
Purpose
278
CALIB COMMAND
Format
GPRIOR
GRADED/PARTIAL option
Purpose
To specify the response model to be used: GRADED specifies the graded response model, and
PARTIAL specifies the partial credit model.
Format
GRADED/PARTIAL
ITEMFIT keyword
Purpose
To specify the number of frequency score groups to be used for the computation of item-fit
statistics. If the ITEMFIT value specified is greater than NQPT, the NQPT value specified will
replace the ITEMFIT value.
Format
ITEMFIT=n
Default
None
Related topics
LOGISTIC/NORMAL option
Purpose
To specify the response function metric to be used: LOGISTIC specifies that the natural met-
ric of the logistic response function is used in all calculations, while NORMAL specifies the
use of the metric of the normal response function (normal ogive model). This choice is ef-
fective only if the graded response model is used. For the partial credit model, only the lo-
gistic response function is available.
279
3 PARSCALE REFERENCE
Format
LOGISTIC/NORMAL
NEWTON keyword
Purpose
To specify the maximum number of Newton-Gauss (Fisher scoring) iterations following the
EM cycles.
Format
NEWTON=n
Default
0.
NOCALIB option
Purpose
To request that the calibration of both item and category parameter estimation will be
skipped. This option permits tests to be scored from previously estimated parameters (see
FILES and INPUT commands in Sections 3.2.6 and 3.2.7).
Format
NOCALIB
Related topics
NQPT keyword
Purpose
To specify the number of quadrature points to be used in the EM and Newton estimation.
Format
NQPT=n
280
CALIB COMMAND
Default
30.
NRATER option
Purpose
To specify that the correction for the information function, specified with the RATER key-
word on the BLOCK command, is not to be used for calibration.
Format
NRATER
Related topics
POSTERIOR option
Purpose
To specify the computation of the posterior distribution after the M-step in the EM cycle, in
addition to normally doing so after the E-step. This allows the expected proportions com-
puted in each succeeding E-step to be based on an updated posterior distribution.
Format
POSTERIOR
PRIORREAD option
Purpose
To specify the use of the slope, threshold, and category parameter priors specified by the
user in the PRIORS command.
Format
PRIORREAD
Related topics
281
3 PARSCALE REFERENCE
QPREAD option
Purpose
To specify that quadrature points and weights are to be read from the following QUADP
command. Otherwise, the program supplies the quadrature points and weights (and no
QUADP command follows).
Format
QPREAD
Related topics
QRANGE keyword
Purpose
To specify the upper (q) and lower (r) range of the quadrature points.
Format
QRANGE=(q,r)
Default
(-4.0, +4.0)
Note:
This keyword is effective only if DIST = 1 or 2 (see SCORE command, Section 3.2.14).
Related topics
RIDGE keyword
Purpose
To specify that a ridge constant is to be added to the diagonal elements of the information
matrix to be inverted during the EM cycles and the Newton iterations.
282
CALIB COMMAND
The ridge constant starts at the value of 0.0 and is increased by v if the ratio of a pivot and
the corresponding diagonal elements of the matrix is less than w.
Format
RIDGE=(v,w)
Default
No ridge.
SCALE keyword
Purpose
Format
SCALE=n
Default
1.0 for the normal ogive item response model; 1.7 for the logistic item response model.
SKIPC option
Purpose
Format
SKIPC
SPRIOR option
Purpose
Format
SPRIOR
283
3 PARSCALE REFERENCE
THRESHOLD option
Purpose
To specify that the item location parameter for a dichotomous item is to be estimated di-
rectly as a threshold. Otherwise, an intercept parameter is estimated and converted to a
threshold. It is only effective for dichotomously scored items.
Format
THRESHOLD
TPRIOR option
Purpose
Format
TPRIOR
284
COMBINE COMMAND
(Optional)
Purpose
Format
Notes
The keyword COMBINE on the INPUT command establishes the number of COMBINE com-
mands that should be inserted here, if any. Each of these COMBINE commands gives the name
for the combined score and the weights corresponding to the subscale scores. The number of
weight constants is the same as the total number of subscales (the total number of SCORE
commands). Specific subscores may be excluded from the combined score by entering a
zero for that subscore.
Related topics
NAME keyword
Purpose
Format
NAME=character string
285
3 PARSCALE REFERENCE
Default
Blank.
WEIGHTS keyword
Purpose
To specify the weights for combining of the subscores. The subscores are combined linearly.
For sums and means: a set of positive fractions with decimal points, summing to 1.0, for
weights of subscale scores.
For DIF: a set of fractions with decimal points, summing to 0.0, for weights of subscale
scores.
Format
WEIGHTS= ( n1 , n2 ,..., nn ) ;
Default
None.
286
COMBINE COMMAND
(Optional)
Purpose
To enter one or more lines of explanatory remarks into the program output stream.
Format
>COMMENT ...text...
...text...
Notes
This line and all subsequent lines preceding the FILES command will be printed verbatim in
the initial output stream. The maximum length of each line is 80 characters. A semicolon to
signal the end of the command is not needed. Comments are optional.
Example:
>COMMENT
Data for this example are from the study described by Doran, et al., in
the April, 1992, Science Teacher. The ratings of the student’s laboratory
reports with different numbers of graded categories are assigned to dif-
ferent blocks. Categories 1 and 2 of the item in block 4, which had low
frequency of use, were collapsed in the modified category assignments. Be-
cause of the limited … are estimated and saved.
>FILES DFNAME=’EXAMPLO4.DAT’,SAVE;
Default
No comments.
Related topics
287
3 PARSCALE REFERENCE
(Required)
Purpose
Format
Notes
The master and calibration files are binary files and created by the program. They can be
saved for reuse by specifying the MASTER and CALIB keywords on the SAVE command, re-
spectively. Otherwise, they are automatically deleted at the end of the analysis.
Other files are ASCII (plain text) files and their specifications are described in Section
3.3.1 of the manual.
FILES is a required command.
If filenames are supplied, the files must already exist.
Names must be enclosed in single quotes.
The maximum length of filenames is 128 characters, including the directory path, if
needed. Note that each line of the command file has a maximum length of 80 characters.
If the filename does not fit on one line of 80 characters, the remaining characters should
be placed on the next line, starting at column 1.
The original response data are recoded into a binary form and saved in the master file
(MFNAME). If the SAMPLE keyword on the INPUT command is specified, the additional bi-
nary file, the calibration file (CFNAME), is created and the responses of the randomly sam-
pled respondents are saved in this calibration file. The calibration file is used for the item
parameter estimation. For the scoring of respondents, however, the master file is used and
all respondents’ scores are computed. This option shortens the calibration stage, but still
computes all respondents’ scores. If only the sampled respondents need to be scored, the
user must specify the SAMPLE keyword on the SCORE command. If no SAMPLE keyword on
the INPUT command is specified, only the master file is created, and it is used for both the
calibration and scoring phases.
To read data from a previously prepared master file, specify the MFNAME keyword instead
of the DFNAME keyword. If an existing item-parameter file is specified by the IFNAME key-
word, and the NOCALIB option is evoked in the CALIB command for the test, scores for the
test will be computed from the previously estimated parameters in the IFNAME file.
288
FILES COMMAND
Example
Related topics
CFNAME keyword
Purpose
Format
CFNAME=<'filename'>
Default
Supplied by program.
DFNAME keyword
Purpose
To specify the name of the raw data file. This file contains the original data.
Format
DFNAME=<'filename'>
Default
Command file contains the raw data after the format code(s).
IFNAME keyword
Purpose
289
3 PARSCALE REFERENCE
Format
IFNAME=<'filename'>
Default
Supplied by program.
MFNAME keyword
Purpose
Format
MFNAME=<'filename'>
Default
Supplied by program.
NFNAME keyword
Purpose
Format
NFNAME=<'filename'>
Default
Blank.
OFNAME keyword
Purpose
Format
OFNAME=<'filename'>
290
FILES COMMAND
Default
Blank.
SAVE option
Purpose
To indicate that additional output files are requested. If this option is present, then the SAVE
command must follow the FILES command. Otherwise, the next command is the INPUT
command. In other words, this option has to be specified if you want to save any or all of the
intermediate output files; the specific output files are selected with the following SAVE
command.
Format
SAVE
Related topics
291
3 PARSCALE REFERENCE
(Required)
Purpose
To describe the original data file and to supply other information used in all three phases of
the program.
Format
Notes
Related topics
COMBINE keyword
Purpose
To specify the number of COMBINE commands that will be used to compute weighted score
combinations (see Section 3.2.4) in the case of multiple subtests or subscores.
Format
COMBINE=n
Default
No combined scores.
Related topics
292
INPUT COMMAND
GROUPLEVEL option
Purpose
To indicate that group-level frequency data will be used as input instead of the default single
respondent data (see Section 3.3.1). Note that this option is not available for the Raters-
effect model.
Format
GROUPLEVEL
Related topics
INOPT keyword
Purpose
To specify the nature of group-level input records (note that this only applies if the GROUP
option has been specified). The possible values for INOPT are:
1: Categorical responses
2: Not-presented categorical responses plus frequencies
3: Omit categorical responses plus frequencies
4: Not-presented plus Omit categorical responses plus frequencies
5: A series of categorical response code plus its frequency
Format
INOPT=n
Default
15.
Related topics
293
3 PARSCALE REFERENCE
LENGTH keyword
Purpose
To specify the number of items in each subtest or subscale. If there is only one subtest (the
default), the format LENGTH=n may be used.
Format
LENGTH= ( n1 , n2 ,..., na )
Default
NTOTAL.
Related topics
MGROUP/MRATER keyword
Purpose
To specify the number of subgroups. The keyword MGROUP should be specified if the DIF
model is used. MGROUP is the number of subgroups. In this case, an MGROUP command should
also be present in the command file, after the BLOCK command(s) and before the CALIB
command.
Format
MGROUP/MRATER=n
Default
Notes
Note that either MGROUP or MRATER can be specified, but not both.
The keyword MRATER should be used if the Raters-effect model is used, in which case
MRATER specifies the number of raters. If MRATER is specified, an MRATER command must
be present after the BLOCK command(s) and before the CALIB command.
294
INPUT COMMAND
Related topics
NFMT keyword
Purpose
To indicate the number of lines used for the format statement(s) that specify how to read the
original data records.
Format
NFMT=n
Default
1.
Related topics
NIDCHAR keyword
Purpose
To specify the number of characters in the respondent’s identification field, at least 1 and at
most 30 characters long.
Format
NIDCH=n
Default
30.
Related topics
295
3 PARSCALE REFERENCE
NRATER keyword
Purpose
To specify the number of times each of k items is rated by each rater. Note that this keyword
can only be used when multiple raters rate examinees.
Note
When rater data are analyzed, data are read in a different format. See Section 3.2.17 for ex-
amples of variable format statements for such data.
Format
NRATER= ( n1 , n2 ,..., nk )
Default
1.
Related topics
NTEST keyword
Purpose
Format
NTEST=n
Default
1.
Related topics
296
INPUT COMMAND
NTOTAL keyword
Purpose
To specify the total number of items in the original data records. The items for particular
subtests or subscales are selected from these items using the TESTi (or SCALEi) commands.
Format
NTOTAL=n
Default
0.
Related topics
R-INOPT keyword
Purpose
This keyword is exclusively used when examinees are rated by multiple raters. By default, it
is assumed that all the data for an examinee are given on the same line. If multiple lines are
used, n should be set to the number of lines containing information for an examinee.
Note
When rater data are analyzed, data are read in a different format. See Section 3.2.17 below
for examples of variable format statements for such data.
Format
R-INOPT=n
Default
R-INOPT=1.
Related topics
297
3 PARSCALE REFERENCE
SAMPLE keyword
Purpose
To request a percentage (0-100) of respondents to be randomly sampled from the raw data
file.
Format
SAMPLE=n
Default
SAMPLE=100.
Related topics
TAKE keyword
Purpose
To request the analysis of only the first n respondents in the raw data file.
Format
TAKE=n
Default
WEIGHT option
Purpose
To indicate the presence of case weights. If this option is present, each input record has a
case weight. In each data record, the weight follows the case ID and precedes the item re-
sponses.
Format
WEIGHT
298
INPUT COMMAND
Related topics
The following INPUT command specifies a 160 item test divided in 16 subtests of 10 items
each. The first fifteen characters of each record are for identification purposes, and one for-
mat statement will follow describing each record.
The next example is a variation on the first in that the data are now weighted. The option
WEIGHT specifies that each record will have a case weight. It follows immediately after the
case ID, and—as the following format statement describes—has a field width of five col-
umns.
299
3 PARSCALE REFERENCE
(Optional)
Purpose
Format
Notes
This command is required if the MGROUP keyword is specified in the INPUT command.
Group names and group codes must be enclosed in single quotes if they do not begin with
a letter or if they contain blanks or special (non-alphanumeric) symbols. Note that group
codes in the data records do not need quotes, regardless of what characters are used.
Related topics
COMMON keyword
Purpose
To specify the positions of the common blocks for each subtest. Note that this can be used
only with the DIF model. A common block contains items for which the model parameters
are the same among the multiple groups in spite of the DIF model.
Format
COMMON= ( n1 , n2 ,...)
Default
None.
300
MGROUP COMMAND
DIF keyword
Purpose
To specify the DIF model. If the value of nn is 1, separate item parameters for multiple sub-
groups are estimated. If the value of nn is 0, a common item parameter for multiple sub-
groups is obtained. Each position of DIF arguments corresponds to a particular item parame-
ter:
Format
DIF= ( n1 , n2 , n3 , n4 )
Default
DIF=(0,1,0,0).
GCODE keyword
Purpose
To specify the subgroup identification code, which appears in the data field of the original
response file (DFNAME) in the same order as the group names, up to four characters.
Format
Default
Related topics
301
3 PARSCALE REFERENCE
GNAME keyword
Purpose
Format
Default
Related topics
REFERENCE keyword
Purpose
To specify the position of the reference subgroup in the GCODE list (i.e., the subscript of the
reference group corresponding to symbols n11 , n21 ,..., nMGROUP ). The parameter values for
other subgroups are adjusted to this reference subgroup. If REFERENCE=0, no reference sub-
group is set and no adjustment is performed. This keyword is used only for the DIF model.
Format
REFERENCE=n
Default
n=1.
302
MRATER COMMAND
(Optional)
Purpose
Format
Notes
This command is required if the MRATER keyword is specified on the INPUT command.
Rater names and rater codes must be enclosed in single quotes if they do not begin with a
letter or if they contain blanks or special (non-alphanumeric) symbols. Note that group
codes in the data records do not need quotes, regardless of what characters are used.
Related topics
RATER keyword
Purpose
To specify the raters’ weights. For the Raters-effect model, the ability score for each re-
spondent is computed for each subtest (or subscale) and each rater separately. A total score
of each respondent for each subtest (or subscale) is computed by summing those scores over
items within each subtest and all raters who have rated the respondent. The rater weights of
this keyword are used to compute the weighted subtest or subscale score for each re-
spondent.
Since the number of raters who rated each respondent’s responses may vary, the weights are
normalized (divided by their sum) for each respondent.
Format
Default
n = 1.0.
303
3 PARSCALE REFERENCE
Related topics
RCODE keyword
Purpose
To specify the rater identification code, which appears in the data field of the original re-
sponse file (DFNAME) in the same order as the rater names, up to four characters.
Format
Default
Related topics
RNAME keyword
Purpose
Format
Default
RATER001, RATER002, ….
Related topics
304
PRIORS COMMAND
(Optional)
Purpose
Format
Notes
If the PRIORREAD option has been specified on the CALIB command, the PRIORS command
is required. Of course, since there should be as many CALIB commands as there are sub-
tests, the number and order of the PRIORS commands should mimic the CALIB commands.
The program assumes a normal prior distribution for the thresholds and a lognormal prior
distribution for the slopes.
Related topics
GMU keyword
Purpose
To specify the real-valued “alpha” parameters for the Beta prior distribution of the lower as-
ymptote (guessing) parameter.
Format
GMU= ( n1 , n2 ,..., nn )
Default
305
3 PARSCALE REFERENCE
GSIGMA keyword
Purpose
To specify the real-valued “beta” prior parameters for the Beta prior distribution of the
lower asymptote (guessing) parameter.
Format
GSIGMA= ( n1 , n2 ,..., nn )
Default
SMU keyword
Purpose
Format
SMU= ( n1 , n2 ,..., nn )
Default
SOPTION option
Purpose
To indicate that the means and the standard deviations for prior slopes are already in the
log(e) metric.
Format
SOPTION
Default
306
PRIORS COMMAND
SSIGMA keyword
Purpose
Format
SSIGMA= ( n1 , n2 ,..., nn )
Default
TMU keyword
Purpose
Format
TMU= ( n1 , n2 ,..., nn )
Default
TSIGMA keyword
Purpose
Format
TSIGMA= ( n1 , n2 ,..., nn )
Default
307
3 PARSCALE REFERENCE
(Optional)
Purpose
To specify that user-supplied quadrature points and weights, or points and ordinates of the
discrete finite representation of the prior ability for subtest or subscale i are provided.
Format
Notes
If the QPREAD option has been specified on the CALIB command, the QUADP command is re-
quired. Of course, since there should be as many CALIB commands as there are subtests, the
number and order of the QUADP commands should mimic the CALIB commands.
Related topics
POINTS keyword
Purpose
To provide a set of NQPT (on CALIB command) real-numbered values (with decimal points)
of the quadrature points of the discrete distribution.
Format
Default
Related topics
308
QUADP COMMAND
WEIGHTS keyword
Purpose
To supply a set of NQPT (on CALIB command) positive fractions (with decimal points and
summing to 1.0) for weights of probabilities of points in the discrete distribution.
Format
Default
Related topics
309
3 PARSCALE REFERENCE
(Optional)
Purpose
To specify that user-supplied quadrature points and weights, or points and ordinates of the
discrete step-function representation of the scale scores for the respondents on subtest or
subscale i are provided.
Format
Notes
If the QPREAD option has been specified on the SCORE command, the QUADS command is re-
quired. Of course, since there should be as many SCORE commands as there are subtests, the
number and order of the QUADS commands should mimic the SCORE commands.
Related topics
POINTS keyword
Purpose
To specify a set of NQPT (on SCORE command) real-numbered values (with decimal points)
of the quadrature points of the discrete distribution.
Format
Default
Related topics
310
QUADS COMMAND
WEIGHTS keyword
Purpose
To specify a set of NQPT (on SCORE command) positive fractions (with decimal points and
summing to 1.0) for weights of probabilities of points in the discrete distribution.
Format
Default
Related topics
311
3 PARSCALE REFERENCE
(Optional)
Purpose
Format
Notes
The master and calibration data files are saved in a binary form. Other files are saved as
ASCII (plain text) files; their formats are described in Section 3.4.1.
The SAVE command is required if the SAVE option on the FILES command has been en-
tered.
There are no default filenames for this command.
If a specific name is supplied with a keyword, then that particular output file will be saved
after the analysis is completed.
If the same filename is used in both the FILES and the SAVE command, then the existing
file will be overwritten after it has been read. Thus, different filenames should be supplied
for the IFNAME keyword on the FILES command and the PARM keyword on the SAVE com-
mand to avoid replacing old item-parameter values with new values.
Names must be enclosed in single quotes.
The maximum length of filenames is 128 characters, including the path, if needed. See
Section 3.2.6 for more details.
Related topics
CALIB keyword
Purpose
Format
CALIB=<'filename'>
312
SAVE COMMAND
Default
None.
Related topics
COMBINE keyword
Purpose
Format
COMBINE=<'filename'>
Default
None.
Related topics
FIT keyword
Purpose
Format
FIT=<'filename'>
Default
None.
Related topics
313
3 PARSCALE REFERENCE
INFORMATION keyword
Purpose
Format
INFORMATION=<'filename'>
Default
None.
Related topics
MASTER keyword
Purpose
Format
MASTER=<'filename'>
Default
None.
Related topics
PARM keyword
Purpose
Format
PARM=<'filename'>
314
SAVE COMMAND
Default
None.
Related topics
SCORE keyword
Purpose
Format
SCORE=<'filename'>
Default
None.
Related topics
315
3 PARSCALE REFERENCE
(Required)
Purpose
Format
Notes
There should be as many SCORE commands as there are subtests, in the same order as the
TEST commands.
If a score file has been specified by the SCORE keyword on the SAVE command, all subject
scores will be printed to the output file, whether the PRINT option on the SCORE command
has been selected or not.
If the option RESCALE is present, the keywords SMEAN and SSD are rescaling constants. Let
the rescaled score be θ * and the original score θ . Then, θ * = sθ + t , where s is the scaling
constant (SSD) and t is the location constant (SMEAN).
Related topics
DIST keyword
Purpose
To specify the type of prior distribution. This keyword is to be used when EAP scoring is se-
lected.
n = 1: Uniform distribution
n = 2: Normal on equally spaced points
n = 3: Normal on Gauss-Hermite points
316
SCORE COMMAND
Format
DIST=n
Default
2.
Related topics
EAP/MLE/WML option
Purpose
Format
EAP/MLE/WML
Default
EAP
Related topics
FIT option
To request the printing of fit statistics for score estimates for the group-level data. This key-
word is not effective for individual response data.
Format
FIT
Related topics
317
3 PARSCALE REFERENCE
ITERATION keyword
Purpose
In maximum likelihood scoring, stops the iterative solution when the changes are less than
i, or the number of iterations is greater than j.
Format
ITERATION=(i,j)
Default
(0.01, 20).
Related topics
NAME keyword
Purpose
Format
NAME=character string
Default
Test name
Related topics
NOADJUST option
Purpose
To suppress the calibration adjustment of the category parameter mean during scoring.
318
SCORE COMMAND
Format
NOADJUST
Related topics
NOSCORE option
Purpose
Format
NOSCORE
Related topics
NQPT keyword
Purpose
To set the number of quadrature points if EAP scoring has been selected.
Format
NQPT=n
Default
30.
Related topics
NRATER option
Purpose
To stop the correction for the information function (specified with the RATER keyword on the
BLOCK command) from being used for scoring.
319
3 PARSCALE REFERENCE
Format
NRATER
Related topics
PFQ keyword
Purpose
To specify the response percentage to be moved to the immediately adjacent category to en-
able the computation of ML scale scores if the input data are group-level frequency data (see
the INPUT command, Section 3.2.7) and all item responses are in the lowest or highest
categories. The edited response records are printed out if DIAG=2 or higher on the
CALIBRATION command.
Format
PFQ=n
Default
(n = 1 to 99).
Related topics
PRINT option
Purpose
Format
320
SCORE COMMAND
QPREAD option
Purpose
To indicate that quadrature points and weights will be read from the QUADS command. Oth-
erwise, the program supplies the quadrature points and weights (and no QUADS command fol-
lows).
Format
QPREAD
Related topics
QRANGE keyword
Purpose
Format
QRANGE=(c,d)
Default
(-4.0, +4.0).
Related topics
RESCALE option
Purpose
To use the values specified for the keywords SMEAN and SSD as rescaling constants instead of
a mean and a standard deviation, respectively, of the sample distribution.
Format
RESCALE
321
3 PARSCALE REFERENCE
Related topics
SAMPLE option
Purpose
To request that only the sampled subjects are scored (see the SAMPLE keyword on the INPUT
command, Section 3.2.7).
Format
SAMPLE
Related topics
SCORING keyword
Purpose
To specify the scoring function to be used for scoring. STANDARD specifies that the standard
scoring function (1.0, 2.0,…) is to be used, even if a different function is used for calibra-
tion. CALIBRATION specifies that the calibration function specified in the BLOCK commands
is to be used for scoring.
Format
SCORING=STANDARD/CALIBRATION
Default
STANDARD.
Related topics
322
SCORE COMMAND
SMEAN keyword
Purpose
To request that the original scale scores be rescaled such that the mean equals n.
Format
SMEAN=n
Default
No rescale.
Related topics
SSD keyword
Purpose
To request that the original scale scores are rescaled such that the standard deviation equals
n.
Format
SSD=n
Default
No rescale.
Related topics
This example shows how an existing item-parameter file is used for scoring observations.
Calibration is not needed, therefore the NOCALIB option of CALIB has been invoked. Scoring
will be done with maximum likelihood estimation, and the score distribution will be ad-
justed to the mean and the standard deviation specified with the SMEAN and SSD keywords,
respectively.
323
3 PARSCALE REFERENCE
Related topics
324
TEST/SCALE COMMAND
(Required)
Purpose
To identify the test or scale, or subtest i or subscale i. The keyword NTEST on the INPUT
command supplies the number of subtests or subscales. The same number of TEST (or
SCALE) commands is expected. The order of these TEST (or SCALE) commands is the same as
the order in which the subtest lengths are specified on the INPUT command. If there is only
one test or scale, there is only one test command.
Location of the items, names of the items, and starting values for estimating the item pa-
rameters can also be supplied with the TEST (or SCALE) command.
Format
Notes
One TEST command is required for each subtest as specified by the NTEST keyword on the
INPUT command. If there are no subtests (NTESTS=1), only one TEST command is needed.
The order of the TEST commands is the same as the order used in the specification of the
length of each subtest on the INPUT command.
If the keywords INTERCEPT, THRESHOLD, or SLOPE are given without any arguments, the
values 0.0, 0.0, and 1.0 are used for the initial intercept, threshold, and slope parameters,
respectively. In this case, no initial values are computed by the program.
Test or item names that
o do not begin with a letter, or
o contain blanks and/or special (non-alphanumerical) symbols, or
o consist of more than 8 characters,
Related topics
325
3 PARSCALE REFERENCE
INAMES keyword
Purpose
To specify a list of names (up to four characters each) for the items in this (sub)test or
(sub)scale.
Format
Default
Related topics
INTERCEPT keyword
Purpose
To provide real-numbered starting values (with decimal points) for estimating the item in-
tercepts. Starting values may be specified by INTERCEPT or THRESHOLD, but not by both.
Format
Default
Related topics
ITEMS keyword
Purpose
To supply a list of the serial position numbers of the items in the total response record.
326
TEST/SCALE COMMAND
Format
Default
through LENGTH.
Related topics
NBLOCK keyword
Purpose
To indicate the number of blocks of items that share common categorical parameters. When
items are rated by a single Likert scale, for example, the number and meaning of their cate-
gories is the same and all may be assigned to the same block. The items must be selected or
rearranged so that all those TEST or SCALES in block 1 precede those in block 2, which pre-
cede those in block 3, etc. (see BLOCK command, discussed in Section 3.2.2).
Format
NBLOCK=n
Default
1.
Related topics
SLOPE keyword
Purpose
To specify real-numbered starting values (with decimal points) for estimating the item
slopes.
327
3 PARSCALE REFERENCE
Format
Default
Related topics
THRESHOLD keyword
Purpose
To specify real-numbered starting values (with decimal points) for estimating the item
thresholds. Starting values may be specified by INTERCEPT or THRESHOLD, but not by both.
Format
Default
Related topics
TNAME keyword
Purpose
To provide a name for the test or scale, subtest or subscale i, up to eight characters.
Format
TNAME=character string
328
TEST/SCALE COMMAND
Default
Related topics
The first TEST command describes a subtest with the name “AUTORHET” consisting of one
block of items with the serial positions 1, 3, 5, 7, 9, 11, 13, 15, 17, and 19, and with names
like “A20R.”
The command file will have sixteen TEST commands, as specified with NTEST on the INPUT
command. Note the order of the commands.
Related topics
329
3 PARSCALE REFERENCE
(Required)
Purpose
To provide a label that will be used throughout the output to identify the problem run.
Format
...text...
...text...
Notes
The first two lines of the command file are title lines. If the title fits on one line, a second,
blank line should be entered before the next command starts. The text will be printed verba-
tim at the top of each output section, as well as at the start of some output files. The two title
lines are required at the start of the command file. No special delimiters (> or ;) are required.
Example:
330
VARIABLE FORMAT STATEMENT
The data layout must be described in a variable format statement. This statement is entered
within parentheses and immediately follows the INPUT command.
When data (labels, raw data, summary statistics) are used in fixed format, a format statement is
needed to instruct the program how to read the data.
(rCw) or (rCw.d),
where:
C Format code:
The format statement should be enclosed in parentheses. Blanks within the statement are ig-
nored: (r C w. d) is acceptable. Anything after the right parenthesis and on the same line
is also ignored by the program, thus comments may be placed after the format statement.
The following example shows three ways to read five integers, with the same result:
(5I1)
12345
(5I2)
1 2 3 4 5
(I1,I2,3I3)
1 2 3 4 5
The F-format requires the number of decimal places in the field description. If there are
none (and eight columns) specify (F8.0); (F8) is not allowed. However, if a data value con-
331
3 PARSCALE REFERENCE
tains a decimal point, then it overrides the location of the decimal point as specified by the
general field description. If the general field description is given by (F8.5), then 12345678
would result in the real number +123.45678, but the decimal point in —1234.56 would not
be changed. Only blanks will result in the value zero. The plus sign is optional.
The “X” operator can be used to skip spaces or unused variables in the data file. For exam-
ple, (F7.4,8X,2F3.2) informs the program that the data file has 21 columns per record. The
first value can be found in the first seven columns (and there are four decimal places), then
eight columns should be skipped, and a second and third value are in columns 16 - 21, both
occupying three columns (with two decimal places). Note that the ITEMS keyword on the
TEST (or SCALE) command also allows selection and reordering of variables.
Another option is the use of the tabulator format descriptor T, followed by a column number
n. For example, (F8.5, T61, 2F5.1) describes three data fields; in columns 1 - 8, with five
decimal digits, next in columns 61 - 65 and 66 - 70, both with one decimal digit. If the num-
ber n is smaller than the current column position, left-tabbing results. Left tabs can be unre-
liable in PC systems and should be used cautiously. A forward slash (/) in an F-format
means “skip the rest of this line and continue on the next line. Thus, (F10.3/5F10.3) or
(F10.3,/,5F10.3) instructs the program to read the first variable on the first line, then to
skip the remaining variables on that line and to read five variables on the next line.
Related topics
332
INPUT AND OUTPUT FILES
The following types of data can be used as input for a PARSCALE analysis:
In addition to these, item parameter files from previous analyses may be used as input. The use
of an omitted key file and not-presented key file is also permitted. Each of these data types will
now be discussed in turn.
Each record is read by a variable format statement supplied by the user. The following fields are
contained in each record.
For a single model, this field should be omitted. For DIF multiple-
group models, the subgroup code is read as characters. For the
Rater-Effect Model, the rater’s code is read as characters. The
Subgroup A<b> length of the characters (b), must be less than eight, and they should
identification
be specified by the GCODE keyword on the MGROUP command. The
maximum number of subgroups should be specified in the MGROUP
keyword on the INPUT command.
Fw.d If the WEIGHT option appears in the INPUT command, this WEIGHT
Weight
field must be read, in floating point format.
333
3 PARSCALE REFERENCE
Notes
For a single group model and a DIF multiple-group model, each respondent’s responses are rep-
resented by a single response vector. For the DIF model, response vectors are not necessarily to
be sorted by subgroups. Response vectors of all subgroups can be mixed. If the identification
field is a blank or the end of file is reached, the program terminates the input procedure.
Related topics
334
INPUT AND OUTPUT FILES
If the GROUP keyword appears in the INPUT command, the input data are assumed to be group-
level frequencies of categorical responses. Each record is read by a format statement supplied by
the user. The following fields are contained in each record.
335
3 PARSCALE REFERENCE
INOPT=2:
INOPT=3:
INOPT=4:
The distinctions between input data streams for the single-group and multiple-group (DIF and
Rater-effect) models are the same as those in the individual response data discussed earlier.
Related topics
See the format specification for PARM file in the SAVE command.
This file should contain a single record in the same format as the individual response data. The
fields of identification, subgroup identification, and weight are not processed and do not need to
be occupied. The current version of PARSCALE treats omitted response as not-presented.
336
INPUT AND OUTPUT FILES
Not-presented file
This file should contain a single record in the same format as the individual response data. The
fields of identification, subgroup identification, and weight are not processed and do not need to
be occupied. For multiple-group models (DIF and Raters’ Effect), this file is particularly impor-
tant because for those situations, not all items are presented to all subgroups of respondents or
not all items are rated by all raters. If a not-presented code is present in the original data file and
this file is not specified, response records containing the code will be rejected.
Related topics
Apart from the standard 4 list output files produced (*.ph0, *.ph1, *.ph2, and *.ph3), the user
can instruct the program to create the following additional output files, using keywords on the
SAVE command:
In the combined score file, the first eight records form the file’s title lines.
The following specifications are repeated for all respondents or sampled respondents.
Format Description
337
3 PARSCALE REFERENCE
Finally, the next specifications are repeated for each subtest (from 1 through NTEST) within each
respondent (NTEST is the number of subtests specified by the NTEST keyword in the INPUT com-
mand)
Format Description
Related topics
The first four records of the fit statistics file describe the run as follows:
338
INPUT AND OUTPUT FILES
The following information is repeated for each group (from 1 through MGROUP):
Format Description
The information shown below, together with its format description, is written to the fit statistics
file for each block (from 1 through NBLOCK) within each group, and for each item (from 1
through NITEMS) within each block.
Format Description
Finally, the following information is repeated for each response category (1 through NCAT) within
each block (NCAT is the number of response categories of the current block):
Format Description
Related topics
339
3 PARSCALE REFERENCE
Records 1 and 2 are the TITLE lines from the command file.
Format: (20A4,/,20A4)
Format Description
Notes:
1: Normal ogive graded response model with item and category parameters separated
2: Normal ogive graded response model with item-category parameters
3: Logistic graded response model with item and category parameters separated
4: Logistic graded response model with item-category parameters
5: Normal ogive partial credit model with item and category parameters separated
6: Normal ogive partial credit model with item-category parameters (not implemented)
7: Logistic partial credit model with item and category parameters separated
8: Logistic partial credit model with item-category parameters
340
INPUT AND OUTPUT FILES
Format: (30I5)
The rest of the data show the parameters grouped by block within each group. For each group
(from 1 through MGROUP), the subgroup name is listed first formatted as (A8). Note that for a sin-
gle-group or Rater's-Effect model there will be only one group name.
Format Description
Lastly, in the case of a Rater's-Effect model, the following rater information is provided for each
rater:
Format Description
341
3 PARSCALE REFERENCE
Related topics
The information file begins with the TITLE lines from the command file in records 1 and 2.
Format: (20A4,/,20A4)
In the remainder of the file, item information is listed as follows: the results are grouped by
quadrature points (1 through NQPT), within items (1 through NITEMS), within blocks (1 through
NBLOCK), within groups (1 through MGROUP, or just 1 for a single-group or Rater's-Effect model):
Format Description
Related topics
342
INPUT AND OUTPUT FILES
Records: 1
Format Description
Format Description
Notes
The length of identification specified by the NIDCHAR keyword in the INPUT command is auto-
matically supplied for <a>A1.
Repeat for each rater (from 1 through the number of response vectors, NVEC) within each respon-
dent. (For a single-group model or DIF model, it repeats only once (NVEC=1), and for Raters’ Ef-
fects models, it repeats for the number of raters rated for this particular respondent. Therefore,
for the Raters’ Effect model, NVEC varies depending on the number of raters who rated this re-
spondent).
Format Description
343
3 PARSCALE REFERENCE
Note that if the DIF model is used, the rater identification is the subgroup identification.
Format Description
(T22,'|',2X,
Note that this record is saved only if the original response data is frequency (group-level) data
and the FIT option in the SCORE command has been given.
Related topics
344
4 MULTILOG REFERENCE
4 MULTILOG
MULTILOG, written by David Thissen, is a computer program designed to facilitate item analy-
sis and scoring of psychological tests within the framework of Item Response Theory (IRT). As
the name implies, MULTILOG is for items with MULTIple alternatives and makes use of LOG-
istic response models, such as Samejima’s (1969) model for graded responses, Bock’s (1972)
model for nominal (non-ordered) responses, and Thissen & Steinberg’s (1984) model for multi-
ple-choice items. The commonly used logistic models for binary item response data are also in-
cluded, because they are special cases of the multiple category models. MULTILOG provides
Marginal Maximum Likelihood (MML) item parameter estimates for data in which the latent
variable of IRT is random, as well as Maximum Likelihood (ML) estimates for the fixed-effects
case. χ 2 indices of the goodness-of-fit of the model are provided. In IRT, the item parameter
estimates are the focus of item analysis. MULTILOG also provides scaled scores estimates of
the latent variable for each examinee or response pattern.
MULTILOG is best suited to the analysis of multiple-alternative items, such as those on multi-
ple-choice tests or Likert-type attitude questionnaires. It is the only widely available program
capable of fitting a wide variety of models to these kinds of data using optimal (MML) methods.
MULTILOG also facilitates refined model fitting and hypothesis testing through general provi-
sions for imposing equality constraints among the item parameters and for fixing item parame-
ters at a particular value. MULTILOG may also be used to test hypotheses about differential item
functioning (DIF; sometimes called “item bias”) with either multiple response or binary data,
through the use of its facilities to handle data from several populations simultaneously and test
hypotheses about the equality of item parameters across groups.
Although MULTILOG syntax can still be created and submitted in batch mode as was done with
previous versions, MULTILOG version 7.0 has new features designed to make the program
more user-friendly.
The user no longer has to create syntax using INFORLOG. The functionality of
INFORLOG and MULTILOG has been combined into a single executable file.
In addition, the MULTILOG syntax wizard described in Section 4.2 can be used to create
a skeleton command file that can then be edited according to the user’s needs.
This document describes those elements in the user’s interface that may not be immediately clear
to the user or that behave in a somewhat nonstandard way. Each element will be discussed in
turn in the following sections.
345
4 MULTILOG REFERENCE
At the center of the interface is the menu bar, which adapts to the currently active function. For
example, when you start the program, the menu bar shows only the menu choices File, View,
and Help.
However, as soon as you open a MULTILOG output file (through the File menu), the Window
and Edit menu choices show up on the menu bar. At the same time, the File menu choices ex-
pand with selections like Save and Save As And the View menu now has a Font option after the
Status bar and Toolbar choices.
Opening an existing MULTILOG command (*.mlg) file, or starting a new one, adds further
choices to the main menu bar: the Output and Run menus.
Note that you can open only one command file at a time. If you want to paste some part from an
existing command file in your current one, opening the old file will automatically close the cur-
rent one. After you copy the part you want to the clipboard, you have to reopen the *.mlg file for
pasting.
The Run menu gives you the option to run the command file displayed in the main window.
346
OVERVIEW OF MULTILOG INTERFACE
When you run an analysis by clicking Run, the current command file will first be saved, if you
made any changes. You can easily tell if a command file has changed by looking at the filename
above the menu bar. An asterisk after the filename shows that the current file has changed, but
has not been saved yet. Once the analysis has been completed, the Plot option, providing access
to the graphics procedure, is enabled. For a description of the plots that can be produced, see
Chapter 6.
Through the Output menu you can open the list output, named with the file extension out. Al-
ways check the end of each output file to see if it reports: NORMAL END. If it does not, something
went wrong and the output file should have some information on that.
The Window menu is only available when you have at least one file open. You can use the Ctrl-
Tab key combination to switch between open files, or use the Window menu to arrange the open
files (Cascade, Tile). If you have the output (*.out) file open for a particular analysis, you could
use the Window menu to arrange this file and the command file for convenient switching.
347
4 MULTILOG REFERENCE
Clicking on the Font option on the View pull-down menu displays a dialog box with the fonts
that are available on your system.
You may use different fonts for command and output files. At installation, they are both set to a
special Arial Monospace font that ships with the program. To keep the tables in the output
aligned, you should always select a monospace or fixed pitch font where all the characters in the
font have the same width. Once you select a new font, that font becomes the default font. This
gives you the option to select a font (as well as font size and font style) for your command
(*.mlg) files that is different from the one for your list output (*.out) files as a quick visual re-
minder of the type of file.
348
OVERVIEW OF MULTILOG INTERFACE
The MULTILOG syntax wizard, used to create new MULTILOG command files, uses succes-
sive dialog boxes to generate the syntax. The boxes displayed during the process depend on the
user’s choices in previous boxes.
The New Analysis dialog box is used to select the type of problem and/or to create a new
MULTILOG command file. This dialog box is activated when the File, New option is selected
from the main menu bar.
The type of problem is specified by selecting one of the three mutually exclusive options in the
Select type of problem group box:
349
4 MULTILOG REFERENCE
Enter the folder location and the name for the MULTILOG command file in the Folder location
and File name edit box respectively.
If the Fixed-theta Item Parameter Estimation option or the MLE or MAP computation op-
tion is chosen, the Fixed Theta dialog box, in which you are asked about the reading of a fixed
value of θ with the data, is activated once the OK button is clicked, followed by the Input Data
dialog box. Selecting MML Item Parameter Estimation or MLE or MAP Computation will
activate the Input Data dialog box when OK is clicked.
If the Blank MULTILOG Command File option is selected, the Folder location and File
name for the new file should be provided in the appropriate fields of this dialog box. Clicking
OK in this case will open an editor window in which you can enter syntax manually.
Related topics
RANDOM, FIXED and SCORES options on the PROBLEM command (Section 4.4.7)
Fixed Theta dialog box (Section 4.2.2)
Input Data dialog box (Section 4.2.3)
The Fixed Theta dialog box is activated when you select fixed- θ item parameter estimation or
the MLE or MAP computation option in the New Analysis dialog box. It is used to indicate
whether a fixed value of θ should be read with the data. If the Yes radio button is clicked, the
position of the fixed value to be read must be indicated using the Data Format field in the Input
Data dialog box. Clicking the Back button will return the user to the New Analysis dialog box
while clicking the Next button will activate the Input Data dialog box.
Related topics
350
OVERVIEW OF MULTILOG INTERFACE
The Input Data dialog box is used to specify the type and location of the data to be analyzed.
You can enter the name of the data file in the Data file name field provided, or use the Browse
button to browse for the file. The program automatically enters the name of the command
file (specified in New Analysis dialog box) with file extension dat as the default name.
MULTILOG can handle three types of data, each associated with one of the mutually exclusive
options in the Type of data group box:
351
4 MULTILOG REFERENCE
In all cases, the format statement describing the data must be entered in the Data Format field.
Depending on the option selected, different versions of the Input Parameters dialog box, re-
flecting the selection made here, will be displayed when the Next button is clicked.
Related topics
PATTERNS, INDIVIDUAL and TABLE options on the PROBLEM command (Section 4.4.7)
DATA keyword on the PROBLEM command
Input Parameters dialog box (Section 4.2.3)
Variable format statement (Section 4.4.14)
The Input Parameters dialog box is used to describe the contents of the data file to be analyzed.
The version of this dialog box displayed depends on the type of data specified in the Input Data
dialog box. In each case, the type of data previously selected is noted at the top of the Input Pa-
rameters dialog box. In general, this dialog box is used to indicate the number of items, groups,
tests, patterns, examinees and the number of characters in the ID field. You can use the Back
button to return to any of the previously completed dialog boxes. Clicking Next activates the
Test Model dialog box. All fields in these dialog boxes are associated with keywords on the
PROBLEM command, with the exception of the Number of tests field. This field is used by the
program to determine the number of tabs in the Test Model dialog box, displayed later in the
setup process.
When counts of response patterns are analyzed, the Input Parameters dialog box shown below
is used to provide the following information:
The number of items. Previous limits on the number of items that can be analyzed have
been removed in the current version of MULTILOG (NITEMS keyword on PROBLEM com-
mand).
The number of groups. Previously, a maximum of 10 groups could be used. This limit has
also been removed (NGROUPS keyword on PROBLEM command)
The number of patterns. This field is only displayed when response pattern data are ana-
lyzed (NPATTERNS keyword on PROBLEM command)
The number of characters in the ID field. By default, it is assumed to be zero (NCHAR key-
word on PROBLEM command)
In the case of analysis of individual item response vectors, the same options as described above
are available, with one exception: the Number of patterns field is replaced by the Number of
examinees field. The number of examinees for which response vectors are available should be
entered in this field.
352
OVERVIEW OF MULTILOG INTERFACE
Related topics
NITEMS, NGROUPS, NEXAMINEES, NPATTERNS and NCHARS keywords on the PROBLEM com-
mand (Section 4.4.7)
Input Data dialog box (Section 4.2.3)
Test Model dialog box (Section 4.2.5)
For the analysis of a fixed-effects table of counts, only three fields need to be completed: the
number of items, the number of groups and the number of tests. The Input Parameters dialog
box for this type of analysis is shown below.
353
4 MULTILOG REFERENCE
The Test Model dialog box is used to specify details for a subtest. The number of tabs in the
Test Model dialog box depends on the value entered in the Number of tests field in the Input
Parameters dialog box.
354
OVERVIEW OF MULTILOG INTERFACE
The model to be fitted to the data is specified in the Test Model group box. One of six mutually
exclusive options may be selected:
In the Test Items group box, the items to be analyzed are described. By default, no items are se-
lected. Clicking the All check box under the Use header will invoke the ALL option on the TEST
command, and all items will be included in the analysis. Also, unchecking any of the items will
uncheck the All check box. Clicking the check box next to an item will select or deselect an item.
Such a selection corresponds to use of the ITEMS keyword on the TEST command. In the image
above, all items have been included in the analysis.
For each item, the number of response categories must be specified under the Category heading.
By default, it is assumed that each item has two response categories. The admissible number of
categories is between 2 and 10. This column is only available for the graded model, nominal
model, and multiple-choice model and corresponds to the NC keyword on the TEST command.
Finally, the number of the highest category has to be indicated in the case of a nominal model.
Select either A for Ascending or D for Descending under the Order header to indicate the order
of categories in each case. The Order column is only available for the nominal and multiple-
choice models and corresponds to the HIGH keyword on the TEST command.
Click the Back button to return to the Input Parameters dialog box and the Next button to pro-
ceed to the Response Codes dialog box.
Related topics
L1, L2, L3, GR, NO and BS options on the TEST command (Section 4.4.11)
NC, HIGH and ITEMS keywords and ALL option on the TEST command
Input Parameters dialog box (Section 4.2.4)
Response Codes (Binary Data) dialog box (Section 4.2.6)
Response Codes (Non-Binary Data) dialog box (Section 4.2.7)
The Response Codes dialog box for binary data is used to provide information on the response
and missing codes and the answer key for the data to be used in the analysis.
The Response Codes field is used to list all possible codes occurring in the data. The Correct
response codes field is used to provide an answer key for the total number of items to be ana-
lyzed. The Missing Code check box is checked when a value indicating “missing” for popula-
355
4 MULTILOG REFERENCE
tion membership other than the default 9.0 assumed by the program is used. Use the drop-down
list box on the right to select the appropriate missing value code for the data.
After completing the dialog box, click Next to display a summary of the information entered.
Click Finish in this dialog box to generate the command file.
Related topics
The Response Codes dialog box for non-binary data is used to provide information on the re-
sponse codes and the answer key for the data to be used in analysis.
The Response Codes field is used to list all possible codes occurring in the data. The Correct
response code fields are used to provide an answer key for each of the items to be analyzed.
After completing the dialog box, click Next to display a summary of the information entered.
Click Finish in this dialog box to generate the command file.
Related topics
356
OVERVIEW OF MULTILOG INTERFACE
In this example, the generation of syntax that includes the reading of an external criterion is illus-
trated. For a complete discussion of the problem, please see Section 12.19.
The first step in creating a new command file using the syntax wizard is to select the New option
from the File menu to activate the New Analysis dialog box.
The type of problem and the name and location of the new command file are defined using the
New Analysis dialog box. As we wish to score (estimate θ ) in this run, the MLE or MAP
Computation option is selected in the Select type of problem group box. This selection corre-
sponds to the SCORES option on the PROBLEM command.
The location in which the new command file is to be stored is specified next. By default, the
folder in which MULTILOG has been installed will be displayed. This can be changed by either
typing an alternative path in the Folder Location field or by using the Browse button to the right
of this field. Finally, the name of the command file is entered in the File name field. In this case,
we want to create the command file knee.mlg in the (default) mlgwin folder. Click OK to con-
tinue with the syntax specification.
357
4 MULTILOG REFERENCE
The Fixed Theta dialog box is now displayed, allowing you to include the reading of a fixed
value with the data. Click the radio button next to the Yes option to add the CRITERION option to
the PROBLEM command. Then click Next to go to the Input Data dialog box.
358
GETTING STARTED WITH MULTILOG
The Input Data dialog box is used to provide information on the position and contents of the
raw data file. By default, the Data file name will be assumed to be in the same folder and to
have the same filename as the new command file. This may be changed by either correcting the
entry in this field or by using the Browse button ( ) to the right of this field.
The variable format statement describing the contents of this file must be entered in the Data
Format field. Recall that the data are in the format:
40 1 0.5 2112111111112111111111111111111111
33 1 1.0 3113211111112122111111111111111111
33 1 2.0 4333211111113122111111111111111111
29 1 3.0 4543211111113122111111011111111111
As the data file contains individual data identification in the first 10 columns, an identification
field is required as the first entry in the variable format statement. The format statement shown
below reflects the position of the examinee identification field (10A1), the 34 item responses
(34A1) and the criterion (F4.0). MULTILOG will read the chronological age of each individual
and use that as a starting value for the iterative modal estimation procedure. The “T” format is
used to tab to the correct positions of the respective fields. The first ten characters on each re-
cord, which are read as an identification field, are also used to assign a value to the NCHARS key-
word on the PROBLEM command through the Input Parameters dialog box (see later in the ex-
ample). Note that if the NCHARS keyword is set to 0, no ID field needs to be included in the for-
mat statement.
Select the Individual item response vectors option from the Type of data group box. The se-
lection made in this case will add the INDIVIDUAL option of the PROBLEM command, while the
entry in the Data file name field will be used in conjunction with the DATA keyword on the same
command. The variable format entered in the Data format field will be echoed to the generated
command file. For more on format specification rules please see Section 4.4.14.
The parameters for the 34 indicators are in a file called knee.par. This file was produced by
MULTILOG in a (previous) calibration run. As the parameters for the SAVE and other optional
commands cannot be set using the syntax wizard, these commands will be added after generating
the command file. Instructions concerning this can be found at the end of this example. Having
completed the Input Data dialog box, click Next to go to the Input Parameters dialog box.
359
4 MULTILOG REFERENCE
The problem uses 34 items, and this is indicated by setting the value in the Number of items
field to 34. Data from 13 examinees are available, and this is specified using the Number of ex-
aminees field. Finally, the NCHAR keyword on the PROBLEM command is set to 10, as previously
indicated in the variable format statement description, using the Number of characters in ID
field.
Entries in this dialog box correspond to the following MULTILOG keywords on the PROBLEM
command:
360
GETTING STARTED WITH MULTILOG
A graded model is used here, and is specified by clicking the radio button next to the Graded
model option in the Test model group box. This corresponds to the GR option on the TEST com-
mand.
The “test” has varying numbers of response categories for the 34 indicators, which are entered in
the NC list on the TEST command. As all items are used, the All check box in the Use column of
the Test Items group box is clicked, and the number of categories is set by item in the Category
column as shown below. Once the number of categories for each item has been indicated, click
OK to go to the Response Code (Non-Binary data) dialog box.
361
4 MULTILOG REFERENCE
All possible responses in the data are entered in the Response Code field. The corresponding
correct response codes are entered in the Correct response codes group box. On each line, the
number of entries permitted corresponds to the number of items specified in the Input Parame-
ters dialog box.
Once the response codes (123450) are entered in the Response Code string field, these codes
appear as the first column of the Correct response code group box. For each response code and
each item, a category number is entered. Permissible values are 1, 2, …, NCAT, where NCAT de-
notes the total number of categories for a given item. In any row, a “0” indicates that the re-
sponse code value is excluded from the analysis. A valid entry for item 1, for example, is Code 1
= 5, Code 2 = 2, Code 3 = 3, Code 4 = 4, and Code 5 = 1. The entry specifies that data values
equal to 1 is assigned to the 5-th category of item 1, while data value 5 is assigned to the first
category.
Start by entering the correct responses for the first code, and press the Enter key on you key-
board when done to proceed to the next line of the window. Note that, if an attempt is made to
specify response codes not in agreement with previous selections, no value will appear in this
box. Only when valid codes are entered, will the results be displayed. Once all codes have been
entered, click OK to go to the Project Settings dialog box. Entries in the Response Code dialog
box will appear after the END command in the generated command file.
362
GETTING STARTED WITH MULTILOG
The Project Settings dialog box displays a summary of all selections made up to this point. To
go back to any of the previous dialog boxes, the Back button may be used. To generate the syn-
tax, click Finish. Syntax generated using the wizard is now displayed in the main MULTILOG
363
4 MULTILOG REFERENCE
window. Before running this problem, the following (optional) commands are added to the syn-
tax in this window by using standard Windows editing functions:
The START command is used to override the default starting values for all the item parameters
and enter others, in this case from the file knee.par.
Click the Run option on the main menu bar to start the analysis. Once the analysis has been
completed, the output generated may be viewed using the Output option on the same menu bar.
The output file will then be displayed in the main window, and the Window option may be used
to switch between syntax and output files.
For a description of the problem for which syntax is generated here, please see Sections 12.1 to
12.3.
Select the New option from the File menu to activate the New Analysis dialog box. In the New
Analysis dialog box, the type of problem and the name and location of the new command file are
defined. For the LSAT data, we wish to perform MML item parameter estimation. Note that this
corresponds to the RANDOM option on the PROBLEM command. Click on the MML Item Parame-
ter Estimation option.
364
GETTING STARTED WITH MULTILOG
Next, the location in which the new command file is to be stored is specified. By default, the
folder in which MULTILOG has been installed will be displayed. This can be changed by either
typing an alternative path in the Folder Location field or by using the Browse button to the right
of this field. Finally, the name of the command file is entered in the File name field. In this case,
we want to create the command file lsat6_2.mlg in the (default) mlgwin folder. Click OK to go
to the Input Data dialog box.
The Input Data dialog box is used to provide information on the position and contents of the
raw data file. By default, the Data file name will be assumed to be in the same folder and to
have the same filename as the new command file. This may be changed by either correcting the
entry in this field or by using the Browse button ( ) to the right of this field.
365
4 MULTILOG REFERENCE
The variable format statement describing the contents of this file must be entered in the Data
Format field. For this example, recall that the data are of the form
1 00000 3
2 00001 6
3 00010 2
4 00011 11
As the data file contains patterns and frequencies, no identification field is required. The format
statement entered reflects the position of the pattern (5A1) and frequency (F4.0) only. In each
row, the first 4 columns are skipped, and this is indicated by the value “4” in combination with
the “X” operator. Select the Counts of response patterns option from the Type of data group
box as shown below. The selection made in this case will add the PATTERN option of the PROBLEM
command, while the entry in the Data file name field will be used with the DATA keyword on the
same command. The variable format entered in the Data format field will be echoed to the gen-
erated command file. For more on format specification rules please see Section 4.4.14. After
completing the Input Data dialog box, click Next to go to the Input Parameters dialog box.
The number of items, groups, tests, patterns/examinees, and characters in the identification field
are specified using the Input Parameters dialog box. When the dialog box is first displayed, all
entries are set to 1, assumed to be the default. For this problem, we need only indicate the num-
ber of items (5) and the number of patterns (32) in the Number of items and Number of pat-
terns fields respectively. Note that the buttons to the right of each of these fields may be used to
increase or decrease the value displayed in a particular field.
Entries on this dialog box correspond to the following MULTILOG keywords on the PROBLEM
command:
366
GETTING STARTED WITH MULTILOG
Number of tests None; used to set number of tabs on Test Model dialog
box.
Number of characters in ID
field NCHARS
The Test Model dialog box is used to describe the items assigned to the test(s) and the model to
be fitted. All entries on this dialog box correspond to keywords/options on the TEST command:
The Test Model group box corresponds to the choice of one of the following options:
L1/L2 /L3/GR/NO/BS.
Entries in the Use column of the Test items group box correspond to the ALL option and
ITEMS keyword on the TEST command.
The Categories and Order columns of the Test items group box (not used in this exam-
ple) correspond to the NC and HIGH keywords respectively.
The number of tabs displayed at the top of this dialog box depends on the entry in the Number
of tests field in the Input Parameters dialog box. This problem requires the use of all 5 items in
367
4 MULTILOG REFERENCE
the data on a single test, so the All check box at the top of the Use column is checked to select all
items simultaneously. To select single items, the check boxes next to the items selected for inclu-
sion should be clicked individually.
At this point, the data file and the model specification are complete. All that remains to be done
is to indicate the response codes. To do this, click Next to go to the Response Codes (Binary
Data) dialog box.
The response patterns in the data file consist of combinations of “0” and “1” values. These two
values are entered in the Response Codes field, which should reflect all possible response codes
present in the data. No missing code is used here, so the Missing code check box is left un-
checked.
The response to each of the items that indicates the correct response is entered in the Correct
response codes field. Note that the number of entries allowed in this field is equal to the number
of items specified in the Input Parameters dialog box.
All entries in this dialog box are echoed to the command file and can be found directly after the
END command that is automatically added to the command file, but before the variable format
statement that is also written to the command file.
368
GETTING STARTED WITH MULTILOG
Problem specification is now complete. When the Next button is clicked on the Response Codes
dialog box, a list of the options specified is displayed in the Project Settings dialog box. To go
back to any of the previous dialog boxes, click the drop-down list button next to the Back button
and select from the list that will be displayed. To generate the command file, click Finish.
369
4 MULTILOG REFERENCE
Once the Finish button has been clicked in the Project Settings dialog box, you are returned to
the main MULTILOG window, where the generated syntax is displayed. In this example, no
changes are needed but, if additional optional commands are to be used, you can insert such
commands in this window using standard Windows editing functions. To run the generated
command file, click the Run option on the main menu bar.
This example illustrates user input for a fixed- θ analysis. For a discussion of these data, see the
previous section. To start the process, select the New option from the File menu. The New
Analysis dialog box shown below will be displayed.
On the New Analysis dialog box, the type of problem and the name and location of the new
command file is defined. As we wish to perform a fixed- θ analysis for the mouse data, the
Fixed-theta Item Parameter Estimation option is selected by clicking on it. Note that this cor-
responds to the FIXED option on the PROBLEM command.
Next, the location in which the new command file is to be stored is specified. By default, the
folder in which MULTILOG has been installed will be displayed. This can be changed by either
typing an alternative path in the Folder Location field or by using the Browse button to the right
of this field. Finally, the name of the command file is entered in the File name field. In this case,
we want to create the command file mouse.mlg in the (default) mlgwin folder. Click OK to pro-
ceed with the specification.
370
GETTING STARTED WITH MULTILOG
As the Fixed-theta Item Parameter Estimation option was selected in the New Analysis dia-
log box, the Fixed Theta dialog box is displayed next. The Fixed Theta dialog box is only acti-
vated in the user-selected fixed- θ item parameter estimation in the New Analysis dialog box. It
is used to indicate whether a fixed value of θ should be read with the data. Clicking the Back
button will return you to the New Analysis dialog box while clicking the Next button will acti-
vate the Input Data dialog box.
Leaving the default entry (No) as it is displayed on this dialog box, click Next.
371
4 MULTILOG REFERENCE
Recall that the data are in a file called mouse.dat, which contains the following four lines:
1 7 0 2 11
0 6 0 6 10
0 2 0 5 11
3 10 2 0 2
Each of the four lines of data represents one of four groups of mice; each group of mice repre-
sents a cell of a 2 x 2 experimental design. The response variable (measured on an ordinal scale)
is the severity of audiogenic seizures. The column categories are “crouching”, “wild running”,
“clonic seizures”, “tonic seizures”, and “death”.
The variable format statement describing the contents of this file must be entered in the Data
Format field. As the data file contains frequencies for each cell of the table, no identification
field is required and the format statement entered reflects the frequency in each cell (5F3.0)
only.
Select the Fixed-effect table of counts option from the Type of data group box as shown below
to indicate that cell frequencies from a table are used as input. The selection made in this case
will add the TABLE option of the PROBLEM command, while the entry in the Data file name field
will be used in conjunction with the DATA keyword on the same command. The variable format
entered in the Data format field will be echoed to the end of the generated command file. For
more on format specification rules, please see Section 3.2.17.
Having completed the Input Data dialog box, click Next to go to the Input Parameters dialog
box.
372
GETTING STARTED WITH MULTILOG
The Input Parameters dialog box reflects the selections made in previous dialog boxes. Only
the numbers of items, groups, and tests need to be specified for this type of problem. When the
dialog box is first displayed, all entries are set to 1, assumed to be the default. For this problem,
you need only indicate the number of items (1) and the number of groups (4) in the Number of
items and Number of groups fields respectively. Note that the buttons to the right of each of
these fields may be used to increase or decrease the value displayed in a particular field.
Entries in this dialog box correspond to the following MULTILOG keywords on the PROBLEM
command:
373
4 MULTILOG REFERENCE
In the Test Model dialog box, only one tab is displayed. In addition, only one item is available
for inclusion on the test. This corresponds to the number of items and tests entered on the Input
Parameters dialog box.
As a graded model (corresponding to the GR option on the TEST command) is required, click the
Graded model radio button.
374
GETTING STARTED WITH MULTILOG
The item can be selected by either clicking the check box next to All or the check box next to “1”
in this case. The entries in the Use column correspond to the ALL option and ITEMS keyword re-
spectively.
In the case of a graded model, the number of categories must be specified. The presence of 5
categories is indicated using the buttons to the right of this field. This sets the value for the NC
keyword on the TEST command.
This completes the model specification, and clicking the Next button on the Test Model dialog
box now generates the syntax.
The generated syntax is displayed in the main MULTILOG window. To add the additional op-
tional commands
to the syntax, use standard Windows editing functions. When done, click the Run option on the
main menu bar to start the analysis. The output generated during the analysis may be accessed
using the Output option after completion of the analysis.
In the table below, the MULTILOG commands are listed in the order in which they should ap-
pear in the command file. MULTILOG command files should have a *.mlg suffix. In the rest of
this section, these commands are listed and discussed in alphabetical order.
375
4 MULTILOG REFERENCE
Optional
TITLE
Required
TITLE
RANDOM/FIXED/SCORE, NITEMS=n, NGROUP=n,
PATTERNS/INDIVIDUAL/TABLE,
PROBLEM Required NPATTERNS=n/NEXAMINEES=n, NCHARS=n, CRITERION,
NOPOP, DATA=filename;
ALL/ITEMS=(list), L1/L2/L3/GRADED/NOMINAL/BS,
TEST Required NC=(list), HIGH=(list);
ALL/ITEMS=(list)/GROUPS=(list), WITH=(list),
EQUAL Optional AJ/BJ/CJ/BK=(list)/AK=(list)/CK=(list)/DK=(list)
/MU/SD;
NCYCLES=n, ITERATIONS=n, ICRIT=n, CCRIT=n,
ESTIMATE Optional ACCMAX=n, VAIM=n;
END Required
ALL/ITEMS=(list)/GROUPS=(list), VALUE=n,
FIX Optional AJ/BJ/CJ/BK=(list)/AK=(list)/CK=(list)/DK=(list)
/MU/SD;
ALL/ITEMS=(list)/GROUPS=(list), PARAMS=(n,n),
PRIORS Optional AJ/BJ/CJ/BK=(list)/AK=(list)/CK=(list)/DK=(list)
/MU/SD;
SAVE Optional
ALL/ITEMS=(list), PARAM=‘filename’, FORMAT,
START Optional PARAM=file;
ALL/ITEMS=(list), AK/CK/DK,
TMATRIX Optional DEVIATION/POLYNOMIAL/TRIANGLE;
376
OVERVIEW OF SYNTAX
A basic command file may be created using the MULTILOG interface. Values can be assigned
to the following keywords in this way:
New Analysis
PROBLEM RANDOM/FIXED/SCORE
Fixed Theta
PROBLEM PATTERNS/INDIVIDUAL/TABLE Input Data
PROBLEM NITEMS=n Input Parameters
PROBLEM NGROUP=n Input Parameters
PROBLEM NPATTERNS=n/NEXAMINEES=n Input Parameters
PROBLEM NCHARS=n Input Parameters
PROBLEM DATA=filename Input Data
TEST ALL/ITEMS=(list) Test Model
TEST L1/L2/L3/GRADED/NOMINAL/BS Test Model
TEST NC=(list) Test Model
TEST HIGH=(list) Test Model
ESTIMATE VAIM=n Response Codes (Binary Data)
Variable format -
Input Data
statement
377
4 MULTILOG REFERENCE
(Required)
Purpose
Format
>END;
378
EQUAL COMMAND
(Optional)
Purpose
Format
AJ/BJ/CJ/BK/AK/CK/DK/MU/SD keyword
Purpose
The set of parameters is specified by one of the following mutually exclusive keywords: AJ,
BJ, CJ, BK=(list), AK=(list), CK=(list), DK=(List), MU, or SD.
Format
AJ/BJ/CJ/BK=(list)/AK=(list)/CK=(list)/DK=(list)/MU/SD
Related topics
379
4 MULTILOG REFERENCE
ALL/ITEMS/GROUPS keyword
Purpose
The set of items is specified with one of the following: ALL, ITEMS=(list), or
GROUPS=(list).
Format
ALL/ITEMS=(list)/GROUPS=(list)
WITH keyword
Purpose
Format
WITH=(list)
Example 1
If the item parameters on the EQUAL command are to be set equal for all items in a set, the set
of items may be given as ALL if the equality constraint applies to all items on the test, or
ITEMS=(list) if the equality constraint is to be imposed for a subset of the items given in
the list. For example, the following sequence specifies the 1PL model:
Example 2
There are cases in which it is desirable to impose equality constraints between (a number of)
pairs of items; this is done by using WITH=(list) in conjunction with ITEMS=(list). A
one-to-one relationship between the items in the ITEMS list and the WITH list is required; the
parameters are made equal within the implied pairs. For example,
380
EQUAL COMMAND
has the effect of setting aitem1 = aitem 2 and aitem 3 = aitem 4 . When WITH is used in this way, it
must refer to the lower-numbered item of each pair; the form must be
Example 3
For the parameter of BS items, if the WITH list is identical to the ITEMS list, equality con-
straints are imposed on the specified contrasts among the parameters within each item.
For example,
sets the first three contrasts among the d k s equal within each item for items 1 – 4. For four-
alternative multiple-choice items, such as those considered in Section 12.6, this would have
the effect of setting d 2 = d3 = d 4 ; the identifiability constraint that the sum of the ds must be
one would then give d1 = 1 − d 2 − d3 − d 4 . Similar forms may be used to impose constraints
on the ak s and ck s . See Section 12.10 for further discussion of the use of the EQUAL com-
mand with the multiple-choice model.
The parameters of Gaussian population distributions may also be constrained if there are
several groups. The default arrangement fixes µ = 0 for the last group, as well as σ = 1 for
all groups. If there are three groups,
Related topics
381
4 MULTILOG REFERENCE
(Optional)
Purpose
Format
ACCMAX keyword
Purpose
Specifies the maximum value for the acceleration parameter; more negative is more accel-
eration.
Format
ACCMAX=n
Default
0.0.
CCRIT keyword
Purpose
Format
CCRIT=n
Default
0.001.
382
ESTIMATE COMMAND
ICRIT keyword
Purpose
Specifies the convergence criterion for the M-step. It should always be smaller than CCRIT.
Format
ICRIT=n
Default
0.0001.
Related topics
ITERATIONS keyword
Purpose
A control parameter for the number of iterations in the M-step; the actual number of itera-
tions is ITERATIONSxNP, where NP is the number of parameters being jointly estimated,
which usually means the number of parameters for a particular item. For very large prob-
lems, it may be useful (faster) to set ITERATIONS at 2.
Format
ITERATIONS=n
Default
4.
NCYCLES keyword
Purpose
Format
NCYCLES=n
383
4 MULTILOG REFERENCE
Default
25.
VAIM keyword
Purpose
Format
VAIM=n
Default
9.0.
Related topics
This keyword may be set through the Response Codes (Binary Data) dialog box (Section
4.2.6)
384
FIX COMMAND
(Optional)
Purpose
Format
The set of items is specified with one of the following: ALL, ITEMS=(list), or
GROUPS=(list).
The set of parameters is specified by one of the following mutually exclusive keywords:
AJ, BJ, CJ, BK=(list), AK=(list), CK=(list), DK=(List), MU, or SD.
AJ/BJ/CJ/BK/AK/CK/DK/MU/SD keyword
Purpose
The set of parameters is specified by one of the following mutually exclusive keywords: AJ,
BJ, CJ, BK=(list), AK=(list), CK=(list), DK=(List), MU, or SD.
Format
AJ/BJ/CJ/BK=(list)/AK=(list)/CK=(list)/DK=(list)/MU/SD
385
4 MULTILOG REFERENCE
ALL/ITEMS/GROUPS keyword
Purpose
The set of items is specified with one of the following: ALL, ITEMS=(list), or
GROUPS=(list).
Format
ALL/ITEMS=(list)/GROUPS=(list)
Related topics
VALUE keyword
Purpose
This real constant is used to specify the value at which the parameter is to be fixed.
Format
VALUE=n.
For the 3PL model, the values are specified in “traditional 3PL, normal metric” form; for the
other models, the actual values of the parameters or contrasts must be used. The parameters
of Gaussian population distributions may also be fixed if there are several groups. The de-
fault arrangement fixes µ = 0 for the last group, as well as σ = 1 for all groups. If there are
three groups,
386
LABELS COMMAND
(Optional)
Purpose
Format
The set of items is specified with either the keyword ALL or ITEMS=(list).
ALL/ITEMS option
Purpose
The set of items for which labels is provided is specified with either the keyword ALL or
ITEMS=(list).
Format
ALL/ITEMS=(list)
Related topics
NAMES keyword
Purpose
These labels are entered as a list; each label must have 4 or fewer characters.
Format
NAMES=(‘lab1’,’lab2’,…).
Related topics
387
4 MULTILOG REFERENCE
(Required)
Purpose
To set up the problem and to specify the type of data MULTILOG is to expect.
Format
The class of the problem is specified by selecting one of the mutually exclusive options:
RANDOM/FIXED/SCORE. The type of input data is specified by selecting one of the mutually
exclusive options: PATTERNS/INDIVIDUAL/TABLE.
Related topics
The RANDOM, FIXED and SCORES options in the New Analysis dialog box may be used to
select the type of problem (Section 4.2.1)
The PATTERNS, INDIVIDUAL and TABLE options may be set using the Input Data dialog
box (Section 4.2.3)
The NITEMS, NGROUP, NPATTERNS, NEXAMINEES, NCHARS and DATA keywords may be ac-
cessed via the Input Parameters dialog box (Section 4.2.4)
The CRITERION option is added to the PROBLEM command by clicking “Yes” in the Fixed
Theta dialog box (Section 4.2.2)
CRITERION option
Purpose
If FIXED or SCORE is entered, you may specify, by including this keyword, that a fixed value
of θ is to be read with the data as the criterion for fixed- θ item parameter calibration or as a
starting value for computation of MLE[ θ ].
Format
CRITERION
388
PROBLEM COMMAND
Related topics
DATA keyword
Purpose
This keyword is used to enter the name and location of the raw data file. The name may by up to
128 characters in length and must be enclosed in single quotes. Note that each line of the com-
mand file has a maximum length of 80 characters. If the filename does not fit on one line of 80
characters, the remaining characters should be placed on the next line, starting at column 1.
Format
DATA=<‘filename’>
Related topics
The DATA keyword may be accessed via the Input Parameters dialog box (Section 4.2.4)
RANDOM/SCORE/FIXED option
Purpose
The class of problem is specified by selecting one of three mutually exclusive options.
Format
FIXED/RANDOM/SCORE
Related topics
The RANDOM, FIXED and SCORES options in the New Analysis dialog box may be used to
select the type of problem (Section 4.2.1)
389
4 MULTILOG REFERENCE
PATTERNS/INDIVIDUAL/TABLE option
Purpose
The type of input data is specified by selecting one of three mutually exclusive options.
The PATTERN option is used for pattern data (also see NPATTERN keyword on PROBLEM
command). The number of examinees (NEXAMINEES) and the number of patterns
(NPATTERNS) are mutually exclusive options.
INDIVIDUAL is used for individual item response vectors (also see NEXAMINEES keyword
on PROBLEM command).
TABLE is used for a fixed-effect table of counts.
Format
PATTERN/INDIVIDUAL/TABLE
Related topics
The RANDOM, FIXED and SCORES options in the New Analysis dialog box may be used to
select the type of problem (Section 4.2.1)
The type of data—pattern, individual or table—is specified on the Input Data dialog box
(Section 4.2.3)
PROBLEM command: NPATTERNS/NEXAMINEES keywords
NCHARS keyword
Purpose
To specify the number of characters in the ID field for individual response or pattern count
data (see INDIVIDUAL/PATTERN/TABLE option on PROBLEM command).
Format
NCHARS=n
Related topics
The NCHARS keyword may be accessed via the Input Parameters dialog box (Section
4.2.4).
PROBLEM command: INDIVIDUAL/PATTERN/TABLE option
390
PROBLEM COMMAND
NEXAMINEES/NPATTERNS keyword
Purpose
Used to indicate the number of patterns or examinees for which responses are present in the
data.
PATTERN represents the number of response patterns tabulated for pattern data (also see
PATTERN option on PROBLEM command).
The number of examinees for individual data. To be used with INDIVIDUAL option on the
PROBLEM option.
The number of examinees (NEXAMINEES) and the number of patterns (NPATTERNS) are
mutually exclusive options.
Format
NPATTERNS=n/NEXAMINEES=n
Related topics
The NITEMS, NGROUP, NPATTERNS, NEXAMINEES, NCHARS and DATA keywords may be ac-
cessed via the Input Parameters dialog box (Section 4.2.4)
PROBLEM command: INDIVIDUAL/PATTERN/TABLE options
PROBLEM command: NPATTERNS/NEXAMINEES keywords
NGROUP keyword
Purpose
Format
NGROUP=n
Related topics
The NGROUP keyword may be accessed via the Input Parameters dialog box (Section
4.2.4)
NITEMS keyword
Purpose
391
4 MULTILOG REFERENCE
Format
NITEMS=n
Related topics
The NITEMS keyword may be accessed via the Input Parameters dialog box (Section
4.2.4)
NOPOP option
Purpose
If SCORE is specified, the default is MAP estimation including the population distribution. If
no population distribution is desired, enter NOPOP. If NOPOP is entered, some MLE[ θ ]s may
not be finite and the program may stop.
Format
NOPOP
Related topics
392
PRIORS COMMAND
(Optional)
Purpose
Format
The set of items is specified with one of the following: ALL, ITEMS=(list), or
GROUPS=(list). The set of parameters is specified by one of the following mutually exclu-
sive keywords: AJ, BJ, CJ, BK=(list), AK=(list), CK=(list), DK=(List), MU, or SD.
The parameters of the Gaussian prior distribution are entered using the PARAMS keyword. In
the special case of CJ or DK=1, indicating the asymptote for the 3PL model, the prior must be
specified for the logit of the asymptote, which is the parameter in MULTILOG. A standard
deviation of 0.5 works well for the asymptote.
AJ/BJ/CJ/BK/AK/CK/DK/MU/SD keyword
Purpose
The set of parameters is specified by one of the following mutually exclusive keywords: AJ,
BJ, CJ, BK=(list), AK=(list), CK=(list), DK=(List), MU, or SD.
Format
AJ/BJ/CJ/BK=(list)/AK=(list)/CK=(list)/DK=(list)/MU/SD
393
4 MULTILOG REFERENCE
ALL/ITEMS/GROUPS option
Purpose
The set of items is specified with one of the mutually exclusive options ALL,
ITEMS=(list) or GROUP=(list).
Format
ALL/ITEMS=(list)/GROUPS=(list)
PARAMS keyword
Purpose
Specify the mean and standard deviation of the normal prior to be imposed on the item pa-
rameter(s) as (mean, standard deviation).
Format
PARAMS=(n,n)
394
SAVE COMMAND
Purpose
Format
>SAVE FORMAT;
The saved parameters may be used to restart the program or score examinees, with the pa-
rameters read after a START command. When item calibration is performed, the parameters
are saved to <jobname>.par. Scores obtained from scoring problems are saved to job-
name.sco.
FORMAT option
Purpose
The present form of the parameter file differs from that of previous versions of the program.
Users who wish to have MULTILOG save the estimated parameters in the previous style
may insert the FORMAT option on the SAVE command. The program will then write a parame-
ter file in the previous file, but the format of the parameter values will be 5F12.5 rather than
8F10.3; the new format must be used in formatted reading of the saved file.
If this option is not present, the file will be saved in free format.
Related topics
395
4 MULTILOG REFERENCE
Purpose
To override the default starting values for the item parameters and enter others.
Format
The set of item related options is specified with either the ALL or ITEMS=(list) keyword.
ALL/ITEMS option
Purpose
The set of items is specified with one of the mutually exclusive options ALL or
ITEMS=(list).
Format
ALL/ITEMS=(list)
FORMAT option
Purpose
The present form of the parameter files differs from that of previous versions of the pro-
gram. The FORMAT option is used in the processing of parameter files created with previous
versions. When this option is present on the START command, the next line of the command
file must contain the format statement for reading the parameter file. The statement
(8F10.3) is the required format for previous style parameter files. The filename is specified
using the PARM keyword.
Format
FORMAT
Related topics
396
START COMMAND
PARAM keyword
Purpose
This option is used to give the name and location of an external file containing parameter
values that should be used as starting values in the current analysis or for comparing exami-
nee or pattern scores. The filename can be up to 128 characters in length and should be en-
closed in single quotes. Note that each line of the command file has a maximum length of 80
characters. If the filename does not fit on one line of 80 characters, the remaining characters
should be placed on the next line, starting at column 1. This option in used in combination
with FORMAT option. If this keyword does not appear, the parameters are assumed to be in
the command file immediately following the START command.
Format
PARAM=<‘filename’>
Related topics
397
4 MULTILOG REFERENCE
(Required)
Purpose
To define the IRT model for a set of items. Note that ALL and ITEMS=(list) and L1, L2,
L3, GRADED, NOMINAL and BS are mutually exclusive options.
Format
Related topics
The L1, L2, L3, GRADED, NOMINAL and BS options in the Input Parameters dialog box may
be used to select the type of model to be fitted to the data (Section 4.2.4).
The ALL option and ITEMS, NC and HIGH keywords may be set by using the Test Model
dialog box (Section 4.2.5).
Purpose
The ALL option will select all of the items in the data as indicated in the variable format
statement.
The ITEMS keyword is used to select a subset of the items for inclusion in the subtest(s).
Format
ALL/ITEMS=(list)
Related topics
The ALL option or ITEMS keyword may be set using the Test Model dialog box.
L1/L2/L3/GRADED/NOMINAL/BS option
Purpose
Used to select the type of model to be fitted to the data. Note that L1, L2, L3, GRADED,
NOMINAL and BS are mutually exclusive options.
398
TEST COMMAND
Format
L1/L2/L3/GRADED/NOMINAL/BS
Related topics
The L1, L2, L3, GRADED, NOMINAL and BS options in the Input Parameters dialog box may
be used to select the type of model to be fitted to the data (Section 4.2.4).
TEST command: NC and HIGH keywords
HIGH keyword
Purpose
This keyword is used to enter the number of the highest category for each item for the nomi-
nal models; this is usually the correct response on an ability test.
Format
HIGH=(list)
Related topics
The NC and HIGH keywords may be set by using the Test Model dialog box (Section 4.2.5)
TEST command: L1/L2/L3/GRADED/NOMINAL/BS options
TEST command: NC keyword
NC keyword
Purpose
This keyword is used to enter the number of response categories for each item. Note that the
nominal model cannot be used for binary (NC=2) data; use L2.
399
4 MULTILOG REFERENCE
Format
NC=(list)
Default
(2(0)NITEMS).
Maximum
Related topics
The NC and HIGH keywords may be set using the Test Model dialog box (Section 4.2.5)
TEST command: L1/L2/L3/GRADED/ NOMINAL/BS options
PROBLEM command: NITEMS keyword (Section 4.4.7)
400
TGROUPS COMMAND
(Optional)
Purpose
To specify grouping on the θ -dimension, for quadrature in MML or fixed groups for the
fixed-effects model.
Format
MIDDLE keyword
Purpose
Specifies the NUMBER of fixed groups at the values of θ as given in the list. It is used in the
context of fixed-effects estimation.
Format
MIDDLE=(list)
Related topics
NUMBER keyword
Purpose
Specifies the number of quadrature points for MML estimation, or the number of θ -groups
for fixed-effects estimation.
Format
NUMBER=n
Maximum
150.
401
4 MULTILOG REFERENCE
QP keyword
Purpose
Specifies the NUMBER of quadrature points, placed at values of θ as given in the list. It is
used in the context of MML random-effects estimation.
Format
QP=(list)
Related topics
402
TMATRIX COMMAND
(Optional)
Purpose
Format
The set of item related options is specified with either the ALL or ITEMS=(list) keyword.
One of the three vectors of parameters of the nominal or multiple-choice model is specified
with one of the following options: AK, CK or DK. One of the following three T -matrix op-
tions specifies the matrix: DE, PO or TR.
For example, the version of Masters’ (1982) model in which the slopes are not constrained
to be equal for all of the items is given by the sequence
which identifies the c-contrasts as the crossover points, the slope contrasts as polynomial
and fixes the quadratic and higher terms to zero. If in addition you enter
the constraint is added that the slopes are equal across items and it becomes the MML ver-
sion of Masters’ (1982) model.
Related topics
ALL/ITEMS option
Purpose
The set of items is specified with one of the mutually exclusive options ALL or
ITEMS=(list).
403
4 MULTILOG REFERENCE
Format
ALL/ITEMS=(list)
AK/CK/DK option
Purpose
One of the three vectors of parameters of the nominal or multiple-choice model is specified
with one of the following mutually exclusive keywords: AK, CK=(list) or DK=(list).
AK refers to ak
CK refers to ck
DK refers to d k .
Format
AK/CK/DK
DEVIATION/POLYNOMIAL/TRIANGLE option
Purpose
One of the following three T-matrix options specifies the matrix: DEVIATION, POLYNOMIAL
or TRIANGLE.
Format
DEVIATION/POLYNOMIAL/TRIANGLE
404
VARIABLE FORMAT STATEMENT
There are two formats for the key information: one is used if the items are all binary, and the
other is used if any items on the test have more than two response categories. The two types of
key entry will now be discussed in turn.
The first line after the END command must contain a single integer that is the number of
response codes in the data file. In this context “code” means a single alphanumeric charac-
ter that appears in the data file to indicate a response; common codes are 0 and 1, or T and
F, or Y and N.
The next line contains, beginning in column 1, in one-column fields, the codes them-
selves, for instance 01, or TF, or YN.
The next line (or lines) contains the “key”: a list of the correct- or positive-response-
codes, beginning in column 1, in one-column fields. The key codes for up to 79 items go
on a single line. If there are more than 79 items, the key codes for items 80 and higher go
on the next line, up to the code for item 158. If there are more than 158 items, the key
codes for items 159 and higher go on the following line, and so on (79 codes per line).
The next line of the binary-item key block contains N in column 1 if there are no explicitly
specified codes that indicate missing-at-random (or “not-reached”) or Y if such a code ex-
ists. Usually this line contains N, because any code in the data file that is not listed among
the codes on the line described in (4), above, is also treated as missing-at-random, so it is
generally easier to simply omit the missing data code(s) from the list of codes. However,
if it is desirable to specify the missing data code among the codes, you can put Y on this
line.
If the entry on the missing-data code line (4, above) is Y, then the final line of the key se-
quence has the missing data code in column 1.
Related topics
END command
Response Codes (Binary Data) dialog box
For tests that include one or more items with more than two response categories:
The first line after the END command must contain a single integer that is the number of
response codes in the data file. In this context “code” means a single alphanumeric char-
acter that appears in the data file to indicate a response; common codes are 0, 1, 2, 3, … or
A, B, C, D, ….
The next line contains, beginning in column 1, in one-column fields, the codes them-
selves, for instance 01234, or ABCDE. Note that the codes must be single characters, re-
gardless of the number of item response categories. For example, after using the digits
from 0-9 for ten response categories, the letters A, B, C, … are often used for the eleventh,
405
4 MULTILOG REFERENCE
For multiple-category data, the code-line in (2) is followed by one line (or set of lines) for each
code, in the order the codes are typed in (2), above. The category numbers are in 1-column fields
if all of the items on the test have fewer than ten (10) response categories. Each line indicates,
beginning in column 1 for item 1, the number of the response category into which that data code
is to be placed. The lowest response category for MULTILOG models is numbered 1; the next
lowest is 2, and so on. Response category number 0 is reserved for missing data in MULTILOG.
For tests with no items with 10 or more categories, each item’s category-number for a given data
code occupies a single column. 79 items' categories fit on a single line; the next 79 go on the
next line, and so on.
For example: In the mixture.mlg file (derived from example 15), there are five (5) response
codes in the data: 0, 1, 2, 3, and 9. The first 26 items are binary; for those items, 0 is incorrect
and 1 is correct. The 27th item has three categories, coded in the data file 1, 2, and 3. 9 indicates
missing data.
5
01239
111111111111111111111111110
222222222222222222222222221
000000000000000000000000002
000000000000000000000000003
000000000000000000000000000
For items 1-26, this key sequence maps the response code 0 into category 1, and the response
code 1 into category 2. The response codes 1, 2, and 3 are placed in model categories 1, 2, and 3
for item 27. Unacceptable values, and 9s, are made missing by placement in category 0 (zero).
If any item on the test has ten or more response categories, the category number lines in the key
are all entered in two-column fields, right-justified. In this case, 40 category numbers are entered
on each line; for more than 40 items, additional lines are used for each data code.
Related topics
Following the key sequence, a format command is entered describing the layout of the data file.
This is described below:
The format begins (in column 1) with “(” and ends with “)”.
406
VARIABLE FORMAT STATEMENT
NCHAR A1 for the ID field (optional; include only if NCHAR>0; NCHAR is the number of ID
characters entered on the PROBLEM command).
I1 (or I2, if there are 10 groups) to read the group number, from 1 to NGROUP (optional;
include only if NGROUP>1 on the PROBLEM command).
NITEMS A1 for the item responses.
Fn.0 for the frequency corresponding to that response pattern, where n is the number of
columns devoted to the frequency in the data file.
NCHAR A1 for the ID field (optional; include only if NCHAR>0; NCHAR is the number of ID
characters entered on the PROBLEM command).
I1 (or I2, if there are 10 groups) to read the group number, from 1 to NGROUP (optional;
include only if NGROUP>1 on the >PROBLEM command).
NITEMS A1 for the item responses.
Fn.d for a criterion if there is one (optional), where n is the number of columns in the data
file devoted to the criterion, and d is the number of places after the decimal point.
The format is for one row of the θ -group × item-response table, giving the frequencies in each
θ -group responding in each category, in F-format.
Examples
The *.mlg file for the COURTAB2 example is shown below, illustrating keying for multiple
categories.
If the problem requires an item parameter file as starting values or for computing examinee or
pattern scores, the file is named in a similar way by the keyword PARM=‘filename’ of the START
407
4 MULTILOG REFERENCE
command. If this keyword does not appear, the parameters are assumed to be in the command
file immediately following the START command. The command file courtab2.mlg for version
7.0 provides an example of this type of problem setup:
Note the addition of DATA=‘COURTAB.DAT’ on the PROBLEM command, and the FORMAT after
>START ALL; To simplify parameter input, the program default has been changed to format-free
read. The values of the parameters must be space- or comma-delimited; no format statement is
required (or permitted). Free-format parameter values may also be read using the keyword
PARM=‘filename’ to read the values from a file.
By default, when the estimated parameters are saved in the jobname.par file, the parameter val-
ues are in this file and are suitable for format-free read. (See comments about the SAVE command
below).
To use parameter files in the format used by previous versions of MULTILOG, the keyword
FORMAT has been added to the START command. This keyword is used only in conjunction with
the keyword PARM=‘filename’, to read output files produced by MULTILOG with the keyword
FORMAT on the SAVE command line. Partial syntax for the Knee problem using this new setup is
shown below:
If the SAVE command appears in an item calibration problem, the parameters are saved in
a file named jobname.par.
408
VARIABLE FORMAT STATEMENT
If you wish to have the program save the estimated parameters in the previous style, you
may insert the keyword FORMAT in the SAVE command. This keyword causes an item
header line to be written ahead of the parameter values as previously. This item header
line contains essential information if the nominal or multiple-choice models are used;
free-format output is not useful for most purposes with those models.
If FORMAT does not appear in the SAVE command, the item header line is omitted and the
parameters are written in format-free, space delimited form.
If the SAVE command appears in an examinee or pattern scoring, the scores are saved in a
file named jobname.sco. If the SAVE command does not appear, the scores are listed in the
jobname.out file.
Related topics
409
5 TESTFACT REFERENCE
5 TESTFACT
5.1 Introduction
The TESTFACT program implements all the main procedures of classical item analysis, test
scoring, and factor analysis of inter-item tetrachoric correlations. In addition, it performs modern
methods of factor analysis based on item response theory (IRT). The program also includes a fa-
cility for simulating responses to test items having difficulties and factor loadings specified by
the user.
New features in TESTFACT are all part of full information item factor analysis (FIFA). The
commands and procedures of classical item statistics and classical factor analysis of tetrachoric
correlation coefficients remain unchanged.
The changes to full information item factor analysis consist of a new and improved algorithm for
estimating the factor loadings and scores – specifically, new methods of numerical integration
used in the EM solution of the marginal maximum likelihood equations. Three different methods
of multidimensional numerical integration for the E-step of the EM algorithm are provided:
adaptive quadrature, non-adaptive quadrature, and Monte Carlo integration.
In exploratory item factor analysis, these methods make possible the analysis of up to fifteen fac-
tors and improve the accuracy of estimation when the number of items is large. The previous
non-adaptive method has been retained in the program as a user-selected option (NOADAPT), but
the adaptive method is the default. The maximum number of factors with adaptive quadrature is
10; with non-adaptive quadrature, 5; with Monte Carlo integration, 15. Bayes estimates of scores
for all factors can be estimated either by the adaptive or non-adaptive method. Estimation of the
classical reliability of the factor scores is also included.
TESTFACT includes yet another full information method that provides an important form of
confirmatory item factor analysis namely “bifactor” analysis. The factor pattern in bifactor
analysis consists of a general factor on which all items have some loading, plus any number of
so-called “group factors” to which non-overlapping subsets of items, assigned by the user, are as-
sumed to belong. The subsets typically represent small numbers of items that pertain to a com-
mon stem such as a reading passage or problem-solving exercise. The bifactor solution provides
Bayes estimation of scores for the general factor, accompanied by standard errors that properly
account for association among responses attributable to the group factors.
BIFACTOR invokes and controls the bifactor solution. The FACTOR and FULL commands
may not be used with BIFACTOR.
TECHNICAL combines keywords and options of item factor analysis that would otherwise
have to be duplicated in the BIFACTOR, FACTOR, FULL, and SCORE commands.
SIMULATE is now a separate command instead of a keyword of the SCORE command. It has
additional options for input of item parameters to specify the simulation. The parameters
may be entered either as item intercepts and factor slopes, or standard difficulties (i.e.,
410
5 TESTFACT REFERENCE
normal deviates corresponding to items percent correct) and factor loadings. The com-
mand now also allows the user to specify mean values for the factor scores. The default
values of the means are zero, as in the previous version of the program.
This document describes those elements in the user’s interface that may not be immediately clear
to the user or that behave in a somewhat nonstandard way.
Main menu
Run menu
Output menu
Font option
Window menu
At the center of the interface is the main menu bar, which adapts to the currently active function.
For example, when you start the program, the menu bar shows only the menu choices File, View,
and Help.
However, as soon as you open a TESTFACT output file (through the File menu), the Windows
and Edit menu choices show up on the menu bar. At the same time, the File menu choices have
been expanded with selections like Save and Save As. And the View menu now has a Font op-
tion next to the Status bar and Toolbar choices.
Opening an existing TESTFACT command (*.tsf) file, or starting a new one, adds further
choices to the main bar: the Output and Run menus.
Note that you can open only one command file at a time. If you want to paste some part from an
existing command file in your current one, opening the old file will automatically close the cur-
rent one. After you copy the part you want to the clipboard, you have to reopen the *.tsf file for
pasting.
411
5 TESTFACT REFERENCE
The Run menu gives you the option to run the command file displayed in the main window.
If you made any changes, the current command file will first be saved when you run an analysis
by clicking Run. You can easily tell if a command file has been changed by looking at the file-
name above the menu bar. An asterisk after the filename shows that the current file has changed
but has not been saved yet.
Through the Output menu, you can open the list output, named with the file extension *.out.
Always check the end of each output file to see if it reports: NORMAL END. If it does not, some-
thing went wrong and the output file should have some information on that.
The Window menu is only available when you have at least one file open. You can use the Ctrl-
Tab key combination to switch between open files, or use the Window menu to arrange the open
files (cascade, tile).
Clicking on the Font option on the View menu displays a dialog box with the fonts that are
available on your system.
You may use different fonts for command and output files. At installation, they are both set to a
special Arial Monospace font that ships with the program. To keep the tables in the output
aligned, you should always select a monospace or fixed pitch font where all the characters in the
font have the same width. Once you select a new font, that font becomes the default font. This
gives you the option to select a font (as well as font size and font style) for your command (*.tsf)
files that is different from the one for your list output (*.out) file as a quick visual reminder of
the type of file.
412
OVERVIEW OF SYNTAX
TESTFACT uses the command conventions of other IRT programs published by SSI. Command
lines employ the general syntax:
Command lines may not exceed 80 columns. Continuation on one or more lines is permit-
ted.
Each command must be terminated by a semicolon (;). The semicolon functions as the
command delimiter: it signals the end of the command and the beginning of a new com-
mand.
A greater-than sign (>) must be entered in column 1 of the first line of a command and
followed without a space by the command name.
Command names, keywords, and options may be entered in full or abbreviated to the first
three characters. Exceptions are the following keyword values and options, which must be
entered in full:
VARIMAX in the FACTOR command (Section 5.3.9)
PROMAX in the FACTOR command
PATTERN in the INPUT command (Section 5.3.12)
CASE in the INPUT command
RECODE in the BIFACTOR and FULL commands (Sections 5.3.3 and 5.3.11)
MISS in the BIFACTOR and FULL commands
LORD in BIFACTOR and FULL commands
413
5 TESTFACT REFERENCE
Related topics
All available TESTFACT commands are given in their necessary order below. Required com-
mands (indicated with a “*”) must appear in the command file for each problem setup. All other
commands are optional. In the sections that follow, commands are arranged in alphabetical order.
TITLE (*)
PROBLEM (*)
COMMENT
NAMES
RESPONSE (*)
KEY (*)
SELECT
SUBTEST
CLASS
FRACTILE
EXTERNAL
CRITERION
414
OVERVIEW OF SYNTAX
RELIABILITY
PLOT
TETRACHORIC
BIFACTOR
FACTOR
FULL
PRIOR
SCORE
TECHNICAL
SAVE
SIMULATE
INPUT (*)
(Variable format statement) (*)
CONTINUE (*)
STOP (*)
Note:
INPUT and CONTINUE (or INPUT and STOP) must be the last two commands. The variable format
is required in the command file when raw data are read in from an external file.
TITLE
PROBLEM NITEM=n, SELECT=n, NOTPRES
RESPONSE=n, SUBTEST=n,
CLASS=n, FRACTILES=n,
EXTERNAL=n, SKIP=n
COMMENT
NAMES
RESPONSE
KEY
SELECT
SUBTEST BOUNDARY=(list),
NAMES=(list)
CLASS IDENTITY=(list),
NAMES=(list)
FRACTILES BOUNDARY=(list) SCORE/PERCENTIL
EXTERNAL
CRITERION NAME=n, WEIGHTS=(list) EXTERNAL/SUBTESTS/CRITMARK
RELIABILITY KR2/ALPHA
415
5 TESTFACT REFERENCE
PLOT BISERIAL/PBISERIAL,
NOCRITERION/CRITERION,
FACILITY/DELTA
TETRACHORIC NDEC=n RECODE/PAIRWISE/ COMPLETE,
TIME, LIST, CROSS
BIFACTOR NIGROUPS=n, TIME, SMOOTH, RESIDUAL,
IGROUPS=(list), LIST=n, NOLIST
CPARMS=(list), NDEC=n,
OMIT=n, CYCLES=n, QUAD=n
FACTOR NFAC=n, NROOT=n, NIT=n, RESIDUAL, SMOOTH
ROTATE=(list), NDEC=n
FULL OMIT=n, FREQ=n, CYCLES=n, TIME
CPARMS=(list), QUAD=n
PRIOR SLOPE=n, INTER=(list)
SCORE NFAC=n, FILE=<name>, MISSING, TIME, CHANCE,
LIST=n, METHOD=n, PARAM=n, LOADINGS
SPRECISION=n
TECHNICAL ITER=(list), QUAD=n, NOADAPT, FRACTION, NOSORT
SQUAD=n, PRV=n, FREQ=n,
NITER=n, QSCALE=n,
QWEIGHT=n, IQUAD=n,
ITLIMIT=n, PRECISION=n,
NSAMPLE=n, ACCEL=n,
MCEMSEED=n
SAVE SCORE, MAIN, SUBTESTS,
CRITERION, CMAIN, CSUB,
CCRIT, CORRELAT, SMOOTH,
ROTATE, UNROTATE, FSCORES,
TRIAL, SORTED, EXPECTED, PARM
SIMULATE NFAC=n, NCASES=n, LOADINGS/SLOPES, CHANCE
SCORESEED=n, ERRORSEED=n,
GUESSSEED=n, FILE=<name>,
MEAN=(list), FORM=n,
GROUP=n, PARM=n
INPUT NIDCHAR=n, NFMT=n, SCORES/CORRELAT/FACTOR,
TRIAL=<name>, FORMAT/UNFORMAT, LIST, REWIND
WEIGHT=(list), FILE=<name>
CONTINUE
STOP
416
OVERVIEW OF SYNTAX
Note:
417
5 TESTFACT REFERENCE
(Optional)
Purpose
Note
FACTOR and BIFACTOR are mutually exclusive commands. If RESIDUAL is not selected, it is
not necessary to compute the tetrachoric matrix.
Format
Example
Default
No bifactor analysis.
Related topics
CPARMS keyword
Purpose
To specify the probability of chance success on each item. If items have been specified in
the SELECT command, the corresponding probabilities will be selected from this list.
Format
418
BIFACTOR COMMAND
Default
Related topics
CYCLES keyword
Purpose
Format
CYCLES=n
Default
20.
IGROUPS keyword
Purpose
To assign the items to the item groups, numbered from 1 to n. Assign 0 to any item that
loads only on the general factor. If items have been specified in the SELECT command, the
corresponding IGROUPS numbers will be selected from this list.
Format
IGROUPS= ( n1 , n2 ,..., nn )
For purpose of comparing results of a bifactor analysis with a one-factor analysis of the
same data, the user may assign all items to the general factor (i.e., all values of the IGROUPS
keyword are zero. In that case, NIGROUPS must also be set to zero).
Default
None.
Related topics
419
5 TESTFACT REFERENCE
LIST keyword
Purpose
n= 0: no printout
n= 1: loadings will be listed in item order
n= 2: loadings will be listed in item group order
n= 3: loadings will be listed in both orders
If unrotated factor loadings are selected in the SAVE command, the loadings will be saved in
item order in the format of a conventional two-factor solution. The group assignments will
be included.
Format
LIST=n
Default
0.
Related topics
NDEC keyword
Purpose
To specify the number of decimal places in the listing of a selected smoothed or residual
correlation computed from the bifactor solution.
Format
NDEC=n
Default
3.
420
BIFACTOR COMMAND
NIGROUPS keyword
Purpose
Format
NIGROUPS=n
Default
None.
NOLIST option
Purpose
To suppress the listing of the smoothed or residual matrix in the program output. These ma-
trices may be saved in an external file in either case (see the SAVE command, discussed in
Section 5.3.20).
Format
NOLIST
Related topics
OMIT keyword
Purpose
To specify the treatment of omitted items. Note that the option selected should be given in
full.
421
5 TESTFACT REFERENCE
Format
OMIT=RECODE/MISSING/LORD
Default
RECODE.
QUAD keyword
Purpose
To control the number of quadrature points for the EM estimation of the parameters.
Format
QUAD=n
Default
9.
RESIDUAL option
Purpose
To compute the difference between the tetrachoric correlation matrix and the smoothed ex-
pected matrix. If RESIDUAL is not selected, it is not necessary to compute the tetrachoric ma-
trix.
Format
RESIDUAL
SMOOTH option
Purpose
To reproduce the expected correlation matrix from the bifactor solution; otherwise, the ma-
trix will not be computed.
Format
SMOOTH
422
BIFACTOR COMMAND
TIME option
Purpose
To specify that omitted items following the last non-omitted item should be treated as not-
presented. Otherwise, they will be scored incorrect if the OMIT keyword is set to RECODE.
Format
TIME
Related topics
423
5 TESTFACT REFERENCE
(Optional)
Purpose
To assign class codes if item statistics are to be estimated separately for each class (group)
of respondents in the sample.
Format
Examples
>CLASS IDENTITY=(‘1000’,‘2000’,‘3000’);
>CLASS IDENTITY=(N,S,E,W,C);
Default
No classes.
IDENTITY keyword
Purpose
Format
IDENTITY= ( n1 , n2 ,..., nq )
424
CLASS COMMAND
Default
None.
NAMES keyword
Purpose
Format
NAMES= ( n1 , n2 ,..., nq )
Default
Blank names.
425
5 TESTFACT REFERENCE
(Optional)
Purpose
Format
>COMMENT
…text…
…
…
…text…
Note
The COMMENT command is given on a line by itself and followed by as many lines as desired,
of 80 characters maximum, containing comments. A semicolon to end this command is not
needed.
Example
>COMMENT
20 ITEM TEST, THE TOTAL SCORE AS THE CRITERION SCORE.
THE ITEMS ARE TESTING THE FOLLOWING TOPICS.
STRUCTURE AND LANDFORMS.
EROSION, TRANSPORT AND DEPOSITION.
CLIMATE AND VEGETATION.
MINERAL RESOURCES.
AGRICULTURE AND INDUSTRY.
POPULATION AND TRANSPORT.
MISCELLANEOUS.
PERSONS SITTING TEST CLASSIFIED BY SEX, G=GIRL, B=BOY;
THE DATA CARDS ARE LAYED OUT AS BELOW.
COLUMN 1 TO 12 INCLUSIVE – IDENTITY
COLUMN 13 – SEX
COLUMNS 14 TO 33 – ITEM RESPONSES
COLUMNS 36 TO 37 INCLUSIVE – CRITERION SCORE
>NAMES…
Default
No comments.
426
CONTINUE COMMAND
(Optional)
Purpose
Format
>CONTINUE
Note
427
5 TESTFACT REFERENCE
(Optional)
Purpose
To define a criterion score to supplement the main test score of each respondent.
Format
Examples
Default
No criterion score.
EXTERNAL/SUBTEST/CRITMARK option
Format
EXTERNAL/SUBTESTS/CRITMARK
EXTERNAL: A linear combination of external variables (see the EXTERNAL command, dis-
cussed in Section 5.3.8). In this case, weights are supplied by the user (see WEIGHTS key-
word, with w = t).
SUBTESTS: A linear combination of subtest scores (see SUBTEST command, Section
5.3.25). In this case, calculation is based weights supplied by the user (see WEIGHTS key-
word, with w = p).
CRITMARK: A score input with item responses. No weights required.
Default
EXTERNAL.
Related topics
428
CRITERION COMMAND
NAME keyword
Purpose
To provide a name of 1 to 8 characters for the resulting criterion. The rules for naming items
(see NAMES command, Section 5.3.14) apply to the criterion name.
Format
NAME=character string
Default
Blank name.
Related topics
WEIGHTS keyword
Purpose
To enter weights when the criterion score must be calculated as a linear combination of
other variables (see EXTERNAL option) .
Format
WEIGHTS= ( n1 , n2 ,..., nw )
Default
1.0.
Related topics
429
5 TESTFACT REFERENCE
(Optional)
Purpose
Format
>EXTERNAL= ( n1 , n2 ,..., nt )
Example
>EXTERNAL ARITH,ALGEBRA,TRIG,GEOMETRY;
Default
Related topics
430
FACTOR COMMAND
(Optional)
Purpose
Format
Note
Example
Default
No factor analysis.
Purpose
Format
NDEC=n
Default
3.
NFAC keyword
Purpose
431
5 TESTFACT REFERENCE
Format
NFAC=n
Default
NITEM/2.
Related topics
NIT keyword
Purpose
To specify the number of iterations for the MINRES factor solution of the smoothed correla-
tion matrix.
Format
NIT=n
Default
NROOT keyword
Purpose
To specify the number of latent roots to be extracted. NROOT must be greater or equal to
NFAC.
Format
NROOT=n
432
FACTOR COMMAND
Default
NFAC.
Related topics
RESIDUAL option
Purpose
To request the computation of the residual correlation matrix. This matrix will be computed
by the initial correlation matrix minus the final correlation matrix. The residual variance for
each item appears in the diagonal of this matrix.
Format
RESIDUAL
Default
ROTATE keyword
Purpose
To request rotation of the factors. VARIMAX or PROMAX has to be entered (in full) if rotation is
required, there is no default. d is the number of leading factors to be rotated, and must be
equal to or less than NFAC. e is the constant for PROMAX rotation and must be between 2
and 4, inclusive.
Format
ROTATE=([VARIMAX/PROMAX],d,e)
Default
d=NFAC, e=3.
Related topics
433
5 TESTFACT REFERENCE
SMOOTH option
Purpose
To request the computation of an f-factor positive definite estimate of the latent response
process correlation matrix.
Format
SMOOTH
Note
The SMOOTH option affects only the output of the final smoothed correlation matrix. Initial
smoothing of the correlation matrix will take place whether the SMOOTH option is entered or
not. The off-diagonal elements 1.0, -1,0, 9.0 or -9.0 in the initial tetrachoric correlation ma-
trix (caused by too small cell or marginal frequencies in a contingency table) will be auto-
matically replaced by a new correlation coefficient estimated by the centroid method. The
positive-definite tetrachoric correlation matrix is then produced before the principal factor
analysis.
434
FRACTILES COMMAND
(Optional)
Purpose
To group scores into fractiles by score boundaries or percentiles. The number of fractiles
must be set in the PROBLEM command.
Format
Examples
through 15
16 through 27
28 through 33
34 through 40
41 through 60
Default
No fractiles.
Related topics
BOUNDARY keyword
Purpose
If the SCORES option is selected, the boundaries consist of cumulative upper scores on the
test bands. The scores are expressed in integers from 1 to NITEM.
If the PERCENTIL option is selected, the boundaries consist of the cumulative upper per-
435
5 TESTFACT REFERENCE
centages of the score distribution. The percentages are expressed in integers from 1 to
100.
Format
BOUNDARY= ( n1 , n2 ,..., ns )
Related topics
SCORE/PERCENTIL option
Purpose
If SCORE is selected, each fractile corresponds to a band of scores on the main test. If the
number of items is small, it is better to use score bands rather than the percentiles to de-
fine fractiles.
If PERCENTIL is selected, each fractile corresponds to a percentile of scores on the main
test.
Format
SCORE/PERCENTIL
Default
SCORE.
436
FULL COMMAND
(Optional)
Purpose
To request full information item factor analysis, starting from the principal factor solution,
and the computation of the likelihood ratio χ 2 and change of χ 2 .
Format
Note
RECODE, MISS and LORD may not be abbreviated in the FULL command.
Example
Default
CPARMS keyword
Purpose
Format
CPARMS= ( n1 , n2 ,.., nn )
Examples
CPARMS=(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,
0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,
0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1);
437
5 TESTFACT REFERENCE
CPARMS=(0.1(0)32);
CPARMS=(0.05(0)16,0.1,0.1,0.15(0)14);
Default
0.0.
CYCLES keyword
Purpose
Format
CYCLES=n
Default
15.
FREQ keyword
Purpose
To list observed and expected response pattern frequencies and their differences.
Format
FREQ=0 or FREQ=1
Default
FREQ=0 (observed and expected response frequency table not written to output file.)
OMIT keyword
Purpose
438
FULL COMMAND
fractionally correct will be observed. The fraction is the chance success parameter for the
items. It is set to the reciprocal of the number of alternatives in the item. Only factor
scores are affected.
Format
OMIT=RECODE/MISSING/LORD
Default
RECODE.
QUAD keyword
Purpose
To control the number of quadrature points for the EM estimation of the parameters.
Format
QUAD=n
Default
When the NOADAPT option is specified on the TECHNICAL command, the default values for
QUAD are as follows:
1 21
2 15
3 5
>3 3
439
5 TESTFACT REFERENCE
Otherwise
1 9
2 5
3 4
>3 3
Related topics
TIME option
Purpose
Format
TIME
Default
Related topics
440
INPUT COMMAND
(Required)
Purpose
Format
Note
Examples
Example 1:
Item response data, unweighted and unformatted, are in the file with the name mydata.dat.
There are two groups, coded M and F, one form, and two external criteria. Because these are
raw data, the variable format record must be in the command file. End of command file:
>INPUT NIDCHAR=4,FILE=‘MYDATA.DAT’;
(4A1,3X,A1,2X,10A1,2F5.1)
>CONTINUE
Example 3:
Data are read as one long string of numbers with decimal points – space-delimited and for-
mat-free. They can appear, for example, in the following form:
441
5 TESTFACT REFERENCE
1.0000
.6231 1.0000
.5574 .5395 1.0000
.3746 .3952 .4871 1.0000
.3210 .3456 .3863 .5894 1.0000
>CONTINUE
Example 4:
A factor solution for VARIMAX and PROMAX rotation; factor loadings are unformatted
output from SAVE command and are in the file factor.dat. The first line of the external file
with the factor loadings contains the variable format statement.
Example 5:
Item response data for full information factor analysis; trial values from a previous principal
factor analysis are input as starting values.
The trial values and the variable format statement describing the layout are in the file
pfact.tri. The raw data are in the external file survey.dat. The corresponding variable for-
mat statement is required in the command file.
Command file:
(I3,2X,3F10.5)
01 0.36512 -0.62143 0.01684
…
15
0001M24 101101110010100
0250F20 111101011010111
Example 6:
Item response data with case weights normalized to 1000; data are formatted in the file sur-
vey.dat. Two variable format records are in the command file.
442
INPUT COMMAND
00001ABDAECAEDBBCAEDEBACDEABC 0.632
…
01651ABDBEACEDCBBCDABEBDDCEAA 0.467
Related topics
NIDCHAR keyword
Purpose
To specify the number (between 1 and 50) of characters in the subject identification field.
Format
NIDCHAR=n
Default
1.
FILE keyword
Purpose
To provide the filename of the data records (of response records, correlations, or factor load-
ings). It may contain up to 128 characters including path and extension and should be en-
closed in single quotes. The drive and directory path should be included if the data file is not
in the same folder as the command file. For more information, see Section 3.2.6.
Format
FILE=<‘filename’>
Default
443
5 TESTFACT REFERENCE
FORMAT/UNFORMAT option
Purpose
If the data file is formatted as described in a variable format statement, the FORMAT option
should be used.
UNFORMAT is used when the data file is unformatted (binary).
To create the records of the unformatted file, the following WRITE statements may be used:
SCORES:
WRITE(FILE)(ID(I),I=1,NIDCHAR),C,(IR(I),I=1,NITEM),
WT,EXTV(I)I=1,EXT),
where C, WT and EXTV(I) are optional. The variable IR is integer; the others are real, single
precision. ID is a case identifier, C a class indicator, IR an item response pattern, and WT a
weight (for either CASE or PATTERN).
CORRELAT:
WRITE(FILE)(CORR(I)I=1,NTRI),
FACTOR:
WRITE(FILE)(A(I)I=1,NFA),
Format
FORMAT/UNFORMAT
Default
FORMAT.
Related topics
LIST option
444
INPUT COMMAND
identification
main test score
subtest scores (if any).
Format
LIST
NFMT keyword
Purpose
To specify the number of variable format records (80 characters) describing the data records.
The format records must appear in the command file immediately following the INPUT
command.
Format
NFMT=n
Default
1.
REWIND option
Purpose
This is a program instruction to read the data file from the beginning for a subsequent prob-
lem in a stacked command file. The term rewind dates from mainframe days, when data files
were commonly read from a tape that needed to be rewound to the start of the file for a fresh
reading of the data.
Format
REWIND
SCORES/CORRELAT/FACTORS option
Purpose
To indicate the type of data being read from the specified file as follows:
Use the SCORES option to read subject response records. The data should conform to the fol-
lowing specifications:
445
5 TESTFACT REFERENCE
Format
SCORES/CORRELAT/FACTORS
Default
SCORES.
Related topics
TRIAL keyword
Purpose
To specify the filename for input of trial values for the full information factor analysis. It
may contain up to 128 characters including path and extension and must be enclosed in sin-
gle quotes.
Each line of the trial values file must contain the item number followed by the intercept and
slope for each factor. The variable format record must appear in the first line of this file. See
446
INPUT COMMAND
Example 5 of the INPUT command. Trial values are saved in this form, with the format
statement, by the TRIAL option of the SAVE command.
If the trial values are in the command file, they must appear immediately after the data for-
mat records.
Format
TRIAL=<‘filename’>
Default
No trial values.
Related topics
WEIGHT keyword
Purpose
To specify the type of weight for a weighted analysis. The two options below may be used
with this keyword.
(CASE,n): Each record includes a case weight (real, i.e., with decimal points in data re-
cords, or read in F-format). Real normalizing constant n must be specified if CASE option
is chosen.
PATTERN: Each data record consists of an answer pattern with a frequency (integer, i.e.,
without a decimal point and read in I-format).
CASE and PATTERN may not be abbreviated in the INPUT command.
Format
WEIGHT=(CASE,n)/PATTERN
Default
No weights.
447
5 TESTFACT REFERENCE
(Required)
Purpose
To specify correct-response codes for all the items on the main test, in their original, prese-
lected order.
Format
>KEY ccccccccccccccccccccccc;
Notes
Example
>KEY AABCAEDCEDEACBD125342;
Default
448
NAMES COMMAND
(Optional)
Purpose
To provide brief names for all of the items on the test in their original order.
Format
>NAMES n1 , n2 ,..., nn ;
Notes
If items are selected and/or reordered using the SELECT command, their item names and the
answer key will be selected and/or reordered at the same time.
Examples
>NAMES KNOW1, KNOW2, UNDER1, ANAL1, KNOW3, UNDER2, ANAL2, COMP, ANAL3;
>NAMES ‘100’,’A100’,’100B’,’C-8’,’D-9’,’E/10’,’F/20’;
Default
449
5 TESTFACT REFERENCE
(Optional)
Purpose
Format
Examples
>PLOT BISERIAL,CRITERION,DELTA;
>PLOT BISERIAL,CRITERION,FACILITY;
>PLOT PBISERIAL,NOCRITERION,DELTA;
Default
No plot.
Purpose
To indicate the choice of discrimination index. The discrimination index may be either of
these:
Format
BISERIAL/PBISERIAL
Default
BISERIAL.
450
PLOT COMMAND
FACILITY/DELTA option
Purpose
To indicate the frame of reference for the item difficulty. Item difficulty may be plotted in
terms of either of these:
Format
FACILITY/DELTA
Default
FACILITY.
NOCRITERION/CRITERION option
Purpose
To define the discriminating power. Discriminating power may be with respect to either of
these:
Format
NOCRITERION/CRITERION
Default
NOCRITERION
451
5 TESTFACT REFERENCE
(Optional)
Purpose
To constrain the maximum likelihood estimation of slope and intercept parameters using a
beta prior distribution on the uniquenesses and a normal prior distribution on the intercepts.
Format
Example
Default
None. If the PRIOR command does not appear, the ML estimation will not be constrained.
INTER keyword
Purpose
To define the mean (m) and standard deviation (s) of the normal distribution for intercept pa-
rameters, such that c j ∼ n(m, s).
Format
INTER=(m,s)
Default
m=0, s=2.
SLOPE keyword
Purpose
To define the parameter of the beta distribution for uniquenesses, such that u j ∼ β (n,1). Lar-
ger values of n correspond to stronger priors.
452
PRIOR COMMAND
Format
SLOPE=n
Default
1.2.
453
5 TESTFACT REFERENCE
(Required)
Purpose
Format
Examples
>PROBLEM NIT=150;
Default
CLASS keyword
Purpose
To specify the number of classes (n = 1 to 10) into which respondents will be divided. This
corresponds to the number of classes identified and named in the CLASS command.
Format
CLASS=n
Default
0.
Related topics
454
PROBLEM COMMAND
EXTERNAL keyword
Purpose
To specify the number of external variates (n = 0 to 5). This should equal the number of ex-
ternal variates named in the EXTERNAL command.
Format
EXTERNAL=n
Default
0.
Related topics
FRACTILES keyword
Purpose
To specify the number of fractiles (n = 1 to 10) into which scores will be divided. Bounda-
ries of the fractiles are defined in the FRACTILES command.
Format
FRACTILES=n
Default
1.
Related topics
NITEMS keyword
Purpose
To specify the total number of test items. This should equal the number of item names speci-
fied in the NAMES command.
455
5 TESTFACT REFERENCE
Format
NITEMS=n
Default
Related topics
NOTPRES option
Purpose
To indicate that one of the response codes identifies “not-presented” items. See the
RESPONSE command, discussed in Section 5.3.19.
Format
NOTPRES
Default
Related topics
RESPONSE keyword
Purpose
To specify the number of response codes (n = 2 to 15). This should equal the number of
codes specified in the RESPONSE command.
Format
RESPONSE=n
Default
2.
456
PROBLEM COMMAND
Related topics
SELECT keyword
Purpose
To specify the number of items selected for this run (n = 0 to NITEM, the number of items
specified in the SELECT command).
Format
SELECT=n
Default
Related topics
SKIP keyword
Purpose
Format
SKIP=n
n=0 Do not skip, perform item analysis and all subsequent steps.
Default
457
5 TESTFACT REFERENCE
SUBTEST keyword
Purpose
To indicate the number of boundaries and subtest names as specified in the SUBTEST com-
mand.
Format
SUBTEST=n
Default
0.
Related topics
458
RELIABILITY COMMAND
(Optional)
Purpose
To specify a measure of internal consistency for the main test (or all subtests).
Format
>RELIABILITY KR20/ALPHA;
KR20: The default if the RELIABILITY command is used. The Kuder-Richardson formula
20 is calculated for each subtest (or for the main test when there are no subtests). Omits
are not allowed in computing KR20.
ALPHA: Coefficient alpha is calculated for each subtest (or for the main test when there are
no subtests). Omits are permissible. The computer time required to calculate alpha may be
excessive if the number of items and respondents is large.
Example
>RELIABILITY KR20;
Default
No reliability measure.
459
5 TESTFACT REFERENCE
(Required)
Purpose
To specify the response codes common to all items on the main test.
Format
Notes
Examples
>RESPONSE ‘0’,A,B,C,D;
In this example, there are 5 response codes on the main test (m = 5).
>RESPONSE ‘ ‘,’1’,’2’,’3’,’4’,’-‘;
In this example, there are 6 response codes on the main test (m = 6). Omit is blank, items
not-presented to respondents are coded “minus”.
Default
Related topics
460
SAVE COMMAND
(Optional)
Purpose
To save scores and/or item parameters in output files specified by the user.
Format
Notes
The saved file for data simulation is described in the SIMULATE command.
All results are saved in fixed-column text files; the first record of each file contains the
format statement describing the column layout.
The saved files will have the jobname as default filename.
Example
>SAVE SCORE,MAIN,SUBTESTS,CRITERION,SMOOTH,FSCORES;
Default
Not saved.
Related topics
CCRIT option
Purpose
To save the class item statistics based upon criterion score in the file <jobname>.ccr:
Format Description
461
5 TESTFACT REFERENCE
F6.3 Facility
F6.2 Difficulty
Format
CCRIT
Default
Do not save.
CMAIN option
Purpose
To save separate estimates for each class based on the main test score in the file <job-
name>.cma:
Format Description
‘MAIN’,5X -
I4 Item number
F6.3 Facility
F5.2 Difficulty
Format
CMAIN
Default
Do not save.
CORRELAT option
Purpose
To save the tetrachoric correlation matrix in the file <jobname>.cor. This matrix may not
be positive-definite (diagonal and lower triangle only, NITEMxNITEM).
Output format
Output is 80-column, space-delimited format-free, in lower triangular form with line wrap.
Format
CORRELAT
Default
Do not save.
463
5 TESTFACT REFERENCE
Related topics
CRITERION option
Purpose
To save the item statistics based upon criterion score in the file <jobname>.cri:
Format Description
F5.3 Facility
F6.2 Difficulty
Format
CRITERION
Default
Do not save.
464
SAVE COMMAND
CSUB option
Purpose
To save the item statistics of each class based upon subtest scores in the file <job-
name>.csu:
Format Description
F6.3 Facility
F5.2 Difficulty
Format
CSUB
Default
Do not save.
EXPECTED option
Purpose
To save the results of the final E-step of the full information item factor analysis in the file
<jobname>.exp:
465
5 TESTFACT REFERENCE
Format
EXPECTED
Note
This option applies only to the solution (NOADAPT option of the TECHNICAL command).
Output format
(1X,fI2,n(/,1X,7F10.5))
f: the number of factors, as specified by the NFAC keyword in the FACTOR command.
Default
Do not save.
Related topics
FSCORES option
Purpose
To save the factor scores and their posterior standard deviations with subject identification in
the file <jobname>.fsc. Output format is given in the first line of the factor score file.
Format
FSCORES
466
SAVE COMMAND
Default
Do not save.
MAIN option
Purpose
Format Description
‘MAIN’,5X -
F5.3 Facility
F6.2 Difficulty
Format
MAIN
Default
Do not save.
467
5 TESTFACT REFERENCE
PARM option
Purpose
To save the item numbers, intercepts, factor slopes, and guessing parameters in a form suit-
able for computing factor scores at a later time in the file <jobname>.par. If VARIMAX or
PROMAX is selected, these parameters will be saved after the VARIMAX rotation; otherwise,
they will be saved from the principal factor solution. If BIFACTOR is selected, the item num-
ber, intercepts, general and specific factor slopes will be saved. For scoring purposes, set the
FILE keyword of the SCORE command equal to <jobname>.par.
Note that PARM and TRIAL cannot be used in the same BIFACTOR analysis.
Output format
Output format is (I3,2X,F8.5,fF8.5), where f is the keyword value for NFAC in the
FACTOR command, or f = 2 for the BIFACTOR command. This format is given in the first line
of the PARM or TRIAL values file.
Format
PARM
Default
Do not save.
Related topics
ROTATE option
Purpose
Output format
468
SAVE COMMAND
Format
ROTATE
Default
Do not save.
Related topics
SCORES option
Purpose
To save the following case score information according to the status of WEIGHT in the INPUT
command in the file <jobname>.sco:
case identification
test form number
case weight
main test score
subtest score
criterion score
a: the number specified with the NIDCHAR keyword in the INPUT command.
p: the number of subtests.
469
5 TESTFACT REFERENCE
If p = 1, the main test and subtest score will be identical. If there is no CRITERION com-
mand, the criterion field will be null.
Format
SCORE
Default
Do not save.
Related topics
SMOOTH option
Purpose
To save the “smoothed” NFAC common factor approximation to the correlation matrix in the
file <jobname>.smo.
Output format
This matrix will be positive-definite (diagonal and lower triangle only, NITEMxNITEM). The
output format is (10F8.5).
Format
SMOOTH
Default
Do not save.
Related topics
SORTED option
Purpose
To save the sorted file of identity, item responses, and weight in the file <jobname>.sor.
470
SAVE COMMAND
This applies only to the non-adaptive solution (NOADAPT option on TECHNICAL command).
Format
SORTED
Default
Do not save.
Related topics
SUBTESTS option
Purpose
To save the item subtest parameter estimates as follows in the file <jobname>.sub:
Format Description
F5.3 Facility
F6.2 Difficulty
Format
SUBTESTS
471
5 TESTFACT REFERENCE
Default
Do not save.
TRIAL option
Purpose
To save the item number intercepts, factor slopes, and guessing parameters in the file <job-
name>.tri in a form suitable for performing additional EM parameter estimation cycles at a
later time. The trial values are saved at the end of the EM cycles and before re-
orthogonalization or rotation. In BIFACTOR analysis, TRIAL and PARM are identical. To use
the saved trial values parameters as the starting point for continued EM cycles, set the TRIAL
keyword of the INPUT command equal to <jobname>.tri.
Output format
The output format is (I3,2X,F8.5,fF8.5), where f is the (keyword) value for NFAC in the
FACTOR command, or f = 2 for the BIFACTOR command. This format is given in the first line
of the PARM or TRIAL values file.
Note that PARM and TRIAL cannot be used in the same BIFACTOR analysis.
Format
TRIAL
Default
Do not save.
Related topics
UNROTATE option
Purpose
To save the unrotated (principal) factor loadings (NITEMxNFAC) in the file <jobname>.unr.
Use UNROTATE to save BIFACTOR loadings.
472
SAVE COMMAND
Output format
Format
UNROTATE
Default
Do not save.
Related topics
473
5 TESTFACT REFERENCE
(Optional)
Purpose
To obtain factor score estimates (EAP or MAP) and standard error estimates for each case
from estimated or supplied item parameters, the EAP score of the general factor of the bifac-
tor model, and estimates of the standard error of the general factor score allowing for condi-
tional dependence introduced by the group factors.
Format
Examples
>SCORE LIST=20;
>SCORE LIST=10,NFAC=6,FILE=‘NEWTEST.PAR’,MISSING,TIME,CHANCE;
Default
CHANCE option
Purpose
To specify the use of the guessing model in computing factor scores. When used in conjunc-
tion with the SIMULATE command, the item parameter file must include the chance parame-
ters. This option has the same effect as CPARMS in the FULL command.
Format
CHANCE
Related topics
474
SCORE COMMAND
FILE keyword
Purpose
To specify the name (enclosed in single quotes) of the file containing item parameters for
scoring. The name may include a path and a filename extension, but the total length may not
exceed 128 characters. The drive and directory path should be included if the data file is not
in the same folder as the command file. For more information, see Section 3.2.6.
This file has the format as trial values produced by the TRIAL option in the SAVE command,
i.e., chance value, intercept, and slopes.
First record:
A variable format statement (in parentheses) describing the item parameter column assign-
ments.
Following records:
Without chance parameter: intercept and factor slopes and/or standard difficulty loadings.
With chance parameter: intercept or factor slopes, and standard difficulty loadings.
Format
FILE=<‘filename’>
Default
None.
Related topics
LIST keyword
Purpose
To specify the number of leading cases for which factor scores will be listed in the program
output. If FSCORES appears in the SAVE command, factor scores for all cases will be saved in
the file with the same name as the command file, but with the extension *.fsc.
475
5 TESTFACT REFERENCE
Format
LIST=n
Default
Related topics
LOADINGS keyword
Purpose
To specify that the parameter file contains item standard difficulties and factor loadings.
Format
LOADINGS
Default
METHOD keyword
Purpose
Format
METHOD=n
476
SCORE COMMAND
MISSING option
Purpose
The OMIT keyword on the BIFACTOR command will be automatically set to MISSING if the
TIME option has been selected in the TETRACHORIC command.
Format
MISSING
Related topics
NFAC keyword
Purpose
To specify the number of factors when estimating factor scores from a user-supplied file of
parameter values. If the keyword NFAC appears, a parameter file must be designated by the
FILE keyword of the SCORE command and available for reading.
Format
NFAC=n
Default
Factor scores will be computed from parameters in the current specified analysis.
Related topics
PARAM keyword
Purpose
To specify the number of parameter values (integer and factor slopes) supplied by the user
for estimating factor scores, where n = f + 1, with f being the number of factor loadings.
Format
PARAM=n
477
5 TESTFACT REFERENCE
Note
Required if scale score estimates for each subject are desired, without factor analysis.
PARAM must not be used if the FACTOR command or the FULL command is included.
If the PARAM keyword is invoked, the parameter file must be designated with the FILE key-
word in the SCORE command and available for reading.
Related topics
SPRECISION keyword
Purpose
To control the EAP and MAP precision in the calculation of factor scores.
Format
SPRECISION=n
Default
0.0001.
QUAD keyword
Purpose
Format
QUAD=n
Default
1 factor: 10
factors : 5
factors: 3
…
478
SCORE COMMAND
TIME option
Purpose
To specify that omitted items after the last non-omitted item should be treated as not-
presented. Tetrachoric correlation coefficients will be computed with the TIME option, even
if TIME has not been specified in the TETRACHORIC command.
Format
TIME
Related topics
479
5 TESTFACT REFERENCE
(Optional)
Purpose
To specify items to be selected and/or reordered for each problem. Requires the SELECT
keyword of the PROBLEM command to be set to n '.
Format
Notes
If the SELECT command is used in a given problem, all commands following it in the same
problem will pertain only to the selected and/or reordered set of items. Rules for selecting
and reordering of items:
Selection is made by listing the original order-numbers of the desired items. For example,
from the items 1, 2, 3, 4, 5, 6, 7, 8, 9, the items 1, 4, 5, 7, 8, 9 might be selected.
The selected items can be in any order. For example, the items could have been selected in
the order 7, 5, 9, 1, 8, 4.
If all the items (n) are to be reordered, n ' = n, and the selection list will contain the origi-
nal n item numbers in the new order.
Contiguous items may be entered with a “(1)” between the first and last item numbers. For
example, 10(1)34, would select all items numbered 10 through 34.
To select every b-th item from a to c, write a(b)c. For example, 1(2)99 will select every
odd-numbered item from a 100-item test.
Each item’s name and the answer key will be selected and/or reordered at the same time
as the item.
Example
>SELECT 10,9,11,2(1)5,15,14,13,12;
From an original list of 20 items, 11 items are to be selected (n = 20; n ' = 11).
Default
480
SELECT COMMAND
Related topics
481
5 TESTFACT REFERENCE
(Optional)
Purpose
To simulate item response records of cases drawn from a multivariate latent distribution of
factor scores with user-specified vector mean and fixed correlation matrix. The user must
supply standard item difficulties and NFAC factor loadings (or intercepts and factor slopes)
for each item. If a model with chance correct responses is specified, the probabilities of cor-
rect responses must also be supplied. The factor loadings must be orthogonal, e.g., principal
factors. If desired, the means of the factors can be set to arbitrary values to simulate group
effects. The default mean value is 0.0.
Format
Notes
The simulated item responses will be saved in the file with the name <jobname>.sim.
The communalities of the factor loadings must be less than 1.0.
For simulation, only the TITLE, PROBLEM, SIMULATE, and CONTINUE or STOP commands
are required; the NAMES command is optional.
There must be no SAVE or INPUT command.
Response codes in the simulated data are 1 for correct and 0 for incorrect.
Examples
Default
No simulation.
Related topics
482
SIMULATE COMMAND
CHANCE option
Purpose
To indicate that the model allowing for correct responses by chance is assumed. The chance
parameters must be present in the parameter file.
Format
CHANCE
Default
Non-chance model.
ERRORSEED keyword
Purpose
To provide the seed of the random number for generating the univariate normal independent
uniqueness distributions of the items.
The random number generator seed may be any number greater than 1 and less than
2147483647.
Format
ERRORSEED=n
Default
453612.
FILE keyword
Purpose
To specify the name (enclosed in single quotes) of the file containing item parameters of the
simulation model. This name may include a path and filename extension, but the total length
may not exceed 128 characters.
483
5 TESTFACT REFERENCE
The simulation parameter file must have the following layout when the CHANCE option is not
present:
First record: variable format statement describing the fixed-column layout of the file.
NITEMS following records: Standard difficulty, NFAC factor loadings or standard difficulty,
NFAC slopes.
If the CHANCE option is present, the chance probabilities should precede standard difficulties.
Format
FILE=‘filename’
Default
None.
Related topics
FORM keyword
Purpose
To provide, solely for convenience, a test form identification following the case number in
the simulation records; n may be set to any natural number.
Format
FORM=n
Default
1.
GROUP keyword
Purpose
To provide, solely for convenience, a group identification following the case number in the
simulation records; n may be set to any natural number.
484
SIMULATE COMMAND
Format
GROUP=n
Default
1.
GUESSSEED keyword
Purpose
To specify the seed of the random number for generating the independent probability of
chance success on an item response.
The random number generator seed may be any natural number greater than 1 and less than
2147483647.
Format
GUESSEED=n
Default
543612.
LOADINGS/SLOPES option
Purpose
Select LOADINGS if the item parameters are supplied in the form of standard item difficulty
(i.e., the standard normal deviate corresponding to the population percent CORRECT),
followed by the NFAC factor loadings.
Select SLOPES if the item parameters are in the form of standard item difficulty, followed
by NFAC factor slopes.
Format
LOADINGS/SLOPES
Default
LOADINGS.
485
5 TESTFACT REFERENCE
Related topics
MEAN keyword
Purpose
To provide the population means of the factor scores from which the responses are gener-
ated. These means will be added to the random standard normal deviates representing the
ability of each case of the corresponding factors. The maximum number of factors allowed
is 15.
Format
MEAN= ( n1 , n2 ,..., nm )
Default
None.
NCASES keyword
Purpose
Format
NCASES=n
Default
1.
NFAC keyword
Purpose
Format
NFAC=n
486
SIMULATE COMMAND
Default
1.
PARM keyword
Purpose
To specify the number of parameter values (intercept and factor loadings) supplied by the
user. n = f + 1, where f is the number of factor loadings.
Format
PARM=n
SCORESEED keyword
Purpose
To provide the seed of the random number for generating the multivariate normal ability dis-
tribution. The mean and standard deviation for each variate is assumed zero and one, respec-
tively. The random number generator seed may be any natural number greater than 1 and
less than 2147483647.
Format
SCORESEED=n
Default
345261.
487
5 TESTFACT REFERENCE
(Required)
Purpose
Format
>STOP
Note
488
SUBTEST COMMAND
(Optional)
Purpose
To specify the partition of the main test into subtests and to assign names to the subtests.
Format
Examples
>SUBTEST BOUNDARY=(10,20,30);
A test with 30 items will be partitioned into 3 named subtests of 10 items each:
BASIC 1 through 10
AVERAGE 11 through 20
ADVANCED 21 through 30
Default
No subtests.
BOUNDARY keyword
Purpose
If the SELECT command is used to reorder items before subtests are partitioned, boundaries
are specified by the new order numbers.
Format
BOUNDARY= ( n1 , n2 ,..., n p )
489
5 TESTFACT REFERENCE
Default
None.
Related topics
NAMES keyword
Purpose
To specify the names, of no more than 8 characters, for each subtest. Note that the rules for
naming items also apply to naming subtests (see the NAMES command, Section 5.3.14).
Format
NAMES= ( n1 , n2 ,..., n p )
Default
No names.
Related topics
490
TECHNICAL COMMAND
(Optional)
Purpose
To change the value of the default constants in the item factor analysis.
Format
Example
Default
ACCEL keyword
Purpose
Format
ACCEL=n
Default
1.0.
FRACTION option
Purpose
To invoke a three-point quadrature with an 81-point fractional factorial design. This option
is only applicable in the case of adaptive quadrature with five factors. Otherwise, the full
243-point design is used.
491
5 TESTFACT REFERENCE
Format
FRACTION
FREQ keyword
Purpose
Format
FREQ=n
Default
0.
IQUAD keyword
Purpose
n = 1: Gauss-Hermite quadrature.
n = 2: Gauss-Hermite quadrature; the quadrature points as well as the weights are printed.
n = 3: Quadrature using ordinates.
n= 4: Quadrature using ordinates; the quadrature points as well as the weights are printed.
Format
IQUAD=n
Default
n = 4.
ITER keyword
Purpose
492
TECHNICAL COMMAND
Format
ITER=(c,d,e)
Note
d and e are used only in non-adaptive quadrature; there is only one M-step per EM-cycle in
adaptive quadrature.
Default
(c,d,e)=(15,5,0.005).
ITLIMIT keyword
Purpose
To specify the number of EM cycles prior to the fixing of conditional distributions. In adap-
tive quadrature, the means and covariances of the conditional distribution of factor variables
for each case is computed only in the first ITLIMIT EM cycles. Thereafter, the conditional
distributions are held fixed for each case. Change of the log likelihood between EM cycles is
computed and displayed only after fixing has occurred. In Monte Carlo EM, the sampled
points are fixed at their values for each case in the ITLIMIT cycle.
Format
ITLIMIT=n
Default
Adaptive: n = 10
MCEMSEED keyword
Purpose
To specify the generation of random multivariate normal variables for Monte Carlo integra-
tion in the full information EM solution. n is the number of points sampled for Monte Carlo
493
5 TESTFACT REFERENCE
EM solution (min = 2; max = 2147483646). If this keyword appears, the quadratures in the
E-step of the EM cycles will be performed by Monte Carlo integration; otherwise, fixed-
point quadrature is used.
Format
MCEMSEED=n
Default
Fixed-point quadrature.
NITER keyword
Purpose
Format
NITER=(h,i)
where
Default
(h,i)=(3, 0.001).
NOADAPT option
Purpose
To specify that non-adaptive quadrature be performed in the full information solution. Note
that this option can only be invoked if there are 5 or fewer factors; with more than 5 factors,
this option if present will be ignored and adaptive fractional quadrature will be performed
(with 3 points per dimension). If the NOADAPT option is not invoked, all quadrature is adap-
tive.
Format
NOADAPT
494
TECHNICAL COMMAND
NOSORT option
Purpose
To suppress the sorting of response patterns with respect to their number correct scores. In
non-adaptive quadrature, such sorting can be used to speed computation. As it has no advan-
tage in adaptive quadrature or Monte Carlo, or in the BIFACTOR solution, NOSORT is always
used in these solutions.
Format
NOSORT
Related topics
BIFACTOR command
NSAMPLE keyword
Purpose
To specify the number of points sampled in the latent variable space when numerical inte-
gration used in the marginal maximum likelihood procedure is based on a fractional factorial
design. For example, if the number of factors equals 4, then a fractional factorial design re-
quires 34 = 81 points. Likewise, when five factors are specified the number of points is
35 = 243.
Format
NSAMPLE=n
Default
PRECISION keyword
Purpose
Format
PRECISION=n
495
5 TESTFACT REFERENCE
Default
One-third of the maximum number of EM cycles (see ITER keyword on TECHNICAL com-
mand).
Related topics
PRV keyword
Purpose
Format
PRV=n
n = 1 Provisional estimates of slope and intercept parameters are printed after each E-
step.
n = 2 Provisional estimates of slope and intercept parameters are printed after each M-
step iteration.
n = 3 Provisional estimates of slope and intercept parameters are printed after each E-
step and M-step iteration.
n = 4 Provisional estimates of slope and intercept parameters and their corrections are
printed as in 3.
Default
0.
QSCALE keyword
Purpose
To set the value of the extreme points in adaptive quadrature when QUAD or SQUAD equals 3.
496
TECHNICAL COMMAND
Format
QSCALE=n
Default
1.2.
Related topics
QUAD keyword
Purpose
To specify the number of quadrature points, 1 to 10, per dimension in full information solu-
tion.
Format
QUAD=n
Default
QWEIGHT keyword
Purpose
m : The weights (m, 1 – 2m, m) are assigned to the points; n must be fractional.
Format
QWEIGHT=m
Default
497
5 TESTFACT REFERENCE
SQUAD keyword
Purpose
To specify the number of quadrature points for EAP estimation of factor scores.
Format
SCQUAD=n
Default
498
TETRACHORIC COMMAND
(Optional)
Purpose
To specify how to form the count matrix that is used in calculating tetrachoric correlations.
Format
Examples
Default
No correlations computed.
CROSS option
Purpose
To ensure the joint frequencies for each pair of items will appear in the output.
Format
CROSS
Default
LIST option
Purpose
To ensure that the matrix of tetrachoric correlations (and possibly warning messages) will
appear in the printed output. This correlation matrix may be saved even when it is not listed
(see the SAVE command).
499
5 TESTFACT REFERENCE
Format
LIST
Default
Related topics
NDEC keyword
Purpose
Format
NDEC=n
Default
3.
RECODE/PAIRWISE/COMPLETE option
Purpose
To specify the treatment of observations that include omits. One of the following options
may be selected:
Format
RECODE/PAIRWISE/COMPLETE
500
TETRACHORIC COMMAND
Default
RECODE.
TIME option
Purpose
To specify that omitted items following the last non-omitted item be treated as not-
presented. All omitted items prior to the last non-omitted item will be recoded as “wrong” if
the guessing mode is not selected. If the guessing mode is selected, these items will be
scored “correct” with probability g j and “incorrect” with probability 1 − g j .
The TIME option does not affect RECODE, but if TIME is combined with COMPLETE or
PAIRWISE, different tetrachoric correlation coefficients will result.
Format
TIME
Related topics
501
5 TESTFACT REFERENCE
(Required)
Purpose
To provide a label that will be used throughout the output to identify the problem run.
Format
>TITLE
…text…
…text…
Notes
The TITLE command consists of three lines. The first line contains the TITLE command, and
is followed by two lines of 80 characters maximum containing the title text. Using only one
title line will cause an error condition. If the title does not require two lines, leave the sec-
ond line blank.
Example
>TITLE
ENGLISH LANGUAGE COMPREHENSION TEST
ITEM AND TEST STATISTICS
>PROBLEM…
Default
The data layout must be described in a variable format statement. This statement is entered
within parentheses and follows immediately after the INPUT command.
When data (labels, raw data, summary statistics) are used in fixed format, a format statement is
needed to instruct the program how to read the data. The general form of such a statement is
(rCw) or (rCw.d),
where:
502
VARIABLE FORMAT STATEMENT
C Format code
w Field width, or number of columns
d Number of decimal places (for F-format).
The following codes are used to indicate the type of value to be read:
The format statement must be enclosed in parentheses. Blanks within the statement are ignored:
(rCw.d) is acceptable. The program also ignores anything after the right parenthesis and on the
same line. Thus, comments may be placed after the format statement.
The labels HEIGHT, WEIGHT, AGE, and IQ could be read in fixed format as
(A6,A6,A3,A2)
HEIGHTWEIGHTAGEIQ
(4A6)
HEIGHTWEIGHT AGE IQ
Note that the first method lets the repeat count default to 1, and that it describes several different
fields, separated by commas, with one statement.
The following example shows three ways to read five integers, with the same result:
(5I1)
12345
(5I2)
1 2 3 4 5
(I1,I2,3I3)
1 2 3 4 5
The F-format requires that the number of decimal places be specified in the field description, so
if there are none (and eight columns) specify (F8.0); (F8) is not allowed. However, if a data
value contains a decimal point, then this overrides the location of the decimal point as specified
by the general field description. If the general field description is given by (F8.5), then
12345678 would result in the real number +123.45678, but the decimal point in –1234.56
would not change. Just a decimal point, or only blanks, will result in the value zero. The plus
sign is optional.
503
5 TESTFACT REFERENCE
It is possible to use the format statement to skip over variables in the data file when they are not
needed in the analysis. For example, (F7.4,8X,2F3.2) informs the program that the data file
has 21 columns per record. The first value can be found in the first seven columns (and there are
four decimal places), then eight columns should be skipped, and a second and third value are in
columns 16 – 21, both occupying three columns (with two decimal places). Note that the SELECT
command allows selection and reordering of variables.
Another possibility is the use of the tabulator format descriptor T, followed by a column number
n. For example, (1F8.5,T60,2F5.1) describes three data fields: in columns 1 – 8, with five
decimal digits, next in columns 61 – 65 and 66 – 70, both with one decimal digit. If the number n
is smaller than the current column position, left tabbing results. A forward slash (/) in an F-for-
mat means “skip the rest of this line and continue on the next line” . Thus, (F10.3/5F10.3) in-
structs the program to read the first variable on the first line, then to skip the remaining variables
on that line and to read five variables on the next line.
Related topics
504
6 IRT GRAPHICS
6 IRT graphics
6.1 Introduction
A new feature included with the IRT programs is the IRT GRAPHICS procedure. Item
characteristic curves, item and test information curves, and a histogram of the estimated abilities
may be plotted. A matrix plot showing all item characteristic curves simultaneously can also be
obtained. This feature is accessed via the Run menu on the main menu bar and becomes
available once the analysis has been completed. The plots are based on the contents of the
parameter files produced by the respective programs. In this chapter, an overview of the interface
and options of this feature is given.
The Main window of the IRT GRAPHICS program is used to access the following graphics:
The graphs displayed may be selected, changed, saved to file, or printed using various options
and dialog boxes described in Section 6.3. To exit the program, click the Exit option on the
Main menu.
505
6 IRT GRAPHICS
This option provides access to item characteristic curves for all the items in the test. In the image
below, the ICC for item 2 is displayed.
As a nominal model was fitted in this case, the high category is displayed in red and a
message to this effect is displayed in the Category Legends box at the bottom of the win-
dow. This field contains the legend for all categories plotted.
The Next button provides access to following items, while the Prev button allows the user
to go back to previously viewed Item Characteristic Curves (ICCs).
Use the Main Menu button at the bottom left of the window to return to the main menu.
The graph can be selected, edited, saved, or printed using the File, Edit, Graphs, and Op-
tions menus on the main menu bar. For more on the options available, see Section 6.3.
506
6 IRT GRAPHICS
Related topics
This option provides access to item information curves for all the items in the test. In the
image below, the item information curve for the second item is displayed.
The Scaling Information box at the bottom of the window contains information on the
scaling of the information axis. The item with the most information is indicated here for
all items in a test.
The Next button provides access to following items, while the Prev button allows the user
to go back to previously viewed item information curves.
Use the Main Menu button at the bottom left of the window to return to the main menu.
The graph can be selected, edited, saved, or printed using the File, Edit, Graphs, and Op-
tions options on the main menu bar. For more on the options available, see Section 6.3.
507
6 IRT GRAPHICS
Related topics
When this option is selected from the Main menu, the ICC and item information curve for an
item are displayed simultaneously.
As a nominal model was fitted in this case, the high category is displayed in red and a
message to this effect is displayed in the Category Legends box at the bottom of the win-
dow. This field also contains information on the legend for all other categories plotted.
The Next button provides access to following items, while the Prev button allows the user
to go back to previously viewed item curves.
Use the Main Menu button at the bottom left of the window to return to the main menu.
The graph can be selected, edited, saved, or printed using the File, Edit, Graphs, and Op-
tions menus on the main menu bar. For more on the options available, see Section 6.3.
508
6 IRT GRAPHICS
Related topics
This option is used to access the test information and standard error curves.
The total test information for a given scale score is read from the axis on the left of the
graph and is plotted in blue.
The axis to the right of the graph is used for reading the standard error estimate for a given
scale score. The measurement error is shown in red.
Use the Main Menu button at the bottom left of the window to return to the main menu.
The Next and Prev buttons may be used to access similar plots for multiple groups (if
any).
The graph can be selected, edited, saved, or printed using the File, Edit, Graphs, and Op-
tions menus on the main menu bar. For more on the options available, see Section 6.3.
509
6 IRT GRAPHICS
Related topics
This option provides an organized way of simultaneously looking at the item characteristic
curves of up to 100 items.
In the graph below, the ICCs of 35 items are plotted. As can be seen from the graph, models fit-
ted to the items range from the 1PL model to the nominal, graded and multiple response models.
Item 1 is shown in the top left corner of the combined graph, as indicated by the item numbers
given to the right of the plots. The gray lines dividing each plot into four quadrants are drawn at
a probability of 0.5 (on the y-axis) and ability of 0 (on the x-axis).
510
6 IRT GRAPHICS
To take a closer look at item 20, to which a nominal response model was fitted, click and drag
the right mouse button to select the area for zooming as shown below.
Releasing the mouse button produces a closer look at the graph for item 20 as shown below.
Note that any part of the matrix of plots can be selected for zooming, and that the zoom option is
also available for already enlarged areas of the matrix such as that shown below.
511
6 IRT GRAPHICS
Note that the high category is shown in red. To reset the image, double-click the right mouse but-
ton.
Up to 100 items can be simultaneously displayed. If the test contains more than 100 items,
return to the Main Menu and click the Matrix Plot button again for the next set of items.
The graphs can be selected, edited, saved, or printed using the File, Edit, Graphs, and
Options menus on the main menu bar. For more on the options available, see Section 6.3.
Related topics
The Histogram option provides a histogram of the ability scores. This option is only available if
scoring has been requested and the scores have been saved to an external file.
As indicated in the legend box at the bottom of the window, abilities are rescaled to a mean of 0
and standard deviation of 1. The area under the bell-shaped curve equals the total area of the his-
togram.
Use the Main Menu button at the bottom left of the window to return to the main menu.
The Next and Prev buttons may be used to access similar plots for multiple groups (if
any).
The graph can be selected, edited, saved, or printed using the File, Edit, Graphs, and Op-
tions options on the main menu bar.
512
6 IRT GRAPHICS
Related topics
The Bivariate Plot option provides a regression of ability on the percentage correct. This option
is only available if scoring has been requested and the scores have been saved to an external file.
As with the matrix plots, segments of the plot may be inspected by zooming in. This is
done by clicking and dragging the mouse to select the area of interest.
A 95% prediction interval for a new examinee is also shown on the plot.
Use the Main Menu button at the bottom left of the window to return to the main menu.
The graph can be selected, edited, saved, or printed using the File, Edit, Graphs, and Op-
tions menus on the main menu bar. For more on the options available, see Section 6.3.
If information is available for multiple groups, bivariate plots are available by group and
the Next and Prev buttons may be used to access the plots for following groups.
513
6 IRT GRAPHICS
Displayed graphs can be modified, saved and printed by using menus available on the main
menu bar of the graph window.
The Save as Metafile option is used to save the selected page or graph as a *.wmf (Win-
dows Metafile) for later use in other applications.
Note that an entire page, including legend boxes, may be printed using the Print current
page option.
Alternatively, the Show Selectors option on the Options menu may be used to select a
graph, after which the Print selected graph option of the File menu may be used to print
only the selected graph.
The Printer Setup and Printing Options options provide access to the standard Windows
printing controls.
514
6 IRT GRAPHICS
Related topics
Options menu
The Edit menu is used for copying of graphs or entire pages to the Windows clipboard. To select
a graph, the Show Selectors option on the Options menu may be used.
Related topics
Options menu
The Options menu is used to enable graph selectors and to highlight a selected graph.
and the selectors for the three areas of the graph below (the ICC, the item information curve and
the Category legends box) are displayed at the right of the window. The second graph has been
selected, and this entire section of the window is highlighted in dark red. This selected graph
may now be saved or printed using options on the File menu.
515
6 IRT GRAPHICS
The Graphs menu provides access to the Parameters and Fill Page options.
The Fill Page option is used to resize the graph to fill the entire window. The Parameters option
is used to change attributes of the graph displayed and is associated with the Graph Parameters
dialog box. This dialog box is used to change the position, size, and color of the currently se-
lected graph and its plotting area.
The Left, Top, Width, and Height edit controls allow the user to specify a new position
and size of the graph (relative to the page window) and of the plotting area (relative to the
graph window).
The Color drop-down list boxes are used to specify the graph window color and the color
of the graph’s plotting area.
If the Border check box is checked, the graph will have a border around it.
If the Border check box is checked, the Border Attributes button leads to another stan-
dard dialog box (the Line Parameters dialog box) that allows specification of the thick-
516
6 IRT GRAPHICS
In addition to the Graphs Parameters dialog box, a number of other dialog boxes may be used
to change attributes of graphs. The dialog boxes accessible depend on the type of graph dis-
played. The dialog boxes are:
The user may access any of these dialog boxes by double-clicking in the corresponding section
of the graph. For example, double-clicking in the legend area of the graph will activate the Leg-
end Parameters dialog box. Double-clicking on the title of the graph, on the other hand, will
provide access to the Text Parameters dialog box.
This dialog box is used for editing axis labels and is activated by double clicking on the axis of a
displayed graph.
The Labels Position group box controls the position of the labels relative to the axis or
plotting area.
The Last Label group box allows manipulation of the last label drawing options. If On is
selected, the last label is displayed like the others. If Off is selected, it is not displayed, If
Text is selected, the text string entered in the edit box below will be displayed instead of
the last numerical label.
The format of the numerical labels can be specified using the radio buttons in the Format
group box.
517
6 IRT GRAPHICS
The Date Parameters group box becomes active once the Date radio button is checked.
The Date Format box selects the date format to use for labels, while the Date Time Base
box selects the time base (minute, hour, day, week, month, year) for the date calculations.
The Starting Date drop-down list boxes specify the starting date that corresponds to the
axis value of 0. All dates are calculated relative to this value.
If the Set Precision check box is not checked, the labels’ precision is determined auto-
matically. If it is checked, the number entered into the #Places field specifies the number
of digits after the decimal point.
The Text Parameters button provides access to the Text Parameters dialog box (see
Section 6.3.10) that controls the font, size, and color of labels.
Related topics
This dialog box is used for editing the parameters of all bars in a regular bar graph, or a selected
group member of grouped bar graphs. It is displayed when a bar in the histogram (Histogram
option on the Main menu) is double-clicked.
518
6 IRT GRAPHICS
It operates as follows:
If the Border check box is checked, the bars have a border around them. In this case, the
Border Attributes button leads to the Line Parameters dialog box that controls border
thickness, color, and style.
The Data button leads to the spreadsheet-style window for editing plotted data points
(shown below)
519
6 IRT GRAPHICS
The Hatch Style drop-down list box allows the user to choose the hatch style for bars.
The Bar Color scrolling bars control the bar RGB color.
The Position radio buttons control the bar position relative to the independent variable
values.
The Width string field allows the user to enter the bar width in units of the independent
variable.
This dialog box allows the editing of legends. It opens when the mouse button is double-clicked
while the cursor is anywhere inside the legend box, except over a symbol representing a plotting
object.
520
6 IRT GRAPHICS
The Left, Top, Width, and Height edit controls allow the user to specify a new position
and size of the legend-bounding rectangle relative to the graph window.
The Color drop-down menu specifies the legend rectangle background color.
If the Border check box is checked, the rectangle will have a border. In this case, the
Border Attributes button leads to the Line Parameters dialog box that controls border
thickness, color, and style of the border line.
The multi-line text box in the lower left corner lists and allows editing of each of the leg-
end text strings.
The Text Parameters button leads to the Text Parameters dialog box discussed earlier.
Related topics
This dialog box is used for editing lines in the graph. It is accessed via the Plot Parameters dia-
log box, which is activated when a curve in a graph is double-clicked.
521
6 IRT GRAPHICS
The Style drop-down list box, visible when activated, allows selection of a line style.
The Width control specifies the line width, in window pixels.
Related topics
The type of line to be displayed may be changed using the Type drop-down lost box.
To fill the area under the curve, the Fill Area check box may be used.
The type of curve fitted (spline or not) is controlled by the Spline check box.
The Data button provides direct access to the data used to plot the curve.
The Line Attributes button provides access to the Line Parameters dialog box (shown to
the right of the Plot Parameters dialog box below). The Line Parameters dialog box is
discussed elsewhere in this section.
Related topics
This dialog box is used for editing text strings, labels, titles, etc. It can be called from some of
the other dialog boxes controlling graphic features. It may be activated by double clicking on any
text in a displayed graph.
522
6 IRT GRAPHICS
The Text edit control allows the user to edit the text string.
The Font drop-down list box allows control of the typeface.
The text color can be selected from the Color drop-down menu.
The size of the fonts (in points) is controlled by the Size drop-down menu.
The Bold, Italic and Underline check boxes control the text style.
The item characteristic curve is a nonlinear function that portrays the regression of the item score
on the trait or ability measured in a test. It shows the relationship between the probability of suc-
cess on the item to the ability measured by the item set or test containing the item.
In the case of binary data, a single curve is used to portray this relationship, and the difficulty,
discrimination and guessing parameters (where applicable) are indicated on the graph. In poly-
tomous models such as the graded response model and nominal response model, a number of
item option curves are plotted. Each curve shows the selection probability of a category of the
item as a function of the ability.
For a description of the models for which item characteristic curves or item option curves may be
obtained, see
Binary data:
Polytomous data:
523
6 IRT GRAPHICS
Item information functions are dependent on ability and provide valuable insight into the differ-
ences in the precision of measurement different ability levels. They are of particular interest in
test construction, where these curves can be used to ensure the inclusion of different items that
maximize the precision of measurement at different levels of θ in the test.
In the case of a 1PL model, the item information function is given by (Hambleton & Swa-
minathan, 1985, Table 6-1)
The maximum value of the information is constant for the one-parameter model and is at the
point bi .
For a 2PL model, the item information function is given by (Hambleton & Swaminathan, 1985,
Table 6-1)
with the maximum value directly proportional to the square of the item discrimination parameter,
a. A larger value of a is associated with greater information. The maximum information is ob-
tained at bi .
For the three-parameter model, the information function is (Hambleton & Swaminathan, 1985,
Table 6-1)
1
bi + ln{1/ 2 + 1/ 2 1 + 8ci }
Dai
The slope of the item response function and the conditional variance at each ability level θ play
524
6 IRT GRAPHICS
an important role in terms of the information provided by an item. An increase in the slope, to-
gether with a decrease in the variance, leads to more information being obtained. This in turn
provides a smaller standard error of measurement. By assessing these curves, items with large
standard errors of measurement may be identified and discarded.
The contributions of both item and test information curves are summarized by Hambleton &
Swaminathan (1985) as follows:
“The item and test information functions provide viable alternatives to the classical concepts
of reliability and standard error. The information functions are defined independently of any
specific group of examinees and, moreover, represent the standard error of measurement at
any chosen ability level. Thus, the precision of measurement can be determined at any level
of ability that is of interest. Furthermore, through the information function, the test construc-
tor can precisely assess the contribution of each item to the precision of the total test and
hence choose items in a manner that is not contradictory with other aspects of test construc-
tion. ”
The item and item information curves for two items to which a 3PL model has been fitted are
shown below. The discrimination parameter for item 24 is approximately twice that of item 25,
and the effect of this can be seen in the corresponding item information curves. Both item infor-
mation functions were plotted on the same scale. The item in the test with the most information
determines the scale.
525
6 IRT GRAPHICS
Related topics
The test information function summarizes the information function for a set of items or test. The
contribution of each item in the test to the total information is additive, as can be seen from the
definition of the test information function
n
Pi ' (θ ) 2
I (θ ) = ∑
i =1 Pi (θ )Qi (θ )
where Pi (θ ) denotes the probability of an examinee responding correctly to item i given an abil-
ity of θ , and Qi (θ ) = 1 − Pi (θ ) .
The function provides information for a set of items at each point on the ability scale and the
amount of information is influenced by the quality and number of test items. As was the case for
the item information function, the item slope and item variance play an important role. An in-
crease in the slope and a decrease in the item variance both lead to more information being ob-
tained. This in turn provides a smaller standard error of measurement. Also note that the contri-
bution of each test item is independent of the other items in the test.
The amount of information provided by a set of test items at an ability level is inversely related
to the error associated with ability estimates at the ability level. The standard error of the ability
526
6 IRT GRAPHICS
1
SE (θ ) = .
I (θ )
An example of test information and measurement error curves is shown below. Note that the ver-
tical axis to the left is used to find the information at a given ability while the vertical axis to the
right serves a similar purpose for the standard error curve.
Related topics
527
7 OVERVIEW OF MODELS
In this section, a brief overview of each of four IRT programs published by Scientific Software
International, Inc., is given. In subsequent sections, the models available and statistics that are
produced by each program are discussed in detail.
7.1.1 BILOG-MG
BILOG-MG (Zimowski, Muraki, Mislevy & Bock, 1996) is an extension of the BILOG (Mislevy
& Bock, 1990) program that is designed for the efficient analysis of binary items, including mul-
tiple-choice or short-answer items scored right, wrong, omitted, or not-presented. The program
performs the same analyses as BILOG in the single-group case. The BILOG-MG program im-
plements an extension of Item Response Theory (IRT) to multiple groups of respondents.
The program provides 1, 2, and 3 parameter logistic models for binary scored responses and ac-
commodates both nonequivalent groups equating for maintaining the comparability of scale
scores as new forms of the test are developed, and vertical equating of test forms across school
grades or age groups.
Analysis of differential item functioning (DIF) with respect to item difficulty associated with
demographic or other group differences may be performed with BILOG-MG, and provision is
made for the detection and correction for item parameter trends over time (DRIFT). In addition,
the BILOG-MG program provides for “variant items” that are inserted in tests for purposes of
estimating item statistics, but are not included in the scores of the examinees.
The present version of BILOG-MG includes a fully developed Windows graphical interface.
Syntax can be generated or adapted using menus and dialog boxes or, as before, in the format of
command files in text format. The interface has menu options that are in the order that the user
would most generally use: model specification is followed by data specification and technical
specifications, etc. Each of the menu options provides access to a number of dialog boxes on
which specifications are entered by the user.
7.1.2 PARSCALE
A versatile IRT rating-scale program is PARSCALE, written by Muraki & Bock (1996).
PARSCALE is capable of large-scale production applications with unlimited numbers of items
or respondents. The program can perform item analysis for both dichotomous and polytomous
data, and scoring of any number of subtests or subscales in a single program run. Up to 15 cate-
528
7 OVERVIEW OF MODELS
gories can be accommodate by PARSCALE. The user has the option to use the normal ogive or
the logistic response function.
This program includes options to make adjustments for differences in rater severity, and to in-
clude DIF of rating-scale items. PARSCALE has the ability to mix rating-scale items and multi-
ple-choice with or without guessing, and to handle multiple subtests and weighted combinations
of subtest scores. The program also provides the option to use Samejima’s graded response
model generalized for rating-scales or Masters’ partial credit model with or without discriminat-
ing power coefficients.
Program output may be directed to text files for purposes of selecting items or preparing reports
of test scores.
7.1.3 MULTILOG
The most versatile of the SSI IRT programs is MULTILOG (Thissen, 1991). It applies to both
binary and multiple category item scores and makes use of logistic response models, such as
Samejima’s (1969) model for graded responses, Bock’s (1972) model for nominal (non-ordered)
responses, and Thissen & Steinberg’s (1984) model for multiple-choice items. The commonly
used logistic models for binary item response data are also included, because they are special
cases of the multiple category models. MULTILOG provides Marginal Maximum Likelihood
(MML) item parameter estimates for data in which the latent variable of IRT is random, as well
as Maximum Likelihood (ML) estimates for the fixed-effects case. χ 2 indices of the goodness-
of-fit of the model are provided. In IRT, the item parameter estimates are the focus of item
analysis. MULTILOG also provides scaled scores: ML and Bayes modal (MAP) estimates of the
latent variable for each examinee or response pattern.
MULTILOG is best suited to the analysis of multiple-alternative items, such as those on multi-
ple-choice tests or Likert-type attitude questionnaires. It is the only widely available program
capable of fitting a wide variety of models to these kinds of data using optimal (MML) methods.
MULTILOG also facilitates refined model fitting and hypothesis testing through general provi-
sions for imposing equality constraints among the item parameters and for fixing item parame-
ters at a particular value. MULTILOG may also be used to test hypotheses about Differential
Item Functioning with either multiple response or binary data, through the use of its facilities to
handle data from several populations simultaneously and test hypotheses about the equality of
item parameters across groups. It is the only IRT program that handles all the major models: 1, 2,
and 3 parameter logistic models, multiple nominal categories, graded rating-scale model, partial
credit model, multiple-choice model, and constrained parameter models. In contrast to previous
versions, it now analyzes models of any size up to the limit of available memory.
7.1.4 TESTFACT
TESTFACT is a factor analysis program for binary scored items. This program, by Bock, Gib-
bons, Schilling, Muraki, Wilson and Wood, implements all the main procedures of classical item
analysis, test scoring, and factor analysis of inter-item tetrachoric correlations, and also modern
methods of factor analysis based on IRT. It handles item selection, multiple subtests, multiple
529
7 OVERVIEW OF MODELS
groups of examinees and correlation without an external criterion. The user can also compute
tetrachoric correlations with or without omitted or not-presented items, perform MINRES princi-
pal factor analysis and full information item factor analysis with likelihood ratio test of the num-
ber of factors, compute Bayes estimates of factor scores from the multidimensional IRT model,
and simulate item response data for the multidimensional model.
New features in TESTFACT are all part of Full information Item Factor Analysis (FIFA). The
command and procedures of classical item statistics and classical factor analysis of tetrachoric
correlation coefficients remain unchanged. The changes to full information item factor analysis
consist of a new and improved algorithm for estimating the factor loadings and scores—
specifically, new methods of numerical integration are used in the EM solution of the marginal
maximum likelihood equations. Four different methods of multidimensional numerical integra-
tion for the E-step of the EM algorithm are provided: adaptive quadrature, fractional adaptive
quadrature, non-adaptive quadrature, and Monte Carlo integration.
In exploratory item factor analysis, these methods make possible the analysis of up to fifteen fac-
tors and improve the accuracy of estimation, especially when the number of items is large. The
previous non-adaptive method has been retained in the program as a user-selected option
(NOADAPT), but the adaptive method is the default. The maximum number of factors with adap-
tive quadrature is 10; with non-adaptive quadrature, 5; with Monte Carlo integration, 15. Bayes
estimates of scores for all factors can be estimated either by the adaptive or non-adaptive
method. Estimation of the classical reliability of the factor scores is also included.
TESTFACT includes yet another full information method that provides an important form of
confirmatory item factor analysis called “bifactor” analysis. The factor pattern in bifactor analy-
sis consists of a general factor on which all items have some loading, plus any number of so-
called “group factors” to which non-overlapping subsets of items, assigned by the user, are as-
sumed to belong. The subsets typically represent small numbers of items that pertain to a com-
mon stem such as a reading passage or problem-solving exercise. The bifactor solution provides
Bayes estimation of scores for the general factor, accompanied by estimated standard errors that
properly account for association among responses attributable to the group factors.
The central concept of item response theory is that of the item response model. These models are
mathematical expressions describing the probability of a correct response to a test item as a func-
tion of the ability (or proficiency) of the respondent. For binary data, the response functions most
often encountered in IRT applications are the normal ogive and the logistic models. These are
discussed in Section 7.2.3. Multiple-group applications are considered in Section 7.2.2.
530
7 OVERVIEW OF MODELS
In the multiple-group case, it is assumed that the response function of any given item is the same
for all groups of subjects. In the DIF and DRIFT applications, however, we allow the relative
difficulties of the items to differ from one group to another or one occasion to another. In that
case, the b j parameters will differ between groups, and we will have to detect and estimate the
differences. Even in the presence of DIF and DRIFT, however, it is assumed that the item dis-
criminating powers are the same from one group to another. In the other applications, such as
nonequivalent groups equating or two-stage testing, we assume that both the locations and the
slope of items common to more than one group are equal. To satisfy this assumption, we would
perform a preliminary DIF analysis and not use, in equating, items showing appreciable DIF.
The main difference between the single-group and multiple-group case is in the assumption
about the latent distribution. In most equation situations, it is reasonable to assume that the re-
spondents in the sample groups are drawn from populations that are normal, but have different
means and standard deviations (see Figure 7.1).
In that case, the item response data can be described completely by estimating the means and
standard deviations of the groups along with the item parameters. One must, however, again con-
tend with the arbitrary origin and unit of the latent continuum, and may resolve this indeter-
minacy either by setting the mean and standard deviation of one of the groups to any arbitrary
values, or by setting the overall mean and variance of the combined distributions to arbitrary val-
ues. Both options are provided in BILOG-MG. The procedure for simultaneous estimation of
item parameters and latent distributions in more than one group are described in Bock & Zi-
mowski (1995) and in Mislevy (1987).
In two-stage testing applications, the situation is different. The groups correspond to examinees
who have been selected on the basis of a first-stage test to receive second-stage test forms tai-
lored to the provisional estimate of ability based on the first-stage test. Typically, the second-
stage groups are determined by cutting points on the θ -scale of the pretest. Because the pretest
3
This section was contributed by Michele Zimowski.
531
7 OVERVIEW OF MODELS
score is a fallible criterion, the θ distributions of the second-stage groups may overlap to a con-
siderable extent, but they cannot be expected to be normal even when the population from which
the examinees originate is normal. More likely in these applications the latent distributions
would appear as in Figure 7.2.
To accommodate such arbitrary shapes of distributions, one must make use of the empirical es-
timation procedure (see the section on estimation in the next chapter). As in the single-group
case, these empirical distributions can be estimated along with the item parameters by marginal
maximum likelihood. Again, the indeterminacy of location and scale must be resolved, either by
setting the mean and standard deviation of one of the groups to convenient values, such as zero
and one, or setting the overall mean and standard deviation of the combined distributions to simi-
lar values. In DIF analysis of ethnic effects, for example, the usual approach is to assign the
mean and standard deviation of the reference group, which is usually the majority demographic
group.
In two-stage testing applications, where the groups represent an arbitrary partition of the original
sample, assigning the overall mean and standard deviation is more reasonable. In vertical equat-
ing and DRIFT analysis, on the other hand, the groups correspond to distinct populations, so the
best solution would be to choose a reference group, perhaps the youngest-age group or the first-
year group, and assign the mean and standard deviation arbitrarily in that group. Comparing the
estimated means and standard deviations of the remaining groups with the reference group would
then show the trends in the mean and variability of test performance in successive age groups or
year groups.
Equivalent groups equating refers to the equating of parallel test forms by assigning them ran-
domly to examinees drawn from the same population. In educational applications, this type of
assignment is easily accomplished by packaging the forms in rotation and distributing them
across whatever seating arrangement exists in the classroom. Provided there are fewer forms than
students per classroom, it is justifiable to assume that the abilities of the examinees who receive
532
7 OVERVIEW OF MODELS
the various forms are similarly distributed in the population. This is the assumption on which the
classical equi-percentile method of equating is based, and it applies also to IRT equating.
Indeed, the procedure is even simpler in IRT because the latent distribution of ability is invariant
with respect to the distribution of item difficulties in the forms (this is not true of the number-
right score of classical test theory: the test score distribution in the population of respondents is
an artifact of the distribution of item difficulties (see Lord & Novick, 1968, pp. 387-392). The
IRT scale scores computed from the various forms are therefore equated whenever their location
and scale are set in the same way for all forms. There is no necessity for common items between
forms, any more than there is for equi-percentile equating, but neither will they interfere with the
equivalent groups equating if present.
The method of carrying out equivalent groups equating is somewhat different, however, accord-
ing to whether common items between forms are or are not present. In both cases, the collection
of forms may be treated as if it were one test with length equal to the number of distinct items
over all forms. The data records are then subjected to a single-group IRT analysis and scoring.
When common items are not present, each form may also be analyzed as an independent test,
with the mean and standard deviation of the scale scores of all forms set to the same values dur-
ing the scoring phase.
Nonequivalent groups equating is possible only by IRT procedures and has no counterpart in
classical test theory. It makes stronger assumptions than equivalent groups equating, but it re-
mains attractive because of the economy it brings to the updating of test forms in long-term test-
ing programs. Either to satisfy item disclosure regulations or to protect the test from compro-
mise, testing programs must regularly retire and replace some or all of the items with others from
the same content and process domains. They then face the problem of equating the reporting
scales of the new and old forms so that the scores remain comparable.
Although equivalent groups equating will accomplish this, it requires a separate study in which
the new and old forms are administered randomly to examinees from the same population. A
more economical approach is to provide for a subset of items that are common to the old and
new forms, and to employ nonequivalent groups equating to place their scores on the same scale.
These common or “link” items are chosen from the old form on the basis of item analysis results.
Link items should have relatively high discriminating power, middle range difficulty, and should
be free of any appreciable DIF effect. With suitable common items included, the old and new
forms can be equated in data from the operational administration of the tests without an addi-
tional equating study. Only the BILOG-MG program can perform this type of equating.
533
7 OVERVIEW OF MODELS
Although the case records from the current administration of the new form and the earlier ad-
ministration of the old form are subjected to a single IRT item analysis in nonequivalent equat-
ing, the test form is identified on each case record and separate latent distributions are estimated
for examinees taking different forms. For typical applications of the procedure to unrestricted
samples of examinees, the latent distributions may reasonably be considered normal. In that case,
the estimation of the mean and standard deviation of each distribution jointly with the item pa-
rameters allows for the nonequivalence of the two equating groups. The common items provide
the link between the two samples of data so that we may fix the arbitrary origin and unit of a sin-
gle reporting scale. Simulation studies have shown that if the sample sizes for the two groups are
large enough to ensure highly precise estimation of the item parameters, as few as four anchor
items can accurately equate the reporting scales for the test forms (see Lord, 1980).
In the BILOG-MG procedure, this method of equating can be extended to nonequivalent groups
equating of any number of such forms, provided there are common items linking the forms to-
gether in an unbroken chain. An example of a plan for common item linking of a series of test
forms is shown in Figure 7.3.
Variant items
If total disclosure of the item content of an educational test is required, a slightly different strat-
egy is followed. Special items, called “variant” items, are included in each test form but not used
in scoring the form in the current year. It is not necessary that all test booklets contain the same
variant items; subsets of variant items may be assigned in a linked design to different test book-
lets in order to evaluate a large number of them without unduly increasing the length of a given
test booklet. These variant items provide the common items that appear among the operational
items of the new form, which itself includes other variant items in anticipation of equating to a
later form. The item calibration of the old and new form then includes, in total, the response data
in the case records for the operational items of the old form, for the linking variant items that ap-
peared on the old form, and for all operational items from the new form. In this way, all of the
items in the current test form can be released as soon as testing is complete.
Vertical equating
Vertical equating refers to the creation of a single reporting scale extending over a number of
school grades or age groups. Because the general level of difficulty of finding items in tests in-
tended for such groups must increase with the grade or age, the forms cannot be identical. There
is little difficulty in finding items that are suitable for neighboring grades or age groups, how-
ever, and these provide the common items that can be used to link the forms together on a com-
534
7 OVERVIEW OF MODELS
mon scale. Inasmuch as these types of groups necessarily have different latent distributions, non-
equivalent groups equating is required. BILOG-MG offers two methods for inputting the re-
sponse records. In the first method, each case record spans the entire set of items appearing in all
the forms, but the columns for the items not appearing in the test booklet of a given respondent
are ignored when the data are read by the program. All of the items thus have unique locations in
the input records and are selected from each record according to the group code on the record. In
the second method, the location of the items in the input records is not unique. An item in one
form may occupy the same column as a different item in another form. In this case, the items are
selected from the record according to the form and the group codes on the record. These methods
of inputting the response records apply in all applications of BILOG-MG. See Chapter 10 for ex-
amples of both types of data input.
The purpose of differential item functioning analysis is to detect and estimate interactions be-
tween item difficulties and various subgroups within the population of respondents (see Thissen,
Steinberg, & Wainer, 1993). It is most often applied to interactions with respect to demographic
or ethnic groups and to gender, but any classification of the respondents could be investigated in
a similar manner. Specifically, it is the interactions of the item location parameters, b j , reflecting
the item difficulties, that are in question. DIF includes only the relative differences in difficulties
between the groups. Any reduction of the item percent corrects due to the average level of ability
in the group, as indicated by the mean of the corresponding latent distribution, we attribute to the
“adverse impact” of the test and do not regard it as DIF. Moreover, we assume that the differen-
tial item functioning does not extend to the item discriminating powers. The b j parameters for
the separate groups are estimated on the assumption that the slope parameters, a j , are homoge-
neous across groups. (for an alternative form of DIF analysis that includes differential item dis-
criminating power, see Bock, 1993).
DIF analysis is similar to nonequivalent groups equating in the sense that different latent distri-
butions are assumed for the groups in question, but it differs because the same form of the test is
administered in all of the groups. It also provides a large sample standard error estimate of the
effect estimators. In addition, the program provides an overall marginal likelihood ratio test of
the presence of differential item functioning in the data. To perform this test, first analyze the
data in a single group as if they came form the same population and note the marginal maximum
log likelihood of the item parameters in the final iteration (labeled –2 LOG LIKELIHOOD in the
output). Then, analyze the data in separate groups using the DIF model and again note the final
log likelihood. Under the null hypothesis of no DIF effects on item locations, the difference in
these log likelihoods is distributed in large samples as χ 2 with (n − 1)(m − 1) degrees of freedom,
where n is the number of items and m is the number of groups. When this χ 2 is significant, there
is evidence that differential item effects are present. Their interpretation usually becomes clear
when the item content is examined in relation to the direction of the estimated contrasts in the b j
parameters, because these contrasts are interactions and must sum to zero (some are positive and
others negative).
535
7 OVERVIEW OF MODELS
As defined by Bock, Muraki & Pfiffenberger (1988), DRIFT is a form of DIF in which item dif-
ficulty interacts with the time of testing. It can be expected to occur in education tests when the
same items appear in forms over a number of years and changes in the curriculum or instruc-
tional emphasis interact differentially with the item content (see Goldstein, 1983). Bock, Muraki
& Pfiffenberger found numerous examples of DRIFT among the items of a form of the College
Board’s Advanced Placement Test in Physics that had been administered annually over a ten-
year period (see Figure 7.4). DRIFT is similar to DIF in admitting only the item interaction:
changes in the means of the latent distributions of successive cohorts are attributed to changes in
the levels of proficiency of the corresponding population cohorts.
Figure 7.4: Drift of the location parameters of two items from a College Board Advanced
Placement Examination in Physics
DRIFT differs from DIF in that the interaction of item location with time is assumed to be a con-
tinuous process that can be modeled by a straight or low degree polynomial regression line.
Thus, in place of estimating contrasts between groups, we estimate the coefficients of the linear
or polynomial function in time that describes the DRIFT in the b j parameters. The significance
of the trends can be judged from the size of the estimated regression coefficient relative to its
large sample standard error estimate. The overall presence of DRIFT can be tested in a marginal
likelihood ratio test similar to that for DIF.
As implemented in BILOG-MG, DRIFT analysis does not require all items to be included in
each test form. The DRIFT regression functions are estimated for whatever time points are avail-
able for each item. In most DRIFT applications, it is satisfactory to assume that the latent distri-
butions of the yearly cohorts are normal. The corresponding means and standard deviations esti-
mated in the DRIFT analysis describe differences in the proficiencies of the cohorts.
Two-stage testing
Two-stage testing is a type of adaptive item presentation suitable for group administration. By
tailoring the difficulties of the test forms to the abilities of selected groups of examinees, it per-
mits a reduction in test length by a factor of a third or a half without loss of measurement preci-
sion. The procedure employs some preliminary estimate of the examinees’ abilities, possibly
536
7 OVERVIEW OF MODELS
from a short first-stage test or other evidence of achievement, to classify the examinees into three
or four levels of ability. Second-stage test forms in which the item difficulties are optimally cho-
sen are administered to each level. Forms at adjacent levels are linked by common items so that
they can be calibrated on a scale extending from the lowest to the highest levels of ability. Simu-
lation studies have shown that two-stage testing with well placed second-stage tests is nearly as
efficient as fully adaptive computerized testing when the second-stage test has four levels (see
Lord, 1980).
The IRT calibration of the second-stage forms is essentially the same as the nonequivalent forms
equating described above, except that the latent distributions in the second-stage groups cannot
be considered normal. This application therefore requires estimation of the location, spread, and
shape of the empirical latent distribution for each group jointly with the estimation of item pa-
rameters. During the scoring phase of the analysis, these estimated latent distributions provide
for Bayes estimation of ability combining the information from the examinee’s first-stage classi-
fication with the information from the second-stage test. Alternatively, the examinees can be
scored by the maximum likelihood method, which does not make use of the first-stage informa-
tion. The BILOG-MG program is capable of performing these analyses for the test as a whole, or
separately for each second-stage subtest and its corresponding first-stage test. For an example of
an application of two-stage testing in mathematics assessment see Bock & Zimowski (1989).
An innovative application of the BILOG-MG program is the estimation, from matrix sampled
assessment data, of the latent distributions for schools or other groups of students. Certain matrix
sampling designs, such as those employed by the National Assessment of Educational Progress,
include in each booklet a number of short scales, consisting of eight or nine items, in several sub-
ject-matter areas. These scales have too few items to permit reliable estimation of the profi-
ciencies of individual examinees in each subject matter, but they do allow estimation of the latent
distribution of each proficiency at the group-level if the number of respondents is sufficiently
large. There is a tradeoff between the number of items for each scale in each test booklet and the
number of respondents: the more items, the fewer respondents are needed for accurate estimation
of the group latent distribution.
If each booklet contains perhaps 48 items, the latent distributions for six content areas could be
estimated simultaneously. The results of the assessment could then be reported to the public in
terms of the means and standard deviations of the achievement levels of the schools or groups.
Alternatively, if achievement standards have been set in terms of IRT scale score levels, the per-
cent of students attaining or exceeding each level can be computed from the latent distribution
and reported. The latter form of reporting is often more easily understood than scale-dependent
statistics such as the mean and standard deviation. Because the BILOG-MG program allows
unlimited numbers of groups as well as unlimited numbers of items and respondents, it is well
suited to the estimation of latent distributions for this form of reporting. The shape of the latent
distributions may either be assumed normal or estimated empirically.
537
7 OVERVIEW OF MODELS
A response to a binary test item j is indicated in these expressions by the item score,
Let θ denote the ability of the person, and let the probability of a correct response to item j be
represented by
P ( x j = 1| θ ) = Pj (θ ) ;
P ( x j = 0 | θ ) = 1 − Pj (θ ) .
In general, the response function also depends upon one or more parameters characteristic of the
item, the values of which must be estimated.
∞
1
Pj (θ ) = ∫ e −[1/ 2]t dt ,
2
2π − (θ −b j ) / σ j
where σ j = 1/ a j is called the item dispersion, a j is the item discriminating power and b j is an
item location parameter. The normal ogive model is conventionally represented as Φ j (θ ) .
At present, the response models most widely used in applied work are the logistic models for bi-
nary scored items. The most important of these models are:
538
7 OVERVIEW OF MODELS
1
P(1) j (θ ) =
1 + exp[− a (θ − b j )]
where
exp(k ) = e k and
1
P(2) j (θ ) =
1 + exp[− a j (θ − b j )]
where a j is the item discriminating power, and b j is an item location parameter as in the 1PL
model.
z j = a j (θ − b j ),
is referred to as a logistic deviate, or logit. The logit can also be written as z j = a jθ + c j where
c j = −a j b j . In this form, a j is referred to as the item slope and c j as the item intercept (see Fig-
ure 7.5).
539
7 OVERVIEW OF MODELS
1
Ψ j (θ ) = −zj
1+ e
If all a j are equal, the model is reduced to a one-parameter logistic or Rasch model.
In the case of multiple-choice items, an examinee who does not know the correct alternative may
succeed in responding correctly by randomly guessing. If the examinee’s ability is θ , the prob-
ability that the examinee will not know the answer, but guesses correctly (with probability g j ) is
g j [1 − Ψ j (θ )] . The probability that the examinee will respond correctly either by knowledge or
by random guessing is therefore
P3 j (θ ) = g j [1 − Ψ j (θ )] + Ψ j (θ )
= g j + (1 − g j )Ψ j (θ ),
540
7 OVERVIEW OF MODELS
will be greater than 1/A by some amount that must be determined empirically along with the a j
and b j or c j parameters.
The parameter g j corresponds to the lower asymptote of the item response function, P3 j (θ ) .
This interpretation of g j , as well as that of the other item parameters, is shown in Figure 7.6.
The logistic item response models are closely related to the normal ogive model. In order to
bring the logistic models into close agreement with the normal ogive model, θ is multiplied by
the factor D = 1.7. When D = 1.7 is used, the discrepancy between the normal response function
and its logistic approximation is never greater than 0.01.
When the logit incorporates this factor, as in z j = a j D(θ − b j ) , the models are said to be in the
normal metric.
BILOG-MG computes and uses classical item statistics as starting values for the iterative esti-
mation of the IRT parameters.
On the assumption that θ is distributed with zero mean and unit standard deviation in the popu-
lation of respondents, the normal ogive item parameters are related to the classical item statistics
as follows (see Lord & Novick, 1968, Sections 16.9 and 16.10).
541
7 OVERVIEW OF MODELS
If one assumes a bivariate normal distribution of the population over the item and criterion vari-
ables, Richardson (1936) and Tucker (1946) have shown that
ρ j = a j / 1 + a 2j , 0 ≤ ρ j ≤ 1
where ρ j is the biserial correlation between ability and item j. In classical item analysis, ρ j is
estimated by the item-test correlation (the correlation between response to the item scored 1 or 0
and number-right score for the test).
We see from the equation above that an item with slope 1 (in the normal metric) has a reliability
index equal to 1/ 2 = 0.707. Items with slopes greater than 1 are more reliable (more discrimi-
nating measures of the trait represented by the test); those with slopes less than 1 but greater than
zero are less reliable. Items with a negative slope are keyed in a direction opposite to that of the
other items. The same relationships hold with good approximation for the logistic parameters
expressed in the normal metric.
Tucker (1946) has expressed classical item difficulty Pj as a function of the item parameters a j
and b j :
Pj = Φ (−a j b j / 1 + a 2j ) ,
that is, p j is the value of the standard normal distribution function at the point
−a j b j
= −b j ρ j
1 + a 2j
i.e., the area to the left of the point under the normal curve.
ρ
a j =
j
2
1− ρ j
542
7 OVERVIEW OF MODELS
and
z j
b j =
ρ j
Because BILOG-MG employs maximum likelihood estimation when fitting the IRT model,
large-sample statistical tests of alternative models are available, provided one model is nested
within each other. Two models are called “nested” when the larger model is formed from the
smaller by the addition of terms and parameters. For example, the one-parameter logistic model
is nested within the two-parameter model, which is in turn nested within the three-parameter
model. Similarly, the single-group model is nested within the two-group model, and so on. The
smaller of the nested models is referred to as the “null” model and the larger as the “alternative”.
The statistical test of the alternative model vs. the null model is equivalent to a test of the hy-
pothesis that the additional parameters in the alternative are all zero and that no significant im-
provement in fit is obtained by including them.
At the end of the estimation cycles in the calibration phase, BILOG-MG prints the negative of
the maximum marginal log likelihood. If the program is run, with the same data, once with the
null model and once with the alternative model, the negative of the log likelihood of the former
will always be larger than that of the latter. In large-samples, the positive difference of these log
likelihoods is distributed as χ 2 on the null hypothesis. Its number of degrees of freedom is equal
to the difference in the number of parameters in the null and alternative models. A model with
more parameters should be adopted only when this test statistic is clearly significant. Otherwise,
fitting of the additional parameters will needlessly reduce precision of estimation.
BILOG-MG also provides a large-sample test of the goodness-of-fit of individual test items in
the analysis: this requires the test to have 20 or more items.
If the test is sufficiently long (more than 20 items), the respondents in a sample of size N can be
assigned with good accuracy to intervals on the θ -continuum on the basis of their estimated
value of θ (for this purpose, we use the EAP estimate with whatever prior is assumed for item
calibration; see the section on test and item information to follow). Then the number of those in
each interval who respond correctly to item j can be tallied from their item scores.
Finally, a likelihood ratio χ 2 test statistic can be used to compare the resulting frequencies of
correct and incorrect responses in the intervals with those expected from the fitted model:
543
7 OVERVIEW OF MODELS
ng
rhj N h − rhj ,
X = 2∑ rhj log e
2
+ ( N h − rhj ) log e
j
_ _
h =1
N h Pj (θ h )) N h (1 − Pj (θ h ))
where ng is the number of intervals, rhj is the observed frequency of correct responses to item j
in interval h, N h is the number of respondents assigned to that interval, and Pj (θ h ) is the value
of the fitted response function for item j at θ h , the average ability of respondents in interval h.
Because neither the MML or MAP methods of fitting the response functions actually minimize
this χ 2 , the residuals are not under linear constraints and there is no loss of degrees of freedom
due to the fitting of the item parameters. The number of degrees of freedom is therefore equal to
the number of intervals remaining after neighboring intervals are collapsed if necessary to avoid
expected values less than 2.
Psychological, sociological, educational and medical data often consist of responses classified in
two or more predefined categories. The extra information contained in multiple-category re-
sponse classifications helps offset the greater cost (compared to machine scored multiple-choice
items) of ratings based on human judgments. Provided the readers are able to assign the catego-
ries consistently, multiple-category scoring is more informative than binary scoring, because it
contains multiple thresholds of difficulty corresponding to the boundaries between the catego-
ries. By discriminating among the respondents at more than one level, multiple-category scoring
of an extended response has the same advantages as adaptive testing with several binary-scored
items at different levels of difficulty.
Readers familiar with IRT in the binary case will find the generalization to the multiple-category
case quite straightforward. The concept of a latent dimension on which response probability
4
This section was contributed by Eiji Muraki.
544
7 OVERVIEW OF MODELS
functions are defined carries over from the binary case without changes, and parameters of the
response functions must still be estimated; the estimated parameters are then used to estimate
scores for the respondents. The only new element is the more general form of the response func-
tions and the greater number of parameters per item. The similarity of the two cases is apparent
in the parallel structure of the BILOG-MG and PARSCALE programs: both programs have a
data input, item calibration and test-scoring phase. Both are designed for efficient use in large-
scale testing programs based on instruments with many items and possible multiple subtests or
scales.
The current version of PARSCALE handles data in which the responses to a number of items are
classified in a common set of ordered categories. This is perhaps the most common type of data.
In the context of attitude measurement, this type of item is often treated as a so-called “Likert”
scale, where the categories are arbitrarily assigned successive integer values (Likert, 1932). In
contrast, the IRT procedures estimate optimal empirical values to be assigned to the boundaries
between the categories. Since all of the items are rated in the same categories, the number of
boundaries to be estimated equals one less than the number of categories. The boundaries, item
locations, and respondent scores are all represented as points on the latent dimension of meas-
urement.
Another common type of data is where each item has its own specific number and definition of
categories. The number of boundaries to be estimated is therefore equal to the sum of one less
than the number of categories for each item. In this case, the item locations are absorbed in the
category boundaries of the items and are not separately estimated.
Alternatively, the instrument or test may consist of a mixture of common-category and specific-
category items. PARSCALE handles this case by assigning items to “blocks”, with categories
common within blocks and different between blocks. Each specific-category item is its own
block. In the case of binary items, i.e., items with only two categories, the categories are com-
mon by definition and all belong to the same block. An educational test, for example, may con-
tain open-ended exercises scored in five or six categories in one block, and multiple-choice items
in another block. The presence of multiple-choice items introduces the additional problem of
guessing effects (which is often absent in rated items). These effects are estimated using a three-
parameter model in the binary case.
A case not handled by the current version is that of nominal categories. These categories each
represent a qualitatively distinct type of response to the stimulus and have no predefined ordinal
relationship. A common use of the nominal model is to extract all information in responses to all
alternatives of a multiple-choice item, beyond just the contrast of correct and incorrect alterna-
tives. At the present time, only MULTILOG (Thissen, 1991) handles both ordinal and nominal
category item response data. But MULTILOG does not allow for Likert-type items with a com-
mon set of response categories.
The response models in PARSCALE are derived from normal and logistic models in the binary
case. In this section Samejima’s Graded Response Model and Masters’ Partial Credit Model are
discussed. The scoring function of the Generalized Partial Credit Model, the Rater's-Effect
545
7 OVERVIEW OF MODELS
model, the DIF model, and the Trend model for dichotomous item response models (see Bock,
Muraki & Pfeiffenberger, 1988) are then reviewed.
If we define Pjk+ (θ ) and Pj+,k +1 (θ ) as the regression of the binary item score method in which all
response categories less than k and k + 1, respectively, are scored 1 for each item j, the operating
characteristic (Samejima, 1972) of the graded item scoring for the latent trait variable θ is
Pj+0 (θ ) = 1
Pj+,m +1 (θ ) = 0,
and, in general,
For the normal ogive model (Samejima, 1974), the formula for Pjk+ (θ ) in the general case, is
given by
a j (θ −b jk )
Pjk+ (
θ) = ∫
−∞
φ (t ) dt ,
b j1 ≤ b j 2 ≤ ... ≤ b jm .
546
7 OVERVIEW OF MODELS
An extension of Samejima’s graded item response model suitable for Likert items is
a j (θ −b j + ck )
Pjk+ ( θ) = ∫
−∞
φ (t ) dt ,
where b j is now the item-location parameter and ck the category parameter. We refer to this ex-
tension as the “rating-scale” model.
a j (θ −b j + ck +1 )
Pj+,k +1 ( θ) = ∫
−∞
φ (t ) dt.
From these results, we obtain the response function of a graded category under the normal ogive
model:
a j (θ −b j + ck )
Pjk+ (θ) = ∫
a j (θ −b j + ck +1 )
φ (t ) dt.
exp[ Da j (θ − b j + ck )] exp[ Da j (θ − b j + ck +1 )]
Pjk (θ ) = −
1 + exp[ Da j (θ − b j + ck )] 1 + exp[ Da j (θ − b j + ck +1 )]
1 1
= −
1 + exp[− Da j (θ − b j + ck )] 1 + exp[− Da j (θ − b j + ck +1 )]
where D = 1.7. Both models in these equations are response functions for items scored in succes-
sive categories. A major distinction between the two models is that in the rating-scale model,
Samejima’s parameter b jk is resolved into an item location parameter b j and a category parame-
ter ck .
If each item has its own response categories, which may differ in number m j (j = 1, 2, …, n), the
graded response model is required. Figures 7.7, 7.8 and 7.9 illustrate the meaning of the pa-
rameters, a j , b j and ck . All examples are the logistic rating-scale models with four categorical
responses. Therefore, the model contains three category parameters, ck .
547
7 OVERVIEW OF MODELS
In Figure 7.7, the values of parameters are set as a j = 1.0, b j = 0.0, c1 = 2.0, c2 = 0.0, and c3 = -
2.0. Item category trace lines are drawn from left to right as the order of Pj 0 , Pj1 , Pj 2 , and Pj 3 .
Since the distances between adjacent category thresholds are equal and the location parameter is
zero, the trace lines are symmetric with respect to θ = 0.
If the slope a j increases by 0.5 ( a j = 1.5) and the location is changed from b j = 0 to b j = 0.5,
then all of the four trace lines become steeper and are shifted to the right of the θ scale, as
shown in Figure 7.8. These mechanisms of the function behave the same as the dichotomous
item response model.
548
7 OVERVIEW OF MODELS
If the distance between c2 and c3 becomes narrower by 0.5 ( c2 = -0.5) as shown in Figure 7.9,
the trace lines of Pj1 and Pj 2 are shifted to the right. In other words, these two categories become
more difficult to be endorsed or attained. However, the trace lines of Pj 0 and Pj 3 stay the same
since these probabilities do not involve c2 . Since the trace lines of the extreme categories, Pj 0
and Pj 3 , are essentially the cumulative probability function, the slope of these functions change
only if the slope parameter, a j , is altered. However, the slope of the middle category is affected
by the distance of adjacent categories. Therefore, the trace line of Pj 2 (θ ) is not only shifted, but
also becomes flatter.
In the case of dichomotomous item response models, the slope parameter is synonymous with
the discriminating power. However, for the polytomous item response model, the discriminating
power of a specific categorical response depends on the width of adjacent category thresholds as
well as a slope parameter. Because of this property, the simultaneous estimation of the slope pa-
rameter and of all m j category parameters is not obtainable. If the model includes the slope pa-
rameter for each item j, the location of the category parameters must be fixed. The natural choice
is to fix the values of the mean of the category parameters c1 and cm . The program provides the
keyword CADJUST on the BLOCK command to set this mean (default is 0.0). The option
NOCADJUST causes the program to omit the adjustment during the calibration or scoring phase.
aj
a j (θ − b j + ck ) = [ sθ − ( sb j − t ) + ( sck − t )]
s
where s is a scaling factor and t is a location constant. This equation shows that shifting the cen-
ter of the category metric results in a shift of b j in the same direction by the same units. If the
549
7 OVERVIEW OF MODELS
intervals of the category scale are expanded by the factor s and the scale of θ is held constant,
the b j will expand and the a j will contract by the same factor. Therefore, if the assumption that
more than two subsets of items measure the same ability is met and their ability distributions are
constrained to have the same mean and standard deviation, both the scale and location parame-
ters are determinate and estimable.
Masters (1982) reformulated Andrich’s polytomous rating response model by utilizing the Rasch
dichotomous model, which does not contain a discriminating power parameter. It is quite legiti-
mate, however, to formulate the general model based on the two-parameter logistic response
model, following the same operating characteristic that Masters employs. Since the essential
mechanism for constructing the general model is shared with Masters’ partial credit model and
Andrich’s rating-scale model, the models constructed in this text can simply be called the gener-
alized partial credit model.
The generalized partial credit model is formulated based on the assumption that each probability
of choosing the k-th category over the (k–1)-th category is governed by the dichotomous re-
sponse model. To develop the partial credit model, let us denote Pjk as the specific probability of
choosing the k-th category from m j + 1 possible categories of item j. In the dichotomous model
( m j + 1 = 2 ), Pj 0 (θ ) + Pj1 (θ ) = 1. The conditional probability of choosing category 1 given the
probability of choosing categories 0 and 1 is then
Pj1 (θ ) exp[a j (θ − b j1 )]
Pj1|0,1 = = Pj1 (θ ) = .
Pj 0 (θ ) + Pj1 (θ ) 1 + exp[a j (θ − b j1 )]
Therefore,
1
Pj 0|0,1 (θ ) = 1 − Pj1 (θ ) = .
1 + exp[ a j (θ − b j1 )]
mj
Pj 0 (θ ) + Pj1 (θ ) + ... + Pjm j (θ ) = ∑ Pjk (θ ) = 1.
k =0
For each of the adjacent categories, the probability of the specific categorical response k over k –
1 is given by the above conditional probability:
550
7 OVERVIEW OF MODELS
C jk = Pjk |k −1,k (θ )
Pjk (θ )
=
Pj ,k −1 (θ ) + Pjk (θ )
exp[a j (θ − b jk )]
= ,
1 + exp[a j (θ − b jk )]
where k = 1, 2, … , m j . Then,
C jk
Pjk (θ ) = Pj ,k −1 (θ ),
1 − C jk
where
C jk Pjk |k −1,k (θ )
=
1 − C jk 1 − Pjk |k −1,k (θ )
Pjk |k −1,k (θ )
=
Pjk −1|k −1,k (θ )
= exp[a j (θ − b jk )].
This equation may be called the operating characteristic for the partial credit model. If we start
by determining
1
Pj 0 (θ ) = ,
G
exp[a j (θ − b j1 )]
Pj1 (θ ) =
G
exp[a j (θ − b j1 ) + a j (θ − b j 2 )]
Pj 2 (θ ) =
G
exp[∑ v =1 a j (θ − b jv )]
g
Pjg (θ ) =
G
exp[∑ v =j1 a j (θ − b jv )]
m
Pjm j (θ ) = ,
G
551
7 OVERVIEW OF MODELS
Since ∑ Pjk (θ ) = 1,
mj
c
G = 1 + ∑ exp ∑ a j (θ − b jv ) .
c =1 v =1
k
exp ∑ a j (θ − b jv )
Pjk (θ ) = v =1
mj
c
1 + ∑ exp ∑ a j (θ − b jv )
c =1 v =1
k
exp ∑ a j (θ − b jv )
= m v =0 ,
j
c
∑ exp ∑ a j (θ − b jv )
c =0 v =0
where b j 0 ≡ 0. The partial credit model reduces to the dichotomous item response model when
m j = 1 and k = 0, 1. Note that b j 0 is arbitrarily defined as 0.0. This quantity is not a location
constant and could be any value because the term containing this parameter cancels from both
numerator and denominator:
k
exp ∑ z jv
Pjk (θ ) = m v =1
j
c
∑ exp ∑ z jv
c =1 v =0
k k
exp[ z j 0 (θ )] × exp ∑ z jv (θ ) exp ∑ z jv (θ )
= v =1 = v =1 ,
mj
c mj
c
exp[ z j 0 (θ )] + ∑ exp z j 0 (θ ) + ∑ z jv (θ ) 1 + ∑ exp ∑ z jv (θ )
c =1 v =1 c =1 v =1
where z jk (θ ) = a j (θ − b jk ). Masters (1980) calls the quantity b jk in this equation an “item step”
parameter. It is the intersection point of Pjk (θ ) and Pj ,k +1 (θ ) expressed as the operating charac-
teristic. Thus, assuming a j > 0,
552
7 OVERVIEW OF MODELS
θ = b jk , then Pjk (θ ) = Pj ,k +1 (θ )
if: θ > b jk , then Pjk (θ ) < Pj ,k +1 (θ )
θ < b jk , then Pjk (θ ) > Pj ,k +1 (θ ).
It should be noted that b jk is not sequentially ordered within item j, because it represents the
relative magnitude of the adjacent probabilities Pjk (θ ) and Pj ,k +1 (θ ) . Furthermore, when all
probabilities Pjk (θ ) are equal, the values of b jk also become identical.
While the item-category threshold parameter, b jk , in the graded response model determines the
steepest point of the trace line, the b jk parameter in the partial credit model is the intersection
point of Pjk (θ ) and Pj ,k +1 (θ ) . These lines intersect only once anywhere on the θ scale. Figure
7.10 is the graph of the partial credit model with a j = 1.0, b j1 = -2.0, b j 2 = 0.0, and b j 3 = 2.0. If
b j1 and b j 2 are brought closer together by changing b j1 = -2.0 to -0.5, then the probability of
completing only the first step would decrease, as illustrated in Figure 7.11. If the slope parameter
is changed from 1.0 to 0.7, as shown in Figure 7.12, the intersection points of all trace lines are
left unchanged and the curves become flatter.
553
7 OVERVIEW OF MODELS
554
7 OVERVIEW OF MODELS
When the second step is made easier than the first ( b j1 > b j 2 ), the trace lines of Pj 2 drop and
every person becomes less likely to complete only the first step. This is illustrated in Figure 7.13.
The trace line in Figure 7.13 is the partial credit model with three categorical responses. If we
add another category ( b j 3 = 2.0) to this model, the trace lines become more complicated.
However, the interpretation remains clear. The transition or step from Pj ,k −1 (θ ) to Pjk (θ ) is gov-
erned by the item step parameter, b jk . Since the magnitude of b jk s are ordered as b j 3 (= 2.0),
b j1 (= 0.5) and b j 2 (= 0.0), the step from b j 2 (θ ) to b j 3 (θ ) is the hardest, next to the step from
Pj 0 (θ ) to Pj1 (θ ) . The easiest step is the transition from Pj1 (θ ) to Pj 2 (θ ) . Consequently, the re-
spondent becomes more likely to complete the first category, but less likely to complete the sec-
ond category. Therefore, as shown in Figure 7.14, the probability of the second categorical re-
sponse, Pj 2 (θ ) , appears dominant. If all item step parameters are the same value, all trace lines
intersect at the same value. Even though the values of item step parameters are not sequentially
ordered, the partial credit model expresses the probabilities of ordered response. The subsequent
steps can be completed only after the former ones are successfully completed. In other words, the
locations of the trace lines can never be interchanged, only their intersection points.
555
7 OVERVIEW OF MODELS
The Likert version of the partial credit model is the simple extension of the foregoing results,
namely,
exp ∑ v =0 a j (θ − b j + cv )
k
Pjk (θ ) = m ,
∑ c=0 exp ∑ v=0 a j (θ − b j + cv )
j c
where θ − b j + c0 ≡ 0 and c0 ≡ 0 and the parameter b jk is resolved into two parameters b j and ck
( b jk = b j − ck ). Andrich (1978) first introduced this separation of the item location and the cate-
gory boundary parameters.
In the graded response model, the probability of responding in category k to a specific item is
obtained by subtracting the person’s probability of responding in or below category k. Since the
probability of the categorical response is determined by the distance between the boundaries of
the category, the order of the boundaries is fixed by the order of the categories.
In the partial credit model, the probability of responding in a category k to a specific item is ex-
pressed by the conditional probability of responding in category k, given the probability of re-
sponding in categories k – 1 and k. The models are constructed by recursively applying a di-
chotomous model to the probability of choosing category k over another adjacent category k – 1
for each pair of binary categories. Therefore, the probability of a specific categorical response is
determined by the number of the upper boundaries the person has passed and the combination of
their unique parameter values. The values of the item-category parameters, b jk , are not necessar-
ily in successive order on a scale like those of the graded response model. Since the item-cate-
gory parameters are not necessarily ordered within item j, the category parameters ( ck ) may not
be sequentially ordered for k = 1, 2, …, m. The parameter ck is interpreted as the relative diffi-
556
7 OVERVIEW OF MODELS
k
z +jk (θ ) = Da j k (θ − b j ) + ∑ cv .
v =0
z +jk (θ ) = Da j Tk (θ − b j ) + K k
.
Andrich (1978) calls Tk and K k the scoring function and the category coefficient, respectively.
For the partial credit model, the scoring function Tk is a linear integer scoring function, that is
T = (1, 2,3,..., m j + 1) where m j + 1 is the number of categories of item j.
λ j ,k |k −1 = Da j (Tk − Tk −1 )(θ − b j ) + ck .
This shows that the log-odds is a monotonically increasing function of the latent trait only when
the increment scoring is used for successive categorical responses. The higher latent trait value a
subject has, the more likely he or she responds in upper categories. In other words, the partial
credit model becomes the model for ordered categorical responses only when the scoring func-
tion is increasing, that is, Tk > Tk −1 , for any k and a j > 0.
557
7 OVERVIEW OF MODELS
Figure 7.15 is the partial credit model with four categorical responses, where a j = 1, b j = 0, c =
(0.0, 2.0, -1.0, -2.0), and T = (1, 2, 3, 4). The trace lines of these Pjk (θ ) s do not change if we use
T = (0, 1, 2, 3) or even T = (-3, -2, -1, 0), because the increment rate of both scoring functions is
identical, that is Tk − Tk −1 = 1. However, if we multiply the scoring by 2, that is, T ' = (2, 4, 6, 8),
the trace lines become steeper and their intersection points are –1.0, 0.5, and 1.0, respectively,
because
z +jk (θ ) = Da j Tk' (θ − b j ) + K k
K
= 2 Da j Tk (θ − b j ) + k .
2
The scoring function provides a convenient notation for collapsing or recoding categorical re-
sponses. For example, if the number of categorical responses of an item is five, then a scoring
function T can be specified as T = (1, 2, 3, 4, 5). If the original response categories are collapsed
by combining the first and second categories into one category, the scoring function T ' can be
written as T ' = (1, 1, 2, 3, 4). If these modified response categories are recoded by treating the
original fourth category as the fifth and the original fifth as the fourth, the scoring function can
be further modified as T '' = (1, 1, 2, 4, 3).
The generalized partial credit model can be expressed as a form of the nominal response model
(Bock, 1972):
exp Da j Tk (θ − b j ) + K k
Pjk (θ ) = .
∑ c=0 exp Da j Tc (θ − b j ) + Kc
mj
The partial credit model can be expressed as a form of the nominal response model (Bock,
1972):
The nominal response model is the model in which the scoring function is constant over response
categories, that is T = Tk for any k, and the discrimination power varies for each categorical re-
sponse. Or, it can be said that the nominal response model is the model whose scoring functions
are unknown and treated as the parameters to be estimated. If a common slope parameter is used
for all categorical responses, the trace lines become horizontally straight lines since they are
558
7 OVERVIEW OF MODELS
independent of θ . Therefore, varied discriminating powers among categorical responses are es-
sential features of the nominal response model. Since the estimates of the discriminating powers
or slope parameters contain indeterminacy, the constraint, such that making the sum of those pa-
rameters zero, is commonly used (see Thissen & Steinberg, 1986).
We have observed that the scoring function determines the orderliness of categorical responses.
Thus, if we assign an identical scoring to two response categories, we can construct the partial
credit model (PCM) with partially unordered categorical responses. This model can be called a
partially unordered partial credit model (PUPCM). The basic difference between the PCM for the
collapsed categories and the PUPCM is that the item-category parameters for each of the original
categories are estimated for the PUPCM.
If the scoring, Tk and Tk ' , are identical, the log-odds of these categorical responses are independ-
ent of the latent trait, θ . The log-odds become a function of the difference of the categorical co-
efficients, that is,
λ j ,k |k ' = Da j [ K k − K k ' ] .
These odds are constant along the θ scale and the trace lines never intersect.
Figure 7.16 is the partial credit model with four categorical responses. The parameter values are
the same as the previous example, but the scoring, T = (1, 2, 2, 3) is used instead. In other words,
we impose the assumption that the second and third categories do not have an inherent ordering.
Since K 2 = 2.0 is larger than K3 = 1.0, Pj 2 (θ ) is always higher than Pj 3 (θ ) . The positions of
these two trace lines are reversed if K 2 < K3 . Notice again that imposing the same scoring on the
categories does not mean collapsing of those categories. We just eliminate the assumption about
the ordering nature among those categories, in other words, nominalizing the categories.
559
7 OVERVIEW OF MODELS
Muraki (1993) proposed several variations of polytomous item response models for the multi-
group settings: the Rater's-Effect (RE) model, the DIF model, and the Trend model. The DIF and
Trend models for the dichotomous item response models were also discussed by Bock, Muraki &
Pfeiffenberger (1988).
The model for the differential item function (DIF) contains the following deviate Z gjk (θ ) :
Z gjk (θ ) = Da j (θ − b j − d gj + c jk )
where d gj is a DIF (or item location deviate) parameter for group g and item j.
In a similar manner, the deviate for the Rater's-Effect (RE) model is expressed as
Z gjk (θ ) = Da j (θ − b j − d g + c jk )
where d g is a rater effect (or group) parameter for rater or rater group g.
Notice that the group parameter d gj for the DIF model is nested within each item. For the DIF
model, it is assumed that only the item location parameters differ among groups while the slope
and category parameters are common for groups (this restriction can be relaxed in the program).
The subgroup identification may be a gender, a year, or some other covariate. On the other hand,
in the RE model, the group effect d g is crossed with the item effect. This model is generally re-
ferred to as a multifacet model (Linacre & Wright, 1993). The basic difference between the DIF
model and the RE model is whether the group parameter is nested within or crossed with the
item difficulty facet.
For the DIF model, a separate prior distribution is used for each group member, and the prior dis-
tribution is updated after each estimation cycle based on the posterior distribution from the pre-
vious cycle. For the RE model, a single prior distribution is used for the responses rated by mul-
tiple groups of raters. The prior distribution may or may not be updated after each estimation cy-
cle.
For the DIF model, it is assumed that different groups have different distributions with mean µ g
and standard deviation σ g . The distributions are not necessarily normal. These empirical poste-
rior distributions are estimated simultaneously with the estimation of the item parameters. To
obtain these estimates, we impose the following constraint for the DIF model:
560
7 OVERVIEW OF MODELS
J J
∑ d Rj = ∑ d Fj .
j =1 j =1
This constraint implies that the overall difficulty levels of a test or a set of common items given
both the reference group and focal group (indicated by subscripts R and F, respectively) are the
same. Therefore, the item difficulty parameters for the focal groups are adjusted. Any overall dif-
ference in terms of test difficulty will be assumed to be the difference in ability level for sub-
groups. The ability level difference among groups can then be estimated by the posterior distri-
butions.
The constraint imposed for the group parameters for the RE model is
G
∑ d g = 0.
g =1
The weight coefficient wg for group g is used only for the RE model. For the case of multiple
raters, it may be reasonable to assume that all raters are not equally reliable. If the reliability in-
dex for each rater is computed by some method, the index can be used for the coefficients.
The goodness-of-fit of the polytomous item response model can be tested item by item. Summa-
tion of the item fit can also be used for the goodness-of-fit for the test as a whole. If a test is suf-
ficiently long, the method used in BILOG-MG (Mislevy and Bock, 1990) can be used with slight
modifications.
In the method of Mislevy & Bock, the respondents in a sample of size N are assigned to H inter-
vals of the θ -continuum. The expected a posteriori (EAP) is used for the estimator of each re-
spondent’s proficiency score. The EAP estimate is the mean of the posterior distribution of θ ,
given the observed response pattern xl (Bock & Mislevy, 1982). The EAP score of the response
pattern xl is approximated by the quadrature points, X f , and the weights, A( X f ) , that is,
∑ f =1 X f Ll ( X f ) A( X f )
F
θl =
∑ f =1 Ll ( X f ) A( X f )
F
where Ll ( X f ) is the probability of observing a particular response pattern xl . The posterior stan-
dard deviation (PSD) of the EAP score is approximated by
561
7 OVERVIEW OF MODELS
∑ f =1 ( X f − θl )2 Ll ( X f ) A( X f ) .
F
PSD(θ l ) =
∑ f =1 Ll ( X f ) A( X f )
F
After all respondents' EAP scores are assigned to any one of the predetermined H intervals on the
θ -continuum, the observed frequency of the k-th categorical responses to item j in interval h,
rhjk , and the number of respondents assigned to item j in the h-th interval, N hj , are computed.
The estimated θ s are rescaled so that the variance of the sample distribution equals that of the
latent distribution on which the MML estimation of the item parameters is based, which is usu-
ally set as N (0,1) as a default. Thus, we obtain the H by m j + 1 contingency table for each item j.
For each interval, we compute the interval mean, θ h , and the value of the fitted response func-
tion, Pjk (θ h ) . Finally, a likelihood ratio χ 2 statistic for each item is computed by
H j mj
rhjk
G 2j =2 ∑∑ rhjk ln −
,
h =1 k =0 N hj Pjk (θ h )
where H j is the number of intervals left after neighboring intervals are merged if necessary to
avoid expected values, N hj Pjk (θ h ) , less than 5. The degrees of freedom is equal to the number of
intervals, H j , multiplied by m j . The likelihood ratio χ 2 test statistic for the test as a whole is
simply the summation of the separate χ 2 test statistics. The number of degrees of freedom is
also the summation of the degrees of freedom for each item. These fit statistics are useful in
evaluating the fits of the models to the same response data when the models are nested in their
parameters.
The point biserial correlation, rPB , j for item j is a computationally simplified Pearson’s r be-
tween the dichotomously scored item j and the total score x. It is computed as
(µ j − µ x ) pj
rPB , j = ,
σx qj
where µ j is the mean total score among examinees who have responded correctly to item j, µ x
is the mean total score for all examinees, p j is the item difficulty index for item j, q j = (1 - p j ),
and σ x is the standard deviation of the total score for all examinees.
562
7 OVERVIEW OF MODELS
The biserial correlation coefficient estimates the relationship between the total score and the hy-
pothetical normally distributed score on a continuous scale underlying the dichotomous item.
The biserial correlation between an item and the total score can be estimated from the p-value
and the point biserial correlation of the item:
p jq j
rB , j = rPB , j ,
h( z j )
where z j is the z score that cuts off p j proportion of the cases to item j in the standard normal
distribution, and h( z j ) is the ordinate of the normal distribution at the point z j .
Lord & Novick (1968) show that the slope and threshold parameters of the normal ogive model
for the item are functions of the biserial correlation coefficient:
rB , j
aj =
1 − rB2, j
and
zj ajzj
bj = − = .
rB , j 1 + a 2j
The point polyserial correlation, rPP , is simply the Pearson correlation between equally spaced
integers assigned to the successive categories (e.g., Likert scores). The relation between the point
polyserial correlation, and the polyserial correlation, rP , is
m j −1
1
rPP , j = rP , j
σj ∑ h( z jk )(T j,k +1 − T jk ),
k =0
where T jk is the scoring function for item j and category k, σ j is the standard deviation of item
scores y for item j, z jk is the z score corresponding to the cumulative proportion, p jk , of the k-th
response category to item j. If consecutive integers are used for scoring (that is T jk = 0,1,… , m j ),
then the relation expressed by this equation becomes
563
7 OVERVIEW OF MODELS
m j −1
rP , j ∑ h( z jk )
k =0
rPP , j =
σj
or
rPP , jσ j
rP , j = m j −1
.
∑ h( z jk )
k =0
The polyserial correlation becomes the biserial correlation if the number of response categories
is two.
Olsson, Drasgow, & Dorans (1982) presented three estimation methods for the polyserial corre-
lation coefficient - the maximum likelihood estimator, the two-step estimator and the ad hoc es-
timator. The latter is obtained by substituting sample statistics into the preceding equation. The
sample product-moment correlation of the total score and polytomous item score is the point-
polyserial correlation, rPP , j . h( z jk ) is the normal ordinate corresponding to the proportion p jk of
examinees with item scores less than or equal to T jk .
From results of a simulation study, Olsson et al. (1982) concluded that the ad hoc estimator was
sufficiently unbiased and accurate for applied research. Thus, we compute initial slope values by
using the ad hoc estimator
rp , j
aj =
1 − rp2, j
This value applies to both the graded model and the generalized partial credit model.
To obtain the m j − 1 initial category threshold parameters of the graded model, we compute the
item category cumulative proportions from the numbers of examinees m jk responding in the suc-
cessive categories of item j.
∑ v=0 n jv .
k
p jk =
∑ v=0 n jv
m j
564
7 OVERVIEW OF MODELS
The corresponding deviates are obtained from the inverse normal distribution function:
z jk = Φ −1 ( p jk ) .
− z jk −a j z jk
b jk = = .
rp , j 1 + a 2j
For the partial credit model, the corresponding parameters are computed from the proportions of
examinees in the higher of the two successive categories,
n jk
p jk = .
n j ,k −1 + n jk
The quantities z jk and b jk are computed as above, but the latter must be adjusted to reflect the
average ability of examinees who respond in these two categories relative to all examinees in the
sample. The adjusted value is
m jk − m jT
b'jk = b jk + ,
s jT
where m jk is the mean test score (computed from items scores 0, 1, …, m-1) of examinees re-
sponding in categories k or k – 1, and m jT and s jT are the mean and standard deviation of the test
scores of all examinees responding to item j.
These initial values are printed for each item in the Phase 1 program output (see Chapter 11 for
an example). Note, however, that in the results the category parameters of both models are ex-
pressed as deviations about the mean value and therefore sum to zero. The mean value itself is
referred to as the item “location” and appears along with the item slope in the listing of items
within blocks. For two-category (binary) items, the category parameters are equal in absolute
value by opposite in sign, and the location parameter is just the threshold parameter of the nor-
mal ogive or 2PL model.
Multiple items that appear within the same block (and thus have the same number of categories)
comprise a rating-scale. Each item has a slope and location parameter, the initial values of which
are computed as above. But all items have the same category parameters, which are therefore a
property of the block. The corresponding initial values are computed, first, by the category re-
sponse frequencies over the m items in the block to obtain
565
7 OVERVIEW OF MODELS
n k
∑∑ n
j =1 v = 0
jv
pk = m m
∑∑ n
j =1 v = 0
jv
or
n
∑n
j =1
jk
pk = m
.
∑ (n
j =1
j , k −1 + n jk )
Then
zk = Φ −1 ( pk ),
− a j zk
bk = ,
1 + a j
2
where a j is the geometric mean of the slopes of the items within the bock (i.e., the n-th root of
the product of the n slopes). The use of the geometric mean is justified by the assumption that the
slopes of the items in the rating-scale domain are log-normally distributed.
For the partial credit rating-scale model, the quantities used in computing the category adjust-
ment formula shown above are accumulated over all items in the block; that is,
mk − mT
bk' = bk + ,
sT
for all items, and mT and sT are the mean and standard deviation of the test scores of all exami-
nees responding to all items in the block.
566
7 OVERVIEW OF MODELS
In this section we describe the item response (trace line) models available in MULTILOG. There
are two general models that can be fitted using MULTILOG: Samejima’s (1969) “graded” model
and Thissen and Steinberg’s (1984) multiple response model. Many other (seemingly) different
models are available, as constrained subsets of one or the other of these two. For an extended
discussion of the relationship among IRT models, see Thissen & Steinberg (1986).
1 1
P( x = k ) = −
1 + exp[− a(θ − bk −1 )] 1 + exp[− a (θ − bk )] ,
= P* (k ) − P* (k + 1)
where a is the slope and bk is threshold(k). P* (k ) is the trace line describing the probability that
a response is in category k or higher, for each value of θ . For completeness of the model defini-
tion, we note that P* (1) = 1 and P* (m + 1) = 0. The value of bk −1 is the point on the θ -axis at
which the probability passed 50% that the response is in category k or higher. The properties of
the model are extensively described by Samejima (1969). In the MULTILOG output, the pa-
rameter a is labelled A, and bk is labelled B(k). This model is obtained by using the GR option on
the TEST command in MULTILOG.
When there are only two possible responses to each item, as for binary items on a test of profi-
ciency (correct/incorrect) or forced-choice items on a personality measure, the graded model is
equivalent to the 2PL model, which is usually written in the following simpler form:
1
P ( x = 2) = .
1 + exp[− a (θ − b)]
For compatibility with the graded model, the key is used to recode the “higher” of the two re-
sponses (correct, positive) to have the internal value “2” in MULTILOG. The lower (incorrect,
negative) response has the value “1.” The 2PL model has two parameters for each item, a (la-
belled A in the output), and b (labelled B(1)).
5
This section was contributed by David Thissen.
567
7 OVERVIEW OF MODELS
If the additional constraint that a j = a for all items j is imposed, the 2PL model is equivalent to
the 1PL model, sometimes referred to as the “Rasch model” (after Rasch, 1960). At times the
term “Rasch model” refers simply to the constraint that the slopes are equal for all items in the
test, in which case this is the “Rasch model,” using the marginal maximum likelihood estimation
algorithm described by Thissen (1982); at other times, the term “Rasch model” refers to both the
constraint and another method of parameter estimation (Conditional Maximum Likelihood),
which is not implemented in MULTILOG. The output is identical to that for the 2PL model, ex-
cept that the value of a (labelled A) is the same for all items.
Note that we do not include the scale factor 1.7 in the definition of either the 2PL or the 1PL
models. This model is obtained by using the L2 option on the TEST command in MULTILOG.
A modified version of Samejima’s (1979) modification of Bock’s (1972) nominal model, for re-
sponses x = 1, 2, ..., m (or m +1), is Thissen and Steinberg’s (1984) multiple response model:
h*exp[akθ + ck ] + hd k exp[a1θ + c1 ]
P( x = k ) = m +1
∑ exp[a θ + c ]
i =1
i i
The parameters ak and ck are not identified with respect to location; either TRIANGLE con-
trasts define the parameters that are estimated, in which case a1 = c1 = 0 , or DEVIATION or
POLYNOMIAL contrasts among these parameters are estimated, in which case
∑ ak = ∑ ck = 0 , or
The parameters represented by d k are proportions, representing the proportion of those
who “don’t know” who respond in each category on a multiple-choice item (see Thissen
& Steinberg, 1984). Therefore, the constraint that ∑ d k = 1 (where the sum is from k = 1 to
2 for binary data and from 2 to m+1 for m > 2) is required. This is enforced by estimating
d k such that
exp d k*
dk =
∑ exp d *
k
The parameters h* and h are used to provide several different models, and are calculated by
MULTILOG. The value of h* is always 1 for items with m > 2.
568
7 OVERVIEW OF MODELS
When m > 2 and h = 1, the Thissen and Steinberg model described in Section 7.4.4 becomes
exp[akθ + ck ] + d k exp[a1θ + c1 ]
P( x = k ) = m +1
∑ exp[a θ + c ]
i =1
i i
which is the “Multiple-choice” model, as described by Thissen and Steinberg (1984). The data
should be keyed into categories 2, 3, … m + 1, because category 1 in the program is the “0” or
“Don’t Know” latent category (see Section 12.10). MULTILOG prints the values of the parame-
ters ak , ck , and d k , labelled A(K), C(K), and D(K), respectively. The values of the d k contrasts
may be fixed at zero using the MULTILOG command language. This produces Samejima’s
(1979) version of the model, in which the “guessing proportions” are equal to 1/ m :
1
exp[akθ + ck ] + exp[a1θ + c1 ]
P( x = k ) = m .
m +1
∑ exp[a θ + c ]
i =1
i i
For m = 2, h* = 0 for category 1 (incorrect) and h* = 1 for category 2 (correct) this gives a
parameterization of the conventional 3PL model (Lord, 1980; see Thissen & Steinberg, 1986, for
a description of this conception of the 3PL), in which
exp[a2θ + c2 ] + d 2 exp[a1θ + c1 ]
P ( x = 2) = 2
∑ exp[a θ + c ]
i =1
i i
exp[a2θ + c2 ] + d 2 exp[a1θ + c1 ]
=
exp[a1θ + c1 ] + exp[a2θ + c2 ]
d 2 exp[a1θ + c1 ] exp[a2θ + c2 ]
= + .
exp[a1θ + c1 ] + exp[ a2θ + c2 ] exp[ a1θ + c1 ] + exp[a2θ + c2 ]
The constraints described above require that a1 = −a2 and c1 = −c2 . Thus, the model is:
d 2 exp[−(a2θ + c2 )] exp[a2θ + c2 ]
P ( x = 2) = +
exp[−(a2θ + c2 )] + exp[a2θ + c2 ] exp[−( a2θ + c2 )] + exp[a2θ + c2 ]
1 1
= d 2 1 +
1 + exp[ −2( a2θ + c2
)] 1 + exp[ −2( a2θ + c2 )]
1
= d 2 + (1 − d 2 )
1 + exp[−2(a2θ + c2 )]
569
7 OVERVIEW OF MODELS
which is a fairly conventional form of the 3PL model. MULTILOG actually estimates the logit
of d 2 , and the contrasts between a1 and a2 , and c1 and c2 . These are printed in the output, as
well as the “traditional 3PL, normal metric” form of the parameters, labelled A, B, and C from the
model when written
1
P ( x = 2) = C + (1 − C) .
1 + exp[−1.7 A(θ − B )]
This model is obtained by using the L3 option on the TEST command in MULTILOG.
When h = 0, the model defined in Section 2.4.3 is equivalent to Bock’s (1972) “nominal model”:
exp[akθ + ck ]
P( x = k ) = m
;
∑ exp[a θ + c ]
i =1
i i
in this case, the data are keyed into categories 1, 2, …, m, and the parameters represented by d k
are not estimated. MULTILOG prints the values of the parameters ak and ck , labelled A(K) and
C(K) respectively. This model is obtained by using the NO option on the TEST command in
MULTILOG.
7.4.8 Contrasts
MULTILOG estimates the contrasts between the as, cs, and d *s ; the unconstrained (estimated)
parameters are the α s, γ s and δ s [denoted AK, CK, and DK, respectively, in the syntax, and
CONTRAST(k) FOR A, CONTRAST(k) FOR C, and CONTRAST(k) FOR D in the MULTILOG out-
put], where
The default form of the T matrices consists of deviation contrasts, as suggested by Bock (1972).
For varying numbers of response categories, those matrices are printed here, along with the al-
ternative polynomial and triangle contrasts.
DEVIATION T-matrices
2 Categories
–0.50 0.50
570
7 OVERVIEW OF MODELS
3 Categories
4 Categories
5 Categories
6 Categories
7 Categories
8 Categories
571
7 OVERVIEW OF MODELS
9 Categories
POLYNOMIAL T-matrices
2 Categories
–0.50 0.50
3 Categories
4 Categories
5 Categories
6 Categories
572
7 OVERVIEW OF MODELS
7 Categories
8 Categories
9 Categories
TRIANGLE T-matrices
(When used for the vector c, the constraint ∑c = 0 is replaced with the constraint c1 = 0 ).
2 Categories
0.00 –1.00
3 Categories
573
7 OVERVIEW OF MODELS
4 Categories
5 Categories
6 Categories
7 Categories
8 Categories
574
7 OVERVIEW OF MODELS
9 Categories
The TESTFACT program implements all the main procedures of classical item analysis, test
scoring, and factor analysis of inter-item tetrachoric correlations, and also modern methods of
item factor analysis based on item response theory (IRT). In addition, the program includes a fa-
cility for simulating responses to test items having difficulties and factor loadings specified by
the user. This section reviews the mathematical and statistical backgrounds of these procedures.
Classical item analysis aims at inferring the expected characteristics of responses of persons in
the population to whom the test will be administered. The data to be analyzed are assumed to
come from a sample of respondents from that population. The various sample indices—item dif-
ficulties based on item p-values, test reliability expressed as a ratio of true-score variances to
test-score variances, and test validity measured by the correlation of the test score with an exter-
nal criterion of performance—are all estimates of population statistics. Classical test theory em-
ploys such estimates to predict how the test will perform when administered to members of the
population in question.
Classical methods also give useful details about the discriminating power of the items. They pro-
vide various measures of the item-by-test score correlation, supplemented by tabulations of the
6
This section was contributed by Robert Wood.
575
7 OVERVIEW OF MODELS
The TESTFACT program computes these statistics from the sample of item-response data and
display them in tables and plots for ready interpretation. It also allows the user to specify subtests
of the main test and to analyze each subtest separately. Similarly, the user can assign respondents
to groups (by age, grade, or class, for example) and can analyze the groups independently.
In addition, TESTFACT provides a powerful data analytic tool in the form of item factor analy-
sis. As a preliminary to test construction, or in preparation for latent trait analysis with programs
such as BILOG (Mislevy & Bock, 1990), BILOG-MG (Zimowski, Muraki, Mislevy & Bock,
1996), or MULTILOG (Thissen, 1988), item factor analysis permits a more comprehensive and
detailed examination of item dimensionality than is currently available with any other procedure.
Because they are based on Thurstone’s multiple factor model, the results of the analysis—factor
loadings, orthogonal and oblique rotations, factor correlations, and factor scores—are familiar to
most users. Since the model is fitted by Bock & Aitkin’s (1981) marginal maximum likelihood
(MML) method, the analysis provides a rigorous test of the statistical significance of factors
added successively to the model.
Item factor analysis has other interesting uses beside those of test construction and exploration of
test dimensionality. The existence of more than one statistically significant factor implies a simi-
lar number of distinguishable profiles of individual differences in the population of respondents.
By calling attention to the common features of items that participate in these distinctions, item
factor analysis gives clues to the cognitive basis for the item responses. It serves as a “discovery
procedure” revealing often-unsuspected cognitive components of the test task. If the sample size
is large, factor analysis of item responses is often more productive than factor analysis of test
scores, because data for many distinct items is easier to obtain than data for a comparable num-
ber of tests.
The factor analysis procedure in TESTFACT makes exploration of any type of binary scored
characteristics dependable and informative. The potential distortions of chance successes, not-
reached items, and Heywood cases are effectively controlled. Principal factor, VARIMAX and
PROMAX patterns are provided; Bayes estimates of scores for orthogonal factors can be com-
puted for each respondent.
The main features and statistical principles of the TESTFACT program are described in the re-
mainder of this section.
Each item has a set of responses: right, wrong, omitted, or not-presented. For item j, the response
of person i can be written as
576
7 OVERVIEW OF MODELS
At the user’s option, omitted items can be considered either wrong or not-presented.
For a test of n items, the total main test score X i for person i would be
n
X i = ∑ xij .
j =1
If the main test is divided into K subtests of nk items each, the subtest scores are
nk
X ik = ∑ xijk k = 1, , K .
j =1
Given scores X i or X ik , the program provides estimates of means, standard deviations, and cor-
relations - whether the group of respondents is taken as a whole or split into classes. In addition,
histograms of main test and subtest scores are supplied to enable the user to check the nature of
the dispersion of each score. Product-moment correlations between the main and subtest scores
and external variates (where applicable) are also provided.
The most important item parameters for test construction are those that measure item difficulty
and discriminating power.
Difficulty
The number of respondents who answer item j correctly is called the item facility and is ex-
pressed as p j (a proportion of the total number attempting the item). For a standard measure of
item difficulty, the delta statistic ( ∆ ) is available. Delta is a non-linear transformation of the pro-
portion correct arranged to have a mean of 13 and a standard deviation of 4. Its effective range is
1 to 25. The formula is
∆ = −4Φ −1 ( p) + 13,
where p is the proportion correct (or item facility) and Φ −1 is the inverse normal transformation.
(For details, see Henryssen, 1971, p. 139-140).
577
7 OVERVIEW OF MODELS
A transformation based upon proportions or percentages that fall on a non-linear scale can cause
misleading judgments about relative differences in difficulty. The difference in difficulty be-
tween items with facilities of .40 and .50 would be quite small, but the difference between items
with facilities of .10 and .20 would be quite large.
The delta scale, on the other hand, is assumed to be linear. The difference in difficulty between
items with deltas of 13 and 14 is assumed to be the same as the difference in difficulty between
items with deltas of 17 and 18. Figure 7.17 shows a delta of 13 (i.e., Φ −1 = 0) corresponds to a
facility of 0.50.
Discriminating power
According to Marshall & Hales (1972), more than 60 different indices for measuring item dis-
criminating power have been proposed.
TESTFACT provides two classical indices, the point biserial and the biserial correlations. Both
call for calculation of correlation between the score (1 or 0) on the item and the score on the test
as a whole. The higher the correlation between these two scores, the more effective the item is in
separating the test scores of the respondents. Naturally, this relationship is relative: a given item
could have a higher item-test correlation when included in one test than when included in a dif-
ferent test.
The point biserial correlation, rpbis , is a product-moment correlation between two variates, when
one of the variates is binary (the item score) and the other is the complete test or subtest, con-
578
7 OVERVIEW OF MODELS
tinuously distributed. The formula for the sample point biserial correlation can be written as
Mp −M p
rpbis = × ,
S 1− p
where:
Mp is the mean score on the test for those subjects who get the item correct
p is the proportion that gets the item right (the item facility)
Evidently, rpbis serves as a measure of separation through the action of the term,
Mp −M
.
S
In principle, values of the point biserial lie between -1 and +1. But, as Wilmut (1975, p. 30) has
demonstrated, in item analysis it is unlikely ever to exceed 0.75 or to fall below -0.10. This
should be kept in mind when interpreting output.
Of the many classical discrimination indices, the only serious rival to the point biserial is the
biserial correlation. Unlike the point biserial, the biserial is not a product-moment correlation;
rather, it should be thought of as a measure of association between performance on the item and
performance on the test (or some other criterion). The biserial is less influenced by item diffi-
culty and tends to be invariant from one testing situation to another—advantages the point bise-
rial does not possess (see below).
Also distinguishing it from its rival is the biserial correlation’s assumption that a normally dis-
tributed latent variable underlies the right/wrong dichotomy imposed in scoring an item. This
variable may be thought of as representing the trait that determines success or failure on the item.
The formula for calculating the sample biserial correlation coefficient, rbis , is
Mp −M p
rbis = × .
S h( p )
579
7 OVERVIEW OF MODELS
Except for h( p ) , the terms are as before; h( p ) stands for the ordinate or elevation of the normal
curve at the point where it cuts off a proportion p of the area under the curve. As might be ex-
pected, h( p ) enters into the formula because of the assumption of a normally distributed underly-
ing variable.
The relationship between the biserial and point biserial formulas is straightforward:
h( p )
rpbis = rbis × .
p (1 − p )
The point biserial is equal to the biserial multiplied by a factor that depends only on an item dif-
ficulty, so the point biserial will always be less than the biserial. In theory, the biserial can take
any value between -1 and +1, but values greater than 0.75 are rare, although the biserial can even
exceed 1 in exceptional circumstances - usually resulting from some peculiarity in the test score
or criterion distribution (Glass & Stanley, 1970, p.171). In practice, negative values usually indi-
cate that the wrong answer has been keyed.
Lord & Novick (1968, p. 340) show that the point biserial can never attain a value as high as
0.80 of the biserial, and they present a table showing how the fraction varies according to item
difficulty (see also Bowers, 1972). They remark that the extent of biserial invariance is necessar-
ily a matter for empirical investigation, but present some results in support of the conclusion that
“biserial correlations tend to be more stable from group to group than point biserials”.
Bowers (1972) observes that as long as a markedly non-normal distribution of the criterion vari-
able is not anticipated, substantially the same items are selected or rejected no matter which sta-
tistic is used to evaluate discrimination. It is true that the point biserial is more dependent on the
level of item difficulty, but this is not serious, as it only leads to rejection of very easy or very
difficult items, which would be rejected anyway. Users who have not made up their minds on
this issue are advised to fasten on to one or another statistic, learn about its behavior, and stay
with it. Switching from one to the other or trying to interpret both simultaneously is likely to be
confusing. Note, however, that in the factor analysis procedure, factor loadings of the items serve
as discrimination indices.
Although point biserial and biserial correlations are useful guides to the discriminating power of
an item, they cannot describe how respondents of differing levels of achievement or ability re-
spond to specific items. By defining fractiles of the distribution of test scores, and classifying
item responses according to membership in these fractiles, the user can observe the behavior of
items across the ability range and, in particular, keep an open eye for malfunctioning distractors.
Items may:
Fail to differentiate between respondents in the lower, and sometimes in the middle, frac-
tile bands;
580
7 OVERVIEW OF MODELS
Function well over lower fractiles, but give little or no information about respondents in
the higher fractiles;
Discriminate in a way that fluctuates wildly over fractiles.
By way of illustration, consider Table 7.1. The item that produced the data belonged to a 50-item
external examination in chemistry taken by 319 candidates. In many cases, of course, the sample
of candidates would be much larger than this.
The correct (starred) answer was option A, chosen by 146 candidates or, as the number under-
neath indicates, by 0.46 of the sample. The facility of this item is therefore 0.46 and the difficulty
( ∆ ) is 13.42.
Of the distractors, E was most popular (endorsed by 82 candidates, or 0.26 of the sample), fol-
lowed by options C, D, and B. Only two candidates omitted the item.
A* B C D E O
146* 13 54 22 82 2 319
0-18 9 6 18 7 21 2 63
18-22 16 5 16 8 19 0 64
22-29 30 1 7 7 19 0 64
29-35 42 1 8 0 13 0 64
35-47 49 0 5 0 10 0 64
Mean
criterion 30.8 18.8 22.0 19.2 23.5 12.5 26.02
Turning to the body of the table, we see an evident pattern. Under the correct answer A, the
count increases as these score level rises. Under the distractors (excepting D, where the trend is
unclear), the gradient runs in the opposite direction. This is what we should see if the item is dis-
criminating in terms of total test score. The pattern we should not see is one in which the counts
under A are relatively equal, or worse, that all the counts in the table tend to equality.
As it is, the distribution of the responses tells us quite a lot. Relatively speaking, options B and C
are much more popular in the lowest score fractile, and in that fractile the correct answer was
581
7 OVERVIEW OF MODELS
barely more popular than B or D. In the higher score fractiles, however, B and D are almost to-
tally rejected.
In all, Table 7.1 supports the view that wrong answers are seldom, if ever, equally distributed
across the distractors, either in the sample as a whole or in fractiles. Nor is there any evidence of
blind guessing, an indication of which would be an inflated number in the cell for option A in the
0-18 score group - the one containing a 9 - which could cause the gradient to flatten out at low
score levels, or even to go in the other direction.
In Table 7.1, the five fractiles (any number can be defined, but five is enough for a first look at
an item) have been constructed so as to contain equal or nearly equal numbers of candidates.
This means that, unless the distribution of scores is rectangular, the score intervals will always
be unequal. However, there is no reason why fractiles cannot be defined in terms of equal score
intervals or according to some assumption about the underlying score distribution. If, for exam-
ple, the user believes that the underlying score distribution is normal, the fractiles might be con-
structed so as to have greater numbers in the middle fractiles and smaller numbers in the outer
fractiles. The only problem with this strategy is that, given small numbers of respondents, any
untoward behavior in the tails of the distribution would be amplified or distorted. Also, interpre-
tation of the table might be prone to error because of the varying number of fractiles.
7.5.6 Plots
TESTFACT provides line plots of item difficulties (or facilities) vs. the point biserial (or bise-
rial) correlations. It is often the case that high item difficulty corresponds to low biserial values,
and vice versa. When evaluating item statistics, plotting item difficulty (or facility) against point
biserial (or biserial) correlations should enable the user to see which items need attention (but see
Section 7.5.5, above). The user can specify either measure of difficulty and either measure of
discrimination in the PLOT command.
Following the argument of Wood (1977), the initial part of TESTFACT does not correct for
guessing in the case of omitted items. In the factor analysis part of the program, the user may
elect to proceed under the 3-parameter multidimensional normal ogive model, which will pro-
vide for the effects of guessing. TESTFACT does not estimate guessing parameters, but does al-
low the user to specify these values, either a priori or as estimated by a program such as BILOG
(see Section 7.5.11, below).
TESTFACT also provides measures of internal test consistency. It is important to understand that
internal consistency is not the same as homogeneity: a test may be internally consistent—an em-
pirical, statistical fact—even though it includes items that are patently dissimilar in content (see
Green, Lissitz & Mulik, 1977). A measure of the internal consistency is the intra-class correla-
tion coefficient of the test or subtest. The correlation is commonly known as coefficient α .
582
7 OVERVIEW OF MODELS
σ2
α= 2 ,
σ + σ ε2 / n
where σ 2 is the variance component due to respondents, and σ ε2 is the residual or error variance.
Unlike many other programs, the calculation of α in TESTFACT allows for omits—technically
it is a variance components analysis in the unbalanced case (Harvey, 1970). Users should be
aware that the time taken to compute α is prohibitive for a large number of items or respondents.
We have, therefore, provided a simpler alternative, the Kuder-Richardson (KR20) coefficient:
n
S 2 − ∑ p j (1 − p j )
n j =1
KR20 = ⋅ ,
n −1 S2
where n is the number of items in the test, p j is the facility of item j, and S 2 is the variance of
the test scores. Note: if large numbers of respondents omit items, this can affect the estimate of
KR20.
The response to any particular item can be thought of in terms of an item threshold on the trait
continuum being assessed. Respondents with a response process greater than or equal to the
threshold will give the correct answer; otherwise, they will give a wrong answer. By assuming
that the processes are normally distributed, and knowing the proportion of cases that respond
correctly to both items in any pair, we can estimate tetrachoric correlations for all distinct
n(n − 1) / 2 pairs of items. To calculate these correlations, TESTFACT uses Digvi’s (1979)
method.
If all respondents get either or both of the items correct, the tetrachoric correlation becomes ±1.
Because the presence of such values causes difficulties for the MINRES factor analysis of the
correlation matrix, TESTFACT uses a one-factor version of Thurstone’s (1947) centroid method
to estimate admissible values for these correlations.
As the final phase of item analysis, the matrix of tetrachoric correlations can be subjected to
principal factor analysis with communality iterations. This is equivalent to unweighted least-
squares (ULS) or MINRES factor analysis based on Thurstone’s (1947) multiple-factor model
(see Harman, 1976). The resulting principal factor pattern can be rotated orthogonally to the
varimax criterion (Kaiser, 1958). With the varimax solution as a target, the pattern can then be
rotated obliquely by the promax method of Hendrickson & White (1964). The latter pattern is
especially appropriate for item analysis, because it tends to identify clusters of items that form
unidimensional subsets within a heterogeneous collection of items.
583
7 OVERVIEW OF MODELS
In general, item tetrachoric correlation matrices are not positive-definite. This means that they
often cannot be used in any of the many statistical procedures that require positive-definiteness,
such as computing partial correlations among some of the items while holding others fixed.
In TESTFACT, this inconvenience can be avoided by listing and saving a smoothed positive
definite matrix of the item correlations. The smoothed matrix is computed from all the positive
roots (renormed to sum to n) of the original tetrachoric matrix. After the number of factors has
been determined, the smoothed matrix is reproduced from the MINRES factor solution.
Classical item analysis makes use of only that information in the examinees’ responses available
in the sample correct and incorrect occurrence frequencies for each item, together with the joint
correct and incorrect occurrence frequencies for all possible pairs of items. The estimation pro-
cedures of classical item statistics, including the MINRES factor loadings, are therefore referred
to as “partial” information methods. IRT estimation procedures, on the other hand, make use of
all of the information in each examinee’s pattern of correct and incorrect responses to the test
items (which is equivalent to the information in all possible occurrence and joint occurrence fre-
quencies of all orders, i.e., of item pairs, item triples, item quadruples, etc.) The IRT procedures
are therefore called “full” information methods.
In the TESTFACT program, IRT-based full information estimation procedures are only needed
in, and only applied to, item factor analysis. Their main advantage is that they are not affected by
the occurrence of zero- or 100-percent joint occurrence frequencies, for which tetrachoric corre-
lations cannot be estimated. Items with zero or 100-percent of correct responses in the sample,
which correspond to infinitely negative or positive item thresholds, do disturb IRT procedures,
however. They should therefore be eliminated from the response patterns before factor analysis
is attempted. For this reason, it is advisable to perform a preliminary classical item analysis be-
fore proceeding with the item factor analysis.
The full information procedure in TESTFACT maximizes the likelihood of the item factor load-
ings and standardized difficulties given the observed patterns of correct and incorrect responses.
It solves the corresponding likelihood equations by integrating over the latent distribution of fac-
tor scores assumed for the population of examinees (the so-called θ distribution). Because this
type of integration is called “marginalization” in the statistical literature, the estimation method
is called “marginal maximum likelihood” or MML. The definite integrals involved in this
method are computed numerically in a procedure referred to, for historical reasons, as “quad-
rature”. This version of TESTFACT makes use of recently developed innovations in quadrature
to make MML estimation feasible for fitting item response models in high-dimensional factor
spaces. It also includes, in addition to the preceding exploratory factor analysis, a confirmatory
factor analysis based on the bifactor model.
7
This section was contributed by R. Darrell Bock and Stephen G. Schilling.
584
7 OVERVIEW OF MODELS
Bock & Aitkin (1981) introduced the marginal maximum likelihood (MML) or estimating item
parameters of the 1- and 2-parameter normal ogive item response models. Their iterative solution
of these likelihood equations was based on the EM algorithm of Dempster, Laird & Rubin
(1977). This method can be applied straightforwardly to the estimation problem of item parame-
ters in the item response model with a guessing term and with more than one latent dimension of
ability, θ . Details are given in Bock, Gibbons & Muraki (1988).
In the multidimensional case, the normal ogive item response model with guessing is given by
where
z j (θ i ) = c j + a j1θ j1 + a j 2θ j 2 + … + a jmθ jm .
The MML estimates of the factor loadings α jk , k = 1, 2,… , m , and standard difficulty, δ ξ are then
calculated from the estimates of the slope parameter, a jk , and the intercept parameter, c j , as fol-
lows:
a jk cj
α jk = , δj = k = 1, 2,… , m,
dj dj
where
d j = 1 + a 2j1 + a 2j 2 + … + a 2jm .
Chance or guessing parameters, g j , are treated as known constants in TESTFACT. If the chance
parameter is not included in the item-response model, g j is set to zero for all items. Otherwise,
values of these parameters must be supplied as a part of the input data. When the guessing model
is invoked, the tetrachoric correlation coefficients are computed according to Carroll’s (1945)
correction chances/successes.
Guessing parameters are the ordinates of the asymptote of the response function in the direction
of low ability. As such, they do not depend upon the form of the response curve at higher abili-
ties. For this reason, guessing parameters can be satisfactorily estimated by a one-dimensional
item response model such as that used in the BILOG program of Mislevy & Bock (1990). Oth-
erwise, the a priori value equal to 1 divided by the number of response alternatives can be used.
585
7 OVERVIEW OF MODELS
Prior to Thurstone’s development of the multiple-factor model, Holzinger & Swineford (1937)
introduced the bifactor model to extend the Spearman (1904) one-factor model for intelligence
tests to include so-called “group” factors. By including these mutually uncorrelated factors they
were able to explain departures from one common factor when distinguishable items, such as
spatial or number series items, appeared in the tests. Their model also applies to educational
achievement tests containing more than one subject-matter content area - for example, a mathe-
matics test containing an algebra and a geometry section. Such tests are often scored for general
mathematics achievement effects, but the multiple content areas may induce group factors.
The bifactor model has special relevance for IRT, because it accounts for departures from condi-
tional independence of responses to groups of items that depend on a common stimulus such as a
reading passage or problem-solving task. This type of item has been called a “testlet” (see
Wainer, 1995). The presence of these items violates the assumption of conditional independence
and leads to under-estimation of the standard error of the test score.
Taking advantage of the fact that a common factor and uncorrelated group factors imply only
two non-zero factors per item, Gibbons & Hedeker (1992) showed that MML estimation for the
bifactor model requires quadratures in only two dimensions. This means that the conditional de-
pendence problem can be solved in a way that is computationally practical and easily extendable
to large numbers of testlets. Standard errors for scores on the common factor after integrating
over the group factor dimensions then correctly account for the presence of conditional depend-
ence within the item groups. Comparing marginal maximum likelihoods of the bifactor solution
and a one-factor exploratory solution also provides a statistical test for failure of conditional in-
dependence. Analysis based on the bifactor model is included in TESTFACT.
Item factor analysis should be applied only to power tests. If the time limits of such a test are too
short, a substantial proportion of the respondents may not reach later items in the test. In ap-
praising ability, such items might be scored as incorrect. But to do so in the item factor analysis
would introduce a spurious factor associated with item position.
To minimize these effects in the factor analysis, TESTFACT provides an option called TIME in
the SCORE, FULL, and TETRACHORIC commands. When this option is invoked (for each respon-
dent), all items omitted after the last-responded-to item are scores as “not-presented”. Omitted
items prior to the last-responded-to item are scored “incorrect”, unless the guessing mode is se-
lected (CPARMS in the FULL command or CHANCE option in the SCORE command). In that case, the
latter items would be scored “correct” with the probability of chance success g j and “incorrect”
with probability 1 − g j .
Unless otherwise constrained, maximum-likelihood factor analysis may encounter one or more
586
7 OVERVIEW OF MODELS
so-called Heywood cases (i.e., items for which the unique variance goes to zero). In these cases,
the iterative MML solution will not converge. When that happens, the user has the option of sup-
pressing Heywood cases by placing a stochastic constraint, in the form of a prior Beta distribu-
tion, on the uniqueness (1 – communality) of the items. The default values of the parameters of
the Beta distribution have been chosen so that the effect of the prior on the estimated factor load-
ings will be comparatively mild. The uniqueness will not become zero or negative, factor load-
ings will not go to ±1, and loadings of smaller absolute value will not be much affected.
TESTFACT also permits a normal prior with specified mean and variance to be placed on the
intercept parameter of the response function. This protects the maximum likelihood analysis
from excessively large or small item intercepts, corresponding to one hundred percent or zero
percent item facility.
If the sample size is sufficiently large that all 2n possible response patterns have expected values
greater than one or two, the χ 2 approximation for the likelihood ratio test of fit of the model
relative to the general multinomial alternative is
2n
rl
G = 2∑ rl ln
2
,
l =1 NPl
2n − 1 − n(m + 1) + m(m − 1) / 2,
In this case, the goodness-of-fit test can be carried out after performing repeated full information
analyses, adding one factor at a time. When G 2 falls to insignificance, no further factors are re-
quired to explain association between item responses in the sample.
The degrees of freedom above do not apply to bifactor analysis. With that model, the number of
degrees of freedom is
Note that the term, 2n − 1 , in the above formulas applies only to the case where all 2n possible
patterns are represented in the data. It can only be used in situations where the patterns are en-
tered with frequency weights, some of which can be zero to account for patterns not appearing in
the data (See Section 13.2).
587
7 OVERVIEW OF MODELS
In situations where the data are entered as individual observations, the program replaces 2n − 1
with N − 1 , the number of cases in the sample minus 1. This is a rather arbitrary expedient, how-
ever, since the degrees of freedom are determined by the number of distinct patterns, and any
given pattern may occur more than once even when there are a large number of items. Also, the
restriction that the probabilities of the patterns must sum to one, which eliminates one degree of
freedom, does not apply unless all possible patterns are represented in the data.
When the number of possible patterns is much larger than the sample size, many patterns will
have zero frequency in the data and many of their expected frequencies will be very small. The
above χ 2 or other approximation to the probability of the likelihood ratio statistics on the null
hypothesis will then be too inaccurate to be relied on as a goodness-of-fit test. Haberman (1977)
has shown, however, that the difference in these statistics for alternative models is distributed in
large samples as χ 2 , with degrees of freedom equal to the difference of respective degrees of
freedom, even when the frequency table is sparse. Thus, the contribution of the last factor added
to the model can be judged significant if the corresponding change of χ 2 value is statistically
significant, even when there are many patterns that do not occur in the sample. Since the term
N − 1 subtracts out of the difference in degrees of freedom when two models with different
number of factors are analyzed with the same data, the degrees of freedom printed by the pro-
gram can be subtracted to obtain the degrees of freedom for the difference of the corresponding
χ 2 values.
These statistics should be interpreted with caution, however: in large-scale studies based on re-
spondents from different sites, cluster effects may inflate the χ 2 statistics. To be conservative
about the number of factors that are identifiable in such studies, it is advisable to divide the χ 2
by a design factor of 2 or 3 before assessing its probability. Factors in large-scale studies that do
not show a significant χ 2 by this criterion are usually uninterpretable.
In TESTFACT, factor scores for the respondents can be computed by the Bayes/EAP (expected a
posteriori) method suggested by Bock & Aitkin (1981) (see also Muraki & Engelhard, 1985).
The posterior standard deviation (PSD) measuring the precision of each factor score estimate is
also computed. The factor scores are computed only for orthogonal solutions (principal factor or
varimax). Transformation to oblique factor scores (by the promax transformation, for example)
could be carried out subsequently, but there is no provision for that in the present version of the
program.
Factor scores may be computed either from standard difficulties and factor loadings estimated
within the program, or from standard difficulties and loadings supplied by the user from external
sources. Alternatively, item intercepts and slopes may be supplied. If the guessing model is se-
lected, chance success parameters must also be supplied.
The factor scores in TESTFACT are Bayes estimates computed on the assumption that the corre-
sponding ability factors are normally distributed in the population from which the sample of
588
7 OVERVIEW OF MODELS
examinees was drawn. That is, the score for each factor is the mean of the posterior (conditional)
distribution of ability, given the item response pattern of the examinee in question. The standard
deviation of the posterior distribution is also calculated and is interpreted as the standard error of
measurement for the score. The user can request the factor scores and corresponding standard
errors to be printed in the output listing and/or saved in an ASCII (plain text) file. The name of
that file will be the command filename with the extension *.fsc.
Following estimation of factor scores, the program will list their sample mean and variances, to-
gether with the mean-square and root-mean-square of the measurement errors. From the score
variance and the mean-square measurement error, the empirical reliability of the test in the par-
ticular sample of examinees is calculated and listed in the output. The expected value of each
factor score variance and the corresponding mean-square error is unity. If the sum of these listed
sample quantities varies widely from 1.0, it may be an indication of poor convergence or of the
presence of a near-Heywood case (see Bock & Zimowski, 1999, for details).
The computations of MML item factor analysis can be time consuming when the number of
items and number of examinees are large, because the program must then evaluate the posterior
probability of each examinee’s response pattern at each point in the quadrature space. In con-
ventional multidimensional quadrature, i.e., “product” quadrature, the total number of points in
the full space is equal to the number of points in one dimension raised to the power of the num-
ber of dimensions. To avoid excessively long computing times in earlier versions of
TESTFACT, the total number of points was limited to 243. This meant that with three points per
dimension, the largest number of factors that could be accommodated in full information analysis
was five (while this is not true of classical partial information MINRES analysis, programming
restrictions allow only up to 15 factors in MINRES analysis).
With large numbers of items, perhaps 30 or more, the dispersion of the posterior distribution for
a given examinee can become small relative to that of the examinee population distribution in the
fill factor space. In that situation, the quadratures tend to become inaccurate with only 243 points
in the full space, because too few of the points fall in the neighborhood of the location of the pos-
terior. Rather than increase the total number of points to avoid this problem, TESTFACT now
employs a form of quadrature that adapts the placement of the points to the region of the factor
space occupied by the posterior distribution corresponding to each pattern. With this method,
three points per dimension are quite adequate for accurate estimation of the factor loadings and
factor scores. It also makes possible a form of “fractional” quadrature based on a subset of points
in product space. The method of choosing these points is described below.
For the integrations by adaptive quadrature, the user has the options of fill or fractional quadra-
ture for five factors. For one to four factors, the quadratures use all points in the product space;
for six to ten factors, the program uses successive one-third fractions of the points. Thus, the
number of points actually involved in the quadrature never exceeds 243. The points and weights
in these quadratures may be those for rectangular or for Gauss-Hermite quadrature, at the user’s
option. (Rectangular is the program default.) The program implementation of adaptive quadra-
ture is based on mathematical and statistical results of Naylor & Smith (1982), Schilling (1993),
589
7 OVERVIEW OF MODELS
Meng & Schilling (1996), Bock & Schilling (1997) and Schilling & Bock (1999). In addition to
increasing to 10 the program limit on the number of factors, adaptive quadrature improves accu-
racy of estimation, especially when the number of items is large. To allow for comparison be-
tween the two methods of quadrature, TESTFACT now includes a technical option, NOADAPT, to
invoke the non-adaptive procedure for up to five factors.
In fractional quadrature, a subset of the full set of points in product quadrature is selected in a
way that retains estimability of the parameters of the multiple factor model. Since factor analysis
is equivalent to determining the mean and covariance matrix of the latent factors (or an arbitrary
orthogonal transformation of the covariance matrix), any subset of points that allows means and
covariances to be estimated will be suitable for quadrature in MML item factor analysis. Designs
that have this property have been found for the formally equivalent problem of factorial experi-
ments in which main effects and two-way interactions must be estimable, but higher-way inter-
actions may be assumed null. Such designs exist for factorial experiments in 5 or more treatment
variables each with three equally spaced levels of treatment. Because these designs reduced the
total number of points by the one-third as an additional treatment is included, the total number of
treatment combinations remains fixed at 243 from six-treatment variable onward. With 5 treat-
ment variables, the one-third fraction contains 81 combinations.
The employment of these fractional factorial designs for multidimensional quadrature requires
only the choice of values corresponding to the treatment levels. Simulation studies by Schilling
& Bock (1998) have shown that - on the assumption that the latent factor score distribution is
multivariate normal - near optimum values are:
These are the default values in TESTFACT, but they can be altered by the user if desired. What-
ever their values, the corresponding quadrature weights are the respective normal ordinates (den-
sities) constrained to sum to unity by dividing by their unconstrained total weight.
In Monte Carlo integration with importance sampling, the adaptation is carried out in a similar
way, but the points are drawn randomly from the provisionally estimated posterior distribution
corresponding to each distinct response pattern, which is assumed multivariate normal. In both
fractional quadrature and Monte Carlo integration, the factor scores for each examinee’s re-
sponse pattern are estimated by the Bayes modal method (also called “maximum a posteriori”,
or MAP estimation). In this method, the multivariate mode of the posterior distribution serves as
the estimate of the examinee’s factor scores, and the inverse of the Fisher information at the
mode (i.e., the curvature of the posterior density at the mode) serves to estimate the variances
and covariances of modal values. In the adaptive solution, the MAP estimates are recomputed
during each of the earlier EM cycles. After a number of cycles set by the ITLIMIT keyword of
the TECHNICAL command (default equals one-third of the number of cycles set by the CYCLES
keyword of the FULL command) the posterior modes and information matrix for each examinee
remain fixed for the rest of the EM cycles.
590
7 OVERVIEW OF MODELS
The program now provides the option of Monte Carlo integration in the EM solution of the mar-
ginal maximum likelihood equations. The user may choose the number of random deviates to be
sampled from the assumed multivariate normal distributions corresponding to each response pat-
tern in the data, and choose also the seed for their generation. In principle, this method of solving
the likelihood equations applies to any number of factors, but programming restrictions in the
implementation limit the number to 15.
7.5.19 Applications
Interesting applications of the TESTFACT program are described in Muraki & Engelhard
(1985), Zimowski (1985), Zwick (1987), Bock, Gibbons & Muraki (1988), and Zimowski &
Bock (1987).
591
8 ESTIMATION
8 Estimation
8.1 Introduction8
Item Response Theory (IRT) has only recently come into widespread application in the fields of
educational and psychological measurement. However, the theory is not really new. Its roots can
be found in the work of Thurstone (1925), Thorndike (1927), Symonds (1927), Horst (1933), and
others. Its beginnings are in the pioneering efforts of Lawley (1943) and Lord (1952). In contrast
to traditional test theory, as presented in Gulliksen’s (1950) landmark text and elsewhere, IRT
provides the following features:
Respondents may be scored on the same scale, even if they do not respond to the same set
of items.
Respondents may be comparably scored on two or more forms of the same test.
Short forms, long forms, easy forms, hard forms, parallel forms, and other alternate forms
are all treated in the same way.
Tests can be tailored to proficiency, with easy questions for those who show low profi-
ciency and difficult questions for those who exhibit higher proficiency.
The magic of IRT arises in placing all of the test scores on the same scale after all of these
machinations, even if the respondents answer different sets of questions. IRT also permits the use
of all of the information included in an examinee’s response to a question or test item, even if
that response may be in one of three or more graded categories (as on a rating scale) or in one of
several strictly nominal categories (as among the four or five choices of a conventional multiple-
choice item). Responses on attitude measures are frequently graded, and on multiple-choice pro-
ficiency tests some distractors are usually “wronger” than others. IRT permits the use of the in-
formation in any choice of an item response to be used to estimate the value of that respondent’s
trait or proficiency.
The power of IRT is associated primarily with the phrase “estimate the value of the trait”.
Loosely speaking, we say that a test is “scored”. But strictly speaking, the test is not scored; one
does not simply count the positive responses, as is done in traditional test theory. One “estimates
the value of the trait” using the inferred relationships between the item responses and the trait
being measured. In the process, one finds that there is no longer an idea of “reliability” in many
cases; instead, there is information. An understanding of this estimation process and the idea of
information in the technical sense (after Fisher, 1925) are crucial for an appreciation of the the-
ory. Both are discussed in the following sections.
8
This section was contributed by David Thissen.
592
8 ESTIMATION
Item response theory is concerned with the probabilistic relationship between the response to
some test item and the respondent’s attribute that the test item is intended to measure. Test items
may be problems on a proficiency test, questions on an attitude scale, or behaviors on a behavior
checklist. The attribute of the person may be a cognitive proficiency, an attitude, or a personality
construct (either “trait” or “state”). The attribute being measured by the test is usually called θ
and is usually arbitrarily placed on a z-score scale, so zero is average and θ -values range, in
practice, roughly from -3 to +3. Item response theory is used to convert item responses into an
estimate of θ , as well as to examine the properties of the items in item analysis.
In its simplest form, item response theory is concerned with the relationship between binary test
items (correct/incorrect, agree/disagree, yes/no) and θ , however it may be conceived. In a useful
binary test item, the relationship between the probability of a positive response and θ must be
more or less like the function in the top panel of Figure 8.1. As is illustrated there, the probability
of a positive response (on the y-axis) is plotted against θ : it is an increasing, S-shaped function,
indicating a low probability of a positive response among persons of low θ , moderate probabil-
ity for individuals of average θ , and a high probability of positive response for persons of high
θ.
Figure 8.1: Probabilities and joint relative likelihood of sequence of binary items
Top panel: A trace line for a binary test item (referred to as item 1); that is, the probability
of a positive response plotted against the trait value ( θ ).
Center panel: The probability of a negative response to a second item (referred to as item
2).
Lower panel: The joint relative likelihood of the response sequence {positive, negative}
as a function of θ .
593
8 ESTIMATION
Computational aspects of item response theory usually require that the function in Figure 8.1
have some specified mathematical form; the normal ogive and logistic functions have frequently
been used (see Lawley, 1943; Lord, 1952; Rasch, 1960; Birnbaum, 1968). In either case, each
binary item has a curve like that in the top panel of Figure 8.1, sometimes called an Item Char-
acteristic Curve (ICC) or “trace line” (Lazarsfeld, 1950), which is defined by its “location” and
“slope”. The latter terminology will be used here.
In some of the simpler IRT models, the location parameter of the trace line is the point on the θ -
scale at which the curve crosses P = 0.5. So persons whose trait value exceeds the location pa-
rameter of the item have greater than a 50% chance of a positive response, while persons whose
θ values lie below that location have less than 50% chance of a positive response. In the context
of proficiency tests, the location of an item corresponds to its difficulty: the higher the location
parameter, the more proficiency is required before the examinee has a 50% chance of a correct
response.
The slope of a trace line reflects the rate at which the probability of a positive response changes
as θ increases. This is the classical discrimination parameter. The trace line for item 2 in Figure
8.1 (for a negative or incorrect response, since it decreases over θ ) changes more quickly as θ
changes than does the trace line for item 1. The item 2 curve drops from about 0.9 to about 0.1
between 0 and 2, while it takes the range from -2 to +2 for the trace line for item 1 to climb from
0.1 to 0.9. Item 2 is an item with a higher slope than item 1. The location of item 2 is also higher
at θ =1.
If the trace lines for items 1 and 2 are known, or, more precisely, if their parameters are known,
and an examinee responds positively to item 1 and negatively to item 2, that information may be
used to estimate the θ -value for that person. One way to make such an estimate uses the princi-
ple of Maximum Likelihood (ML). If the item responses are independent (conditional on θ ), then
the joint likelihood of the sequence {positive response, negative response} ({right, wrong},
{agree, disagree}, and so on) at any value of θ is the product of the item 1 and item 2 probabil-
ity values at that level of θ in Figure 8.1. That product has been computed and is labeled “Total”
at the bottom of Figure 8.1. The total likelihood is low for low values of θ , because it is unlikely
that a person there would respond positively to item 1, and it is low for high values of θ , be-
cause it is unlikely that a person there would respond negatively to item 2. The total likelihood of
the sequence {positive, negative} is highest at about θ = 0.4 so that is the Maximum Likelihood
Estimate (MLE) for θ , called MLE[ θ ]. So a person who responds {positive, negative} might
be assigned a trait value of 0.4 as a “test score” or measurement.
The MLE is the mode of the total likelihood in Figure 8.1. If desired, the average, called EAP[ θ ]
for Expected a Posteriori, or some other estimate may be used. The point estimate of location
provides a very limited summary of the total likelihood. In a subsequent section on test informa-
tion, we will consider the addition of indices of spread or width of that likelihood around its lo-
cation. At this point, it is sufficient to understand that “estimation” of θ uses the relative likeli-
hood of the observed response sequence as a function of θ , and consists of summarizing the dis-
tribution of the likelihood, like the “Total” in Figure 8.1, by one or more numbers - the first of
594
8 ESTIMATION
The procedure is easily extended to more items. Figure 8.2 shows the trace lines associated with
a five-item test, for the sequence {negative, negative, positive, positive, positive}. Again, the to-
tal at the bottom is a plot of the likelihood of that particular sequence over values of θ . There is
almost no likelihood above zero, and MLE[ θ ] for this sequence of responses to this very easy
test is about -1.3. So the examinees that responded with this sequence to these five items might
be assigned -1.3 as a point estimate of their trait values.
If items 1 and 2 of Figure 8.1 and items 1 to 5 of Figure 8.2 came from a pool of items which
measured the same trait ( θ ) and their item parameters were known, the ML estimates of θ in the
two cases would be on the same scale and thus directly comparable.
This is true even though the examinee represented by Figure 8.1 responded to only two items and
the examinee represented by Figure 8.2 responded to five (different) items. This feature of item
response theory allows tests made up of different sets of items (like tests with missing data, short
and long forms, alternate forms, and so on) to be used to assign comparable trait estimates to ex-
aminees. The ML estimation of θ for each person takes into account the properties of each item
in constructing the total likelihood of the observed responses. Thus, the estimate of θ which as
the value that has the highest likelihood of producing those responses, is the same regardless of
the set of items.
Figure 8.2: Trace lines and total relative likelihood for sequence of 5 binary items
Top five panels: Trace lines for five binary items, in sequence {negative, negative, posi-
tive, positive, positive}.
Lower panel: Total relative likelihood for that sequence as a function of θ .
595
8 ESTIMATION
Item responses need not be binary. Samejima (1969) (see Sections 2.3.2 and 2.4.2) has devel-
oped item response models for graded items with 3, 4, or more ordered categories of response
(like disagree, moderately disagree, moderately agree, and agree). The trace lines for the highest
and lowest of the graded responses are like those for positive and negative binary responses; for
binary items, the graded models become identical to binary models using the same functions. For
more than two graded responses, the intermediate responses have increasing, then decreasing
trace lines: intermediate responses must be more likely at some moderate value of θ . Figure 8.3
shows trace lines for a test of three graded items, with the probability of responses 2 on item 1, 3
on item 2, and 1 on item 3 plotted. Samejima’s graded model has one slope parameter for each
item (very high for item 3, moderate for item 2, and low for item 1 in Figure 8.3). There are sev-
eral location parameters, called thresholds: one less than the number of response categories.
Each location parameter specifies the point on the θ scale at which a person has a 50% chance
of responding in some higher category than the one to which the threshold belongs.
As illustrated in Figure 8.3, there is still a total likelihood over θ for any given response pattern,
even when all (or some) of the items have more than two possible responses. θ may still be esti-
mated as the maximum of the total likelihood, it is -1.2 in the example in Figure 8.3.
Figure 8.3: Trace lines and joint relative likelihood of response sequence of 3-category
items
Top panel: Trace line for a response in category 2 of a 3-category graded item.
Second panel: Trace line for a response in category 3 of a 3-category item.
Third panel: Trace line for a response of 1 on another 3-category item.
Fourth panel: The joint relative likelihood of the response sequence as a function of θ .
There are sometimes test items that permit multiple responses that have no obvious order. Multi-
ple-choice vocabulary questions are such items. While one response may be “correct”, the others
may not be obviously graded with respect to “correctness” although they may be chosen dif-
596
8 ESTIMATION
ferentially by examinees of different proficiency. Bock (1972) has proposed a logistic response
model for just such items. The parameters of that model are contrasts among the slopes and in-
tercepts of the trace lines. The model produces trace lines for each response category, which can
be combined with other trace lines and used in the ML estimation of θ . Thissen and Steinberg
(1984, see Section 7.4.4) describe an extended version of this model which is included in
MULTILOG.
Frequently, prior information is available or can be assumed about the distribution of θ in the
population of persons responding to a test. Such information, based on population membership,
is numerically equivalent to a test item to which all of the members of the population respond
identically. A N[0,1] prior – assuming that the examinees are drawn from a standard normal dis-
tribution – is equivalent to a trace line for a Gaussian distribution. The curve marked “Pop.” in
Figure 8.4 is a density for such a normal prior. Prior information of this sort may be combined
with the item responses just as though it represented an item on the test. The implicit test item is
“Do you belong in this population?” Yes = mean of population, No = missing data.
Item response theory is a flexible system for scoring tests and thus providing measurement of
individual differences. Alternative models are available for many types of item response data.
The reader is advised to examine the technical literature in this field or obtain competent advice
before applying any model (or set of models). Once the parameters are estimated, item response
theory should provide a satisfactory solution to any problem of psychological measurement.
8.1.2 Information
In the preceding section, we have made use of point estimates of the location of total likelihoods
as estimates of the unobserved trait value θ . When the total likelihood over θ derived from the
item responses is relatively flat, as it is for the five item responses in Figure 8.2, an unmitigated
version of this procedure should generate some discomfort in the reader. In Figure 8.2, there is a
substantial likelihood for the observed response pattern {00111} for all θ -values between –3.0
and 0.0. Under these circumstances, the point estimate MLE[ θ ] = –1.3 may apply too much pre-
cision.
An estimate of the width or spread of the total likelihood may be used to specify the precision
with which the MLE[ θ ] estimates θ . Since the form of the distribution of total likelihood is
roughly Gaussian, an estimate of the standard deviation of that distribution is a useful and widely
comprehensible index of spread. For the situation in which the item parameters are taken to be
fixed and known and the only parameter to be estimated is the trait-value θ for a particular ex-
aminee, the distribution is the sampling distribution of θ and its standard deviation is the stan-
dard error of MLE[ θ ].
597
8 ESTIMATION
Figure 8.4: Trace lines and joint relative likelihood for five binary items
Top five panels: Trace lines for five binary test items {00100}.
Sixth panel: N[0,1] population density.
Lowest panel: The joint relative likelihood (posterior) as a function of θ .
It is possible to employ any variety of methods to estimate the spread of distributions such as the
total likelihood in Figure 8.2. One method, which is extremely convenient in the context of ML
estimation, makes use of the fact that the negative inverse of the expected value of the second
derivative of the log likelihood is approximately equal to the variance of the estimator (Fisher,
1925; Kendall & Stuart, 1961, p.10). While the term used to describe that number is unpleasantly
long, that value is a routine by-product of ML estimation. It is frequently used as an estimate of
the standard error to describe the spread of the total likelihood in terms that are interpretable in a
roughly Gaussian sense, i.e. a 95% confidence interval is MLE[ θ ] ± 2 standard errors.
The standard error estimated in this way for MLE[ θ ] = –1.3 from the likelihood at the bottom of
Figure 8.2 is 1.3. So the central (Gaussian) 68% confidence interval for θ would run from
-1.3 ± 1.3 = –2.6 to 0.0. Examination of Figure 8.2 reveals that, although the total likelihood is not
strictly Gaussian, the inflection points are very nearly at –2.6 and 0.0, as would be expected if
the distribution were Gaussian and 1.3 were the standard deviation.
Standard errors estimated in this way are different for different response patterns for the same
test. The likelihoods may be broad or narrow, depending on the relative locations of θ for the
individual and the item parameters. Since the standard error is the width of the total likelihood, it
varies. With some exceptions, there is a pattern to the standard errors: they are small for θ -loca-
tions near clusters of discriminating items and large far away, usually at the edges of the range of
the test. This variation is at odds with the concept of reliability ( ρ ), which is based on a model in
which all the estimates have the same error of estimate, equal to 1 − ρ for standardized tests.
598
8 ESTIMATION
So reliability is frequently not a useful characteristic of a test scored in this way. No single num-
ber characterizes the precision of the entire set of IRT trait-estimates made from a test. Instead,
the pattern of precision over the range of the test may be plotted. A plot of the standard error
against θ would serve this purpose, but the variable conventionally plotted is Information, which
is approximately equal to 1/(standard error)2 . This definition, due to Fisher (1925) and therefore
sometimes called Fisherian information, uses the word “information” in an intuitively obvious
way: if the standard error reflects our lack of knowledge about the parameter, then its inverse is
information. Information is used primarily because it is additive: each test item produces a fixed
quantity of information at each level of θ . The information function for a test is simply the sum
of the item information functions. This allows easy computation of information functions for
tests of varying compositions.
To make use of IRT for test scoring, the parameters of the model for each item of the test must
be estimated. Estimating the item parameter and checking the fit of the models is referred to as
item calibration. The calibration process requires item data from a sample of respondent who
have been administered the test under exactly the same conditions as those in which the test will
be used in practice. After a preliminary pilot study to select suitable items, the main item cali-
bration can be performed on data obtained in the first operational use of the test. Replacement of
items in subsequent administrations can then be carried out by one of the methods described in
the section on test equating.
An approach to item parameter estimation that applies to all types of item response models and is
efficient for short and long tests is the method of marginal maximum likelihood (MML). (See
Bock & Aitkin, 1981; Harwell, Baker & Zwarts, 1988). Except in special cases, the MML
method assumes the conditional independence of responses to different items by persons of the
same ability θ . Because the joint probability of independent events is the product of the prob-
abilities of the separate events, this assumption makes it possible to calculate the probability of
observing a particular pattern of item scores,
x = ( x1 , x2 ,… , xn ) ,
9
This section was contributed by R. Darrell Bock.
599
8 ESTIMATION
n
P (x | θ ) = ∏ [ Pj (θ )] j [1 −Pj (θ )]
x 1− x j
Equation Section 8 (8.1)
j =1
that is, as the continued product of Pj (θ ) or 1 − Pj (θ ) according as the person response correctly
or incorrectly to item j. This quantity is the probability of the pattern x , conditional on θ . It is
to be distinguished from the probability of observing the pattern x from a person of unknown
ability drawn at random from a population in which θ is distributed with a continuous density
g (θ ) . The latter is the unconditional probability given by the definite integral,
∞
P (x) = ∫ P(x | θ )g (θ ) dθ . (8.2)
−∞
This quantity is also referred to as the marginal probability of x . Because the ability, θ , has
been integrated out, this quantity is a function of the item parameters only.
In IRT applications, the integral in (8.2) cannot generally be expressed in closed form, but the
marginal probability can be evaluated as accurately as required by the Gaussian quadrature for-
mula
_ q
Px ≈ ∑ P (x | X k ) A( X k ), (8.3)
k =1
where X is a quadrature point and A( X k ) is a positive weight corresponding to the density func-
tion, g ( X ) . Tables giving quadrature points and corresponding weights are available for various
choices of g (θ ) (see Stroud & Sechrest, 1966). We recommend 2 x the square root of the num-
ber of items as the maximum number of quadrature points.
In the MML method, values for the item parameters are chosen so as to maximize the logarithm
of the marginal maximum likelihood function, defined as
S _
log LM = ∑ rl log e P ( xl ), (8.4)
l =1
where rl is the frequency with which the pattern x l is observed in a sample of N respondents,
and S is the number of distinct patterns observed.
A necessary condition on the maximum of (8.4) for the 3PL model of item j is given by the like-
lihood equations
600
8 ESTIMATION
0
q r jk − N jk Pj ( X k ) ∂Pj ( X k )
∑ P ( X )[1 − P ( X )] c = 0 , (8.5)
k =1 j k
k j
j 0
∂ a j
g j
where
S
r jk = ∑ rl xlj P (x l | X k ) A( X k ) / P xl
l
and (8.6)
S
N k = ∑ rl P (x l | X k ) A( X k ) / P xl
l
are, respectively, the posterior expectation of the number-correct and of the number of attempts
at point X k . ( xlj is the 0,1 score for item j in pattern l).
The so-called EM algorithm and Newton-Gauss (Fisher-scoring) methods are used to solve these
implicit equations. Details may be found in Bock and Aitkin (1981) and Thissen (1982). Stan-
dard errors and correlations of the parameter estimators are obtained by inverting the information
matrix in the Fisher-scoring solution.
MML estimation for the two- and three-parameter models is essentially a one-dimensional item
factor analysis. As such, it is subject to so-called Heywood cases in which a unique variance
goes to zero. The symptom of such a case is an indefinitely increasing slope during the EM and
Newton iterations of the maximization.
Because all items are fallible to some degree, zero unique variance is untenable. It is therefore
reasonable to avoid Heywood cases by placing a stochastic constraint on the item slopes to pre-
vent them from becoming indefinitely large. This may be done by adopting a Bayes procedure
called “marginal maximum a posteriori” (MMAP) estimation. In one form of this procedure, the
slopes (which must be positive) are assumed to have a log normal distribution in the domain
from which the items are drawn. Values for the item parameters are then chosen so as to maxi-
mize the logarithm of the product of the likelihood of the sample and the assumed log normal
“prior” distribution of the slopes. The parameters of this log normal distribution for slopes can be
either specified as a priori—the Bayes solution—or estimated from the data at hand—an empiri-
cal Bayes solution. This amounts to finding the maximum of the posterior distribution of the
slopes, given the data.
For the three-parameter model, a similar procedure is needed to keep the lower asymptote pa-
rameter, g j , in the open interval from 0 to 1. The beta distribution may be used for this purpose.
The intercept parameter can also be constrained to a plausible region, although this is less im-
601
8 ESTIMATION
portant than constraining the slope and asymptote. (See Mislevy, 1986, and Tsutakawa & Lin,
1986, for details).
z j = a j (θ − b j ) ,
any change in the origin of θ can be absorbed in b j , and any change in the unit of θ can be ab-
sorbed in a j . A widely accepted convention is to fix location by setting the mean of the latent
distribution (of θ ) to 0 and to fix scale by setting the standard deviation of the distribution to 1.
The parameters are then said to be in the “0, 1” metric. To set the mean and standard deviation to
some other values, m and s, say, it is only necessary to change b j to
b*j = sb j + m (8.7)
and
a*j = a j / s. (8.8)
c*j = c j − a j m / s. (8.9)
A convenient way to characterize an arbitrary latent distribution with finite mean and variance is
to compute the probability density at a finite number of suitably chosen values of θ and to nor-
malize the densities by dividing by their total. The result is a so-called “discrete distribution on a
finite number of points” (Mislevy, 1984). These normalized densities can be used as the weights,
A( X k ) in quadrature formulas such as (8.3).
This discrete representation of the latent distribution can be readily estimated from the item re-
sponse model. The expected frequency at point X k , given the item data from a sample of N re-
spondents, is N k , the expected number of attempts defined by (8.6) above.
602
8 ESTIMATION
q
A* ( X k ) = N k / ∑ N h . (8.10)
h
They are called empirical or “posterior” weights, as distinguished from the theoretical or “prior”
weights, A( X k ) , assumed before the data are in hand.
If data from a large sample of respondents is available, the fit of the model may be tested, either
for the test as a whole, or item by item. The method of examining fit depends upon the number
of items in the test.
If nearly all of the 2n possible response patterns for an n-item test appear in the data, the overall
goodness-of-fit of the model can be tested directly. The distinct response patterns must be
counted to obtain the pattern frequencies r1 , r2 ,…, r2 .
n
If a few of these frequencies are zero, ½ may be substituted for each and the sum of these sub-
stitutions subtracted from the largest frequency. Then the likelihood ratio χ 2 statistic for the test
of fit is
2n
rl
G 2 = 2∑ rl log e , (8.11)
l =1 N P ( xl )
where P( xl ) is the marginal probability of pattern x l given by (8.3). This χ 2 has degrees of
freedom 2n − kn − 1 , where k is the number of item parameters in the model. Significantly large
values of the statistic indicate a failure of fit of one or more of the response models for the n
items.
No dependable, formal test of fit yet exists for all of this range. But useful information about the
fit of individual items may be obtained by inspecting standardized differences between the poste-
rior probability of correct response at selected values of θ and the probabilities at those points
computed from the corresponding fitted response model. These differences are called “standard-
ized posterior residuals”.
In terms of quantities defined above, the posterior probability of a correct response is computed
for item j at the point X k is the ratio r jk / N k , the terms of which are defined above.
603
8 ESTIMATION
S
∑Wlk [ xlj − P( X k )]
δ jk = S
l
(8.12)
{∑Wlk [ xlj − P ( X k )]} 1/ 2
where
rl P (x l | X k )
Wlk = . (8.13)
P (x l )
Values of this residual greater than, say, 2.0 may be taken to indicate some failure of fit of the
model at the corresponding point. In interpreting such deviates, it is advisable to take into con-
sideration the posterior weight, A* ( X k ) , at the point, since a discrepancy in a region of θ with
very little probability in the population will have little effect on the performance of the model.
As an overall index of fit, we suggest the population root-mean-square of the posterior deviates.
Its formula is
1/ 2
q q
RMS (δ j ) = ∑ N k δ 2jk / ∑ N k .
k k
Unfortunately, the posterior residuals seem to be too highly correlated to be successfully com-
bined into a χ 2 statistic for the item. Neither do they take into account the sampling variance of
Pj ( X k ) due to estimation of its item parameters, but this source of variation is presumably small.
If the test is sufficiently long, the respondents in a sample of size N can be assigned with good
accuracy to intervals on the θ -continuum on the basis of their estimated value of θ . For this pur-
pose, we use the EAP estimate with whatever prior is assumed for item calibration. The esti-
mated θ ‘s are rescaled so that the variance of the sample distribution equals that of the latent
distribution on which the MML estimation of the item parameters is based. The number of re-
spondents in each interval who respond correctly to item j can be tallied from their item scores.
Finally, a likelihood ratio χ 2 statistic may be used to compare the resulting frequencies of cor-
rect and incorrect responses in the intervals with those expected from the fitted model at the in-
terval mean, θ h :
604
8 ESTIMATION
ng
rhj N h − rhj
G 2j = 2∑ rhj log e + ( N h − rhj ) log e , (8.14)
h =1
N h Pj (θ h ) N h [1 − P j (θ h )]
where ng is the number of intervals, rhj is the observed frequency of correct response to item j in
interval h, N h is the number of respondents assigned to that interval, and Pj (θ h ) is the value of
the fitted response function for item j at θ h , the average ability of respondents in interval h.
Because neither the MML nor the MMAP method of fitting the response functions actually
minimizes this χ 2 , the residuals are not under linear constraints and there is no loss of degrees of
freedom due to the fitting of the item parameters. The number of degrees of freedom is therefore
equal to the number of intervals remaining after neighboring intervals are merged if necessary to
avoid expected values less than 5.
To diagnose cases of poor fit, one can inspect a plot of rhj / N h compared to Pj (θ h ) . Ninety-five
percent tolerance intervals on these points are
±2 Pj (θ h )[1 − Pj (θ h ) / N h .
When the number of items is small, the standardized posterior deviates should be plotted instead.
Unlike classical test theory, IRT does not in general base the estimate of the respondent’s ability
(or other attribute) on the number-correct score. The only exception is the one-parameter logistic
model, in which the estimate is a non-linear function of that score. To distinguish IRT scores
from their classical counterparts, we refer to them as “scale” scores.
remain comparable when items are added to or deleted from the tests,
weight the individual items optimally according to their discriminating powers,
have more accurate standard errors,
provide more flexible and robust adjustments for guessing than the classical corrections,
and
are on the same continuum as the item locations.
There are three types of IRT scale score estimation methods now in general use:
The three types of IRT scale score estimation methods are discussed in the sections to follow.
605
8 ESTIMATION
The maximum likelihood (ML) estimate of the scale score of respondent i is the value of θ that
maximizes
n
log Li (θ ) = ∑ {xij log e Pj (θ ) + (1 −xij ) log e [1 − Pj (θ )]}, (8.15)
j =1
n
I (θ ) = ∑ a 2j Pj (θ )[1 − Pj (θ )], (8.16)
j =1
in the case of the two-parameter logistic model. Similar formulas are available for the other
models. The iterations of the Fisher-scoring solution are
∂ log Li (θ )
θ t +1 = θ t + I −1 (θ ) .
∂θ
The standard error of the ML estimator is the square root reciprocal of the information at θ :
Unlike the classical standard error of measurement, which is a constant, the IRT standard error
varies across the scale-score continuum. It is typically smaller towards the center of the scale
where more items are located and larger at the extremes where there are fewer items. A disad-
vantage of the ML estimate is that it is not defined for the response patterns in which all items
are correct or all items are incorrect (and occasionally for other unfavorable patterns near the
chance level when the three-parameter model is used). These problems do not arise in the other
two methods of estimation.
606
8 ESTIMATION
Bayes estimation
The Bayes estimate is the mean of the posterior distribution of θ , given the observed response
pattern xi (Bock & Mislevy, 1982). It can be approximated as accurately as required by the
Gaussian quadrature (see the section on MML estimation):
∑X k P(xi | X k ) A( X k )
θi ≅ k =1
q
.
∑ P(x
k =1
i | X k ) A( X k )
This function of the response pattern xi has also been called the expected a posteriori (EAP) es-
timator. A measure of its precision is the posterior standard deviation (PSD) approximated by
∑(X k − θ i ) 2 P(xi | X k ) A( X k )
PSD(θ i ) ≅ k =1
q
.
∑ P(x
k =1
i | X k ) A( X k )
The EAP estimator exists for any answer pattern and has a smaller average error in the popula-
tion than any other estimator, including the ML estimator. It is in general biased toward the
population mean, but the bias is small within ±3 σ of the mean when the PSD is small (e.g., less
than 0.2 σ ). Although the sample mean of EAP estimates is an unbiased estimator of the mean of
the latent population, the sample standard deviation is in general smaller than that of the latent
population. This is not a serious problem if all the respondents are measured within the same
PSD. But it could be a problem if respondents are compared using alternative test forms that
have much different PSDs. The same problem occurs, of course, when number-right scores from
alternative test forms with differing reliabilities are used to compare respondents. Tests adminis-
trators should avoid making comparisons between respondents who have taken alternative forms
that differed appreciably in their psychometric properties. A further implication is that, if EAP
estimates are used in computerized adaptive testing, the trials should not terminate after a fixed
number of items, but should continue until a prespecified PSD is reached.
Similar to the Bayes estimator, but with a somewhat larger average error, is the Bayes modal or
so-called maximum a posteriori (MAP) estimator. It is the value of θ that maximizes
607
8 ESTIMATION
n
P(θ | xi ) = ∑ {xij log e Pi (θ ) + (1 −xij ) log e [1 − Pi (θ )]} + log e g (θ ),
j =1
Analogous to the maximum likelihood estimate, the MAP estimate is calculated by Fisher scor-
ing, employing the posterior information,
∂ 2 log e g (θ )
J (θ ) = I (θ ) + ,
∂θ 2
where the right-most term is the second derivative of the population log density of θ .
In the case of the 2PL model and a normal distribution of θ with variance σ 2 , the posterior in-
formation is
n
1
I (θ ) = ∑ a 2j Pj (θ )[1 − Pj (θ )] + .
j =1 σ2
∧
PSD(θ ) = 1/ I (θ ).
Like the EAP estimator, the MAP estimator exists for all response patterns, but is generally bi-
ased toward the population mean.
According to classical theory, the standard error of measurement (SEM) is a function only of the
test reliability and the variance of the score distribution. But this is an oversimplification. Actu-
ally, the error standard deviation of a score on a test of finite length—whether the classical num-
ber-right score or an IRT scale score—also depends upon the level of the score itself.
When the maximum likelihood estimator is used to obtain an IRT scale score, the SEMs of the
three logistic models expressed in the normal metric are as follows:
608
8 ESTIMATION
1PL:
1
n 2
S.E.(1) (θ ) = 1/ D 2 a 2 ∑ P(1) j (θ )[1 − P(1) j (θ )] (8.18)
j =1
2PL:
1
n 2
S.E.(2) (θ ) = 1/ D 2 ∑ a 2j P(2) j (θ )[1 − P(2) j (θ )] (8.19)
j =1
3PL:
1
1 − P(3) j (θ ) P(3) j (θ ) − g j
2
2
n
S.E.(3) (θ ) = 1/ D ∑ a j
2 2
⋅ (8.20)
θ) 1− g j
j = 1 P(3) j (
Although these formulas are more realistic than the classical standard error of measurement, they
are nevertheless approximations. Strictly speaking, they are exact only as the number of items
becomes indefinitely large. But in general, they are good approximations for tests with as few as
ten or twenty items. Although they neglect the errors due to estimating the item parameters, these
errors are inconsequential if the calibration sample is large.
Because the terms that are summed in the information functions (8.18), (8.19), and (8.20) can be
regarded as the information functions of the items, they show how the SEM depends upon the
item slopes, locations and lower asymptotes. By plotting the item information functions of the
items against the test information, the test constructor can see which items are contributing most
to increasing the test information in relevant regions of the scale, and thus to decreasing the
SEM. The plots show where additional items are needed to improve the precision of measure-
ment locally. Generally, the aim is to produce a test item function that is high and flat over the
range of θ in which accurate measurement is required.
It is evident in the information functions for the logistic models that, as Pj (θ ) goes to 1 or 0 (or
to g j for the 3PL model), the information goes to zero and the standard error to infinity. Thus,
the ML estimator is effective only over a finite range. As a result, it is necessary to set some
limit, perhaps ±5 standard deviations of the latent distribution, as upper and lower bounds of θ .
The posterior information for the Bayes modal (MAP) estimator has properties similar to those
of the Fisher information of the ML estimator except that, when the prior is suitably chosen (e.g.,
normal) the posterior information does not go to zero as θ becomes extreme. Rather, for a nor-
mal prior, the posterior information goes to 1/ σ 2 , and the SEM goes to the population standard
609
8 ESTIMATION
deviation, σ , which means that nothing is known about θ except that it is very large or very
small, depending on the sign of θ .
The squared inverse posterior standard deviation (PSD) of the Bayes (EAP) estimator does not
have the convenient additive property of the Fisher and posterior information. But because of the
equivalence of the EAP and MAP estimators as the number of items becomes large, ML infor-
mation analysis of items can be applied to the EAP estimation for most practical purposes of test
construction.
Guessing in response to multiple-choice items has a deleterious effect on any estimator of ability,
classic or IRT. For the three-parameter model, the average effect of guessing, and thus the size of
the asymptote parameter, g j , can be reduced by instructing the examinees to omit the item rather
than make a blind guess. But when the three-parameter model is used in scoring, it does not dis-
tinguish between those examinees who omit and those who ignore the instructions, and guess.
Two methods of improving the accuracy of scale score estimation in the presence of mixed omit-
ting and guessing have been proposed. One method is to assign to the omitted responses a prob-
ability equal to the asymptote parameter, g j , or to 1/A, where A is the number of alternatives of
the multiple-choice item (Lord, 1980, p. 229). In effect, the omitted responses are replaced by
guessed responses and scored fractionally correct.
The other method is to score omits as incorrect, but suppress the effects of guessing by giving
reduced weight to unlikely correct responses in the response pattern. A technique of robust data
analysis, called “biweighting”, has been proposed for this purpose (Mislevy & Bock, 1982).
Simulation studies have shown that such robustifying procedures improve the accuracy of esti-
mating ability in the presence of chance successes in response to multiple-choice items.
In some forms of educational assessment, scores are required for populations of groups and stu-
dents (schools, for example) rather than for individual students (Mislevy, 1983). In these appli-
cations, IRT scale scores for the groups can be estimated directly from matrix sampling data if
the following conditions are met:
The assessment instrument consists of 15 or more randomly parallel forms, each of which
contain exactly one item from each content element to be measured.
The forms are assigned in rotation to students in the groups being assessed and adminis-
tered under identical conditions.
On these conditions, it may be reasonable to assume that the ability measured by each scale is
normally distributed within the groups. In that case, the proportion of students in the groups who
respond correctly to each item of a scaled element will be well approximated by a logistic model
610
8 ESTIMATION
in which the ability parameter, θ , is the mean ability of the group. Because each item of the ele-
ment appears on a different form, these responses will be experimentally independent.
An aggregate-level IRT model can therefore be used to analyze data for the groups summarized
as the number of attempted responses, N hj , and the number of correct responses, rhj , to item j in
group h. The probability of these response frequencies for the n items of the element, given the
mean ability of the group, θ h , is then
n N hj !
P(rh | N h ,θ h ) = ∏
N hj − rhj
⋅ [Ψ j (θ h )] hj [1 − Ψ j (θ h )]
r
. (8.21)
j =1 ( N hj − rhj )!rhj !
we can carry out MML estimation of item parameters for the aggregate-level IRT model in the
same manner as estimation for the individual-level model. Scale scoring of the pattern of fre-
quencies of attempts and correct is performed by a similar substitution in (8.15), (8.16) or (8.17).
All other aspects of the IRT analysis are unchanged.
Unlike the individual-level analysis, the aggregate-level permits a rigorous test of fit of the re-
sponse pattern for the group. Because the response frequencies for the items of a scaled element
are binomially distributed and independent, a likelihood ratio or Pearsonian χ 2 test statistic may
be computed to test the fit of the model within each group.
The starting values computed in the input phase and used in item parameter estimation in the
calibration phase in BILOG-MG are generally too high for aggregate-level models. The user
should reduce these values by substituting other starting values in the TEST command.
PARSCALE estimates the parameters of the response models by marginal maximum likelihood
assuming either a normal or empirically estimated latent distribution with mean zero and stan-
dard deviation one (see Muraki, 1990). The EM algorithm is used in the solution of the likeli-
hood equations starting from the initial values described previously. The current version includes
the Newton cycles used in BILOG-MG to improve the EM results.
Because of the potentially wide spacing of category boundary locations on the latent dimension,
it is advisable to use a greater number of quadrature points than in BILOG-MG. Thirty points is
10
This section was contributed by Eiji Muraki.
611
8 ESTIMATION
the default. Simulation studies show that with smaller numbers of points the item slopes are in-
creasingly underestimated. The effect tends to be proportional, however, and is hardly apparent
in the test scores when they are rescaled to an assigned standard deviation in the sample.
Despite the greater number of parameters in the multiple-category models as opposed to the bi-
nary, the greater information in the data allows stable estimation in similarly sized samples.
Sample sizes around 250 are marginally acceptable in research applications, but 500 or 1000
should be required in operational use. Beyond 1000, the additional precision may not justify the
additional computing time.
For a slope parameter, we assume that the natural logarithm of the parameter, ln(a j ) , is distrib-
uted as N ( µ a ,σ a2 ) :
1 −(ln a j − µ a ) 2
f (a j ) = exp .
σ a (2π )1/ 2 2σ a2
1 −(ln b j − µb ) 2
f (b j ) = exp .
σ b (2π )1/ 2 2σ b2
g αj −1 (1 − g j ) β −1
f (g j ) =
B(α , β )
The graded response model and the partial credit model contain the element
z jk (θ ) = Da j (θ − b jk )
= Da j (θ − b j + ck ).
θ * = Aθ − B,
612
8 ESTIMATION
where θ ∼ N (m, s 2 ) and θ * ∼ N (m* , s*2 ) ; then A = s* / s and B = Am − m* . Then, the element
z jk (θ * ) on the new θ * scale is expressed by
z jk (θ * ) = Da*j (θ * − b*jk )
b*jk + B
= Da*j (θ − ),
A
where D is the adjustment for a normal metric. We then obtain the following relations:
aj
a*j =
A
and
b*jk = Ab jk − B.
mj mj
∑ ck =∑ ck* = 0.
k =0 k =0
Therefore, the location shift, B, is absorbed by the item location parameter, b j . Consequently, we
obtain b*j = Ab j − B and ck* = Ack .
The item information function, I j (θ ) , is the information contributed by a specific item j. The
item information for the polytomous item response model as proposed by Samejima (1974) is
mj
I j (θ ) = ∑ Ajk (θ )
k =0
2
∂
mj Pjk (θ )
∂θ
= ∑ P (θ )
k =0 jk
613
8 ESTIMATION
For the normal ogive form of the graded response model, the basic function A jk (θ ) is written as
[ϕ jk (θ ) − ϕ j ,k +1 (θ )]2
A jk (θ ) = D 2 a j
A jk (θ )
where ϕ jk (θ ) is the normal ordinate for Pjk+ (θ ) . For the logistic graded response model, the basic
function becomes
mj mj
2
T 2 P (θ ) − T P (θ ) .
I j (θ ) = D 2 a 2j
∑ c jc
c =0
∑
c jc
c =0
I j (θ ) = D 2 a 2j (T0 − T1 )2 Pj 0 (θ ) Pj1 (θ ),
where Pj1 (θ ) = 1 − Pj 0 (θ ) .
Bock (1972) proposed the information due to the response in category k of item j as the partition
of the item information, that is,
I jk (θ ) = Pjk (θ ) I j (θ ).
This result may be called the item’s response information function, according to Samejima’s
term, although she formulated it slightly differently.
The item information function may also be expressed by the summation of the response infor-
mation functions:
mj
I j (θ ) = ∑ I jk (θ ).
k =0
614
8 ESTIMATION
Finally, the test information function is defined as the summation of item information functions:
n
I (θ ) = ∑ I j (θ ).
j =1
Warm’s (1989) weighted maximum likelihood (WML) estimator is obtained by maximizing the
likelihood weighted by a square root of the test information function. The likelihood of a par-
ticular response vector ( U jk ) given θ is
L*[(U jk ) | θ ] = f (θ ) L[(U jk ) | θ ].
ln L*[(U jk ) | θ ] = ln f (θ ) + ln L[(U jk ) | θ ].
A weighted maximum likelihood estimator WML( θ ) is the value that maximizes the weighted
likelihood above. If f (θ ) is a positive constant, WML( θ ) is a maximum likelihood estimate of
θ . If f (θ ) is a square root of the test information function I (θ ) , it is called Warm’s weighted
maximum likelihood estimate, WML( θ ). This is not a Bayesian estimator of a latent trait
since f (θ ) is not a prior probability, but a reciprocal of the standard error of MLE( θ ).
1 n j
m n mj
ln L*[(U jk ) | θ ] = ln ∑∑
2 j =1 k =0
Pjk (θ ) I jk (θ ) +
∑∑U ijk ln[ Pjk (θ )].
j =1 k = 0
The Newton-Raphson technique is used to obtain the MLE( θ ) or WLE( θ ) via an iterative pro-
cedure. The Newton-Raphson estimation equation is given by
∧ ∧ ∂ 2 ln L*[(U jk ) | (θ )] ∂ ln L*[(U jk ) | (θ )]
θ q +1 = θ q − − .
∂θ 2 ∂θ
The PARSCALE program can also compute EAP( θ ) scores in addition to MLE( θ ) and
WML( θ ).
615
8 ESTIMATION
In the discussion in Section 3.1, we have assumed that the item parameters were known. Usually,
they are not. Current practice requires estimation of the item parameters using empirical data.
Usually it is desirable to use a sample sufficiently large that the standard errors of the estimated
item parameters are small, and can be ignored in future use of the parameters. Such a sample is
called a calibration sample.
The sample size required for useful item calibration varies widely, depending on the format of
the response and the strength of the relationship between the item responses and the trait. Con-
straints of the item parameters, in the form of equality constraints or prior information incorpo-
rated using Bayes' rule, facilitate estimation with relatively small samples. Large numbers of un-
constrained item parameters require relatively large samples. An example of a highly constrained
model is the Rasch (1960) 1PL model. Only a few hundred examinees may serve to calibrate a
test under this model. On the other hand, a model with three unconstrained parameters per item
(as in the 3PL: slope, location and “pseudo-guessing” parameters) may require tens of thousands
of examinees to calibrate successfully (Thissen and Wainer, 1982). The relationship of the item
responses to the trait is crucial: the parameters of items, which are strongly influenced by the
trait, may be estimated precisely with few observations, while weakly related items may require
more. So no real guidelines are available in each case for the standard errors of the parameters to
determine if the precision of the estimation is satisfactory.
Item parameter calibration problems fall into two broad categories: those in which θ is consid-
ered fixed and those in which it is considered random. The random- θ problem is the situation
most commonly encountered in psychological measurement, but the fixed- θ case is simpler, so
we will discuss that first.
If θ is assumed to have fixed values, two more alternative conditions arise: the fixed values of θ
may also be taken to be known, or the values of θ may be taken to be unknown parameters. In
the former case, item parameter calibration is simply a problem in nonlinear regression of the
item responses on θ . An example of IRT item calibration in this sort of problem is provided in
the Roche, Wainer and Thissen (1975) measurement of skeletal age. The Roche et al. system
makes use of graded indicators of skeletal maturity that are observable on radiographs. Thus,
each indicator is like a test item, and skeletal age is estimated as was θ above. To calibrate the
indicators, or items, Roche et al. defined skeletal age to be linearly related to chronological age
in the population from birth to maturity. Then the parameters of the graded response model were
estimated by nonlinear regression of the observed indicator grades on the ages of those meas-
ured. This procedure depends on the existence of an observed variable linearly related to the trait
being measured, called a criterion.
Estimation is complicated somewhat if θ is taken to have fixed but unknown values for several
pre-defined groups of examinees. Bock (1983) considered such a problem in an item calibration
context very similar to the skeletal age problem, in which the items to be calibrated were 96
questions from the Stanford-Binet and the examinees were again classified by age. Each age
616
8 ESTIMATION
group was assumed to have some fixed mean developmental age ( θ ), which was estimated si-
multaneously with the item parameters. Bock (1976) and Kolakowski and Bock (1981) discuss
an algorithm for the simultaneous estimation of the fixed values of θ and the item parameters in
general, for any item response model. This procedure depends on the existence of a division of
the examinees into homogeneous groups with respect to the trait being measured. An example,
amounting to a very simple fixed effects, unknown- θ -calibration of a single “item”, is given in
Chapter 5.
When θ is assumed to be an unobserved random variable, the only fixed parameters to be esti-
mated are the item parameters, but their estimation is numerically complex. For categorical item
response models, Bock and Lieberman (1970) provided a theoretically satisfying but impractical
algorithm; Bock and Aitkin (1981) provided the workable algorithm used in MULTILOG. Its
workings are explained there, and by Thissen (1982) for the 1PL model and Thissen & Steinberg
(1984) for the multiple-choice model.
MULTILOG reports the closest approximation to reliability available in the context of IRT, the
so-called marginal reliability. This value is (effectively) an average reliability over levels of θ ;
it is an accurate characterization of the precision of measurement only if the test information is
relatively uniform.
617
9 USES OF ITEM RESPONSE THEORY
9.1 Introduction
The development of item response theory (IRT) has reached a point where testing applications,
whether in educational or psychological testing programs or in research, can be carried out en-
tirely with IRT methods. These methods have significant advantages over those of classical test
theory: they improve the quality of the tests and scales produced, handle a wider range of re-
sponse modes, facilitate test equating, allow adaptive test administration to reduce testing time,
and offer important economies in labor and cost of test construction and maintenance. The more
limited methods that grew from classical theory were strongly conditioned by the rudimentary
data processing capabilities available in the formative years from 1910 to 1950. The more flexi-
ble and efficient, but computationally intensive, IRT methods could not develop and find practi-
cal use until electronic computation became widely accessible. Although classical and IRT
methods now exist side-by-side in computer implemented form, IRT uses the power of com-
puters in more varied and effective ways.
For the benefit of readers who have studied and worked with classical methods prior to or along
with IRT, this chapter contrasts these approaches to item response data in various areas of appli-
cation. In terms of present uses of tests and scales, the following five areas perhaps cover most
possibilities:
Selection testing
Qualification testing
Program evaluation and assessment testing
Clinical testing
Measurement methods and research
These five areas are discussed in Sections 9.2 to 9.6. In Section 9.7 various approaches to analy-
sis of item response data are considered.
Selection tests are administered to persons competing for a limited number of positions in some
organization or program. Examples are college entrance examinations, employment tests, civil
service examinations, military enlistment tests, etc. A test as an aid to selection is valuable if it
predicts with acceptable accuracy some criterion of the person’s later performance. First-year
college grades point average, success in a job training program, and on-the-job productivity are
typical examples of performance criteria. By suitable choice of item content and operating char-
acteristics, tests can be constructed to maximize correlation between the test scores and some
measure of the criterion. In most applications, prediction is further improved by use of multiple
11
This section was contributed by R. Darrell Bock.
618
9 USES OF ITEM RESPONSE THEORY
regression procedures to combine the test score with other information about the applicant. An
major economic role of selection tests is to reduce losses incurred when a person selected for a
position proves untrainable or unable to perform work assignments satisfactorily.
Qualification tests results are used in connection with education or training as evidence that a
person has attained an acceptable level of knowledge or skills. Examples are tests required in
school promotion or graduation, licensing examinations of persons going into professions such
as law or medicine, pre-service testing of public school teachers, etc. In these applications a
clear-cut criterion of later performance rarely exists. The rational justification of the test is that it
samples the domain of competence. The percent of items on the test that the examinee responds
to satisfactorily is assumed to estimate the percent of mastery of the domain, which must be suf-
ficiently high to “pass” the test. Because qualification tests often have high-stakes for persons
taking them, they must be carefully constructed to ensure that they represent fairly the domain of
competence and give consistent results from one test form to another.
Evaluation tests are administered to persons in some program of education or training, not for
individual qualification, but to evaluate whether the program, or institution conducting the pro-
gram, is achieving its instructional goals. In the evaluation of schools or school systems, this test-
ing is now referred to as assessment. The objective of assessment is to stimulate and guide
change and improvements in instruction when they are needed. An important requirement of as-
sessment is, therefore, that it include measures of outcomes in all main areas of instruction; oth-
erwise, under pressure to obtain favorable results, schools may concentrate instruction on areas
that are tested at the expense of those that are not. Because assessment programs are often car-
ried out in very large scale at state or national levels, and are intended to measure achievement
trends over a period of years, attention to the efficiency of the test forms to deliver accurate and
stable results is of the greatest importance.
Clinical tests in fields such as counseling psychology, pediatrics, and psychiatry help in the iden-
tification of learning difficulties and behavioral disorders. The Binet and Wechsler I.Q. scales are
well-known examples of tests administered to children to determine whether they are learning
and reasoning at the level expected for their chronological age. The Minnesota Multiphasic Per-
sonality Inventory (MMPI) is the leading self-report device for obtaining information about per-
sonal adjustment problems and neuroses. Clinical tests are administered and interpreted only by
qualified professionals, usually on a one-to-one basis with the client. Ideally, they produce a pro-
file of scores exhibiting patterns that aid diagnosis of the behavioral problem. Because these tests
are limited to controlled clinical settings, they are in little danger from overexposure or compro-
mise and can remain in the same form over a period of years.
619
9 USES OF ITEM RESPONSE THEORY
Once a construct is identified and items representing it are in hand, the questions focus on the
measurement characteristics of the resulting test or scale:
can the full range of variation in the population of potential respondents be measured with
acceptable precision?
what is the measurement error variance at various points on the scale?
are scores obtained at different sites of application, or on different occasions in time, sta-
ble and consistent?
if ratings based on human judgments are involved in scoring, are results sufficiently re-
producible between judges or between teams of judges recruited and trained at different
sites and times?
Classical test theory, especially generalizability theory, answers these questions in an average
sense, while IRT test information analysis gives a more detailed account of measurement error
throughout the range of scores. In addition, IRT test equating facilitates the construction of paral-
lel test forms measuring the construct, and these in turn can serve as “multiple indicators” for
structural equation modeling to validate the construct through its relationships with external vari-
ables.
Once the test has been administered to persons in some population of interest and the item re-
sponses are in hand, certain analysis operations must be performed to put the information in the
data in usable form. Many testing programs and research organizations still perform these opera-
tions with procedures based entirely on classical test theory, others rely on a mixture of classical
and IRT methods, and a few others use IRT methods exclusively.
620
9 USES OF ITEM RESPONSE THEORY
To give some idea of how day-to-day work of data analysis may change in a shift from classical
to IRT procedures, this section compares the two approaches in the areas of application detailed
previously. It also serves as guide to the references for further reading at the end of the subsec-
tions. Although no two persons would likely agree on which or how many such aspects of data
analysis deserve attention, twelve topics frequently appearing in the current literature are dis-
cussed in the following sections. They are test scoring, test generalizability, item analysis, esti-
mating the population distribution, differential item functioning, forms equating, vertical equat-
ing, construct definition, analysis and scoring of rated responses, matrix sampling, estimating
domain scores, and adaptive testing.
A given of classical test theory is that the score on a test in which the responses are marked cor-
rect or incorrect is the number correct or percent correct. Minor variations are the score on a mul-
tiple-choice test corrected for guessing (number correct minus the quantity (number incorrect
divided by the number of choice alternatives less one) or arbitrary scoring formulas in which
some items count for more than others. It is an interesting fact that the number correct score was
not part of the first rationally developed standardized test—the Binet-Simon Intelligence Scale,
first published in 1909. The test consisted of an age-graded series of tasks and questions pre-
sented successively to a child by a test administrator. The score on the test, called the child’s
“mental age”, corresponded to the highest of several age-graded items that the administrator
found the child could complete successfully. The child’s “I.Q.” was defined as this mental age
divided by chronological age. There is a sense then in which measurement methodology has
come full circle, for, except in some special cases, number-correct is not a summary statistic used
by IRT in computing the score of a person taking the test. IRT uses instead the person’s total pat-
tern of correct and incorrect responses to the test items to estimate a score on the construct scale.
The result is referred to as a “full information” estimate: it makes use of all information in the
answer pattern, not just that in the number-correct count. Finding this IRT “scale score” is much
like locating the test taker’s position on the Binet-Simon scale, except that the continuum on
which it is expressed is not an external variable, such as age, but is a construct inferred from the
internal consistency of item responses within the sample data. During the early period when IRT
was oriented primarily toward selection testing, this construct was called “ability”. Later, as
qualification and program evaluation became more prominent in testing, the term “proficiency”
was introduced. In other areas of application—consumer research, for example—“preference” or
“propensity” would be apposite. “Proficiency” is used in the present writing.
621
9 USES OF ITEM RESPONSE THEORY
These models have so-called threshold parameters that are related to the difficulty of the item
and determine where the item is located on the inferred scale. They also have slope parameters,
related to the discriminating power of the items, that determine how much each will influence
estimation of the proficiency scores. The score on the test is that point on the scale where, when
the person’s score is substituted in the item response models, the person’s pattern of correct and
incorrect responses is best accounted for. Scores determined in this way can be represented,
along with the item thresholds, on the proficiency scale in the same manner that mental ages and
age-graded items appear on the Binet intelligence scale.
The IRT method of extracting information about the person from the item responses, although
more intricate than the simple number-correct count, has several important advantages in testing
practice. First, the person’s scale score is little affected by adding or deleting items from the test.
The precision with which the scale point is located may change, but the meaning of the scale and
its units of measurement remain the same. This is not true of the number-correct score or even
the percent-correct score: they vary depending on the difficulties of the items added or removed;
if the average difficulties of the items change, the difficulty of the test as a whole is altered. This
does not happen with the IRT scale score because differences in item difficulty are accounted for
by the threshold parameters of the item response model.
Second, the IRT scale scores have smaller relative measurement error than number-right scores
because the influence of the items on the estimate is adjusted by the discriminating power pa-
rameters to minimize error. Finally, the IRT scale-score concept generalizes in a direct and con-
sistent way to other response modes—for example, extended responses to open-ended items
scored in graded categories by raters. Classical test theory has no comparable capability; it
merely resorts to arbitrary assignment of numerical values to the various grades and summing the
values to provide a score.
Apart from some remarks and references to examples in connection with item analysis and
analysis of rated responses, technical particulars of how item parameters and scale scores are es-
timated from item response data are beyond the scope of this section. Computer programs that
implement the IRT methods of estimation are described in the chapters to follow.
A fundamental concept of both classical test theory and IRT is that the items of a test are a sam-
ple from some larger domain of items or tasks, any of which might equally well have been pre-
sented to the test taker. A score from any such test therefore raises the essential question of sam-
pling—namely, how much error variation in the test score must be attributed to the sampling
process. In classical test theory, this question is posed in terms of the so-called true score model,
in which the observed test score is assumed to be the sum of a true score component and error
component. The two components are defined to be statistically independent, such that the vari-
ance of the test score in the population of persons to be tested equals the sum of the variances of
the components. These variances can be estimated in test data by giving correct responses a score
of 1 and incorrect responses a score of 0, and carrying out a person-by-items analysis of vari-
ance. On the assumptions underlying this model, the square root of the estimated variance of the
error component is the standard error of measurement of the test. It can be used, for example, to
622
9 USES OF ITEM RESPONSE THEORY
place an approximate 90 percent confidence interval on the true score (i.e., observed true score
plus or minus 2 times the standard error). The variance estimates can also be used to calculate a
generalizability index for the test as the ratio of the true score variance to the sum of the true
score variance and the error variance. This index is variously referred to as coefficient α , Kuder-
Richardson reliability, or test reliability. It can be modified to predict the coefficient of gener-
alizability of a test in which the number of items sampled is increased n times merely by dividing
the error variance component in the ratio by that factor. The resulting formula is equivalent to the
Spearman-Brown prophecy formula of classical test theory.
A more penetrating treatment of the classical model, called strong true score theory, shows the
preceding results to be an oversimplification. The standard error of a test score is not constant
but varies with the value of the test score. IRT results take this into account by providing, not just
one error estimate, but an error function, computable from the item parameters, that yields an es-
timate of the error variance specific to every point on the scale score. This function typically
shows the test to have the highest precision in the region of the scale where the item locations are
most dense. For a test in which the greater part of the item set is in the middle range of difficulty,
the error function tends to be “U” shaped.
A related, very useful concept is that of the test information function, which is the reciprocal of
the error function. The information function shows the relative precision of the test at every point
on the scale. High values of the information correspond to high precision of the scale score and
low values to low precision. The important property of the test information function is that it is
the sum of corresponding information functions of the items. Item information functions depend
on both the item location and its discriminating power. The maximum of the function occurs for
the normal and logistic models, for example, at the location of the item threshold, and the height
of the function increases and decreases with item discriminating power. The test information thus
shows in quantitative detail how the measurement precision of a test can be adapted to a particu-
lar application by the placement of items of differing difficulty and discriminating power. A test
can be made highly informative in a narrow score range by concentrating items in that interval,
or made uniformly informative over a wide range by spacing items evenly over the range. Incor-
porating effects of both item thresholds and discriminating powers, plots of item information
functions play the same role in IRT that plots of item difficulty vs. part-whole correlation play in
classical test theory.
In classical test theory, the estimation of item difficulties, part-whole correlations and other char-
acteristics, such as distractor use in multiple-choice items, is referred to as item analysis. The
corresponding operations of IRT theory are called item calibration. For the normal ogive or lo-
gistic model calibration involves the estimation of the item thresholds and discriminating powers
from samples of item response data. If the test contains multiple-choice items, then a modifica-
tion of these models that accounts for chance successes may be used. These so-called three-
parameter normal and logistic models require estimation of item-specific probabilities of correct
response due to guessing in addition to the threshold and slope parameters. In most instances,
only the more difficult items of a test, with their greater frequency of wrong responses, require a
three-parameter model.
623
9 USES OF ITEM RESPONSE THEORY
An important purpose of item analysis, both in classical and IRT methodology, is to check the
extent to which each item belonging to some larger set represents the construct that the set is in-
tended to measure. The classical item statistic that conveys this information is the part-whole
correlation (computed as the biserial or point biserial correlation between the item 0, 1 score and
the test score). Of course, this correlation succeeds in this role only if the preponderate majority
of the items in the test are validly construct-related. When this condition is satisfied and a small
minority of items depart from the construct or are in some way ambiguous, their part-whole cor-
relations will be low. The IRT statistic that functions in a similar way is the slope parameter of
the item response model—high slopes correspond to high part-whole correlations, and vice
versa. In fact, the slope statistics can be converted into a correlation index, similar to part-whole
correlation, that measures the relationship between the item response and the inferred construct.
An operational difference between classical and IRT procedures, however, is that during score
estimation the presence of a very low slope parameter will automatically nullify the influence of
the item, whereas the item must be specifically excluded from the classical number-correct score.
The other essential statistic of classical item analysis is item difficulty or, more accurately item
facility—namely, the percent or proportion of correct responses to the item when the test is ad-
ministered to a sample of persons representing the relevant population. It is well known from
classical and IRT theory that an item is most informative about a particular person when that per-
son’s probability of responding correctly is in the neighborhood of one-half. Although this prob-
ability will differ considerably among persons in the population, it is advisable from the stand-
point of minimizing the average measurement error that test items be selected so that an appre-
ciable proportion of persons has an intermediate probability of correct response. Near zero or
100 percent chances of correct response across the population as a whole are of no help in meas-
urement. The IRT statistic that measures item difficulty is the item-threshold parameter, located
at or near the point on the scale where a person with that scale score will have probability one-
half of responding correctly. This parameter is also sometimes referred to as the item location. It
is not related in a simple way to the percent of persons in the population expected to respond cor-
rectly to the item, but if the origin or units of the IRT scale are chosen suitably, the threshold pa-
rameter conveys similar information. The appropriate scaling convention for this purpose is to
set respectively values 0 and 1, to the mean and standard deviation of the distribution of scale
values in the sample data. If, as is often the case, this distribution is approximately normal, the
item thresholds are on a scale in which their values correspond to normal deviates in the popula-
tion of persons.
At the core of any IRT item analysis is the algorithm for estimating parameters of the response
models from a sample—preferably a large sample—of data obtained by administering the test to
persons in some population of interest. Fitting models to such data is referred to as item calibra-
tion. The most general and robust procedures for this purpose, applicable to any well-identified,
twice-differentiable model, are based on the statistical techniques of maximum marginal likeli-
hood or Bayes estimation. These methods give a single best estimate of each parameter of each
item, and also an interval estimate indicating the effect of sampling variation. They also provide
for statistical tests of the improvement of fit to the data when additional parameters are included.
With the multiple-group IRT models discussed below, more general forms of these methods also
estimate the proficiency distributions of the populations corresponding to the groups.
624
9 USES OF ITEM RESPONSE THEORY
As mentioned above, IRT theory gives more precision to item selection criteria by combining the
information in the item slopes and thresholds into item information functions, that accumulate
additively to form the test information function. A plot of item and test information functions on
the same scale as the sample score distribution conveys clearly how the items or tests will per-
form in the population of interest. The same approach applies to a classical statistic pertaining to
multiple-choice items—namely, the percent of responses in the sample falling into each of the
alternatives of the multiple-choice item. In IRT, the nominal categories model gives the prob-
ability of the correct response and of each of the distractors as a function of the scale score. It is
easy to identify in these plots the distractors that are not functioning as desired in various regions
of the range. In addition, the analysis under this model shows the amount of construct related in-
formation in the distractors as well as in the correct response. In many cases, plausible distractors
contain information that can improve the precision of estimated scale scores and can be recov-
ered by the IRT scoring procedure based on the model. This model is implemented in the
MULTILOG program.
For purposes of norming test results, it is necessary to estimate the distribution of test scores in
the population of interest. This presents a problem for classical test theory for two reasons. First,
the number-correct test scores contain both true score variation and measurement error variation;
since the measurement error variance is a function of test length, the variance of the score distri-
bution therefore depends on an arbitrary choice in test construction. Second, the shape of the test
score distribution depends arbitrarily upon the distribution of item difficulties in the test; tests
with severely skewed distributions of item difficulties will produce skewed distributions of test
scores in the population.
Classical test theory sidesteps these problems by expressing norms as population percentiles,
which are invariant with respect to the spread or shape of the score distribution. Further analysis
of the test scores by statistical methods that assume a normal distribution may still be affected,
however. IRT theory is more favorable in this respect in that the shape of the observed scale
score distribution is relatively little influenced by the distribution of item difficulties. If the true
score distribution is approximately normal, for example, the scale score distribution will be also.
The variance of the latter is still increased by measurement error, but as is also true of test scores,
the effect can be largely suppressed independent of test length by computing so-called “re-
gressed” or “shrunken” estimates as a function of test reliability. The Bayes (EAP) regressed and
Bayes modal (MAP) scores provided by the programs are regressed estimates.
IRT can handle this problem more rigorously, however, by estimating an inferred latent distribu-
tion of proficiency scores. The shape of the latent distribution is estimated directly from the pat-
terns of correct and incorrect responses to the test items and does not involve the test scores. If
there is only one sample group in the analysis, the location and dispersion of the latent distribu-
tion are indeterminant and must be set arbitrarily (e.g. to 0 and 1). If there are multiple sample
groups in the analysis, locations and dispersions of their latent distributions can be set relative to
a designated reference group or relative to arbitrarily set values in the combined groups. Multi-
ple-group analysis is implemented in the BILOG-MG and MULTILOG programs.
625
9 USES OF ITEM RESPONSE THEORY
Almost any population of potential test takers will consist of identifiable subpopulations—
different age groups, the two sexes, urban or rural residents, education levels, ethnic and lan-
guage groups, etc. Relevant information on group membership may be available from back-
ground questionnaires administered along with the test. If so, the data will allow investigation of
whether persons in one such group experience differences in item difficulty or discriminating
power relative to those in other groups when all groups have equal mean scores on the test as a
whole. When this is the case, the test is said to exhibit differential item functioning (DIF). DIF is
essentially item by group interaction in item difficulty or discriminating power. If at the same
time, the groups show unequal mean test scores, the test is said to have adverse impact on the
groups that perform more poorly. Adverse impact can, of course, also occur in the absence of
DIF. Since DIF in effect alters the substantive meaning of the test score from one group to an-
other, it is undesirable and should be eliminated if possible. An English language vocabulary test
with words of Latin or Germanic origin, for example, will tend to show DIF with respect to Eng-
lish or Spanish as first language acquired. If only a few items of the test exhibit DIF, they usually
can be removed without impairing measurement of the intended construct.
The problem for the data analyst is how to detect DIF in tests that may also show adverse impact.
There are both classical and IRT approaches to this problem. The classical methods look for dif-
ferences in item difficulty among persons from different background groups whose tests scores
are equal or fall in a narrow score interval. A summary statistic for these differences over the
scores or score intervals provides a measure of DIF; an associated statistical test establishes its
presence. Based on a log-linear model of item by group interaction, a similar analysis can be car-
ried out with the so-called Mantel-Haenszel statistic.
The IRT treatment of DIF is an example of multiple-group analysis in which item thresholds or
discriminating power are estimated separately in each group and jointly with the group latent dis-
tributions, under the restriction that the means of the item thresholds must be equal in all groups.
The item guessing parameters, if any, are also restricted to be equal among groups. IRT estima-
tion of DIF effects includes standard errors that can be used to assess statistical significance of
effects for individual items. In addition, a test of DIF in all items jointly is provided by compari-
son of the goodness-of-fit of the response model when different thresholds or discriminating
power are assumed vs. the fit when a single set of thresholds or discriminating power is esti-
mated in the combined data. The IRT method of analyzing DIF is in general more sensitive than
its classical counterparts, especially with shorter tests, because IRT better defines the latent con-
struct measured by the test. DIF in item difficulty is implemented in BILOG-MG, PARSCALE
and MULTILOG. DIF in discriminating power is implemented in PARSCALE and
MULTILOG.
Many testing programs must update their test at regular intervals to prevent overexposure and
compromise of the item content. This creates the problem of keeping the reporting scores for
successive forms comparable so that a person is neither advantaged nor disadvantaged by taking
one form rather than another. Somehow, the reported results must allow for the differences in
626
9 USES OF ITEM RESPONSE THEORY
overall difficulty of the forms that inevitably occur when items are changed. Classical test theory
solves this problem by equivalent-groups equating. This method requires the alternative forms to
be assigned randomly to persons in some large sample. The randomization ensures that persons
taking different forms will have essentially the same true score distribution (provided that the
successor forms are of the same length as the preceding form and have similar distributions of
item difficulties and discriminating powers). If these conditions are met, the test scores for the
new forms can be expressed on the same scale as the old forms by assigning them to the corre-
sponding points of their respective observed score distributions. This is the equipercentile
method of keeping the score reports comparable to one another through successive generations
of test forms. If the distribution of item difficulties within the forms is more or less normally dis-
tributed and well centered for the population, the test score distributions will be approximately
normal. In that case, a nearly equivalent equating can be obtained merely by standardizing the
scores of the various forms—that is, by setting the mean and standard deviations of their respec-
tive distributions to any convenient fixed values. This method is called linear equating.
Since IRT scale scores are much more likely to approximate normal distribution than number-
right scores, equipercentile equating is less needed in IRT applications. Linear equating suffices,
and it happens automatically if the origin and unit of the IRT scale is set so that the scale scores
have a specified mean and standard deviation in the sample. In addition, IRT is unique in allow-
ing the equating forms administered to non-equivalent groups—i.e., groups with different true
score distributions. This type of equating requires, however, that the test forms share a certain
number of common “linking” items. Provided the linking items do not exhibit DIF with respect
to the groups, multiple-group IRT analysis of all forms together automatically produces a single
IRT scale on which the reported scores are comparable. The multiple-group procedure estimates
separate latent distributions of the groups jointly with the item parameters of all the forms. The
advantage of this method is that it does not require a separate administration of the forms to
some group of persons for purposes of equivalent groups equating. Forms can, for example, be
updated in the course of operational administrations of the test in which a certain proportion of
items from the previous year’s forms is carried over to the current year’s forms. A random sam-
ple of examinees from the previous year’s operational testing provides data for one of the groups,
and a similar sample from the current year provides data for the other. The resulting scale scores
are linearly equated to those of the previous year by setting the mean and standard deviations of
the latent distribution of the first group to its previous year’s values. Estimates of change in the
mean and standard deviation between years are a by-product of the equating. If desired, non-
equivalent groups equating can be carried back more than one year, provided linking items exist
between at least adjacent pairs of forms. Multiple-groups forms equating is implemented in
BILOG-MG and MULTILOG.
In school systems with a unified primary and secondary curriculum, there is often interest in
monitoring individual children’s growth in achievement from Kindergarten through eighth grade.
A number of test publishers have produced articulated series of tests covering this range for sub-
ject matter such as reading, mathematics, language skills, and, more recently, science. The tests
are scored on a single scale so that each child’s gains in these subjects can be measured. The ana-
627
9 USES OF ITEM RESPONSE THEORY
lytical procedure for placing results from the grade-specific test forms on a common scale for
this purpose is referred to as vertical equating.
The most widely used classical method of vertical equating is the transformation of test scores
into so-called grade equivalents. In essence, the number-correct scores for each year are scaled
in such a way that the mean score for the age group is equal to the numerical values of the grades
zero through eight. This convention permits a child’s performance on any test in the series to be
described in language similar to that used with the Binet mental age scale. One may say of a
child whose reading score exceeds the grade mean, for example, that he or she is “reading above
grade level”.
For IRT, vertical equating is merely another application of non-equivalent groups equating in
which the children administered particular grade-specific tests correspond to the groups. As in
the equating of updated forms mentioned above, linking items between at least consecutive
forms in the series are required. They must be provided in each subject matter included in the
graded series. (Note that grade-equivalent scaling does not require linking items.)
The two methods produce quite different scales. Grade equivalents are of course linear in school
grade. They treat the average gain between first and second grade, for example, as if it were
equal to that between seventh and eighth. On this scale, the amount of variation between chil-
dren’s scores appears to increase as the cohort moves through the grades, and there is a corre-
sponding positive correlation between a child’s average score level over the years and the child’s
average gain. In other words, children who begin at a lower level appear to gain less overall than
those who begin at a higher level. This so-called “fan-spread effect” is regularly seen in all sub-
ject matters.
On an IRT vertically equated scale, average gains are generally greatest at the earlier grade levels
and decrease with increasing grade. Within grade, standard deviations are fairly uniform, and the
correlation between children’s average score levels and their gains are small, or even slightly
negative in some subject matters.
Unfortunately, there is no objective basis for deciding which of these scales better represents a
child’s true course of growth in knowledge and skills during the school years. Different IRT
models assuming other transformations of the proficiency scale could be made to fit the item re-
sponse data equally well and yet exhibit much different appearing relationships between grade
level and average score or average gain. Extrinsic considerations would have to be brought to
bear on the question to determine a preferred scale. For example, if one wished to compare an-
nual average gains in test performance of children in different classrooms when assignment to
classrooms is non-random, the scale that showed zero correlation between level and gain would
be most advantageous. IRT vertical equating comes much closer to this ideal than grade equiva-
lents, but might require some further transformation, possibly subject matter and site specific, to
attain complete independence of level and gain. (See Bock, Wolfe & Fisher, 1996, for a discus-
sion of this topic).
628
9 USES OF ITEM RESPONSE THEORY
The discussion up to this point assumes that all items in the test measure the same underlying
construct. When it is not clear that the item set is homogeneous in this sense, steps must be taken
to explore the construct dimensionality of the set. The classical approach to this problem is to
perform, in a large sample of test data, a multiple factor analysis of the matrix of tetrachoric cor-
relations between all pairs of items. The more familiar Pearson product-moment correlation of
item responses assigned different numerical values if correct or incorrect (phi coefficient) is not
generally satisfactory for this purpose because variation in item difficulties introduces spurious
factors in the results. Random guessing on multiple-choice items has a similar effect that must
also be allowed for. Tetrachoric correlations with corrections for guessing are largely free of
these problems, but they have others of their own. One of these is computational instability that
appears when the correlations have large positive or negative values and the item difficulties are
very low or high; in these cases, it is often necessary to replace the correlation in the matrix with
an attributed default value. The other problem is that factor analysis of tetrachoric correlation
matrices almost always produces a certain number of small, unreal factors that are meaningless
and must be discarded.
IRT improves on this procedure by a method of full information item factor analysis that oper-
ates directly on the patterns of correct and incorrect responses without intervening computation
of correlation coefficients. In effect, this method fits a multidimensional item response model to
the patterns in the sample data. Full information item factor analysis is robust in the presence of
omitted or not-presented items and is free of the artifacts of the tetrachoric method. It also pro-
vides a statistical test of the number of factors that can be detected in the data.
The objective of both classical and IRT item factor analysis is the identification of items with
similar profiles of factor loadings—an indication that they arise from the same cognitive or af-
fective sources underlying the responses of persons taking the test. Objective methods of rotating
the factor structure, such as orthogonal varimax rotation and non-orthogonal promax rotation, are
especially effective in picking out clusters of items that identify these implicit constructs. The
presence of significant multiple factors in the data means that there are corresponding dimen-
sions of variation in the population of persons. In some cases, actual subgroups in the population
associated with particular factors can be identified by including demographic variables in the
analysis. Alternatively, they may be found by conventional multiple regression analysis of factor
scores for the persons, which are also provided by IRT full information item factor analysis. Full
information item factor analysis is implemented in the TESTFACT program.
When tests contain items or exercises that cannot be scored mechanically, the responses are often
rated on a graded scale that indicates quality or degree of correctness. For individually adminis-
tered intelligence tests the grading is done by the test administrator at the time the response is
recorded. For group administered open-ended exercises and essay questions, written responses
are graded later by trained raters. In both cases, the additional information conveyed, beyond that
provided by correct-incorrect scoring, provides better justification of the considerable cost of
graded scoring.
629
9 USES OF ITEM RESPONSE THEORY
In addition to problems that may arise in preparing the rating protocol and training the raters for
graded scoring, analysis of the resulting data presents other difficulties not encountered with cor-
rect-incorrect scoring. How to combine the ratings into an overall score in a rational way is not at
all clear in classical test theory—especially so if the test also includes multiple-choice items. The
classical approach never goes much beyond mere assignment of arbitrary numerical values to the
scale categories and summing these values to obtain the test score. The arbitrariness of this
method, and the fact that items with different numbers of rating categories receive different
weights in the sum, has always proved troublesome.
In this respect, IRT methods are a very considerable advance. Item response models now exist
that express the probability of a response falling in a given graded category as function of 1) the
respondent’s position on the IRT scale, 2) parameters for the spacing of the categories, and 3) the
difficulty and discriminating power of the item. Models for items with different numbers of rat-
ing categories and models for dichotomously scored responses can be mixed in any order when
analyzing items or scoring tests; arbitrary assignments of score points are not required. The IRT
test scoring based on these models makes use of the information in the pattern of ratings in a way
that is internally consistent in the data and minimizes measurement error. The IRT approach to
graded data allows tests to have more interesting and varied item formats and makes them acces-
sible to IRT methods of construction and forms equating. Provision for graded scores is included
in PARSCALE and MULTILOG.
Testing at the state and national level plays a part not only in counseling or qualification of indi-
vidual students, but also in evaluating the effectiveness of instructional programs, schools, or
school systems. The objective is to compare instructional programs and schools with respect to
their strength and weaknesses in promoting student achievement in various categories of the cur-
riculum. Testing used in this way is referred to as assessment to distinguish it from student-
oriented achievement testing. Educational assessment is typically carried out in large-scale sur-
veys, often on a sampling basis rather than a total census of schools and students. The sampling
approach consists of drawing a probability sample of schools and, within these schools, testing a
random sample of students. To minimize the burden on schools and students alike there is an at-
tempt to test as many curricular categories as possible in a limited amount of time, usually one
class period. This is accomplished by assigning randomly to the selected students one of 20 or 30
different test forms each containing only a small number of the items representing a category.
Usually the categories are main topic areas within subject matters. The total sampling design can
be laid out as a table in which the rows correspond to schools and students tested and the col-
umns correspond to items sampled for the test forms and the categories within forms. This ar-
rangement is referred to as a matrix sample.
630
9 USES OF ITEM RESPONSE THEORY
A problem with average-percent correct reporting occurs, however, if the assessment aims at
monitoring trends in average achievement over successive years. When the time comes to update
the items of the assessment instrument, new items substituted for old inevitably introduce
changes in average scores at higher levels of aggregation—changes which may be larger than the
expected differences between year or programs or schools. Although scores on the successive
instruments can be made comparable by equivalent groups equating, very large sample groups
are required to bring the equating errors below the size of the smallest difference that would have
policy implications.
IRT nonequivalent groups equating, which can be done in the full operational samples, is much
more cost effective in this situation. It requires only that a certain proportion of items from the
previous assessment be carried over into the update to serve as links between successive forms.
Typically, one-third of the items are retained as links. A large random sample of cases from the
two assessments are then analyzed in a multiple-group IRT calibration that estimates the latent
distributions for the two assessment samples jointly with the new set of item parameters. The
link items serve to set the origin and unit of scale equal to those of the previous assessment.
Paralleling the average percent-correct approach, IRT can also estimate scores at the group-level
without intervening score estimation for individual students. This can be done in one of two
ways. If the interest is only in comparing mean scores among schools or higher level aggregates,
these quantities can be estimated directly from counts of the number of times each item is pre-
sented to a student in the group, and of these, the proportion correct. The group means are esti-
mated on a scale that is standardized by setting the mean of the estimated group means weighted
by the numbers of persons testing in the respective groups, and the standard deviation of the es-
timated group means calculated in a similar weighted form. Standard errors with respect to the
sampling of students within schools are available for the estimated school means, and the higher-
order aggregate means.
If it is also of interest, however, to know something about the distribution of student achievement
within the aggregate groups, multiple-group IRT analysis can be used to estimate the latent dis-
tributions within the groups directly, without estimating scores for individual respondents. The
procedure is more efficient for a definite form of latent distribution, such as the normal or other
distribution that depends on a relatively small number of parameters. If a completely general
form is assumed, a nonparametric procedure, possibly involving computer simulations, may be
necessary.
Both classical test theory and item response theory have to contend with the arbitrary nature of
test scores as measurements. As mentioned above, the classical number-correct score, and even
the length-independent percent-correct score, depend arbitrarily upon the difficulties of the items
selected for the test. The IRT scale score, although relatively free of that problem, is nevertheless
expressed on a scale of arbitrary origin and unit. The earliest and still most widely used method
of removing this arbitrariness is to scale the scores relative to their distribution in some large
631
9 USES OF ITEM RESPONSE THEORY
sample of persons taking the test. This is most commonly done by expressing the scores as per-
centiles of the distribution or as standardized scores, i.e., subtracting the mean of the distribution
from the observed score and dividing by the distribution standard deviation. This approach to
reporting test scores is called norm referencing; it assumes that comparisons between persons is
the object of the testing, which undeniably it is in selection testing.
In the context of qualification testing, however, a more relevant objective is whether a person
taking the test shows evidence of having learned or mastered a satisfactory proportion of the
knowledge and skills required for qualification. Similarly, in program evaluation the objective is
whether a sufficient proportion of students in a program has reached a satisfactory level of learn-
ing or mastery. Reporting test results in these terms is referred to as domain referencing, or in a
somewhat similar usage, criterion referencing. For domain referencing to be realizable in prac-
tice, some reasonably large pool of items or exercises must exist to define the domain operation-
ally. Particular tests containing items or exercises from the pool may then be selected for pur-
poses of estimating domain scores.
The classical method of domain score estimation is to assume that items of the test are a random
sample of the pool. In that case, the test percent-correct directly estimates the domain percent-
correct, and its standard error can be computed from the test’s generalizability coefficient. IRT
can improve upon this estimate if response models for items in the pool have been calibrated in
data from a relevant population of examinees. With this information available, the items selected
for a particular test do not need to be a random sample of the pool. They need only be link items
in tests calibrated by non-equivalent groups equating. In that case, one estimates the domain
score by first estimating the person’s IRT scale score, then substituting the score in the model for
each test item to compute the person’s corresponding probability of correct response: the IRT
estimated domain score is the sum of these probabilities divided by the number of items on the
test. Domain scores estimated in this way are more accurate than classical estimates because they
take into account the varying difficulty and discriminating power of the items making up the test.
These methods of estimation can be carried out with multidimensional as well as unidimensional
response models. Domain scores are implemented in the BILOG-MG program.
Adaptive testing is a method of test administration in which items are chosen that are maximally
informative for each individual examinee. Among items with acceptable discriminating power,
those selected are at a level of difficulty that affords the examinee a roughly 50 percent probabil-
ity of correct response. This corresponds to minimum a priori knowledge of the response, and
thus maximum information gain from its observation.
The two main forms of adaptive test administration are two-stage testing and sequential item
testing. In the two-stage method, which is suitable for group administration, a brief first-stage
test is administered in order to obtain a rough provisional estimate of each examinee’s profi-
ciency level. At a later time, a longer second-stage test form is administered at a level of diffi-
culty adapted to the provisional score of each examinee. In sequential adaptive testing, usually
carried out by computer, a new provisional estimate of the examinee’s proficiency is calculated
after each item presentation, and a most informative next item is chosen based on that estimate.
632
9 USES OF ITEM RESPONSE THEORY
The presentation sequence begins with an item of median difficulty in the population from which
the examinee is drawn. Depending on whether the response to that item is correct or incorrect,
the second item chosen is harder or easier. The presentations continue in this manner until the
successive provisional estimates of proficiency narrow-in on a final value with acceptably small
measurement error. Unlike two-stage testing, this method of administration requires the adaptive
process to be carried out during the testing session. For this reason computer administration is
possible only if the items are machine scorable.
When IRT scale scores are used to obtain the provisional estimates of proficiency in computer-
ized adaptive testing, the presented items must be calibrated beforehand in data obtained non-
adaptively. Once the system is in operation, however, items required for routine updating can be
calibrated “on line”. For this purpose, new items that are not part of the adaptive process must be
presented to examinees at random, usually in the early presentations. Responses to all items in
the sequence are then saved and assembled from all testing sites and sessions. A special type of
IRT calibration called variant item analysis is applied in which parameters are estimated for the
new “variant” items only; parameters of the old items are kept at the values used in the adaptive
testing. Because IRT calibration as well as scoring can be carried out on different arbitrary sub-
sets of item presented to respondents, the parameters of the variant items are correctly estimated
in the calibration even though the old items have been presented non-randomly in the adaptive
process. Variant item analysis is implemented in the BILOG-MG program.
With different examinees presented items of differing difficulty in adaptive testing, the number-
correct score is not appropriate for comparing proficiency levels among examinees. For this rea-
son, no treatment of adaptive testing appeared within classical test theory, and hardly any discus-
sion of the topic arose until item response theory made it possible to estimate comparable scores
from arbitrary item subsets. That development, combined with the availability of computer ter-
minals and microcomputers, has made sequential testing a practical possibility. Significant appli-
cations of computerized adaptive testing have followed, particularly in the area of selection test-
ing. Apart from its logistical and operational convenience, the primary benefit of this method of
test administration is in reducing testing time. As little as one-third of the time required for a
non-adaptive test suffices for a fully adaptive sequential test of equal precision.
633
10 BILOG-MG EXAMPLES
10 BILOG-MG examples
This example illustrates how the BILOG-MG program can be used for traditional IRT analyses.
The data are responses to 15 multiple-choice mathematics items that were administered to a
sample of eighth-grade students. The answer key and the omitted response key are in files called
exampl01.key and exampl01.omt, respectively (defined on the INPUT command).
The data lines, of which the first few lines are shown below, contain 15 item responses. This is
the simplest form in which raw data can be read from file: there is one line of data for each ex-
aminee, and the response to item 1, for example, can always be found in column 6. All items are
used on the single subtest. Item responses start in column 6 as reflected in the format statement
(4A1,1X,15A1).
1 242311431435242
2 243323413213131
3 142212441212312
4 341211323253521
KEY 341421323441413
With such a short test (15 items), item chi-squares are not reliable. For illustration purposes the
minimum number of items needed for chi-square computations has been reduced from the de-
fault of 20 to the number of items in this test, using the CHI keyword on the CALIB command.
With the item chi-squares computed, the PLOT=1 specification can now be used to plot all the
item response functions.
Note that the ICCs produced with the IRTPLOT program in the Windows version of BILOG-
MG display the χ 2 -test statistics, degrees of freedom, and probability, as well as the observed
response probabilities only for those items that have a significance level below the value speci-
fied with the PLOT keyword.
The scoring phase includes an information analysis (INFO=2) with expected information indices
for a normal population (POP). Rescaling of the scores and item parameters to mean 0 and stan-
dard deviation 1 in the estimated latent distribution has been requested (RSC=4). Printing of the
students' scores on the screen is suppressed (NOPRINT), because that information is saved in the
exampl01.sco file.
634
10 BILOG-MG EXAMPLES
>ITEMS INAMES=(MATH01(1)MATH15);
>TEST1 TNAME='PRETEST', INUMBER=(1(1)15);
(4A1,1X,15A1)
>CALIB NQPT=31, CYCLES=25, NEWTON=10, CRIT=0.001, ACCEL=0.0, CHI=15, PLOT=1;
>SCORE NOPRINT, RSCTYPE=4, INFO=2, POP;
Phase 1 output
This is a standard 3-parameter, one-form, single-group analysis of a 15 item test. The Phase 1
classical item statistics for the first 5 items are as follows.
Phase 2 output
No new features are illustrated in the Phase 2 analysis, except that the plot criterion has been set
to include all items.
The first and last item response function plots are shown below. The first item is extremely easy
and the last extremely difficult. These plots were produced using the IRTGRAPH procedure,
which is accessed via the Plot option on the Run menu after completion of the analysis. Note
that the Phase 2 output file also contains similar line plots.
635
10 BILOG-MG EXAMPLES
Phase 3 output
With this short, wide range test, ten quadrature points are sufficient for scoring. The item pa-
636
10 BILOG-MG EXAMPLES
rameters are rescaled so that the scores have mean zero and standard deviation one in the latent
distribution estimated from the full sample of 1000 examinees. Population characteristics of the
score information, including the IRT estimate of test reliability (equal to [score variance–
1/average information] / score variance) are shown with the information plot.
Using the Plot option on the Run menu to access the IRTGRAPH program, the following plot of
test information is obtained:
637
10 BILOG-MG EXAMPLES
This example is based on an example in Thissen, Steinberg & Wainer (1993). The data are drawn
from a 100 word spelling test administered by tape recorder to psychology students at a large
university. The words for the test were randomly selected from a popular word book for secretar-
ies. Students were asked to write the words as used in a sentence on the tape recording. Re-
sponses were scored 1 if spelled correctly and 0 if spelled incorrectly. Because the items are
scored 1,0, an answer key is not required. A complete description of these data are given in Sec-
tion 2.4.1.
The groups in this example are the two sexes and this is indicated by the NGROUP keyword on the
INPUT command. The same four items are presented to both groups on a single test form. The
format statement following the second GROUP command describes the position and order of data
in exampl01.dat. The group indicator is found in column 3 of the data records and is read in in-
teger format. A form indicator is not required in the data records because there is only one form.
The data have been sorted into answer patterns, and the frequencies are found in columns 10-11
of the data (2A1). These frequencies serve as case weights in the analysis. The TYPE=2 and
NWGHT=3 keywords describe this type of data. The value assigned to the keyword NWGHT requests
the use of weighting in both the statistics and calibration (by default, no weights would be ap-
plied).
A 1-parameter logistic model is requested using the NPARM keyword on the GLOBAL command.
The LOGISTIC option on the GLOBAL command indicates that the natural metric of the logistic
response function will be assumed in all calculations. If this keyword is not present, the logit is,
by default, multiplied by 1.7 to obtain the metric of the normal response function.
The SAVE option on the GLOBAL command indicates that a SAVE command will follow directly
after the GLOBAL command. On the SAVE command, the item parameter estimates are saved to an
external file exampl02.par and the DIF analysis results are written to an external file ex-
ampl02.dif.
The total number of unique items is described using the NTOTAL keyword on the INPUT command
while the NITEMS keyword on the LENGTH command is set to 4 to indicate that all 4 items are to
be used in the single subtest.
The ITEMS command lists the four items in the order that they will be read from the data records.
The INAMES and INUMBERS keywords assign each item a name and a corresponding number. Be-
cause there is only one form, the NFORM keyword is not required in the INPUT command and a
FORM command is not required. Because examinees in both groups are presented all the items
listed in the ITEMS command, the TEST and GROUP commands need contain only the test name
and the group names, respectively.
A DIF analysis is requested through the use of the DIF option on the INPUT command.
The REFERENCE=1 keyword on the CALIB command designates males as the reference group. The
convergence criterion is set to 0.005 instead of the default 0.01 using the CRIT keyword.
638
10 BILOG-MG EXAMPLES
When NGROUP >1, 20 quadrature points will be used for each group. Setting the NQPT keyword to
10 implies that 10 points will be used for each group, as fewer points are needed when the num-
ber of items is small.
No SCORE command is included in the command file, as DIF models cannot be scored.
Phase 1 output
The title and additional comments (if the optional COMMENT command has been used) are echoed
to the output file. Immediately after that, Phase 1 commands and specifications of the analysis
are given. Under FILE ASSIGNMENT, relevant information as read in from the GLOBAL, SAVE,
LENGTH, and TEST command are listed.
>SAVE PARM='EXAMPL02.PAR',DIF='EXAMPL02.DIF';
[OUTPUT FILES]
ITEM PARAMETERS FILE EXAMPL02.PAR
DIF PARAMETER FILE EXAMPL02.DIF
>LENGTH NITEMS=4;
639
10 BILOG-MG EXAMPLES
>INPUT NTOTAL=4,NGROUP=2,DIF,NIDCHAR=2,TYPE=2;
Specification of input related keywords are echoed in the next section. The data are entered as
item-score patterns (right = 1, wrong = 0) and frequencies (case weights).
>ITEMS INAMES=(SP1(1)SP4),INUMBERS=(1(1)4);
TEST SPECIFICATIONS
===================
>TEST TNAME=SPELL;
The following lines indicate the assignment of items to the single subtest, utilizing the informa-
tion on both the TEST and ITEMS commands.
Information on the forms and groups is given next. The definition of the male and female groups,
and the use of the same four items for both groups are reflected below. It is also noted that a DIF
model is to be employed in this analysis.
FORM SPECIFICATIONS
===================
ITEMS READ ACCORDING TO SPECIFICATIONS ON THE ITEMS COMMAND
>GROUP1 GNAME=MALES;
>GROUP2 GNAME=FEMALES;
640
10 BILOG-MG EXAMPLES
ITEM ITEM
NUMBER NAME
------------------
1 SP1
2 SP2
3 SP3
4 SP4
------------------
ITEM ITEM
NUMBER NAME
------------------
1 SP1
2 SP2
3 SP3
4 SP4
------------------
Following is the format statement used in reading the data and the answer, omit, and not-present
keys (if any). Data for this example are item scores and they are complete; keys are not required.
The case ID is read in the first 2 columns (2A1), followed by the group indicator (I1). After read-
ing the weights (F2.0), the 4 item responses are read (4A1).
The first two cases are echoed to the output file so that the user can verify the input.
SUBTEST #: 1 SPELL
GROUP #: 1 MALES
TRIED RIGHT
4.000 0.000
ITEM 1 2 3 4
TRIED 1.0 1.0 1.0 1.0
RIGHT 0.0 0.0 0.0 0.0
641
10 BILOG-MG EXAMPLES
SUBTEST #: 1 SPELL
GROUP #: 1 MALES
TRIED RIGHT
4.000 1.000
ITEM 1 2 3 4
TRIED 1.0 1.0 1.0 1.0
RIGHT 0.0 0.0 0.0 1.0
Classical item statistics for the total sample and each group sample follow. #TRIED designates
the number of examinees responding to the item. For completeness, both the Pearson and biserial
item-test correlations are shown. The latter has smaller bias when the percent right is extreme.
The item statistics are given by group and then for the total group.
Item means, initial slope estimates, and Pearson and polyserial item-test correlations are given in
the next table.
Pearson
The point biserial correlation rPB , j for item j is a computationally simplified Pearson’s r between
the dichotomously scored item j and the total score x. It is computed as
(µ j − µ x ) pj
rPB , j =
σx qj
where µ j is the mean total score among examinees who have responded correctly to item j, µ x
is the mean total score for all examinees, p j is the item difficulty index for item j, q j is 1 − p j ,
and σ x is the standard deviation of the total scores for all examinees.
Polyserial correlation
The polyserial correlation rP , j can be expressed in terms of the point polyserial correlation as
rPP , jσ j
rP , j = m −1
∑ k =1
h( z jk )
where
z jk is the scoring corresponding to the cumulative proportion, p jk of the k-th response category
to item j, σ j is the standard deviation of items scores y for item j, and rPP , j is the point-
642
10 BILOG-MG EXAMPLES
polyserial correlation.
The biserial correlation estimates the relationship between the total score and the hypothetical
score on the continuous scale underlying the (dichotomous) item. The biserial correlation also
assumes a normal distribution of the hypothetical scores. The reason for reporting these correla-
tions separately for each group is that the appearance of large discrepancies between groups for a
given item would suggest that the assumption of a common slope is untenable. Note that, if a
biserial correlation more negative than –0.15 is detected by the program during this phase of the
analysis, the item in question will be assumed miskeyed and will be omitted in the Phase 2
analysis.
Phase 2 output
During calibration, a logistic item response function is fitted to each item of each subscale. In
this example, a 1-parameter logistic response function is fitted (NPARM =1 on GLOBAL).
Echoing of the Phase 2 commands and specification of the analysis starts the listing of Phase 2
output.
>CALIB NQPT=10,CYCLES=15,CRIT=0.005,NEWTON=2,REFERENCE=1;
643
10 BILOG-MG EXAMPLES
Under CALIBRATION PARAMETERS, the definitions of calibration related keywords for this analy-
sis are given:
CALIBRATION PARAMETERS
======================
MML estimation is used when tests of three or more items are specified. The solution assumes
that the respondents are drawn randomly from a population or populations of abilities, which is
assumed to have a normal distribution. The empirical distribution of ability is represented as a
discrete distribution on a finite number of points. The quadrature points and weights used for
MML estimation of the item parameters for the two groups are printed next.
METHOD OF SOLUTION:
EM CYCLES (MAXIMUM OF 15)
FOLLOWED BY NEWTON-RAPHSON STEPS (MAXIMUM OF 2)
6 7 8 9 10
POINT 0.4444E+00 0.1333E+01 0.2222E+01 0.3111E+01 0.4000E+01
WEIGHT 0.3213E+00 0.1458E+00 0.3002E-01 0.2805E-02 0.1190E-03
6 7 8 9 10
POINT 0.4444E+00 0.1333E+01 0.2222E+01 0.3111E+01 0.4000E+01
WEIGHT 0.3213E+00 0.1458E+00 0.3002E-01 0.2805E-02 0.1190E-03
The MML solution employs both the EM method and Newton-Gauss iterations to solve the mar-
ginal likelihood equations. On the CALIB command, a maximum of 15 EM cycles and 2 Newton-
Gauss iterations were requested. Results for each iteration are displayed so that the extent of
convergence can be judged.
644
10 BILOG-MG EXAMPLES
In the case of nested models on the same data, the –2 log likelihood values at convergence can be
used to evaluate the fit of the models. Refitting this example, for instance as a single-group
analysis will allow the comparison of non-DIF and DIF models for these data. In that way, it can
be determined whether differential item functioning effects are present.
[E-M CYCLES]
...
The information matrix for all item parameters is approximated during each Newton step and
then used at convergence to provide large-sample standard errors of estimation on the item pa-
rameter estimates.
In Phase 2, when there is a single group, the unit and origin of the scale on which the parameters
are expressed is based on the assumption that the latent ability distribution has zero mean and
unit variance (the so-called “0,1” metric). In the case of multiple groups, the program provides
the option of setting the mean and standard deviation of one group to 0,1 as shown here. The user
may set the mean and standard deviation of the combined estimated distribution of the groups to
0 and 1 by setting the REFERENCE keyword on the CALIB command to zero. The parameter esti-
mates can be rescaled in Phase 3 according to scale conventions selected by the user (using the
RSCTYPE, SCALE and LOCATION keywords on the SCORE command). In a DIF model, no scoring is
done, so use of the REFERENCE=0 specification is not pursued here.
Estimated item parameters for the two groups are given next. The INTERCEPT column contains
the estimates of the item intercepts, which are defined as the product of each item’s slope and
threshold. This is followed by the slope or discrimination parameters and the item threshold or
location parameters. The LOADING column represents the one-factor item factor loadings given
by the expression
slope
.
1.0 + slope 2
For a 1PL model, no asymptotes or guessing parameters are estimated. In a 1PL model, all slopes
are equal. In DIF analyses, the assumption is made that slopes are equal over the groups. This
implies that items will discriminate equally well in all groups. Note that, in this example, the
slopes of all items for both groups are constrained to 1.285.
645
10 BILOG-MG EXAMPLES
The item parameter estimates for each group are followed by the averages for the group thresh-
olds. The mean threshold of the female group (Group 2) is 0.146 above that of the male or refer-
ence group. DIF is item by group interaction under the constraint that the mean thresholds of the
groups are equal. The threshold adjustment sets the mean of the reference group’s threshold to 1,
and the mean threshold for the females is accordingly adjusted to 0.148. The unadjusted and ad-
justed mean thresholds for the two groups form the next section of the Phase 2 output file.
646
10 BILOG-MG EXAMPLES
THRESHOLD MEANS
GROUP ADJUSTMENT
------------------------
1 0.000
2 0.146
------------------------
The adjusted threshold values are followed by the group differences of the constrained values.
The standard errors for the differences are computed as
ITEM GROUP
2 - 1
-----------------------
SP1 | -0.455
| 0.185*
|
SP2 | -0.043
| 0.159*
|
SP3 | -0.065
| 0.141*
|
SP4 | 0.564
| 0.156*
-----------------------
*STANDARD ERROR
The estimated latent distributions of the groups are given next; with the origin and unit of scale
set so that the mean of the reference group is 0 and the standard deviation is 1.
1 2 3 4 5
POINT -0.3578E+01 -0.2788E+01 -0.1998E+01 -0.1208E+01 -0.4180E+00
POSTERIOR 0.1972E-03 0.4485E-02 0.4394E-01 0.1737E+00 0.2780E+00
647
10 BILOG-MG EXAMPLES
6 7 8 9 10
POINT 0.3720E+00 0.1162E+01 0.1952E+01 0.2742E+01 0.3532E+01
POSTERIOR 0.2647E+00 0.1724E+00 0.5526E-01 0.7020E-02 0.3483E-03
MEAN 0.00000
S.E. 0.00000
S.D. 1.00000
S.E. 0.00000
1 2 3 4 5
POINT -0.3724E+01 -0.2934E+01 -0.2144E+01 -0.1354E+01 -0.5642E+00
POSTERIOR 0.2099E-03 0.4246E-02 0.3608E-01 0.1456E+00 0.3067E+00
6 7 8 9 10
POINT 0.2258E+00 0.1016E+01 0.1806E+01 0.2596E+01 0.3386E+01
POSTERIOR 0.3161E+00 0.1525E+00 0.3473E-01 0.3598E-02 0.1624E-03
MEAN -0.16191
S.E. 0.06907
S.D. 0.89707
S.E. 0.00845
A plot of the two estimated latent distributions are shown below. The solid line represents the
estimated distribution of the male group.
BILOG-MG is also capable of producing graphic representations of a number of item and test
characteristics. Using the PLOT keyword on the CALIB command, it is possible to obtain plots of
the item-response functions with a significance level below the value assigned to the PLOT key-
648
10 BILOG-MG EXAMPLES
word. By default, PLOT=0 and no plots are produced. On the other hand, setting PLOT to 1.0 will
lead to the display of all item response functions in the output file. One such plot, for the fourth
item administered to the female group, is shown below.
The plot also shows 95% tolerance intervals for the observed percent correct among respondents
in corresponding EAP groups, assuming the percent-correct predicted by the model is correct.
Note that similar plots may be obtained using the IRTGRAPH program accessible from the Run
menu.
Similar graphs may be obtained through the IRTGRAPH program, accessed via the Plot option
on the Run menu in BILOG-MG for Windows.
GROUP: 2 FEMALES
SUBTEST: SPELL
By saving the estimated parameter estimates to an external file, the estimates can also be used in
external packages to produce additional plots. Below, the item response functions for both
groups are plotted by item.
649
10 BILOG-MG EXAMPLES
The data from example 2 are analyzed here as a single group. Thus no NGROUP keyword is pro-
vided on the INPUT command and, by default, the program assumes there is only one group. No
GROUP commands follow the TEST command, and the group indicator has been removed from the
variable format statement.
The acceleration factor on the CALIB command has been set to its default value of 0.5
(ACCEL=0.5). The difference in the log likelihoods from the two-group and single-group solu-
tions can be examined to determine if differential item functioning effects are present. The item
parameter file obtained in the previous section is specified in the GLOBAL command to provide
starting values for parameter estimation in Phase 2.
650
10 BILOG-MG EXAMPLES
Phase 1 output
The Phase 1 output for this example is the same as that obtained in Section 10.2, except that clas-
sical item statistics are computed only for the total sample.
Phase 2 output
The main interest in this example is the comparison of the log likelihood of the fit of the DIF and
non-DIF models. The difference, 3138.4122 – 3110.3990 = 28.0132, distributed as χ 2 on four
degrees of freedom, indicates significantly better fit of the DIF model.
651
10 BILOG-MG EXAMPLES
This example illustrates the equating of equivalent groups with the BILOG-MG program. Two
parallel test forms of 20 multiple-choice items were administered to two equivalent samples of
200 examinees drawn from the same population. There are no common items between the forms.
Because the samples were drawn from the same population, GROUP commands are not required.
The FORM1 command lists the order of the items in Form 1 and the FORM2 command lists the or-
der of the items in Form 2. These commands follow directly after the TEST command as indi-
cated by the NFORM=2 keyword on the INPUT command. As only one test is used, the vector of
items per subtest given by the NITEMS keyword on the LENGTH command contains only one entry.
The SAVE option on the GLOBAL command is used in combination with the SAVE command to
save item parameter estimates and scores to the external files exampl04.par and exampl04.sco
respectively.
In this example, 40 unique item responses are given in the data file. The first few lines of the
data file are shown below. The first record shown after the answer keys for the two forms, which
should always appear first and in the same format as the data, contains responses to items 1
through 20 in the second line associated with this examinee. In the case of the data shown for
another examinee who responded to the second form, responses in the same positions in the data
file correspond to items 21 through 40. Keep in mind that the number of items read by the format
statement is the total number of items in the form, when NFORM=1 and the total number of items
in the longest form when NFORM>1.
1 11111111111111111111
2 11111111111111111111
1 001 11111111122212122111
1 002 11222212221222222112
1 003 12121221222222221222
1 004 11212212222222212222
…
2 198 11112211111222212211
2 199 21122222222222222122
2 200 11111111111111221111
The FLOAT option is used on the CALIB command to request the estimation of the means of the
prior distributions of item parameters along with the parameters. This option should not be used
when the data set is small and items few. Means of the item parameters may drift indefinitely
during estimation cycles under these conditions. In the CALIB command, the FIXED option is also
required to keep the prior distributions of ability fixed during the EM cycles of this example. In
multiple-group analysis, the default is “not fixed”.
ML estimates of ability are rescaled to a mean of 250 and standard deviation of 50 in Phase 3
(METHOD=1, RSCTYPE=3, LOCATION=250, SCALE=50). By settting INFO to 1 on the SCORE com-
mand, the printing of test information curves to the phase 3 output file is requested. To request
the calculation of expected information for the population, the POP option may be added to this
command. In the case of multiple subtests, the further addition of the YCOMMON option will re-
652
10 BILOG-MG EXAMPLES
quest the expression of test information curves for the subtests in comparable units.
Phase 1 output
Because all examples are drawn from the same population, all responses are combined in the re-
sults. Since there are no common items between forms, the number tried for each item is 200. If
there had been common items, their number tried would be 400. Results for the first 5 items are
shown below.
Phase 2 output
Item parameter estimation assumes a common latent distribution for the random equivalent
groups administered the respective test forms. Empirical prior distributions are assumed for the
slope and threshold parameters. The means of these priors are estimated concurrently with the
item parameters.
CALIBRATION PARAMETERS
======================
653
10 BILOG-MG EXAMPLES
Final iterations of the solutions and some of the results are as follows. Indeterminacy of the ori-
gin and unit of the ability scale is resolved in Phase 2 by setting the mean and standard deviation
of the latent distribution to zero and one, respectively.
[NEWTON CYCLES]
After assigning cases to the intervals (shown below) on the basis of the EAP estimates of their
scale scores, the program computes the expected number of correct responses in the interval by
multiplying these counts by the response model probability at the indicated θ . The χ 2 is com-
puted in the usual way from the differences between the observed and expected counts.
The counts are displayed so that the user can judge whether there are enough cases in each group
to justify computing a χ 2 statistic. If not, the user should reset the number of intervals.
654
10 BILOG-MG EXAMPLES
Phase 3 output
For purposes of reporting test scores, the ability scale is set so that the mean score distribution in
the sample of examinees is 250 and the standard deviation is 50. The item parameters are re-
scaled accordingly.
Before rescaling, the sample mean score is essentially the same as that in the Phase 2 latent dis-
tribution. The standard deviation is larger, however, because the score distribution includes
measurement error variance.
The correlation matrix of the test scores (when there is more than one test).
The mean, standard deviation and variance of the θ score estimates:
Maximum Likelihood (ML) estimate
Bayes Model (Maximum A Posteriori, MAP) estimate
Bayes (Expected, EAP) estimate
655
10 BILOG-MG EXAMPLES
given the response pattern of the case. The standard error is the square root of the average
of these variances.
EAP – Root-Mean-Square posterior standard deviation: The error variance for each case is
the variance of the posterior distribution of theta, given the response pattern of the case.
The standard error is the square root of the average of these variances.
The empirical reliability of the test is the θ score variance divided by the sum of that variance
and the error variance.
Note:
The expected value of the sum of the θ score variance and the error variance is the variance of
the latent distribution of the group. The sum of the corresponding sample variances should tend
to that value as the sample size increases.
SIM
SIM 1.0000
TEST: SIM
MEAN: 0.0057
S.D.: 1.1426
VARIANCE: 1.3054
TEST: SIM
RMS: 0.4203
VARIANCE: 0.1767
RESCALING CONSTANTS
TEST SCALE LOCATION
SIM 43.762 249.749
The scaled scores are saved on an external file and their printing is suppressed in all but the first
two cases.
656
10 BILOG-MG EXAMPLES
The magnitudes of the rescaled item parameters reflect the new origin and unit of the scale. The
thresholds center around 250 and the slopes are smaller by a factor of about 50. The slopes are
printed here to only three decimal places but appear accurately in the saved items parameter file.
If saved parameters are used to score other examinees, the results will be determined in the pre-
sent sample.
Results of the information analysis are depicted in the following line printer plot. Points indi-
cated by + and * represent the information and measurement error functions, respectively. This
plot applies to all 40 items and not to the separate test forms. Because the item thresholds are
normally distributed with mean standard similar to that of the score distribution, the precision of
the item set is greatest toward the middle of the scale.
657
10 BILOG-MG EXAMPLES
Two hundred students at each of three grade levels, grades four, six, and eight, were given grade-
appropriate versions of a 20-item arithmetic examination. Items 19 and 20 appear in the grade 4
and 6 forms; items 37 and 38 appear in the grade 6 and 8 forms. Because each item is assigned a
unique column in the data records, a FORM command is not required.
The data file contains, the answer key, not-presented key and raw data. Two lines of information
are given per examinee as shown below. The answer key contains 56 entries, each equal to 1. If
658
10 BILOG-MG EXAMPLES
an item has not been presented, its presence in the data will be indicated with ‘ ‘.
KEY 11111111111111111111111111111111111111111111111111111111
NOT
001 1 11111112221211222212
002 1 21121211121111121212
003 1 11112112211222212212
004 1 11111112121111111211
005 1 21111112221212121222
No items are assigned to the TEST command using the INAMES or INUMBERS keywords. By de-
fault, it is assumed that all items are assigned to the test. Although the test name (TNAME=MATH) is
not enclosed in single quotes, the group names are as these names contain blanks as part of the
name.
The distributions of ability are assumed to be normal at each grade level (NORMAL on the CALIB
command). Grade 6 serves as the reference group in the calibration of the items (REFERENCE=2).
EAP estimates of ability are calculated using the information in the posterior distributions from
Phase 2. The ability estimates are rescaled to a mean of 0 and standard deviation of 1 by specify-
ing RSCTYPE=3 on the SCORE command.
Phase 1 output
In this example, items assigned to the three groups of examinees are selected from the following
set. The items are selected in such a way that two items are common to groups 1 and 2 and two
other items are common to groups 2 and 3. The groups, corresponding to school grades four, six,
and eight are non-equivalent and require separate classical item statistics. The fact that classical
item statistics are not invariant with respect to sampling from different populations is illustrated
by the different results for common items in different groups.
659
10 BILOG-MG EXAMPLES
ITEM ITEM
NUMBER NAME
------------------
1 M01
2 M02
…
20 M20
------------------
ITEM ITEM
NUMBER NAME
------------------
19 M19
20 M20
…
38 M38
------------------
ITEM ITEM
NUMBER NAME
------------------
37 M37
…
56 M56
------------------
SUBTEST 1 MATH
GROUP 1 GRADE 4 200 OBSERVATIONS
GROUP 2 GRADE 6 200 OBSERVATIONS
GROUP 3 GRADE 8 200 OBSERVATIONS
Item statistics for the first 5 items of each subtest are shown below. Similar output is produced
for grades 6 to 8, and for multiple groups MATH which, in this case, contains the statistics for all
the grades.
660
10 BILOG-MG EXAMPLES
SUBTEST 1 MATH
ITEM STATISTICS FOR GROUP: 1 GRADE 4
ITEM*TEST CORRELATION
ITEM NAME #TRIED #RIGHT PCT LOGIT/1.7 PEARSON BISERIAL
-----------------------------------------------------------------------
1 M01 200.0 138.0 0.690 -0.47 0.470 0.616
...
19 M19 200.0 95.0 0.475 0.06 0.520 0.652
20 M20 200.0 67.0 0.335 0.40 0.475 0.615
----------------------------------------------------------------------
Phase 2 output
In vertical equating over a range of age levels, the ability distributions of the groups may be
widely spaced. For that reason, it is desirable to use a large number of quadrature points – in this
case, 51.
The origins and unit of the ability distribution can be fixed in the calibration either by setting the
mean and standard deviation of a reference group to zero and one, respectively, or, similarly, set-
ting the mean and standard deviation of the combined groups. In this example, group 2 is se-
lected as the reference group.
CALIBRATION PARAMETERS
======================
MAXIMUM NUMBER OF EM CYCLES: 30
MAXIMUM NUMBER OF NEWTON CYCLES: 2
CONVERGENCE CRITERION: 0.0100
ACCELERATION CONSTANT: 1.0000
LATENT DISTRIBUTION: NORMAL PRIOR FOR EACH GROUP
GROUP MEANS AND SDS
ESTIMATED CONCURRENTLY
WITH ITEM PARAMETERS
REFERENCE GROUP: 2
661
10 BILOG-MG EXAMPLES
The iterative estimation procedures typically converge more slowly in nonequivalent group data
than in one group or equivalent groups data. The last few iterations are shown here along with
some of the resulting parameter estimates. The means of the prior distributions on item thresh-
olds and slopes are also listed.
[NEWTON CYCLES]
UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -0.23533 0.50000
UPDATED PRIOR ON THRESHOLDS; MEAN & SD = 0.08308 2.00000
-2 LOG LIKELIHOOD: 13245.9542
662
10 BILOG-MG EXAMPLES
The within-group latent distributions are assumed normal. Their means and standard deviations
are estimated relative to the reference group. In these data, the means increase over the grades (-
0722, 0.000, 0.569), but the standard deviations are relatively constant (1.069, 1.00, 1.126).
47 48 49 50 51
POINT 0.3552E+01 0.3722E+01 0.3892E+01 0.4062E+01 0.4232E+01
POSTERIOR 0.1899E-04 0.9879E-05 0.3535E-05 0.1816E-05 0.9055E-06
MEAN -0.72298
S.E. 0.11260
S.D. 1.06880
S.E. 0.12631
47 48 49 50 51
POINT 0.3552E+01 0.3722E+01 0.3892E+01 0.4062E+01 0.4232E+01
POSTERIOR 0.1172E-03 0.6346E-04 0.3291E-04 0.1689E-04 0.8409E-05
MEAN 0.00000
S.E. 0.00000
S.D. 1.00000
S.E. 0.00000
663
10 BILOG-MG EXAMPLES
47 48 49 50 51
POINT 0.3552E+01 0.3722E+01 0.3892E+01 0.4062E+01 0.4232E+01
POSTERIOR 0.1837E-02 0.1230E-02 0.8192E-03 0.5316E-03 0.3268E-03
MEAN 0.56861
S.E. 0.11855
S.D. 1.12577
S.E. 0.14026
Phase 3 output
With nonequivalent groups, Bayes (EAP) and Bayes Modal (MAP) estimation of test scores
should be carried out with respect to the Phase 2 latent distribution to which the examinee be-
longs. Specify IDIST=3 on the SCORE command.
QUAD
TEST NAME GROUP POINTS
---------------------------
1 MATH 1 51
1 MATH 2 51
1 MATH 3 51
---------------------------
RESCALING CONSTANTS
TEST NAME SCALE LOCATION
------------------------------------
1 MATH 1.000 0.000
------------------------------------
In this example, the scores are rescaled so that their mean and standard deviation in the total
sample are zero and one, respectively. The parameter estimates are rescaled accordingly.
RESCALING CONSTANTS
TEST SCALE LOCATION
MATH 1.066 0.003
664
10 BILOG-MG EXAMPLES
GROUP MEAN SD
-----------------------------
1 -0.776 1.067
2 0.000 1.000
3 0.608 1.118
-----------------------------
GROUP MEAN SD
-----------------------------
1 -0.776 1.149
2 0.000 1.074
3 0.608 1.201
-----------------------------
665
10 BILOG-MG EXAMPLES
This example illustrates the use of the TYPE=3 specification on the INPUT command to analyze
aggregate-level, multiple-matrix sampling data. The data in exampl06.dat are numbers tried and
numbers correct for items from eight forms of a matrix sampled assessment instrument. The
groups are selected 8th grade students from 32 public schools. The first record for each school
contains the data for the items of a Number Concepts scale, NUMCON, and the second record con-
tains the data for items of an Algebra Concepts scale, ALGCON. Data for the first two schools are
shown below.
SCHOOL 1 NUM 1 0 3 2 2 1 4 4 3 2 2 1 4 3 4 1
SCHOOL 1 ALG 1 0 3 1 2 0 3 2 3 2 2 1 4 1 4 0
SCHOOL 2 NUM 5 3 4 4 3 2 3 3 2 2 4 3 4 3 5 3
SCHOOL 2 ALG 5 2 4 2 3 2 3 2 2 2 4 2 4 2 5 3
An answer key is not required for aggregate-level data in number-tried, number-right summary
form. Note the format statement for reading the two sets of eight number-tried, number-right ob-
servations. For more information on how to set up the variable format statement for this type of
data, see 2.6.18.
The items are multiple-choice and fairly difficult, so the 3PL model is needed. Because aggre-
gate-level data are always more informative than individual-level item responses, it is worth-
while in the CALIB command to increase the number of quadrature points (NQPT), to set a stricter
criterion for convergence (CRIT), and to increase the CYCLES limit. A prior on the thresholds
(TPRIOR) and a ridge constant of 0.8 (RIDGE) are required for convergence with the exceptionally
difficult ALGCON subtest. Aggregate-level data typically have smaller slopes in the 0,1 metric
than do person-level data. For this reason, the mean of the prior for the log slopes has been set to
0.5 by use of the READPRIOR option of the CALIB command and the following PRIOR commands.
The aggregate scores for the schools are estimated by the EAP method using the empirical distri-
butions from Phase 2. The number of quadrature points is set the same as in Phase 2.
The scores are rescaled to a mean of 250 and a standard deviation of 50 in the latent distribution
of schools (IDIST=3, LOCATION=250, SCALE=50). The fit of the data to the group-level model is
tested for each school (FIT). The NUMCON items have fairly homogeneous slopes and might
be favorable for a one-parameter model.
666
10 BILOG-MG EXAMPLES
Phase 1 output
Group-level data consist of number-tried and number-right frequencies for each item in each
group. The program reads them as values rather than characters and conversion to item scores is
not required.
SUBTEST #: 1 NUMCON
GROUP #: 1
TRIED RIGHT
23.000 14.000
ITEM 1 2 3 4 5 6 7 8
TRIED 1.0 3.0 2.0 4.0 3.0 2.0 4.0 4.0
RIGHT 0.0 2.0 1.0 4.0 2.0 1.0 3.0 1.0
SUBTEST #: 2 ALGCON
GROUP #: 1
TRIED RIGHT
22.000 7.000
ITEM 1 2 3 4 5 6 7 8
TRIED 1.0 3.0 2.0 3.0 3.0 2.0 4.0 4.0
RIGHT 0.0 1.0 0.0 2.0 2.0 1.0 1.0 0.0
SUBTEST #: 1 NUMCON
GROUP #: 1
TRIED RIGHT
30.000 23.000
ITEM 1 2 3 4 5 6 7 8
TRIED 5.0 4.0 3.0 3.0 2.0 4.0 4.0 5.0
RIGHT 3.0 4.0 2.0 3.0 2.0 3.0 3.0 3.0
SUBTEST #: 2 ALGCON
GROUP #: 1
TRIED RIGHT
30.000 17.000
ITEM 1 2 3 4 5 6 7 8
TRIED 5.0 4.0 3.0 3.0 2.0 4.0 4.0 5.0
RIGHT 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0
Classical item statistics are computed for each subtest. Biserial correlations cannot be computed
with group-level data.
667
10 BILOG-MG EXAMPLES
ITEM*TEST CORRELATION
ITEM NAME #TRIED #RIGHT PCT LOGIT/1.7 PEARSON BISERIAL
-------------------------------------------------------------------------
1 A1 259.0 120.0 46.3 0.09 0.636 0.000
2 A2 267.0 81.0 30.3 0.49 0.606 0.000
3 A3 241.0 94.0 39.0 0.26 0.669 0.000
4 A4 245.0 121.0 49.4 0.01 0.687 0.000
5 A5 263.0 96.0 36.5 0.33 0.669 0.000
6 A6 263.0 166.0 63.1 -0.32 0.746 0.000
7 A7 267.0 71.0 26.6 0.60 0.667 0.000
8 A8 262.0 90.0 34.4 0.38 0.683 0.000
-------------------------------------------------------------------------
Phase 2 output
The set-up for group-level item calibration differs somewhat from examinee-level analysis: more
quadrature points and more iterations for the solution are required. Prior distributions for all pa-
rameters are necessary, the means should be kept fixed (default = NOFLOAT), and the mean of the
priors for slopes should be set lower than the examinee-level default.
668
10 BILOG-MG EXAMPLES
Group-level item parameter estimates for the first 3 items in subtest NUMCON are as follows.
Phase 3 output
Computing scores at the group-level is essentially the same as at the examinee level. Note that
the selection of EAP estimations based on the empirical latent distribution from Phase 2 over-
rides the choice here of number of quadrature points. Because of the small number of items, the
standard deviation of the estimated scores is considerably smaller than that of the latent distribu-
tion. Portions of the Phase 3 output are listed below.
669
10 BILOG-MG EXAMPLES
The scores are rescaled so that the mean and standard deviation of the Phase 3 latent distribution
are 250 and 50, respectively. Scores for all 32 schools are computed and printed. Because the
data are binomial rather than binary, a χ 2 index of fit on 8 degrees of freedom can be calculated
for each school. The corresponding probabilities are shown in the output.
RESCALING CONSTANTS
TEST SCALE LOCATION
NUMCON 58.462 251.342
ALGCON 56.462 251.127
In this example, responses to 50 items are read from those of 100 items in the data file using the
format statement
(10A1,T38,25A1,1X,25A1).
The first few lines of the data file are shown below. In contrast to previous examples, each posi-
tion in the item response fields for each examinee corresponds to the same item. In the earlier
examples, the association between response and item depended on the group/form membership
of an examinee.
The answer key (KFNAME keyword on the INPUT command) is given first, and is given in the first
two lines of the raw data file in the same format as the item responses.
KEY 00000000000000000000000000000000000000000000000000000000000000000000000000…
0102111900 00000401020100002001101002024030005001000000000233004002014062000012000100…
0104112200 10101200210100000100010230110030013000000100103021014000002042001012001000…
0105121900 11012041110200000010002230131010122101000000013123000002001042101012001300…
From the 50, 20 are selected as Main Test items and 4 as Variant Test items. This is indicated by
setting NITEMS to 24 and NVARIANT to 4 on the LENGTH command. Items for the main test are
670
10 BILOG-MG EXAMPLES
selected by name in the TESTM command; items for the variant test are selected by name in the
TESTV command. The item names correspond to the sequence numbers in the original set of 100
items. Here the short form of naming and numbering is used – the set of items forms an arithme-
tic progression of integer or decimal numbers allowing use of the short form (first (increment)
last). A similar abbreviation may be used for consecutive item names (INAMES keyword on the
ITEMS command).
The analysis is performed on a sample of 200 students randomly drawn from the original sample
of 660 (SAMPLE=200 on the INPUT command). The EAP scale scores of Phase 3 are computed
from the responses to items in the main test.
Phase 1 output
Phase 1 lists the test specifications and the assignment of items to the main test and the variants.
TEST SPECIFICATIONS
===================
>TESTM TNAME=MAINTEST,
INAMES=(I26,I27,I28,I29,I31,I33,I34,
I35,I36,I38,I39,I47,I48,I49,I50,I54,I60,I64,I68,I72);
>TESTV TNAME=VARIANT,
INAMES=(I53,I59,I69,I73);
671
10 BILOG-MG EXAMPLES
Responses of 660 examinees are read from the data records, but only 200 randomly sampled
cases are included in the Phase 1 and Phase 2 analysis. The classical item statistics are shown
separately for main and variant items. The test scores for the item-test correlations are based on
the test scores from the main test items only.
Phase 2 output
Calibration of the main test items is computed as in the other examples. Without altering the item
parameter estimates of those items, parameter estimates for the variants are computed with re-
spect to the latent dimension determined by the main items.
672
10 BILOG-MG EXAMPLES
******************************
CALIBRATION OF VARIANT ITEMS
VARIANT
******************************
Phase 3 output
In Phase 3, scores for all 660 examinees are computed from the main test item response and
saved to an external file. Printing of the scores is suppressed, except for the first three cases. The
latent distribution estimated from all 660 cases is computed and printed. Scores are based on the
unrescaled Phase 2 parameters, which are then saved to an external file.
>SCORE METHOD=2,NOPRINT;
673
10 BILOG-MG EXAMPLES
MAINTEST
MAINTEST 1.0000
TEST: MAINTEST
MEAN: 0.0915
S.D.: 0.8940
VARIANCE: 0.7992
TEST: MAINTEST
RMS: 0.4493
VARIANCE: 0.2019
This example illustrates the use of BILOG-MG with multiple groups and multiple subtests. It is
designed to illustrate some of the more complicated features of the program, including user-
specified priors on the latent distributions and priors on the item parameters.
674
10 BILOG-MG EXAMPLES
Based on previous test performance, examinees are assigned to two groups for adaptive testing.
Out of a set of 45 items, group 1 is assigned items 1 through 25, and group 2 is assigned items 21
through 45. Thus, there are 5 items linking the test forms administered to the groups.
Twenty of the 25 items presented to group 1 belong to subtest 1 (items 1-15 and 21-25); twenty
items also belong to subtest 2 (items 6-25). Of the 25 items presented to group 2, 20 belong to
subtest 1 (items 21-40) and 20 to subtest 2 (items 21-25 and 31-45).
In all, there are 35 items from the set of 45 assigned to each subtest. (This extent of item overlap
between subtests is not realistic, but it illustrates that more than one subtest can be scored adap-
tively provided they each contain link items between the test forms.)
This example also illustrates how user-supplied priors for the latent distributions are specified
with IDIST=1 on the CALIB command. The points and weights for these distributions are sup-
plied in the QUAD commands. Note that with IDIST=1, there are separate QUAD commands for
each group for each subtest. Within each subtest the points are the same for each group. This is a
requirement of the program. But as the example shows, the points for the groups may differ by
subtest. If IDIST has been set to 2, sets of weights have to be supplied by group. The set of
points then applies to all subtests.
The PRIOR command for each subtest is placed after the QUAD commands for that subtest. The
presence of the PRIOR command is indicated using the READPRIOR option on the CALIB com-
mand. In this example, only the prior for the standard deviation of the thresholds is supplied on
the PRIOR command. Default values are used for the other prior distributions. The means of the
distributions are kept fixed at their specified values by using the NOFLOAT option on the CALIB
command.
The score distribution in the respondent population is estimated in the form of a discrete distribu-
tion on NQPT=16 points by adding the EMPIRICAL option to the CALIB command. This discrete
distribution will be used in the place of the prior in MML estimate of the item parameters. When
NGROUP>1, separate score distributions will be estimated for the groups. The first group serves as
the reference group (REFERENCE=1). If the REFERENCE keyword is omitted, the first group will by
default be used as the reference group. When NGROUP>1, the FLOAT option is the default. By us-
ing NOFLOAT here, the means of the prior distributions on item parameters are kept fixed at the
specified values during estimation.
In the scoring phase, the empirical prior from phase 2 is used as prior distribution for the scale
scores (IDIST=3). Rescaling of scores to the scale and location in the sample of scale score esti-
mates is requested by setting RSCTYPE to 3. The presence of the INFO keyword indicates that in-
formation output is required. In this case INFO=1 and test information curves will be printed to
the phase 3 output file. In combination with the YCOMMON and POP options, the test information
curves will be expressed in comparable units and an estimate of the classical reliability coeffi-
cient, amongst other information, will be calculated for each subtest.
675
10 BILOG-MG EXAMPLES
EXAMPL08.BLM -
GROUP-WISE ADAPTIVE TESTING WITH TWO SUBTESTS
>GLOBAL DFNAME='EXAMPL08.DAT', NPARM=2, NTEST=2, SAVE;
>SAVE SCORE='EXAMPL08.SCO';
>LENGTH NITEMS=(35,35);
>INPUT NTOT=45, SAMPLE=2000, NGROUP=2, KFNAME='EXAMPL08.DAT', NALT=5,
NFORMS=2, NIDCH=5;
>ITEMS INUM=(1(1)45), INAME=(C01(1)C45);
>TEST1 TNAME=SUBTEST1, INAME=(C01(1)C15,C21(1)C40);
>TEST2 TNAME=SUBTEST2, INAME=(C06(1)C25,C31(1)C45);
>FORM1 LENGTH=25, INUM=(1(1)25);
>FORM2 LENGTH=25, INUM=(21(1)45);
>GROUP1 GNAME=POP1, LENGTH=25, INUM=(1(1)25);
>GROUP2 GNAME=POP2, LENGTH=25, INUM=(21(1)45);
(5A1,T1,I1,T1,I1,T7,25A1)
>CALIB IDIST=1, READPRIOR, EMPIRICAL, NQPT=31, CYCLE=25, TPRIOR, NEWTON=5,
CRITERION=0.01, REFERENCE=1, NOFLOAT;
>QUAD1 POINTS=(-0.4598E+01,-0.3560E+01,-0.2522E+01,-0.1484E+01,-0.4453E+00,
0.5930E+00, 0.1631E+01, 0.2670E+01, 0.3708E+01, 0.4746E+01),
WEIGHTS=(0.2464E-05, 0.4435E-03, 0.1724E-01, 0.1682E+00, 0.3229E+00,
0.3679E+00, 0.1059E+00, 0.1685E-01, 0.6475E-03, 0.8673E-05);
>QUAD2 POINTS=(-0.4598E+01,-0.3560E+01,-0.2522E+01,-0.1484E+01,-0.4453E+00,
0.5930E+00, 0.1631E+01, 0.2670E+01, 0.3708E+01, 0.4746E+01),
WEIGHTS=(0.2996E-04, 0.1300E-02, 0.1474E-01, 0.1127E+00, 0.3251E+00,
0.3417E+00, 0.1816E+00, 0.2149E-01, 0.1307E-02, 0.3154E-04);
>PRIOR TSIGMA=(1.5(0)35);
>QUAD1 POINTS=(-0.4000E+01,-0.3111E+01,-0.2222E+01,-0.1333E+01,-0.4444E+00,
0.4444E+00, 0.1333E+01, 0.2222E+01, 0.3111E+01, 0.4000E+01),
WEIGHTS=(0.1190E-03, 0.2805E-02, 0.3002E-01, 0.1458E+00, 0.3213E+00,
0.3213E+00, 0.1458E+00, 0.3002E-01, 0.2805E-02, 0.1190E-03);
>QUAD2 POINTS=(-0.4000E+01,-0.3111E+01,-0.2222E+01,-0.1333E+01,-0.4444E+00,
0.4444E+00, 0.1333E+01, 0.2222E+01, 0.3111E+01, 0.4000E+01),
WEIGHTS=(0.1190E-03, 0.2805E-02, 0.3002E-01, 0.1458E+00, 0.3213E+00,
0.3213E+00, 0.1458E+00, 0.3002E-01, 0.2805E-02, 0.1190E-03);
>PRIOR TSIGMA=(1.5(0)35);
>SCORE IDIST=3, RSCTYPE=3, INFO=1, YCOMMON, POP, NOPRINT;
Phase 1 output
Phase 1 echoes the assignment of items to subtests, forms, and groups. Classical item statistics
are computed for each subtest in each group. Output for subtest 1 and group 1 (POP1) is given
below.
SUBTEST 1 SUBTEST1
GROUP 1 POP1 200 OBSERVATIONS
GROUP 2 POP2 200 OBSERVATIONS
SUBTEST 2 SUBTEST2
GROUP 1 POP1 200 OBSERVATIONS
GROUP 2 POP2 200 OBSERVATIONS
676
10 BILOG-MG EXAMPLES
SUBTEST 1 SUBTEST1
ITEM STATISTICS FOR GROUP: 1 POP1
ITEM*TEST CORRELATION
ITEM NAME #TRIED #RIGHT PCT LOGIT/1.7 PEARSON BISERIAL
------------------------------------------------------------------------
1 C01 200.0 170.0 0.850 -1.02 0.408 0.625
2 C02 200.0 164.0 0.820 -0.89 0.396 0.580
3 C03 200.0 154.0 0.770 -0.71 0.451 0.626
4 C04 200.0 143.0 0.715 -0.54 0.400 0.532
5 C05 200.0 140.0 0.700 -0.50 0.586 0.772
6 C06 200.0 135.0 0.675 -0.43 0.441 0.574
...
19 C24 200.0 83.0 0.415 0.20 0.590 0.746
20 C25 200.0 76.0 0.380 0.29 0.558 0.711
------------------------------------------------------------------------
Phase 2 output
Phase 2 estimates empirical latent distributions for each group and item parameters for each sub-
test. The arbitrary mean and standard deviation of reference group 1 determine the origin and
unit of the ability scales.
677
10 BILOG-MG EXAMPLES
Phase 3 output
The only new feature in Phase 3 is the use of the YCOMMON option to place the information plots
for the subtests on the same scale. This permits visual comparison of the relative precision of the
subtests according to the heights of the information curves. To illustrate, the ICC for subtest 1,
form is given below. The POP option also provides IRT estimated reliability for each subtest.
678
10 BILOG-MG EXAMPLES
This example is based on a study by Bock and Zimowski (1998). The full document is available
on the Internet from the American Institutes for Research. As a small computing example, we
simulated two-stage testing in data for the “One-Hundred Word Spelling Test” previously ana-
lyzed by Bock, Thissen, and Zimowski (1997). A complete description of these data are given in
Section 2.4.1.
On the basis of item parameters they report, we selected 12 first-stage items and 12 items for
each of three levels of the second-stage test.
Because of the limited number of items in the pool, we could not meet exactly the requirements
of the prototype design, but the resulting test illustrates well enough the main features of the
analysis. The item numbers in this and a later example correspond to the words presented in
Bock, Thissen, and Zimowski’s (1977) Table 1 in the NAEP report. All computations in the
analysis were carried out with the BILOG-MG program of Zimowski, Muraki, Mislevy and
Bock (1996). The program command files as well as the data file (with N = 660) are included in
the twostage folder of the BILOG-MG installation folder.
For assigning the cases in the data to second-stage levels under conditions that would apply in an
operational assessment, we re-estimated the parameters for the 12 first-stage items, computed
Bayes estimates of proficiency scale scores, and rescaled the scores to mean 0 and standard de-
viation 1 in the sample. The command file step0.blm, shown below, contains the necessary com-
mands.
The resulting score file was manipulated per these instructions(see result in
the STEP0.EAP file) and the assigned group membership added to the original
data file as column 12 (before empty). The resulting split is: group 1 236,
group 2 531, group 3 233.
679
10 BILOG-MG EXAMPLES
Cases with scores at or below -0.67 were assigned to group 1. Those at or above +0.67 were as-
signed to group 3, and the remainder to group 2. Of the 1000 cases in the original study, 274,
451, and 275 were assigned to groups 1, 2, and 3, respectively. With these assignment codes in-
serted in the case records, the latent distributions were estimated using the command file for the
first-stage analysis shown below (step1.blm in the twostage folder).
For the second-stage analysis, we used the latent distributions estimated in the first-stage analysis
as the prior distributions for maximum marginal likelihood analysis of the combined first- and
second-stage data. The points and weights representing the distributions are shown in the corre-
sponding BILOG-MG command file.
Inasmuch as there are no second-stage link items in this example, we use the first-stage items as
an anchor test. The six easiest of these items provide the links between levels 1 and 2; the six
most difficult provide the links between levels 2 and 3.
680
10 BILOG-MG EXAMPLES
Since the spelling data contain responses of all cases to all items, we can examine the compara-
tive accuracy of the estimates based on the 24 items per case in the two-stage data with those
based on 48 items per case in a conventional one-stage test. Syntax is as given in step3.blm,
shown below.
681
10 BILOG-MG EXAMPLES
The latter estimates are also shown in Table 10.1. Despite the small number of items and rela-
tively small sample size in this computing example, the agreement between the estimates is rea-
sonably good for the majority of items. There are notable exceptions, however, among the sec-
ond-stage items: of these, items 6, 7, 77, and 84 show discrepancies in both slope and threshold;
all of these are from level 3 and have extremely high thresholds in the one-stage analysis, well
beyond the +1.5 maximum we are assuming for second-stage items. Items 12 and 17 from level 3
are discrepant only in slope, as are items 26 and 38 from level 2, and items 50 and 64 from level
1.
Table 10.1: Comparison of two-stage and one-stage item parameter estimates in the spell-
ing data (shown for first 10 items)
682
10 BILOG-MG EXAMPLES
In all cases the two-stage slope is larger than the one-stage slope. This effect is balanced how-
ever, by the tendency of the first-stage items, 1, 4, 8, 10, 23, 25, 28, 29, 39, 47, 59, and 87, to
show smaller slopes in the two-stage analysis. As a result, the average slope in the two-stage re-
sults is only slightly larger than the one-stage average.
The average thresholds also show only a small difference. In principle, the parameters of a two-
parameter logistic response function can be calculated from probabilities at any two distinct, fi-
nite values on the measurement continuum. Similarly, those of the three-parameter model can be
calculated from three such points. This suggests that in fallible data estimation must improve,
even in the two-stage case, as sample size increases. Some preliminary simulations we have at-
tempted suggest that with sample sizes in the order of 5 or 10 thousand, and better placing of the
items, the discrepancies we see in the prototype 1 results largely disappear.
The latent distributions estimated with items from both stages are depicted in Figure 10.1. The
distributions for the three assignment groups are shown normalized to unity. The estimated popu-
lation distribution, which is the sum of the distributions for the individual groups weighted pro-
portional to sample size, is constrained to mean 0 and standard deviation 1 during estimation of
the component distribution. It is essentially normal and almost identical to the population distri-
bution estimated in the one-stage analysis.
Figure 10.1. Prototype 1: estimated latent distributions from two-stage and one-stage spell-
ing data
One may infer the measurement properties of the simulated two-stage spelling test from the in-
formation and efficiency calculations shown in Figure 10.2 and Figure 10.3, respectively. When
interpreting information curves, the following rules of thumb are helpful. An information value
683
10 BILOG-MG EXAMPLES
of 5 corresponds to a measurement error variance of 1/5 = 0.2. In a population in which the score
variance is set to unity, the reliability of a score with this error variance is 1.0 - 0.2 = 0.8. Simi-
larly, the reliability corresponding to an information value of 10 is 0.9. In the context of low-
stakes score reporting, we are aiming for reliabilities anywhere between these figures. As is ap-
parent in Figure 10.2, this range of reliability is achieved in the two-stage results for spelling
over much of the latent distribution.
684
10 BILOG-MG EXAMPLES
Finally, the efficiency curves in Figure 10.3 for the three levels show us the saving of test length
and administration time, including both first- and second-stage testing, due specifically to the
two-stage procedure in comparison with a one-stage test of the same length and item content.
In this case we hope to see efficiencies greater than 2.0, at least away from the population mean
where conventional tests with peaked centers typically have reduced precision. The prototype 1
design and analysis meet this criterion.
In this example, commands for estimating item parameters and computing score means, standard
deviations, variances, average standard errors, error variances, and inverse information reliabil-
ities of maximum likelihood estimates of ability, are illustrated.
Note: to obtain the same results for EAP estimation, set METHOD=2 in the SCORE command; for
MAP estimation, set METHOD=3.
Related topics
This example contains the syntax used for computing parallel form correlations and between test
correlations for tests of different lengths. Set METHOD equal to 1, 2, or 3 in the SCORE command to
obtain correlations for ML, EAP, and MAP estimated abilities respectively.
685
10 BILOG-MG EXAMPLES
>GLOBAL DFNAME=‘SIM01C0.SIM',NPARM=2,NTEST=12,SAVE;
>SAVE SCORE='MAPCOR1.SCO';
>LENGTH NITEMS=(4,4,8,8,16,16,32,32,64,64,128,128);
>INPUT NTOTAL=504,NIDCH=5,SAMPLE=3000;
>ITEMS INUMBERS=(1(1)504),INAME=(ITEM001(1)ITEM504);
>TEST1 TNAME=LENGTH4a,INUMBERS=(1(1)4);
>TEST2 TNAME=LENGTH4b,INUMBERS=(5(1)8);
>TEST3 TNAME=LENGTH8a,INUMBERS=(9(1)16);
>TEST4 TNAME=LENGTH8b,INUMBERS=(17(1)24);
>TEST5 TNAME=LEN16a, INUMBERS=(25(1)40);
>TEST6 TNAME=LEN16b, INUMBERS=(41(1)56);
>TEST7 TNAME=LEN32a, INUMBERS=(57(1)88);
>TEST8 TNAME=LEN32b, INUMBERS=(93(1)120);
>TEST9 TNAME=LEN64a, INUMBERS=(121(1)184);
>TEST10 TNAME=LEN64b, INUMBERS=(185(1)248);
>TEST11 TNAME=LEN128a ,INUMBERS=(249(1)376);
>TEST12 TNAME=LEN128b, INUMBERS=(377(1)504);
(11A1,1X,504A1)
>CALIB NQPT=40,CYCLE=25,NEWTON=3,CRIT=0.001,NOSPRIOR,NOADJUST;
>SCORE METHOD=1,INFO=1,YCOMMON,POP,NOPRINT;
Related topics
10.12 EAP scoring of the NAEP forms and state main and variant tests
The syntax in this example was used to score NAEP forms and state main and variant tests. It is
included here as an example of a more complicated analysis and contains numerous TEST and
FORMS commands.
The use of the INUMBERS keyword on the FORM commands to assign items to the various forms is
of interest, as is the naming convention used with the INAMES keyword on the ITEMS command.
Finally, note that none of the tests are calibrated (SELECT=0 for all tests on the CALIB command).
Scoring is done according to a previously generated item parameter file gr4fin.par read with the
IFNAME keyword on the GLOBAL command.
The syntax in this example was used to score NAEP forms and state main and
variant tests. It is included here as an example of a more complicated analysis
and contains numerous TEST and FORMS commands.
The use of the INUMBERS keyword on the FORM commands to assign items to the
various forms is of interest, as is the naming convention used with the INAMES
keyword on the ITEMS command. Finally, note that none of the tests are cali-
686
10 BILOG-MG EXAMPLES
brated (SELECT=0 for all tests on the CALIB command). Scoring is done according
to a previously generated item parameter file (gr4fin.par) read with the IFNAME
keyword on the GLOBAL command.
The variant items in a test are not intended to be scored as a test. They are
included in the analysis to obtain preliminary information on there item char-
acteristics with respect to the latent variable measured by the main test.
687
10 BILOG-MG EXAMPLES
>CALIB SELECT=(0(0)6);
>SCORE METHOD=2, NOPRINT, NQPT=(25(0)6);
Related topics
This is an attempt to reconstruct the domain scores demonstration application reported in “The
Domain Score Concept and IRT: Implications for Standards Setting” by Bock, Thissen & Zi-
mowski (2001). We use the dataset spell.dat as included with the TESTFACT program (see
Chapter 13). All 100 items of the 100-word spelling test seem to be there, but there are only 660
records (instead of the 1,000 that Bock et. al. report). In a first run (spell1.blm), we calibrate all
100 items and save the parameters in an external file. The syntax is shown below.
The SCORE command is included to obtain the percent correct for each examinee
(= the true domain scores).
The item parameters of the first 5 items, as reported in the item parameter file step1blm.par, are
shown in Table 10.2.
688
10 BILOG-MG EXAMPLES
The parameter values are in close agreement with Table 1 from Bock. et al. (results for the first 5
items shown in Table 10.3 below), showing also that we have a correct dataset, with the items in
the right order (of the table), albeit not all records.
In a second run (spell2.blm), we let the program compute the expected domain scores for all 660
examinees from the saved parameter file. The DOMAIN and FILE keywords on the SCORE com-
mand are used. We skip the calibration phase with the SELECT keyword on the CALIB command.
The scores are saved to file by using the SCORE keyword on the SAVE command.
The contents of spell2.blm are shown below. All the command files and data discussed here are
available to the user in the domscore subfolder of the BILOG-MG installation folder.
689
10 BILOG-MG EXAMPLES
The parameter file that we read in through the FILE keyword on the SCORE command had to be
created from the saved parameter file (spell1blm.par) in the spell1.blm run. First we deleted
everything before the first line with parameter estimates. Then we deleted all the columns that
were not slope, threshold, or guessing parameters, leaving just those three columns and in that
order. Then, we added a column with weights as the first column, in the same format. We used
1.0000, because we want all items weighed equally. We then added the variable format statement
(4F10.5) as the first line in the file and renamed it to spell1.par.
The estimated domain scores from spell2.blm are fairly well recovered as spell2.ph3 shows.
Here are the results for the first five examinees:
If the estimated expected domain scores are not close, something is probably wrong, so this is a
good test.
In a third and final step (step3.blm), we take a random sample of 20 items, adapt the parameter
file (spell3.par as described previously) and produce a new score file (spell3.sco).
690
10 BILOG-MG EXAMPLES
As can be seen, a decent recovery of the “population domain scores” with the random sample of
only 20 items.
Related topics
691
11 PARSCALE EXAMPLES
11 PARSCALE examples
11.1 Item calibration and examinee Bayes scoring with the rating-scale graded
model
This example illustrates calibration and scoring of a test or scale containing 20 multiple category
items. The simulated data represent responses of 1000 examinees drawn randomly from a popu-
lation with a mean trait score of 0.0 and standard deviation of 1.0.
Data are read from the file exampl01.dat in the examples folder using the DFNAME keyword on
the FILES command. The first few lines of the data file are shown below. The generating trait
value of each examinee is the second column of information in the data file. The case ID, given
at the beginning of each line, is 4 characters long and is indicated as such using the NIDCHAR
keyword on the INPUT command. It is also reflected in the format statement as 4A1.
All 20 items are used in a single test (NTEST=1 on INPUT command, with LENGTH=20). All 20
items have common categories and are assigned to the same BLOCK (NBLOCK=1 on TEST;
NITEMS=20 on BLOCK).
All items have four categories (NCAT=4 on BLOCK command) and varying difficulties and dis-
criminating powers. The graded model is assumed (GRADED on CALIB command); and a logistic
response model (LOGISTIC on CALIB command) is requested. The choice between a logistic or
normal response function metric is effective only if the graded response model is used. The re-
sponse function of the graded model can be either the normal ogive or its logistic approximation.
Graded is the default. If logistic is selected, the item parameters can be in the natural or the logis-
tic metric. Natural is the default. For the normal metric, set SCALE equal to 1.7. Neither
LOGISTIC nor SCALE is needed when PARTIAL is selected. Because the generalized model allows
for varying item discriminating powers, both a slope and threshold are estimated for each item.
The CADJUST keyword on the BLOCK command is used to set the mean of the category parameters
to 0 as simultaneous estimation of slope parameters and all category parameters is not obtain-
able.
The ITEMFIT keyword is used to set the number of frequency score groups for the computation
of item fit statistics to 10. Note that there is no default value for the ITEMFIT keyword.
The CYCLES keyword specifies 25 EM iterations, with maximum 2 inner EM iterations for the
item and category parameter estimation. Five Newton-Gauss iterations are requested (NEWTON=5
on CALIB). A convergence criterion of 0.005 is specified by using the CRIT keyword on CALIB.
30 quadrature points are to be used in the EM and Newton estimation instead of the default of 10
for cases where LENGTH less or equal to 50 in the INPUT command. The calibration procedure
692
11 PARSCALE EXAMPLES
depends on the evaluation of integrals using Gauss-Hermite quadrature. In general, the accuracy
of numerical integration increases with the number of quadrature points used.
The score estimation method is specified (EAP option on SCORE command). Scale scores for each
subtest are estimated by the Bayes (EAP) method, and their posterior standard deviations serve
as standard errors.
The scores, which are rescaled to zero mean and unit standard deviation in the sample (SMEAN
and SSD on SCORE), are saved in the file exampl01.sco using the SCORE keyword on the SAVE
command. The PFQ keyword is specified. This keyword is usually used to make ML scores more
computable but would also improve the EAP estimates somewhat. In addition, the estimated item
parameters are saved in the file exampl01.par (PARM keyword on the SAVE command).
Phase 0 output
At the beginning of the output for Phase 0, the command file is echoed. Information on the num-
ber of tests, items, and type of model to be fitted as interpreted by PARSCALE is also given.
>TEST1 TNAME=SCALE1,ITEM=(1(1)20),NBLOCK=1;
BLOCK CARD: 1
>BLOCK1 BNAME=SBLOCK1,NITEMS=20,NCAT=4,CADJ=0.0;
>CAL GRADED,LOGISTIC,SCALE=1.7,NQPTS=30,CYCLE=(25,2,2,2,2),
NEWTON=5,CRIT=0.005,ITEMFIT=10;
MODEL SPECIFICATIONS
======================
693
11 PARSCALE EXAMPLES
This section of the output file contains information on the settings to be used during the item pa-
rameter estimation in Phase 2.
CALIBRATION PARAMETERS
======================
No prior distribution was requested in the CALIB command, and consequently the default prior, a
normal distribution on equally spaced points, will be used (DIST=2 on CALIB). The number of
quadrature points to be used during item parameter estimation was set to 30 (NQPT on CALIB).
The program-generated quadrature points and weights are printed to the Phase 0 output file, as
shown below.
1 2 3 4 5
POINT -0.4000E+01 -0.3724E+01 -0.3448E+01 -0.3172E+01 -0.2897E+01
WEIGHT 0.3692E-04 0.1071E-03 0.2881E-03 0.7181E-03 0.1659E-02
6 7 8 9 10
POINT -0.2621E+01 -0.2345E+01 -0.2069E+01 -0.1793E+01 -0.1517E+01
WEIGHT 0.3550E-02 0.7042E-02 0.1294E-01 0.2205E-01 0.3481E-01
11 12 13 14 15
POINT -0.1241E+01 -0.9655E+00 -0.6897E+00 -0.4138E+00 -0.1379E+00
WEIGHT 0.5093E-01 0.6905E-01 0.8676E-01 0.1010E+00 0.1090E+00
16 17 18 19 20
POINT 0.1379E+00 0.4138E+00 0.6897E+00 0.9655E+00 0.1241E+01
WEIGHT 0.1090E+00 0.1010E+00 0.8676E-01 0.6905E-01 0.5093E-01
21 22 23 24 25
POINT 0.1517E+01 0.1793E+01 0.2069E+01 0.2345E+01 0.2621E+01
WEIGHT 0.3481E-01 0.2205E-01 0.1294E-01 0.7042E-02 0.3550E-02
694
11 PARSCALE EXAMPLES
26 27 28 29 30
POINT 0.2897E+01 0.3172E+01 0.3448E+01 0.3724E+01 0.4000E+01
WEIGHT 0.1659E-02 0.7181E-03 0.2881E-03 0.1071E-03 0.3692E-04
The control settings to be used during calibration are followed by settings to be used during the
scoring phase (Phase 3). The EAP method of scoring is requested (EAP option) and, as in the
calibration phase, 30 quadrature points were requested. Since no prior distribution was requested
using the DIST keyword, by default a normal distribution on equally spaced points will be used
(DIST=2 on SCORE). Note that the DIST keyword applies only when EAP scoring has been se-
lected.
>SCORE EAP,NQPTS=30,SMEAN=0.0,SSD=1.0,NAME=EAP,PFQ=5;
1 2 3 4 5
POINT -0.4000E+01 -0.3724E+01 -0.3448E+01 -0.3172E+01 -0.2897E+01
WEIGHT 0.3692E-04 0.1071E-03 0.2881E-03 0.7181E-03 0.1659E-02
6 7 8 9 10
POINT -0.2621E+01 -0.2345E+01 -0.2069E+01 -0.1793E+01 -0.1517E+01
WEIGHT 0.3550E-02 0.7042E-02 0.1294E-01 0.2205E-01 0.3481E-01
11 12 13 14 15
POINT -0.1241E+01 -0.9655E+00 -0.6897E+00 -0.4138E+00 -0.1379E+00
WEIGHT 0.5093E-01 0.6905E-01 0.8676E-01 0.1010E+00 0.1090E+00
16 17 18 19 20
POINT 0.1379E+00 0.4138E+00 0.6897E+00 0.9655E+00 0.1241E+01
WEIGHT 0.1090E+00 0.1010E+00 0.8676E-01 0.6905E-01 0.5093E-01
21 22 23 24 25
POINT 0.1517E+01 0.1793E+01 0.2069E+01 0.2345E+01 0.2621E+01
WEIGHT 0.3481E-01 0.2205E-01 0.1294E-01 0.7042E-02 0.3550E-02
26 27 28 29 30
POINT 0.2897E+01 0.3172E+01 0.3448E+01 0.3724E+01 0.4000E+01
WEIGHT 0.1659E-02 0.7181E-03 0.2881E-03 0.1071E-03 0.3692E-04
695
11 PARSCALE EXAMPLES
The values assigned to the rescaling constants SMEAN and SSD in the SCORE command are shown:
SET NUMBER : 1
SCORE NAME : EAP
NUMBER OF ITEMS : 20
RESCALE CONSTANT: MEAN = 0.00 S.D. = 1.00
ITEMS : 1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
0001 0002 0003 0004 0005 0006 0007 0008 0009 0010
0011 0012 0013 0014 0015 0016 0017 0018 0019 0020
Input and output files as requested with the DFNAME keyword on the FILES command and the
PARM and SCORE keywords on the SAVE command are listed:
[INPUT FILES]
[OUTPUT FILES]
[SCRATCH FILES]
To allow the user to verify that data have been read in correctly from the raw data file, the first
two records from the data file are echoed in the output. The INPUT RESPONSES fields give the
original responses while the RECODED RESPONSES reflect any recoding of the responses. Re-
coding of responses is controlled by the ORIGINAL and MODIFIED keywords on the BLOCK com-
mand.
OBSERVATION # 1
GROUP: 1
ID: 0001
696
11 PARSCALE EXAMPLES
INPUT RESPONSES: 4 2 4 4 4 2 3 2 2 2 3 3 4 3 4 3 3 3 3 2
RECODED RESPONSES:4 2 4 4 4 2 3 2 2 2 3 3 4 3 4 3 3 3 3 2
OBSERVATION # 2
GROUP: 1
ID: 0002
INPUT RESPONSES: 1 2 2 2 1 1 2 1 1 2 2 3 2 4 1 2 1 4 3 2
RECODED RESPONSES:1 2 2 2 1 1 2 1 1 2 2 3 2 4 1 2 1 4 3 2
Finally, the number of observations to be used in the analysis is recorded. By default, all obser-
vations will be used. The number of observations to be used can be manipulated using the
SAMPLE or TAKE keywords on the INPUT command.
Phase 1 output
The title given in the TITLE command and name assigned to the test in the TEST command in the
command file are echoed in the output file.
MAINTEST: SCALE1
The master file created during Phase 0 is used as input. Note that the master file exampl01.mfl
may be saved using the MASTER keyword on the SAVE command for use as input in a subsequent
analysis (MFNAME keyword on the FILES command). The keywords TAKE and SAMPLE on the
INPUT command control the number of records read from the raw data file. As the default value
of SAMPLE is 100%, neither keyword was used and all data were used by default.
Summary item statistics for the 20 items are given next. Since no not-represented (NFNAME on
FILES) or omit key (OFNAME on FILES) was used, no frequencies or percentages are reported un-
der the “NOT PRESENT” or “OMIT” headings. Under the “CATEGORIES” heading, frequencies and
percentages of responses for each of the 4 categories are given item-by-item. Cumulative fre-
quencies and percentages for the categories over all items are given at the end of the table.
Note that, if empty categories are encountered, the user has to recode the corresponding items on
which this occurs before proceeding with the analysis.
697
11 PARSCALE EXAMPLES
| |
0002 | |
FREQ.| 1000 0 0| 204 284 310 202
PERC.| 0.0 0.0| 20.4 28.4 31.0 20.2
…
0020 | |
FREQ.| 1000 0 0| 305 211 212 272
PERC.| 0.0 0.0| 30.5 21.1 21.2 27.2
| |
---------------------------------------------------------------
CUMMUL.| |
FREQ.| | 4844 5186 5204 4766
PERC.| | 24.2 25.9 26.0 23.8
---------------------------------------------------------------
Item means, initial slope estimates, and Pearson and polyserial item-test correlations are shown
in the next table.
Pearson
J
ti = ∑ sij ,
j =1
and m-category polytomous item score, sij = 1, 2,..., m , is the point polyserial correlation rPP , j ,
where
∑
n
t s − nts j
rPP , j = i =1 i ij
where n is the sample size, t is the mean test score and s j , the mean item score. In this example
n = 1000. For item 1,
so that
s1 =
∑s i1
=
2502
= 2.502.
n 1000
Also
∑s 2
i1 = (12 ×194) + (22 × 303) + (32 × 313) + (42 × 190) = 7263
698
11 PARSCALE EXAMPLES
so that
Polyserial correlation
The polyserial correlation rP can be expressed in terms of the point polyserial correlation as
rPP , jσ j
rP , j = m −1
∑ k =1
h( z jk )
where
1 1
h( z jk ) = exp(− z 2jk ).
2π 2
The polyserial correlation estimates the item factor loading, α j , say. If the arbitrary scale of the
item latent variable, y j , is chosen so that the variance y j equals 1, then
y j = α j (θ − b jk ) + ε j ,
where θ is the factor score with mean 0 and variance 1, and the error, ε j , has mean 0 and vari-
ance 1 − rp2, j .
For purposes of MML parameter estimation in IRT, it is convenient to rescale the item latent
variable so that the error variance equals 1. The factor loading then becomes the item slope,
a j = rp , j / 1 − rp2, j .
699
11 PARSCALE EXAMPLES
This provisional estimate of the slope is then used as the starting value in the iterative EM solu-
tions of the marginal maximum likelihood equations for estimating the parameters of the poly-
tomous item response models. The initial locations shown in the last column of the table are the
averages of the category thresholds for each item.
Item-category threshold parameters can be calculated once the polyserial coefficients have been
obtained. The expression for the threshold parameter in terms of the cumulative category propor-
tions and the biserial correlation coefficient (Lord & Novick, 1968) is
zjk
b jk =
rB , j
with rB , j the biserial correlation for item j and zjk the z score that cuts off
p jk proportion of the
cases to item j in a unit-normal distribution; that is
njk
p jk = m
∑ njk
v =1
where n jk is the frequency of the categorical response for item j and category k. These provi-
sional thresholds of the categories serve as starting values in MML estimation of the correspond-
ing item parameters. For the rating-scale model, whether or not all items have the same thresh-
olds, the category proportions are computed from frequencies accumulated over all items; i.e.,
∑n
j =1
jk
pk = n m
.
∑∑ n
j =1 k =1
jk
In Muraki’s (1990) formulation of the rating-scale model, the category threshold parameter, ck ,
is expressed as a deviation from the item threshold parameter, b j ; that is
y j = α (θ − b j + ck ) + ε j
m −1
under the constraint that ∑c
j = k −1
j = 0.
700
11 PARSCALE EXAMPLES
In the context of the rating-scale model, b j is referred to as a “location” parameter. The INITIAL
LOCATION column provides the values of the average of the category thresholds for each item.
---------------------------------------------------------------------------
BLOCK | RESPONSE TOTAL SCORE | PEARSON & | INITIAL INITIAL
ITEM | MEAN MEAN | POLYSERIAL | SLOPE LOCATION
| S.D.* S.D.* | CORRELATION |
---------------------------------------------------------------------------
SBLOCK1 | | |
1 0001 | 2.499 49.892 | 0.778 | 1.488 -0.017
| 1.009* 14.754* | 0.830 |
2 0002 | 2.510 49.892 | 0.797 | 1.628 -0.036
| 1.030* 14.754* | 0.852 |
3 0003 | 2.481 49.892 | 0.785 | 1.545 0.013
| 1.031* 14.754* | 0.839 |
4 0004 | 2.515 49.892 | 0.805 | 1.695 -0.053
| 1.037* 14.754* | 0.861 |
5 0005 | 2.511 49.892 | 0.811 | 1.739 -0.038
| 1.032* 14.754* | 0.867 |
6 0006 | 2.137 49.892 | 0.728 | 1.293 0.837
| 1.037* 14.754* | 0.791 |
7 0007 | 2.118 49.892 | 0.735 | 1.336 0.855
| 1.033* 14.754* | 0.801 |
8 0008 | 2.144 49.892 | 0.754 | 1.426 0.758
| 1.029* 14.754* | 0.819 |
9 0009 | 2.136 49.892 | 0.736 | 1.329 0.830
| 1.029* 14.754* | 0.799 |
10 0010 | 2.128 49.892 | 0.730 | 1.293 0.882
| 1.002* 14.754* | 0.791 |
11 0011 | 2.870 49.892 | 0.645 | 0.985 -1.168
| 1.041* 14.754* | 0.702 |
12 0012 | 2.874 49.892 | 0.655 | 1.029 -1.094
| 1.071* 14.754* | 0.717 |
13 0013 | 2.874 49.892 | 0.690 | 1.144 -1.017
| 1.053* 14.754* | 0.753 |
14 0014 | 2.831 49.892 | 0.673 | 1.072 -0.953
| 1.057* 14.754* | 0.731 |
15 0015 | 2.847 49.892 | 0.679 | 1.114 -0.938
| 1.094* 14.754* | 0.744 |
16 0016 | 2.492 49.892 | 0.590 | 0.839 0.010
| 1.161* 14.754* | 0.643 |
17 0017 | 2.541 49.892 | 0.548 | 0.738 -0.173
| 1.125* 14.754* | 0.594 |
18 0018 | 2.463 49.892 | 0.589 | 0.834 0.102
| 1.152* 14.754* | 0.641 |
19 0019 | 2.470 49.892 | 0.573 | 0.798 0.085
| 1.160* 14.754* | 0.624 |
20 0020 | 2.451 49.892 | 0.583 | 0.830 0.048
| 1.184* 14.754* | 0.639 |
---------------------------------------------------------------------------
CATEGORY | | MEAN | S.D. | PARAMETER
1 | | 36.116 | 10.656 | 0.927
2 | | 46.091 | 11.156 | 0.002
3 | | 54.107 | 11.165 | -0.930
4 | | 63.427 | 10.739 | 0.000
----------------------------------------------------------------------------
701
11 PARSCALE EXAMPLES
At the end of this table, descriptive statistics for the raw total scores of examinees who re-
sponded in each of the 4 categories are given. The highest average total score of 63.427 was for
respondents who responded in the 4th category.
Phase 2 output
An MML approach is used for estimation, and either a normal or empirical latent distribution
with mean 0 and standard deviation 1 is assumed. The type of distribution used is controlled by
the DIST keyword on the CALIB command. By default, a normal distribution with equally spaced
points is used and, for analyses where the LENGTH keyword on the INPUT command is set to a
value less than or equal to 50, 10 quadrature points will be used.
Because of the potentially wide spacing of category boundary parameters on the latent dimen-
sion, it is advisable to use a greater number of quadrature points than in BILOG-MG. In this ex-
ample, the number of quadrature points was set to 30 (NQPT on the CALIB command).
The EM algorithm is used in the solution of the maximum likelihood equations for parameters,
starting from the initial values described in the Phase 1 output. At each iteration, the -2 ln L is
given, along with information on the parameter for which the largest change between cycles was
observed. The number of EM cycles is controlled by the CYCLE keyword on the CALIB command,
and the convergence criterion may be set using the CRIT keyword on the same command. By de-
fault, 10 EM cycles would be performed when LENGTH ≤ 50 on the INPUT command. In this ex-
ample, 25 EM cycles with a maximum of 2 inner EM iterations for the item and category pa-
rameter estimation were specified. The default convergence criterion is 0.001. For this example,
it was set to 0.005.
The EM algorithm converged after 3 cycles were completed. After reaching either the maximum
number of EM cycles or convergence, the program will perform the Newton-Gauss (Fisher scor-
ing) cycles requested through the NEWTON keyword on the CALIB command. In this example,
702
11 PARSCALE EXAMPLES
NEWTON was set to 5. The information matrix for all item parameters is approximated during
each Newton step and then used at convergence to provide large-sample standard errors of esti-
mation for the item parameter estimates.
The Newton cycles converged after 2 iterations. As all items were assigned to the same BLOCK,
only one table is printed to the output file.
At the top of the table, the estimated category parameters are given. For each m category item,
there are m-1 category threshold parameters with
b j1 ≤ b j 2 ... ≤ b jm −1.
For a polytomous item response model, the discriminating power of a specific categorical re-
sponse depends on the width of the adjacent category thresholds as well as a slope parameter.
Because of this property, the simultaneous estimation of the slope parameter and all m j category
parameters is not obtainable. If the model includes the slope parameter for each item j as in this
example, the location of the category parameters must be fixed. The CADJUST keyword on the
BLOCK command was set to 0, and thus the mean of the category parameters is 0.
For each item, the slope and location parameters, along with corresponding standard errors, are
given. All guessing parameters are zero for this model.
+------+-----+---------+---------+---------+---------+---------+---------+
| ITEM |BLOCK| SLOPE | S.E. |LOCATION | S.E. |GUESSING | S.E. |
+======+=====+=========+=========+=========+=========+=========+=========+
| 0001 | 1 | 1.486 | 0.063 | 0.006 | 0.042 | 0.000 | 0.000 |
| 0002 | 1 | 1.526 | 0.067 | -0.012 | 0.040 | 0.000 | 0.000 |
| 0003 | 1 | 1.472 | 0.065 | 0.022 | 0.041 | 0.000 | 0.000 |
703
11 PARSCALE EXAMPLES
The average parameter estimates over all 20 items are given next. If the items are regarded as
random samples from a real or hypothetical universe, these quantities estimate the means and
standard deviations of the parameters. They could serve as item parameter priors in future item
calibrations in this universe.
+----------+---------+---------+----+
|PARAMETER | MEAN | STN DEV | N |
+==========+=========+=========+====+
|SLOPE | 1.111| 0.317| 20|
|LOG(SLOPE)| 0.065| 0.296| 20|
|THRESHOLD | 0.003| 0.370| 20|
|GUESSING | 0.000| 0.000| 0|
+----------+---------+---------+----+
The estimated latent distribution is given next. This distribution is the sum of the posterior distri-
butions of θ for all respondents in the sample. It is represented here as point masses, scaled to
sum to 1.0, at 30 equally spaced points on the θ dimension. If the population distribution is
normal and the test is sufficiently informative over the range of θ , the posterior distributions for
all respondents will approach normality and the latent distribution will approach normality.
1 2 3 4 5
POINT -0.4000E+01 -0.3724E+01 -0.3448E+01 -0.3172E+01 -0.2897E+01
WEIGHT 0.6912E-04 0.1967E-03 0.5110E-03 0.1201E-03 0.2420E-02
6 7 8 9 10
POINT -0.2621E+01 -0.2345E+01 -0.2069E+01 -0.1793E+01 -0.1517E+01
WEIGHT 0.4662E-02 0.7645E-02 0.1189E-01 0.2005E-01 0.3585E-01
11 12 13 14 15
POINT -0.1241E+01 -0.9655E+00 -0.6897E+00 -0.4138E+00 -0.1379E+00
WEIGHT 0.5568E-01 0.7094E-01 0.8078E-01 0.9708E+00 0.1104E+00
16 17 18 19 20
POINT 0.1379E+00 0.4138E+00 0.6897E+00 0.9655E+00 0.1241E+01
WEIGHT 0.1086E+00 0.9806E+00 0.8301E-01 0.6999E-01 0.5416E-01
21 22 23 24 25
POINT 0.1517E+01 0.1793E+01 0.2069E+01 0.2345E+01 0.2621E+01
WEIGHT 0.3797E-01 0.2403E-01 0.1328E-01 0.6619E-02 0.2962E-02
26 27 28 29 30
POINT 0.2897E+01 0.3172E+01 0.3448E+01 0.3724E+01 0.4000E+01
WEIGHT 0.1197E-03 0.4451E-03 0.1547E-04 0.5062E-04 0.1563E-05
The goodness-of-fit of the polytomous item response model can be tested item by item. Summa-
tion of the item fit can also be used for the goodness-of-fit for the test as a whole. The fit statis-
tics are useful in evaluating the fit of models to the same response data when models are nested
704
11 PARSCALE EXAMPLES
in their parameters.
Respondents are assigned to H intervals on the θ -continuum. The number of intervals is set us-
ing the ITEMFIT keyword on the CALIB command. The expected a posteriori (EAP) score of each
respondent is used for assigning respondents to the H intervals. The observed frequency rhjk of
the k-th category response to item j in interval h, and N hj , the number of respondents assigned to
item j in the h-th interval, are computed. The estimated θ s are rescaled so that the variance of
the sample distribution equals that of the latent distribution on which the MML estimation of the
parameters is based.
Thus an H by m j contingency table is obtained for each item j. In order to avoid expected values
less than 5, neighboring intervals and/or categories may be merged. For each interval, the inter-
val mean, θ h , and the value of the fitted response function Pjk (θ h ) , are computed.
H j mj
rhjk
G = 2∑∑ rhjk ln
2
,
N hj Pjk (θ h )
j
h =1 k =1
where H j is the number of intervals left after neighboring intervals are merged. The degrees of
∑
Hj
freedom is j =1
(m*j − 1) where m*j is the number of categories left after merging.
The likelihood ratio χ 2 -statistic for the test as a whole is simply the summation of the separate
χ 2 -statistics. The number of degrees of freedom is also the summation of the degrees of freedom
for each item.
-----------------------------------------------
| BLOCK | ITEM | CHI-SQUARE | D.F. | PROB. |
-----------------------------------------------
| SBLOCK1 | 0001 | 25.00714 | 20. | 0.201 |
| | 0002 | 23.18082 | 20. | 0.280 |
| | 0003 | 25.66873 | 20. | 0.177 |
| | 0004 | 31.56813 | 19. | 0.035 |
| | 0005 | 19.88483 | 19. | 0.339 |
| | 0006 | 13.51922 | 22. | 0.918 |
…
| | 0019 | 12.51549 | 25. | 0.982 |
| | 0020 | 25.25502 | 25. | 0.448 |
-----------------------------------------------
| TOTAL | | 492.43930 | 442. | 0.049 |
-----------------------------------------------
705
11 PARSCALE EXAMPLES
The null hypothesis tested here is that there are no significant differences between the expected
and observed frequencies. A significant χ 2 -statistic indicates that item parameters differ across
the raw score groups and that the assumed model is not appropriate for the data. In this case, no
item showed poor fit to the assumed model.
Phase 3 output
The first information given in the output from the scoring phase is on the scoring function used
for scaling. The default function is STANDARD, and thus the standard scoring function (1.0, 2.0)
will be used even though a different scoring function may be used for calibration. The scoring
function may also be set to CALIBRATION (SCORING keyword on the SCORE command) to use the
calibration scoring function specified on the BLOCK command instead. Note that the scoring func-
tion only applies to the partial credit model.
BLOCK: 1 SBLOCK1
1 1.000
2 2.000
3 3.000
4 4.000
Bayes estimates are computed for each examinee with respect to his or her group latent distribu-
tion (controlled by the EAP option on the SCORE command used here). A discrete distribution on a
finite number of points (see below) is used as prior. The user may select the number of points
and the type of prior using the NQPT and DIST keywords on the SCORE command.
1 2 3 4 5
POINT -0.4000E+01 -0.3724E+01 -0.3448E+01 -0.3172E+01 -0.2897E+01
WEIGHT 0.3692E-04 0.1071E-03 0.2881E-03 0.7181E-03 0.1659E-02
6 7 8 9 10
POINT -0.2621E+01 -0.2345E+01 -0.2069E+01 -0.1793E+01 -0.1517E+01
WEIGHT 0.3550E-02 0.7042E-02 0.1294E-01 0.2205E-01 0.3481E-01
11 12 13 14 15
POINT -0.1241E+01 -0.9655E+00 -0.6897E+00 -0.4138E+00 -0.1379E+00
WEIGHT 0.5093E-01 0.6905E-01 0.8676E-01 0.1010E+00 0.1090E+00
16 17 18 19 20
POINT 0.1379E+00 0.4138E+00 0.6897E+00 0.9655E+00 0.1241E+01
WEIGHT 0.1090E+00 0.1010E+00 0.8676E-01 0.6905E-01 0.5093E-01
706
11 PARSCALE EXAMPLES
21 22 23 24 25
POINT 0.1517E+01 0.1793E+01 0.2069E+01 0.2345E+01 0.2621E+01
WEIGHT 0.3481E-01 0.2205E-01 0.1294E-01 0.7042E-02 0.3550E-02
26 27 28 29 30
POINT 0.2897E+01 0.3172E+01 0.3448E+01 0.3724E+01 0.4000E+01
WEIGHT 0.1659E-02 0.7181E-03 0.2881E-03 0.1071E-03 0.3692E-04
In this example, the keywords SMEAN and SSD were set to 0 and 1 respectively on the SCORE
command. As a result, the following output reflects the rescaling constants (0.000 and 1.015)
used in this particular case.
Scores are saved to an external file (keyword SCORE on SAVE command), but the first three scores
are printed to the output file for purposes of checking. When EAP is used for scoring, the S.E.
column represents the posterior standard deviation.
When EAP is selected, an estimate of the population distribution of ability in the form of a dis-
crete distribution of a finite number of points is obtained by accumulating the posterior densities
707
11 PARSCALE EXAMPLES
over the subjects at each quadrature point. These sums are then normalized to obtain the esti-
mated probabilities at the points. Improved estimates of the latent distribution may be obtained
after one more iteration of the solution.
The program also computes the mean and standard deviation for the estimated latent distribution.
Sheppard’s correction for coarse grouping is used in the calculation of the standard deviation.
The EAP estimate is the mean of the posterior distribution while the standard error is the stan-
dard deviation of the posterior distribution. Posterior weights are only given when EAP is used.
Note that it is based on all cases, and not just on those cases used in calibration.
6 7 8 9 10
POINT -0.2621E+01 -0.2345E+01 -0.2069E+01 -0.1793E+01 -0.1517E+01
WEIGHT 0.46622-02 0.7591E-02 0.1180E-01 0.1987E-01 0.3555E-01
11 12 13 14 15
POINT -0.1241E+01 -0.9655E+00 -0.6897E+00 -0.4138E+00 -0.1379E+00
WEIGHT 0.5541E-01 0.7082E-01 0.8069E-01 0.9694E+00 0.1105E+00
16 17 18 19 20
POINT 0.1379E+00 0.4138E+00 0.6897E+00 0.9655E+00 0.1241E+01
WEIGHT 0.1088E+00 0.9832E+00 0.8323E-01 0.7015E-01 0.5431E-01
21 22 23 24 25
POINT 0.1517E+01 0.1793E+01 0.2069E+01 0.2345E+01 0.2621E+01
WEIGHT 0.3809E-01 0.2411E-01 0.1333E-01 0.6645E-02 0.2974E-02
26 27 28 29 30
POINT 0.2897E+01 0.3172E+01 0.3448E+01 0.3724E+01 0.4000E+01
WEIGHT 0.1202E-03 0.4470E-03 0.1554E-04 0.5083E-04 0.1569E-05
The mean and standard deviation of the latent posterior distribution calculated from posterior
weights at quadrature points are also given. In these calculations, the formulas for the variance of
grouped data are used, with quadrature points as class marks and posterior weights as class fre-
quencies.
In this example, the item parameter estimates from the Section 11.1, saved in the exampl01.par
file, are used in scoring the simulated examinees by the maximum likelihood method (MLE).
The item parameter file is used as input (IFNAME keyword on the FILES command) and calibra-
tion is suppressed with the NOCALIB option of the CALIB command.
708
11 PARSCALE EXAMPLES
Comparison of the results in files example01.ph3 (see Section 11.1, Phase 3 output) and ex-
ampl02.ph3 (not shown here) shows that, when the scores are scaled to match the mean and
standard deviation of the generating distribution, both the EAP and MLE estimates recover the
generating values with good accuracy.
11.3 Calibration and scoring with the generalized partial credit rating-scale
model: collapsing of categories
This example scores and calibrates the data of Section 11.1 assuming the partial credit model
with standard scoring function. The command file is shown below.
To illustrate the situation where two types of items are involved, the four categories for the sec-
ond ten items are collapsed into two categories, thus making those items effectively binary. Two
blocks are required (each with ten items), and the MODIFIED list in the BLOCK2 command speci-
fies the collapsing.
The standard score function assumes 4 is the highest category, so no response modification is
required in BLOCK1. In BLOCK2, the scoring function is used to specify scoring function values.
CADJUST is not used with the partial credit model, nor is SCALE in the CALIB command. Because
the data are now less informative, the number of quadrature points for calibration can be reduced
(NQPT=15 instead of the 30 previously used).
Despite the different model and the partition of the items into two blocks, the estimated trait
scores in exampl03.sco agree well with the estimates from Sections 11.1 and 11.2 after rescaling
in the sample.
709
11 PARSCALE EXAMPLES
11.4 Two-group differential item functioning (DIF) analysis with the partial
credit model
This example illustrates differential item functioning (DIF) analysis of multiple category item
responses. The SCORE command is required and thus included in the command file. For the DIF
model, however, no scoring is done and there is no Phase 3 output.
Raw data are read from the file exampl04.dat using the DFNAME keyword on the FILES com-
mand. The data file contains responses to 6 items, as indicated on the INPUT command, where
NTOTAL is set to 6. The data file contains the examinee ID and sample group code (1,2), then the
responses on the 6 items, and finally the generating trait value for each examinee. The first few
lines of the data file are shown below.
The format statement includes information on three fields in the raw data file. The subject ID
(4A1) and group identification field (1A1) are read first, followed by the 6 item responses (6A1).
One test, 6 items in length, is considered. The MGROUP keyword on the INPUT command requests
a multiple-group analysis for two groups. Note that the MGROUP keyword is used in combination
with the MGROUP command, which must follow directly after the BLOCK command(s).
On the TEST command, a name for the test is provided using the TNAME keyword. The items on
this test are listed using the ITEMS keyword, while the INAMES keyword is used to provide names
for the items. Finally, by setting NBLOCK to 6, it is indicated that 6 BLOCK commands will follow
the TEST command.
In this example, there is one item with three categories originally coded 1, 2, and 3 in each block
as indicated by the NITEMS, NCAT and ORIGINAL keywords respectively. Because the rating-scale
model is not used here, separate category parameters are estimated for each item, and the REPEAT
keyword indicates that the BLOCK command should be repeated six times.
The second value (1) assigned to the DIF keyword of the MGROUP command requests a DIF analy-
sis of the item threshold parameters. All other values in this keyword are equal to zero, indicating
that only thresholds are allowed to differ between the groups. The GNAME and GCODE keywords
are used to assign names and codes to the two groups. By default, the first group will be used as
the reference group. To change the reference group, the REFERENCE keyword on the MGROUP
command may be used.
A partial credit model with logistic response function is requested through the use of the
PARTIAL and LOGISTIC options on the CALIB command. The default number of quadrature points
is 30. In this case, NQPT is set to 25, because fewer points are needed when the number of items
is small. By setting the CYCLES keyword to 100, a maximum of 100 EM cycles will be per-
710
11 PARSCALE EXAMPLES
formed, followed by two Newton cycles (NEWTON=2). The convergence criterion is somewhat re-
laxed by setting CRIT to 0.01 instead of using the default convergence criterion of 0.001. Finally,
the POSTERIOR option is added to the CALIB command. As a default, the posterior distribution is
computed after the computation of expected proportions during the E-step as their by-product.
Therefore, these expected sample sizes and expected frequencies of categorical responses are
computed based on the posterior distribution in the previous EM cycle. Adding the POSTERIOR
option forces the program to compute the posterior distribution again after the M-step. Therefore,
the expected proportions can be computed during the E-step based on an updated posterior dis-
tribution. This was added to be consistent with the BILOG-MG program in the case of two
categories.
Phase 0 output
When the MGROUP keyword and MGROUP command are used or multiple TEST/BLOCK commands
are used, additional information is written to the phase 0 output file.
NUMBER OF SUBGROUPS: 2
FORMAT OF DATA INPUT IS
(4A1,1X,1A1,1X,6A1)
BLOCK CARD: 1
>BLOCK1 REPEAT=6, NIT=1, NCAT=3, ORIGINAL=(1,2,3) ;
In the next few lines, the program echoes the information on parameters allowed to be different
between groups as specified with the DIF keyword: in this case, only the thresholds are allowed
to differ between the two groups. The MALE group will be used as reference group.
711
11 PARSCALE EXAMPLES
1 MALE 1
2 FEMALE 2
Phase 1 output
The only difference between the Phase 1 output for a single group analysis and for a multiple-
group analysis is that the summary item statistics are first given by subgroup and then for the to-
tal group. The output for the first item is shown below for all three cases. We see that females
were more likely to respond in category 3 and less likely to respond in category 1 than the males.
In general, 76% of the total responses were in category 3.
1 SUBGROUP: MALE
2 SUBGROUP: FEMALE
712
11 PARSCALE EXAMPLES
TOTAL
Item means, initial slope estimates, and Pearson and polyserial item-test correlations are given in
the next table. For a detailed discussion of the measures shown here, refer to the discussion of
the Phase 1 output of Section 11.1.
----------------------------------------------------------------------------
BLOCK | RESPONSE TOTAL SCORE | PEARSON & | INITIAL INITIAL
ITEM | MEAN MEAN | POLYSERIAL | SLOPE LOCATION
| S.D.* S.D.* | CORRELATION |
---------------------------------------------------------------------------
BLOCK | | |
1 I001 | 2.539 13.162 | 0.714 | 1.000 0.000
| 0.831* 3.765* | 0.976 |
----------------------------------------------------------------------------
CATEGORY | SCORING | MEAN | S.D. | PARAMETER
1 | 1.000 | 8.190 | 2.235 | 0.000
2 | 2.000 | 11.263 | 2.899 | -0.155
3 | 3.000 | 14.655 | 2.735 | 1.596
----------------------------------------------------------------------------
Phase 2 output
For the DIF model, a separate prior distribution is used for each group member, and the prior dis-
tribution is updated after each estimation cycle based on the posterior distribution from the pre-
vious cycle.
For the DIF model, it is assumed that different groups have different distributions with mean µ g
and standard deviation σ g . The distributions are not necessarily normal. These empirical poste-
rior distributions are estimated simultaneously with the estimation of the item parameters. To
obtain those parameters, the following constraint is imposed for the DIF model:
J J
∑ d R j = ∑ d Fj .
j =1 j =1
This constraint implies the overall difficulty levels of a test or a set of common items given to
both the reference group and focal group, indicated by subscripts R and F, respectively, are the
same. Therefore, the item difficulty parameters for the focal groups are adjusted. Any overall
difference in terms of test difficulty will be assumed to be the difference in ability level for sub-
713
11 PARSCALE EXAMPLES
groups. The ability level difference among groups can then be estimated by the posterior distri-
butions.
The first difference between the output file discussed here and the Phase 2 output for Section
11.1 concerns the scoring function and step parameters for the multiple blocks. As no scoring
function was specified on the CALIB command, the default scoring function 1, 2 will be used.
Under the partial credit model, the step parameters, also known as the item step difficulties or
category intersections, correspond to the points on the ability scale where two successive item
response category characteristic curves (IRCCC) intersect. The increasing difficulty of a step
relative to other steps within an item is associated with higher values of the step parameters. In
this example, where each item has 3 categories, 2 “steps” are needed to move from the first cate-
gory to the third category: a respondent needs to move from category 1 to category 2, and a sec-
ond step is needed to move from category 2 to category 3. From the second step parameters of
items 1 and 2 (see below) moving from category 2 to category 3 is harder to do in the case of
item 2 for the male respondents.
The IRCCC for items 1 and 5 are shown below. Vertical lines were added to indicate the trait
level at which the curves for step 0 and step 1 intersect. The most likely response for a male with
trait level of -2 would be to complete 0 steps in both cases. For a male with trait level of ap-
proximately 1.5, completing the step from category 2 to category 3 would be more likely in the
case of item 5. Although there is little difference between the two graphs, it would appear that
completing the first step is somewhat easier for item 1 than for item 5, while completing the sec-
ond step is easier for item 5. This is in agreement with the second step parameters for these
items: 1.769 for item 1 and 1.517 for item 5.
714
11 PARSCALE EXAMPLES
[GROUP: 1 MALE ]
715
11 PARSCALE EXAMPLES
The step parameter information is followed by the item parameter estimates for the male group.
Standard errors are computed from the empirical information matrix in the final Newton cycle.
+------+-----+---------+---------+---------+---------+---------+---------+
| ITEM |BLOCK| SLOPE | S.E. |LOCATION | S.E. |GUESSING | S.E. |
+======+=====+=========+=========+=========+=========+=========+=========+
| I001 | 1 | 0.846 | 0.054 | -0.590 | 0.070 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I002 | 2 | 0.948 | 0.060 | 0.519 | 0.066 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I003 | 3 | 0.628 | 0.034 | -0.542 | 0.076 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I004 | 4 | 0.615 | 0.034 | 0.544 | 0.077 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I005 | 5 | 0.414 | 0.025 | -0.666 | 0.098 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I006 | 6 | 0.344 | 0.021 | 0.658 | 0.110 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
Similar information for the female group is given next. Note that the slope for each item is com-
mon across the two groups. This implies that the same item discrimination is assumed over the
groups.
[GROUP: 2 FEMALE ]
716
11 PARSCALE EXAMPLES
+------+-----+---------+---------+---------+---------+---------+---------+
| ITEM |BLOCK| SLOPE | S.E. |LOCATION | S.E. |GUESSING | S.E. |
+======+=====+=========+=========+=========+=========+=========+=========+
| I001 | 1 | 0.846 | 0.054 | -0.615 | 0.085 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I002 | 2 | 0.948 | 0.060 | 0.644 | 0.057 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I003 | 3 | 0.628 | 0.034 | 0.010 | 0.075 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I004 | 4 | 0.615 | 0.034 | -0.348 | 0.084 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I005 | 5 | 0.414 | 0.025 | -0.645 | 0.118 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| I006 | 6 | 0.344 | 0.021 | 0.877 | 0.098 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
DIF contrasts are given next. In the table below, the CONTRAST column gives the differences be-
tween item locations between the groups and the associated standard error. The STD column con-
tains standardized contrasts, obtained by dividing each contrast by its standard error. The prob-
ability that a normal variate exceeds the absolute value of the standardized difference is also
given. This is a one-sided test.
χ 2 -test statistics for the item location contrasts are given in the next section of the output file. In
this case, with only one degree of freedom, χ 2 = ( std . difference) 2 . This is a two-sided test. In
this table, these χ 2 -test statistics and exceedance probabilities are summarized.
717
11 PARSCALE EXAMPLES
+---------------------------------+
|ITEM BLOCK CHI-SQRS D.F. |
| PROB. |
+=====+=====+==========+==========+
|I001 | 1 | 0.053 | 1. |
| | | | 0.803 |
+---------------------------------+
|I002 | 2 | 2.052 | 1. |
| | | | 0.148 |
+---------------------------------+
|I003 | 3 | 26.789 | 1. |
| | | | 0.000 |
+---------------------------------+
|I004 | 4 | 60.814 | 1. |
| | | | 0.000 |
+---------------------------------+
|I005 | 5 | 0.019 | 1. |
| | | | 0.861 |
+---------------------------------+
|I006 | 6 | 2.231 | 1. |
| | | | 0.131 |
+---------------------------------+
|TOTAL| | 91.958 | 6. |
| | | | 0.000 |
+---------------------------------+
When the summary statistics for the 2 groups are compared, we see that only the standard devia-
tion of the threshold differs. Recall that for this example, the DIF keyword on the MGROUP com-
mand was used to allow only threshold parameters to differ between the groups. Overall, no
large difference between groups over all items is observed.
The final output is the estimated latent distributions by group. The origin and unit of the scale are
set so that the mean and standard deviation of the reference group are 0 and 1 respectively.
718
11 PARSCALE EXAMPLES
A plot of the estimated latent distributions is given below. The solid line represents the distribu-
tion for the male group. If there is appreciable DIF, the latent distributions do not represent the
same latent variable and no meaningful comparison of the two distributions is possible. If there is
no DIF, significant differences between the latent distributions represent real differences between
the populations sampled.
1 2 3 4 5
POINT -0.4000E+01 -0.3667E+01 -0.3333E+01 -0.3000E+01 -0.2667E+01
WEIGHT 0.5988E-04 0.2137E-03 0.6808E-03 0.1934E-02 0.4887E-02
6 7 8 9 10
POINT -0.2333E+01 -0.2000E+01 -0.1667E+01 -0.1333E+01 -0.1000E+01
WEIGHT 0.1096E-01 0.2172E-01 0.3790E-01 0.5826E-01 0.8000E-01
11 12 13 14 15
POINT -0.6667E+00 -0.3333E+00 0.3331E-15 0.3333E+00 0.6667E+00
WEIGHT 0.1009E+00 0.1178E+00 0.1249E+00 0.1190E+00 0.1034E+00
16 17 18 19 20
POINT 0.1000E+01 0.1333E+01 0.1667E+01 0.2000E+01 0.2333E+01
WEIGHT 0.8257E-01 0.5917E-01 0.3742E-01 0.2086E-01 0.1029E-02
21 22 23 24 25
POINT 0.2667E+01 0.3000E+01 0.3333E+01 0.3667E+01 0.4000E+01
WEIGHT 0.4516E-02 0.1766E-02 0.6167E-03 0.1924E-03 0.5368E-04
719
11 PARSCALE EXAMPLES
1 2 3 4 5
POINT -0.4000E+01 -0.3667E+01 -0.3333E+01 -0.3000E+01 -0.2667E+01
WEIGHT 0.1485E-04 0.5381E-04 0.1748E-03 0.5093E-03 0.1331E-02
6 7 8 9 10
POINT -0.2333E+01 -0.2000E+01 -0.1667E+01 -0.1333E+01 -0.1000E+01
WEIGHT 0.3120E-02 0.6569E-02 0.1248E-01 0.2175E-01 0.3608E-01
11 12 13 14 15
POINT -0.6667E+00 -0.3333E+00 0.3331E-15 0.3333E+00 0.6667E+00
WEIGHT 0.5834E-01 0.8712E-01 0.1130E+00 0.1320E+00 0.1437E+00
16 17 18 19 20
POINT 0.1000E+01 0.1333E+01 0.1667E+01 0.2000E+01 0.2333E+01
WEIGHT 0.1360E+00 0.1059E+00 0.6927E-01 0.3922E-01 0.1955E-01
21 22 23 24 25
POINT 0.2667E+01 0.3000E+01 0.3333E+01 0.3667E+01 0.4000E+01
WEIGHT 0.8653E-02 0.3410E-02 0.1199E-02 0.3764E-03 0.1056E-04
11.5 A test with 26 multiple-choice items and one 4-category item: three-
parameter logistic and generalized partial credit model
This example illustrates a test consisting primarily of machine-scorable multiple choice items,
but also containing one open-ended item scored in three categories. The latter item appears in the
middle.
The item responses are from several test forms, and items not represented on a particular form
are assigned the not-presented code 9. The not-presented key appears in the exampl05.npc file.
The codes 1 and 0 for correct and incorrect response to the multiple-choice items must be re-
coded 1 and 2, respectively, for the PARSCALE analysis. This is accomplished through use of
the ORIGINAL and MODIFIED keywords on the BLOCK commands.
The first few lines of the file exampl05.dat are shown below. The data and command files can
be found in the examples folder of the PARSCALE installation.
720
11 PARSCALE EXAMPLES
1 110000000000199999999999999
2 110000000011199999999999999
3 011001000001199999999999999
4 110000100000199999999999999
5 101011010011199999999999999
The contents of exampl05.npc, are as shown below. It is indicated in the syntax by the NFNAME
keyword on the FILES command.
KEY 999999999999999999999999999
The first information read according to the format statement shown below is the case ID, which
is read in the format “3A1”. The NIDCHAR keyword is set to 3 to indicate that the case ID is six
characters in length. The response to the first item is in column 5, and the format (“27A1”) that
follows after skipping of the fourth column using the “X” operator indicates that 27 items are
read from each line.
The 3-parameter logistic model (3PL) is assumed for the multiple-choice items, and the partial
credit model is assumed for the open-ended item. Because the parameters of the 3PL model dif-
fer from one item to another, each item must be assigned to a separate block. This is facilitated
by the REPEAT keyword of the BLOCK command, which indicates the number of successive items
that have the same block specifications. In the present example, the first block specification ap-
plies to the first 12 multiple-choice items, the second applies to the open-ended item, and the
third applies to the remaining 14 multiple-choice items. Note also the assignment of separate
block names using the BNAME keyword.
The use of the SPRIOR and GPRIOR options on the CALIB command requests the use of a log-
normal prior distribution and a normal prior distribution on the slope and guessing parameters
respectively.
The Bayes estimates (EAP option on the SCORE command) of the respondents' scale scores are
estimated and saved.
EXAMPL05.PSL - A TEST WITH 26 MULTIPLE CHOICE ITEMS AND ONE 4-CATEGORY ITEM
THREE-PARAMETER LOGISTIC AND GENERALIZED PARTIAL CREDIT MODEL
>FILE DFNAME='EXAMPL05.DAT', NFNAME='EXAMPL05.NPC', SAVE ;
>SAVE PARM='EXAMPL05.PAR', SCORE='EXAMPL05.SCO' ;
>INPUT NIDCHAR=3, NTOTAL=27, NTEST=1, LENGTH=27;
(3A1,1X,27A1)
>TEST1 TNAME=SOCSCI, ITEM=(1(1)27), NBLOCK=27 ;
>BLOCKS BNAME=(MC01,MC02,MC03,MC04,MC05,MC06,MC07,MC08,MC09,MC10,MC11,MC12),
NITEMS=1, NCAT=2, ORIGINAL=(0,1), MODIFIED=(1,2),
REPEAT=12, GUESSING=(2,ESTIMATE) ;
>BLOCK BNAME=OE, NITEMS=1, NCAT=3, SCORING=(1,2,3) ;
>BLOCKS BNAME=(MC13,MC14,MC15,MC16,MC17,MC18,MC19,MC20,MC21,MC22,MC23,MC24,
MC25,MC26),
NITEMS=1, NCAT=2, ORIGINAL=(0,1), MODIFIED=(1,2),
REPEAT=14, GUESSING=(2,ESTIMATE) ;
>CALIB PARTIAL, LOGISTIC, NQPTS=15, CYCLE=(50,1,1,1,1), NEWTON=2,
CRIT=0.01, SPRIOR, GPRIOR ;
>SCORE EAP, SMEAN=0.0, SSD=1.0, NAME=SOCSCI ;
721
11 PARSCALE EXAMPLES
11.6 Analysis of three tests containing items with two and three categories:
calculation of combined scores
A partial credit model based on artificial data is discussed in this example. Six items, with either
2 or 3 categories each, are assigned to three subtests. In all cases, guessing parameters are esti-
mated.
The data file used is exampl06.dat in the examples subfolder of the PARSCALE installation.
The first few lines of the data file are shown below.
The case identification is given in the first four columns of each line. Responses to the six items
are recorded in columns 6 to 11. At the end of each line, the generating trait value is given. This
value is not used in the analysis. The format statement used to read these data is:
(4A1,1X,6A1)
The items are analyzed in different ways in three subtests (NTEST=3 on INPUT). The LENGTH
keyword on the INPUT command indicates the length of each of the three subtests. The COMBINE
keyword on the INPUT command indicates that 3 COMBINE commands follow the SCORE com-
mand, while the SAVE option indicates that a SAVE command will follow directly after the FILES
command. On the SAVE command, names for external files to which subject scores and combined
scores will be saved are provided.
The first subtest consists of six items analyzed in six distinct blocks. The REPEAT keyword of the
first block indicates that the first three blocks each contain one 3-category item with item-
specific step parameters. The remaining blocks contain multiple-choice items with various guess-
ing parameters. The GPARM keyword is used here to correct the dichotomous item response prob-
abilities in the presence of the GUESSING keyword. These guessing parameters are used for the
initial parameter values and have a default value of zero. The value of (2,ESTIMATE) assigned to
the GUESSING keywords indicates that the second category is the correct response and that a
guessing parameter is to be estimated.
In the second subtest, the first 3 items are analyzed separately. In the third subtest, the last 3
items are analyzed separately. The convergence criterion for the iteration procedure is somewhat
relaxed for this test calibration (0.005 ==> 0.01) to obtain convergence.
Scores for the three subtests are combined in the scoring phase. These scores are saved to the ex-
722
11 PARSCALE EXAMPLES
ternal file exampl06.sco as specified on the SAVE command. They are combined as specified by
the COMBINE keyword in the INPUT command and the COMBINE commands following the last
SCORE command. The WEIGHT keywords on these commands have as values sets of positive frac-
tions summing to 1. These values are used as the weights for the subscale scores. Subscores are
combined linearly. In this example, three different combinations of the scores from the subtests
are requested. These scores are saved to the external file exampl06.cmb.
This example illustrates the parameter estimation for multiple raters. The analysis is based on
data in the file exampl07.dat in the raters folder of the PARSCALE installation folder. The first
few lines of the data are shown below.
00001 12 11 32 32
00001 22 21 42 42
00002 12 11 32 32
00002 22 22 43 42
00003 12 12 31 31
00003 23 22 43 41
00004 12 12 33 32
00004 22 22 42 42
723
11 PARSCALE EXAMPLES
00005 11 11 31 31
00005 22 21 41 42
The data contain the rating on four items administered to each examinee by four raters. The first
5 columns of each line of data contain the examinee ID. After two blank columns, the rater ID is
given, directly followed by the rating on the first item. Similar combinations of rater ID and rat-
ing for the other three items follow. As can be seen from the data above, the first line of data is
associated with examinee 00001 and contains the ratings for raters 1 and 3. The second line of
data, associated with the same examinee, contains the ratings for raters 2 and 4.
(5A1,4(2X,2A1))
where “5A1” is the format of the examinee ID, and “2X,2A1” the format for reading of one rater
ID/rating combination. The latter is repeated four times, using the notation “4( )”. Note that,
since the data for each examinee are given on two lines, R-INOPT=2 could have been specified on
the INPUT command and the format statement changed to
(5A1,4(2X,2A1),/T6,4(2X,2A1)).
The MRATER keyword on the INPUT command requests Rater's-Effect analysis, and indicates the
number of raters. The MRATER command provides necessary information about the four raters.
The estimated parameters and scores are saved to external output files using the SAVE option on
the FILES command and the PARM and SCORE keywords on the SAVE command.
The command file for a partial credit model based on these data is shown below.
Phase 0 output
In addition to the standard Phase 0 output discussed elsewhere, information on the raters’ names,
codes, and the weight assigned to each is echoed to the output file.
The MRATER command used here only assigns names and codes to the rater. By default, the RATER
keyword, not included in the MRATER command shown here, assumes the value (1,1,1,1). The ar-
guments of this keyword are the raters’ weights. For the Raters-effect model, the ability score
724
11 PARSCALE EXAMPLES
for each respondent is computed for each subtest (or subscale) and each rater separately. A total
score of each respondent for each subtest (or subscale) is computed by summing those scores
over items within each subtest and all raters who have rated the respondent. The rater weights of
this keyword are used to compute the weighted subtest or subscale score for each respondent.
Since the number of raters who rated each respondent’s responses varies, the weights are normal-
ized (divided by their sum) for each respondent.
1 RaterA 1 1.00
2 RaterB 2 1.00
3 RaterC 3 1.00
4 RaterD 4 1.00
Also included in the Phase 0 output is a listing of the first two observations, showing the input
and recoded responses. The raters responsible for each rating are also listed. This information is
provided so that the user can check that the data are read in correctly. If not, the variable format
statement (or the data) should be corrected.
OBSERVATION # 2
GROUP: 1
ID: 00001
INPUT RESPONSES: 2 1 2 2
RECODED RESPONSES: 2 1 2 2
RECODED RATERS : 2 2 4 4
The Phase 0 output also reports that 2000 lines of data were read from the data file, and indicates
that these 2000 observations are associated with 1000 examinees.
725
11 PARSCALE EXAMPLES
Phase 1 Output
The Phase 1 output file contains no additional information in this type of analysis. As usual, fre-
quencies and percentages for items nested within blocks are reported here. Information for the
first block/item is shown below.
Phase 2 Output
The Phase 2 output file shows the standard output for category parameters and item parameters
at convergence. This is followed by rater parameters and their associated standard errors as
shown below.
726
11 PARSCALE EXAMPLES
+------+-----+---------+---------+---------+---------+---------+---------+
| ITEM |BLOCK| SLOPE | S.E. |LOCATION | S.E. |GUESSING | S.E. |
+======+=====+=========+=========+=========+=========+=========+=========+
| 0001*| 1 | 0.814 | 0.041 | -0.515 | 0.039 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| 0002*| 2 | 0.935 | 0.047 | 0.410 | 0.037 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| 0003*| 3 | 0.491 | 0.027 | -0.502 | 0.051 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
| 0004*| 4 | 0.505 | 0.028 | 0.508 | 0.050 | 0.000 | 0.000 |
+------+-----+---------+---------+---------+---------+---------+---------+
RATER’s EFFECT PARAMETER
From the output above, we see a marked difference between the raters, in particular between
RaterC and RaterD. The raters differ appreciably in severity.
11.8 Rater-effect model: one-record input format with same number of raters
per examinee
This example illustrates another option of rater data input (R-INOPT=1). The data in ex-
ampl07.dat (see Section 11.7) were reformatted so that rated responses for each respondent are
on one same record. This input option needs the NRATER keyword in the INPUT command to indi-
cate the number of times each item was rated. The number of raters is indicated using the MRATER
keyword on the same command.
727
11 PARSCALE EXAMPLES
This example illustrates another form of data input for multiple ratings. It is requested by setting
R-INOPT=1 on the INPUT command to indicate one line of data per examinee. The number of
items in the test is given in the LENGTH keyword.
The data in exampl09.dat (given in the raters folder) are formatted so that a rater ID code pre-
cedes each rating of the examinee’s response to an item. The INPUT command must include the
NRATER keyword to indicate the number of times each item has been rated. The MRATER keyword
is used to give the maximum number of raters for each of the items in the test. If any given item
of any particular case record has fewer than the maximum number of raters, the not-presented
code must be inserted for the rater code of each missing rater.
If an item is multiple-choice or is objectively scored, then the number of raters for the item in the
NRATER list must be set to zero. For those items, only the response code appears in the case re-
cord.
The total number of responses, NTOTAL, to all items in the data is equal to the sum of the number
of multiple-choice items plus the sum of number of raters in NRATER list. The INPUT command
must also contain the MRATER keyword, giving the number of different raters in the data. The
codes that identify the raters in the data must appear in the MRATER command. Labels for the rat-
ers in the output listing may be supplied in the RNAME keyword on the MRATER command.
The following is an example of a data record in exampl09.dat. There are 5 open-ended items,
but any given examinee is presented only 2 of these items. Rater codes and ratings for the re-
maining items are assigned the not-presented code 0. There are no multiple-choice items.
14 3 2 10 3 0 0 0 0 5 3 12 2 0 0 0 0 0 0 0 0
Examinee 14 was presented items 1 and 3. The response to item 1 was scored by rater 3, who
assigned it category 2, and by rater 10, who assigned it category 3. The response to item 3 was
scored by rater 5, who assigned it category 3, and by rater 12, who assigned it category 2.
The not-presented key must have the same format as the data records. In this case:
NPKY 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The raters are nested within items in these data; i.e., any given rater scores one, and only one,
response of any given examinee.
728
11 PARSCALE EXAMPLES
729
12 MULTILOG EXAMPLES
12 MULTILOG examples
The so-called LSAT Section 6 data includes the responses of 1000 examinees to five binary
items in a short section of the Law School Admission test. The data have been analyzed by Bock
& Lieberman (1970), Andersen & Madsen (1977), Bock & Aitkin (1981), Thissen (1982) and
others; the 1PL model with a Gaussian population distribution fits quite well.
Contents of the data file are shown below. Note that a frequency of 0 was obtained for the 11th
and 13th patterns.
1 00000 3
2 00001 6
3 00010 2
4 00011 11
5 00100 1
6 00101 1
7 00110 3
8 00111 4
9 01000 1
10 01001 8
11 01010 0
12 01011 16
13 01100 0
...
31 11110 28
32 11111 298
The examples in Sections 12.2 and 12.3 fit these data with the 2PL and 3PL models. The
PROBLEM command identifies the problem as RANDOM θ (requiring MML estimation) using
PATTERN-count data input. The TEST command defines the test as ALL L1, specifying the 1PL
model. The data, in the file exampl01.dat, have the response patterns defined by [0-1] strings,
with 1 coded for a correct response. The format reads the 5A1 item responses, followed by F4.0
to read the frequency.
Results are saved to an output file called exampl01.out. The first few pages of the MULTILOG
output give information about the problem; those are omitted from the selected output repro-
duced here. The results are on the final three pages of the output and are included here.
The parameters correspond to those given by Thissen (1982, p. 180). The values in parentheses
adjacent to each parameter are approximate standard errors. The final page of the MULTILOG
output for PATTERN input has two parts: the left section describes goodness-of-fit, and the right
section characterizes the distribution of θ for each response pattern. In the goodness-of-fit sec-
tion, the observed and expected frequencies are printed, as well as the standardized residuals:
[observed-expected]/ expected.
730
12 MULTILOG EXAMPLES
The EAP (Expected A Posteriori) estimate of θ for each pattern is also printed, as well as the
posterior standard deviation. At the bottom of the table, the likelihood ratio χ 2 goodness-of-fit
statistic value is printed. The command file exampl01.mlg is shown below.
EXAMPL01.MLG -
MML PARAMETER ESTIMATION FOR THE 1PL MODEL, LSAT6 DATA
>PROBLEM RANDOM, PATTERN, NITEMS=5, NGROUP=1, NPATTERNS=32,
DATA=‘EXAMPL01.DAT’;
>TEST ALL, L1;
>END;
2
01
11111
N
(4X,5A1,F4.0)
Selected output is shown below. Parameter estimates for item 1, with standard errors in parenthe-
ses, are given. These may be used to test if a parameter is significantly different from zero (t =
estimate/S.E.).
The next section of output provides information on the contribution of item 1 to the total infor-
mation.
@THETA: INFORMATION:
-3.0 - -1.6 1.548 1.566 1.581 1.593 1.600 1.604 1.603 1.599
-1.4 - 0.0 1.590 1.578 1.561 1.541 1.519 1.493 1.466 1.437
0.2 - 1.6 1.407 1.376 1.346 1.316 1.287 1.259 1.233 1.208
1.8 - 3.0 1.186 1.165 1.145 1.128 1.112 1.098 1.086
731
12 MULTILOG EXAMPLES
0.2 - 1.6 0.843 0.852 0.862 0.872 0.881 0.891 0.901 0.910
1.8 - 3.0 0.918 0.927 0.934 0.942 0.948 0.954 0.960
The second example of MULTILOG fits the LSAT6 data with the 2PL model. A 3PL model is
fitted to the data in Section 12.1.
The test is redefined as L2 (for the 2PL model) on the TEST command. The results follow, in the
same format as before. The only differences are that each item has a different estimated slope
(A) value, and the value of the likelihood ratio statistic indicates a very slight improvement in fit,
from 21.8 for the 1PL model to 21.2 for the 2PL model.
The command file exampl02.mlg for this analysis is shown below, followed by the output ob-
tained from this run.
EXAMPL02.MLG -
MML PARAMETER ESTIMATION FOR THE 2PL MODEL, LSAT DATA
>PROBLEM RANDOM, PATTERN, NITEMS=5, NGROUP=1, NPATTERNS=32,
DATA=‘EXAMPL01.DAT’;
>TEST ALL, L2;
>END;
2
01
11111
N
(4X,5A1,F4.0)
ITEM SUMMARY
MML PARAMETER ESTIMATION FOR THE 2PL MODEL, LSAT DATA
732
12 MULTILOG EXAMPLES
The third run of MULTILOG fits the LSAT6 data with the 3PL model. The test is redefined as
L3 (for the 3PL) on the TEST command. 1PL and 2PL models for these data are discussed in Sec-
tions 12.1 and 12.2. This example also illustrates the use of Bayesian priors for some of the item
parameters. Specifically, the PRIORS command indicates that for all five items
[ITEMS=(1,2,3,4,5)] the parameter DK=1, which is the logit of the lower asymptote, should be
estimated with a Gaussian prior distribution with a mean of –1.4 and a standard deviation of 1.0.
The value –1.4 is chosen for the mean because that is the logit of 0.2, and the items of LSAT6
were five-alternative multiple-choice items. The complete command file exampl03.mlg is shown
below.
733
12 MULTILOG EXAMPLES
The presentation of the parameter estimates is different in this third (3PL) output from
MULTILOG. MULTILOG interprets the 1PL and 2PL models as binary versions of Samejima’s
(1969) graded model, giving the output form in the first two runs. The 3PL is estimated as a bi-
nary form of the multiple-choice model; so contrasts between the two slopes (correct and incor-
rect) and intercepts are estimated, as well as the logit of the lower asymptote. For convenience,
the three parameters are transformed into the more commonly used “Traditional 3PL, normal
metric” form on the first line for each item.
The results indicate that there is very little information about the lower asymptote parameters for
these very easy items; all of the estimated values of the lower asymptote are very near their prior
expected value of 0.2. The most difficult of the five items (item 3) has an estimated asymptote of
0.18. The likelihood ratio statistic indicates that this model does not fit quite as well as the 2PL
model did. That is true, although it seems odd. The Maximum Likelihood estimates (computed
with no priors) for the 3PL model for these data are identical to the 2PL estimates: all of the as-
ymptotes are estimated to be zero. The prior holds the estimates of the asymptotes near 0.2, and
does not fit quite as well. It does not fit particularly worse, either; there is very little information
available for estimating the lower asymptotes for these items.
Selected output for this run follows (only item 3 shown here), followed by the total information,
observed and expected frequencies, and value of −2 ln L :
734
12 MULTILOG EXAMPLES
@THETA: INFORMATION:
-3.0 - -1.6 1.275 1.297 1.317 1.337 1.355 1.373 1.391 1.409
-1.4 - 0.0 1.427 1.446 1.465 1.484 1.500 1.514 1.522 1.522
0.2 - 1.6 1.514 1.497 1.471 1.438 1.400 1.360 1.318 1.278
1.8 - 3.0 1.240 1.206 1.175 1.148 1.124 1.104 1.087
Clogg & Goodman (1984) analyzed a set of data for two responses (six weeks apart) to a three-
alternative graded questionnaire item about “happiness.” Some of their data are analyzed here
with ordered latent trait models. The data are in a file called exampl04.dat, there are three re-
sponse codes: 1 = very happy, 2 = pretty happy, and 3 = not too happy.
In this example, we fit these data with Samejima’s (1969) graded model. In the next section, we
estimate the parameters of a version of Masters’ (1982) partial credit model for the same data.
Another example of a graded model can be found in Section 12.7. The TEST command defines
the model as “GRADED,” with 3 categories for each of the two items. The items are labeled “PRE”
and “POST” on the LABELS command. The slope parameters are constrained to be equal. The long
form of key entry is required for multiple-category items: each response code in the data must be
assigned to a category of the model.
The graded model assumes that the response corresponding to the highest value of the trait (here,
happiness) has the highest value, so response 1 is placed in category 3 for both items, 2 in cate-
gory 2 and 3 in category 1.
735
12 MULTILOG EXAMPLES
The data file exampl04.dat and command file exampl04.mlg are shown below.
11 46
12 31
13 8
21 20
22 68
23 12
31 1
32 12
33 11
In the MULTILOG output, the estimated parameters are printed in a format similar to those for
the 1PL and 2PL models in Sections 12.1 and 12.2, except that there are two thresholds for each
of the three-category items. As before, the goodness-of-fit statistics and EAP[ θ ]s are printed on
the final page of the MULTILOG listing. The model fits these data satisfactorily: the likelihood
ratio χ 2 statistic is 7.4 on 3 d.f., p = 0.07. Selected output is given below.
736
12 MULTILOG EXAMPLES
737
12 MULTILOG EXAMPLES
In this example, we estimate the parameters of a version of Masters’ (1982) partial credit model
for these same “happiness” data considered in Section 12.4 where Samejima’s (1969) graded
model was fitted. A description of the data file can also be found in that section. The model for
the test items is redefined to be NOMINAL, with 3 categories for each item, and category 3 is
“HIGH.” The sequence
specifies that POLYNOMIAL contrasts are to be used for the ak parameters of the NOMINAL model,
with the linear contrasts constrained to be equal for the two items and the quadratic contrasts
FIXED at zero. The command
specifies the “TRIANGLE” contrast matrix for the ck parameters of the NOMINAL model. Thissen
& Steinberg (1986) show that this parameterization of the NOMINAL model is equivalent to Mas-
ter’s (1982) partial credit model; the only difference between the model as fitted here and that
fitted by Masters is the inclusion here of the Gaussian population distribution. This model does
not fit these data quite as well as Samejima’s (1969) graded model .
With this parameterization, the parameter values printed by MULTILOG are the slope contrast,
which is the slope of the trace lines relative to the unit standard deviation of the population dis-
tribution, and the c-contrasts, which are equivalent to Masters’ δ s : the points at which the suc-
cessive ordered trace lines cross. A property of the partial credit model is that it is a “Rasch-
type” model; response patterns with the same total raw score have the same posterior distribution
of θ . This means, for instance, that the response patterns that total 5 (32 and 23) have the same
EAP[ θ ], 0.34, with the same standard deviation, 0.65. This property of raw-score sufficiency for
θ is not obtained with the Samejima graded model, even when the slopes are constrained to be
equal, as in the preceding run. It is only obtained with this model when, as here, the slopes are
constrained to be equal for all items.
The command file exampl05.mlg is given below, followed by selected output for item 1 only.
738
12 MULTILOG EXAMPLES
123
33
22
11
(2A1,F4.0)
739
12 MULTILOG EXAMPLES
Klassen & O’Connor (1989) conducted a prospective study of predictors of violence in adult
male mental health admissions. One combination of possible predictors of subsequent violence
involved data readily available in mental health center records: The number of prior (inpatient)
admissions and age at the first such admission. Both a large number of previous admissions and
a young age at first admission are considered possible predictors of subsequent violence, pre-
sumably because they both reflect more serious psychopathology.
In acquiring the interview data, Klassen & O’Connor (1989) divided both age at first admission
and number of prior admissions into four ordered categories. The two variables do not really ap-
pear to be test items. But they are related to each other, in an obvious sort of way: Those whose
first admission was at a relatively young age tend to have had more previous admissions
[ χ 2 (9) = 16.4, p = 0.05 for independence].
From the point of view of item response theory, the fact that these two “items” are not independ-
ent is explained by their common relationship to an underlying variable: the “long-term nature”
or “seriousness” of the mental health problems for which the person is being admitted. From the
point of view of the researchers attempting to predict subsequent behavior, estimates of individ-
ual values on that underlying continuum may be more useful than either of the two observed
variables alone. Thissen (1991) describes fitting these data with Samejima’s (1969) graded
model, and the consequences for estimating individual scores. This example illustrates the use of
MULTILOG for this purpose. The data are given in the file exampl06.dat. Additional graded
models for the happiness data are discussed in Section 12.4 and 12.7.
740
12 MULTILOG EXAMPLES
In this example, we illustrate the use of MULTILOG with data from an experiment conducted
during the 1974 General Social Survey. The data involve two questions. The first question (in
form A) was, “In general, do you think the courts in this area deal too harshly or not harshly
enough with criminals?”; the responses used here (with their codes) are “Courts too harsh” (1),
“About right” (2), and “Not harsh enough” (3). The second question produced a classification of
the respondents into the three categories “Liberal” (1), “Moderate” (2), and “Conservative” (3).
The first question was asked in different wordings on two forms. The first wording is given
above; the second wording (used on form B) was “In general, do you think the courts in this area
deal too harshly or not harshly enough with criminals, or don’t you have enough information
about the courts to say?” The two forms were randomly assigned to the respondents to the sur-
vey. The point of the split-ballot experiment was to determine the effect of the explicitly offered
“don’t know” alternative in form B. About 7% of the group one (form A) respondents said they
“didn’t know,” and about 29% of the group two (form B) respondents said they “didn’t know.”
Thus, as expected, explicit provision of “don’t know” as an alternative increased the probability
of that response.
Here, we consider only the data from the respondents who chose one of the three (coded) sub-
stantive alternatives listed above. Setting aside the people (differing numbers in the two groups)
who said they “didn’t know,” we consider the hypothesis that the structure of the responses to
the two questions is the same for both wordings. To do this, we hypothesize that a single under-
lying latent variable (in this case, political liberalism-conservativism) accounts for the observed
covariances between the responses to the two questions. We fit the data with Samejima’s (1969)
741
12 MULTILOG EXAMPLES
graded item response model, and consider the goodness-of-fit, the trace lines, and the conse-
quences of the model for inferences about the political attitudes of the respondents.
The data are in the file exampl07.dat. Contents of the command file exampl07.mlg are shown
below. The command lines entered here indicate that the problem is one involving RANDOM
(MML) item parameter estimation, using response-PATTERN data, for 2 items, and 2 groups. The
GRADED model is used, with 3 response categories for each item.
The data file is shown below. The first column contains 1 for form A, and 2 for form B. Columns
3 and 4 contain codes (1, 2, and 3) for the responses to the two items. The frequencies for each
response pattern for each group are in columns 6-8.
This example illustrates MULTILOG’s use of numbers from 1 to the number of groups (in this
case, 2) to denote group membership. When there is only one group, no group number is read in
the data.
1 11 16
1 12 16
1 13 5
1 21 24
1 22 29
1 23 13
1 31 122
1 32 224
1 33 185
2 11 21
2 12 7
2 13 3
2 21 16
2 22 11
2 23 11
2 31 112
2 32 152
2 33 126
Annotated output is given below. On the first page of the output, MULTILOG reports on the
state of its internal control codes. This information is used mostly for trouble-shooting.
742
12 MULTILOG EXAMPLES
ESTIMATION PARAMETERS:
THE ITEMS WILL BE CALIBRATED--
BY MARGINAL MAXIMUM LIKELIHOOD ESTIMATION
MAXIMUM NUMBER OF EM CYCLES PERMITTED: 25
NUMBER OF PARAMETER-SEGMENTS USED IS: 1
NUMBER OF FREE PARAMETERS IS: 7
MAXIMUM NUMBER OF M-STEP ITERATIONS IS 4 TIMES
THE NUMBER OF PARAMETERS IN THE SEGMENT
THE M-STEP CONVERGENCE CRITERION IS: 0.000100
THE EM-CYCLE CONVERGENCE CRITERION IS: 0.001000
THE RK CONTROL PARAMETER (FOR THE M-STEPS) IS: 0.9000
THE RM CONTROL PARAMETER (FOR THE M-STEPS) IS: 1.0000
THE MAXIMUM ACCELERATION PERMITTED IS: 0.0000
The key and format for the data, and the values for the first observation are printed to help de-
termine that the data have been read properly. The values printed next to NORML are the internal
representation of group membership: 0 means “in group 1” and 9 means “not in group 2.” The
value printed for WT/CR is the frequency (weight). Below, we note that the MML estimation al-
gorithm has essentially converged; since the maximum change between estimation cycles for any
parameter is less than 0.004.
743
12 MULTILOG EXAMPLES
ITEM 2: LIB, MOD, CONS; ITEM 1:COURTS HARSH-NOT HARSH; TWO FORMS
READING DATA...
KEY-
CODE CATEGORY
11
22
33
ITEMS 11
NORML 0.000 9.000
WT/CR 16.00
FINISHED CYCLE 25
MAXIMUM INTERCYCLE PARAMETER CHANGE= 0.00367 P( 6)
The Maximum Likelihood estimates of the item parameters are printed here: one value for the
slope, (A), and two thresholds (B) for each item.
ITEM 2: LIB, MOD, CONS; ITEM 1:COURTS HARSH-NOT HARSH; TWO FORMS
@THETA:
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
I(THETA):
0.34 0.30 0.25 0.19 0.13 0.08 0.05 0.03 0.02
GROUP 1:
EXP. PROP. 0.06 0.09 0.85
GROUP 2:
EXP. PROP. 0.07 0.10 0.83
744
12 MULTILOG EXAMPLES
@THETA:
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
I(THETA):
0.22 0.29 0.33 0.35 0.35 0.35 0.34 0.29 0.23
GROUP 1:
EXP. PROP. 0.27 0.40 0.33
GROUP 2:
EXP. PROP. 0.30 0.40 0.29
Beneath the parameter estimates for each item, MULTILOG prints the information I[ θ ] for that
item at nine values of θ from –2 to 2, and the observed and expected frequencies for each re-
sponse alternative.
@THETA:
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
I(THETA):
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
@THETA:
-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
I(THETA):
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
FOR GROUP 1:
@THETA: -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
I(THETA): 1.6 1.6 1.6 1.5 1.5 1.4 1.4 1.3 1.2
SE(THETA):0.80 0.79 0.79 0.81 0.82 0.83 0.85 0.87 0.89
745
12 MULTILOG EXAMPLES
FOR GROUP 2:
@THETA: -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0
I(THETA): 1.6 1.6 1.6 1.5 1.5 1.4 1.4 1.3 1.2
SE(THETA): 0.80 0.79 0.79 0.81 0.82 0.83 0.85 0.87 0.89
In this case, the population distributions of the two groups are assumed to be normal.
MULTILOG prints the estimated or fixed means (MU) and the standard deviations. It also prints
the total test information I[ θ ] for each group, its inverse square root SE[ θ ], and the marginal
reliability.
ITEM 2: LIB, MOD, CONS; ITEM 1:COURTS HARSH-NOT HARSH; TWO FORMS
GROUP 1
OBSERVED(EXPECTED) STD. : EAP (S.D.) : PATTERN
RES. : :
16.0( 17.3) -0.31 : -1.21 ( 0.84) : 11
16.0( 13.7) 0.62 : -0.55 ( 0.81) : 12
5.0( 6.1) -0.46 : -0.04 ( 0.88) : 13
24.0( 23.8) 0.04 : -0.97 ( 0.79) : 21
29.0( 22.8) 1.30 : -0.40 ( 0.77) : 22
13.0( 11.2) 0.54 : 0.09 ( 0.84) : 23
122.0( 131.6) -0.84 : -0.33 ( 0.84) : 31
224.0( 217.8) 0.42 : 0.22 ( 0.80) : 32
185.0( 189.6) -0.34 : 0.84 ( 0.87) : 33
ITEM 2: LIB, MOD, CONS; ITEM 1:COURTS HARSH-NOT HARSH; TWO FORMS
GROUP 2
OBSERVED(EXPECTED) STD. : EAP (S.D.) : PATTERN
RES. : :
21.0( 15.6) 1.36 : -1.32 ( 0.84) : 11
7.0( 11.1) -1.23 : -0.66 ( 0.81) : 12
3.0( 4.6) -0.74 : -0.17 ( 0.89) : 13
16.0( 20.6) -1.02 : -1.07 ( 0.78) : 21
11.0( 18.0) -1.65 : -0.50 ( 0.77) : 22
11.0( 8.2) 0.99 : -0.03 ( 0.84) : 23
112.0( 102.9) 0.90 : -0.44 ( 0.84) : 31
152.0( 155.5) -0.28 : 0.11 ( 0.80) : 32
126.0( 122.4) 0.32 : 0.72 ( 0.86) : 33
These tables summarize the goodness-of-fit of the model to the data. Observed and expected val-
ues are printed for each response pattern in each group, as well as the standardized residual
746
12 MULTILOG EXAMPLES
which may be taken to be approximately normally distributed with mean zero and variance one
for diagnostic purposes. The χ 2 statistics indicate that the model-fit is satisfactory (on 18 – 2
[group totals] – 7 [parameters fitted] = 9 degrees of freedom). The tables also include EAP[ θ ]
for each response pattern, and the corresponding standard deviation.
Figure 12.1 shows the trace lines computed from the item parameters in the MULTILOG output.
We note that the “Liberal-Conservative” question divides the respondents approximately equally
into three “centered” groups, while the question on the courts has trace lines crossing each other
far on the left. Only the most liberal respondents consider the courts sufficiently harsh. As ex-
pected, the questions are strongly related. For further discussion of part of these data, see Thissen
& Steinberg (1988). A scoring run on these data is discussed in the next section.
747
12 MULTILOG EXAMPLES
Having concluded that the model fits the data satisfactorily, we set up a “scoring run” in which
we compute MAP[ θ ] for each response pattern, as though each line of the input data file repre-
sented an individual observation. This sequence of events represents the normal use of
MULTILOG: the item analysis and individual scoring are done in two separate runs of the com-
puter program. Frequently, several (sometimes even many) item analysis runs are performed be-
fore a satisfactory model is selected. Only after this is accomplished does it make sense to com-
pute estimates of θ for each respondent.
To set up a scoring run for the data described in the previous section, the syntax in courtab2.mlg
is used. The command lines entered here indicate that the problem is one involving the calcula-
tion of SCORE for INDIVIDUAL data, for 2 items, and 2 groups. The GRADED model is used, with 3
response categories for each item. The START command is used to enter the item parameters from
the previous run. The parameters are entered in the order that they are printed, following a user-
supplied format. Usually, these parameters are read from a file previously saved by MULTILOG.
Note that the user should provide information about the key and data format. In the data format,
the first 2A1 refers to the NCHARS=2 characters of ID information; in this case, that reads the re-
sponse pattern as the label.
748
12 MULTILOG EXAMPLES
The numbers in the column marked THETAHAT in the output file obtained for this analysis are the
values of MAP[ θ ] for each response pattern; the response patterns are used as the ID fields on
the right. The estimated standard errors are tabulated, as well as the number of iterations required
by the Newton-Raphson algorithm. Comparing these values for each response pattern to the cor-
responding EAPs, we note very little difference.
Bock (1975, pp. 512-547) describes the graded model, a model for ordered categorical data, in
detail and includes an application to a set of behavioral data. The data are in a file called ex-
ampl09.dat, which contains the following four lines:
1 7 0 2 11
0 6 0 6 10
0 2 0 5 11
3 10 2 0 2
Each of the four lines of data represents one of four groups of mice; each group of mice repre-
sents a cell of a 2 x 2 experimental design. The response variable is a classification of the mice in
each group according to the ordered severity of audiogenic seizures they exhibit; the column-
categories are “crouching,” “wild running,” “clonic seizures,” “tonic seizures,” and “death.”
749
12 MULTILOG EXAMPLES
Bock (1975) uses a model for responses in ordered categories, formally identical to Samejima’s
graded item response model, to relate the categorical response to effects of the experimental
conditions. This example reproduces the estimates for Bock’s “main class and interaction”
model.
The algorithm used in MULTILOG is very different from that described by Bock, and requires
different treatment of the required constraints of location. Bock’s system constrains the group
means (called µ there, and θ here) to total zero; he estimates three contrasts among the four
group values. That is impossible in MULTILOG, so all four group locations ( θ ) are estimated,
and one of the thresholds, called BK=4, is fixed at the value Bock obtains (0.4756). With this con-
straint, the results obtained with MULTILOG match those printed in the original source. Of
course, in real data analysis, one would not have such a value and one of the thresholds would be
fixed at some arbitrary value, like zero. If BK=4 had been fixed at zero in the current example, all
of the values of θ and the other three thresholds would have been shifted 0.4756 from the values
in the text.
The command file illustrates user input for FIXED- θ analysis, with data in the form of the table
given above. In this case, there is a single item, and the rows of the table are the groups, so
NGROUP=4. The TGROUPS command specifies four starting values (1,1,1,–1) for the four values of
θ , one for each group. These values must be entered manually. The slope is fixed at a value of 1
and BK=4 is fixed at 0.4756.
The contents of the command file exampl09.mlg are given below. To see how to generate this
command file using the syntax wizard, please see Section 4.3.3.
In the MULTILOG output, the values of the thresholds, corresponding to those tabulated on p.
547 by Bock, are printed as the values of B(K) in the item summary; their estimated standard er-
rors differ slightly from those in the original analysis, because MULTILOG uses a somewhat
less precise algorithm for computing estimated standard errors. For each θ -group, the estimated
value of θ is printed before the word DATA, e.g. 0.32 for group 1. Those four values correspond
to the four µs on page 546 in Bock (1975). The remainder of the table gives the observed and
expected counts (proportions and probabilities in parentheses) for each cell in the 4 x 5 table, and
the values of the likelihood ratio and Pearson goodness-of-fit statistics. A selection of the output
follows.
750
12 MULTILOG EXAMPLES
ITEM SUMMARY
ITEM 1
TH-GROUP CATEGORY
1 2 3 4 5
0.32 DATA 1.0(0.05) 7.0(0.33) 0.0(0.00) 2.0(0.10) 11.0(0.52)
EXPECTED 0.5(0.02) 5.9(0.28) 0.7(0.03) 4.3(0.20) 9.7(0.46)
D.F.= 9 (IF TEST ALL CATEGORICAL AND THERE ARE NO EMPTY TH-GROUPS)
The multiple-choice model (Thissen & Steinberg, 1984) includes a separate trace line for each
alternative response—the key and all of the distractors—on a multiple-choice item. The model is
a development of suggestions by Bock (1972) and Samejima (1979). For this reason, it is re-
ferred to as the “BS” model. The procedures involved differ from those used when the responses
on multiple-choice items are made binary by scoring correct or incorrect before the item analy-
sis. The data are more complex: for 4 four-alternative multiple-choice items, there are 44 = 256
possible response patterns; if the data are made binary, there are only 24 = 16 response patterns.
The model is more complex: the multiple-choice model has eleven free parameters for each four-
alternative item, while the 3PL model has only three. The model, its estimation, and its interpre-
tation are described by Thissen & Steinberg (1984) and Thissen, Steinberg & Fitzpatrick (1989).
The interested reader is referred to those sources.
751
12 MULTILOG EXAMPLES
The first 12 lines of the data file exampl10.dat are shown below.
1 1111
2 1113
1 1121
1 1133
1 1134
2 1143
1 1144
2 1222
2 1232
1 1233
2 1242
5 1243
This example shows the MULTILOG output for an item parameter estimation problem. The data
are the responses of 976 examinees to 4 four-alternative multiple-choice vocabulary items. The
model is fitted with constraints described by Thissen & Steinberg (1984) as the “ABCD(C),
ABCD(D)” model. In addition to the constraints on ak giving the “ABCD(C), ABCD(D)” model, two
of the relatively ill-determined c-contrasts are fixed at zero, increasing the precision of estima-
tion of the entire model, without damaging the fit.
On the PROBLEM command, the four-choice items are defined as having five response categories
[NC=(5(0)4)], because the Multiple-choice (“BS”) model appends an additional latent response
category to each item. This category is denoted DK (for “Don’t Know”) by Thissen & Steinberg
(1984), and must be category 1 in MULTILOG; i.e. the “real” responses are keyed into catego-
ries 2, 3, 4, and 5. The correct answers for these four items are [D,C,C,D], so HIGH =
(5,4,4,5).
The EQUAL commands impose the constraint that the proportions of the “DK” curve distributed
into each of the observed response categories are the same within the pairs of items with the
same keyed correct response. The two c-contrasts are fixed at zero, because we have found since
the publication of the original paper that the model is better-conditioned with the addition of
such constraints, and there is no apparent damage to the fit. As a matter of fact, this run pro-
duces a better fit than that reported for the slightly less constrained ABCD(C), ABCD(D) model in
the original paper, because this version of MULTILOG appears to converge somewhat more
completely than the version of MULTILOG (4.0) that provided the findings reported in the pa-
per. The two parameters fixed at zero had estimated standard errors several times larger than
their absolute values when they were estimated. Because of the substantial error covariances
among the parameters of this model, their estimation induced large standard error estimates in
several of the other parameters. Fixing the two ck s produces much more stable results. The pa-
rameter estimates are printed in the following selection of the MULTILOG output, both in con-
trast form and as the ak s , ck s and d k s . The goodness-of fit statistics and EAP[ θ ]s for each ob-
served pattern are printed on the final pages of the output.
The more heavily constrained model also runs faster in MULTILOG. In general, poorly identi-
fied models require much more computing time than more highly constrained models.
752
12 MULTILOG EXAMPLES
The parameterization
The relationship among the unconstrained (or contrast) parameters estimated by MULTILOG
and the constrained parameters of the model is fairly complex; here we provide illustrations,
based on item 1 of the example.
The model is
exp[akθ + ck ] + d k [a1θ + c1 ]
P( x = k ) = m +1
∑ i =1
exp[aiθ + ci ]
a ' = α 'Tα ,
where α contains the unconstrained parameters estimated by MULTILOG. For item 1, this is
(with the vectors transposed to fit on the page):
−2.98
'
−5.01
6.89 −0.20 −0.20 0.80 −0.20 −0.20
3.91 =
2.33 −0.20 −0.20 −0.20 0.80 −0.20
−0.66
7.73 −0.20 −0.20 −0.20 −0.20 0.80
4.75
The estimates of the parameters ak , in the vector a , are printed in the row marked A(K) in the
MULTILOG output, and the estimates of the (unconstrained) parameters in the vector α are
printed in the column marked CONTRAST COEFFICIENTS FOR A. Using the (default) deviation
contrasts in T , there is a fairly straightforward scalar interpretation of the parameters:
and
753
12 MULTILOG EXAMPLES
and
where α ' = [–2.03 6.89 2.33 7.73 ] contains the parameters estimated by MULTILOG. This
has a direct bearing on the imposition of equality constraints using the MULTILOG command-
language. If, for instance, one wanted to constrain a3 and a5 to be equal, one would enter the
command
because this would set the second and fourth contrasts among the as equal (they are currently
estimated as 6.89 and 7.73); the consequence of this would be that a3 and a5 would be equal.
Any constraints involving a1 are different: To constrain a1 and a2 to be equal, for instance, one
would enter the command
which would have the effect of fixing the first contrast among the as (currently estimated to be –
2.03) at a value of zero. If that is true, a1 = a2 . The computation of the cs is parallel in all re-
spects to that for the as. Note that, in the example as printed, the command
The use of different T -matrices (Polynomial or Triangle) changes the relationship between
the unconstrained parameters estimated by MULTILOG and the as. However, MULTILOG
commands to FIX or EQUAL parameters always refer to the unconstrained contrast parameters,
and algebraic manipulation similar to that described here is necessary to obtain any desired con-
straints on the as or cs themselves.
The relationship between the ds and the unconstrained parameters estimated by MULTILOG is
somewhat more complex, because the parameters represented by d k are proportions (represent-
ing the proportion of those who “don’t know” who respond in each category on a multiple-
choice item; see Thissen & Steinberg, 1984). Therefore, the constraint that ∑ d k = 1 is required.
This is enforced by estimating d k such that
exp[d k* ]
dk =
∑ exp[d k* ]
754
12 MULTILOG EXAMPLES
and
d*' = δ'Td .
The elements of the vector δ are the parameters estimated by MULTILOG, and printed in the
column marked CONTRAST COEFFICIENTS FOR D; in the case of item 1 of this example,
These values are used (internally) by MULTILOG to compute the values of d*k . In this case, they
are:
−0.47
'
0.28
= −0.13 −0.25 −0.25 0.75 −0.25
−0.61
1.28 −0.25 −0.25 −0.25 0.75
0.80
Then
∑ exp[d *
k ] = exp[−0.47] + exp[0.28] + exp[−0.61] + exp[0.80]
= 0.625 + 1.323 + 0.543 + 2.226
= 4.717.
So
0.625 1.323
d1 = = 0.13 d 2 = = 0.28
4.717 , 4.717 ,
0.543 2.226
d3 = = 0.12 and d 4 = = 0.47.
4.717 , 4.717
The four proportions [0.13, 0.28, 0.12, 0.47] are printed as D(K) in the MULTILOG output, in
columns 2, 3, 4 and 5 because those columns represent the parameters for the observed item re-
sponses.
The example illustrates the imposition of equality constraints on the ds between items. To im-
pose equality constraints on the ds within an item, the procedure is parallel to that described pre-
viously for imposing within-item equality constraints on as and cs. For instance, to impose the
constraint that d 2 and d3 should be equal, one would enter the command
755
12 MULTILOG EXAMPLES
because this would set the first and second contrasts among the ds equal (they are currently esti-
mated as 0.76 and –0.13). The consequence of this would be that d 2 and d3 would be equal.
Any constraints involving d1 are different: To constrain d1 and d3 to be equal, for instance, one
would enter the command
which would have the effect of fixing the second contrast among the ds (currently estimated to
be –0.13) at a value of zero. If that is true, then d1 = d3 .
The command file exampl10.mlg is shown below. Another example of the fitting of a multiple-
choice model is given in Section 12.11.
EXAMPL10.MLG -
ABCD(C) ABCD(D) WITH TWO C(K)S FIXED AT ZERO
>PROBLEM RANDOM, PATTERNS, NITEMS=4, NGROUP=1, NPATTERNS=156,
DATA=‘EXAMPL10.DAT’;
>TEST ALL, BS, NC=(5(0)4), HIGH=(5,4,4,5);
>EQUAL ITEMS=(1,4), DK=(1,2,3);
>EQUAL ITEMS=(2,3), DK=(1,2,3);
>FIX ITEMS=1, CK=3, VALUE=0.0;
>FIX ITEMS=2, CK=2, VALUE=0.0;
>SAVE;
>ESTIMATE NC=25;
>TGROUPS NUMBERS=10, QP=(-4.5(1.00)4.5);
>END;
4
1234
2222
3333
4444
5555
(10X,4A1,T3,F4.0)
756
12 MULTILOG EXAMPLES
@THETA: INFORMATION:
-3.0 - -1.6 2.520 2.883 3.164 3.272 3.163 2.879 2.525 2.224
-1.4 - 0.0 2.088 2.257 3.024 4.791 5.243 3.507 2.811 3.882
0.2 - 1.6 6.688 8.513 6.862 4.443 2.957 2.194 1.787 1.550
1.8 - 3.0 1.400 1.300 1.231 1.181 1.144 1.116 1.095
Thissen, Steinberg & Fitzpatrick (1989) described the use of the multiple-choice model with four
items from a nation-wide tryout of achievement test items conducted in 1987 by CTB/McGraw-
Hill. The data comprised the responses of 959 examinees that responded to four items on a single
page of one of the tryout forms. The items are included in the report by Thissen, Steinberg &
Fitzpatrick (1989).
The data for the analysis were the observed counts of examinees giving each of the 44 = 256
possible response patterns to the four items. Fitting the 256-cell contingency table with the mul-
tiple-choice model with no constraints, the likelihood ratio G 2 with 211 d.f. was 226.0, which
indicates a satisfactory fit. However, examination of the item parameters, the trace lines, and the
items themselves lead us to impose a number of constraints on the model. Using MULTILOG
subscripts, where category 1 =“don’t know,” and the observed responses are in categories 2-5:
757
12 MULTILOG EXAMPLES
For items 2, 3, and 4, we constrained d k = 0.25 for all four alternatives with >FIX
ITEMS=(2,3,4), DK=(1,2,3) VALUE=0.0.
For item 1, we constrained d1 = d3 = d 4 with >FIX ITEMS=1, DK=(2,3), VALUE=0.0.
For items 1 and 2 we constrained a2 = a1 with >FIX ITEMS=1, AK=1, VALUE=0.0 and
>FIX ITEMS=2, AK=1, VALUE=0.0.
For item 2, we constrained a3 = a5 with >EQUAL ITEMS=2, WITH=2, AK=(2,4); for item
3 we constrain a3 = a4 with >EQUAL ITEMS=3, WITH=3, AK=(2,3); and for item 4, we
constrained a2 = a3 = a5 with >EQUAL ITEMS=4, WITH=4, AK=(1,2,4).
For item 3, we constrained a2 = a5 with >EQUAL ITEMS=3, WITH=3, AK=(1,4).
These constraints reduce the number of parameters (contrasts) estimated from 44 to 26. The
goodness-of-fit statistic under all of the constraints is χ 2 (229) = 236.9 , which is very close to
expectation. The overall test of significance of the 18 contrasts among the parameters eliminated
in these constraints is χ 2 (18) = 236.9 − 226.0 = 10.9.
Thus no significant differences among the trace lines have been eliminated in the imposition of
these constraints. However, the remaining parameters are much more precisely estimated and the
corresponding trace lines are smoother than those involving many parameters that are not well-
specified by the data.
On the following pages we illustrate the use of MULTILOG to compute the estimates. Note that
we increased the number of quadrature points (with the TGROUPS command) from the default 10
to 13. This increases the usefulness of the approximate standard errors. We also impose a gentle
Bayesian prior on d-contrast 1 for item 1 (the only estimated d-contrast); as with the 3PL model,
weak priors on the d-contrasts are usually helpful.
Syntax for this run, as shown below, is given in exampl11.mlg while the data file is ex-
ampl11.dat.
EXAMPL11.MLG -
"CALORIC CONSUMPTION ITEMS", TSF, JEM 89
>PROBLEM RANDOM, PATTERNS, NITEMS=4, NGROUP=1, NPATTERNS=148,
DATA=‘EXAMPL11.DAT’;
>TEST ALL, BS, NC=(5(0)4), HIGH=(3,4,5,4);
>SAVE;
>TGROUPS NUMBERS=13, QP=(-4.5(0.75)4.5);
>FIX ITEMS=1, AK=1, VA=0.0;
>FIX ITEMS=2, AK=1, VA=0.0;
>EQUAL ITEMS=2, WITH=2, AK=(2,4);
>EQUAL ITEMS=3, WITH=3, AK=(1,4);
>EQUAL ITEMS=3, WITH=3, AK=(2,3);
>EQUAL ITEMS=4, WITH=4, AK=(1,2,4);
>FIX ITEMS=1, DK=(2,3), VALUE=0.0;
>FIX ITEMS=(2,3,4), DK=(1,2,3), VALUE=0.0;
>PRIOR ITEMS=1, DK=1, PA=(0.0,1.0);
>ESTIMATE NC=100;
>END;
4
758
12 MULTILOG EXAMPLES
1234
2222
3333
4444
5555
(4A1,F4.0)
CATEGORY(K): 1 2 3 4 5
A(K) -1.94 -1.94 1.31 0.68 1.88
C(K) 0.05 -0.29 1.17 1.74 -2.68
D(K) 0.12 0.65 0.12 0.12
In their description of the use of latent class models for the validation of the structure of knowl-
edge domains, Bergan & Stone (1985) report a number of analyses of the data in this example.
The data were collected as the responses to four items measuring the numerical knowledge of a
sample of preschool children in the Head Start program. The first two items required the children
to identify numerals (3 and 4), and the second two items required the children to match the cor-
rect numeral (again, 3 or 4) represented by a number of blocks.
In an analysis reported in Thissen & Steinberg (1988), the items were redefined as two pseudo-
items, each of which has four response categories. The first of these pseudo-items is denoted
“Identify,” which has four categories of response: correctly identifying neither numeral, only 3,
only 4, or both correct. The second pseudo-item is called “Match,” with the same four response
759
12 MULTILOG EXAMPLES
categories. The pseudo-items are logically equivalent to testlets described by Wainer & Kiely
(1987): They are clusters of items between which conditional independence may reasonably be
expected.
The trace line model used here is Bock’s (1972) nominal model. Equality constraints are im-
posed among the parameters: for “Identify,” a2 = a1 ; for “Match,” a3 = a2 and c3 = c2 . Given
the use of “Triangle” T -matrices, these constraints are imposed by fixing a- and c-contrasts at
zero, because those contrasts represent the differences between successive as and cs. This exam-
ple also illustrates entry of starting values; MULTILOG’s default starting values do not perform
well in this example. The fit of the model is quite good: χ 2 = 8.4, p = 0.2.
Syntax for this model, as shown below, is contained in the file exampl12.mlg and is based on
data in exampl12.dat. Additional examples of nominal models are given in the next two sec-
tions.
EXAMPL12.MLG -
BERGAN & STONE DATA ON PRESCHOOLERS AND ‘3 AND 4’
>PROBLEM RANDOM, PATTERNS, NITEMS=2, NGGROUPS=1, NPATTERNS=16,
DATA=‘EXAMPL12.DAT’;
>TEST ALL, NOMINAL, NC=(4,4), HIGH=(4,4);
>TMATRIX ALL, AK, TRIANGLE;
>TMATRIX ALL, CK, TRIANGLE;
>FIX ITEMS=1, AK=1, VALUE=0.0;
>FIX ITEMS=2, AK=2, VALUE=0.0;
>FIX ITEMS=2, CK=2, VALUE=0.0;
>START ITEMS=(1,2), PARAMS=‘EXAMPL12.PRM’;
>END;
4
N34B
11
22
33
44
(2A1,F4.0)
760
12 MULTILOG EXAMPLES
0.2 - 1.6 3.566 3.152 2.678 2.245 1.894 1.629 1.437 1.301
1.8 - 3.0 1.206 1.141 1.097 1.066 1.045 1.031 1.021
This example illustrates the computations involved in the analysis of the “life satisfaction” data
described by Thissen & Steinberg (1988). The data consist of the counts of respondents in a 33
cross-classification based on the responses of 1472 respondents to the 1975 General Social Sur-
vey (Davis, 1975), to three questions concerning satisfaction with family (F), hobbies (H), and
residence (R). In the original data, there were seven responses available. In previous analyses,
Clogg (1979) re-classified the data into three categories, and Masters (1985) used the trichoto-
mized data. Better data analysis would probably be obtained with the original seven-category
data, or at least a more sensible reduction; Muraki (1984), for instance, used a different four-
category system for the same seven responses. However, the analysis illustrated here corresponds
to that described by Thissen & Steinberg (1988) and uses the trichotomized data.
In this illustration, we again use Bock’s (1972) nominal model. This model for the trace lines is
extremely flexible; however, it is frequently too flexible and some additional constraints on the
item parameters are required to give a satisfactory solution. When fitted without constraints, for
item F, the difference between a1 and a2 is nearly zero. For items H and R, the difference be-
tween a1 and a2 is small and similar; and the difference between a1 and a3 is about the same for
all three items. In this example, we impose equality constraints to make these small differences
exactly zero. Using the (default) deviation contrasts, this is done with >FIX ITEMS=1, AK=1,
VALUE=0.0 [to set a1 = a2 for item 1], >EQUAL ITEMS=(2,3), AK=1 [to set ( a1 − a2 ) equal for
items 2 and 3] and >EQUAL ITEMS=(1,2,3), AK=2 [to set (a1 − a3 ) equal for all three items].
Imposing these equality constraints gives a version of the nominal model that (barely) fits:
761
12 MULTILOG EXAMPLES
Additional examples of nominal models are given in Sections 12.10 and 12.12. The contents of
the command file exampl13.mlg are shown below.
EXAMPL13.MLG:
SATISFACTION DATA FOR THE PARAMETERS IN TABLE 7, T&S 88
>PROBLEM RANDOM, PATTERNS, NITEMS=3, NGROUP=1, NPATTERNS=27,
DATA=‘EXAMPL13.DAT’;
>TEST ALL, NOMINAL, NC=(3,3,3), HIGH=(3,3,3);
>FIX ITEMS=1, AK=1, VALUE=0.0;
>EQUAL ITEMS=(2,3), AK=1;
>EQUAL ITEMS=(1,2,3), AK=2;
>SAVE;
>END;
3
123
111
222
333
(1X,3A1,F4.0)
In this example, we consider the responses of 3866 examinees to a 4-passage, 22-item test of
reading comprehension. For a complete description of the data and the analysis, see Thissen,
Steinberg & Mooney (1989). The reading passages were of varying lengths, and they were fol-
762
12 MULTILOG EXAMPLES
lowed by varying numbers of questions about their content, from three to eight questions. Instead
of considering the test to be comprised of 22 binary items, we considered it to be made up of four
testlets (Wainer & Kiely, 1987). Each testlet has q questions (q = 7, 4, 3, 8), and the four testlet
responses for each examinee are the number of questions correct for each of the four passages.
Thus the seven questions following the first passage constitute a single testlet, with responses
x=0, 1, 2, …, 7.
The model we used for the number-correct for each passage was the nominal model (Bock,
1972). We reparameterized the model using centered polynomials of the associated scores to rep-
resent the category-to-category change in the ak s and ck s (with TMATRIX … POLYNOMIAL). This-
sen & Steinberg (1986) showed that the polynomial-contrast version of the nominal model is
equivalent to Masters’ (1982) “partial credit” model for ordered item responses when the con-
trasts among the as are restricted to be linear, and constant for all items. We did not expect that
such a simple model will fit the data; for instance, we did not expect a priori that the testlets
would be equally related to proficiency, so we permitted the linear contrast among the as to vary
over items. Guessing may cause a score of one on a multi-question passage to reflect little more
proficiency than a score of zero, but higher scores should be more ordered. The linear-plus-quad-
ratic polynomial for the a-contrasts was intended to produce as that may be similar for scores of
zero and one, and increasing for higher scores. The polynomial parameterization for the cs is in-
tended to capture the smoothness in the distribution of response proportions for adjacent scores.
To improve the stability of estimation of the item parameters, we located the lowest-degree
polynomials, which provided a satisfactory fit to the data. We used the likelihood ratio statistics
to evaluate the models. For the unconstrained nominal model twice the negative log likelihood
was 1048.3 (this is not distributed as χ 2 with any clear degrees of freedom; only 652 of the 1440
cells of the 4-way contingency table are non-zero). Upon reducing the rank of the polynomials
for the as to one (linear in number-correct) for testlets 1, 3, and 4, and to two (quadratic in num-
ber-correct) for testlet 2, we obtained a value of 1082.2; the likelihood ratio test for the signifi-
cance of this reduction is χ 2 (17) = 33.9, p = 0.01. While this value is significant, it is not
highly significant given the sample size (3866). No individual term among those eliminated was
extremely significant. The significance arose from moderately large χ 2s for two or three rela-
tively high-order polynomial terms (e.g., χ 2s of about 5 for fourth- and seventh-degree terms).
Upon finding that any further reduction in the rank of the a-parameterization induces a highly
significant change in the goodness of fit, we settled on linear as for testlets 1, 3, and 4 and quad-
ratic as for testlet 2.
Using the reduced-rank as, we then reduced the rank of the polynomials for the cs to {3, 3, 2, 4}
for the four testlets; χ 2 (10) = 12.0, p = 0.3 for the ten high-order polynomial terms eliminated.
Any further reduction caused a highly significant change in the goodness-of-fit. On the following
pages, we fit the model to the 8 x 5 x 4 x 9 cross-classification of observed response-pattern fre-
quencies, with constraints imposed to give the final model.
The data are given in exampl14.dat and the syntax for this run (exampl14.mlg) is given below.
763
12 MULTILOG EXAMPLES
EXAMPL14.MLG -
READING COMPREHENSION AS 4 TESTLETS, FINAL MODEL, TSM 89 JEM
>PROBLEM RANDOM, PATTERNS, NITEMS=4, NGROUPS=1, NPATTERNS=652,
DATA=‘EXAMPL14.DAT’;
>TEST ALL, NOMINAL, NC=(8,5,4,9), HIGH=(8,5,4,9);
>SAVE;
>TMATRIX ALL, AK, POLYNOMIAL;
>TMATRIX ALL, CK, POLYNOMIAL;
>FIX ITEMS=1, AK=(2(1)7), VALUE=0.0;
>FIX ITEMS=2, AK=(3,4), VALUE=0.0;
>FIX ITEMS=3, AK=(2,3), VALUE=0.0;
>FIX ITEMS=4, AK=(2(1)8), VALUE=0.0;
>FIX ITEMS=1, CK=(4,5,6,7), VALUE=0.0;
>FIX ITEMS=2, CK=4, VALUE=0.0;
>FIX ITEMS=3, CK=3, VALUE=0.0;
>FIX ITEMS=4, CK=(5,6,7,8), VALUE=0.0;
>END;
9
123456789
1111
2222
3333
4444
5505
6006
7007
8008
0009
(4A1,F5.0)
764
12 MULTILOG EXAMPLES
12.15 A mixed nominal and graded model for self-report inventory items
In research concerned with eating disorders among college women, Irving (1987) used a ques-
tionnaire called the BULIT, a 36-item index created to identify individuals with, or at risk for
developing bulimia (Smith & Thelen, 1984). All of the items on the scale have five response al-
ternatives; most are “Likert-type” items. The questionnaire was developed to be scored by add-
ing the numbers (from 1 to 5) associated with each response; high scores imply high risk. But
the BULIT also includes items for which the responses are not so obviously ordered.
In this example, we illustrate the use of MULTILOG to fit different models to different items of
the same scale, as described by Thissen (1991). We use Bock’s (1972) nominal model for item 1
of exampl15.dat, while we use Samejima’s (1969) graded model for items 2 and 3. For the 5 x 5
x 5 table arising from the cross-classification based on the three items described by Thissen
(1991), the graded model for items 2 and 3 and the nominal model for item 1 give
χ 2 (108) = 99.9, p = 0.6.
Syntax for this run, from the file exampl15.mlg, is shown below.
EXAMPL15.MLG -
HYBRID GRADED-NOMINAL SET OF ITEMS FROM THE BULIT
>PROBLEM RANDOM, PATTERNS, NITEMS=3, NGROUP=1, NPATTERNS=69,
DATA=‘EXAMPL15.DAT’;
>TEST ITEMS=1, NOMINAL, NC=5, HIGH=5;
>TEST ITEMS=(2,3), GRADED, NC=(5,5);
>FIX ITEMS=1, AK=(1,3), VALUE=0.0;
>END;
5
12345
111
222
765
12 MULTILOG EXAMPLES
333
444
555
(1X,3A1,F5.0)
766
12 MULTILOG EXAMPLES
12.16 A mixed three-parameter logistic and partial credit model for a 26-item
test
In this example, we illustrate the use of MULTILOG for item analysis for a test comprising 26
conventional multiple-choice items (scored dichotomously: correct or incorrect), and a 27th item
with three response categories. We use the 3PL model for items 1-26, and Bock’s (1972) nomi-
nal model (with constraints making it equivalent to Masters’ partial credit model) for item 27.
Note that, as in the previous section, the specification of two distinct item response models is
done with two TEST commands.
In this example, we use Bayesian prior distributions for all three parameters of the 3PL model:
we assume that the slopes (as) are distributed normally with an average value of 1.7 (equal to a
slope of 1.0 in the usual “normal metric” of the 3PL) and a standard deviation of 1. We assume
that the bs are distributed normally with mean zero and standard deviation 3 (this serves only to
limit the bs for very easy or very difficult items); and we assume that the logit of the lower as-
ymptote is normally distributed with an average of –1.4 and a standard deviation of 1. The
TMATRIX commands establish the partial credit parameterization for item 27.
Using MULTILOG, there is no problem combining item response models to analyze and score
items with different kinds of responses on the same test. The data file exampl16.dat is used in
this example and the command file (exampl16.mlg) is shown below.
EXAMPL16.MLG -
MIXTURE OF 26 3PL ITEMS AND ONE PARTIAL CREDIT ITEM
>PROBLEM RANDOM, INDIVIDUAL, NITEMS=27, NGROUP=1, NEXAMINEES=668,
DATA=‘EXAMPL16.DAT’;
>TEST ITEMS=(1(1)26), L3;
>TEST ITEMS=27, NOMINAL, NC=3, HIGH=3;
>PRIORS ITEMS=(1(1)26), AJ, PARAMS=(1.7,1.0);
>PRIORS ITEMS=(1(1)26), BJ, PARAMS=(0.0,2.0);
>PRIORS ITEMS=(1(1)26), CJ, PARAMS=(-1.4,0.5);
>TMATRIX ITEMS=27, AK, POLYNOMIAL;
>TMATRIX ITEMS=27, CK, TRIANGLE;
>FIX ITEMS=27, AK=2, VALUE=0.0;
>SAVE ;
>END;
5
01239
111111111111111111111111110
222222222222222222222222221
000000000000000000000000002
000000000000000000000000003
000000000000000000000000000
(12A1,2X,15A1)
767
12 MULTILOG EXAMPLES
In an attempt to link the study of social norms and the study of personality, Stouffer & Toby
(1951) devised three forms of a questionnaire designed to measure a personality disposition to-
ward “particularistic” (as opposed to “universalistic”) solutions to social-role conflicts. Form A
of their questionnaire consisted of four vignettes designed to invoke social role conflict and the
items elicited particularistic or universalistic responses. The four items are reproduced by This-
sen & Steinberg (1988), along with a discussion of the data analysis in this example. In Form B,
the stories were worded so that a friend of the respondent faced the role conflict and items meas-
ured expectations for particularistic or universalistic actions on the part of friends.
Here, we consider the fit of the 2PL model to these data. In the data, the item responses for Form
A are in columns 3-6 (as items 1-4), and the item responses for Form B are in columns 7-10 (as
items 5-8). The trace lines have been fitted with the constraint that the slopes are the same for a
given item on the two forms [using >EQUAL AJ, ITEMS=(5,6,7,8), WITH=(1,2,3,4)], but the
thresholds may vary between forms. The respondents were randomly assigned to the different
forms; therefore we constrained the population means of the two groups to be equal. Because the
mean for group 2 is fixed at zero as an identifiability constraint, this is done by fixing the mean
for group 1 at zero as well. The model fits the data adequately; the goodness-of-fit likelihood ra-
tio statistic is 21.9 on 18 d.f., p = 0.2.
In the output, note that when there are two (or more) groups, MULTILOG prints the observed
frequencies and proportions in each response category for the entire sample, but the expected
768
12 MULTILOG EXAMPLES
The command file exampl17.mlg, using the data file exampl17.dat, is given below.
ITEM SUMMARY
FOR GROUP 1:
@THETA: INFORMATION:
-3.0 - -1.6 1.238 1.309 1.402 1.522 1.675 1.868 2.108 2.399
-1.4 - 0.0 2.745 3.148 3.602 4.095 4.590 5.019 3.148 5.316
0.2 - 1.6 5.087 4.667 4.168 3.684 3.262 2.915 2.632 2.397
1.8 - 3.0 2.196 2.020 1.865 1.728 1.608 1.503 1.414
FOR GROUP 2:
@THETA: INFORMATION:
-3.0 - -1.6 1.238 1.309 1.402 1.522 1.675 1.868 2.108 2.399
-1.4 - 0.0 2.745 3.148 3.602 4.095 4.590 5.019 3.148 5.316
0.2 - 1.6 5.087 4.667 4.168 3.684 3.262 2.915 2.632 2.397
1.8 - 3.0 2.196 2.020 1.865 1.728 1.608 1.503 1.414
Note: In this situation, MULTILOG “thinks” there are eight items when, in fact, each respon-
dent answered only four. MULTILOG computes TOTAL TEST INFORMATION and MARGINAL
RELIABILITY assuming that each respondent answered all (eight) items; as a result, these values
are not correct for the four-item tests that were actually administered. MULTILOG cannot know
the difference between real “missing data” and this kind of artificial “missing data.” In situations
769
12 MULTILOG EXAMPLES
like this, the TOTAL TEST INFORMATION and MARGINAL RELIABILITY values printed cannot be
used.
GROUP 1
OBSERVED(EXPECTED) STD. : EAP (S.D.) : PATTERN
RES. : :
20.0( 21.9) -0.42 : 1.27 ( 0.71) : 22220000
9.0( 8.9) 0.04 : 0.83 ( 0.66) : 21220000
6.0( 4.0) 0.99 : 0.31 ( 0.61) : 22120000
2.0( 1.7) 0.24 : 0.57 ( 0.63) : 22210000
2.0( 1.3) 0.66 : 0.21 ( 0.61) : 21210000
...
NEGATIVE TWICE THE LOGLIKELIHOOD= 11.6
(CHI-SQUARE FOR SEVERAL TIMES MORE EXAMINEES THAN CELLS)
GROUP 2
OBSERVED(EXPECTED) STD. : EAP (S.D.) : PATTERN
RES. : :
20.0( 24.8) -0.96 : 1.23 ( 0.71) : 00002222
23.0( 17.4) 1.34 : 0.79 ( 0.66) : 00002122
4.0( 4.1) -0.03 : 0.27 ( 0.61) : 00002212
3.0( 1.9) 0.77 : 0.53 ( 0.63) : 00002221
3.0( 2.5) 0.32 : 0.18 ( 0.60) : 00002121
…
NEGATIVE TWICE THE LOGLIKELIHOOD= 10.3
(CHI-SQUARE FOR SEVERAL TIMES MORE EXAMINEES THAN CELLS)
12.18 Differential item functioning (DIF) analysis of eight items from the 100-
Item Spelling Test
Thissen, Steinberg & Wainer (1993) illustrated the application of a number of likelihood-based
procedures for the detection of Differential Item Functioning (DIF) using a set of data derived
from a conventional orally-administered spelling test, with data obtained from 659 undergradu-
ates at the University of Kansas. A description of these data are given in Section 2.4.1.
The reference group included the male students (N = 285), and the focal group was made up of
the female students (N = 374). The original test had 100 words, but only four (infidelity, pano-
ramic, succumb, and girder) are used here. The words infidelity, panoramic, and succumb were
selected to comprise an “anchor” (a set of items believed to involve no DIF) with information
over a range of the θ -continuum. The word girder is the “studied” item. It was selected because
it shows substantial differential difficulty for the two groups in these data.
Thissen, Steinberg & Wainer (1993) included (in an appendix) a description of the procedures
followed to compute the estimates using MULTILOG version 5. In this section, the same analy-
sis is reproduced using version 7. The item responses for the males are read as items 1-4, and
those for the females as items 5-8.
770
12 MULTILOG EXAMPLES
Syntax from the file exampl18.mlg is given below. This analysis is based on the data in ex-
ampl18.dat.
GROUP 1:
EXP. PROP. 0.2093 0.7907
GROUP 2:
EXP. PROP. 0.2135 0.7865
FOR GROUP 1:
@THETA: INFORMATION:
-3.0 - -1.6 1.556 1.683 1.831 1.999 2.184 2.383 2.589 2.792
-1.4 - 0.0 2.985 3.158 3.301 3.409 3.477 3.504 3.158 3.439
0.2 - 1.6 3.354 3.239 3.099 2.939 2.763 2.577 2.388 2.202
1.8 - 3.0 2.024 1.860 1.714 1.585 1.475 1.383 1.306
771
12 MULTILOG EXAMPLES
FOR GROUP 2:
@THETA: INFORMATION:
-3.0 - -1.6 1.556 1.683 1.831 1.999 2.184 2.383 2.589 2.792
-1.4 - 0.0 2.985 3.158 3.301 3.409 3.477 3.504 3.158 3.439
0.2 - 1.6 3.354 3.239 3.099 2.939 2.763 2.577 2.388 2.202
1.8 - 3.0 2.024 1.860 1.714 1.585 1.475 1.383 1.306
GROUP 1
OBSERVED(EXPECTED) STD. : EAP (S.D.) : PATTERN
RES. : :
GROUP 2
OBSERVED(EXPECTED) STD. : EAP (S.D.) : PATTERN
RES. : :
12.19 Individual scores for a skeletal maturity scale based on graded ratings
of ossification sites in the knee
Roche, Wainer, & Thissen (1975) calibrated 34 “indicators” (items) of skeletal maturity using
Samejima’s (1969) graded model ; a description of the model and methods used is in Chapter V
of that volume. The parameters of the data for the males are used here to “score” (estimate θ =
skeletal age) using the following data in the file exampl19.dat:
40 1 0.5 2112111111112111111111111111111111
33 1 1.0 3113211111112122111111111111111111
33 1 2.0 4333211111113122111111111111111111
29 1 3.0 4543211111113122111111011111111111
8 1 5.0 5553211211114222112111111111323111
772
12 MULTILOG EXAMPLES
10 1 6.0 5553211211115322121111121111323011
23 1 7.0 5553211311115322111001111111323011
26 1 8.0 5553212221115322221001221111423211
35 1 9.0 5553211321115422222111211111423111
10 1 12.0 5553212321115522222111222111523021
23 1 14.0 5553210320115522022100220011523121
24 1 16.0 5553222323105522222222222221523222
46 1 18.0 5553222323025522222202222202523224
The parameters for the 34 indicators are in a file called exampl19.prm. This file was produced
by MULTILOG in a (previous) calibration run. Note that the parameters in the Roche et al.
(1975) table (in which the thresholds are called τ and the slopes are called d) are in years, instead
of the usual standard units, so the results appear in years.
The MULTILOG command file includes instructions to SCORE INDIVIDUAL data on the PROBLEM
command, as well as to use no population distribution, because RWT skeletal ages are not nor-
mally computed using a population distribution. We also use CRITERION, which instructs
MULTILOG to read the chronological age of each individual to use as a starting value for the
iterative modal estimation procedure. The first ten characters (NCHARS=10) on each record are
read as an identification field; using T-format, the age in that field is also read later as the
CRITERION. The “test” has varying numbers of response categories for the 34 indicators, which
are entered in the NC list on the TEST command. The command file exampl19.mlg is shown be-
low. To see how to generate this command file using the syntax wizard, see Section 4.3.1.
EXAMPL19.MLG -
ESTIMATION OF SKELETAL MATURITY BY THE RWT (1975) METHOD
>PROBLEM SCORE, INDIVIDUAL, CRITERION, NEXAMINEES=13, NITEMS=34, NCHARS=10,
DATA=‘EXAMPL19.DAT’;
>TEST ALL, GRADED,
NC=(5,5,5,3,2,2,2,3,2,3,3,3,5,5,2(0)12,3,3,5,2,3,2,2,4);
>START ALL, FORMAT, PARAM=‘EXAMPL19.PRM’;
>SAVE;
>END;
6
123450
1111111111111111111111111111111111
2222222222222222222222222222222222
3333333333333333333333333333333333
4444444444444444444444444444444444
5555555555555555555555555555555555
0000000000000000000000000000000000
(10A1,1X,34A1,T7,F4.0)
The RWT estimates of skeletal age are modal estimates of θ , labeled THETAHAT on the last page
of the MULTILOG output. Their estimated standard errors, the number of iterations required to
compute each, and the contents of the ID field are also printed there. When using MULTILOG,
modal estimates of θ are always computed in this way, in a subsequent run after the item pa-
rameters have been estimated. Frequently, several item analysis runs are required with a set of
item-response data before a satisfactory set of item parameters is obtained; only then is it useful
to score the individuals. Selected output for this run follows.
773
12 MULTILOG EXAMPLES
SCORING DATA...
774
13 TESTFACT EXAMPLES
13 TESTFACT examples
13.1 Classical item analysis and scoring on a geography test with an external
criterion
The geography test discussed in this example consists of 20 items. The total score on the test is
used as the criterion score. The items are testing the following topics:
This example illustrates the running of stacked problems. The two problems use the same data,
but with different variable format statements. The same data are found in two identical data files,
exampl01.da1 and exampl01.da2. The reason for the duplication is that, in the case of stacked
problems, the same data file cannot be opened more than once during the analysis. The first ten
lines of the data files exampl01.da1 and exampl01.da2 are shown below.
1201903390B32325251253531212145 62531
2201903400B12223111431231122312 02535
3201903410B12123432542455323111 92231
4201903420B15323121415431524135 91827
5201903430B43123221153531522151 81220
6201903440B45124321343431512313101121
7201903450B14523224514521123411 81826
8201903460B45125422444211421213 51217
9201903470B34423221541453322131122638
10201903480B44423525451431313114121628
The persons sitting for the test are classified by sex, with “G” denoting a girl, and “B” a boy. Col-
umns 1 to 3 inclusive contain the case identification, while the gender classification is given in
column 13. These fields are denoted by “3A1” and “A1” in the variable format statement. Note
that the “X” operator is used to skip from column 3 to column 13. The width of the case identifi-
cation field is also indicated by the NIDCHAR keyword on the INPUT command.
(3A1,9X,A1,20A1,F2.0)
The item responses are given in columns 14 to 33 and are represented by “20A1” in the format
statement. Finally, the criterion score is given in columns 34 and 35. Note that this score is read
as a real number with format “F2.0”.
775
13 TESTFACT EXAMPLES
(3A1,9X,1X,20A1)
and is the same as for the first analysis, with the exceptions of the omission of the criterion score
specification and the omission of the gender classification.
In the first problem, an external criterion score is used. The PROBLEM command specifies that 20
items, with 6 responses each, are to be analyzed in the first problem (NITEM=20 and
RESPONSE=6). To obtain estimated item statistics for the two gender groups, the responses are
divided into two classes (CLASS=2) and the definition of the two classes is given in the CLASS
command. The INPUT command indicates that the data are in the external data file exampl01.da1
(FILE keyword) and that it consists of scores (SCORES option).
The external criterion score used is a score input with item responses (CRITMARK option on the
CRITERION command) named “TWENTY” (NAME keyword). By specifying the ALPHA option on the
RELIABILITY command, the calculation of coefficient alpha is requested. Alternatively, the
Kuder-Richardson formula 20, which is the default reliability measure, may be requested using
the KR20 option.
The PLOT command requests line plots of the point biserial coefficient (PBISERIAL option) as
discrimination index and with discriminating power with respect to the external criterion
(CRITERION option). The measure of item difficulty is plotted in terms of the item facility (per-
cent correct; default FACILITY option).
Note that the use of the CONTINUE command in the case of stacked problems is optional.
In the second part of exampl01.tsf the geography test is split into 2 subtests. This is indicated by
the use of the SUBTEST keyword on the PROBLEM command and the SUBTEST command in which
the BOUNDARY keyword is used to indicate that the 12th item is the last item in the first subtest,
and the 20th item is the last item in subtest 2. The subtests are named using the NAME keyword on
this command. The subtests are composed of items testing the following abilities:
The reordering of the items is indicated by the SELECT keyword on the PROBLEM command. The
reordering is specified on the SELECT command, which lists the items in the order in which they
are to be used.
The fractile option is used to investigate behavior of items across the ability spectrum. The
FRACTILES command is used to group scores into fractiles by score boundaries (SCORES option).
The boundaries, consisting of the cumulative upper scores on the test bands, are defined using
the BOUNDARY keyword on the FRACTILES command. The FRACTILES keyword on the PROBLEM
command indicates that 3 fractiles will be used for score divisions. The INPUT command indi-
cates that, as in the first analysis, scores are used as input. In addition, the LIST option requests
776
13 TESTFACT EXAMPLES
the listing, for all subjects, of the identification, main and subtest scores in the output file.
Each TESTFACT run produces output under headings labelled Phase 0 to Phase 7. The Phase 1
to Phase 4 output contains data description, plots, basic statistics, and item statistics. These are
discussed in detail in Section 13.4.1. In the present example, the Phase 1 to Phase 4 output is
suppressed using by setting the SKIP keyword on the PROBLEM command to 1. Phase 5 output
provides information about tetrachoric correlations, while Phase 6 and 7 output are only pro-
duced if a FACTOR or BIFACTOR command is present in the command file.
>TITLE
EXAMPL01.TSF- GEOGRAPHY TEST WITH EXTERNAL CRITERION SCORE
ITEM AND TEST STATISTICS
>PROBLEM NITEM=20,RESPONSE=6,CLASS=2;
>NAMES MISCELL1,MISCELL2,EROSION1,EROSION2,EROSION3,
STRUCTU1,MINERAL1,MINERAL2,MINERAL3,AGRICUL1,
MISCELL3,STRUCTU2,EROSION4,CLIMATE1,CLIMATE2,
MINERAL4,AGRICUL2,AGR,POPULAT1,STRUCTU3;
>RESPONSE ‘0’,’1’,’2’,’3’,’4’,’5’;
>KEY 14423321441435112111;
>CLASS IDEN=(G,B),NAME=(GIRLS,BOYS);
>CRITERION CRITMARK, NAME=’TWENTY’;
>RELIABIITY ALPHA;
>PLOT PBISERIAL,CRITERION,FACILITY;
>INPUT NIDCHAR=3,SCORES,FILE=‘EXAMPL01.DA1’;
(3A1,9X,A1,20A1,F2.0)
>TITLE
GEOGRAPHY TEST SPLIT INTO 2 SUBTESTS AND USE OF FRACTILES
ITEMS REORDERED
>PROBLEM NITEM=20,RESPONSE=6,SELECT=20,SUBTEST=2,FRACTILES=3;
>NAMES MISCELL1,MISCELL2,EROSION1,EROSION2,EROSION3,
STRUCTU1,MINERAL1,MINERAL2,MINERAL3,AGRICUL1,
MISCELL3,STRUCTU2,EROSION4,CLIMATE1,CLIMATE2,
MINERAL4,AGRICUL2,AGR,POPULAT1,STRUCTU3;
>RESPONSE ‘0’,’1’,’2’,’3’,’4’,’5’;
>KEY 14423321441435112111;
>SELECT 3,4,7(1)12,16(1)19,1,2,5,6,13,14,15,20;
>SUBTEST BOUNDARY=(12,20),NAME=(RECALL,ANALYSIS);
>FRACTILE SCORE,BOUNDARY=(7,13,20);
>INPUT NIDCHAR=3,SCORES,LIST,FILE=‘EXAMPL01.DA2’;
(3A1,9X,1X,20A1)
>STOP
Portions of the Phase 5 output are shown below. The first part of the output contains, for each
selected item, the number of cases, % correct, % omitted, % not reached and % not-presented.
The summary shows that 2.5% of the respondents omitted item number 6.
777
13 TESTFACT EXAMPLES
In this example, a non-adaptive full information item factor analysis is performed on 5 items,
with 3 responses each, from the LSAT data (Bock & Lieberman (1970), LSAT data Section 7).
The number of items and responses are indicated by the NITEMS and RESPONSE keywords on the
PROBLEM command. The data are in the file exampl02.dat, and have the following layout:
778
13 TESTFACT EXAMPLES
(2A1,5A1,3X,I3)
lists these three fields in the same order, and the “X” operator is used to skip from column 7 to
column 11.
The INPUT command indicates that item scores are used as input (SCORES option) and that each
data record starts with an identification field 2 characters in length (NIDCHAR=2). The WEIGHT
keyword is set to PATTERN to indicate that each data record consists of an answer pattern with a
frequency. Note that the frequency is read as an integer (I3) in the variable format statement.
The three responses are listed on the RESPONSE command, while the KEY command indicates that
a “1” is the correct response to all 5 items. By default, the RECODE option will be used on the
TETRACHORIC command, and thus all omits will be recoded as wrong responses.
The TETRACHORIC command specifies details concerning the tetrachoric correlation matrix. Co-
efficients will be printed to 3 decimal places (NDEC=3) and the matrix of tetrachoric correlations
will appear in the printed output (LIST option). This matrix may also be saved to an external file
if the CORRELAT option is included on the (optional) SAVE command.
The FACTOR and FULL commands are used to specify parameters for the full information item
factor analysis. Two factors and 3 latent roots are to be extracted, as indicated by the NFAC and
NROOT keywords respectively. A PROMAX rotation is requested. Note that this keyword may not be
abbreviated in the FACTOR command. The residual correlation matrix will be computed as the
initial correlation matrix minus the final correction matrix (RESIDUAL option). An f-factor posi-
tive definite estimate of the latent response process correlation matrix will be computed (SMOOTH
option). This option affects only the output of the final smoothed correlation matrix. A maximum
of 20 EM cycles will be performed (CYCLES keyword on the FULL command).
The NOADAPT option on the TECHNICAL command specifies that non-adaptive quadrature should
be used to obtain the full information solution. Note that, if NFAC > 5, the presence of this option
will be ignored and adaptive fractional quadrature will be performed.
The smoothed correlation matrix, rotated factor loadings and item parameters are saved to exter-
nal files (exampl02.smo, exampl02.rot and exampl02.par respectively) using the SMOOTH,
ROTATE and PARM options on the SAVE command.
>TITLE
EXAMPL02.TSF- LSAT DATA NON-ADAPTIVE FULL INFORMATION ITEM FACTOR ANALYSIS
COUNTED RESPONSE PATTERNS
>PROBLEM NITEM=5,RESPONSE=3;
>NAMES ITEM1,ITEM2,ITEM3,ITEM4,ITEM5;
>RESPONSE ‘8’, ‘0’, ‘1’;
>KEY 11111;
>TETRACHORIC NDEC=3,LIST;
>FACTOR NFAC=2,NROOT=3,ROTATE=PROMAX,RESIDUAL,SMOOTH;
>FULL CYCLES=20;
>TECHNICAL NOADAPT;
779
13 TESTFACT EXAMPLES
>SAVE SMOOTH,ROTATE,PARM;
>INPUT NIDCHAR=2,SCORES,WEIGHT=PATTERN, FILE=’EXAMPL02.DAT’;
(2A1,5A1,3X,I3)
>STOP;
13.3 One-factor non-adaptive full information item factor analysis of the five-
item test
In this example, the LSAT data of Section 13.2 are analyzed assuming a one-factor model. The
purpose of the analysis is to compare the goodness-of-fit with that of the two-factor model, and
to use the change in χ 2 between the models as a test of statistical significance of the second fac-
tor. The computation of classical item statistics is skipped (SKIP=1), and the factor loadings are
not rotated or saved.
>TITLE
EXAMPL03.TSF- LSAT DATA NON-ADAPTIVE FULL INFORMATION ITEM FACTOR ANALYSIS
TEST OF FIT
>PROBLEM NITEM=5,RESPONSE=3,SKIP=1;
>NAMES ITEM1,ITEM2,ITEM3,ITEM4,ITEM5;
>RESPONSE ‘8’,‘0’,‘1’;
>KEY 11111;
>TETRACHORIC NDEC=3,LIST;
>FACTOR NFAC=1,NROOT=3;
>FULL CYCLES=16;
>TECHNICAL NOADAPT;
>INPUT NIDCHAR=2,SCORES,WEIGHT=PATTERN,FILE=’EXAMPL02.DAT’;
(2A1,5A1,3X,I3)
>STOP;
13.4 A three-factor adaptive item factor analysis with Bayes (EAP) estimation
of factor scores: 32 items from an activity survey
This example analyzes 32 items selected from the 48-item version of the Jenkins Activity Survey
for Health Prediction, Form B (Jenkins, Rosenman, & Zyzanski, 1972). The data are responses
of 598 men from central Finland drawn from a larger survey sample. Most of the items are rated
on three-point scales representing little or no, occasional, or frequent occurrence of the activity
or behavior in question. For purposes of the present analysis, the scales have been dichotomized
near the median. Wording in the positive or negative direction varies from item to item as fol-
lows (item numbers are those of the original pool of items from which those of the present form
were selected):
-Q156,-Q157,+Q158,-Q165,-Q166,-Q167,+Q247,+Q248,-Q249,-Q250,+Q251,
+Q252,+Q253,+Q254,+Q255,+Q256,+Q257,-Q258,-259,+Q260,+Q261,+Q262,
+Q263,+Q264,+Q265,-Q266,+Q267,+Q268,+Q269,+Q270,+Q271,+Q272,-Q273,
-Q274,-Q275,+Q276,+Q277,+Q278,-Q279,-Q280,+Q307,+Q308,+Q309,+Q310,
+Q311,-Q312,-Q313,-Q314.
The first 7 lines of the data file exampl04.dat are shown below.
780
13 TESTFACT EXAMPLES
201000220122112221022212202112211101122112222000
001221211011100111111111111110111102211111211020
0010.02100222122021221222112112212.0011111222001
002020220212012120011112112221221022211111222202
201000221000211221221112012211122112211111222000
001001221022011120022222212222211101121112222101
102100111022112120021212212221121212111022200021
The first 10 columns of each record are used as case identification and are read first. Starting
again in the first column by using the “T” operator, the responses to the 48 items are read as sin-
gle fields (48A1).
(10A1,T1,48A1)
The SELECT keyword on the PROBLEM command indicates that 32 items are selected from the
original 48 items. The SELECT command provides the selected items in the order in which they
will be used. The RESPONSE command lists the 5 responses indicated on the PROBLEM command
(RESPONSE keyword) and the KEY command provides the correct responses for each of the 48
items. The NOTPRESENTED option on the PROBLEM command is required if one of the response
codes identifies not-presented items. The “.” code on the RESPONSE command identifies these
responses.
The TETRACHORIC command requests the printing of the coefficients to 3 decimal places
(NDEC=3) in the printed output file (LIST option). The tetrachoric correlation matrix, item pa-
rameters, rotated factor loadings, and the factor scores will be saved in the files exampl04.cor,
exampl04.par, example04.rot, and exampl04.fsc, respectively, as specified on the SAVE com-
mand.
The FACTOR and FULL commands are used to specify parameters for the full information item
factor analysis. Three factors and ten latent roots are to be extracted, as indicated by the NFAC
and NROOT keywords respectively. A VARIMAX rotation is requested. Note that this keyword may
not be abbreviated in the FACTOR command. A maximum of 80 EM cycles will be performed
(CYCLES keyword on the FULL command). The convergence criterion for the EM cycles is given
by the PRECISION keyword on the TECHNICAL command.
Cases will be scored by EAP (Expected A Posteriori, or Bayes) estimation with adaptive quad-
rature (METHOD=2 on the SCORE command). Posterior standard deviations will also be computed.
Results will be saved in the exampl04.fsc file (FSCORES option on the SAVE command). The fac-
tor scores for the first 20 cases will be listed in the output file (LIST=20). See Section 13.5 for
MAP (Maximum A Posteriori, or Bayes Modal) estimation for the same cases.
>TITLE
EXAMPL04.TSF-ITEMS FROM THE JENKINS ACTIVITY SURVEY
ADAPTIVE ITEM FACTOR ANALYSIS AND FACTOR SCORE ESTIMATION
>PROBLEM NITEMS=48,SELECT=32,RESPONSES=5,NOTPRESENTED;
>NAMES Q156,Q157,Q158,Q165,Q166,Q167,Q247,Q248,Q249,Q250,Q251,Q252,
Q253,Q254,Q255,Q256,Q257,Q258,Q259,Q260,Q261,Q262,Q263,Q264,
Q265,Q266,Q267,Q268,Q269,Q270,Q271,Q272,Q273,Q274,Q275,Q276,
Q277,Q278,Q279,Q280,Q307,Q308,Q309,Q310,Q311,Q312,Q313,Q314;
>RESPONSE ‘8’, ‘0’, ‘1’, ‘2’, ‘.’;
>KEY 002000220022222220022222202222220002220022222000;
781
13 TESTFACT EXAMPLES
>SELECT 3,5,6,7,9,11(1)14,17(1)23,25(1)30,32,33,35,36,39(1)42,47,48;
>TETRACHORIC LIST, NDEC=3;
>FACTOR NFAC=3,NROOT=10,ROTATE=VARIMAX;
>FULL CYCLES=80;
>TECHNICAL PRECISION=0.005;
>SCORE METHOD=2,LIST=20;
>SAVE CORR,PARM,FSCORE,ROTATE;
>INPUT NIDCHAR=10,SCORES,FILE=‘EXAMPL04.DAT’;
(10A1,T1,48A1)
>STOP
The first part of the output lists the name of the command file (exampl04.tsf) and the name of
the output file (exampl04.out). Each TESTFACT run produces output under one or more of the
following headings, depending on the type of analysis.
The analysis specified in exampl04.tsf produces Phase 0, Phase 1, Phase 2, Phase 5, and Phase 7
output.
Regardless of the type of analysis, a Phase 0 output is produced, being an echo of the input com-
mands in the *.tsf file.
-Q156,-Q157,+Q158,-Q165,-Q166,-Q167,+Q247,+Q248,-Q249,-Q250,
+Q251,+Q252,+Q253,+Q254,+Q255,+Q256,+Q257,-Q258,-Q259,+Q260,+Q261,
782
13 TESTFACT EXAMPLES
+Q262,+Q263,+Q264,+Q265,-Q266,+Q267,+Q268,+Q269,+Q270,+Q271,+Q272,
-Q273,-Q274,-Q275,+Q276,+Q277,+Q278,-Q279,-Q280,+Q307,+Q308,+Q309,
+Q310,+Q311,-Q312,-Q313,-Q314.
>NAMES Q156,Q157,Q158,Q165,Q166,Q167,Q247,Q248,Q249,Q250,Q251,Q252,
Q253,Q254,Q255,Q256,Q257,Q258,Q259,Q260,Q261,Q262,Q263,Q264,
Q265,Q266,Q267,Q268,Q269,Q270,Q271,Q272,Q273,Q274,Q275,Q276,
Q277,Q278,Q279,Q280,Q307,Q308,Q309,Q310,Q311,Q312,Q313,Q314;
>RESPONSE ‘8’, ‘0’, ‘1’, ‘2’, ‘.’;
>KEY 002000220022222220022222202222220002220022222000;
>SELECT 3,5,6,7,9,11(1)14,17(1)23,25(1)30,32,33,35,36,39(1)42,47,48;
>TETRACHORIC LIST, NDEC=3;
>FACTOR NFAC=3,NROOT=10,ROTATE=VARIMAX;
>FULL CYCLES=80;
>TECHNICAL PRECISION=0.005;
>SCORE METHOD=2,LIST=20;
>SAVE CORR,PARM,FSCORE,ROTATE;
>INPUT NIDCHAR=10,SCORES,FILE=‘EXAMPL04.DAT’;
DATA FORMAT=
(10A1,T1,48A1)
Values of the response categories (8, 0, 1, 2, .), the answer key, contents of the first observation,
the sum of weights and number of records are given. This information enables you to verify that
the data values were read correctly from the data file exampl04.dat. The response categories in-
dicate a code of “8” for omitted responses (first value) and a code of “.” for not-presented items
(last value).
Thirty-two items were selected from the 48-item test. Based on the answer key values, a total
score for each of the 598 respondents is computed. Each item has a set of responses: right,
wrong, omit, or not-presented. For item j, j = 1, 2, …, 32, the response of person i, i = 1, 2, …,
598 can be written as
783
13 TESTFACT EXAMPLES
At your option, omitted items can be considered either wrong or not presented. The total test
score X i for person i is
32
X i = ∑ xij .
j =1
Respondent 1, for example, has a total score of 19 correct out of a possible 32 as shown below.
Answer key:
20020222220022222022222002002200
Respondent 1:
10020221121022212021121101211200
Using this information, a frequency table of the score distribution is calculated and presented
graphically.
784
13 TESTFACT EXAMPLES
FREQUENCY :
|
|
| **
| ****
| *****
8.0+ *****
| *****
| *****
| ***** *
| * ***** *
| *********
| **********
| ***********
| ***********
| ***********
4.0+ ***********
| ***********
| *************
| **************
| **************
| **************
| ***************
| ****************
| *******************
| *******************
0.0+-----+----+----+----+----+----+----+----+----+----+----+----+----+--
0. 5. 10. 15. 20. 25. 30.
SCORES
The last portion of the Phase 1 output gives the mean (15.9) and standard deviation (4.0) of the
Total Scores.
598 32
p = ∑∑ xij /(32 × 598) = 0.497
i =1 j =1
p (1 − p ) = 0.5.
For each item, eight statistics are produced. The Number, Mean and S.D. for item 2, for example,
are 590, 15.92, and 4.03 respectively. These values are obtained by “deleting” each row of the
data if a not-presented code is encountered for item 2. Since 8 rows contain not-presented codes,
the mean and standard deviation of the Total Scores is calculated for the remaining 590 cases.
785
13 TESTFACT EXAMPLES
Note, for example, that item 1 was presented to all 598 persons, while item 4 was presented to
592 persons.
The mean score for those subjects who get a specific item correct is denoted by RMEAN. For ex-
ample, since 385 respondents selected the correct response for item 2, RMEAN for item 2 is calcu-
lated as the mean of the corresponding 385 Total Scores and equals 17.13.
The item facility (FACILITY) is the proportion correct response for a specific item. For example,
385 of the 590 respondents presented with item 2 selected the correct response, and hence
∆ = −4Φ −1 ( p ) + 13,
where p is the item facility and Φ −1 denotes the inverse normal transformation. This statistic has
an effective range of 1 to 25, with a mean and standard deviation of 13 and 4 respectively.
The last 2 statistics are the biserial (BIS) and point biserial (P.BIS) correlations. The formula for
the sample point biserial correlation is
786
13 TESTFACT EXAMPLES
The point biserial correlation is the correlation between the item score and the total score, or sub-
test score. Theoretically −1 ≤ P.BIS ≤ 1 but in practice −0.20 ≤ P.BIS ≤ 0.75. Therefore, 0.467 indi-
cates a relatively strong association between item 8 and the Total Score.
The formula for calculating the sample biserial correlation coefficient, BIS, is
Consider, for example, the item 3 facility, which equals 0.790. From the inverse normal tables,
this corresponds to a z p -value of 0.8062.
1 1
h( facility ) = exp(− z 2p )
2π 2
= 0.399 × 0.723
= 0.29.
For item 3,
The first part of the output contains, for each selected item, the Number of Cases, Percent
Correct, Percent Omitted, Percent Not Reached and Percent Not Presented.
787
13 TESTFACT EXAMPLES
This summary indicates that there were no omitted codes in the data and that all 598 respondents
could complete the test. The percent Not Presented varies from 0.0 to a maximum of 1.3 for
item 2. For item 2, this percentage is calculated as
598 − 590
×100 = 1.3%.
598
Note that the Percent Correct is calculated here as the number of respondents who selected the
correct answer, divided by the total number of cases. For item 2
385
PERCENT CORRECT = ×100 = 64.38%.
598
This value differs from the facility estimate (385/590) given under Phase 2 of the output.
The tetrachoric correlation coefficient is widely used as a measure of association between two
dichotomous items. Tetrachoric correlations are obtained by hypothesizing, for each item, the
existence of a continuous “latent” variable underlying the “right-wrong” dichotomy imposed in
scoring. It is additionally hypothesized that, for each pair of items, the corresponding two con-
tinuous “latent” variables have a bivariate normal distribution.
1 2 3 4 5 6
Q158 Q166 Q167 Q247 Q249 Q251
1 Q158 1.000
2 Q166 -0.383 1.000
3 Q167 -0.145 0.124 1.000
4 Q247 -0.535 0.368 0.054 1.000
5 Q249 0.106 -0.019 0.016 -0.161 1.000
6 Q251 -0.065 0.017 0.019 0.016 -0.126 1.000
...
788
13 TESTFACT EXAMPLES
R W
R O
W O
R W
R O O
and
R W
R O
W O
The average tetrachoric correlation equals 0.0654. Since the output has both negative and posi-
tive correlation coefficients, the average value does not shed much light on the actual strength of
association between item pairs. Note that tetrachoric correlation matrices are not necessarily
positive definite.
By definition, a symmetric matrix is positive definite if all its characteristic roots are positive.
From the output below, it is seen that only the first 31 of the 32 roots are positive, and therefore
the 32 × 32 matrix of tetrachoric correlations is not positive definite. This problem can be cor-
rected by replacing the negative roots of the matrix by zero or a small non-zero quantity.
1 2 3 4 5 6
1 7.491350 3.442602 2.592276 1.745235 1.576302 1.442306
7 8 9 10 11 12
1 1.248438 1.118638 1.015248 0.971235 0.908476 0.835705
13 14 15 16 17 18
1 0.768426 0.719607 0.657375 0.638227 0.631485 0.555802
789
13 TESTFACT EXAMPLES
19 20 21 22 23 24
1 0.514488 0.461871 0.398661 0.375292 0.349726 0.312994
25 26 27 28 29 30
1 0.292964 0.243591 0.218973 0.183170 0.167582 0.117183
31
1 0.055375
Display 3: Number of items and sum of latent roots and their ratio
This section of the output shows the sum of positive roots and the ratio with which each root has
to be multiplied to obtain a sum of “corrected roots” which equals the number of items. To illus-
trate, consider a 5 × 5 correlation matrix with latent roots 3, 1, 0.8, 0.3, and –0.1. The sum of the
roots equals 5. In general, for any correlation matrix based on n items, the sum of roots equals n.
Suppose the value of –0.1 is replaced by 0.0001, then the new sum of roots equals 5.1001. How-
ever, by multiplying each root by the ratio 5/5.1001 = 0.9804, a “corrected” set of roots is ob-
tained in the sense that their sum equals 5.
From the Display 3 part of the output, the ratio required to obtain a corrected set of latent roots
equals 0.9984211. The corrected set is given under the Display 4 heading.
1 2 3 4 5 6
1 7.479522 3.437167 2.588184 1.742479 1.573814 1.440029
...
A tetrachoric correlation matrix is not necessarily positive definite and in TESTFACT it is re-
placed by a so-called smoothed inter-item correlation matrix. For the reader familiar with matrix
algebra, a short description of the smoothing procedure follows.
R = VDV '
where D is a diagonal matrix with diagonal elements the characteristic roots of ng = 32. As men-
tioned previously, if all roots are positive, that is, all the diagonal elements of D are positive, R
is a positive definite matrix. When this is not the case, a “smoothed” correlation matrix, R* may
790
13 TESTFACT EXAMPLES
be obtained by replacing the elements of D with the corrected roots and negative roots with ei-
ther 0 or some small positive quantity, so that
R * = VD*V ' ,
where the columns of V are eigenvectors and the elements of D* the corrected latent roots. The
elements of the smoothed correlation matrix for the first 6 of the 32 items are given below.
1 2 3 4 5 6
Q158 Q166 Q167 Q247 Q249 Q251
1 Q158 1.000
2 Q166 -0.383 1.000
3 Q167 -0.145 0.124 1.000
4 Q247 -0.534 0.368 0.054 1.000
5 Q249 0.106 -0.019 0.016 -0.161 1.000
6 Q251 -0.066 0.017 0.019 0.016 -0.126 1.000
A communality is defined as the squared multiple correlation between an observed variable and
the set of factors. The output below shows the estimated communalities for iterations 1, 2, 3, and
4. Note the small changes in the estimated values going from iteration 3 to iteration 4.
At iteration 1, the squared multiple correlation of an item with all other items is calculated for
each of the 32 items. The MINRES method (see Display 7) is subsequently used to obtain post-
solution improvements to these initial multiple regression communality estimates.
1 2 3 4
1 Q158 0.413 0.373 0.371 0.371
2 Q166 0.370 0.325 0.323 0.322
3 Q167 0.156 0.116 0.115 0.115
4 Q247 0.516 0.471 0.466 0.465
5 Q249 0.142 0.088 0.087 0.087
6 Q251 0.351 0.269 0.257 0.255
…
31 Q313 0.477 0.422 0.415 0.414
32 Q314 0.458 0.396 0.387 0.386
TESTFACT uses the minimum squared residuals (MINRES) method to extract factors from the
smoothed correlation matrix R* .
Let eij denote the difference between a smoothed correlation coefficient rij* and the correspond-
ing estimated correlation coefficient pij . These estimated coefficients are functions of the factor
loadings and unique variances. The MINRES method mimimizes the residual sum of squares,
791
13 TESTFACT EXAMPLES
∑e 2
ij , using ordinary least squares. A more technical description, which may be skipped, fol-
lows.
The MINRES method minimizes the sum of squares of the residuals in a matrix ∆ , where
∆ p× p = R* − ( ΛΛ ' + Du )
where Λ is a p × k common factor matrix and the diagonal elements uii of Du , the unique vari-
ances, i = 1, 2, …, p. If ρi2 denotes the communality for item i, then uii equals 1 − ρi2 .
The sum of squares of the residuals is expressed as a statistical function (see, e.g. Tucker &
MacCallum, 1997), which is minimized by the determination of the matrix of factor loadings Λ
and uniqueness Du .
In this part of the output, the NROOT largest roots of the matrix
R * − Du
are reported. Note that, since uii equals 1 − ρi2 , characteristic roots are actually obtained from the
smoothed correlation matrix with the unit diagonal elements replaced by the communalities. In
general, the matrix R* − Du will be non-positive definite and hence a subset of the roots will be
negative.
If one replaces NROOT=10 in the FACTOR command with, for example, NROOT=20, the output
shows that roots with numbers 16, 17, 18 and higher are all negative. An empirical rule for the
selection of the number of factors, k, is to set k equal to the number of latent roots larger than 1.
For the present example it appears as if 3 or 4 factors are appropriate. Usually, the number of
factors is selected on the basis of some theoretical framework concerning the items included in
the analysis.
1 2 3 4 5 6
1 6.886994 2.861018 1.961481 1.149766 0.934423 0.738751
7 8 9 10
1 0.582337 0.423875 0.326571 0.270941
The estimated factor loadings at convergence of the MINRES method are given below. These
values are used to obtain starting values for the marginal maximum likelihood procedure speci-
fied in the FULL (full information) command.
Note that each communality is equal to the sum of squares of the corresponding factor loadings.
792
13 TESTFACT EXAMPLES
For example, for item 12, the 3 factor loadings are 0.406, 0.275, and 0.555. Hence,
1 2 3
1 Q158 -0.579 0.189 0.022
2 Q166 0.519 -0.230 -0.001
3 Q167 0.246 0.215 -0.091
4 Q247 0.535 -0.420 -0.049
5 Q249 -0.152 -0.022 -0.251
6 Q251 0.250 0.245 0.364
...
31 Q313 0.431 -0.478 -0.018
32 Q314 0.338 -0.511 0.105
The intercept and slope estimates are functions of the item facility and factor loadings. If the
ROTATE keyword is omitted in the FACTOR command, the factor loadings are the MINRES factor
loadings (see Display 8). Otherwise the initial rotated factor loadings are used (not shown in the
output).
Suppose the factor loadings for item 1 and a 3-factor solution are denoted by f11 , f12 , and f13 ,
respectively. Let
and denote the slopes corresponding to item 1 by s11 , s12 , and s13 respectively. Then
f11 f f
s11 = , s12 = 12 , s13 = 13 .
c1 c1 c1
zi
Intercepts are computed as , where
ci
3
ci = 1 − ∑ f ij2
j =1
and zi is the z-value corresponding to an area under the N(0,1) curve equal to the item i facility.
793
13 TESTFACT EXAMPLES
For item 1, for example, facility equals 0.206 and the corresponding z-value is –0.8202. For item
1, c1 = 0.791 and therefore the item 1 intercept estimate is
−0.8202
INTERCEPT = = −1.036.
0.791
Conversely, factor loadings are related to the slopes. Let fij and sij respectively denote the j-th
factor loading and slope of item i, j = 1, 2, …, nfac. Then
sij
fij = ,
ki
where
3
ki = 1 + ∑ sij2 .
j =1
The initial intercept and slope values are used as initial estimates for the full information maxi-
mum likelihood procedure specified by the FULL command.
This part of the output shows that parameter estimates will be based on the EM (Expectation
Maximization) method and that the number of quadrature points equals 4. Quadrature is a nu-
meric integration method that is often used in practice to calculate the value of an integral, when
no closed-form solution exists.
For the interested reader, a brief description of the quadrature method to calculate the log-
likelihood function is presented next.
794
13 TESTFACT EXAMPLES
For a one-factor analysis, for example, the log-likelihood function can be expressed as
N ∞
∑
α
log ∫ gα (θ , x)dx
=1 −∞
∫ gα (θ , x)dx = ∑
−∞ k =1
wk gα (θ , xk )
The numeric values of the 4 quadrature points and weights are listed. Note that the weights are
always positive and that the quadrature points are symmetric.
1 -2.334414 0.045876
2 -0.741964 0.454124
3 0.741964 0.454124
4 2.334414 0.045876
The next part of the output shows the progress of the iterative procedure. At each cycle, -2 x
LOG-LIKELIHOOD is reported as well as the maximum change in the intercept and slope values.
For example, the maximum change in slope 1 estimates is equal to 0.098630. In other words,
starting from the initial slope values of 0.387 (item 1), 0.467 (item 2), …, 0.277 (item 32), the
differences between these values and the revised cycle 1 slope 1 estimates are at the most
0.098630 units.
Small maximum changes in intercept and slope estimates are therefore an indication of conver-
gence.
Note that, starting from cycle 6, the difference between –2 log L of the previous cycle and the
present cycle is reported. At cycle 19, for example, this value, reported as CHANGE, is 0.0726.
795
13 TESTFACT EXAMPLES
NR Wj
χ = ∑ W j log
2
,
W × p
j =1 T j
where N R denotes the number of unique observed response patterns, W j the sum of weights and
p j the marginal probability (marginal likelihood function) for pattern j.
For this example, N R = 598 , nfac = 3, and n (number of items) equals 32. Hence
The resultant test statistic is the difference between the χ 2 under H 0 and the χ 2 under H1 with
degrees of freedom equal to the difference in degrees of freedom for H 0 and H1 .
796
13 TESTFACT EXAMPLES
If we replace the NFAC=3 keyword in the FACTOR command with NFAC=2, then
χ 02 = 13498.63
ndf = 502.
From the output below, χ12 = 13155.03 with 472 degrees of freedom. The χ 2 for a 2-factor ver-
sus a 3-factor model is 13498.63 – 13155.03 = 343.60 with 502 – 472 = 30 degrees of freedom.
Since this value is highly significant, we reject the 2-factor model in favor of the 3-factor model.
The output below shows the estimated intercept and slope estimates after convergence is at-
tained, or alternatively, after the maximum number of cycles specified is used. The number of
EM cycles can be specified by one of the following commands:
Each communality is equal to the sum of squared factor loadings for the corresponding item. For
example, for item 1 the factor loadings are –0.533, –0.194, and 0.069. The communality is equal
to c j ∼ N (m, s ). The standardized difficulty for item i is calculated as – intercept / ki , where (see
comments for Display 9)
and sij denotes the j-th slope for item j. For item 1, for example,
797
13 TESTFACT EXAMPLES
An item with a standardized difficulty of 0 can be regarded as an item with “average” difficulty.
Standardized difficulty scores above 0 are associated with the more difficult items and a value of
1.0, for example, indicates that examinees can be expected to find this item more difficult to an-
swer than an item with standardized difficulty of less than 1. On the other hand, items with stan-
dardized difficulty of less than 0 (for example item 31) can be expected to be much easier to an-
swer correctly.
As mentioned earlier (see Display 9), the relationship between slopes and unrotated factor load-
ings is given by
sij
fij = ,
ki
where i is the item number, j the slope number and ki as defined above.
The principal factor loadings given below are obtained as follows. Let F be a n × nfac matrix of
factor loadings with typical elements fij = sij / kij and define S as the n × n symmetric matrix
FF ' with column rank equal to the number of factors, nfac. This implies that S has a maximum
of nfac non-zero characteristic roots c1 , c2 ,..., cnfac . If we denote the corresponding eigenvectors
by e1 , e 2 ,..., e nfac , then the principal factor loadings shown in the output below are computed as
f1* = e1 c1 , f 2* = e 2 c2 and f3* = e3 c3 where the elements of f *j denote the factor loadings for
the j-th factor, j = 1, 2, 3.
1 2 3
1 Q158 0.846 0.348 -0.553 -0.194 0.069
2 Q166 -0.406 0.290 0.496 0.208 -0.030
3 Q167 -0.825 0.096 0.215 -0.215 0.062
4 Q247 -0.522 0.433 0.512 0.410 -0.050
5 Q249 0.083 0.078 -0.146 0.032 0.235
6 Q251 -0.080 0.261 0.246 -0.242 -0.377
...
31 Q313 -1.024 0.434 0.419 0.487 -0.145
32 Q314 -0.221 0.372 0.341 0.488 -0.131
798
13 TESTFACT EXAMPLES
cj
×100%, j = 1, 2, … , nfac
n
where c j is the j-th characteristic root of FF ' (see Display 14) and n the number of items.
From the values reported in the output, it is seen that 20.31% of the total variance is explained by
the first factor, 8.64% by the second and 5.68% by the third factor. Since
c1
20.31 = ×100,
32
1 2 3
1 20.31014 8.64630 5.68340
Let Λ be a n × k matrix of factor loadings. This matrix represents the relationships between the
original n items and k linear combinations of these items. To illustrate, suppose the number of
items (p) is 4 and the number of factors (k) equals 2:
where F1 and F2 are uncorrelated and the variances of F1 and F2 are the so-called eigenvalues.
The factor loadings {λij } are only unique up to a rotation in k-dimensional space. A suitable rota-
tion of these factor loadings can result in a simplified structure between the factors and items if
the new set of factor loadings {λij*} are either relatively large or small. Rotations may be found
by minimizing the criterion (see, e.g. Lawley and Maxwell (1971))
γ
2
k n
n k
V = ∑∑ (λ ) − ∑ ∑ (λij* ) 2 ,
* 4
ij
j =1 i =1 n j =1 i =1
799
13 TESTFACT EXAMPLES
where the constant γ gives a family of rotations with γ = 1 giving VARIMAX rotations and
γ = 0 QUARTIMAX rotations.
Note that the standardized difficulty and communality estimates are the same as those given in
Display 14. To determine which items are associated with a specific factor, one may select, for
each item, the column with the highest loading (ignoring the sign of the loading). The following
items appear to be indicators of Factor 2, for example: items 1, 2, 4, 8, 20, 24, 25, 26, 31 and 32.
The factor scores are Bayes estimates computed under the assumption that the corresponding
ability factors are normally distributed in the population from which the sample of examinees
was drawn.
Let θ ik denote the j-th ability score, k = 1, 2, …, nfac for examinee i, i = 1, 2, …, N, then the fac-
tor scores are E (θ ik | xi1 , xi 2 ,..., xin ) where xij is the item j score for examinee i (see the discussion
of the output in Section 13.7 for more details).
To obtain these conditional expectations, a 5-point quadrature formula is employed. The points
and weights are shown below.
1 -2.856970 0.011257
2 -1.355626 0.222076
3 0.000000 0.533333
4 1.355626 0.222076
5 2.856970 0.011257
800
13 TESTFACT EXAMPLES
This command requests that the factor ability scores for the first 20 cases should be listed as part
of the output. The full set of factor scores is written to the file exampl03.fsc. For each case, the
case ID, number of items presented, percent correct and percent omitted are reported. Below
these values, the ability scores for each factor, with estimated standard errors marked with an
asterisk, are given. Case 3, for example, was presented with 30 items of which 13 were answered
correctly. Hence the percentage correct for this case is
13
×100 = 43.3.
30
Case 10 answered 84.4% percent correctly and had factor scores of 0.898, 1.234 and 1.710 re-
spectively. Since the means of the 598 factor scores (see the last part of the output) are approxi-
mately 0 with standard deviations of 0.86, 0.86 and 0.82 respectively, it can be concluded that
examinee 10 attained factor scores that are at least one standard deviation above average.
Factor scores are not unique in the sense that multiplication of any column of factor scores by –1
does not affect the validity of the estimates. It may therefore happen that negative scores are as-
sociated with above average percent responses and vice versa for below average responses.
TESTFACT attempts to reverse the signs in such a way that scores above zero are usually as-
signed with above average achievement.
801
13 TESTFACT EXAMPLES
13.5 Adaptive item factor analysis and Bayes modal (MAP) factor score esti-
mation for the activity survey
This example analyzes 32 items selected from the 48-item version of the Jenkins Activity Survey
for Health Prediction, Form B (Jenkins, Rosenman, & Zyzanski, 1972). The data are responses
of 598 men from central Finland drawn from a larger survey sample. Most of the items are rated
on three-point scales representing little or no, occasional, or frequent occurrence of the activity
or behavior in question. For purposes of the present analysis, the scales have been dichotomized
near the median. For a complete discussion of the contents of the data file and variable format
statement used to read these data, see Section 13.4. In Section 13.4, EAP factor score estimation
was performed. This example, illustrating MAP factor score estimation, imports the ex-
ampl04.par file from the previous example (FILE keyword on the SCORE command) to score the
respondents to the survey using the VARIMAX rotated factor pattern.
The PROBLEM, RESPONSE, KEY, SELECT and INPUT commands are the same as used in Section
13.4, with the exception of the addition of the SKIP keyword on the PROBLEM command.
Classical item analysis is skipped, and the removal of the TETRACHORIC, FACTOR and FULL com-
mands confirms that item factor analysis is also skipped, as indicated by SKIP=2. The SAVE
command is still present, but only used to save factor scores to the file exampl04.fsc (FSCORES
option on the SAVE command).
The SCORE command now indicates the use of MAP estimation (METHOD=3). The FILE keyword
indicates the parameter file to be used while the NFAC keyword specifies the number of factors
used when estimating the factor scores (recall that in the previous example 3 factors were ex-
tracted). Factor scores for the first 20 cases are to be written to the output file (LIST=20) and the
convergence for the MAP iterations is set by the SPRECISION keyword. Cases will be scored by
the MAP (Maximum A Posteriori, or Bayes Modal) method. Standard error estimates will be
computed from the posterior information at the estimated values.
>TITLE
EXAMPL05.TSF-ITEMS FROM THE JENKINS ACTIVITY SURVEY
SCORING THE RESPONDENTS (MAP METHOD)
>PROBLEM NITEMS=48,SELECT=32,RESPONSES=5,NOTPRESENTED,SKIP=2;
>NAMES Q156,Q157,Q158,Q165,Q166,Q167,Q247,Q248,Q249,Q250,Q251,Q252,
Q253,Q254,Q255,Q256,Q257,Q258,Q259,Q260,Q261,Q262,Q263,Q264,
Q265,Q266,Q267,Q268,Q269,Q270,Q271,Q272,Q273,Q274,Q275,Q276,
Q277,Q278,Q279,Q280,Q307,Q308,Q309,Q310,Q311,Q312,Q313,Q314;
>RESPONSE ‘8’, ‘0’, ‘1’, ‘2’, ‘.’;
>KEY 002000220022222220022222202222220002220022222000;
>SELECT 3,5,6,7,9,11(1)14,17(1)23,25(1)30,32,33,35,36,39(1)42,47,48;
>SAVE FSCORES;
802
13 TESTFACT EXAMPLES
>SCORE METHOD=3,LIST=20,NFAC=3,SPRECISION=0.0001,
FILE=‘EXAMPL04.PAR’;
>INPUT NIDCHAR=10,SCORES,FILE=‘EXAMPL04.DAT’;
(10A1,T1,48A1)
>STOP
13.6 Six-factor analysis of the activity survey by Monte Carlo full information
analysis
This example illustrates a six-dimensional analysis by the Monte Carlo version of adaptive EM
estimation. The same 32 items selected from the 48-item version of the Jenkins Activity Survey
for Health Prediction, Form B (Jenkins, Rosenman, and Zyzanski, 1972) as in the previous 2 ex-
amples are used. For a complete discussion of the contents of the data file and variable format
statement used to read these data, see Section 13.4.
The TETRACHORIC command requests the printing of the coefficients to 3 decimal places
(NDEC=3) in the printed output file (LIST option). The FACTOR and FULL commands are used to
specify parameters for the full information item factor analysis. Six factors and six latent roots
are to be extracted, as indicated by the NFAC and NROOT keywords respectively. A PROMAX rota-
tion is requested. Note that this keyword may not be abbreviated in the FACTOR command. A
maximum of 24 EM cycles will be performed (CYCLES keyword on the FULL command).
In place of the default method of integration by fractional quadrature of the posterior distribu-
tions, the program performs Monte Carlo integration in the corresponding number of dimensions.
Random points are drawn at each E-step from the provisional posterior distribution for each case,
which is assumed multivariate normal in the number of factors. After the specified iteration limit
is reached, the points for each case at the iteration limit are saved and used in all subsequent EM
cycles. The number of points sampled for the Monte Carlo EM solution is set to a value of 74596
using the MCEMSEED keyword on the TECHNICAL command.
Monte Carlo integration is also used in computing EAP factor scores (METHOD=2 on the SCORE
command). Factor scores for the first 20 cases are to be written to the output file (LIST=20).
>TITLE
EXAMPL06.TSF- ITEMS FROM THE JENKINS ACTIVITY SURVEY
SIX-FACTOR ANALYSIS BY MONTE CARLO EM FULL INFORMATION ANALYSIS
>PROBLEM NITEMS=48,SELECT=32,RESPONSES=5,NOTPRESENTED;
>NAMES Q156,Q157,Q158,Q165,Q166,Q167,Q247,Q248,Q249,Q250,Q251,Q252,
Q253,Q254,Q255,Q256,Q257,Q258,Q259,Q260,Q261,Q262,Q263,Q264,
Q265,Q266,Q267,Q268,Q269,Q270,Q271,Q272,Q273,Q274,Q275,Q276,
Q277,Q278,Q279,Q280,Q307,Q308,Q309,Q310,Q311,Q312,Q313,Q314;
>RESPONSE ‘8’,‘0’, ‘1’, ‘2’, ‘.’;
>KEY 002000220022222220022222202222220002220022222000;
>SELECT 3,5,6,7,9,11(1)14,17(1)23,25(1)30,32,33,35,36,39(1)42,47,48;
>TETRACHORIC LIST, NDEC=3;
>FACTOR NFAC=6,NROOT=6,ROTATE=PROMAX;
>FULL CYCLES=24;
>SCORE METHOD=2,LIST=20;
>TECHNICAL MCEMSEED=4593;
>INPUT NIDCHAR=10,SCORES,FILE=‘EXAMPL04.DAT’;
(10A1,T1,48A1)
>STOP
803
13 TESTFACT EXAMPLES
Data for this example are based on 32 items from a science assessment test in the subjects of bi-
ology, chemistry, and physics administered to twelfth-grade students near the end of the school
year. The items were classified by subject matter for purposes of the bifactor analysis.
The first five cases from the data file exampl07.dat are shown below. The FILE keyword on the
INPUT command denotes this file as the data source and the SCORES option indicates that it con-
tains item scores.
Case001 14523121312421534414334135131545
Case002 34283328312821524114338184145848
Case003 14543223322131554134331134134441
Case004 24423324322421524134315254134242
Case005 24523221122421544514333115131241
The case identification is given in the first 7 columns, and is listed first in the variable format
statement. The length of this field is also indicated by the NIDCHAR keyword on the INPUT com-
mand. After using the “T” operator to tab to column 11, the 32 item responses are read as single
characters (32A1).
(7A1,T11,32A1)
32 items from the science test are used as indicated by the NITEMS keyword on the PROBLEM com-
mand, and the RESPONSE keyword denotes the number of possible responses. The six responses
are listed in the RESPONSE command. Naming of the items is done using the NAMES command,
while the KEY command lists the correct response to each item.
The BIFACTOR command is used to request full information estimation of loadings on a general
factor in the presence of item-group factors. Three item-group factors are present (NIGROUP=3),
with allocation of the items to these groups as specified with the IGROUPS keyword. The CPARMS
keyword lists the probabilities of chance success on each item. By setting the LIST keyword to 3,
the bifactor loadings will be printed in both item and in item-group order in the output file. A
total of 30 EM cycles (CYCLES=30) will be performed in the bifactor solution.
The SCORE command is used to obtain, for each distinct pattern, the EAP score of the general
factor of the bifactor model and to obtain the standard error estimate of the general factor score
allowing for conditional dependence introduced by the group factors. Factor scores for the first
10 cases will be printed to the output file (LIST=10) and the guessing model will be used in the
computation of the factor scores (CHANCE option).
>TITLE
EXAMPL07.TSF- ITEM BIFACTOR ANALYSIS OF A TWELFTH-GRADE SCIENCE ASSESSMENT TEST
THE GENERAL FACTOR WILL BE SCORED
>PROBLEM NITEMS=32,RESPONSE=6;
>NAMES CHEM01,PHYS02,CHEM03,PHYS04,PHYS05,CHEM06,BIOL07,CHEM08,
BIOL09,BIOL10,BIOL11,PHYS12,BIOL13,PHYS14,BIOL15,CHEM16,
BIOL17,BIOL18,PHYS19,PHYS20,BIOL21,BIOL22,PHYS23,BIOL24,
PHYS25,PHYS26,BIOL27,PHYS29,CHEM29,PHYS30,BIOL31,CHEM32;
804
13 TESTFACT EXAMPLES
>RESPONSE 8,1,2,3,4,5;
>KEY 14523121312421534414334135131545;
>BIFACTOR NIGROUP=3,LIST=3,CYCLES=30,QUAD=9,
IGROUPS=(2,3,2,3,3,2,1,2,1,1,1,3,1,3,1,2,1,1,3,3,1,1,
3,1,3,3,1,3,2,3,1,2),
CPARMS=(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,
0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,
0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1);
>SCORE LIST=10,CHANCE;
>SAVE PARM,FSCORES;
>INPUT NIDCHAR=4,SCORES,FILE=‘EXAMPL07.DAT’;
(7A1,T11,32A1)
>STOP;
Exampl04.tsf illustrates the extension of a one-factor model to a so-called bifactor model by the
inclusion of group factors. The bifactor model is applicable when an achievement test contains
more than one subject matter content area. The data set exampl04.dat consists of the results of a
32-item science assessment test in the subjects biology, chemistry, and physics. Items are classi-
fied according to subject matter where 1 = biology, 2 = chemistry and 3 = physics (see the
IGROUPS keyword in the BIFACTOR command, discussed in Section 5.3.3). Note that TESTFACT
does not estimate guessing parameters, but does allow the user to specify the values (see CPARMS
keyword) in which case a 3-parameter model that provides for the effect of guessing is fitted to
the data.
The analysis specified in exampl04.tsf produces Phase 0, Phase 1, Phase 2, Phase 6 and Phase 7
output. The interpretation of Phases 0, 1, and 2 is omitted here since a detailed discussion of
these parts of the output elsewhere.
Display 1 lists the chance and initial intercept and slope estimates. Note that the initial intercept
estimates are set equal to zero, the initial slope estimates are set to 1.414 for the general factor
and 1.00 for the group factors. These initial values are routinely used in TESTFACT for bifactor
models.
One may optionally include the TETRACHORIC command (see exampl03.tsf) when fitting a bi-
factor model. This command is required if a printout of residuals is requested. If a TETRACHORIC
805
13 TESTFACT EXAMPLES
command is used, tetrachoric correlations are computed pairwise for the 32 × (32 − 1) / 2 = 496
pairs of items. There are a total of 20 item pairs that cannot be used since their corresponding
2 × 2 frequency tables contain zero or near-zero off-diagonal or marginal frequencies. In these
cases, a tetrachoric correlation of 1 is substituted in the matrix of tetrachoric correlations.
The inclusion or exclusion of the TETRACHORIC command has no effect on the estimation proce-
dure, since the starting values for the marginal maximum likelihood procedure are fixed, and do
not depend on the matrix of tetrachoric coefficients.
The bifactor procedure uses the 9 quadrature points and weights listed below. MML estimation
for the bifactor model requires quadrature in only two dimensions. For a more detailed discus-
sion, see the Phase 7 part of the output.
9 QUADRATURE POINTS
1 -4.000000 0.000134
2 -3.000000 0.004432
3 -2.000000 0.053991
4 -1.000000 0.241971
5 0.000000 0.398942
6 1.000000 0.241971
7 2.000000 0.053991
8 3.000000 0.004432
9 4.000000 0.000134
The number of cycles for the EM algorithm is set equal to 30 (CYCLES=30 on the BIFACTOR
command). At each cycle, the value of –2 log L as well as the maximum change in the intercept
and slope parameters are given. At cycle 30 the maximum change in intercept is 0.0050. The
general factor slope estimates for the 32 items changed at most by 0.0047 while the correspond-
ing value for the group factor equals 0.0095. These values indicate that, although convergence
was not-reached within the specified 30 cycles, the solution after 30 cycles is probably accept-
able for all practical purposes.
806
13 TESTFACT EXAMPLES
The χ 2 -value is 1150.36 with 503 degrees of freedom. The number of degrees of freedom is cal-
culated as
df = N − 1 − 2n − ng
where N is the number of distinct patterns, n is the number of items, and ng is the number of
items assigned to group factors. For this example, N = 600, n = 32 and, since all the items are
assigned to group factors, ng = 32.
The χ 2 -statistic is only correct when all possible 2n patterns are observed. For the present sam-
ple, since N 2n , the χ 2 -statistic is too inaccurate to be used as a goodness-of-fit test statistic.
The difference in the χ 2 -statistics for alternative models, however, yields a valid test statistic for
judging whether the inclusion of additional parameters results in a significant improvement of
model fit.
Example
It is hypothesized that the 12 physics items are indicators of a general factor only, while the biol-
ogy and chemistry items are indicators of a general and two uncorrelated group factors.
We wish to test
H 0 : The 32 items are indicators of a general factor, but the 13 biology and 7 chemistry
items are also indicators of two uncorrelated group factors.
H1 : The 32 items are indicators of a general as well as three uncorrelated group factors.
To obtain the χ 2 -statistic and degrees of freedom under H 0 , the BIFACTOR command is modi-
fied as follows:
Note that the NIGROUPS keyword is set equal to 2 and that each value of 3, corresponding to the
position of the physics items in the data set, is substituted by a value of 0 in the IGROUPS key-
word. A “0” symbol indicates that the corresponding items are not assigned to any group factors.
A graphical presentation of the H 0 model is shown below.
807
13 TESTFACT EXAMPLES
If we run exampl04.tsf with the changes to the BIFACTOR command as discussed above, the χ 2 -
statistic value and degrees of freedom shown below are obtained.
To test H 0 against H1 , one computes the difference in the corresponding χ 2 s and degrees of
freedom. Hence χ 2 = 11179.71 – 11150.36 = 29.35 with 515 – 503 = 12 degrees of freedom.
Since P( χ 2 (12) ≥ 29.35) = 0.0034 , H 0 is rejected and it is concluded that items from all 3 sub-
jects should be used for the group factors.
The estimates for the intercept and slope parameters are listed below.
808
13 TESTFACT EXAMPLES
1 2
1CHEM01 0.100 -1.054 0.709 0.417
2PHYS02 0.100 0.126 1.019 0.548
3CHEM03 0.100 -1.360 1.265 -0.182
4PHYS04 0.100 -0.578 0.469 0.377
5PHYS05 0.100 0.263 0.635 0.337
6CHEM06 0.100 -2.729 1.706 0.308
...
31 BIOL31 0.100 1.608 1.447 -0.005
32 CHEM32 0.100 -1.522 0.190 0.066
An alternative way to present these estimated parameters is shown below for the first 10 items.
The percentage variance explained by each of the four factors is calculated as follows. Let sij
denote the j-th slope parameter for item i, i = 1, 2, …, 32. If we define
ki = 1 + ∑ sij2 ,
j
then slopes are transformed to factor loadings (see Display 9 in the discussion of the Section
13.4 output) using the relationship
sij
fij =
ki .
809
13 TESTFACT EXAMPLES
Example
For item 7,
The item 7 loadings are therefore 0.586/1.322 = 0.443 and 0.636/1.322 = 0.481 respectively. Let
F be a 32 × 4 matrix of factor loadings with elements (see Display 7)
cj
×100%, j = 1, 2,3, 4,
n
where n = 32 and c j the j-th characteristic root of FF ' (See also the discussion of the output,
Display 15 in Section 13.4).
n − ∑cj
×100%.
j
The bifactor loadings are derived from the slope estimates using the formula fij = sij / ki (see Dis-
play 6 above). The standardized item i difficulty equals −intercept / ki .
Example
For item 7, k7 = 1.322 so that the standardized difficulty is −0.839 /1.322 = −0.635. Communal-
ities are equal to the sum of the squares of the factor loadings. For example, the item 1 commu-
nality is equal to
810
13 TESTFACT EXAMPLES
The printout below shows the same information as for Display 7, except that the items are re-or-
dered by group number. All 32 items have positive loadings on the general factor, while the
group factor loadings for BIOL31, CHEM03 and PHYS30 are negative, but relatively small.
811
13 TESTFACT EXAMPLES
The factor scores are so-called expected a-posteriori estimates of the general ability factor under
the assumption of normality (see Phase 7, exampl03.out).
Let θ i denote the general ability for examinee i. The EAP score is the conditional expectation
E (θ i | xi1 , xi 2 ,..., xin ), where xij is the item j score for examinee i. It can be shown that this condi-
tional expectation follows as the solution of a two-dimensional integral that is approximated by a
Gauss-quadrature formula. A brief description is provided below for the interested reader.
where
The marginal probability function f ( xi1 , xi 2 ,..., xin ) is obtained in the EM step using a two-
dimensional quadrature formula. Suppose yi1 , yi 2 ,..., yin denotes the set of ordered item scores for
the three groups. Under the assumption of uncorrelated group factors, it follows from Section
13.7 that
where θ i denotes the general ability for examinee i, θ i1 is the group 1 (biology), θ i 2 the group 2
(chemistry) and θ i 3 the group 3 (physics) ability respectively. Note that (see Display 9) yi1 = xi 7 ,
yi 2 = xi 9 , …, yi 32 = xi 30 .
812
13 TESTFACT EXAMPLES
f ( yi1 , yi 2 ,..., yi13 ) = ∫ ∫ f ( yi1 , yi 2 ,..., yi13 | θ ,θ1 ) g (θ ,θ1 )∂θ ∂θ1.
θ θ1
∑∑ w w fθ θ ( x , x ),
k l
*
, 1 k l
where wk and wl are the weights and xk and xl the points shown as Display 9 of the output.
The ability scores for each case and corresponding standard error estimates are tabulated. Ex-
aminee 5, for example, selected the correct answers to 22 of the 32 items. Therefore, the percent-
age correct is 22 / 32 × 100 = 68.8%.
The estimated ability score for this candidate is 0.8 with a standard error of 0.411. Candidate 7
also had correct answers to 22 of the 32 items, but the ability estimate is 0.691. It is evident that
the ability estimated depends on the number of correct items as well as which subset of items
was answered correctly.
CASE HEADER:
CASE NUMBER PERCENT PERCENT CASE ID
PRESENTED CORRECT OMITTED
SCORE AND S.E.
============================================================================
1 32 100.0 0.0 Case001
2.507 0.591
2 32 53.1 0.0 Case002
-0.066 0.323
3 32 56.2 0.0 Case003
0.032 0.380
4 32 50.0 0.0 Case004
-0.559 0.505
5 32 68.8 0.0 Case005
0.800 0.411
6 32 62.5 0.0 Case006
0.342 0.491
7 32 68.8 0.0 Case007
0.691 0.469
8 32 65.6 0.0 Case008
0.286 0.463
9 32 28.1 0.0 Case009
-1.434 0.518
10 32 46.9 0.0 Case010
-0.965 0.342
813
13 TESTFACT EXAMPLES
The number of cases scored is equal to 600, with a mean of –0.0258 and standard deviation of
0.9011. Note that the ability scores are estimated under the assumption that the general factor
ability has a normal distribution with mean 0 and standard deviation 1. For large data sets, one
ideally wants the estimated ability scores to have mean 0 and standard deviation 1.
The RMS value of 0.4394 is relatively large, and indicates that, in general, 95% confidence in-
tervals for the estimated scores will be wide. For example, a 95% confidence interval for exami-
nee 5 is
The empirical reliability is a measure of how close the observed scores are to the true, but unob-
served, scores. A reliability of 1, for example, implies that one can safely substitute the observed
test scores for the unknown true scores.
EMPIRICAL
RELIABILITY: 0.8114
Data for this example are based on 32 items from a science assessment test in the subjects of bi-
ology, chemistry, and physics administered to twelfth-grade students near the end of the school
year. For a description of the data file and variable format statement, see the example discussed
in Section 13.7.
814
13 TESTFACT EXAMPLES
Although the items are classified by subject matter for purposes of a bifactor analysis (see Sec-
tion 13.4), an item factor analysis is specified. The FULL command specifies that a maximum of
24 EM cycles (CYCLES=24) is to be performed in the full information item factor analysis in
which 3 factors and 3 latent roots are to be extracted (NFAC=3, NROOT=3 on FACTOR command).
Non-adaptive quadrature is requested through the use of the NOADAPT option on the TECHNICAL
command.
The SAVE command is used to write the unrotated factor loadings to exampl8.unr (UNROTATE
option). See Section 13.14 for more details on the use of this file.
>TITLE
EXAMPL08.TSF-THREE FACTOR ANALYSIS OF A TWELFTH-GRADE SCIENCE ASSESSMENT TEST
UNROTATED FACTOR LOADINGS ARE SAVED FOR USE IN EXAMPL14.
>PROBLEM NITEM=32,RESPONSE=6;
>NAMES CHEM01,PHYS02,CHEM03,PHYS04,PHYS05,CHEM06,BIOL07,CHEM08,BIOL09,
BIOL10,BIOL11,PHYS12,BIOL13,PHYS14,BIOL15,CHEM16,BIOL17,BIOL18,
PHYS19,PHYS20,BIOL21,BIOL22,PHYS23,BIOL24,PHYS25,PHYS26,BIOL27,
PHYS29,CHEM29,PHYS30,BIOL31,CHEM32;
>RESPONSE 8,1,2,3,4,5;
>KEY 14523121312421534414334135131545;
>TETRACHORIC NDEC=3,LIST;
>FACTOR NFAC=3,NROOT=3;
>FULL CYCLES=24;
>TECHNICAL NOADAPT;
>SAVE UNROTATE;
>INPUT NIDCHAR=7,SCORES,FILE=‘EXAMPL07.DAT’;
(7A1,T11,32A1)
>STOP
This example illustrates bifactor scoring from saved parameters. Data for this example are based
on 32 items from a science assessment test in the subjects of biology, chemistry, and physics
administered to twelfth-grade students near the end of the school year. For a description of the
data file and variable format statement, see the example discussed in Section 13.7.
The assignment of items to group factors is not included in the parameter file exampl07.par read
using the FILE keyword on the SCORE command. The assignment of items must be supplied in
the bifactor command. The BIFACTOR command is used to request full information estimation of
loadings on a general factor in the presence of item-group factors. Three item-group factors are
present (NIGROUP=3), with allocation of the items to these groups as specified with the IGROUPS
keyword. By setting the LIST keyword to 3, the bifactor loadings will be printed in both item and
in item-group order in the output file. A total of 30 EM cycles (CYCLES=30) will be performed in
the bifactor solution. The chance parameters are supplied in the file and do not need to be re-
entered in the command.
815
13 TESTFACT EXAMPLES
For the purpose of scoring from supplied parameters, the number of factors (NFAC) is set to 1 in
the SCORE command. Factor scores for the first 10 students will be printed to the output file
(LIST=10). The factor scores are also saved to the file exampl09.fsc (FSCORES on the SAVE com-
mand). The guessing model will be used in the computation of the factor scores (CHANCE option).
>TITLE
EXAMPL09.TSF- ITEM BIFACTOR ANALYSIS OF A TWELFTH-GRADE SCIENCE
ASSESSMENT TEST: THE GENERAL FACTOR WILL BE SCORED
>PROBLEM NITEM=32,RESPONSE=6;
>NAMES CHEM01,PHYS02,CHEM03,PHYS04,PHYS05,CHEM06,BIOL07,CHEM08,
BIOL09,BIOL10,BIOL11,PHYS12,BIOL13,PHYS14,BIOL15,CHEM16,
BIOL17,BIOL18,PHYS19,PHYS20,BIOL21,BIOL22,PHYS23,BIOL24,
PHYS25,PHYS26,BIOL27,PHYS29,CHEM29,PHYS30,BIOL31,CHEM32;
>RESPONSE 8,1,2,3,4,5;
>KEY 14523121312421534414334135131545;
>BIFACTOR NIGROUP=3,LIST=3,CYCLES=30,
IGROUPS=(2,3,2,3,3,2,1,2,1,1,1,3,1,3,1,2,1,1,3,3,1,1,
3,1,3,3,1,3,2,3,1,2);
>SCORE NFAC=1,LIST=10,CHANCE,FILE=‘EXAMPL07.PAR’;
>SAVE FSCORES;
>INPUT NIDCHAR=7,SCORES,FILE=‘EXAMPL07.DAT’;
(7A1,T11,32A1)
>STOP;
A graphical presentation of the general factor scores using a bifactor analysis with 3 groups is
shown below.
816
13 TESTFACT EXAMPLES
The saved parameters from Section 13.7 are used in scoring the general factor by adaptive quad-
rature. Data for this example are based on 32 items from a science assessment test in the subjects
of biology, chemistry, and physics administered to twelfth-grade students near the end of the
school year. For a description of the data file and variable format statement, see Section 13.7.
The PROBLEM, KEY, RESPONSE, SAVE, and INPUT commands are also the same as those used in
Section 13.7.
Conditional dependence due to the group factors is not accounted for. A one-factor analysis is
requested by replacing the BIFACTOR command used in Section 13.7 with the FACTOR command
shown here.
EAP factor scores are requested (METHOD=2 on the SCORE command). The first ten cases are also
printed to the output file (LIST=10) and factor scores for all cases are saved to the file ex-
ampl10.fsc (FSCORES on SAVE command). As before, the guessing model will be used in the
computation of the factor scores (CHANCE option).
>TITLE
EXAMPL10.TSF-ONE-FACTOR ANALYSIS OF A TWELFTH-GRADE SCIENCE ASSESSMENT TEST
ADAPTIVE SCORING OF GENERAL FACTOR FROM SUPPLIED BIFACTOR PARAMETERS
>PROBLEM NITEM=32,RESPONSE=6;
>NAMES CHEM01,PHYS02,CHEM03,PHYS04,PHYS05,CHEM06,BIOL07,CHEM08,
BIOL09,BIOL10,BIOL11,PHYS12,BIOL13,PHYS14,BIOL15,CHEM16,
BIOL17,BIOL18,PHYS19,PHYS20,BIOL21,BIOL22,PHYS23,BIOL24,
PHYS25,PHYS26,BIOL27,PHYS29,CHEM29,PHYS30,BIOL31,CHEM32;
>RESPONSE 8,1,2,3,4,5;
>KEY 14523121312421534414334135131545;
>FACTOR NFAC=1;
>SCORE METHOD=2,NFAC=1,LIST=10,CHANCE,FILE=‘EXAMPL07.PAR’;
>SAVE FSCORES;
>INPUT NIDCHAR=7,SCORES,FILE=‘EXAMPL07.DAT’;
(7A1,T11,32A1)
>STOP;
A histogram of the factor scores obtained from the one-factor model is shown below. The distri-
bution of scores follow a bell-shaped curve with mean –0.028 and standard deviation of 0.933. In
contrast to this, the bifactor solution (see previous section) yields scores for the general factor
which exhibit much less variation (standard deviation = 0.273) about the mean.
817
13 TESTFACT EXAMPLES
This example illustrates a MINRES factor analysis of a correlation matrix imported from the file
exampl04.cor saved in Section 13.4.
The import file is named using the FILE keyword on the INPUT command. The CORRELAT option
on this command indicates that the input is a correlation matrix for MINRES factor analysis (full
information factor analysis requires item response data and cannot be carried out directly on the
correlation matrix). In this instance, the matrix contains item tetrachoric correlations, but a corre-
lation matrix from any source could be analyzed.
For convenience in handling large correlation matrices, the tetrachoric correlation matrix is
saved and imported in format-free space delimited form. Note that names are supplied in the
NAMES command for the variables represented in the correlation matrix.
The 48 items, from which 32 are selected using the SELECT keyword on the PROBLEM command
and the SELECT command specifying the items and the order of selection, are from the Jenkins
Activity Survey. SKIP=2 on the PROBLEM command bypasses the calculations and printing of
classical item statistics.
The FACTOR command specifies the extraction of 3 factors and 6 roots (NFAC=3; NROOT=6). A
PROMAX rotation is requested and the rotated factor loadings will be saved in the file ex-
ampl03.rot (ROTATE option on the SAVE command). Note that the PROMAX option may not be ab-
breviated on the FACTOR command.
818
13 TESTFACT EXAMPLES
>TITLE
EXAMPL11.TSF- ITEMS FROM THE JENKINS ACTIVITY SURVEY
ITEM FACTOR ANALYSIS OF A USER-SUPPLIED CORRELATION MATRIX
>PROBLEM NITEM=48,SELECT=32,SKIP=2;
>NAMES Q156,Q157,Q158,Q165,Q166,Q167,Q247,Q248,Q249,Q250,Q251,Q252,
Q253,Q254,Q255,Q256,Q257,Q258,Q259,Q260,Q261,Q262,Q263,Q264,
Q265,Q266,Q267,Q268,Q269,Q270,Q271,Q272,Q273,Q274,Q275,Q276,
Q277,Q278,Q279,Q280,Q307,Q308,Q309,Q310,Q311,Q312,Q313,Q314;
>SELECT 3,5,6,7,9,11(1)14,17(1)23,25(1)30,32,33,35,36,39(1)42,47,48;
>FACTOR NFAC=3,NROOT=6,ROTATE=PROMAX;
>SAVE ROTATE;
>INPUT CORRELAT,FILE=‘EXAMPL04.COR’;
>STOP
This example illustrates the simulation of a sample of 1500 responses to 32 items. Sampling is
from a multivariate latent distribution of factor scores with user-specified vector mean and fixed
correlation matrix. A three-factor model is assumed. The user must supply standard item diffi-
culties and NFAC factor loadings (or intercepts or factor slopes) for each item.
Note that the PROBLEM command only indicates the number of items, and that the syntax contains
no INPUT command, but only the SIMULATE command. The NFAC keyword on this command indi-
cates the use of a three-factor model, and NCASES denotes the required sample size. The presence
of the SLOPES option indicates that the item parameters provided are the intercept and the NFAC
slopes. These parameter values are read in from the file exampl12.prm using the FILE keyword.
The MEAN keyword indicates the population means of the factor scores from which the responses
are generated. These means will be added to the random standard normal deviates representing
the ability of each case on the corresponding factors. If the MEAN keyword is omitted, zero means
are assumed and written to the *.sim file.
The simulated responses are written to a file with file extension *.sim, in this case ex-
ampl12.sim. The first line of each new record contains the case number, group number, form
number and factor means. The next line is the set of responses, where 0 indicates an incorrect
answer and 1 a correct answer. The GROUP keyword is set to its default value of 1. Similarly, test
form identification may be requested using the FORM keyword. By default, all records will be as-
sumed to belong to the same test form.
>TITLE
EXAMPL12.TSF- SIMULATE RESPONSES TO 32 ITEMS
THREE FACTOR MODEL; FACTOR SLOPES; SAMPLE SIZE= 1500
>PROBLEM NITEM=32;
>SIMULATE NFAC=3,NCASES=1500,FORM=2,GROUP=1,SLOPES,
MEAN=(0,0.0,0.0),FILE=‘EXAMPL12.PRM’;
>STOP
819
13 TESTFACT EXAMPLES
(6X,4F8.3)
1 1.041 -0.675 0.246 -0.049
2 0.480 0.585 -0.261 0.024
3 0.868 0.240 0.230 -0.063
This example illustrates the simulation of a sample of 1500 responses to 32 items. Sampling is
from a multivariate latent distribution of factor scores with user-specified vector mean and fixed
correlation matrix. A three-factor model is assumed and simulation is with guessing and non-
zero factor means. The user must supply standard item difficulties and NFAC factor loadings (or
intercepts or factor slopes) for each item.
Note that the PROBLEM command only indicates the number of items, and that the syntax contains
no INPUT command, but only the SIMULATE command. The NFAC keyword on this command indi-
cates the use of a three-factor model, and NCASES denotes the required sample size. The presence
of the CHANCE and LOADINGS options indicates that each item has a guessing, standardized diffi-
culty and three factor loading parameters. The parameters are read in from the file exampl3.prm
using the FILE keyword.
The MEAN keyword indicates the population means of the factor scores from which the responses
are generated. These means will be added to the random standard normal deviates representing
the ability of each case on the corresponding factors. If the MEAN keyword is omitted, zero means
are assumed and written to the *.sim file.
The simulated responses are written to a file with file extension *.sim, in this case ex-
ampl13.sim. The first line of each new record contains the case number, group number, form
number and factor means. The next line is the set of responses, where 0 indicates an incorrect
answer and 1 a correct answer. The GROUP keyword is set to its default value of 1. Similarly, test
form identification may be requested using the FORMS keyword. By default, all records will be as-
sumed to belong to the same test form.
The SCORESEED keyword specifies the random number generator seed for the simulation of mean
abilities, the GUESSSEED keyword the seed for the simulation of chance parameters with popula-
tion values specified in the exampl13.prm file, and the ERRORSEED is the seed associated with
the simulation of the binary responses based on the difficulty and slope parameters.
>TITLE
EXAMPL13.TSF-SIMULATE RESPONSES TO 32 ITEMS WITH GUESSING AND NON-ZERO
FACTOR MEANS; THRE-FACTOR MODEL; FACTOR LOADINGS; N=1500
>PROBLEM NITEM=32;
>SIMULATE NFAC=3, NCASES=1500, FORM=2, GROUP=3, LOADINGS, ERRORSEED=1231,
SCORESEED=71893, GUESSSEED=3451, FILE=‘EX7SIM.PAR’, CHANCE,
MEAN=(0.5,-0.5,1.0);
>STOP
820
13 TESTFACT EXAMPLES
(6X,5F8.3)
1 0.200 0.844 0.552 -0.197 0.046
2 0.200 -0.405 0.497 0.215 -0.019
3 0.200 -0.824 0.222 -0.216 0.065
This example illustrates how to simulate data under the assumption that there are 32 binary items
that measure three ability factors. The model considered allows for guessing and for non-zero
factor means. It is assumed that for each item, the population values for the guessing, standard-
ized difficulty and factor loadings are known. These values are stored in the file ex7sim.par.
The COMMENT command is used to show the format statement and the parameter values for the
first 5 of the 32 items.
(6X,5F8.3)
1 0.200 0.844 -0.552 -0.197 0.046
2 0.200 -0.405 0.497 0.215 -0.019
3 0.200 -0.824 0.222 -0.216 0.065
4 0.200 -0.520 0.508 0.415 -0.030
5 0.200 0.083 -0.145 0.026 0.244
The LOADINGS option on the SIMULATE command specifies that the population parameters are
standardized difficulties and factor loadings. Note that FORM=2, GROUP=3, ERRORSEED=1231,
GUESSSEED=3451, and SCORESEED=71893 are optional keywords.
The values of the chance, difficulty, and factor loadings for each item (the contents of ex-
ampl13.prm) are given below.
NUMBER OF ITEMS = 32
NUMBER OF CASES = 1500
NUMBER OF FACTORS = 3
CHANCE MODEL
821
13 TESTFACT EXAMPLES
The simulated data are written to the file exampl13.sim. The first line of the *.sim file gives the
case, form and group number, as well as the simulated abilities for the three factors (1.056, 0.605
and 0.873 for case 1). Note that, if the keywords FORM=f and GROUP=g are omitted from the
SIMULATE command, default values of one are written to the *.sim file. The values of the simu-
lated abilities will change if a different value for the keyword SCORESEED is used. The default
value is 345261. By changing both or either of the ERRORSEED and GUESSSEED values, one will
obtain a new set of simulated responses. The GUESSSEED default value is 543612 while the
ERRORSEED default value is 453612. Note that the GUESSSEED parameter only has an effect if a
chance model is simulated. It determines the sequence of the simulated values from a normal
population with a mean equal to the chance parameter (in the case of exampl13.tsf this value is
0.2 for each item).
822
13 TESTFACT EXAMPLES
The values below are based on the simulated ability for each factor. For example, the mean of
factor 1 is computed as
Note that the means are close to the assumed population values of 0.5, -0.5 and 1.0 respectively.
The correlations between the simulated ability variables are close to zero showing that the simu-
lated factor abilities are, for practical purposes, uncorrelated.
1 2 3
1.000
-0.005 1.000
0.000 0.039 1.000
13.14 Three-factor analysis with PROMAX rotation: 32 items from the sci-
ence assessment test
In this example, a PROMAX rotation is performed, using a three-factor model and 32 items from
a science assessment test in the subjects of biology, chemistry, and physics administered to
twelfth-grade students near the end of the school year. For a description of the data file and vari-
able format statement, see Section 13.7.
As input, a factor pattern from a 3-factor analysis of the data discussed in Section 13.8 is used
(FILE keyword on the INPUT command). The FACTORS option on the same command indicates
that the input is in the form of factor loadings. This option is used for rotation only.
(15X,5F10.6,2(/15X,5F10.6))
823
13 TESTFACT EXAMPLES
The SKIP keyword on the PROBLEM command is set to 2, and TESTFACT will thus proceed di-
rectly to rotation after input of the factor pattern. The rotation is specified by the ROTATE key-
word on the FACTOR command, while the NFAC keyword confirms this to be a three-factor model.
>TITLE
EXAMPL14.TSF- PROMAX ROTATION FOR 32 ITEMS
3-FACTOR MODEL
>PROBLEM NITEM=32,SKIP=2;
>NAMES CHEM01,PHYS02,CHEM03,PHYS04,PHYS05,CHEM06,BIOL07,CHEM08,
BIOL09,BIOL10,BIOL11,PHYS12,BIOL13,PHYS14,BIOL15,CHEM16,
BIOL17,BIOL18,PHYS19,PHYS20,BIOL21,BIOL22,PHYS23,BIOL24,
PHYS25,PHYS26,BIOL27,PHYS29,CHEM29,PHYS30,BIOL31,CHEM32;
>FACTOR NFAC=3,ROTATE=PROMAX;
>INPUT FACTOR,FILE=‘EXAMPL08.UNR’;
>STOP
Each row of factor loadings can be viewed as a point in multidimensional space, so that each fac-
tor corresponds to a coordinate axis. A factor rotation is equivalent to rotating those axes, result-
ing in a new set of factor loadings. There are various rotation methods, some (e.g. VARIMAX)
leave the axes orthogonal, while others are so-called oblique methods that change the angles be-
tween axes. The oblique method used in TESTFACT is called PROMAX. This method often
creates a simpler structure in the sense that loadings on each factor are either large or small. Note
that with oblique rotation, factors are no longer uncorrelated.
1 2 3
1 CHEM01 -0.021 0.271 0.170
2 PHYS02 0.235 0.381 -0.035
3 CHEM03 0.105 0.119 0.300
4 PHYS04 -0.006 0.491 -0.162
5 PHYS05 0.229 0.352 -0.030
6 CHEM06 0.014 0.332 0.286
7 BIOL07 0.469 -0.106 -0.023
8 CHEM08 -0.019 0.138 0.056
9 BIOL09 0.419 -0.027 -0.056
10 BIOL10 0.275 0.142 0.001
11 BIOL11 0.301 0.041 0.016
12 PHYS12 -0.117 0.275 0.002
13 BIOL13 0.361 0.090 -0.013
14 PHYS14 0.127 0.577 -0.050
15 BIOL15 0.332 -0.146 0.034
16 CHEM16 -0.051 0.298 0.042
17 BIOL17 0.232 0.018 0.034
18 BIOL18 0.113 0.086 0.068
19 PHYS19 -0.022 0.123 0.073
20 PHYS20 0.208 0.309 0.000
21 BIOL21 0.485 0.016 -0.086
22 BIOL22 -0.022 -0.180 0.121
23 PHYS23 -0.177 0.378 0.049
24 BIOL24 0.239 -0.023 0.043
25 PHYS25 -0.115 0.397 0.045
26 PHYS26 0.110 0.409 0.009
27 BIOL27 0.297 0.092 0.017
824
13 TESTFACT EXAMPLES
From the output of PROMAX factor loadings we conclude that there are only two factors, these
being Biology (factor 1) and Chemistry-physics (factor 2). Except for item 3 (CHEM03), all
items have larger loadings on either of the first two factors than on the third factor.
The correlation between factors 2 and 3 equals 0.694. This relatively high correlation may ex-
plain why two factors appear to be sufficient.
32 items from the simulated data set in the file exampl15.dat are used as indicated by the
NITEMSS keyword on the PROBLEM command and the FILE keyword on the INPUT command.
The input is in the form of input subject records containing scores. The RESPONSE keyword de-
notes the number of possible responses. The three responses are listed in the RESPONSE com-
mand. Naming of the items is done using the NAMES command, while the KEY command lists the
correct response to each item.
The TETRACHORIC command requests the recoding of omits to wrong responses (RECODE option),
prior to the computation of the tetrachoric correlation coefficients. The FACTOR and FULL com-
mands are used to specify parameters for the factor analysis. A two-factor model will be fitted to
the data (NFAC=2) and the first 6 characteristic roots of the smoothed correlation matrix
(NROOT=6) will be written to the output file. A maximum of 10 EM cycles will be performed
(CYCLES keyword on the FULL command). The OMIT keyword on this command indicates re-
coding of omits to wrong responses. The QUAD keyword sets the number of quadrature points for
the EM estimation of the parameters to 9, instead of the default of 15 for the 2-factor case when
the NOADAPT option is selected. Non-adaptive quadrature will be performed (NOADAPT option on
the TECHNICAL command).
Trial intercept and slope estimates after 10 cycles will be saved in exampl15.tri as indicated by
the TRIAL option on the SAVE command.
>TITLE
EXAMPL15.TSF- 2-FACTOR MODEL. SIMULATED DATA: PRINCIPAL FACTOR SOLUTION, NO
GUESSING. NON-ADAPTIVE QUADRATURE. SAVE TRAIL VALUES FOR CONTINUED EM CYCLES.
>PROBLEM NITEM=32,RESPONSE=3;
>NAMES I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,
I11,I12,I13,I14,I15,I16,I17,I18,I19,I20,
I21,I22,I23,I24,I25,I26,I27,I28,I29,I30,
825
13 TESTFACT EXAMPLES
I31,I32;
>RESPONSE ‘8’, ‘0’, ‘1’;
>KEY 11111111111111111111111111111111;
>TETRACHORIC RECODE;
>FACTOR NFAC=2,NROOT=6;
>FULL CYCLES=10,OMIT=RECODE,QUAD=9;
>TECHNICAL NOADAPT;
>SAVE TRIAL;
>INPUT NIDCHAR=3,SCORES,FILE=‘EXAMPL15.DAT’;
(3A1,T31,32A1)
>STOP
32 items from the simulated data set in the file example15.dat are again used as input (see Sec-
tion 13.15) as indicated by the NITEMS keyword on the PROBLEM command and the FILE keyword
on the INPUT command. The input is in the form of input subject records containing scores
(SCORES option). Trial intercept and slope estimates will be read from the previously saved file
exampl15.tri. Item numbers are required. The first few lines of the trial values file are:
(15X,6F9.5,2(/24X,5F9.5))
1 I1 0.02547 0.68219 -0.55756
2 I2 0.01425 0.64937 -0.88348
3 I3 -0.00925 0.80883 -0.84193
The RESPONSE keyword denotes the number of possible responses. The three responses are listed
in the RESPONSE command. Naming of the items is done using the NAMES command, while the
KEY command lists the correct response to each item. The inclusion of the SKIP=1 keyword on
the PROBLEM command indicates that the classical and item analysis phase should be skipped.
The program will proceed to the calculation of tetrachoric correlations immediately after data
entry.
The FACTOR and FULL commands are used to specify parameters for the factor analysis. Two fac-
tors and six latent roots are to be printed, as indicated by the NFAC and NROOT keywords respec-
tively. The OMIT keyword on the FULL command indicates recoding of omits to wrong responses.
The QUAD keyword sets the number of quadrature points for the EM estimation of the parameters
to 9, instead of the default of 15 for the 2-factor case when the NOADAPT option is selected. Non-
adaptive quadrature will be performed (NOADAPT option on the TECHNICAL command). The pa-
rameters assigned to the ITER keyword requests a maximum of 15 EM cycles, with a maximum
of 5 iterations and a convergence criterion of 0.001 for the M-step. Trial values will be saved
again in case further EM cycles are necessary. The trial values and the intercepts, factor slopes,
and guessing parameters (in a form suitable for computing factor scores at a later time) are saved
in exampl16.tri and exampl16.par as indicated by the TRIAL and PARM options on the SAVE
command.
826
13 TESTFACT EXAMPLES
>TITLE
EXAMPL16.TSF- 2-FACTOR MODEL SIMULATION: PRINCIPAL FACTOR SOLUTION, NO
GUESSING. NON-ADAPTIVE QUADRATURE. CONTINUE WITH AN ADDITIONAL 15 CYCLES.
>PROBLEM NITEM=32,RESPONSE=3,SKIP=1;
>NAMES I1,I2,I3,I4,I5,I6,I7,I8,I9,I10,
I11,I12,I13,I14,I15,I16,I17,I18,I19,I20,
I21,I22,I23,I24,I25,I26,I27,I28,I29,I30,
I31,I32;
>RESPONSE ‘8’, ‘0’, ‘1’;
>KEY 11111111111111111111111111111111;
>FACTOR NFAC=2,NROOT=6;
>FULL OMIT=RECODE,QUAD=9;
>TECHNICAL NOADAPT,ITER=(15,5,0.001);
>SAVE PARM, TRIAL;
>INPUT TRIAL=‘EXAMPL15.TRI’,NIDCHAR=3,SCORES,FILE=‘EXAMPL15.DAT’;
(3A1,T31,32A1)
>STOP
13.17 Adaptive item factor analysis of 25 spelling items from the 100-Item
Spelling Test
Data from a 100 word spelling test are used in this example. A complete description of these data
are given in Section 2.4.1.
The data file example17.dat contains individual responses to all 100 items, of which 25 are used
here. Data are read using the FILE keyword on the INPUT command. The SCORES option indi-
cates that the data file contains item scores, and that the case identification is 11 characters in
width (NIDCHAR).
The first 11 columns of every line of data contain the case identification, which is represented by
“11A1” in the variable format statement given below. Responses to the first 25 items start in col-
umn 13, and the “X” operator is used to skip over the 11th column after the case identification has
been read. The next set of 25 responses are contained in columns 39 to 63 inclusive and are read
in the same format as the previous set (25A1). The third set of responses follows after 1 blank
column, which is skipped using the “X” operator. The final set of 25 items is again separated
from the previous set by a single blank column (1X,25A1).
(11A1,1X,25A1,1X,25A1,1X,25A1,1X,25A1)
The number of items read by the variable format statement corresponds to the number of items
indicated on the PROBLEM command (NITEMS keyword). The 12 possible responses to each item
are listed on the RESPONSE command, and the RESPONSE key on the PROBLEM command indicates
the total number. The answer key, given in the KEY command, indicates that a “1” is the correct
response to all 100 items.
SELECT=25 on the PROBLEM command indicates that only 25 items will be used in the analysis.
These items are listed in the order to be used on the SELECT command.
827
13 TESTFACT EXAMPLES
The TETRACHORIC command specifies how the count matrix to be used in the calculation of the
tetrachoric correlations is to be formed. By using the (default) RECODE option, all omitted re-
sponses will be recoded as wrong responses. The matrix of tetrachoric correlations, with ele-
ments printed up to 3 decimal places (NDEC keyword), will be printed in the output (LIST) and
saved to the file exampl17.cor through the use of the CORRELAT option on the SAVE command.
Factor scores and their posterior standard deviations are saved to exampl17.fsc with the FSCORES
option on the SAVE command.
The FACTOR command requests and controls the parameters for the item factor analysis. Two fac-
tors (NFAC=2) are to be extracted, along with 6 latent roots (NROOT=6). The ROTATE keyword is
used to request a PROMAX rotation. Note that this keyword may not be abbreviated in the FACTOR
command. By default, NFAC leading factors will be rotated and the constant for the PROMAX
rotation is equal to 3. The FULL command is used to request full information item factor analysis,
starting from the principal factor solution. The OMIT keyword is set to RECODE, and omitted re-
sponses are thus recoded as wrong responses (similar to the request on the TETRACHORIC com-
mand). Note that RECODE may not be abbreviated in the FULL command.
The SCORE command specifies that the factor scores for 100 cases are to be listed in the output.
>TITLE
EXAMPL17.TSF- ITEM FACTOR ANALYSIS OF 25 SPELLING ITEMS SELECTED FROM
THE 100 WORD SPELLING TEST. USING TETRACHORIC OPTION
>PROBLEM NITEM=100,RESPONSE=12,SELECT=25;
>NAMES S01,S02,S03,S04,S05,S06,S07,S08,S09,S10,S11,S12,S13,S14,S15,S16,
S17,S18,S19,S20,S21,S22,S23,S24,S25,S26,S27,S28,S29,S30,S31,S32,
S33,S34,S35,S36,S37,S38,S39,S40,S41,S42,S43,S44,S45,S46,S47,S48,
S49,S50,S51,S52,S53,S54,S55,S56,S57,S58,S59,S60,S61,S62,S64,S64,
S65,S66,S67,S68,S69,S70,S71,S72,S73,S74,S75,S76,S77,S78,S79,S80,
S81,S82,S83,S84,S85,S86,S87,S88,S89,S90,S91,S92,S93,S94,S95,S96,
S97,S98,S99,S100;
>RESPONSE ‘ ’,‘0’, ‘1’, ‘2’, ‘3’, ‘4’, ‘5’, ‘6’, ‘7’, ‘8’, ‘9’, ‘A’;
>KEY 00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000;
>SELECT 1,4,6,8,9,10,15,20,23(1)29,33,34,36,39,48,49,54,59,64,72;
>TETRACHORIC RECODE,NDEC=3,LIST;
>FACTOR NFAC=2,NROOT=6,NIT=(5,0.02),ROTATE=PROMAX;
>FULL ITER=(8,3,0.01),OMIT=RECODE;
>SCORE LIST=100;
>SAVE CORRELAT,FSCORES;
>INPUT NIDCHAR=11,SCORES,FILE=‘EXAMPL17.DAT’;
(11A1,1X,25A1,1X,25A11X,25A1,1X,25A1)
>STOP
13.18 Classical item factor analysis of spelling data from a tetrachoric corre-
lation matrix
The analysis in this example is based on the spelling data used in Section 13.17. For a discussion
of the data, variable format statement, and INPUT command, see the previous section.
A classical analysis is carried out on all 100 items in the data. Thus the SELECT keyword previ-
ously used is omitted from the PROBLEM command, which only indicates the total number of
828
13 TESTFACT EXAMPLES
items (NITEM) and the total number of possible responses (RESPONSE). All 12 responses are listed
on the RESPONSE command, and the KEY command contains the answer key for all the items. The
TETRACHORIC command specifies how the count matrix, to be used in the calculation of the tetra-
choric correlations, is to be formed. By using the (default) RECODE option, all omitted responses
will be recoded as wrong responses.
The FACTOR command requests and controls the parameters for the item factor analysis. Two fac-
tors (NFAC=2) are to be printed, along with 6 latent roots (NROOT=6). The ROTATE keyword is used
to request a PROMAX rotation. Note that this keyword may not be abbreviated in the FACTOR com-
mand. By default, NFAC leading factors will be rotated and the constant for the PROMAX rota-
tion is equal to 3. The NIT keyword specifies the number of iterations for the MINRES factor
solution and the convergence criterion. A value of 0.01, for example, implies that if the largest
change in factor loadings is less than 0.01, the iteration procedure will terminate. The default
values are 3 and 0.0001 respectively.
Matrix plots of the biserial coefficient (BISERIAL option) and item facility (percent correct;
FACILITY option) against discriminating power are requested using the PLOT command. By de-
fault, the internal test score is used as discriminating power. To use an external criterion score,
the CRITERION option should be included on the PLOT command.
>TITLE
EXAMPL18.TSF- CLASSICAL ANALYSIS OF SPELLING DATA: 100 ITEMS
USING TETRACHORIC OPTION AND PLOT
>PROBLEM NITEM=100,RESPONSE=12;
>RESPONSE ‘ ’,‘0’,‘1’,‘2’,‘3’,‘4’,‘5’,‘6’,‘7’,‘8’,‘9’,‘A’;
>KEY 00000000000000000000000000000000000000000000000000
00000000000000000000000000000000000000000000000000;
>PLOT BISERIAL,FACILITY;
>TETRACHORIC RECODE;
>FACTOR NFAC=2,NROOT=6,NIT=(5,0.01),ROTATE=PROMAX;
>INPUT NIDCHAR=11,SCORES,FILE=‘EXAMPL17.DAT’;
(11A1,4(1X,25A1))
>CONTINUE
>STOP
829
REFERENCES
References
Aitchison, J., & Silvey, S. D. (1960). Maximum-likelihood estimation procedures and associated
tests of significance. Journal of the Royal Statistical Society, Series B, 22, 154-171.
Andersen, E. B., & Madsen, M. (1977). Estimating the parameters of a latent population distribu-
tion. Psychometrika, 42, 357-374.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43,
561-573.
Baker, F. B. (1992). Item response theory: parameter estimation techniques. Reading, NY: Mar-
cel Dekker.
Bartholomew, D. J. (1980). Factor analysis for categorical data. Journal of the Royal Statistical
Society, Series B, 42, 293-321.
Bergan, J. R., & Stone, C. A. (1985). Latent class models for knowledge domains. Psychological
Bulletin, 98, 166-184.
Binet, A., & Simon, T. (1905). Methods nouvelles pour le diagnostic du niveau intellectuel des
anormaux. Année Psychologique, 11, 191-244.
Birnbaum, A. (1957). Efficient design and use of tests of a mental ability for various decision
making problems. Series Report No. 58-16. Project No. 7755-23, USAF School of Aviation
Medicine, Randolph Air Force Base, Texas.
Birnbaum, A. (1958a). On the estimation of mental ability. Series Report No. 15. Project No.
7755-23, USAF School of Aviation Medicine, Randolph Air Force Base, Texas.
Birnbaum, A. (1967). Statistical theory for logistic mental test models with a prior distribution of
ability. Research Bulletin, No. 67-12. Princeton, NJ: Educational Testing Service.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee’s ability. In
F. M. Lord & R. M. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Ad-
dison-Wesley.
848
REFERENCES
Bliss, C. I. (1935). The calculation of the dosage mortality curve (Appendix by R. A. Fisher).
Annals of Applied Biology, 22, 134-167.
Bock, R. D. (1970). Estimating multinomial response relations. In R. C. Bose, et al. (Eds.), Con-
tributions to statistics and probability. Chapel Hill, NC: University of North Carolina Press, 111-
132.
Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in
two or more nominal categories. Psychometrika, 37, 29-51.
Bock, R. D. (1983b). The mental growth curve re-examined. In D. Weiss (Ed.), New horizons in
testing. New York: Academic Press, 205-219.
Bock, R. D. (1983c). The discrete Bayesian. In H. Wainer & S. Messick (Eds.), Principles of
psychometrics. Hillsdale, NJ: Erlbaum, 103-115.
Bock, R. D. (1993). Different DIFs. In: P. W. Holland & H. Wainer (Eds.), Differential item
functioning. Hillsdale, NJ: Erlbaum, 115-122.
Bock, R. D. (1997). The nominal categories model. In W. J. van der Linden & R. K. Hambleton
(Eds.), Handbook of Modern Item Response Theory. New York: Springer Verlag, 33-65.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters:
application of an EM algorithm. Psychometrika, 46, 443-445.
Bock, R. D., Gibbons, R. D., & Muraki, E. (1988). Full information item factor analysis. Applied
Psychological Measurement, 12, 261-280.
849
REFERENCES
Bock, R. D., & Jones, L. V. (1968). The measurement and prediction of judgment and choice.
San Francisco: Holden-Day.
Bock, R. D., & Lieberman, M. (1970). Fitting a response model for n dichotomously scored
items. Psychometrika, 35, 179-197.
Bock, R. D., & Mislevy, R. J. (1981). An item response model for matrix-sampling data: the
California Grade Three Assessment. In D. Carlson (Ed.), Testing in the states: beyond account-
ability. San Francisco: Jossey-Bass, 65-90.
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer
environment. Applied Psychological Measurement, 6, 431-444.
Bock, R. D., Muraki, E., & Pfiffenberger, W. (1988). Item pool maintenance in the presence of
item parameter drift. Journal of Educational Measurement, 25, 275-285.
Bock, R. D., Thissen, D., & Zimowski, M. F. (1997). IRT estimation of domain scores. Journal
of Educational Measurement 34, 197-211.
Bock, R. D., Wolfe, R., & Fisher, T. H. (1996). A Review and analysis of the Tennessee Value-
Added Assessment System. Nashville, TN: Office of Education Accountability, State of Tennes-
see, Comptroller of the Treasury.
Bock, R. D., & Zimowski, M. F. (1989). Duplex Design giving students a stake in educational
assessment. Chicago: Methodology Research Center NORC.
Bock, R. D., & Zimowski, M. F. (1995). Multiple group IRT. In W. van der Linden & R. Ham-
bleton (Eds.), Handbook of item response theory. New York: Springer-Verlag.
Bock, R. D., & Zimowski, M. F. (1997). Multiple group IRT. In W. J. van der Linden & R. K.
Hambleton (Eds.), Handbook of Modern Item Response Theory. New York: Springer Verlag,
433-448.
Bock, R. D., & Zimowski, M. F. (1998). Feasibility Studies of Two-Stage Testing in Large-Scale
Educational Assessment: Implications for NAEP, 34-41. Commissioned by the NAEP Validity
Studies (NVS) Panel. May 1998.
Bock, R. D., & Zimowski, M. F., (1999). Application of disattenuation analysis to correlations
between matrix-sample assessment results and achievement test scores. Addendum to D. H.
McLaughlin, R . D. Bock, E. A. Arenson & M. F. Zimowski. Palo Alto, CA: American Institutes
for Research.
Bowers, J. (1972). A note on comparing r-biserial and r-point biserial. Educational and Psycho-
logical Measurement, 32, 771-775.
850
REFERENCES
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs. I. Method of
paired comparisons. Biometrika, 39, 324-345.
Browne, M. W., & du Toit, S. H. C. (1992). Automated fitting of nonstandard models. Multi-
variate Behavioral Research, 27, 269-300.
Burt, C. (1921). Mental and scholastic tests. London: P. S. King & Son.
Carroll, J. B. (1945). The effect of difficulty and chance success on correlations between items or
between tests. Psychometrika, 10, 1-19.
Clogg, C. C. (1979). Some latent structure models for the analysis of Likert-type data. Social
Science Research, 8, 287-301.
Clogg, C. C., & Goodman, L. A. (1984). Latent structure analysis of a set of multi-dimensional
contingency tables. Journal of the American Statistical Association, 79, 762-661.
Cronbach, L. J., Glaser, G. C., Nanda, H., & Rajaratnam, N. (1972). The dependability of behav-
ioral measurements: Theory of generalizability for scores and profiles. New York: Wiley.
Davis, J. A. (1975). Codebook for the Spring 1976 General Social Survey. Chicago: NORC.
De Leeuw, J., & Verhelst, N. (1986). Maximum likelihood estimation in generalized Rasch
models. Journal of Educational Statistics, 11, 193-196.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1-38.
Finney, D. J. (1952). Probit analysis: A statistical treatment of the sigmoid response curve, 2nd
ed. London: Cambridge University Press.
Fisher, R. A., & Yates, F. (1938). Statistical tables for biological, agricultural and medical re-
search. New York: Hafner.
Follman, D. (1988). Consistent estimation in the Rasch model based on nonparametric margins.
Psychometrika, 53, 553-562.
851
REFERENCES
French, J. L., & Hale, R. L. (1990). A history of the development of psychological and educa-
tional testing. In C. R. Reynolds & R. W. Kamphaus (Eds.), Handbook of psychological and
educational assessment of children. New York: Guilford Press, 3-28.
Gibbons, R. D., & Hedeker, D. R. (1992). Full information item bi-factor analysis. Psycho-
metrika, 57, 423-436.
Glass, G. V., & Stanley, J. C. (1970). Statistical Methods in Education and Psychology. Engle-
wood Cliffs, NJ: Prentice-Hall.
Goldstein, H. (1983). Measuring changes in educational attainment over time. Journal of Educa-
tional Measurement, 20, 369-377.
Green, B. F. (1951). A general solution for the latent class model of latent structure analysis.
Psychometrika, 16, 151-166.
Green, B. F. (1952). Latent structure analysis and its relation to factor analysis. Journal of the
American Statistical Association, 47, 71-76.
Green, S. B., Lissitz, R. W., & Mulaik, S. A. (1977). Limitations of coefficient alpha as an index
of test unidimensionality. Educational and Psychological Measurement, 37, 827-838.
Haberman, S. J. (1977). Log-linear models and frequency tables with small expected cell counts.
Annals of Statistics, 5, 1148-1169.
Haberman, S. J. (1979). Analysis of qualitative data, Vol. 2. New developments. New York:
Academic Press.
Hambleton, R. K., & Jurgensen, C. (1990). Criterion referenced assessment of school achieve-
ment. In C. R. Reynolds & R. W. Kamphaus (Eds.), Handbook of psychological and educational
assessment of children. New York: Guilford Press, 456-477.
Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory. Principles and applica-
tions. Boston: Kluwer.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response
theory. Newbury Park, CA: Sage.
852
REFERENCES
Harman, H. H. (1976). Modern Factor Analysis. Chicago: The University of Chicago Press.
Harvey, W. R. (1970). Estimation of variance and covariance components in the mixed model.
Biometrics, 26, 485-504.
Harwell, M. R., Baker, F. B., & Zwarts, M. (1988). Item parameter estimation via marginal
maximum likelihood and an EM algorithm: A didactic. Journal of Educational Statistics, 13,
243-271.
Hendrickson, E. A., & White, P. O. (1964). Promax: A quick method for rotation to oblique sim-
ple structure. British Journal of Mathematical and Statistical Psychology, 17, 65.
Henryssen, S. (1971). Gathering, analyzing, and using data on test items. In R. L. Thorndike
(Ed.), Educational Measurement, 2nd ed., Washington, DC: American Council on Education.
Holland, P. W., & Rubin, D. B. (Eds.) (1982). Test equating. Hillsdale, NJ: Erlbaum.
Holland, P. W., & Wainer, H. (1993). Differential Item Functioning. Hillsdale, NJ: Erlbaum.
Holzinger, K. J., & Swineford, F. (1937). The bi-factor method. Psychometrika, 2, 41-54.
Horst, P. (1933). The difficulty of a multiple-choice test item. Journal of Educational Psychol-
ogy, 24, 229-232.
Irving, L. M. (1987). Mirror images: Effects of the standard of beauty on women’s self and body
esteem. Unpublished Masters Thesis, University of Kansas.
Jenkins, C.D., Rosenman, R.H. & Zyzanski, S.J. (1972). The Jenkins Activity Survey of Health
Prediction. New York: The Psychological Corporation.
Jöreskog, K. G., & Sörbom, D. (1996). LISREL 8: User’s Reference Guide. Chicago: Scientific
Software International, Inc.
Kaiser, H. F. (1958). The varimax criterion for analytic rotation in factor analysis. Psycho-
metrika, 23, 187-200.
Kendall, M., & Stuart, A. (1961), Inference and Relationship, Vol. 2 of The Advanced Theory of
Statistics, first ed. London: Charles Griffin & Company.
853
REFERENCES
Kiefer, J., & Wolfowitz, J. (1956). Consistency of the maximum likelihood estimator in the pres-
ence of infinitely many incidental parameters. Annals of Mathematical Statistics, 27, 887.
Klassen, D., & O'Connor, W. A. (1989). Assessing the risk of violence in released mental pa-
tients: A cross-validation study. Psychological Assessment: A Journal of Consulting and Clinical
Psychology, 1, 75-81.
Lawley, D. N. (1943). On problems connected with time selection and test construction. Pro-
ceedings of the Royal Society of Edinburgh, 61A, 273-287.
Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure analysis. In
S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, S. A. Star & J. A. Clausen, Meas-
urement and prediction. Princeton, NJ: Princeton University Press, 362-412.
Lawley, D. N. (1943). On problems connected with item selection and test construction. Pro-
ceedings of the Royal Society of Edinburgh, 61, 273-287.
Likert, R. (1932). A technique for the measurement of attitude. Archives of Psychology, 140.
Linacre, J. M., & Wright, W. D. (1993). FACETS: Many-facet Rasch analysis with FACFORM
Data Formatter. Chicago: MESA Press.
Linn, R. L., & Hambleton, R. K. (1991). Customized sets and customized norms. Applied Meas-
urement in Education, 4, 185-207.
Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley.
Longford, N. T. (1989). Fisher scoring algorithm for variance component analysis of data with
multilevel structure. In R. D. Bock (Ed.), Multilevel analysis of educational data. San Diego:
Academic Press, 297-310.
Lord, F. M. (1980). Applications of item response theory to practical testing problems. Hillsdale,
NJ: Erlbaum.
854
REFERENCES
Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores (with contri-
butions by A. Birnbaum). Reading, MA: Addison-Wesley.
Louis, T. A. (1982). Finding the observed information matrix when using the EM algorithm.
Journal of the Royal Statistical Society, Series B, 44, 226-233.
Mantel, N. (1966). Models for complex contingency tables and polytomous dosage response
curves. Biometrics, 22, 83-95.
Marshall, J. C., & Hales, L. W. (1972). Essentials of Testing. Reading, MA: Addison-Wesley.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
Masters, G. N. (1985). A comparison of latent trait and latent class analyses of Likert-type data.
Psychometrika, 50, 69-82.
Meng, X. L., & Schilling, S. (1996). Fitting full information factor models and an empirical in-
vestigation of bridge sampling. Journal of the American Statistical Association, 91, 1254-1267.
Mislevy, R. J. (1983). Item response models for grouped data. Journal of Educational Statistics,
8, 271-288.
Mislevy, R. J. (1986). Bayesian modal estimate in item response models. Psychometrika, 51,
177-195.
Mislevy, R. J. (1987). Exploiting auxiliary information about examinees in the estimation of item
parameters. Applied Psychological Measurement, 11, 81-91.
Mislevy, R. J., & Bock, R. D. (1982). Biweight estimates of latent ability. Journal of Educa-
tional and Psychological Measurement, 42, 725-737.
Mislevy, R. J., & Bock, R. D. (1983). BILOG: Analysis and scoring of binary items and one-,
two-, and three-parameter logistic models. Chicago: Scientific Software International, Inc.
Mislevy, R. J., & Bock, R. D. (1990). BILOG 3: Item Analysis and Test Scoring with Binary Lo-
gistic Models. Chicago: Scientific Software International, Inc.
Mislevy, R. J., Johnson, E. G., & Muraki, E. (1992). Scaling procedures in NAEP. Journal of
Educational Statistics, 17, 131-154.
Muraki, E. (1990). Fitting a polytomous item response model to Likert-type data. Applied Psy-
chological Measurement, 14, 59-71.
855
REFERENCES
Muraki, E. (1993). Variations of polytomous item response models: Raters’ effect model, DIF
model, and trend model. Paper presented at the Annual Meeting of the American Educational
Research Association, Atlanta, GA.
Muraki, E. (1997). The generalized partial credit model. In W. J. van der Linden & R. K. Ham-
bleton (Eds.), Handbook of Modern Item Response Theory. New York: Springer Verlag, 153-
164.
Muraki, E., & Bock, R. D. (1997). PARSCALE 3: IRT based test scoring and item analysis for
graded items and rating scales. Chicago: Scientific Software International, Inc.
Muraki, E., & Engelhard, G. (1985). Full information item factor analysis: applications of EAP
scores. Applied Psychological Measurement, 9, 417-430.
Naylor, J. C., & Smith, A. F. M. (1982). Applications of a method for the efficient computation
of posterior distributions. Applied Statistics, 31, 214-225.
Neyman, J. A., & Scott, E. L. (1948). Consistent estimates based on partially consistent observa-
tions. Econometrika, 16, 1-22.
Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psycho-
metrika, 47(3), 337-347.
Owen, R. J. (1969). A Bayesian approach to tailored testing. Research Bulletin No. 69-92.
Princeton, NJ: Educational Testing Service.
Petersen, N. S., Kolen, M. J., & Hoover, H. D. (1989). Scaling, norming, and equating. In R. L.
Linn (Ed.), Educational Measurement (3rd edition). New York: American Council on Educa-
tion-Macmillan, 221-262.
Rasch, G. (1960; reprinted 1980). Probabilistic models for some intelligence and attainment
tests. Chicago: University of Chicago Press.
Rasch, G. (1961). On general laws and the meaning of measurement in psychology. Proceedings
of the fourth Berkeley symposium on mathematical statistics and probability, 4, 321-324.
Richardson, M. W. (1936). The relationship between difficulty and the differential validity of a
test. Psychometrika, 1, 33-49.
856
REFERENCES
Roche, A. F., Wainer, H., & Thissen, D. (1975). Skeletal Maturity: The knee joint as a biological
indicator. New York: Plenum.
Samejima, F. (1969). Estimation of latent trait ability using a response pattern of graded scores.
Psychometrika Monograph Supplement, No. 17.
Samejima, F. (1972). A general model for free-response data. Psychometrika Monograph Sup-
plement, No. 18.
Samejima, F. (1974). Normal ogive model on the continuous response level in the multidimen-
sional latent space. Psychometrika, 39, 111-121.
Samejima, F. (1979). A new family of models for the multiple-choice item. Research Report,
No. 79-4, Department of Psychology, University of Tennessee.
Schilling, S. (1993). Advances in Full Information Item Factor Analysis using the Gibbs Sam-
pler. Unpublished doctoral dissertation, University of Chicago.
Schilling, S. G., & Bock, R. D. (1999). High-dimensional maximum marginal likelihood item
factor analysis. (In press.)
Schultz, M. E., & Nicewander, W. A. (1997). Grade equivalent and IRT representations of
growth. Journal of Educational Measurement, 34(4), 315-332..
Smith, M. C., & Thelen, M. H. (1984). Development and validation of a test for bulimia. Journal
of Consulting and Clinical Psychology, 52, 863-872.
Stouffer, S. A., & Toby, J. (1951). Role conflict and personality. American Journal of Sociology,
56, 395-406.
Stroud, A. H., & Secrest, D. (1966). Gaussian Quadrature Formulas. Englewood Cliffs, NJ:
Prentice-Hall.
Symonds, P. M. (1929). Choice of items for a test on the basis of difficulty. Journal of Educa-
tional Psychology, 20, 481-493.
Thissen, D. (1982). Marginal maximum likelihood estimation for the one-parameter logistic
model. Psychometrika, 47, 175-186.
Thissen, D. (1991). MULTILOG: multiple category item analysis and test scoring using item re-
sponse theory. Chicago: Scientific Software International, Inc.
857
REFERENCES
Thissen, D., & Steinberg, L. (1984). A response model for multiple-choice items. Psychometrika,
49, 501-519.
Thissen, D., & Steinberg, L. (1986). A taxonomy of item response models. Psychometrika, 51,
566-577.
Thissen, D., & Steinberg, L. (1988). Data analysis using item response theory. Psychological
Bulletin, 104, 385-395.
Thissen, D., Steinberg, L., & Fitzpatrick, A. R. (1989). Multiple-choice models: The distractors
are also part of the item. Journal of Educational Measurement, 26, 161-176.
Thissen, D., Steinberg, L., & Money, J. A. (1989). Trace lines for testlets: A use of multiple-
categorical-response models. Journal of Educational Measurement, 26, 247-260.
Thissen, D., Steinberg, L., & Wainer, H. (1993). Detection of DIF using the parameters of item
response models. In P. W. Holland & H. Wainer (Eds.), Differential item functioning. Hillsdale,
NJ: Erlbaum, 67-113.
Thissen, D., & Wainer, H. (1982). Some standard errors in item response theory. Psychometrika,
47, 397-412.
Thorndike, E. L. (1927). The measurement of intelligence. New York: Teacher’s College, Co-
lumbia University.
Tsutakawa, R. K. (1992). Prior distribution for item response curves. British Journal of Mathe-
matical and Statistical Psychology, 45, 51-71.
Tsutakawa, R. K., & Lin, H. Y. (1986). Bayesian estimation of item response curves. Psycho-
metrika, 51, 251-267.
Tucker, L. R. (1946). Maximum validity of a test with equivalent items. Psychometrika, 11, 1-
13.
858
REFERENCES
Van der Linden, W. J., & Hambleton, R. K. (1997). Handbook of modern item response theory.
New York: Springer-Verlag.
Verhulst, P-F. (1844). Recherches mathematiques sur la loi d'accrossement de population. Mem-
oires de l'Academe royal de Belgique, Volume 18.
Wainer, H. (Ed.) (1990). Computerized adaptive testing: a primer. Hillsdale, NJ: Erlbaum.
Wainer, H. (1995). Precision and differential item functioning on a testlet based test: The 1991
Law School Admissions Test as an example. Applied Measurement in Education, 8, 157-187.
Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for
testlets. Journal of Educational Measurement, 24, 185-201.
Warm, T. (1989). Weighted likelihood estimation of ability in item response theory. Psycho-
metrika, 54, 427-450.
Wilmut, J. (1975). Objective test analysis: some criteria for item selection. Research in Educa-
tion, 13, 27-56.
Wilson, D. T., Wood, R., & Gibbons, R. (1991). TESTFACT: Test scoring, item statistics, and
item factor analysis. Chicago: Scientific Software International, Inc.
Wood, R. (1977). Inhibiting blind guessing: the effect of instructions. Journal of Educational
Measurement, 13, 297-307.
Zimowski, M. F. (1985). Attributes of spatial test items that influence cognitive processing. Un-
published doctoral dissertation, University of Chicago.
Zimowski, M. F., & Bock, R. D. (1987). Full information item factor analysis of test forms from
the ASVAB CAT pool. Chicago: NORC.
Zimowski, M. F., Muraki, E., Mislevy, R. J., & Bock, R. D. (1996). BILOG-MG: Multiple-group
IRT analysis and test maintenance for binary items. Chicago: Scientific Software International,
Inc.
Zwick, R. (1987). Assessing the dimensionality of NAEP reading data. Journal of Educational
Measurement, 24, 293-308.
859
INDEX
1PL model, 51, 101, 123, for score estimation, on PRIORS command,
137-139, 355, 367, 379- 410, 530 394
380, 398-400, 510, 524, invoking 3-point, 491 on START command,
539-540, 543, 567-568, setting values of 396
605, 616-617, 645, 730, extreme points, 496 on TEST command,
732, 843 setting values of 355, 367, 398, 730
2PL model, 36, 51, 92-93, extreme weights, 497 on TMATRIX
101, 123, 137-139, 169, Adaptive testing, 632 command, 403
355, 367, 379, 398-400, Advanced tab ALPHA keyword
524, 539-540, 565, 567- on Item Analysis dialog on PRIORS command,
568, 606, 608, 732-734, box, 51, 88-89 47, 90, 115, 187, 189-
736, 768, 842 Aggregate-level 190
3PL model, 268, 355, 367, fit statistic, 210 ALPHA option
379, 398-400, 523, 538, IRT models, 610 on RELIABILITY
540, 569 Aitchison, J., 844 command, 776
Aitkin, M., 576, 585, 588, Analysis
599, 601, 617, 730, 842, display of details, 84
A 843 indication of successful
AJ keyword termination of, 82
on EQUAL command, selecting steps of, 457
Ability score file 379 specifying in
saving, 242-243 on FIX command, 385 MULTILOG, 388-
ACCEL keyword on PRIORS command, 389
on CALIB command, 393 Andersen, E.B., 730, 840
78, 89, 115-117, 123, AK keyword Andrich, D., 550, 556-557,
650 on EQUAL command, 846
on TECHNICAL 379 Answer key
command, 491 on FIX command, 385 using, 26, 59, 92, 149,
ACCEL option on PRIORS command, 167-171, 178-179,
on CALIB command, 393 236, 238, 241-242,
264, 274-275 AK option 254, 355-356, 448-
Acceleration on TMATRIX 449, 480, 634, 652,
for full information command, 404 658, 666, 670, 783,
factor analysis, 491 ALL keyword 827, 829
using or suppressing of on EQUAL command, Answer Key tab
routine, 275 381 on Item Keys dialog
value of constant, 77, on FIX command, 386 box, 59, 87
117, 382 on GROUPS command, Assessment testing, 618-
ACCMAX keyword 381 619
on ESTIMATE on ITEMS command, Assign Calibration Prior
command, 382 381 Latent Distribution
Adaptive quadrature, 410, ALL option dialog box, 75
491, 493, 495, 530, 589, on LABELS command,
781, 815, 817, 825-826, 387
842
860
INDEX
861
INDEX
862
INDEX
863
INDEX
864
INDEX
865
INDEX
866
INDEX
867
INDEX
868
INDEX
869
INDEX
870
INDEX
vertical, 24, 116, 120, Bayes, 31, 34-35, 214, 685, 748-749, 781-
129, 138, 528, 532, 317, 410, 418, 476, 782, 802, 837-838
534, 621, 627-628, 529-530, 537, 544, Minimized squared
658, 661, 845 576, 588, 590, 601, residuals (MINRES),
Equipercentile method, 605, 607, 609-610, 791, 793
533, 627 616, 624-625, 655, Newton-Gauss, 28, 31,
Equivalent groups 664, 679, 692-693, 120, 601, 644, 692,
equating, 532-533, 627, 706, 721, 780-781, 702, 833-834
631, 652, 768 800-802, 837-838, Newton-Raphson, 31,
ERRORSEED keyword 841, 843-844, 847 615, 749, 833-834
on SIMULATE Bayes modal, 605, 607, reversing order of EM,
command, 483, 820, 841, 843 277
822 marginal maximum Warm's weighted
E-step likelihood, 28, 116, maximum likelihood,
methods of integration, 121, 123, 345, 349- 264, 316-317, 615
410, 530 350, 364, 389, 401- ESTORDER option
saving results of final, 403, 407, 529, 544, on CALIB command,
465 562, 576, 584-587, 264, 274, 277
ESTIMATE command, 589-590, 599-602, Examinee Data dialog box,
382 604-605, 607, 611, 55, 95, 147, 149, 163,
ACCMAX keyword, 644, 675, 699-700, 167, 171, 175, 178-181,
382 702, 705, 730, 742- 240, 242
CCRIT keyword, 383 743, 806, 842-843 Data File tab, 55
ICRIT keyword, 383 maximum likelihood, Data File/Enter Data
ITERATIONS keyword, 25, 28, 30-31, 33-36, tab, 85-86, 88
383 75, 122, 128, 208, Enter Data tab, 58
NCYCLES keyword, 213-214, 222, 277, General tab, 55, 86-87,
383 317-318, 320, 323, 103
VAIM keyword, 356, 345, 410, 452, 495, Examinee Data option
384 529-530, 532, 537, on Data menu, 55, 85-
Estimated error variance, 543-544, 564, 568, 88, 94-95, 103, 147,
32-35, 249, 271, 583, 576, 584-585, 586- 149, 163, 167, 171,
620, 623, 625, 655-656, 587, 591, 594-595, 175, 178-181, 240,
684-685, 699, 838, 843 597- 600, 606-611, 242
Estimates 615, 652, 655, 685, Example, 359, 362
a-posteriori, 812 693, 700, 702, 708, 2PL model with
provisional, 496 734, 744, 792, 794, BILOG-MG, 92
Estimating 806, 833-834, 836- 3PL model, 364, 733
common value for lower 838, 840-844, 847 DIF analysis with
asymptote, 118 maximum marginal a BILOG-MG, 100
means of prior posteriori, 25, 29, 31, fixed-theta model, 370
distributions, 123 35-37, 75, 211, 214, reading of an external
score distribution as 349-350, 357, 389, criterion, 357
discrete distribution, 392, 395, 474, 476, Expected correlations
121 529, 544, 590, 607- in bifactor solution, 422
Estimation 610, 625, 655, 664, Expected frequencies, 438
871
INDEX
873
INDEX
874
INDEX
875
INDEX
876
INDEX
877
INDEX
Harman, H.H., 583 on SCORE command, 222, 226, 243, 248, 250,
Harvey, W.R., 583 75, 78, 90, 115, 196- 256, 282, 342, 510, 524,
Harwell, M.R., 599 198, 208, 211, 213, 525, 527, 592, 599, 601,
Hedeker, D.R., 586 666, 675 606, 683, 685, 703, 745,
Help menu, 85 IFNAME keyword 842-843, 847
Hendrickson, E.A., 583 on FILES command, curves, 678
Henryssen, S., 577 263, 274, 288-289, expected, 652
Heywood case, 123, 137, 312, 708 maximum value of, 524
576, 587, 589, 601, 843 on GLOBAL command, Information axis
HIGH keyword 53, 85, 114, 147, 150, scaling of, 507
on TEST command, 180-181, 201, 204- Information curves, 25, 36
367, 399, 738, 752 205, 226-227, 686 Information function, 271,
Histogram on INPUT command, 524-526, 599, 609, 613-
estimated abilities, 505 650 614, 623, 625
of ability scores, 512 IGROUPS keyword correcting, 319
scores in TESTFACT, on BIFACTOR INFORMATION keyword
577 command, 419, 804- on SAVE command,
Histogram option 805, 815 263, 312, 314, 337,
Graphics procedure, Import/Enter Values tab 342
505, 512, 518 Assign Item Parameter Information option
Hively, W., 846 Starting Values dialog Graphics procedure,
Holland, P.W., 844 box, 65, 87-88 505, 507, 526
Holzinger, K.J., 586 INAMES keyword Information statistics
Horst, P., 592 on FORM command, requesting, 212
50, 88, 114, 144-145 Initial slope parameter, 564
on GROUP command, Initialize option
I 50, 88, 114, 159-160 on Run menu, 82
on ITEMS command, INOPT keyword
46, 87, 114, 159, 182, on INPUT command,
ICC and Info option 183, 638, 686 263, 292-293, 335-
Graphics procedure, on TEST command, 87, 336
505, 508, 526 114, 224, 228, 263, Input
ICC option 325-326, 710 counts of response
Graphics procedure, Indeterminacy, 531, 549, patterns, 352, 366,
505-506 703 407
ICRIT keyword INDIVIDUAL option data for item or factor
on ESTIMATE on PROBLEM analysis, 441
command, 383 command, 352, 359, file in TESTFACT, 443
IDENTITY keyword 390-391, 748, 773 fixed-effects table of
on CLASS command, INFO keyword counts, 352-353, 407
424 on SCORE command, trial values for full
IDIST keyword 36, 90, 115, 208, 212, information factor
on CALIB command, 218, 222-223, 226- analysis, 446
88, 115-116, 122, 227, 256, 634, 675 INPUT command, 55, 62,
125, 189, 193-195, Information, 25, 28, 31, 86, 114, 241-242, 263,
675 33-36, 208, 212, 218, 280, 289, 291-292, 320,
878
INDEX
329, 331, 441, 482, 504 300-304, 334, 336, OFNAME keyword, 61,
CASE option, 413-414 342, 724, 727-728 62, 87, 114, 155, 163,
COMBINE keyword, NALT keyword, 44, 86, 169, 172, 176-177,
263, 285, 292, 338, 114, 134, 155, 163, 241-242
722 169 PATTERN option, 413,
CORRELAT option, NFMT keyword, 57, 59, 730
445, 818 64, 86, 114, 163, 170, PERSONAL option, 55,
DIAGNOSE keyword, 263, 292, 295, 445 87, 114, 159, 163,
87, 114, 163-164 NFNAME keyword, 60- 177, 240
DIF option, 43, 87, 114, 61, 87, 114, 163, 169, REWIND option, 445
135, 163-164, 192, 171, 177, 241-242 R-INOPT keyword, 263,
638 NFORM keyword, 42, 292, 296-297, 724,
DRIFT option, 43, 87, 87, 114, 144-146, 727-728
114, 135, 142, 163, 159, 163, 168-169, SAMPLE keyword, 55,
165, 192, 202 171-173, 176-177, 63, 77-78, 86, 114,
EXTERNAL keyword, 235, 240-241 148, 163, 167, 178-
55, 63, 87, 114, 163, NFORMS keyword, 652 180, 263, 288, 292,
166 NGROUP keyword, 42, 298, 671, 697
FACTORS option, 445, 87, 114, 117, 121, SCORES option, 430,
823 122-123, 128-130, 444-445, 776, 779,
FILE keyword, 443, 135, 142-143, 159, 781, 804
446, 776, 804, 818, 163, 165, 173-174, TAKE keyword, 55, 63,
823, 825 187, 189, 192, 194- 87, 114, 163, 167,
FORMAT option, 444 198, 218-219, 241, 179, 263, 292, 298,
GROUPLEVEL option, 638 697
263, 292-293, 336 NIDCHAR keyword, TRIAL keyword, 446,
IFNAME keyword, 650 55, 63, 87, 114, 163, 472
INOPT keyword, 263, 174, 241, 263, 292, TYPE keyword, 55, 63,
292, 293, 335, 336 295, 333-336, 343- 86, 114, 147, 148,
ISEED keyword, 78, 87, 344, 443, 446, 469, 150-152, 154-155,
114, 163, 167 692, 721, 775, 779, 163, 180, 238, 240-
KFNAME keyword, 59, 804 241, 638, 666
87, 114, 149, 163, NRATER keyword, UNFORMAT option,
167, 169, 172, 177, 263, 292, 296, 727- 444
236, 241, 670 728 WEIGHT keyword, 441,
LENGTH keyword, NTEST keyword, 263- 444, 446-447, 470,
263, 270, 276, 292, 264, 292, 296, 325, 779
294, 327, 692, 702, 329, 338, 692, 722 WEIGHT option, 263,
722, 728 NTOTAL keyword, 42, 292, 298, 333-336
LIST option, 444, 777 86, 114, 146, 161, Input data
MGROUP keyword, 162-163, 175, 183- type of, 390
263, 292, 294, 300- 184, 229-230, 263, Input Data dialog box,
304, 334-336, 339- 292, 297, 333-334, 350-353, 358, 365, 371,
342, 710 638, 710, 728 377, 388, 390
MRATER keyword, NWEIGHT keyword, Input files
263, 292, 294, 296, 638 answer key, 167
879
INDEX
BILOG-MG, 57, 149, INTERCPT keyword Group Items tab, 49, 88,
156, 173, 241 on TEST command, 47, 102
calibration file in 68, 87, 114, 224, 226, Subtest Items tab, 48, 87
BILOG-MG, 147 229, 232 Subtests tab, 46, 86, 89
in PARSCALE, 288-290 Internal consistency, 459, Item Analysis option
item parameter file in 582 on Setup menu, 40, 46,
BILOG-MG, 150, measure of, 459 86-89, 94, 102, 108-
200, 204 Intervals 111, 113, 117-118,
item parameters for assigning respondents 120-124, 127, 130,
scoring, 475 to, 543, 604 137-138, 140, 145-
item provisional values assigning scores to, 562 146, 160-161, 185-
file in BILOG-MG, for displaying response 186, 225, 231
156 proportions, 29 Item characteristic curves,
item standard tolerance, 605, 649 505-509, 515, 523, 525-
difficulties, 476 Intra-class correlation 526, 594, 644, 653
master file in BILOG- coefficient, 582 displaying, 506, 508
MG, 151 INUMBERS keyword displaying
not-presented key, 171 on FORM command, simultaneously, 510
omit key, 176 50, 88, 114, 144, 146, editing and saving, 506,
PARSCALE, 289, 333 686 508, 512
raw data in BILOG-MG, on GROUP command, Item difficulty, 482, 819-
148 50, 88, 114, 159, 161 820
specifying in BILOG- on ITEMS command, as input, 476
MG, 147, 163 87, 114, 159, 182- plot, 582
Input Parameters dialog 183, 638 plot against
box, 352-355, 359, 362, on TEST command, 47, discriminating power,
366-368, 372-374, 377, 48, 87, 114, 224, 230, 450
388-392, 398-399 233 Item dimensionality, 576
Instruments IQUAD keyword Item facility, 451, 462-465,
multiple test forms, 112 on TECHNICAL 467, 471, 577, 579, 587,
single test form, 111 command, 492 624, 776, 786, 793, 829
INTER keyword Irving, L.M., 765 Item factor analysis, 410,
on PRIOR command, ISEED keyword 432, 465, 530, 575-576,
452 on INPUT command, 584, 586, 589-601, 620,
on SLOPE command, 78, 87, 114, 163, 167 629, 778-781, 802-803,
452 ISTAT keyword 815, 818, 827-829, 840,
INTERCEPT keyword on SAVE command, 86, 842
on TEST command, 114, 199, 203, 244 Item fit statistics, 29
263, 325-326, 328 Item Analysis dialog box, Item information, 32, 40,
Intercepts, 539 46, 94, 102, 108-111, 200, 224, 248, 253, 256,
normal prior 113, 117-118, 120-124, 314, 337, 342, 505, 507-
distribution, 452 127, 130, 137-138, 140, 508, 515, 524-526, 543,
starting values for, 229, 145-146, 160-161, 185- 599, 608-609, 613-615,
326 186, 225, 231 623, 625, 847
Advanced tab, 51, 88-89 suppressing correction
Form Items tab, 49, 88 for, 281
880
INDEX
881
INDEX
883
INDEX
3PL, 523, 538, 540, 569 602, 604-605, 607, 611, 708, 734, 744, 792, 794,
and relationship to 644, 675, 699-700, 702, 806, 833-834, 836-838,
normal ogive, 541 705, 730, 742-743, 806, 840-844, 847
LOGISTIC option 842-843 and Warm's weighted,
on CALIB command, Marginal probability, 32 264, 316-317, 615
264, 274, 279, 692, of pattern, 30, 32, 244, Maximum marginal a
710 600-603, 796, 812, posteriori (MAP), 25,
on GLOBAL command, 836 29, 31, 35-37, 75, 211,
43, 85, 114, 147, 150, saving to file, 206, 243, 214, 349-350, 357, 389,
638 255 392, 395, 474, 476, 529,
Marshall, J.C., 578 544, 590, 607-610, 625,
Logit, 31, 150, 393, 539, Master file 655, 664, 685, 748-749,
541, 570, 602, 734, 834, as input in BILOG-MG, 781, 782, 802, 837-838
842 151 controlling precision for
Longford, N.T., 842 naming of, 204, 290 factor scores, 478
LORD option saving, 314 MAXPOWER keyword
on BIFACTOR MASTER keyword on DRIFT command,
command, 413-414 on SAVE command, 78, 88, 114, 142, 192
on FULL command, 86, 114, 152, 199, MCEMSEED keyword
413-414, 437-438 204, 263, 288, 312, on TECHNICAL
Lord, F.M., 421, 438, 533, 314, 337 command, 493, 803
534, 537, 541, 563, 569, Masters, G.N., 403, 404, Mean
580, 592, 594, 610, 700, 523, 529, 545, 550, 552, criterion score, 462, 464
835-838, 840-842, 845, 735, 738, 761, 763, 767, of normal distribution
847 842, 846 for intercepts, 452
Louis, T.A., 830, 842 Matrix plot, 829 MEAN keyword
Lower asymptote Matrix Plot option on SIMULATE
3PL model, 393 Graphics procedure, command, 486, 819-
Luce, R.D., 835 505, 510 820
Matrix sampling data, 116, Means
187, 238, 537, 610, 630, estimates for groups, 37
M 666 of population
Maximum distribution, 379, 385,
effectiveness point, 250 393
Madsen, M., 730, 843 Maximum information, 32, of population of factor
MAIN option 249, 256, 524, 632 scores, 486
on SAVE command, Maximum Likelihood of prior distributions,
467 (ML), 25, 28, 30-36, 75, 51, 123, 277, 394,
Mantel, N., 626, 835 122, 128, 208, 213-214, 652
Marginal 222, 277, 317-318, 320, Mean-square
reliability, 617, 746 323, 345, 410, 452, 495, of measurement errors,
Marginal maximum 529-530, 532, 537, 543- 589
likelihood (MML), 28, 544, 564, 568, 576, 584- Measurement
116, 121, 123, 345, 349- 587, 591, 594-595, 597- standard error, 525-526,
350, 364, 389, 401-403, 600, 606-611, 615, 652, 589, 606, 608-609,
407, 529, 544, 562, 576, 655, 685, 693, 700, 702, 622
584-587, 589-590, 599-
884
INDEX
885
INDEX
886
INDEX
887
INDEX
888
INDEX
of cases generated, 486 of iterations, 51, 120, of tests, 24, 325, 352,
of categories for graded 276, 383, 438, 493, 354, 360, 367, 373,
model, 375 496, 702 528
of classes, 454 of iterations for of the highest category,
of COMBINE MINRES, 432, 829 355, 399
commands, 292 of iterations in the M- of times item is rated,
of cycles of MML step, 383, 493 296
estimation, 383 of iterative communality of unique items, 175
of cycles prior to fixing, improvements, 494 of variable format
493 of latent roots, 432, 792 records, 445
of decimals for of parameter values, 487 NUMBER keyword
residuals, 431 of parameters in on TGROUPS
of decimals for BILOG-MG model, command, 401-402
tetrachorics, 500 152 NVARIANT keyword
of EM cycles, 493, 496 of patterns, 352, 360, on LENGTH command,
of examinees, 391 366, 367, 391 43, 47, 86, 114, 185-
of examinees in of points sampled, 495 186, 671
MULTILOG, 352 of quadrature points, 76, NVTEST keyword
of external variates, 455 117, 129, 208, 216- on GLOBAL command,
of factors, 431, 477, 486 217, 221, 280, 319, 43, 85, 114, 147, 153,
of factors to be 401, 422, 439, 478, 159, 175, 185-186,
extracted, 431 497-498, 600, 611, 224-225, 229-231,
of format records, 170 666, 669, 693, 694, 233
of forms, 42 702, 709, 710, 758, NWEIGHT keyword
of fractiles, 435, 455, 794, 825, 826 on INPUT command,
582 of quadrature points for 638
of generated response EAP estimation, 498 NWGHT keyword
records, 486 of records in data file, on GLOBAL command,
of groups, 42, 62, 174, 54-55, 58, 62, 391 85, 114, 147, 154,
294, 352, 360, 367, of response alternatives, 240-241
373, 391 43, 169
of groups in of response categories,
MULTILOG, 352, 399 O
353 of response categories in
of item-group factors, MULTILOG, 355
421 of response codes, 405, O’Connor, W.A., 740
of items, 42, 183, 270, 456 Oblique rotation, 588
352, 391 of response patterns, Observed
of items in form, 146, 390-391 frequencies of patterns,
162 of response patterns in 438
of items in MULTILOG, 352 OFNAME keyword
MULTILOG, 352, of selected items, 457 on FILES command,
353 of subtests, 42, 152-153, 263, 288, 290, 337
of items in test, 185, 294 296 on INPUT command,
of test forms, 173 61-62, 87, 114, 155,
163, 169, 172, 176-
177, 241-242
889
INDEX
891
INDEX
892
INDEX
70, 90, 115, 187, 190, NCHARS keyword, Project Settings dialog
264, 305-306 359-360, 367, 390, box, 362, 369
SOPTION option, 264, 409, 748, 773 PROMAX option
305, 306 NEXAMINEES on FACTOR command,
SSIGMA keyword, 26, keyword, 390-391 413
47, 70, 90, 115, 187, NGROUP keyword, PROMAX rotation, 413,
191, 264, 305, 307 360, 367, 373, 391, 431, 433, 441-442, 468,
TMU keyword, 47, 70, 409 576, 583, 588, 629, 779,
90, 115, 187, 191, NITEMS keyword, 360, 803, 818, 823, 828-829
264, 305, 307 367, 373, 391, 400, Provisional estimates
TSIGMA keyword, 47, 409, 432, 436, 446, controlling printing of,
70, 90, 115, 187, 192, 455, 457, 464, 469- 496
264, 305, 307 470, 473, 484, 776, Provisional values file
PRNAME keyword 778, 804, 825 as input in BILOG-MG,
on GLOBAL command, NOPOP option, 392 156
25, 53, 86, 114, 147, NOTPRES option, 456, PRV keyword
156-157, 226-227, 460 on TECHNICAL
241 NOT-PRESENTED command, 496
Probability option, 781 PSD keyword
marginal of pattern, 30, NPATTERNS keyword, on SCORE command,
32, 244, 600, 603, 360, 367, 390-391 75, 90, 115, 208, 214,
796, 812, 836 PATTERN option, 352, 217, 218-219
observed response, 634 390-391, 742
of chance success, 418 RANDOM option, 349-
posterior, 255-256, 589, 350, 389, 392, 730, Q
603, 655 742
PROBLEM command, RESPONSE keyword,
388, 435, 454, 482 456, 776, 778, 781, QP keyword
CLASS keyword, 446, 804, 825 on TGROUPS
454, 776 SCORE option, 748, command, 402
CRITERION option, 773 QPREAD option
358, 388, 773 SCORES option, 349- on CALIB command,
DATA keyword, 352, 350, 357, 389, 392 264, 274, 282, 308
359, 389 SELECT keyword, 457, on SCORE command,
EXTERNAL keyword, 480-481, 776, 781, 264, 310, 316, 321
430, 446, 455 818 QRANGE keyword
FIXED option, 349, SKIP keyword, 457, on CALIB command,
350, 370, 389, 392 780, 802, 818, 824 264, 274, 282
FRACTILES keyword, SUBTEST keyword, on SCORE command,
455, 776 458, 776 264, 316, 321
INDIVIDUAL option, TABLE option, 352, QSCALE keyword
352, 359, 390-391, 390-391 on TECHNICAL
748, 773 Program command, 496
NC keyword, 752 evaluation, 618-619 QUAD command, 75, 90,
information, 85 115, 117, 122, 126, 187-
189, 193, 675
893
INDEX
894
INDEX
895
INDEX
897
INDEX
898
INDEX
899
INDEX
900
INDEX
901
INDEX
902
INDEX
903
INDEX
904
INDEX
905
INDEX
View menu, 82 for quadrature, 70, 251, Wolfowitz, J., 840, 843
308, 310, 590, 800 Wood, R., 529, 575, 582,
for Rater's Effect model, 592, 599, 611
W 561 Workspace
providing information setting in PARSCALE,
on, 62 259
Wainer, H., 91, 535, 586, specifying in BILOG- Wright, W.D., 560
616, 638, 760, 763, 770, MG, 154
772, 844, 847 type of, 447
Warm, T., 615 WEIGHTS keyword Y
Warm's on COMBINE
weighted ML command, 264, 285-
estimation, 264, 316- 286 Yates, F., 267, 834
317, 615 on CRITERION YCOMMON option
WEIGHT keyword command, 428-429 on SCORE command,
on COMBINE on QUAD command, 91, 115, 208, 213,
command, 723 51, 75, 77, 90, 115, 222, 653, 675
on INPUT command, 193, 195
441, 444, 446-447, on QUADP command,
470, 779 Z
264, 308-309
WEIGHT option on QUADS command,
on INPUT command, 91, 115, 196-197,
263, 292, 298, 333- Zimowski, M.F., 25, 209,
264, 310-311 528, 531, 537, 576, 589,
336 White, P.O., 583
Weights 591, 679, 688, 844-846
Wilmut, J., 579 Zwarts, M., 599
for calculating criterion Wilson, D.T., 529
score, 429 Zwick, R., 591
Window menu, 85 Zyzanski, S.J., 780, 802-
for combining subscale WML option
scores, 285, 286 803
on SCORE command,
264, 316-317
906