0% found this document useful (0 votes)
63 views13 pages

Educational Test Approaches: The Suitability of Computer-Based Test Types For Assessment and Evaluation in Formative and Summative Contexts

The document discusses different types of computer-based tests (CBTs) and their suitability for various educational test approaches. It describes four test approaches: formative assessment, formative evaluation, summative assessment, and summative evaluation. It also outlines six CBT types and evaluates each type based on test characteristics like purpose, length, level of interest, and reporting. The goal is to provide guidance on selecting the most appropriate CBT type based on the intended test approach.

Uploaded by

rais amin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views13 pages

Educational Test Approaches: The Suitability of Computer-Based Test Types For Assessment and Evaluation in Formative and Summative Contexts

The document discusses different types of computer-based tests (CBTs) and their suitability for various educational test approaches. It describes four test approaches: formative assessment, formative evaluation, summative assessment, and summative evaluation. It also outlines six CBT types and evaluates each type based on test characteristics like purpose, length, level of interest, and reporting. The goal is to provide guidance on selecting the most appropriate CBT type based on the intended test approach.

Uploaded by

rais amin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Journal of Applied Testing Technology, Vol 21(1), 12- 24, 2020

Educational Test Approaches: The Suitability of


Computer-Based Test Types for Assessment
and Evaluation in Formative and Summative
Contexts
Maaike M. van Groen1* and Theo J. H. M. Eggen1,2
Cito, Amsterdamseweg 13, 6814 CM Arnhem, The Netherlands;
1

[email protected]
2
Department of Research Methodology, Measurement and Data Analysis, University of Twente, P.O. Box 217,
7500 AE Enschede, The Netherlands;
[email protected]

Abstract
When developing a digital test, one of the first decisions that need to be made is which type of Computer-Based Test (CBT)
to develop. Six different CBT types are considered here: linear tests, automatically generated tests, computerized adaptive
tests, adaptive learning environments, educational simulations, and educational games. The selection of a CBT type needs
to be guided by the intended purposes of the test. The test approach determines which purposes can be achieved by using a
particular test. Four different test approaches are discussed here: formative assessment, formative evaluation, summative
assessment, and summative evaluation. The suitability of each CBT type to measure performance for the different test
approaches is evaluated based on four test characteristics: test purpose, test length, level of interest for measurement
(student, class, school, system), and test report. This article aims to provide some guidance in the selection of the most
appropriate type of CBT.

Keywords: Computer-based Testing, Formative Testing and Summative Testing

1. Introduction test should follow the function of the test (Davey, 2011).
The aim of this article is to provide some guidance to test
When developing a digital test, one of the first choices the developers when selecting the type of CBT.
test developers need to make is which type of Computer- The test approach determines which purposes can be
Based Test (CBT) to develop. The usefulness and feasibility achieved. Four different test approaches are distinguished
of different types of Computer-Based Tests (CBTs) need here: formative assessment, formative evaluation,
to be evaluated in order to choose the optimal type of summative assessment, and summative evaluation. This
test for the specific test situation (Becker & Bergstrom, article discusses six different CBT types: linear tests,
2013). The selection of a CBT type needs to be guided automatically generated tests, computerized adaptive tests,
by the intended purposes of the test and the form of the adaptive learning environments, educational simulations,

*Author for correspondence


Maaike M. van Groen and Theo J. H. M. Eggen

and educational games. The suitability of the CBT types set learning goals based on the current student knowledge
for the different test approaches is evaluated based on test level (Van der Kleij et al., 2015).
characteristics. Four characteristics were distinguished to Assessment for learning focuses on the quality of the
describe the assessment approaches and test types. learning process (Van der Kleij et al., 2015). It attempts
The test approaches are described in more detail to make testing a part of the learning process (Stobart,
in the next section. Next, the CBT types are described. 2008) and focuses “on what is being learned and on the
The suitability of the CBT types for the approaches are quality of classroom interactions and relationships” (p.
explored thereafter. The paper ends with a discussion. 145). Assessment for learning differs from instruction in
its focus on measurement in combination with enhancing
2. Educational Test Approaches learning, whereas instruction purely focuses on enhancing
learning.
Although a wide range of test approaches and definitions Diagnostic testing assumes that how a task is solved
for those approaches exist, we selected: formative indicates the learner’s developmental stage (Van der Kleij
assessment, formative evaluation, summative assessment, et al., 2015). The test items are developed to have certain
and summative evaluation. A complication when characteristics that are assumed to elicit a response
distinguishing approaches is that one test can serve behavior that indicates the learner’s developmental stage
several purposes (Stobart, 2008). Before and during test (Leighton & Gierl, 2007) and is used to identify the
development, the primary purpose determines which weaknesses in prior knowledge or skills (Crisp, 2012).
approach is best for a specific test. Test report measures must be precise enough at
the individual level so that instruction can be adapted
2.1 Formative Assessment (assessment for learning), the teachers get a precise
The first approach is formative assessment, which focuses overview of the learner’s developmental stage (diagnostic
on supporting and improving the learning process to testing), or can be aggregated to give an overview of the
facilitate learning by making decisions at the level of current status of the curriculum and school performance
the learner and the class (Van der Kleij, Vermeulen, (data-based decision making).
Schildkamp, & Eggen, 2015) where individual
characteristics, such as performance or knowledge, are 2.2 Formative Evaluation
measured. A well-designed and implemented formative The second approach is formative evaluation, which
test suggests how instruction should be modified and tells focuses on improving the day-to-day running of
teachers what learners know and can do (Bennett, 2011). educational organizations and systems (Scheerens, Glas
Feedback should be provided to enhance learning (Van & Thomas, 2003) and is used to make decisions about the
der Kleij et al., 2015). Formative tests offer a powerful quality of education (Harlen, 2007). However, the focus
method to improves schooling (Wiliam, 2013). is on day-to-day improvements in education rather than
Van der Kleij et al., (2015) distinguished three types making judgments about educational organizations and
of formative assessment: data-based decision making, systems.
assessment for learning, and diagnostic testing. Data- Formative evaluation requires findings that are
based decision is “systematically analyzing existing data timely, concrete, and immediately useful (Rossi, Lipsey,
sources within the school; applying outcomes of analyses & Freeman, 2004). Test data can (partially) provide
to innovate teaching, curricula, and school performance; information for formative evaluation. Valid and useful
and implementing (e.g., genuine improvement actions) indicators include measurements of the school context
and evaluating these innovations” (Schildkamp and and student outcomes (Oakes, 1989). Individual test
Kuiper, 2010, p. 482). Using the data, teachers can then results can be aggregated to obtain group or school

Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology 13


Test Approaches and Computer-Based Testing

information (Sanders, 2013). Qualitative data can also be 2.4 Summative Evaluation


used as a data source.
The fourth approach, summative evaluation, focuses on
After the needed modifications are identified and
the use of test data to make judgments about schools
implemented, it can be investigated whether they were
(Van der Kleij et al., 2015) or educational systems.
effective. A coding scheme and correspondence index
This approach is the evaluation of a finished product
can then be used to investigate whether the intended
(Scriven, 1967). Here, judgment is made about schools in
intervention and the implemented intervention match
summative evaluation, whereas day-to-day improvements
(Tolboom & Kuiper, 2014). New test data at the student
in education are the focus of formative assessment.
level can be obtained to investigate whether the intended
School accountability is a form of summative
goals of the improvement were met. The test report
evaluation in which schools provide information on
measures for formative evaluation must be sufficiently
their performance and functioning to outside parties
precise at the program or school level.
(Scheerens et al., 2003). The tests intend to show what the
school’s students can do at one time and changes across
2.3 Summative Assessment
years can be monitored (Harlen, 2007). Adjustments in
The third approach is summative assessment or policy and standards can then be based on the findings.
assessment of learning (Harlen, 2007). Summative tests The second form of summative evaluation supersedes
are used to make inferences about individual students the school level. Judgments can be made at the regional
(Haertel, 2013) based on measurement. The focus is on or (inter)national level and are typically based on large-
what has been learned by the end of the process. This scale assessments. The goal is to inform relevant parties
type of test is intended to measure the student’s progress about the state of the educational system (Sanders, 2013).
toward the achievement of major concepts rather than As a result, educational systems can focus their energies
learning of specific things (Harlen & James, 1997). Thus, and efforts on improving measured schooling outcomes
summative tests are designed to measure and report (Haertel, 2013).
individual student achievement (Harlen, 2007). The test Tests for monitoring national standards are usually
results play a role in decision making about the mastery low stakes at the individual level. A representative sample
of a content domain by the pupil or class (Van der Kleij et of students is used for monitoring national standards
al., 2015), guide decisions about student ability grouping, (Stobart, 2008). Measurements for summative evaluation
determine entry into and exit from education, aid in purposes must be precise at the school or system level.
college admissions decisions (Haertel, 2013), and affect
whether a student graduates. Sanders (2013) distinguished
3. Types of CBTs
several purposes for summative assessment: selection,
classification, certification, or placement. Several types of CBTs exist. The test characteristics depend
Summative assessment takes place at specific intervals on the CBT type. Six types are discussed here: linear tests,
when achievement needs to be reported in relation to automatically generated tests, computerized adaptive
progress in learning against public criteria, and it should tests, adaptive learning environments, educational
be based on evidence about the performance relevant simulations, and educational games.
to those criteria (Harlen & James, 1997). Summative
assessment can be conducted at a point in time, or 3.1 Linear Tests
summarized over a period of time until the reporting The first type of CBT is linear testing in which the
moment (Harlen, 2007). Several smaller tests can be content, items, item order, and test length are the same
combined for summative assessment. for everyone. Several linear test forms can be constructed

14 Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology


Maaike M. van Groen and Theo J. H. M. Eggen

that are as parallel as possible. Nevertheless, all test forms students have to answer questions that are too difficult
are assembled before test administration takes place. or too easy for them, which cause either frustration or
The items for each test are selected manually by test boredom. These items also contribute little information
developers before the test is administered, often using to the measurement accuracy of the student’s ability
pretest information about the items. Linear tests are also (Yan, Lewis, von Davier, 2014). For the test results,
administered in a paper-and-pencil format, but digital the number of correct scores, percentile scores in the
administration adds possibilities, such as the inclusion of distribution, classifications, or ability estimates can be
innovative items, the use of multimedia, and automated reported.
reporting.
This type of testing is inefficient in terms of 3.2 Automatically Generated Tests
measurement precision because the tests are not tailored The second type of CBT is automatically generated testing,
to the individual student (Mellenbergh, 2011). Some also known as automatically assembled testing, in which

Table 1.  Characteristics of test approaches

Test Report
administration Test length Level
purpose Scope Report measure Precision

Preferably One or
Assessment: Ability estimate,
short tests multiple
Formative Enhance Individual score, or Low at the
because narrow
assessment learning and or class indicator for each individual level
testing is often domains or
instruction domain
frequent skills

Distribution
Evaluation: information
Low at the
Make decisions Short tests based on ability
individual level
Formative about the because Program or Multiple very estimate, score,
because results are
evaluation quality of aggregated school broad domains decision, or
aggregated, high
programs or results are used indicator for each
at the higher level
schools domain or for the
entire test

Long test Requires a


Low for low-
Assessment: acceptable for mastery decision,
stakes testing,
Make a decision high-stakes One or ability estimate,
Summative high for high-
about mastery testing,short Individual multiple broad score, or
assessment stakes testing at
of a domain or tests acceptable domains indicator for each
the individual
admission for low-stakes domain or for the
level
testing entire test
Distribution Low at the
Evaluation:
Short tests School or information individual level
Make judgments
Summative because one or more Multiple very based on ability because results
about schools
evaluation aggregated educational broad domains estimate, score, or are aggregated,
or educational
results are used systems indicator for each precise at the level
systems
domain of interest

Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology 15


Test Approaches and Computer-Based Testing

fixed-length tests are produced that satisfy a number of that is calibrated with item response theory (Hambleton,
constraints or conditions (Parshall, Spray, Kalohn, & Swaminathan, Rogers, 1991). The item bank is assembled
Davey, 2002). Constraints deal with content restrictions, after a pretest and contains items that have the desired
psychometric properties (Parshall et al., 2002), or item characteristics, such as appropriate difficulty and content.
exposure. Test forms are assembled automatically before Testing can be stopped when the desired accuracy level
administration using heuristics or linear programming has been obtained for the ability estimate, when enough
(Parshall et al., 2002; Van der Linden, 2005). A major confidence has been reached in the classification decision,
advantage of automatically generated tests is that test or after a fixed number of items. An ability estimate,
forms can be automatically generated from large item a percentile score based on the ability estimate or a
pools. classification decision is reported.
Van der Linden (2005) distinguished four modes Mixed forms of computerized adaptive tests and linear
of test assembly: Random sampling of tests, sequential tests are multistage tests (Yan et al., 2014) and multi-
sampling of tests, optimal test assembly, and adaptive segment computerized adaptive tests (Eggen, 2018). In
test assembly. In random sampling, items are randomly multistage tests, preassembled, fixed, item sets are used as
drawn from an item pool. In sequential sampling, items building blocks for the adaptive test (Zenisky, Hambleton,
are sampled until a predetermined level is realized for the & Luecht, 2010) with tailoring to the student’s ability
standard error of measurement. In optimal test assembly, between the item sets. In multi-segment computerized
elaborate heuristics are used to obtain a test that has adaptive tests, the test consists of several adaptive item
optimal measurement characteristics and fulfills content parts, called segments, with branching rules between the
constraints. Adaptive test assembly is discussed separately. segments (Eggen, 2018). This enables test developers to
A major disadvantage of the three discussed modes is combine multiple subjects into one test with adaptivity
that the resulting tests are not tailored to the individual within and between the subjects.
student’s ability. As a result, items are administered that
might be too easy or too difficult and testing might be 3.4 Adaptive Learning Environments
inefficient. An automatically generated test has the same The fourth type of CBT is the item-based adaptive learning
reporting possibilities as linear tests. environment. These systems provide instruction that is
optimized to each learner’s individual needs, preferences,
3.3 Computerized Adaptive and/or context (Wauters, 2012). The sequencing of
Tests learning content and feedback is adapted to the learner,
The third type of CBT is computerized adaptive testing, which leads to more efficient and effective learning
in which items are selected that best fit the examinee (Wauters, 2012). adaptive techniques select the item
(Mellenbergh, 2011). Computerized adaptive testing “that maximizes learning based on the learner’s current
results in a more precise ability estimate than linear knowledge level and the difficulty level of the problem
testing with the same test length. After each response, and adapting the feedback accordingly” (Wauters, 2012,
the examinee’s ability is estimated, and the next item is p. i).
automatically selected that has optimal measurement Brusilovsky (1999) distinguished between two
properties at the new ability estimate (Van der Linden types of environments: adaptive hypermedia systems
& Glas, 2010). Other item selection methods have been and intelligent tutoring systems. The former contains
developed that, for example, select items that measure knowledge in the form of hypertexts (Wauters, 2012).
optimally at the cut-off point when making classification The presentation of the information and the overall link
decisions (Van Groen, Eggen, Veldkamp, 2014). structure are adapted based on registered user actions.
Computerized adaptive testing requires an item bank The latter supports the learner during the problem-

16 Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology


Maaike M. van Groen and Theo J. H. M. Eggen

solving process and provides help on every aspect of designing simulations for learning requires a focus on the
problem solving based on discovered knowledge gaps. simulation features that provoke the intended knowledge
The discussion is limited here to intelligent tutoring and skills. The focus in simulation-based testing is
systems in which the environment selects the items on reporting overall proficiency or specific aspects of
and tasks for the learner. These systems can provide knowledge and skills (Mislevy, 2011). Simulation-based
(immediate) feedback to the student, and an ability testing in which learning and measurement take place
estimate can be reported to the student (Wauters, 2012). should focus on simulation features that provoke learning
Only those systems are considered here that have an and also support measurement.
explicit measurement component besides the learning
components. 3.6 Educational Games
The last type is educational games. These are serious
3.5 Educational Simulations games in which testing is woven directly and invisibly
The fifth type of CBT is the educational simulation. A into the game (Shute & Ventura, 2014). While gaming,
simulation-based test is a task “in which the student is the students produce sequences of actions drawing on
presented with, works with, or produces a work product the skills and competencies that are assessed (Shute &
that contains a simulation of a real-world scenario” (Levy, Ventura, 2014).
2013). Educational simulations often contain complex Novak, Johnson, Tenenbaum, and Shute (2014)
tasks that require multiple steps, capture multiple summarized some potential benefits of game-based
features of task performance, produce products in an learning: it facilitates hands-on student-centered
unconstrained manner, and relate task features to aspects learning and encourages the integration of knowledge
of performance (Levy, 2013). from different areas to make decisions and to examine
Student behavior can be observed in educational outcomes. Educational games facilitate learning because
simulations that approximate relevant real-world playing not only affects learning outcomes but also
situations where it would be impractical, cost prohibitive, keeps the learner engaged and motivated (Novak et al.,
or unethical to place students in the actual situation 2014). Educational games can be used to assess a broader
for measurement purposes (Levy, 2013). According to range of skills and constructs than traditional assessments
Mislevy (2011), a carefully developed educational design (Kato & De Klerk, 2017). Although educational
supports a student’s learning by tailoring situational games are often used for instruction, the focus here
features to his or her skill level(s), allowing for multiple is on games that also include a strong measurement
attempts and providing immediate feedback (p. 5), and component.
can provide opportunities to assess people’s capacities Game-based testing can provide useful information
to act in the provided situations. In educational to students, teachers, and the system itself (Mislevy et
simulations, dynamic or interactive features are present, al., 2014). This requires “reasoning from the specific
such as viewing an animation, which act in ways things that students do, to what they know and do more
that prompt a change in or response from a system broadly, and what the system, the teacher, or the students
(Levy, 2013). themselves might do next” (p. 9) to develop skills and
Educational simulations report multiple aspects knowledge. Evidence-centered design can be used to
of proficiency including a wide range of abilities and ensure valid inferences from games (Mislevy, Steinberg &
skills. Product data is collected but also process data (De Almond, 2003; Shute & Ke, 2012). The tests can report
Klerk, Van Dijk & Van den Berg, 2015). Moving from real-time estimates of competencies across a range of
simulations for learning to simulation-based testing is knowledge and skills (Mislevy et al., 2003; Shute &
not simple (Mislevy, 2011). According to Mislevy (2011), Ventura, 2014).

Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology 17


Test Approaches and Computer-Based Testing

4. Test Approaches and the and educational games. These types of testing were
developed to support and create learning and instruction,
Suitability of the Types of
but they also include a measurement component. When
CBTs
the test design is specifically tailored to and supports
The suitability of the different CBT types for each test learning and instruction, linear, automatically generated,
approach is discussed based on the characteristics of the and computerized adaptive tests can also be used.
test approaches listed in Table 1 and the characteristics Tests for formative assessment should be short due
of the CBT types listed in Table 2. Four characteristics to the frequency of testing. Linear and automatically
are included in the tables. The first characteristic is generated tests are often too long when multiple domains
the purpose of the test. Given that the purpose for need to be measured with meaningful accuracy. These
administering the test forms the basis for further test types should be used only when the test measures a
development, support of the intended testing purpose by limited set of domains. The test length of the other CBT
the test approach is considered to be the most important types tends to be adaptable to the specific test purpose
characteristic. Different purposes can be distinguished and specified precision. Given that all types of CBTs are
here. The first purpose is learning and instruction, focused on the individual, no further restrictions are
which is aimed at increasing the student’s knowledge. placed on using the CBT types for formative assessment.
The second purpose is measurement, which is aimed at Formative assessment is aimed at assessing one or
obtaining information about the student’s knowledge multiple domains or skills. A report for each domain or skill
about a specific subject at a specific point in time. is required, such as an ability estimate, score, or indicator.
Measurement can be divided further into assessment The report measures can be less precise than for high-
and evaluation: measurement aimed at the student or stakes testing. In high-stakes testing, the consequences of
class levels (i.e., assessment) versus measurement at the a wrong decision or less precise estimate can have serious
school or even higher levels (i.e., evaluation). The second consequences for test takers. In formative testing, such
characteristic deals with the practical constraint of the errors have less impact because of the low-stakes nature
desired test length. For some approaches, long tests are of such tests and the frequency of testing. Linear tests
not suitable. The third characteristic is the level of interest and automatically generated tests can only be used if a
in measurement: the individual student or a higher level small set of domains or skills is measured. Otherwise, the
(class, school, or system level). The intended level has number of items will be too high if sufficient measurement
a large influence on test design. The last characteristic accuracy needs to be achieved. Computerized adaptive
deals with the test report measures. The report enables tests can obtain many report measures due to the short
stakeholders to use the outcomes of administering the test. tests for each domain. Adaptive learning environments,
This characteristic is divided into intended testing scope, educational simulations, and educational games can be
desired report measure, and required precision of the used to obtain the desired report measures with sufficient
report. precision for formative assessment.
Based on the four characteristics, adaptive learning
4.1 Formative Assessment: Suitability of the environments, educational simulations, and educational
Types of CBTs games are very suitable for formative assessment. These
Formative assessment focuses on supporting and testing types have been developed to enhance learning
improving learning using educational measurement. and instruction while providing measurement. However,
Learning and instruction are primarily supported by to arrive at meaningful test report measures, special
adaptive learning environments, educational simulations, attention should be paid to the test design. Computerized

18 Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology


Maaike M. van Groen and Theo J. H. M. Eggen

Table 2.  Characteristics of computer-based test types

Test Report
administration Test length Level Report
purpose Scope Precision
measure
Ability estimate,
Depends on One or score, indicator,
Individual, can
Depends on test testing purpose, multiple broad or decision for Depends on
Linear tests be aggregated
design long for precise or narrow each domain or test design
to higher level
measurement domains for the entire
test
Ability estimate,
Depends on One or score, indicator,
Individual, can
Automatically Depends on test testing purpose, multiple broad or decision for Depends on
be aggregated
generated tests design long for precise or narrow each domain or test design
to higher level
measurement domains for the entire
test
Ability estimate
One or Precision
Short Individual, can or decision for
Computerized Depends on test multiple broad at the
depending on be aggregated each domain or
adaptive tests design or narrow individual
test domain to higher level for the entire
domains level
test
Ability estimate,
Learning and One or score, indicator, Low
Adaptive instruction, Individual, multiple or decision for precision
Depends on test
learning assessment aggregation narrow each domain, at the
design
environments depends on test challenging domains or skill, or for the individual
design skills entire test and level
feedback

Learning and Score or Low


instruction, Individual, One or indicator precision
Educational Depends on test
assessment aggregation multiple for overall at the
simulations design
depends on test challenging narrow skills proficiency or individual
design specific aspects level

Learning and Depends on test Low


instruction, design, length Individual, One or Indicators precision
Educational
assessment less problematic aggregation multiple on specific at the
games
depends on test due to intrinsic challenging narrow skills competencies individual
design motivation level

adaptive tests focus on measurement, but the test design 4.2 Formative Evaluation: Suitability of the
requires special attention to enhance learning. Linear and Types of CBTs
automatically generated tests appear to be less suitable for
Formative evaluations aim to make decisions about
formative testing with sufficient precision given their long
the quality of programs or schools. Given the focus of
test lengths and primary aim of measurement instead of
adaptive learning environments, educational simulations,
learning.
and educational games on learning and instruction, these

Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology 19


Test Approaches and Computer-Based Testing

types of CBTs are less suitable for formative evaluation. 4.3 Summative Assessment: Suitability of
Linear, automatically generated, and computerized the Types of CBTs
adaptive tests are better suited for evaluation purposes.
Summative assessment focuses on what has been learned.
As aggregated results are used for formative evaluation,
It is used to make a decision about the mastery of a domain
the precision of the report measures at the individual level
by the student or an admission of the student. These
is less important. Therefore, all CBT types can be used.
decisions are made after learning and instruction have
The level of interest here is at the program or school
been completed. Therefore, CBTs that enhance learning,
level. To draw conclusions at the school level, individual
such as adaptive learning environments, educational
testing data must be aggregated, as aggregating data from
simulations, and educational games, are less suitable
adaptive learning environments, educational simulations,
for summative assessment. In situations where it is not
and educational games might be challenging unless the
possible to measure performance in reality, educational
aggregation is taken into account when developing the
simulations might be the only option for measurement.
test. Thus, these types of CBTs seem to be less suitable for
Depending on the test design, linear, automatically
formative evaluation. Results from other types of tests can
generated, and computerized adaptive tests can be used
be easily aggregated at the school level.
for summative assessment.
Formative evaluation requires aggregated information
The test length for summative assessment must
based on an ability estimate, score, classification decision,
be carefully considered, depending on the stakes of
or indicator for each domain or on the entire test. All types
testing. In high-stakes testing, longer tests are often
of CBTs, except educational simulations and educational
acceptable or even required to obtain the specified
games, can be used to obtain an ability estimate. When
precision. However, in low-stakes testing, long tests are
scores or indicators are required, all types of CBTs, except
less acceptable. Depending on the stakes, measurement
computerized adaptive tests, can be used. However, when
precision requirements must be determined, test length
indicators are required, it depends on the test construct
requirements must be set, and a decision must be made
whether educational games can be used because games
whether a certain CBT type can be used.
provide indicators based on competencies. When
Summative assessment is focused on the individual
decisions are required, linear, automatically generated,
level. All types of CBTs are administered at the individual
and computerized adaptive tests can be used to make the
level. However, the scope in terms of test content of the
decisions using classification methods.
different types is not always appropriate for summative
As the goal of formative evaluation is to make decisions
assessment. Scope implies here whether many topics are
about the quality of the school or program, using the
measured superficially, or whether one topic is measured
types of CBTs that are directed at learning and instruction
in depth. The scope of adaptive learning environments,
makes less sense. Adaptive learning environments,
educational simulations, and educational games is often
educational simulations, and educational games are
too narrow for summative assessment. The scope of linear
less likely to be used. Depending on the desired type of
automatically generated, and computerized adaptive tests
test report measure, linear, automatically generated, or
can be adapted to the required scope.
computerized adaptive tests can be chosen. When sum
Summative assessment requires a decision, ability
scores are required, computerized adaptive tests cannot
estimate, score, or indicator for the entire test or for under
be used. However, when a large number of domains need
lying domains. Except for educational games, all CBT
to be tested, computerized adaptive testing is probably the
types can provide report measures for the entire test and
only choice given the large total number of items required
for each domain. When a mastery decision is required,
for linear and automatically generated tests.
linear, automatically generated, and computerized

20 Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology


Maaike M. van Groen and Theo J. H. M. Eggen

adaptive tests can provide a decision using classification efficient measurement (adaptive learning environments,
methods. Adaptive learning environments, educational educational simulations, and educational games) are not
simulations, and educational games provide decisions ideal. Linear, automatically generated, and computerized
only by setting standards at the ability scale, score, or adaptive tests provide the type of information needed for
indicator level. The advantage of using classification summative evaluation.
methods is that a decision can be made with fewer items Ideally, tests should be short for summative evaluation
without the use of such methods while keeping the same as interest is not at the student level, and students are less
level of precision. motivated to take tests when nothing is at stake for them.
Precise test report measures are required for As the precision does not have to be high at the individual
summative assessment when the testing stakes are level, short tests can be used for summative evaluation.
high. This implies that adaptive learning environments, All types of CBTs provide enough precision to be used for
educational simulations, and educational games cannot summative evaluation without including too many items.
be used in high-stakes situations. Depending on the Summative evaluation requires an aggregation of
test design, linear or automatically generated tests may individual student data. Therefore, using a CBT type for
provide sufficient precision. However, when precision is which the results of individual students’ results can be
paramount, computerized adaptive tests provide the most easily aggregated is more convenient. This suggests the
information without resulting in very long tests. Since use of linear, automatically generated, or computerized
precision is less important in low-stakes testing, all types adaptive tests.
of CBTs can be used. Given the focus of summative evaluation on broad
Given that summative assessment is used after learning domains, using types of CBTs that are especially suitable
has ended, using a type of CBT directed at enhancing for measuring narrow domains seems contradictory. Thus,
learning makes less sense. Therefore, adaptive learning adaptive learning environments, educational simulations,
environments, educational simulations, and educational and educational games appear less suitable for summative
games seem to be less fitting. Depending on the testing evaluation. Nevertheless, all types of CBT scan provide
stakes, more or less precision is required and test length the desired report measures.
is more or less important. For high-stakes testing that Because adaptive learning environments, educational
covers a large number of domains, computerized adaptive simulations, and educational games have been developed
tests provide high precision with relatively short test to enhance learning, which is not the aim of summative
lengths. When fewer domains are to be covered, linear evaluation, linear, automatically generated, and
and automatically generated tests can also be used. For computerized adaptive tests are most appropriate for
low-stakes summative assessment, linear, automatically summative evaluations.
generated, and computerized adaptive tests can be used.
However, when a decision is required, test length can be 5. Discussion
reduced for those three types by using a classification
method to make the decisions. When developing an educational test, test developers must
specify the purpose of the test. Based on this purpose,
4.4 Summative Evaluation: Suitability of the developer selects the relevant test approach, such as
Types of CBTs formative assessment, formative evaluation, summative
assessment, or summative evaluation. After the developer
Summative evaluation is used to make judgments about
has selected the purpose and the test approach, he or she
schools or educational systems. This implies that the types
must select the appropriate test type. The current article
of CBTs that enhance learning instead of focusing on

Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology 21


Test Approaches and Computer-Based Testing

aimed to provide test developers with an overview of tests. In such a situation, test developers should ensure
the suitability of different CBT types for these four test that each test is appropriate for the intended testing
approaches. Six types of CBTs were discussed: linear tests, purpose at that point in time and in the program.
automatically generated tests, computerized adaptive
tests, adaptive learning environments, educational 6. Acknowledgment
simulations, and educational games.
Adaptive learning environments, educational Part of this study was financed by the Dutch Ministry of
simulations, and educational games are the CBT types Education as part of the SLOA law.
best suited for formative assessment. However, to ensure
that valid measurement inferences from the tests are 7. References
possible, much attention must be paid to the measurement
Becker, K. A., & Bergstrom, B. A. (2013). Test administration
component of the test design. Computerized adaptive models. Practical Assessment, Research & Evaluation,
tests can also be used given their strong focus on 18(14). Retrieved from http://pareonline.net/getvn.
measurement and the wide range of possible adaptations asp?v=18&n=14.
of their test design. Linear, automatically generated, Bennett, R. E. (2011). Formative assessment: A critical
review. Assessment in Education: Principles, Policy &
and computerized adaptive tests are most appropriate
Practice, 18, 5-25. doi:10.1080/0969594X.2010.513678
for formative evaluation, summative assessment, and https://doi.org/10.1080/0969594X.2010.513678.
summative evaluation. Depending on the testing stakes Brusilovsky, P. (1999). Adaptive and intelligent technologies
in summative assessment, one of the three types is more for web-based education. Künstliche Intelligenz [Artificial
appropriate. Intelligence], 13(4), 19-25.
Crisp, G. T. (2012). Integrative assessment: Reframing
The focus of this article was to determine the
assessment practice for current and future
appropriate types of CBT for each test approach based learning. Assessment & Evaluation in Higher
on just four characteristics: the purpose of the test Education. doi:10.1080/02602938.2010.494234
administration, the test length, the level of interest for https://doi.org/10.1080/02602938.2010.494234.
measurement, and the report measures. The purpose Davey, T. (2011). Practical considerations in computer-based
testing. Princeton, NJ: Educational Testing Service.
of the test is the most important characteristic when
De Klerk, S., Van Dijk, P., & Van den Berg, L. (2015).
developing a test. Other criteria, including the complexity Voordelen en uitdagingen voor toetsing in computer
and costs of the test administration software, the costs simulaties [Advantages and challenges of assessment in
of item development, experience with CBT (types), computersimulations]. Examens, 12(1), 11-17.
acceptability of CBT types by relevant stakeholders, and Eggen, T. J. H. M. (2018). Multi-segment computerized
adaptive testing for educational testing purposes.
so on, are also important in deciding which CBT type
Frontiers in Education. doi: 10.3389/feduc.2018.00111
should be used for a specific test situation. These were https://doi.org/10.3389/feduc.2018.00111.
not taken into account in this article. Test developers who Haertel, E. (2013). How is testing supposed to improve
are interested in such criteria are referred to (Becker and schooling? Measurement: Interdisciplinary Research and
Bergstrom, 2013; Davey 2011; Parshall et al. 2002). Perspectives, 11, 1-18. doi: 10.1080/15366367.2013.783752
https://doi.org/10.1080/15366367.2013.783752.
One challenge when specifying the purpose of a test
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991).
is that tests often have multiple intended purposes. We Fundamentals of item response theory. Newbury Park, CA:
assumed that a test has one purpose that is the most Sage.
important. If a test has multiple purposes, the developer Harlen, W. (2007).The quality of learning: Assessment alternatives
should check whether the selected type of CBT is for primary education (Primary Review Research Survey
3/4). Cambridge, England: University of Cambridge.
appropriate for all relevant and important purposes. A
test program can also combine summative and formative

22 Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology


Maaike M. van Groen and Theo J. H. M. Eggen

Harlen, W., & James, M. (1997). Assessment and learning: Rossi, P. H., Lipsey, M. W., & Freeman, H. E. (2004). Evaluation:
Differences and relationships between formative and A systematic approach (7th ed.). Thousand Oaks, CA: Sage.
summative assessment. Assessment in Education: Principles, Sanders, P. (2013). Het doel van toetsen [The purpose of
Policy&Practice, 4, 365-379. doi: 10.1080/ 0969594970040304 testing]. In P. Sanders (Ed.), Toetsen op school [Testing at
https://doi.org/10.1080/0969594970040304. schools] (pp. 15-23). Arnhem, the Netherlands: Stichting
Kato, P. M., & De Klerk, S. (2017). Serious games for assessment: Cito Instituut voor Toetsontwikkeling.
Welcome to the jungle. Journal of Applied Testing Technology, Scheerens, J., Glas, C. A. W., & Thomas, S. M. (2003). Educational
1, 1-6. evaluation, assessment, and monitoring. London, England:
Levy, R. (2013). Psychometric and evidentiary advances, Taylor & Francis.
opportunities, and challenges for simulation- Schildkamp, K., & Kuiper, W. (2010). Data-informed
based assessment. Educational Assessment, 18, curriculum reform: Which data, what purposes, and
182-207. doi:10.1080/10627197.2013.814517 promoting and hindering factors. Teaching and Teacher
https://doi.org/10.1080/10627197.2013.814517. Education, 26, 482-496. doi:10.1016/j.tate.2009.06.007
Leighton, J. P., & Gierl, M. J. (2007). Defining and https://doi.org/10.1016/j.tate.2009.06.007.
evaluating models of cognition used in educational Scriven, M. (1967).The methodology of evaluation. In R. E.
measurement to make inferences about examinees’ Stake (Ed.), Curriculum evaluation (pp. 39-83). Chicago,
thinking processes. Educational Measurement: Issues and IL: Rand McNally.
Practice, 26, 3-16. doi:10.1111/j.1745-3992.2007.00090.x Shepard, L. A. (2005, October). Formative assessment: Caveat
https://doi.org/10.1111/j.1745-3992.2007.00090.x. emptor. Paper presented at the ETS Invitational Conference,
Mellenbergh, G. J. (2011). A conceptual introduction to The Future of Assessment: Shaping Teaching and Learning,
psychometrics. Den Haag, the Netherlands: Eleven New York, NY.
International. Shute, V. J., & Ke, F. (2012). Games, learning, and
Mislevy, R. J. (2011).Evidence-centered design for simulation- assessment. In D. Ifenthaler, D. Eseryel, & X. Ge
based assessment (CRESST Report 800). Los Angeles, CA: (Eds.), Assessment in game-based learning (pp.
University of California, National Center for Research on 43-58). New York, NY: Springer Science+Business.
Evaluation, Standards, and Student Testing (CRESST). https://doi.org/10.1007/978-1-4614-3546-4_4.
Mislevy, R. J., Oranje, A., Bauer, M. I., von Davier, A. A., Shute, V. J., & Ventura, M. (2014). Stealth assessment:
Hao, J., Corrigan, S., . . .John, M. (2014). Psychometric Measuring and supporting learning in video games.
considerations in game-based assessment. Redwood City, Cambridge, MA: Massachusetts Institute of Technology.
CA: GlassLab Research, Institute of Play. https://doi.org/10.7551/mitpress/9589.001.0001.
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On Stobart, G. (2008). Testing times: The uses and
structure of educational assessments. Measurement: abuses of testing. London, England: Routledge.
Interdisciplinary Research and Perspectives, 1, https://doi.org/10.4324/9780203930502.
3-62. doi:10.1207/S15366359MEA0101_02 Tolboom, J., & Kuiper, W. (2014). Quantifying correspondence
https://doi.org/10.1207/S15366359MEA0101_02. between the intended and the implemented intervention
Novak, E., Johnson, T. E., Tenenbaum, G., & Shute, in educational design research. Studies in Educational
V. J. (2014). Effects of an instructional gaming Evaluation, 43, 160-168. doi:10.1016/j.stueduc.2014.09.001
characteristic on learning effectiveness, efficiency, https://doi.org/10.1016/j.stueduc.2014.09.001.
and engagement: Using a storyline for teaching basic Van der Kleij, F. M., Vermeulen, J. A., Schildkamp, K., &
statistical skills. Interactive Learning Environments, Eggen, T. J. H. M. (2013). Integrating data-based decision
24(3), 523-538. doi:10.1080/10494820.2014.881393 making, assessment for learning and diagnostic testing in
https://doi.org/10.1080/10494820.2014.881393. formative assessment. Assessment in Education: Principles,
Oakes, J. (1989). What educational indicators? The case for assess- Policy & Practice. doi: 10.1080/0969594X.2014.999024
ing the school context. Educational Evaluation and Policy https://doi.org/10.1080/0969594X.2014.999024.
Analysis, 11, 181-199. doi: 10.3102/01623737011002181. Van der Linden, W. J. (2005). Linear models for optimal test
https://doi.org/10.3102/01623737011002181. design. New York, NY: Springer. doi:10.1007/0.387.29054.0
Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, https://doi.org/10.1007/0-387-29054-0.
T. (2002). Practical considerations in computer- Van der Linden, W. J., & Glas, C. A. W. (2010). Preface.
based testing. New York, NY: Springer. In W. J. van der Linden, & C. A. W. Glas (Eds.),
https://doi.org/10.1007/978-1-4613-0083-0. Elements of adaptive testing (pp. v-vii). New York,

Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology 23


Test Approaches and Computer-Based Testing

NY: Springer. doi:10.1007/978-0-387-85461-8 11, 55-59. doi:10.1080/15366367.2013.784165


https://doi.org/10.1007/978-0-387-85461-8. https://doi.org/10.1080/15366367.2013.784165.
Van Groen, M. M., Eggen, T. J. H. M., & Veldkamp, Yan, D., Lewis, C., & von Davier, A. (2014). Overview of
B. P. (2014). Item selection methods based on computerized multistage tests.In D. Yan, A. A. von Davier,
multiple objective approaches for classification of & C. Lewis (Eds.), Computerized multistage testing: Theory
respondents into multiple levels. Applied Psychological and applications. Boca Raton, FL: CRC Press.
Measurement, doi: 10.1177/0146621613509723 Zenisky, A., Hambleton, R. K., & Luecht, R. M. (2010).
https://doi.org/10.1177/0146621613509723. Multistage testing: Issues, designs, and research.
Wauters, K. (2012). Adaptive item sequencing in item-based In W. J. van der Linden & C. A. W. Glas (Eds.),
learning environments (Unpublished doctoral dissertation). Elements of adaptive testing (pp. 355-372). New York,
KU Leuven, Belgium. NY: Springer. doi:10.1007/978-0-387-85461-8_18
Wiliam, D. (2013). How is testing supposed to improve https://doi.org/10.1007/978-0-387-85461-8_18.
schooling? Some reflections. Measurement:
Interdisciplinary Research and Perspectives,

24 Vol 21(1) | 2020 | www.jattjournal.com Journal of Applied Testing Technology

You might also like