0% found this document useful (0 votes)
239 views43 pages

Framework SABER-Student Assessment

This document provides an overview framework for building an effective student assessment system. It discusses key principles from countries' experiences, professional standards, and research. The framework is intended to help policymakers, educators, and organizations diagnose, discuss, and improve national student assessment systems in order to better support education quality and learning for all students. The document outlines several components of effective assessment systems, levels of development for different assessment types, and examples of tools to evaluate existing systems.

Uploaded by

L Sudhakar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
239 views43 pages

Framework SABER-Student Assessment

This document provides an overview framework for building an effective student assessment system. It discusses key principles from countries' experiences, professional standards, and research. The framework is intended to help policymakers, educators, and organizations diagnose, discuss, and improve national student assessment systems in order to better support education quality and learning for all students. The document outlines several components of effective assessment systems, levels of development for different assessment types, and examples of tools to evaluate existing systems.

Uploaded by

L Sudhakar Reddy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Contents

About the Series............................................................................................................................................ 3


About the Author .......................................................................................................................................... 4
Acknowledgments ........................................................................................................................................ 4
Abstract ........................................................................................................................................................ 5
Introduction .................................................................................................................................................. 6
Theory and Evidence on Student Assessment .............................................................................................. 7
Framework for Student Assessment Systems............................................................................................. 10
Fleshing out the Framework ....................................................................................................................... 15
Levels of Development ............................................................................................................................... 17
Conclusions ................................................................................................................................................. 22
References .................................................................................................................................................. 23
Appendix 1: Assessment Types and Their Key Differences......................................................................... 26
Appendix 2: Rubrics for Judging the Development Level of Different Assessment Types ......................... 27
Appendix 3: Example of Using the Rubrics to Evaluate a National Large-Scale Assessment Program....... 39

2
About the Series
Building strong education systems that promote learning is fundamental to development and economic
growth. Over the past few years, as developing countries have succeeded in building more classrooms,
and getting millions more children into school, the education community has begun to actively embrace
the vision of measurable learning for all children in school. However, learning depends not only on
resources invested in the school system, but also on the quality of the policies and institutions that
enable their use and on how well the policies are implemented.

In 2011, the World Bank Group launched Education Sector Strategy 2020: Learning for All, which
outlines an agenda for achieving “Learning for All” in the developing world over the next decade. To
support implementation of the strategy, the World Bank commenced a multi-year program to support
countries in systematically examining and strengthening the performance of their education systems.
This evidence-based initiative, called SABER (Systems Approach for Better Education Results), is building
a toolkit of diagnostics for examining education systems and their component policy domains against
global standards, best practices, and in comparison with the policies and practices of countries around
the world. By leveraging this global knowledge, SABER fills a gap in the availability of data and evidence
on what matters most to improve the quality of education and achievement of better results.

SABER-Student Assessment, one of the systems examined within the SABER program, has developed
tools to analyze and benchmark student assessment policies and systems around the world, with the
goal of promoting stronger assessment systems that contribute to improved education quality and
learning for all. To help explore the state of knowledge in the area, the SABER-Student Assessment team
invited leading academics, assessment experts, and practitioners from developing and industrialized
countries to come together to discuss assessment issues relevant for improving education quality and
learning outcomes. The papers and case studies on student assessment in this series are the result of
those conversations and the underlying research. Prior to publication, all of the papers benefited from a
rigorous review process, which included comments from World Bank staff, academics, development
practitioners, and country assessment experts.

All SABER-Student Assessment papers in this series were made possible by support from the Russia
Education Aid for Development Trust Fund (READ TF). READ TF is a collaboration between the Russian
Federation and the World Bank that supports the improvement of student learning outcomes in low-
income countries through the development of robust student assessment systems.

The SABER working paper series was produced under the general guidance of Elizabeth King, Education
Director, and Robin Horn, Education Manager in the Human Development Network of the World Bank.
The Student Assessment papers in the series were produced under the technical leadership of
Marguerite Clarke, Senior Education Specialist and SABER-Student Assessment Team Coordinator in the
Human Development Network of the World Bank. Papers in this series represent the independent views
of the authors.

3
About the Author
Marguerite Clarke is a Senior Education Specialist in the Human Development Network at the World
Bank. She leads the Bank’s work on learning assessment, including providing support to individual
countries to improve their assessment activities and uses of assessment information, and heading the
global work program on student assessment under the Russia Education Aid for Development (READ)
Trust Fund program. Under READ, she is responsible for developing evidence-based tools and
approaches for evaluating and strengthening the quality of student assessment systems. Prior to joining
the Bank, Marguerite was involved in research, policy, and practice in the areas of higher education
teaching and learning, higher education quality, and student assessment and testing policy at
universities in Australia (University of South Australia) and the United States (Brown University, Boston
College). She also worked as a classroom teacher in the Chinese, Irish, Japanese, and U.S. education
systems and received a national teaching award from the Irish Department of Education in 1989. A
former Fulbright Scholar, she received her PhD in Educational Research, Measurement, and Evaluation
from Boston College (2000) and is on the advisory board of the UNESCO Institute for Statistics
Observatory for Learning Outcomes.

Acknowledgments
Many people provided inputs and suggestions for this paper. Thanks in particular go to the peer
reviewers and meeting chairs: Luis Benveniste, Luis Crouch, Deon Filmer, Robin Horn, Elizabeth King,
Marlaine Lockheed, Harry Patrinos, and Alberto Rodriguez. I am also grateful to the READ Trust Fund
team, particularly Julia Liberman and María-José Ramírez, who provided valuable support in developing
a set of rubrics and questionnaires based on this framework paper, as well as Olav Christensen, Emily
Gardner, Manorama Gotur, Emine Kildirgici, Diana Manevskaya, Cassia Miranda, and Fahma Nur. Thanks
also to READ Technical Group members, past and present, including Luis Benveniste, Cedric Croft,
Amber Gove, Vincent Greaney, Anil Kanjee, Thomas Kellaghan, Marina Kuznetsova, María-José Ramírez,
and Yulia Tumeneva, as well as to the Task Team Leaders and teams in the READ countries. Others who
provided helpful insights and suggestions along the way include Patricia Arregui, Felipe Barrera, Viktor
Bolotov, Lester Flockton, Alejandro Ganimian, Juliana Guaqueta, Gabrielle Matters, Emilio Porta, Halsey
Rogers, Alan Ruby, Jee-Peng Tan, Igor Valdman, and Emiliana Vegas. Special thanks to the Russian
government for their support for this work under the READ Trust Fund program.

4
Abstract
The purpose of this paper is to provide an overview of what matters most for building a more effective
student assessment system. The focus is on systems for assessing student learning and achievement at
the primary and secondary levels. 1 The paper extracts principles and guidelines from countries’
experiences, professional testing standards, and the current research base. The goal is to provide
national policy makers, education ministry officials, development organization staff, and other
stakeholders with a framework and key indicators for diagnosis, discussion, and consensus-building
around how to construct a sound and sustainable student assessment system that will support improved
education quality and learning for all.

1 This paper does not discuss psychological or workplace testing; nor does it explicitly discuss assessment of
student learning and achievement at the tertiary level, although many of the issues raised also apply to that
level of schooling.

5
Introduction
Assessment is the process2 of gathering and evaluating information on what students know, understand,
and can do in order to make an informed decision about next steps in the educational process. Methods
can be as simple as oral questioning and response (for example, “What is the capital of Ethiopia?”) or as
complex as computer-adaptive testing models based on multifaceted scoring algorithms and learning
progressions. 3 Decisions based on the results may vary from how to design system-wide programs to
improve teaching and learning in schools, to identifying next steps in classroom instruction, to
determining which applicants should be admitted to university.

An assessment system is a group of policies, structures, practices, and tools for generating and using
information on student learning and achievement. Effective assessment systems are those that provide
information of sufficient quality and quantity to meet stakeholder information and decision-making
needs in support of improved education quality and student learning outcomes (Ravela et al., 2009). 4
Meeting these information and decision-making needs in a way that has the support of key political and
other groups in society will contribute to the longer-term sustainability and effectiveness of the
assessment system.

Governments, international organizations, and other stakeholders are increasingly recognizing the
importance of assessment for monitoring and improving student learning and achievement levels, and
the concomitant need to develop strong systems for student assessment (IEG, 2006; McKinsey &
Company, 2007; UNESCO, 2007). This recognition is linked to growing evidence that many of the
benefits of education—cultural, economic, and social—accrue to society only when learning occurs
(OECD, 2010). For example, an increase of one standard deviation in scores on international
assessments of reading and mathematics achievement levels has been linked to a 2 percent increase in
annual growth rates of GDP per capita (Hanushek and Woessmann, 2007, 2009).

Some people argue that assessments, particularly large-scale assessment exercises, are too expensive. In
fact, the opposite tends to be true, with testing shown to be among the least expensive innovations in
education reform, typically costing far less than increasing teachers’ salaries or reducing class size.
Hoxby (2002) found that even the most expensive state-level, test-based accountability programs in the
United States cost less than 0.25 percent of per-pupil spending. Similarly, in none of the Latin American
countries reviewed by Wolff (2007) did testing involve more than 0.3 percent of the national education
budget at the level (primary or secondary) tested. While these cost efficiencies are appealing, they
should not be allowed to obscure other important factors—for example, equity and social goals—that
need to be considered in any decision about whether or not to implement a particular assessment
program.

Over the last 20 years, many countries have started implementing assessment exercises or building on
existing assessment systems (UNESCO, 2007). In addition, there has been huge growth in the number of

2 When used as a noun, assessment may refer to a particular tool, such as a test.
3 A list of computer-adaptive testing programs can be found at http://www.psych.umn.edu/
psylabs/catcentral/.
4 A student assessment system supports a variety of information needs, such as informing learning and

instruction, determining progress, measuring achievement, and providing partial accountability information.
All of these purposes, and the decisions based on them, should ultimately lead to improved quality and
learning levels in the education system.
6
countries participating in international comparative assessment exercises such as the Trends in
International Mathematics and Science Study (TIMSS) and the Programme for International Student
Assessment (PISA). 5 Nongovernmental organizations also have increasingly turned to student
assessment to draw public attention to poor achievement levels and to create an impetus for change.

Despite this interest in student assessment, far too few countries have in place the policies, structures,
practices, and tools that constitute an effective assessment system. This is particularly the case for low-
income countries, which stand to benefit most from systematic efforts to measure learning outcomes.
Some of these countries have experimented with large-scale or other standardized assessments of
student learning and achievement levels, but too often these have been ad hoc experiences that are not
part of an education strategy and are not sustained over time. A key difference between one-off
assessments and a sustained assessment system is that the former only provides a snapshot of student
achievement levels while the latter allows for the possibility of monitoring trends in achievement and
learning levels over time (more like a series of photos) and a better understanding of the relative
contribution of various inputs and educational practices to changes in those trends. One-off
assessments can have shock value and create an opening for discussions about education quality, and
this can be a short-term strategy for putting learning on the agenda. 6 Ultimately, however, governments
must deal with the challenging, but necessary, task of putting in place systems that allow for regular
monitoring of, and support for, student learning and achievement. This is the only way to harness the
full power of assessment.

Theory and Evidence on Student Assessment


A basic premise of the research on student assessment is that the right kinds of assessment activities,
and the right uses of the data generated by those activities, contribute to better outcomes, be those
improved learning or improved policy decisions (for example, Heubert and Hauser, 1999). 7 What
constitutes ‘right’ is largely driven by a set of theoretical and technical guidelines for test developers and
users of assessment information (AERA, APA, and NCME, 1999).

5 For example, the number of countries participating in PISA jumped from 43 in 2000 to 66 in 2007. A
comparatively small number of developing countries have participated in international assessments of student
achievement. These countries have consistently performed in the bottom of the distribution, limiting the
amount of information they can derive from the data to better understand and improve their own education
systems.
6 One of the more popular of these initiatives is known as EGRA. According to the USAID Website

(https://www.eddataglobal.org/): “The Early Grade Reading Assessment (EGRA) is an oral assessment


designed to measure the most basic foundation skills for literacy acquisition in the early grades …. in order to
inform ministries and donors regarding system needs for improving instruction.”
7 Ravela et al. (2008) note that student assessment is a necessary, but insufficient, condition for improving

education. There is some evidence that the mere existence and dissemination of assessment information has
some effect on certain actors. But assessment is only one of several key elements of education policy; others
include preservice and inservice teacher training, teacher working conditions, school management and
supervision, curricular design, textbooks and educational materials, investment of resources proportional to
the needs of different populations, and concerted action by those responsible for education to resolve any
problems uncovered.

7
There also is a sizeable body of empirical research showing the benefits of specific types of assessment
activities, when implemented and used correctly, on student learning. For example, research
demonstrates a strong link between high-quality, formative classroom assessment activities and better
student learning outcomes as measured by student performance on standardized tests of educational
achievement. Black and Wiliam’s (1998) synthesis of over 250 empirical studies from around the world
on the impact of high-quality, formative classroom assessment activities shows student gains of a half to
a full standard deviation on standardized achievement tests, with the largest gains being realized by low
achievers. 8 Black and Wiliam (1998) conclude:

The gains in achievement appear to be quite considerable, and … amongst the largest
ever reported for educational interventions. As an illustration of just how big these gains
are, an effect size of 0.7, if it could be achieved on a nationwide scale, would be
equivalent to raising the mathematics attainment score of an “average” country like
England, New Zealand or the United States into the “top five” after the Pacific rim
countries of Singapore, Korea, Japan and Hong Kong. (p. 61)

Bennett (2011), however, notes that more work needs to be done to define and isolate the specific
characteristics of formative classroom assessment activities that lead to improved student learning
outcomes. 9

Correlational research on high school or upper-secondary exit examinations demonstrates a link


between countries that have those policies and higher student performance levels on international
assessments, such as PISA or TIMSS (for example, Bishop, Mane and Bishop, 2001). Other studies show a
link between specific characteristics of the tests used in these examination programs and student
learning outcomes, with curriculum- or subject-based examinations (as opposed to more general ability
or aptitude tests) viewed as most effective in promoting better student learning outcomes (Au, 2007;
Hill, 2010).

At the same time, these kinds of high-stakes examinations have been shown to have a negative impact
on students from disadvantaged groups by disproportionately limiting their opportunities to proceed to
the next level of the education system or to avail themselves of certain kinds of educational
opportunities (Greaney and Kellaghan, 1995; Madaus and Clarke, 2001). Because of these kinds of
equity issues, the uses and outcomes of examinations must be carefully monitored at the system, group,
and individual levels, and efforts should be made to reduce or mitigate any unintended negative
consequences.

Results from large-scale, system-level assessments of overall student achievement levels increasingly
provide the foundation for test-based accountability programs in many countries. Research shows an
overall weak, but positive, link between the uses of data from these assessments to hold schools and
educators accountable (through, for example, league tables, monetary rewards, or staffing decisions)
and better student learning outcomes (for example, Carnoy and Loeb, 2002). At the same time, findings

8 Rodriguez (2004) reports effects of similar size in U.S. TIMSS mathematics performance arising from the
effective management of classroom assessment (this finding is based on analysis of the responses of teachers
from TIMSS participating countries to questions on the topic of management of classroom assessment).
9 One meta-analysis of 21 controlled studies (Fuchs and Fuchs, 1986) that looked at the frequency of classroom

assessment activities found that systematic use of formative classroom assessment activities—weekly or even
more often—can have a strong positive effect on student achievement (for example, two assessments per week
results in an effect size of 0.85, or a percentile gain of 30 points).
8
suggest that simply reporting information about average school scores on these assessments also can
lead to increased student performance (Hanushek and Raymond, 2003), suggesting that there still is
much to learn about the optimal mix of incentives for test-based accountability models that will produce
the best outcomes with the fewest negative side effects. To date, research suggests that key
determinants of whether the effects of test-based accountability exercises are more positive than
negative include the technical quality of the tests themselves, the alignment between the test design
and the way test results are used, and the extent to which supports are in place to help schools or
teachers identified as underperforming (Ravela, 2005). 10

Research is increasingly focusing on the characteristics of effective assessment systems that encompass
the aforementioned types of assessment activities and uses (that is, classroom assessment,
examinations, and large-scale, system-level assessments). This research draws on principles and best
practices in the assessment literature as well as analyses of the assessment systems of high-achieving
nations. Darling-Hammond and Wentworth (2010) reviewed the practices of high-performing education
systems around the world (for example, Australia, Finland, Singapore, Sweden, and the United Kingdom)
and noted that student assessment activities in these systems:

x illustrate the importance of assessment of, for, and as student learning, rather than as a
separate disjointed element of the education enterprise
x provide feedback to students, teachers and schools about what has been learned, and ‘feed
forward’ information that can shape future learning as well as guide college- and career-
related decision making
x closely align curriculum expectations, subject and performance criteria and desired learning
outcomes
x engage teachers in assessment development and scoring as a way to improve their
professional practice and their capacity to support student learning and achievement
x engage students in authentic assessments to improve their motivation and learning
x seek to advance student learning in higher-order thinking skills and problem solving by using
a wider range of instructional and assessment strategies
x privilege quality over quantity of standardized testing 11
x as a large and increasing part of their examination systems, use open-ended performance
tasks and school-based assessments that require students to write extensively and give them
opportunities to develop ‘twenty-first century’ skills.12

10 Ravela (2005) describes the use of large-scale national assessment results in Uruguay to help teachers
improve their teaching. The emphasis on formative uses at the classroom level helped enhance teacher
acceptance of the results; it also influenced the assessment design in terms of the need to use a census-based
approach to data collection and the use of background factors to control for non-school factors affecting
achievement.
11 That is to say, some countries have good outcomes on international assessment exercises, but don’t use a lot

of standardized testing in their own education systems (for example, Finland). Other countries place a lot of
emphasis on standardized testing (for example, the United States), but don’t do so well on the same
international assessment exercises.
12 Results from standardized performance tasks are incorporated into students’ examination scores in systems

as wide-ranging as the GCSE in the United Kingdom; the Singapore examinations system; the certification
systems in Victoria and Queensland, Australia; and the International Baccalaureate, which operates in more

9
While Darling-Hammond and Wentworth’s research provides a broad vision of what an effective
assessment system looks like, it does not tell us what it takes to get there. Other studies delve into these
planning, process, and implementation issues. For example, Ferrer (2006) provides advice on designing
sustainable and sound assessment systems based on his analysis of existing systems in Latin America.
Bray and Steward (1998) carry out a similar analysis for secondary school examinations. Others (for example,
Lockheed, 2009) evaluate the status of donor activity in the area of assessment and discuss how to improve
the effectiveness of this support to countries. Still others delve into the politics of creating sustainable and
effective assessment systems (McDermott, 2011).

This paper draws together all of the above streams of evidence, organizing the key issues and factors
into a unified framework for understanding what an effective student assessment system looks like and
how countries can begin to build such systems.

Framework for Student Assessment Systems


In order to approach the framework in a strategic way, we need to identify some key dimensions of
assessment systems. Two main dimensions are discussed here: (i) types/purposes of assessment
activities and (ii) the quality of those activities.

Dimension 1. Assessment Types/Purposes


Assessment systems tend to comprise three main kinds of assessment activities, corresponding to three
main information needs or purposes (see also appendix 1). These kinds and the concomitant
information needs are:

x classroom assessments for providing real-time information to support teaching and


learning in individual classrooms
x examinations for making decisions about an individual student’s progress through the
education system (for example, certification or selection decisions), including the allocation of
‘scarce’ educational opportunities
x large-scale, system-level assessments for monitoring and providing policy-maker- and
practitioner-relevant information on overall performance levels in the system, changes in those
levels, and related or contributing factors.

To be sure, these assessment types are not completely independent of each other; nor are they all-
encompassing (that is, there are some assessment activities that don’t quite fit under these labels). At
the same time, they represent the main kinds of assessment activities carried out in the majority of
education systems around the world.

Classroom assessments, also referred to as continuous or formative assessments, are those carried out
by teachers and students in the course of daily activity (Airasian and Russell, 2007). They encompass a
variety of standardized and nonstandardized instruments and procedures for collecting and interpreting
written, oral, and other forms of evidence on student learning or achievement. Examples of classroom
assessment activities include oral questioning and feedback, homework assignments, student

than 100 countries around the world. Because these assessments are embedded in the curriculum, they
influence the day-to-day work of teaching and learning, focusing it on the use of knowledge to solve problems.
10
presentations, diagnostic tests, and end-of-unit quizzes. The main purpose of these assessments is to
provide ‘real time’ information to support teaching and learning.

Examinations, variously modified by the terms ‘public,’ ‘external,’ or ‘end-of-cycle,’ provide information
for high-stakes decision making about individual students—for example, whether they should be
assigned to a particular type of school or academic program, graduate from high school, or gain
admission to university (Greaney and Kellaghan, 1995; Heubert and Hauser, 1999). Whether externally
administered or (increasingly) school-based, their typically standardized nature is meant to ensure that
all students are given an equal opportunity to show what they know and can do in relation to an official
curriculum or other identified body of knowledge and skills (Madaus and Clarke, 2001). The leaving
certificate or exit examinations at the end of compulsory education in many education systems are a
good example. As discussed earlier, the high-stakes nature of most examinations means they can exert a
backwash effect on the education system in terms of what is taught and learned, having an impact, for
better or worse, on the skills and knowledge profile of graduates (West and Crighton, 1999). Such
consequences must be considered when determining whether the use of such tests is appropriate 13 and
whether or how they should be combined with other sources of information in order to ensure that the
results are used in a way that is as fair as possible to individuals, groups, and society as a whole. It is
important to emphasize that there are very specific professional and technical standards regarding the
appropriate and inappropriate uses of examinations (and tests in general) for making high-stakes
decisions about individual students (AERA, APA, and NCME, 1999).

Large-scale, system-level assessments are designed to provide information on system performance


levels and related or contributing factors (Greaney and Kellaghan, 2008; Kifer, 2001), typically in relation
to an agreed-upon set of standards or learning goals, in order to inform education policy and practice.
Examples include international assessments of student achievement levels, such as TIMSS, PIRLS, and
PISA; regional assessments, such as PASEC in Francophone Africa, SACMEQ in Anglophone Africa, and
LLECE in South America; national-level assessments, such as SIMCE in Chile; and subnational
assessments, such as the state-level tests in the United States or Canada. 14 These assessments vary in
the grades or age levels tested, coverage of the target population (sample or census), internal or
external focus (for example, national versus international benchmarks), subjects or skill areas covered,
types of background data gathered, and the frequency with which they are administered. They also vary
in how the results are reported and used. For example, as discussed earlier, while some stop at the

13 Greaney and Kellaghan (1995) note that because of the high stakes attached to examination performance,
teachers often teach to the examination, with the result that inadequate opportunities to acquire relevant
knowledge and skills are provided for students who will leave school at an early stage. Practices associated
with examinations that may create inequities for some students include scoring practices, the requirement that
candidates pay fees, private tutoring, examination in a language with which students are not familiar, and a
variety of malpractices. The use of quota systems to deal with differences in performance associated with
location, ethnicity, or language-group membership also creates inequities for some students.
14 TIMSS—Trends in International Mathematics and Science Study; PIRLS—Progress in International Reading

Literacy Study; PISA—Program for International Student Assessment; PASEC—Programme d'Analyse des
Systèmes Educatifs (Program on the Analysis of Education Systems); SACMEQ—Southern and Eastern Africa
Consortium for Monitoring Educational Quality; LLECE—Latin American Laboratory for Assessment of the
Quality of Education; Sistema de Medición de Calidad de la Educación.

11
reporting of results to policy makers or the general public, others use the results to hold accountable
specific groups in the education system (Clarke, 2007). 15

One way to differentiate among the above three types of assessment activities is that classroom
assessment is mainly about assessment as learning or for learning (and hence is primarily formative in
nature) while examinations and surveys are mainly about assessment of learning (and hence are
primarily summative in nature). These distinctions do not always hold up neatly in practice and hybrid
approaches are becoming more common. For example, Singapore has an assessment system structured
around public examinations, but has built a whole infrastructure of support for learning around it (L.
Benveniste, personal communication, March 2010). Other hybrid activities involve the adaptation of
tools designed for one type of assessment activity (for example, classroom instruments for informing
instruction) for another purpose (for example, documenting performance at the system level). One of
the best known of these initiatives is the Early Grade Reading Assessment (EGRA), an instrument
developed with the support of donor agencies and experts for use in developing countries
(https://www.eddataglobal.org/). Based on a tool originally designed for classroom use, EGRA has been
used to collect system-level data on student performance on early reading skills in order to inform
ministries and donors regarding system needs for improving instruction (Gove and Cvelich, 2011).

Education systems can have quite different profiles in terms of the emphasis placed on the different
types of assessment activities. For example, Finland’s education system emphasizes classroom
assessment as the key source of information on student learning and achievement and draws far less on
examinations or large-scale, system-level assessment. China has traditionally placed considerable
emphasis on examinations as a means to sort and select from its large student population, and relatively
less on classroom assessment or large-scale surveys (although this is changing). 16 Factors contributing to
these different assessment system profiles vary from the official vision and goals of the education
system (and the role of assessment in achieving that vision) to the economic structures and
opportunities in a country and the related information needs of key stakeholders. It is not clear that
there exists one ideal profile for an assessment system that works equally well in all contexts.

Dimension 2. Quality Drivers


Instead of being able to reference one ideal profile for a student assessment system, the key
consideration is the individual and combined quality of the assessment activities in terms of the
adequacy of the information generated to support decision making (Messick, 1989; Shepard, 2000).

There are three main drivers of information quality in an assessment system (AERA, APA, and NCME,
1999; Darling-Hammond and Wentworth, 2010):

x enabling context
x system alignment

15 World Bank support for assessment activity over the last 20 years (Larch and Lockheed, 1992; Liberman and
Clarke, 2012) has shifted from an emphasis on examination reform to an emphasis on the implementation of
large-scale, system-level assessment exercises for monitoring achievement trends and informing policy and
practice.
16 Other contributing factors include the historical legacy of assessment in a particular education system, which

can create a pull toward a particular type of assessment activity (Madaus, Clarke, and O’Leary, 2003); the
capacity of various stakeholders in the system to effectively carry out different types of assessment activities
(Greaney and Kellaghan, 2008); and the cost, perceived or real, of assessment activities (Wolff, 2007).
12
x assessment quality.

Although closely related, these dimensions are presented here separately for the purposes of discussion.

The enabling context refers to the broader context in which an assessment activity takes place and the
extent to which that context is conducive to, or supportive of, the assessment. It covers such areas as
the legislative or policy framework for assessment activities; leadership surrounding the assessment
activity (including the political will to implement an assessment in spite of the knowledge that results
might reveal serious issues or inequities in student learning); public engagement with the assessment
activity; the institutional arrangements for designing, carrying out, or using the results from the
assessment activity; 17 the availability of sufficient and stable sources of funding and the presence of
competent assessment unit staff and classroom teachers.

The enabling context is important to get right because it is a key driver of the long-term quality and
effectiveness of an assessment system and—like the soil, water, and air that a plant needs to grow—no
assessment system is sustainable in its absence (World Bank, 2010). In most instances, the onus is on
the government to at least provide the vision, leadership, and policy framework toward establishing this
enabling context (at the same time, keeping in mind that relative autonomy from political influence is
one of the hallmarks of a more mature assessment system), which may subsequently be implemented
via public-private partnerships (for example, contracting administration of an assessment program to an
outside firm). Some education systems, particularly in federal contexts, combine forces to create an
enabling context in terms of pooling resources or institutional arrangements for developing,
implementing, analyzing, or reporting on tests (for example, when states or systems come together to
design a common test item bank that each can use for their own purposes, hence reducing the cost for
individual states or systems). Regional assessment exercises, such as SACMEQ, PASEC, and LLECE,
represent another form of collaboration toward creating an enabling context. The efficiencies of scale
achieved by these collaborations make it more cost effective to develop higher-quality tests and to
incorporate technological advances into the testing process.

System alignment refers to the extent to which the assessment is aligned or coherent with other
components of the education system. This includes the connection between assessment activities and
system learning goals, standards, curriculum, and pre- and in-service teacher training opportunities
(Fuhrman and Elmore, 1994; Smith and O’Day, 1991). It is important for assessment activities to align
with the rest of the education system so that the information they provide is of use to improving the
quality of education in the system, and so that synergies can be created.

Alignment involves more than a simple match between what is tested and what is in the official
standards or intended curriculum (at the same time, it is important that most assessment activities
provide at least some information on student learning and achievement in relation to official standards
or curriculum). Hence, while the correspondence between a country’s curriculum and what is tested on
international assessments such as PISA and TIMSS may be low, the assessment might still be aligned
with (and useful for informing) the overall goals and aspirations for the education system and related

17There is much debate over whether examination or large-scale assessment units should be located within or
outside of education ministries. In fact, the institutional location is not as important as the culture of continuity
and transparency created around the assessment (Ravela et al., 2008). Such a culture is achieved when an
assessment has a clear mandate and solid structure, which necessitates that the assessment system be
underpinned by some kind of legal statute.

13
reforms. Under such a scenario, assessment can actually lead quality improvements in the education
system rather than simply passively monitor them (notwithstanding that the use of data from TIMSS,
PIRLS, and PISA to monitor the impact of national reforms on performance over time has been key to
the improvement of achievement levels in countries as diverse as Brazil, Jordan, and Poland).

Assessment quality refers to the psychometric quality of the instruments, processes, and procedures
used for the assessment activity (AERA, APA, and NCME, 1999). It is important to note that assessment
quality is a concern for any kind of assessment activity— that is, classroom assessment; examinations; or
large-scale, system-level assessment. It covers such issues as the design and implementation of
assessment activities, examination questions, or survey items; the analysis and interpretation of student
responses to those assessment activities, questions, or items; and the appropriateness of how the
assessment, examination, or survey results are reported and used (Heubert and Hauser, 1999; Shepard,
2000). Depending on the assessment activity, the exact criteria used to make those judgments differ.
Assessment quality is important because if an assessment is not sound in terms of its design,
implementation, analysis, interpretation, reporting, or use, it may contribute to poor decision-making in
regards to student learning and system quality (Messick, 1989; Wolff, 2007). In fact, poor assessment
quality could undermine the entire assessment exercise if it causes distrust in the approach.

Two technical issues that need to be considered in any review of assessment quality are reliability and
validity. Reliability refers to whether the assessment produces accurate information, and is a particularly
important consideration for high-stakes examinations and for monitoring trends over time. Validity
pertains to whether the test scores represent what they are supposed to represent and whether they
can be used in the intended ways. One common threat to test score validity is a difference between the
language of instruction and the language of testing, which may make it difficult for a child to show what
they know and can do. Use is a very important concept in relation to validity, and requires a careful
consideration of the consequences of test score use, including the social, economic, and other impacts
on different groups in the population.

Crossing these quality drivers with the different assessment types/purposes, we arrive at the framework
diagramed in table 1.

14
Table 1. Framework for Building a More Effective Student Assessment System

Assessment types/purposes

Large-scale, system-
Classroom assessment Examinations level assessment

Enabling context

System alignment

Assessment quality

Source: World Bank.

The rest of this paper fleshes out and discusses the use of this framework for building a more effective
assessment system. The framework can be applied to any country’s assessment system as a way to
begin a discussion about where the system appears strong and where more work may be needed.

Fleshing out the Framework


The framework in table 1 is a starting point for identifying indicators that can be used to review
assessment systems and plan for their improvement. Indicators can be identified based on a
combination of criteria, including:

x professional standards for assessment


x empirical research on the characteristics of effective assessment systems, including
analysis of the characteristics that differentiate between the assessment systems of
low- versus high-performing nations
x theory—that is, general consensus among experts that it contributes to effective
assessment.

15
The evidence base is stronger in some areas than in others. For example, there are many professional
standards for assessment quality (APA, AERA, and NCME, 1999), 18 but far fewer for the enabling context.
In addition, some of the empirical research is limited by its correlational nature and hence we must be
cautious about inappropriate attribution or over-interpreting the association between characteristics.
Despite such limitations, evidence from a variety of sources converges quite convincingly to make clear
what better assessment is (and what it is not).

The above criteria and considerations were used to expand the three quality drivers into the broad
indicator areas shown in table 2. These indicator areas are most relevant to examinations and large-
scale, system-level assessment activities, but also can be applied to classroom assessment.

Table 2. Framework for Building a More Effective Student Assessment System, with Broad Indicator
Areas

Assessment types/purposes

Large-scale, system-
Classroom assessment Examinations level assessment

Policies

Leadership and public engagement

Enabling context Funding

Institutional arrangements

Human resources

Learning/quality goals

System alignment Curriculum

Pre- and in-service teacher training opportunities

Ensuring quality (design, administration, analysis)


Assessment quality
Ensuring effective uses

Source: World Bank.

18There also is a sizeable research base on system alignment (for example, Fuhrman and Elmore, 1994;
Hamilton, Stecher, and Klein, 2002).
16
Data pertaining to some of these indicator areas can be found in official documents, published reports
(for example, Ferrer, 2006), research articles (for example, Braun and Kanjee, 2006), and online
databases.19 For the most part, however, the relevant data have not been gathered in any
comprehensive or systematic fashion. 20 Those wishing to review this type of information for a particular
assessment system most likely will need to collect the data themselves. In response to this need, the
World Bank has developed a set of standardized questionnaires and rubrics for collecting and evaluating
data on the three assessment types (classroom assessments, examinations, and large-scale, system-level
assessment) and related quality drivers (enabling context, system alignment, assessment quality). The
tools, which are regularly updated on the basis of new evidence and country experiences, are available
at http://www.worldbank.org/education/saber. Countries can use these tools, which build on the
framework and broad indicator areas shown in table 2, to systematically examine and gain a better
understanding of the strengths and weaknesses of their student assessment system and to plan for
where to go next. It is important to point out that the tools primarily focus on benchmarking a country’s
policies, practices, and arrangements for classroom assessment, examinations, and large-scale, system-
level assessment activities at the system-level. Additional tools would be needed to determine actual,
on-the-ground practices by teachers and students in schools.

Levels of Development
The basic structure of the rubrics for evaluating data collected using the standardized questionnaires is
summarized in table 3. The full set of rubrics is provided in appendix 2. The goal of the rubrics is to
provide a country with some sense of the development level of its assessment activities compared to
best or recommended practice in the area.

19Two of the more useful online databases are http://www.inca.org.uk/ and http://epdc.org/.
20Brinkley, Guthrie, and Wyatt (1991) surveyed large-scale, system-level assessment and examination practices
in OECD countries. Larach and Lockheed (1992) did a similar survey of assessments supported by the World
Bank. Macintosh (1994) did a study in 10 countries (Australia, Bahrain, England and Wales, Guatemala, Israel,
Malaysia, Namibia, Poland, Scotland, and Slovenia).

17
Table 3. Basic Structure of Rubrics for Evaluating Data Collected on a Student Assessment System

Development Level

LATENT ESTABLISHED
(Absence of, or EMERGING (Acceptable
deviation from, (On way to meeting minimum ADVANCED
Dimension attribute) minimum standard) standard) (Best practice) Justification

EC—ENABLING CONTEXT

EC1—Policies

EC2—Leadership, public
engagement

EC3—Funding

EC4—Institutional arrangements

EC5—Human resources

SA—SYSTEM ALIGNMENT

SA1—Learning/quality goals

SA2—Curriculum

SA3—Pre-, in-service teacher


training

AQ—ASSESSMENT QUALITY

AQ1—Ensuring quality (design,


administration, analysis)

AQ2—Ensuring effective uses

Source: World Bank.

For each indicator, the rubric displays four development levels—Latent, Emerging, Established, and
Advanced. 21 These levels are artificially constructed categories chosen to represent key stages on the
underlying continuum for each indicator. Each level is accompanied by a description of what
performance on the indicator looks like at that level. Latent is the lowest level of performance; it
represents absence of, or deviation from, the attribute. Emerging is the next level; it represents partial
presence of the attribute. Established represents the acceptable minimum standard on the indicator and
Advanced represents the ideal or current best practice. Not all questions from the questionnaires are
represented in the rubrics; this is because not all of the questions are underpinned by an evidence base that

21The Latent label could be applied to countries where there is no formal assessment activity or where the
education system has been suspended due to war or other conflict.
18
demonstrates a relationship between increasing performance levels on the attribute/indicator and improved
quality or effectiveness of assessment activities.

It is important to recognize that many of the issues that we are trying to get at with the indicators and
associated development levels can be difficult to measure. In some instances, explicit technical
standards exist and can be drawn on to aid these measurement efforts (for example, international
standards for determining whether a country’s TIMSS results are sufficiently robust to be included in the
international report). In others, judgment calls need to be made (for example, measuring the degree of
public support for a particular assessment activity). In order to enhance the overall reliability and cross-
system comparability of the indicators and development levels, the questionnaires and rubrics rely, as
much as possible, on objective measures.

In addition to evaluating performance on individual indicators, it can be useful to qualitatively compare


an assessment system’s overall characteristics against profiles of assessment systems as they might look
at different levels of development. Table 4 outlines generic profiles—drawing on the information
provided in table 2 and appendix 2—for assessment systems at Emerging, Established, and Advanced
levels of development (Latent is omitted because it basically represents the absence of any assessment
activity).

Assessment systems that are at an Emerging level can be characterized as having enabling contexts, as
well as levels of system alignment and assessment quality, that are just taking shape. These systems are
characterized by instability and uncertainty about the choice, frequency, and use of assessment
activities, indicative of an unclear vision for assessment at the system level and uncertain or insufficient
funding for assessment activities. In this context, assessment is more likely to function as an ‘add on’ to
the system, without much systematic effort to align it with standards, curricula, or teacher training
opportunities.

19
Table 4. Stylized Profiles of Student Assessment Systems at Different Levels of Development

Emerging Established Advanced

x No or limited policy x Presence of clear policy The same as for Established


framework or guidelines framework or guidelines

x Weak leadership/public x Strong leadership/public


engagement engagement + strong focus on:

Enabling context x Few trained staff; high x Training programs/trained


turnover staff with low turnover
x Assessment for
x Unreliable/irregular funding x Stable/regular funding learning

x Unclear or unstable x Clear and stable institutional x School-based and


institutional arrangements arrangements classroom assessment

x Assessments not fully aligned x Assessments aligned with x Role of teachers


with learning/quality goals, learning/quality goals, standards,
standards, curriculum curriculum x Innovation and
System
research-based practices
alignment
x Assessments not aligned with x Assessments aligned with
pre- and in-service teacher training pre- and in-service teacher training
opportunities opportunities

x Limited awareness or x Awareness and application of


Assessment application of technical or technical or professional standards
quality professional standards for ensuring for ensuring assessment quality
assessment quality and effective uses effective uses

Source: World Bank.


Note: The Latent level is omitted because it basically represents the absence of any assessment activity.

Capacity building tends to be nonsystematic and of limited effectiveness as individuals disperse to other
parts of the organization or to the private sector after they have been trained. Assessment activities
tend to be of low quality due to a lack of awareness of, or attention to, professional standards.

Assessment systems that are at an Established level can be characterized as having enabling contexts, as
well as levels of system alignment and assessment quality, that are stable, assured, or consolidated in
nature. These systems are characterized by continuity and certainty about the choice, frequency, and
use of assessment activities, as well as stable and sufficient sources of funding, indicative of a vision and
‘buy in’ for assessment at the system level. In this environment, assessment functions more as an
integral part of the system, with systematic efforts to align it with standards, curricula, or teacher
training opportunities. Capacity building tends to be focused, sustained, and effective and there is low
staff turnover. Assessment activities tend to be of good quality due to awareness of, and attention to,
professional standards. This level may be viewed as the acceptable minimum standard in order for an
assessment system to be effective.

Assessment systems that are at an Advanced level can be characterized as having enabling contexts, as
well as levels of system alignment and assessment quality that are highly developed in nature. In
addition to having the best features of Established systems, Advanced systems are characterized by high

20
levels of innovation and research-based practices. In this environment, assessment functions as a highly
integral part of the system. Capacity building tends to be very much focused on teachers, in addition to
‘technicians,’ testimony to a strong emphasis on school-based and classroom assessment (and
reminiscent of the key features of high-performing systems highlighted by Darling-Hammond and
Wentworth in their work).

In reality, assessment systems are likely to be at different levels of development in different areas. For
example, a system may be Established in the area of examinations, but Emerging in the area of large-
scale, system-level assessment, and vice versa. While intuition suggests that it is probably better to be
further along in as many areas as possible, the evidence is unclear as to whether it is necessary to be
functioning at Advanced levels in all areas. Therefore, one might view the Established level as a desirable
minimum outcome to achieve in all areas (which is what we see in the assessment systems of countries
like Finland and Australia), but only aspire beyond that in those areas that most contribute to the
national vision or priorities for education. In line with these considerations, the ratings generated by the
rubrics in appendix 2 are not meant to be additive across assessment types (that is, they are not meant
to be added to create an overall rating for an assessment system; they are only meant to produce an
overall rating for each assessment type).

While it is useful to have an idea of what assessment systems and different assessment types look like at
different development levels, it is equally, if not more, useful to know how to progress through those
levels. Thus, we also need to understand some of the key reforms or inputs that countries have used to
develop more effective assessment systems. Unfortunately, the evidence becomes sparser in this area
and further research is definitely needed to flesh out the concrete strategies involved.

Based on the small amount of available evidence, the main factor that seems to characterize systems
that make the shift from Emerging to Established (overall or in a specific assessment area) is a concerted
focus on reforms, inputs, and practices that strengthen the enabling context for assessment (Ferrer,
2006). 22 For example, in their review of World Bank support for assessment projects in client countries,
Larach and Lockheed (1992) found that projects that first focused on improving institutional
arrangements were more likely to succeed—in terms of leading to a sustainable assessment program in
the country—than projects that first tried to improve the technical quality of existing assessment
activities. In line with this finding, in their review of assessment reform efforts in Central and Eastern
European countries, West and Crighton (1999) noted that reforms had a better chance of being
sustained when there was public consensus that change was needed, clear and consistent political
support for change, and sufficient allocation of resources.

The main factor that seems to characterize systems that make the shift from Established to Advanced is
a focus on reforms, inputs, and practices that prioritize the classroom, and teachers and students as the
key actors in assessment (Darling-Hammond and Wentworth, 2010; Shepard, 2000). This relates to the
fact that the most powerful form of assessment, when done correctly, is that carried out by teachers
and students in the course of their daily classroom activities (that is, classroom assessment). Doing this
type of assessment correctly requires a lot of capacity building and focused attention on teacher quality
issues.

22While it may benefit a system, for a short time, to focus resources around making progress on one specific
quality driver (for example, enabling context), this is not a long-term strategy as each quality driver is a
necessary contributor to an effective assessment system.

21
Conclusion
Assessment is key to knowing whether an education system is producing the desired outcomes for
students, the economy, and society at large. Without effective assessment, it is impossible to know
whether students are learning and whether reforms are working in the intended ways.

This paper extracted principles and guidelines from countries’ experiences and the current research
base to outline a framework for developing a more effective student assessment system. The framework
provides policy makers and others with an evidence-based structure for discussion and consensus
building around priorities and key inputs for their assessment system.

An important contribution of the framework is to help countries identify the key quality drivers that
need to be addressed in order to strengthen the quality and utility of the information produced by the
various activities in their assessment system. This is critical because the main purpose of any assessment
system is to provide valid and timely information to a set of users—the student, the teacher, the
community, and the policy maker—so that they can make better decisions in support of improved
quality and learning outcomes. Choices about the assessment system need to be consistent with serving
these users and their information and decision-making needs.

The framework also has a dynamic dimension that illustrates the trajectory of moving from one level of
development to the next in each assessment area. It is important to keep in mind that it takes time to
progress from level to level. Case studies on countries’ experiences in strengthening their student
assessment systems reveal that it often takes a decade or more for a set of reforms and inputs to really
take hold and produce tangible results. Therefore, country teams must plan from the outset to have a
long-term commitment to, and investment in, the policies, inputs, and actions that will be required to
transform their assessment system. The payoff will be an assessment system that can support better
decision making and contribute to higher levels of education quality and learning for all.

22
References
Airasian, P., and M. Russell. 2007. Classroom Assessment: Concepts and Applications (6th ed.). New York:
McGrath Hill.
Au, W. 2007. “High-Stakes Testing and Curricular Control: A Qualitative Metasynthesis.” Educational
Researcher 36(5): 258–67.
American Educational Research Association (AERA), American Psychological Association (APA), and
National Council on Measurement in Education (NCME). 1999. Standards for Educational and
Psychological Testing. Washington, DC: AERA.
Bennett, R. E. 2011. “Formative Assessment: A Critical Review.” Assessment in Education: Principles,
Policy and Practice 18(1): 5–25.
Bishop, J., F. Mane, and M. Bishop. 2001. “Secondary Education in the United States: What Can Others
Learn from Our Mistakes?” CAHRS Working Paper Series. Cornell Center for Advanced Human
Resource Studies (CAHRS).
Black, P., and D. Wiliam. 1998. “Assessment and Classroom Learning.” Assessment in Education:
Principles, Policy and Practice 5(1): 7–73.
Braun, H., and A. Kanjee. 2006. “Using Assessment to Improve Education in Developing Nations.” In J.
Cohen, D. Bloom, and M. Malin, eds., Educating All Children: A Global Agenda. Cambridge, MA:
American Academy of Arts and Sciences.
Bray, M., and L. Steward, eds. 1998. Examination Systems in Small States: Comparative Perspectives on
Policies, Models and Operations. London: The Commonwealth Secretariat.
Brinkley, M., J. Guthrie, and T. Wyatt. 1991. A Survey of National Assessment and Examination Practices
in OECD Countries. Lugano, Switzerland: OECD.
Carnoy, M., and S. Loeb. 2002. “Does External Accountability Affect Student Outcomes? A Cross-State
Analysis.” Educational Evaluation and Policy Analysis 24(4): 305–331.
Clarke, M. 2007. “State Responses to the No Child Left Behind Act: The Uncertain Link between
Implementation and ‘Proficiency for All’.” In C. Kaestle and A. Lodewick, eds., To Educate a Nation:
Federal and National Strategies of School Reform (pp. 144–174). Lawrence: University of Kansas
Press.
Darling-Hammond, L., and L. Wentworth. 2010. Benchmarking Learning Systems: Student Performance
Assessment in International Context. Stanford, CA: Stanford University, Stanford Center for
Opportunity Policy in Education.
Ferrer, G. 2006. Educational Assessment Systems in Latin America: Current Practice and Future
Challenges. Washington, DC: Partnership for Educational Revitalization in the Americas.
Fuchs, L. S., and D. Fuchs. 1986. “Effects of Systematic Formative Evaluation on Student Achievement: A
Meta-Analysis.” Exceptional Children 53: 199–208.
Fuhrman, S., and D. Elmore, eds. 1994. Governing Curriculum. Alexandria, VA: ASCD.
Gove, A., and P. Cvelich. 2011. Early Reading: Igniting Education for All. A Report by the Early Grade
Learning Community of Practice. Revised Edition. Research Triangle Park, NC: Research Triangle
Institute.

23
Greaney, V., and T. Kellaghan. 2008. Assessing National Achievement Levels in Education. Washington,
DC: World Bank.
———. 1995. Equity Issues in Public Examinations in Developing Countries. Washington, DC: World Bank.
Hamilton, L., B. Stecher, and S. Klein., eds. 2002. Making Sense of Test-Based Accountability in
Education. Santa Monica, CA: RAND Corporation.
Hanushek, E., and M. Raymond. 2003. “Lessons about the Design of State Accountability Systems.” In P.
Peterson and M. West, eds., No Child Left Behind? The Politics and Practice of Accountability (pp.
127–151). Washington, DC: Brookings Institution Press.
Hanushek, E., and L. Woessmann. 2009. “Schooling, Cognitive Skills, and the Latin American Growth
Puzzle.” Working Paper 15066. Cambridge, MA: National Bureau of Economic Research.
———. 2007. Education Quality and Economic Growth. Washington, DC: World Bank.
Heubert, J., and R. Hauser. 1999. High Stakes: Testing for Tracking, Promotion, and Graduation.
Washington, DC: National Academy Press.
Hill, P. 2010. Examination Systems. Asia-Pacific Secondary Education System Review Series. Bangkok:
UNESCO.
Hoxby, C. 2002. “The Cost of Accountability.” NBER Working Paper Series No. w8855. Cambridge, MA:
National Bureau of Economic Research. Available at SSRN: http://ssrn.com/abstract=305599.
Independent Evaluation Group (IEG). 2006. From Schooling Access to Learning Outcomes: An Unfinished
Agenda. Washington, DC: World Bank.
Kifer, E. 2001. Large-Scale Assessment: Dimensions, Dilemmas, and Policy. Thousand Oaks, CA: Corwin
Press, Inc.
Larach, L., and M. Lockheed. 1992. “World Bank Lending for Educational Testing.” PHREE Background
Paper, 92/62R. Population and Human Resources Department. Washington, DC: World Bank.
Liberman, J., and M. Clarke. 2012. Review of World Bank Support for Assessment Activities in Client
Countries. Unpublished manuscript. Washington, DC: World Bank.
Lockheed, M. 2009. Review of Donor Support for Assessment Capacity Building in Developing Countries.
Unpublished manuscript. Washington, DC: World Bank.
Macintosh, H. 1994. A Comparative Study of Current Theories and Practices in Assessing Students’
Achievements at Primary and Secondary Level. IBE Document Series, Number 4. Geneva,
Switzerland: International Bureau of Education.
Madaus, G., and M. Clarke. 2001. “The Impact of High-Stakes Testing on Minority Students.” In M.
Kornhaber and G. Orfield, eds., Raising Standards or Raising Barriers: Inequality and High Stakes
Testing in Public Education (pp. 85–106). New York: Century Foundation.
Madaus G., M. Clarke, and M. O’Leary. 2003. “A Century of Standardized Mathematics Testing.” In G.
M.A. Stanic and J. Kilpatrick, eds., A History of School Mathematics (pp. 1311–1434). Reston, VA:
NCTM.
McDermott, K. A. 2011. High-Stakes Reform: The Politics of Educational Accountability. Washington, DC:
Georgetown University Press.
McKinsey & Company. 2007. How the World’s Best Performing School Systems Come Out On Top.
London: McKinsey & Company.

24
Messick, S. 1989. “Validity.” In R. Linn, ed., Educational Measurement (3rd ed.) (pp. 13–103). New York:
American Council on Education/Macmillan.
Organisation for Economic Co-operation and Development (OECD). 2010. The High Cost of Low
Educational Performance. The Long-Run Economic Impact of Improving PISA Outcomes. Paris:
OECD.
Ravela, P. 2005. “A Formative Approach to National Assessments: The Case of Uruguay.” Prospects
35(1): 21–43.
Ravela, P., P. Arregui, G. Valverde, R. Wolfe, G. Ferrer, F. Martinez, M. Aylwin, and L. Wolff. 2008. “The
Educational Assessments that Latin America Needs.” Working Paper Series No. 40. Washington,
DC: Partnership for Educational Revitalization in the Americas (PREAL).
Ravela, P., P. Arregui, G. Valverde, R. Wolfe, G. Ferrer, F. M. Rizo, M. Aylwin, and L. Wolff. 2009. “The
Educational Assessments that Latin America Needs.” Washington, DC: PREAL.
Rodriguez, M. C. 2004. “The Role of Classroom Assessment in Student Performance on TIMSS.” Applied
Measurement in Education 17(1): 1–24.
Shepard, L. 2000. “The Role of Assessment in a Learning Culture.” Educational Researcher 29(7): 4–14.
Smith, M. S., and J. O’Day. 1991. “Systemic School Reform.” In S. H. Fuhrman and B. Malen, eds., The
Politics of Curriculum and Testing, 1990 Yearbook of the Politics of Education Association (pp. 233–
267). London and Washington, DC: Falmer Press.
United Nations Educational, Scientific, and Cultural Organization (UNESCO). 2007. Education for All
Global Monitoring Report 2008: Education for All by 2015. Will We Make It? Paris: UNESCO/Oxford
University Press.
West, R., and J. Crighton. 1999. “Examination Reform in Central and Eastern Europe: Issues and Trends.”
Assessment in Education 6(2): 271–280.
Wolff, L. 2007. The Costs of Student Assessment in Latin America. Washington, DC: PREAL.
World Bank. 2010. Russia Education Aid for Development (READ) Trust Fund Annual Report 2009.
Washington, DC: World Bank.

25
Appendix 1: Assessment Types and Their Key Differences
Large-scale, system-level assessment
Classroom National International Examinations
Purpose To provide To provide To provide To select or certify
immediate feedback on the feedback on the students as they move
feedback to overall health of the comparative from one level of the
inform classroom system at particular performance of the education system to the
instruction grade/age level(s), education system at next (or into the
and to monitor particular grade/age workforce)
trends in learning level(s)
Frequency Daily For individual For individual Annually and more often
subjects offered on subjects offered on where the system allows
a regular basis a regular basis for repeats
(such as every 3-5 (such as every 3-5
years) years)
Who is tested? All students Sample or census A sample of All eligible students
of students at a students at a
particular grade or particular grade or
age level(s) age level(s)
Format Varies from Usually multiple Usually multiple Usually essay and
observation to choice and short choice and short multiple choice
questioning to answer answer
paper-and-pencil
tests to student
performances
Coverage of All subject areas Generally confined Generally confined Covers main subject
curriculum to a few subjects to one or two areas
subjects

Additional Yes, as part of the Frequently Yes Seldom


information teaching process
collected from
students?
Scoring Usually informal Varies from simple Usually involves Varies from simple to
and simple to more statistically statistically more statistically
sophisticated sophisticated sophisticated techniques
techniques techniques
Source: World Bank.

26
Appendix 2. Rubrics for Judging the Development Level of Different Assessment
Types
Classroom Assessment
LATENT EMERGING ESTABLISHED
Absence of, or deviation On way to meeting Acceptable minimum ADVANCED
from, the attribute minimum standard standard Best practice Justification

Enabling Context & System Alignment (EC & SA)


Overall policy and resource framework within which classroom assessment activity takes place in an education
system, and the degree to which classroom assessment activity is coherent with other components of the education
system.

EC&SA1—Setting clear guidelines for classroom assessment

(Q1) There is no system- (Q1) There is an informal (Q1) There is a formal This option does not
level document that system-level document system-level document that apply to this dimension.
provides guidelines for that provides guidelines provides guidelines for
classroom assessment. for classroom classroom assessment.
assessment.

This option does not apply This option does not (Q3, Q4) The availability of (Q3, Q4) The document
to this dimension. apply to this dimension. the document is restricted. is widely available.

EC&SA2—Aligning classroom assessment with system learning goals

(Q5) There are no system- (Q5) There are scarce (Q5) There are some (Q5) There are a variety
wide resources for teachers system-wide resources system-wide resources for of system-wide
for classroom assessment. for teachers for teachers for classroom resources available for
classroom assessment. assessment. teachers for classroom
assessment.

(Q6) There is no official (Q6) There is an official (Q6) There is an official (Q6) There is an official
curriculum or standards curriculum or standards curriculum or standards curriculum or standards
document. document, but it is not document that specifies document that specifies
clear what students are what students are expected what students are
expected to learn or to to learn, but the level of expected to learn and
what level of performance required is not to what level of
performance. clear. performance.

EC&SA3—Having effective human resources to carry out classroom assessment activities

(Q7, Q8) There are no This option does not (Q7, Q8) There are some (Q7, Q8) There are a
system-level mechanisms apply to this dimension. system-level mechanisms variety of system-level
to ensure that teachers to ensure that teachers mechanisms to ensure
develop skills and expertise develop skills and expertise that teachers develop
in classroom assessment. in classroom assessment. skills and expertise in
classroom assessment.

27
LATENT EMERGING ESTABLISHED
Absence of, or deviation On way to meeting Acceptable minimum ADVANCED
from, the attribute minimum standard standard Best practice Justification

Assessment Quality (AQ)


Quality of classroom assessment design, administration, analysis, and use

AQ1—Ensuring the quality of classroom assessment

(Q11) Classroom (Q11) Classroom (Q11) Classroom (Q11) Classroom


assessment practices suffer assessment practices are assessment practices are assessment practices
from widespread known to be weak. known to be of moderate are known to be
weaknesses, or there is no quality. generally of high
information available on quality.
classroom assessment
practices.

(Q12) There are no (Q12) There are ad hoc (Q12) There are limited (Q12) There are varied
mechanisms to monitor the mechanisms to monitor systematic mechanisms to and systematic
quality of classroom the quality of classroom monitor the quality of mechanisms in place to
assessment practices. assessment practices. classroom assessment monitor the quality of
practices. classroom assessment
practices.

AQ2—Ensuring effective uses of classroom assessment

(Q14) Classroom This option does not (Q14) Classroom (Q14) Classroom
assessment information is apply to this dimension. assessment information is assessment information
not required to be required to be disseminated is required to be
disseminated to key to some key stakeholders. disseminated to all key
stakeholders. stakeholders.

(Q15) There are no (Q15) There are limited (Q15) There are adequate (Q15) There are
required uses of classroom required uses of required uses of classroom adequate required uses
assessment to support classroom assessment to assessment to support of classroom
student learning. support student learning. student learning, excluding assessment to support
its use as an input for student learning,
external examination including its use as an
results. input for external
examination results.

Source: World Bank.

28
Examinations
LATENT EMERGING ESTABLISHED
Absence of, or deviation On way to meeting Acceptable minimum ADVANCED
from, the attribute minimum standard standard Best practice Justification

Enabling Context (EC)


Overall framework of policies, leadership, organizational structures, fiscal, and human resources in which
assessment activity takes place in an education system and the extent to which that framework is conducive to, or
supportive of, the assessment activity.

EC1—Setting clear policies

(Q3_III) No standardized (Q3_III) The standardized (Q3_III) The examination is This option does not
examination has taken examination has been a stable program that has apply to this dimension.
place. operating on an irregular been operating regularly.
basis.

(Q3) There is no policy (Q3) There is an informal (Q3) There is a formal This option does not
document that authorizes or draft policy document policy document that apply to this dimension.
the examination. that authorizes the authorizes the
examination. examination.

This option does not apply (Q5) The policy document (Q5) The policy document This option does not
to this dimension. is not available to the is available to the public. apply to this dimension.
public.

This option does not apply This option does not apply (Q6) The policy document (Q6) The policy
to this dimension. to this dimension. addresses some key document addresses all
aspects of the key aspects of the
examination. examination.

EC2—Having strong leadership

(Q8) All stakeholder (Q8) Most stakeholder (Q8) Most stakeholders (Q8) All stakeholder
groups strongly oppose groups oppose the groups support the groups support the
the examination. examination. examination. examination.

(Q9) There are no This option does not apply (Q9) There are (Q9) There are
attempts to improve the to this dimension. independent attempts to coordinated attempts to
examination by improve the examination improve the
stakeholder groups. by stakeholder groups. examination by
stakeholder groups.

(Q10) Efforts to improve This option does not apply (Q10) Efforts to improve This option does not
the examination are not to this dimension. the examination are apply to this dimension.
welcomed by the generally welcomed by the
leadership in charge of the leadership in charge of the
examination. examination.

EC3—Having regular funding

(Q11) There is no funding (Q11) There is irregular (Q11) There is regular This option does not
allocated for the funding allocated for the funding allocated for the apply to this dimension.
examination. examination. examination.

This option does not apply (Q12) Funding covers (Q12) Funding covers all This option does not
to this dimension. some core examination core examination activities: apply to this dimension.
activities: design, design, administration,
administration, data data processing, and
processing or reporting. reporting.

This option does not apply (Q12) Funding does not Does not apply. (Q12) Funding covers
to this dimension. cover research and research and
development. development.

29
LATENT EMERGING ESTABLISHED
Absence of, or deviation On way to meeting Acceptable minimum ADVANCED
from, the attribute minimum standard standard Best practice Justification

EC4—Having strong organizational structures

(Q14) The examination (Q14) The examination (Q14) The examination This option does not
office does not exist or is office is newly established. office is a stable apply to this dimension.
newly established. organization.

(Q15) The examination This option does not apply (Q15) The examination This option does not
office is not accountable to to this dimension. office is accountable to an apply to this dimension.
an external board or external board or agency.
agency.

(Q16) Examination results (Q16) Examination results (Q16) Examination results (Q16) Examination
are not recognized by any are recognized by the are recognized by one results are recognized
certification or selection certification or selection certification or selection by two or more
system. system in the country. system in another country. certification or selection
systems in another
country.

(Q17) The examination (Q17) The examination (Q17) The examination (Q17) The examination
office does not have the office has some of the office has all of the office has state-of-the-
required facilities to carry required facilities to carry required facilities to carry art facilities to carry out
out the examination. out the examination. out the examination. the examination.

EC5—Having effective human resources

(Q18) There is no staff to (Q18, Q19) The (Q18, Q19) The (Q18, Q19) The
carry out the examination. examination office is examination office is examination office is
inadequately staffed to adequately staffed to carry adequately staffed to
effectively carry out the out the examination carry out the
examination; issues are effectively, with minimal assessment effectively,
pervasive. issues. with no issues.

(Q20) The country/system This option does not apply (Q20) The country/system (Q20) The
does not offer to this dimension. offers some opportunities country/system offers a
opportunities that prepare that prepare for work on wide range of
for work on the the examination. opportunities that
examination. prepare for work on the
examination.

30
LATENT EMERGING ESTABLISHED
Absence of, or deviation On way to meeting Acceptable minimum ADVANCED
from, the attribute minimum standard standard Best practice Justification

System Alignment (SA)


Degree to which the assessment is coherent with other components of the education system.

SA1—Aligning examinations with learning goals and opportunities to learn

(Q21) It is not clear what This option does not apply (Q21) There is a clear This option does not
the examination to this dimension. understanding of what the apply to this dimension.
measures. examination measures.

(Q22) What the This option does not apply (Q22) What is measured This option does not
examination measures is to this dimension. by the examination is apply to this dimension.
questioned by some largely accepted by
stakeholder groups. stakeholder groups.

(Q23, Q24) Material to (Q23, Q24) There is some (Q23, Q24) There is (Q23, Q24) There is
prepare for the material to prepare for the comprehensive material to comprehensive material
examination is minimal examination that is prepare for the to prepare for the
and it is accessible to very accessible to some examinations that is examination that is
few students. students. accessible to most accessible to all
students. students.

SA2—Providing teachers with opportunities to learn about the examination

(Q25) There are no (Q25) There are no up-to- (Q25) There are up-to-date (Q25) There are up-to-
courses or workshops on date courses or voluntary courses or date compulsory
examinations available to workshops on workshops on courses or workshops
teachers. examinations available to examinations available to on examinations for
teachers. teachers. teachers.

(Q26) Teachers are (Q26) Teachers are (Q26) Teachers are (Q26) Teachers are
excluded from all involved in very few involved in some involved in most
examination-related tasks. examination-related tasks. examination-related tasks. examination-related
tasks.

Assessment Quality (AQ)


Degree to which the assessment meets quality standards, is fair, and is used in an effective way.

AQ1—Ensuring quality

(Q27) There is no (Q27) There is some (Q27) There is a (Q27) There is a


technical report or other documentation on the comprehensive technical comprehensive, high-
documentation. examination, but it is not in report but with restricted quality technical report
a formal report format. circulation. available to the general
public.

(Q28) There are no This option does not apply (Q28) There are limited (Q28) There are varied
mechanisms in place to to this dimension. systematic mechanisms in and systematic
ensure the quality of the place to ensure the quality mechanisms in place to
examination. of the examination. ensure the quality of the
examination.

31
LATENT EMERGING ESTABLISHED
Absence of, or deviation On way to meeting Acceptable minimum ADVANCED
from, the attribute minimum standard standard Best practice Justification

AQ2—Ensuring fairness

(Q29) Inappropiate (Q29) Inappropiate (Q29) Inappropiate (Q29) Inappropiate


behavior surrounding the behavior surrounding the behavior surrounding the behavior surrounding
examination process is examination process is examination process is the examination
high. moderate. low. process is marginal.

(Q30) The examination (Q30) The examination (Q30) The examination This option does not
results lack credibility for results are credible for results are credible for all apply to this dimension.
all stakeholder groups. some stakeholder groups. stakeholder groups.

(Q31, Q32) The majority of (Q31, Q32) A significant (Q31, Q32) A small (Q31) All students can
students (over 50%) may proportion of students proportion of students (less take the examination;
not take the examination (10%-50%) may not take than 10%) may not take there are no language,
because of language, the examination because the examination because gender, or other
gender or other equivalent of language, gender or of language, gender or equivalent barriers.
barriers. other equivalent barriers. other equivalent barriers.

AQ3—Using examination information in a fair way

(Q33) Examination results (Q33) Examination results (Q33) Examination results (Q33) Examination
are not used in an are used by some are used by most results are used by all
appropriate way by all stakeholder groups in an stakeholder groups in an stakeholder groups in
stakeholder groups. appropriate way. appropriate way. an appropriate way.

(Q34) Student names and This option does not apply (Q34) Student results are This option does not
results are made public. to this dimension. confidential. apply to this dimension.

AQ4—Ensuring positive consequences of the examination

(Q35) There are no (Q35) There are very (Q35) There are some (Q35) There are a
options for students who limited options for students options for students who variety of options for
do not perform well on the who do not perform well do not perform well on the students who do not
examination, or students on the examination. examination. perform well on the
must leave the education examination.
system.

(Q36) There are no This option does not apply (Q36) There are some (Q36) There are a
mechanisms in place to to this dimension. mechanisms in place to variety of mechanisms
monitor the consequences monitor the consequences in place to monitor the
of the examination. of the examination. consequences of the
examination.

Source: World Bank.

32
National Large-Scale Assessment (NLSA)
LATENT
Absence of, or EMERGING ESTABLISHED
deviation from, the On way to meeting Acceptable minimum ADVANCED
attribute minimum standard standard Best practice Justification

Enabling Context (EC)


Overall framework of policies, leadership, organizational structures, fiscal, and human resources in which NLSA
activity takes place in an education system and the extent to which that framework is conducive to, or supportive of,
the NLSA activity.

EC1—Setting clear policies for NLSA

(Q3_III) No NLSA (Q3_III) The NLSA has (Q3_III) The NLSA is a This option does not apply
exercise has taken place. been operating on an stable program that has to this dimension.
irregular basis. been operating regularly.

(Q5) There is no policy (Q5) There is an (Q5) There is a formal This option does not apply
document pertaining to informal or draft policy policy document that to this dimension.
NLSA. document that authorizes the NLSA.
authorizes the NLSA.

Does not apply. (Q7) The policy (Q7) The policy document This option does not apply
document is not is available to the public. to this dimension.
available to the public.

(Q8) There is no plan for This option does not (Q8, Q9) There is a general (Q8, Q9) There is a written
NLSA activity. apply to this dimension. understanding that the NLSA plan for the coming
NLSA will take place. years.

EC2—Having strong public engagement for NLSA

(Q11, Q12) All (Q11, Q12) Some (Q11, Q12) Most (Q11, Q12) All stakeholder
stakeholder groups stakeholder groups stakeholders groups groups support the NLSA.
strongly oppose the oppose the NLSA. support the NLSA.
NLSA.

EC3—Having regular funding for NLSA

(Q13) There is no funding (Q13) There is irregular (Q13) There is regular This option does not apply
allocated to the NLSA. funding allocated to the funding allocated to the to this dimension.
NLSA. NLSA.

Does not apply. (Q14) Funding covers (Q14) Funding covers all This option does not apply
some core NLSA core NLSA activities: to this dimension.
activities: design, design, administration,
administration, analysis analysis and reporting.
or reporting.

Does not apply. (Q14) Funding does not This option does not apply (Q14) Funding covers
cover research and to this dimension. research and development
development activities. activities.

33
LATENT
Absence of, or EMERGING ESTABLISHED
deviation from, the On way to meeting Acceptable minimum ADVANCED
attribute minimum standard standard Best practice Justification

EC4—Having strong organizational structures for NLSA

(Q15) There is no NLSA (Q15) The NLSA office (Q15) The NLSA office is a This option does not apply
office, ad hoc unit or is a temporary agency permanent agency, to this dimension.
team. or group of people. institution, or unit.

This option does not (Q16, Q17) Political (Q16, Q17) Political (Q16, Q17) Political
apply to this dimension. considerations regularly considerations sometimes considerations never
hamper technical hamper technical hamper technical
considerations. considerations. considerations.

This option does not (Q18, Q19) The NLSA (Q18, Q19) The NLSA This option does not apply
apply to this dimension. office is not office is accountable to a to this dimension.
accountable to a clearly clearly recognized body.
recognized body.

EC5—Having effective human resources for NLSA

(Q20) ) There is no staff (Q20, Q21) The NLSA (Q20, Q21) The NLSA (Q20, Q21) The NLSA
allocated for running a office is inadequately office is adequately staffed office is adequately staffed
NLSA. staffed to effectively to carry out the NLSA to carry out the NLSA
carry out the effectively, with minimal effectively, with no issues.
assessment. issues.

(Q22) The This option does not (Q22) The country/system (Q22) The country/system
country/system does not apply to this dimension. offers some opportunities to offers a wide range of
offer opportunities that prepare individuals for work opportunities to prepare
prepare individuals for on the NLSA. individuals for work on the
work on NLSA. NLSA.

System Alignment (SA)


Degree to which the NLSA is coherent with other components of the education system.

SA1—Aligning the NLSA with learning goals

(Q23) It is not clear if the This option does not (Q23) The NLSA measures This option does not apply
NLSA is based on apply to this dimension. performance against to this dimension.
curriculum or learning curriculum or learning
standards. standards.

(Q24) What the NLSA This option does not (Q24) What the NLSA (Q24) What the NLSA
measures is generally apply to this dimension. measures is questioned by measures is largely
questioned by some stakeholder groups. accepted by stakeholder
stakeholder groups. groups.

(Q25) There are no (Q25, Q26) There are (Q25, Q26) There are This option does not apply
mechanisms in place to ad hoc reviews of the regular internal reviews of to this dimension.
ensure that the NLSA NLSA to ensure that it the NLSA to ensure that it
accurately measures measures what it is measures what it is
what it is supposed to intended to measure. intended to measure.
measure.

34
LATENT
Absence of, or EMERGING ESTABLISHED
deviation from, the On way to meeting Acceptable minimum ADVANCED
attribute minimum standard standard Best practice Justification

SA2—Providing teachers with opportunities to learn about the NLSA

(Q27) There are no (Q27, Q28) There are (Q27, Q28) There are some (Q27, Q28) There are
courses or workshops on occassional courses or courses or workshops on widely available high-
the NLSA. workshops on the the NLSA offered on a quality courses or
NLSA. regular basis. workshops on the NLSA
offered on a regular basis.

Assessment Quality (AQ)


Degree to which the NLSA meets technical standards, is fair, and is used in an effective way.

AQ1—Ensuring the quality of the NLSA

(Q29) No options are This option does not (Q29) At least one option is (Q29) Different options are
offered to include all apply to this dimension. offered to include all groups offered to include all groups
groups of students in the of students in the NLSA. of students in the NLSA.
NLSA.

(Q30) There are no This option does not (Q30) There are some (Q30) There are a variety
mechanisms in place to apply to this dimension. mechanisms in place to of mechanisms in place to
ensure the quality of the ensure the quality of the ensure the quality of the
NLSA. NLSA. NLSA.

(Q31) There is no (Q31) There is some (Q31) There is a (Q31) There is a


technical report or other documentation about comprehensive technical comprehensive, high-
documentation about the the technical aspects of report, but with restricted quality technical report
NLSA. the NLSA, but it is not circulation. available to the general
in a formal report public.
format.

AQ2—Ensuring effective uses of the NLSA

(Q32) NLSA results are (Q32) NLSA results are (Q32) NLSA results are This option does not apply
not disseminated. poorly disseminated. disseminated in an effective to this dimension.
way.

(Q33) NLSA information This option does not (Q33) NLSA results are (Q33) NLSA information is
is not used or is used in apply to this dimension. used by some stakeholder used by all stakeholder
ways inconsistent with groups in a way that is groups in a way that is
the purposes or the consistent with the consistent with the
technical characteristics purposes and technical purposes and technical
of the assessment. characteristics of the characteristics of the
assessment. assessment.

(Q34) There are no This option does not (Q34) There are some (Q34) There are a variety
mechanisms in place to apply to this dimension. mechanisms in place to of mechanisms in place to
monitor the monitor the consequences monitor the consequences
consequences of the of the NLSA. of the NLSA.
NLSA.

Source: World Bank.

35
International Large-Scale Assessment (ILSA)
LATENT
Absence of, or EMERGING ESTABLISHED
deviation from, the On way to meeting Acceptable minimum ADVANCED
attribute minimum standard standard Best practice Justification

Enabling Context (EC)


Overall framework of policies, leadership, organizational structures, fiscal and human resources in which ILSA
takes place in an education system and the extent to which that framework is conducive to, or supportive of, the
ILSA activity.

EC1—Setting clear policies for ILSA

(Q1, Q2) The This option does not (Q1, Q2) The (Q1, Q2) The
country/system has not apply to this dimension. country/system has country/system has
participated in an ILSA in participated in at least one participated in two or more
the last 10 years. ILSA in the last 10 years. ILSA in the last 10 years.

(Q3) The country/system This option does not (Q3) The country/system This option does not apply
has not taken concrete apply to this dimension. has taken concrete steps to this dimension.
steps to participate in an to participate in at least
ILSA in the next 5 years. one ILSA in the next 5
years.

(Q5) There is no policy (Q5) There is an informal (Q5) There is a formal This option does not apply
document that addresses or draft policy document policy document that to this dimension.
participation in ILSA. that addresses addresses participation in
participation in ILSA. ILSA.

Does not apply. (Q7) The policy (Q7) The policy document This option does not apply
document is not is available to the public. to this dimension.
available to the public.

EC2—Having regular funding for ILSA

(Q8) There is no funding (Q9) There is funding (Q9) There is regular (Q9) There is regular
for participation in ILSA. from loans or external funding allocated at funding approved by law,
donors. discretion. decree or norm.

This option does not (Q10) Funding covers (Q10) Funding covers all This option does not apply
apply to this dimension. some core activities of core activities of the ILSA. to this dimension.
the ILSA.

(Q10) Funding does not This option does not This option does not apply (Q10) Funding covers
cover research and apply to this dimension. to this dimension. research and development
development activities. activities.

36
LATENT
Absence of, or EMERGING ESTABLISHED
deviation from, the On way to meeting Acceptable minimum ADVANCED
attribute minimum standard standard Best practice Justification

EC3—Having effective human resources for ILSA

(Q11, Q12) There is no (Q11, Q12) There is a (Q11, Q12) There is a This option does not apply
team or national/system team or national/system team and national/system to this dimension.
coordinator to carry out coordinator to carry out coordinator to carry out the
the ILSA activities. the ILSA activities. ILSA activities.

This option does not (Q13) The (Q13) The national/system This option does not apply
apply to this dimension. national/system coordinator is fluent in the to this dimension.
coordinator or other official language of the
designated team ILSA exercise.
member is not fluent in
the official language of
the ILSA exercise.

This option does not (Q13, Q14, Q15) The (Q13, Q14, Q15) The ILSA (Q13, Q14, Q15) The ILSA
apply to this dimension. ILSA office is office is adequately staffed office is adequately staffed
inadequately staffed or or trained to carry out the and trained to carry out the
trained to carry out the ILSA effectively, with ILSA effectively, with no
assessment effectively. minimal issues. issues.

System Alignment (SA)


Degree to which the ILSA is coherent with other components of the education system.

SA1—Providing opportunities to learn about ILSA

(Q14) The ILSA team has (Q14) The ILSA team (Q14) The ILSA team This option does not apply
not attended international attended some attended all international to this dimension.
workshops or meetings. international workshops workshops or meetings.
or meetings.

(Q16) The This option does not (Q16, Q17) The (Q16,Q17) The
country/system offers no apply to this dimension. country/system offers country/system offers a
opportunities to learn some opportunities to learn wide range of opportunities
about ILSA. about ILSA. to learn about ILSA.

This option does not This option does not (Q18) Opportunities to (Q18) Opportunities to
apply to this dimension. apply to this dimension. learn about ILSA are learn about ILSA are
available to the available to a wide
country's/system's ILSA audience, in addition to the
team members only. country's/system's ILSA
team members.

37
LATENT
Absence of, or EMERGING ESTABLISHED
deviation from, the On way to meeting Acceptable minimum ADVANCED
attribute minimum standard standard Best practice Justification

Assessment Quality (AQ)


Degree to which the ILSA meets technical quality standards, is fair, and is used in an effective way.

AQ1—Ensuring the quality of ILSA

(Q19) Data from the ILSA (Q19) The (Q19) The country/system This option does not apply
has not been published. country/system met met all technical standards to this dimension.
sufficient standards to required to have its data
have its data presented presented in the main
beneath the main display displays of the international
of the report.
international report or in
an annex.

(Q20) The This option does not This option does not apply (Q20) The country/system
country/system has not apply to this dimension. to this dimension. has contributed new
contributed new knowledge on ILSA.
knowledge on ILSA.

AQ2—Ensuring effective uses of ILSA

(Q21, Q22) If any, (Q21, Q22) (Q21, Q22) (Q21, Q22)


country/system-specific Country/system-specific Country/system-specific Country/system-specific
results and information results and information results and information are results and information are
are not disseminated in are disseminated regularly disseminated in regularly and widely
the country/system. irregularly in the the country/system. disseminated in the
country/system. country/system.

(Q21, Q23) Products to This option does not (Q21, Q23) Products to (Q21, Q23) Products to
provide feedback to apply to this dimension. provide feedback to provide feedback to
schools and educators schools and educators schools and educators
about the ILSA results about the ILSA results are about ILSA results are
are not made available. sometimes made available. systematically made
available.

(Q24) There is no media (Q24) There is limited (Q24) There is some media (Q24) There is wide media
coverage of the ILSA media coverage of the coverage of the ILSA coverage of the ILSA
results. ILSA results. results. results.

(Q25, Q26) If any, (Q26) Results from the (Q26) Results from the (Q26) Results from the
country/system-specific ILSA are used in a ILSA are used in some ILSA are used in a variety
results and information limited way to inform ways to inform decision of ways to inform decision
from the ILSA are not decision making in the making in the making in the
used to inform decision country/system. country/system. country/system.
making in the
country/system.

(Q27) It is not clear that This option does not This option does not apply (Q27) Decisions based on
decisions based on ILSA apply to this dimension. to this dimension. the ILSA results have had
results have had a a positive impact on
positive impact on students' achievement
students' achievement levels.
levels.

Source: World Bank.

38
Appendix 3. Example of Using the Rubrics to Evaluate a National Large-Scale
Assessment Program
Preliminary
COUNTRY X Level of
Development
National Large-Scale Assessment (NLSA) Adjusted (based on
Rubric Score (with Default Adjusted
Score Constraint) Weight Score) Notes

EMERGING
LATENT On way to ESTABLISHED
Absence of, or meeting Acceptable 2.32 2.11 1 EMERGING
deviation from, minimum minimum ADVANCED
the attribute standard standard Best practice JUSTIFICATION
Enabling Context (EC)
Overall framework of policies, leadership, organizational structures, fiscal,
and human resources in which NLSA activity takes place in an education 2.63 2 0.33 Emerging
system and the extent to which that framework is conducive to, or
supportive of, the NLSA activity.
EC1—Setting clear policies for NLSA 2 2 0.2

(Q3_III) No NLSA (Q3_III) The (Q3_III) The This option does In 2009, the
exercise has NLSA has been NLSA is a stable not apply to this NLSA program in
taken place. operating on an program that dimension. Country X was
irregular basis. has been operating on a
operating regular basis.
regularly. However, funding
for the various
NLSA exercises
3 0.25 Constraint
was being
sourced from
different donors,
and the
assessments
were taking place
roughly every 3 to
4 years.
(Q5) There is no (Q5) There is (Q5) There is a This option does In 2009, Country
policy document an informal or formal policy not apply to this X did not have
pertaining to draft policy document that dimension. any kind (formal,
1 0.25 Constraint
NLSA. document that authorizes the informal, draft) of
authorizes the NLSA. policy document
NLSA. on NLSA activity.
Does not apply. (Q7) The policy (Q7) The policy This option does There was no
document is not document is not apply to this policy document
1 0.25
available to the available to the dimension. available in 2009.
public. public.
(Q8) There is no This option (Q8, Q9) There (Q8, Q9) There Although there
plan for NLSA does not apply is a general is a written was no formal
activity. to this understanding NLSA plan for policy document
dimension. that the NLSA the coming underpinning the
will take place. years. NLSA in 2009,
there was a
3 0.25
general
understanding
that the NLSA
would take place
every 3 to 4
years.

39
Preliminary
COUNTRY X Level of
Development
National Large-Scale Assessment (NLSA) Adjusted (based on
Rubric Score (with Default Adjusted
Score Constraint) Weight Score) Notes

EMERGING
LATENT On way to ESTABLISHED
Absence of, or meeting Acceptable 2.32 2.11 1 EMERGING
deviation from, minimum minimum ADVANCED
the attribute standard standard Best practice JUSTIFICATION
EC2—Having strong public engagement for NLSA 4 0.2

(Q11, Q12) All (Q11, Q12) (Q11, Q12) (Q11, Q12) All Based on our
stakeholder Some Most stakeholder information, there
groups strongly stakeholder stakeholders groups support is no opposition to 4 1
oppose the groups oppose groups support the NLSA. the NLSA.
NLSA. the NLSA. the NLSA.
EC3—Having regular funding for NLSA 2 2 0.2

(Q13) There is no (Q13) There is (Q13) There is This option does NLSA activity is
funding allocated irregular funding regular funding not apply to this partially funded by
to the NLSA. allocated to the allocated to the dimension. the MOE and
NLSA. NLSA. partially by 2 0.33 Constraint
donors. The
funding is still ad
hoc.
Does not apply. (Q14) Funding (Q14) Funding This option does Funding has
covers some covers all core not apply to this tended to cover
core NLSA NLSA activities: dimension. only basic
activities: design, aspects of NLSA
design, administration, activities.
2 0.33
administration, analysis and Sometimes, there
analysis or reporting. has been
reporting. insufficient
funding to cover
all core activities.
Does not apply. (Q14) Funding This option does (Q14) Funding Funding has
does not cover not apply to this covers research primarily focused
research and dimension. and on supporting the
development development actual carrying
activities. activities. out of NLSA 2 0.33
activities and not
on R&D or
secondary
analysis.
EC4—Having strong organizational structures for NLSA 2.67 2 0.2

(Q15) There is no (Q15) The (Q15) The This option does In 2009, the
NLSA office, ad NLSA office is a NLSA office is a not apply to this NLSA team was
hoc unit or team. temporary permanent dimension. comprised of a
agency or group agency, small number of
of people. institution or staff (4), some of
unit. whom had no
background or
2 0.33 Constraint
training in NLSA.
There was no
permanent unit,
and an
institutional home
was still being
worked out.
This option does (Q16, Q17) (Q16, Q17) (Q16, Q17) There is no
not apply to this Political Political Political precedence of
dimension. considerations considerations considerations political
regularly sometimes never hamper considerations 4 0.33
hamper hamper technical hampering
technical technical considerations. technical
considerations. considerations. considerations.

40
Preliminary
COUNTRY X Level of
Development
National Large-Scale Assessment (NLSA) Adjusted (based on
Rubric Score (with Default Adjusted
Score Constraint) Weight Score) Notes

EMERGING
LATENT On way to ESTABLISHED
Absence of, or meeting Acceptable 2.32 2.11 1 EMERGING
deviation from, minimum minimum ADVANCED
the attribute standard standard Best practice JUSTIFICATION
This option does (Q18, Q19) The (Q18, Q19) The This option does In 2009, the
not apply to this NLSA office is NLSA office is not apply to this NLSA office was
dimension. not accountable accountable to a dimension. not accountable
to a clearly clearly to a clearly
recognized recognized recognized body.
2 0.33
body. body. This is because it
was in transition
from one
institutional home
to another.
EC5—Having effective human resources for NLSA 2.5 0.2

(Q20) ) There is (Q20, Q21) The (Q20, Q21) The (Q20, Q21) The In 2009, the
no staff allocated NLSA office is NLSA office is NLSA office is NLSA office did
for running a inadequately adequately adequately not have sufficient
NLSA. staffed to staffed to carry staffed to carry staff to effectively 2 0.5
effectively carry out the NLSA out the NLSA carry out NLSA
out the effectively, with effectively, with activities.
assessment. minimal issues. no issues.
(Q22) The This option (Q22) The (Q22) The There were some
country/system does not apply country/system country/system large-scale
does not offer to this offers some offers a wide assessment- and
opportunities that dimension. opportunities to range of measurement-
prepare prepare opportunities to related courses 3 0.5
individuals for individuals for prepare offered by the
work on NLSA. work on the individuals for main university in
NLSA. work on the Country X.
NLSA.
System Alignment (SA)
Degree to which the NLSA is coherent with other components of the 2 0.33 Emerging
education system.
SA1—Aligning the NLSA with learning goals 3 0.5

(Q23) It is not This option (Q23) The This option does The NLSA was
clear if the NLSA does not apply NLSA measures not apply to this aligned with
is based on to this performance dimension. existing
curriculum or dimension. against curriculum and 3 0.33
learning curriculum or standards.
standards. learning
standards.
(Q24) What the This option (Q24) What the (Q24) What the The MOE and
NLSA measures does not apply NLSA measures NLSA measures other
is generally to this is questioned by is largely stakeholders have
4 0.33
questioned by dimension. some accepted by accepted the
stakeholder stakeholder stakeholder NLSA.
groups. groups. groups.
(Q25) There are (Q25, Q26) (Q25, Q26) This option does In 2009, there
no mechanisms There are ad There are not apply to this were some
in place to ensure hoc reviews of regular internal dimension. procedures in
that the NLSA the NLSA to reviews of the place for
accurately ensure that it NLSA to ensure reviewing the
measures what it measures what that it measures alignment of the
is supposed to it is intended to what it is NLSA test with
measure. measure. intended to the 2 0.33
measure. constructs/content
it was intended to
measure, but
these procedures
were not
formalized or
standardized.

41
Preliminary
COUNTRY X Level of
Development
National Large-Scale Assessment (NLSA) Adjusted (based on
Rubric Score (with Default Adjusted
Score Constraint) Weight Score) Notes

EMERGING
LATENT On way to ESTABLISHED
Absence of, or meeting Acceptable 2.32 2.11 1 EMERGING
deviation from, minimum minimum ADVANCED
the attribute standard standard Best practice JUSTIFICATION
SA2—Providing teachers with opportunities to learn about the NLSA 1 0.5

(Q27) There are (Q27, Q28) (Q27, Q28) (Q27, Q28) The only courses
no courses or There are There are some There are widely or workshops
workshops on the occassional courses or available high- associated with
NLSA. courses or workshops on quality courses previous NLSA
workshops on the NLSA or workshops on exercises have
the NLSA. offered on a the NLSA been for 1 1
regular basis. offered on a policymakers and
regular basis. high-level
educators, and
not for classroom
teachers.
Assessment Quality (AQ)
Emerging or
Degree to which the NLSA meets technical standards, is fair and is used 2.33 0.33
Established
in an effective way.
AQ1—Ensuring the quality of the NLSA 2.67 0.5

(Q29) No options This option (Q29) At least (Q29) Different The NLSA is
are offered to does not apply one option is options are translated into the
include all groups to this offered to offered to language of
of students in the dimension. include all include all instruction for 3 0.33
NLSA. groups of groups of each region.
students in the students in the
NLSA. NLSA.
(Q30) There are This option (Q30) There are (Q30) There are In 2009, there
no mechanisms does not apply some a variety of were some
in place to ensure to this mechanisms in mechanisms in procedures in
the quality of the dimension. place to ensure place to ensure place for
NLSA. the quality of the the quality of the reviewing the
NLSA. NLSA. alignment of the
NLSA test with
the
constructs/content
3 0.33
it was intended to
measure. This
would allow us to
say that there
were 'some
mechanisms in
place to ensure
the quality of the
NLSA.'
(Q31) There is no (Q31) There is (Q31) There is a (Q31) There is a In 2009, no formal
technical report some comprehensive comprehensive, technical reports
or other documentation technical report, high-quality were available for
documentation about the but with technical report the NLSA.
about the NLSA. technical restricted available to the 2 0.33
aspects of the circulation. general public.
NLSA, but it is
not in a formal
report format.
AQ2—Ensuring effective uses of the NLSA 2 0.5

(Q32) NLSA (Q32) NLSA (Q32) NLSA This option does In 2009, NLSA
results are not results are results are not apply to this results were not
disseminated. poorly disseminated in dimension. being widely
disseminated. an effective disseminated to
2 0.33
way. key stakeholders.
Few copies of the
report were
available.

42
Preliminary
COUNTRY X Level of
Development
National Large-Scale Assessment (NLSA) Adjusted (based on
Rubric Score (with Default Adjusted
Score Constraint) Weight Score) Notes

EMERGING
LATENT On way to ESTABLISHED
Absence of, or meeting Acceptable 2.32 2.11 1 EMERGING
deviation from, minimum minimum ADVANCED
the attribute standard standard Best practice JUSTIFICATION
(Q33) NLSA This option (Q33) NLSA (Q33) NLSA In 2009, the
information is not does not apply results are used information is NLSA results
used or is used in to this by some used by all were, to a certain
ways inconsistent dimension. stakeholder stakeholder extent, used for
with the purposes groups in a way groups in a way curriculum
or the technical that is that is consistent development and
3 0.33
characteristics of consistent with with the teacher training.
the assessment. the purposes purposes and
and technical technical
characteristics characteristics of
of the the assessment.
assessment.
(Q34) There are This option (Q34) There are (Q34) There are In 2009, there
no mechanisms does not apply some a variety of were no
in place to to this mechanisms in mechanisms in mechanisms in
monitor the dimension. place to monitor place to monitor place to monitor 1 0.33
consequences of the the the consequences
the NLSA. consequences consequences of the NLSA.
of the NLSA. of the NLSA.

Source: World Bank.

43

You might also like