From Tests To Portfolios: A Guide For Educational Assessment
Abstract: This essay explores the various assessment methods and tools used in educational
settings to evaluate student learning, performance, and progress. It covers methods such as
formative and summative assessments, including exams, quizzes, and standardized tests, as
well as alternative approaches like portfolios, self-assessments, peer reviews, and project-
based assessments. The essay also discusses digital tools and technologies that enhance
assessment practices and data analytics. By analyzing the strengths and limitations of each
method, the essay highlights the importance of a balanced and comprehensive approach to
assessment in education.
The role of testing and assessment in education
Testing and assessment play a crucial role in education by measuring student learning,
guiding instruction, and evaluating overall educational progress. They help identify strengths
and weaknesses, allowing educators to tailor teaching methods to meet individual needs.
Assessment tools such as quizzes, standardized tests, and performance tasks provide feedback
to both students and teachers, fostering continuous improvement. Moreover, testing can
inform curriculum development and policy decisions, ensuring that educational programs are
effective and aligned with learning goals. Ultimately, assessments aim to support student
growth and ensure academic success.
Achievement tests
Achievement tests are designed to measure accomplishment. Achievement tests are
designed to measure the degree of learning that has taken place as a result of exposure to a
relatively defined learning experience. “Relatively defined learning experience” may mean
something as broad as what was learned from four years of college, or something much
narrower, such as how to prepare dough for use in making pizza. In most educational settings,
achievement tests are used to measure student progress toward instructional objectives,
compare an individual’s accomplishment to peers, and help determine what instructional
activities and strategies might best propel the students toward educational objectives. An
achievement test may be standardized locally, regionally, or nationally, or it may not be
standardized at all. Like other tests, achievement tests vary widely in their psychometric
soundness. A sound achievement test is one that adequately samples the targeted subject
matter and reliably measures the extent to which the examinees have learned it. Scores on
achievement tests may be used for various purposes. Achievement test data can help gauge
the quality of instruction in a particular class, school, school district, or state. Achievement
tests are sometimes used to screen for difficulties.
Measures of General Achievement
Measures of general achievement assess learning in various academic areas, often
using achievement batteries, which consist of multiple subtests. These tests may be group- or
individually administered and can cover broad subjects like reading, math, and science, or
focus on specific skills. Examples include the Wide Range Achievement Test (Wilkinson &
ASSESSMENT FOR EDUCATION 2
Robertson, 2006) and the Sequential Tests of Educational Progress (STEP). Some batteries
offer norm-referenced and criterion-referenced analysis or are used to identify gifted children.
The Wechsler Individual Achievement Test (WIAT-III) (Psychological Corporation, 2009) is
popular in schools and research settings. When selecting tests, it's crucial to ensure
psychometric soundness, minimize bias, and use up-to-date content.
Measures of Achievement in Specific Subject Areas
Achievement tests in specific subject areas are often teacher-made, though some
standardized tests exist. At the elementary level, tests focus on basic skills like reading,
writing, and arithmetic. Reading tests, for instance, assess comprehension, vocabulary, speed,
and accuracy. At the secondary level, tests like the Cooperative Achievement Test cover
subjects such as English, math, and science. For college students, there is a growing interest
in outcome assessments to ensure educational value, while placement tests like Advanced
Placement (AP) and CLEP allow students to earn college credits. For adults, tests like the
Adult Basic Learning Examination (ABLE) assess fundamental skills. Achievement tests may
assess both fact-based and conceptual knowledge, and curriculum-based assessment focuses
on what students have learned in school. Some items on achievement tests may require the
respondent to not only know and understand relevant facts but also be able to apply them.
Because respondents must draw on and apply knowledge related to a particular concept, these
types of achievement test items are referred to as conceptual.
Aptitude tests
We are all constantly acquiring information through everyday life experiences and
formal learning experiences. The key difference between achievement tests and aptitude tests
is that aptitude tests tend to focus more on informal learning or life experiences whereas
achievement tests tend to focus on the learning that has occurred as a result of relatively
structured input. Whether a test is seen as measuring aptitude or achievement is a context-
based judgment, that is, the judgment will be based, at least in part, on whether or not the test
taker is presumed to have prior exposure or formal learning related to the test’s content.
Aptitude tests are also referred to as prognostic tests and they are typically used to make
predictions.
The Preschool Level
The first five years of life—the period referred to as the preschool period—is a time
of profound change. Typically, between 18 and 24 months, the child becomes capable of
symbolic thought and develops language skills. At the preschool level, assessment is largely a
matter of determining whether a child’s cognitive, emotional, and social development is
consistent with age-related expectations and whether any problems likely to hamper learning
ability are evident. The preschool assessment tools are, with age-appropriate variations built
into them, the same types of tools used to assess school-age children and adults. These tools
include, among others, checklists and rating scales, tests, and interviews.
Checklists and rating scales are common assessment tools, especially with
preschoolers, used to evaluate the presence or absence of specific behaviors or traits.
Checklists involve marking observed behaviors, while rating scales require evaluators to
ASSESSMENT FOR EDUCATION 3
judge the degree of certain traits. These tools are often used by professionals, parents, or even
the individual being assessed. The Apgar score, used to assess newborns’ health, is an
example of a rating scale. Formal assessments like the Connors Rating Scales-Revised and
BASC-3 are used for identifying children at risk in educational settings, guiding potential
interventions based on issues such as learning or behavioral difficulties.
Psychological assessments for young children, particularly preschoolers, focus on
cognitive, emotional, and social development through observation and interviews. Traditional
verbal and performance tests used for older children are not suitable for this age group due to
short attention spans. Ideal assessments for preschoolers are engaging, simple to administer,
and allow for behavioral observations, such as the WPPSI-III and SB5 tests. Infant
intelligence tests, while not strong predictors of future intellectual ability, are useful in
identifying developmental strengths, weaknesses, and potential disabilities. These tests are
valuable for guiding early interventions, especially for children in need of special support.
Many other assessment techniques and instruments are available for use with
preschoolers, including case history methods, interviews, portfolio evaluation, and role-play
methods. For example, there are instruments to measure temperament, language skills, the
family environment in general, and specific aspects of parenting and caregiving. Drawings
may be analyzed for insights they can provide concerning the child’s personality.
The Elementary-School Level
Children entering the educational system come from a wide range of backgrounds and
experiences. Each child’s rate of physiological, psychological, and social development also
varies widely.
The Metropolitan Readiness Tests (sixth edition; MRT6), (Nurss, 1994) is a test
battery that assesses the development of the reading and mathematics skills important in the
early stages of formal school learning. The test is divided into two levels: Level I
(administered individually), for use with beginning and middle kindergarteners, and Level II
(administered in group), which spans the end of kindergarten through first grade. At each
level, there are two forms of the test. These tests are orally administered in several sessions
and are not timed, though administration time typically runs about 90 minutes. A practice test,
it is useful especially with young children who have had minimal or no prior test-taking
experience, may be administered several days before the actual examination to help
familiarize students with the procedures and format.
The Secondary-School Level
The SAT, formerly the Scholastic Aptitude Test, is an example of an aptitude test and
is a widely used college entrance exam that helps with college admissions, high school
guidance, and scholarship decisions. It consists of a main test measuring reading, writing, and
math, and SAT subject tests assess knowledge in specific areas like history, science, and
languages. The mathematics section investigates knowledge of subjects such as algebra,
geometry, basic statistics, and probability. The writing portion of the exam tests knowledge of
grammar, its usage, and word choice, and is tested through both multiple-choice items and an
essay question. The SAT Subject tests are one-hour-long tests designed to measure
achievement in specific subject areas such as Mathematics, English, Science, History and
ASSESSMENT FOR EDUCATION 4
Social Studies, and Languages. The SAT is considered a strong predictor of college success
when combined with high school GPA, though it faces criticism for potential biases, such as
race and environmental factors.
The College Level and Beyond
The Graduate Record Examinations (GRE) is a key test for graduate school
admissions, comprising a General Test with verbal, quantitative, and analytical writing
sections. The verbal section assesses the ability to analyze and evaluate written material, the
quantitative section tests basic math knowledge and reasoning, and the analytical writing
section evaluates critical thinking and writing skills. The GRE can be taken on paper, pencil,
or computer, with essays sent directly to graduate programs. Research shows that the GRE,
along with undergraduate GPA, is a valid predictor of graduate school success.
Another widely used examination is the Miller Analogies Test (MAT). MAT is a 100-
item, multiple-choice analogy test that not only draws on the examinee’s ability to perceive
relationships but also on academic learning, vocabulary, and general intelligence. The MAT
has been cited as one of the most cost-effective of all existing aptitude tests when it comes to
forecasting success in graduate school (Kuncel & Hezlett, 2007a). However, as most readers
are probably aware, the use of almost any aptitude test, even in combination with other
predictors, tends to engender controversy.
There are various other aptitude tests. Applicants may be required to take specialized
entrance examinations for training in certain professions and occupations. Numerous aptitude
tests have been developed to assess specific kinds of academic, professional, and/or
occupational aptitudes. There are also several lesser-known aptitude tests. For example, the
Seashore Measures of Musical Talents (Seashore, 1938) is a now-classic measure of musical
aptitude administered with the aid of a record or prerecorded tape. The Horn Art Aptitude
Inventory is a measure designed to gauge various aspects of the respondent’s artistic aptitude.
Diagnostic tests
Today a distinction is made between tests and test data used primarily for evaluative
purposes and tests and test data used primarily for diagnostic purposes. Diagnostic tests
identify specific learning difficulties in students, distinguishing them from evaluative tests,
which are used for making decisions like pass-fail or admissions. Diagnostic tests, such as
reading assessments with multiple subtests, help pinpoint the areas where a student struggles,
guiding targeted interventions. Diagnostic information aids educational decisions like class
placements. However, these tests do not explain the underlying causes of the learning
difficulties, which may require further psychological or medical evaluation. Typically,
diagnostic tests are given after poor performance in class or on achievement tests. Diagnostic
tests are generally administered to students who have already demonstrated their problem
with a particular subject area through their poor performance either in the classroom or on
some achievement test. For this reason, diagnostic tests may contain simpler items.
ASSESSMENT FOR EDUCATION 5
Reading Tests
The ability to read is integral to virtually all classroom learning, so it is not surprising
that several diagnostic tests are available to help pinpoint difficulties in acquiring this skill.
Some of the tests available to help identify reading difficulties include the Stanford
Diagnostic Reading Test, the Metropolitan Reading Instructional Tests, the Diagnostic
Reading Scales, and the Durrell Analysis of Reading Test. One such diagnostic battery
commonly used is the Woodcock Reading Mastery Tests. The Woodcock Reading Mastery
Tests-Revised (WRMT-III; Woodcock, 2011) This paper-and-pencil measure of reading
readiness, reading achievement, and reading difficulties takes between 15 and 45 minutes to
administer the entire battery. It can be used with children as young as 4½, adults as old as 80,
and almost everyone in between. Users of prior editions of this popular test will recognize
many of the subtests on the WRMT-III including Letter Identification, Word Identification,
Word Attack, Word Comprehension, and passage comprehension. In the third edition, there
are three new subtests which are Phonological Awareness, Listening Comprehension, and
Oral Reading Fluency. All of the subtests taken together are used to derive a picture of the
test taker’s reading-related strengths and weaknesses, as well as an actionable plan for
reading remediation when necessary. The test comes in parallel forms, useful for establishing
a baseline and then monitoring postintervention progress.
Math Tests
The Stanford Diagnostic Mathematics Test, Fourth Edition (SDMT4) and the
KeyMath 3 Diagnostic System (KeyMath3-DA) are two of many tests that have been
developed to help diagnose difficulties with arithmetic and mathematical concepts. Items on
such tests typically test everything from knowledge of basic concepts and operations through
applications entailing increasingly advanced problem-solving skills. These tests come in two
forms, each containing 10 subtests. The test protocols can either be computer-scored or hand-
scored. Because the KeyMath3-DA is individually administered, it is ideally administered by
a qualified examiner who is skilful in establishing and maintaining rapport with test takers
and knowledgeable in following the test’s standardized procedures. The SDMT-4 is a
standardized test that can provide useful diagnostic insights about the mathematical abilities
of children just entering school and college. The test, available in different forms, is amenable
to individual or group administration. It contains both multiple-choice and free-response
items. The latter items are designed to provide the examiner with a firsthand understanding of
the reasoning, strategies, and methods applied by test takers to solve different kinds of
problems. Test protocols may be hand-scored or centrally scored by the test’s publisher. An
online version of the SDMT-4 has been available, since 2003.
Psychoeducational test batteries
Psychoeducational test batteries are test kits that generally contain two types of tests:
those that measure abilities related to academic success and those that measure educational
achievement in areas such as reading and arithmetic. Data derived from these batteries allow
for normative comparisons (how the student compares with other students within the same
age group), as well as an evaluation of the test taker’s strengths and weaknesses—all the
ASSESSMENT FOR EDUCATION 6
better to plan educational interventions. The psychoeducational battery includes the Kaufman
Assessment Battery for Children (K-ABC), as well as the extensively revised second edition
of the test, the KABC-II.
The Kaufman Assessment Battery for Children (K-ABC) and the Kaufman Assessment
Battery for Children, Second Edition (KABC-II)
The Kaufman Assessment Battery for Children (K-ABC), developed for ages 2½ to
12½, measures intelligence and achievement through subtests. These subtests are divided into
two information-processing categories: simultaneous and sequential skills, following Luria's
theories (Das et al., 1975; Luria, 1966a, 1966b). Factor analysis supports the existence of
these two processing factors, but researchers have debated the third factor, often linked to
reading and verbal comprehension. While the K-ABC Achievement Scale predicts success,
the independence of sequential and simultaneous processing is questioned. Recommendations
for teaching based on a child’s processing strengths have mixed support, with research
showing limited impact on educational decisions.
The second edition of the Kaufman Assessment Battery for Children (KABC-II),
published in 2004, expanded its age range to 3–18 years, allowing for comparisons between
ability and achievement in high school. The test underwent significant structural changes,
adding 10 new subtests and removing 8 of the original ones. Conceptually, it integrated both
Luria's theory of sequential versus simultaneous processing and the Cattell-Horn-Carroll
(CHC) theory. This dual foundation allows examiners to choose between models depending
on the test taker's background, either focusing on broader cognitive abilities or excluding
verbal skills. While psychometrically sound, reviewers questioned the clarity of using two
distinct theoretical models in a single test, with some feeling that it lacked explanation and
sample reports for each model. Despite these concerns, the KABC-II aligns well with the
broad CHC abilities it aims to measure, particularly for school-age children.
The Woodcock-Johnson IV (WJ IV)
The WJ III (Woodcock et al., 2000) was a psychoeducational test package consisting
of two co-normed batteries. The Woodcock-Johnson IV (WJ IV) is a psychoeducational test
package consisting of three co-normed batteries: Tests of Achievement, Cognitive Abilities,
and Oral Language. The Oral Language battery includes 12 tests (nine in English and three in
Spanish) for assessing language proficiency and related skills. It can be used for individuals
aged 2 to 90+, evaluating areas such as listening comprehension and lexical access speed.
Based on the Cattell-Horn-Carroll (CHC) theory of cognitive abilities, the WJ IV provides
multiple measures, including general intellectual ability (GIA), fluid abilities (Gf), and
crystallized abilities (Gc). Examiners can choose between a standard or extended battery
depending on the depth of assessment required.
Other tools of assessment in educational settings
There lies a wide creation of other instruments and techniques of assessment beyond
traditional achievement, aptitude, and diagnostic instruments, that may be used in the service
of students and society at large.
ASSESSMENT FOR EDUCATION 7
Performance, Portfolio, and Authentic Assessment
Performance assessment refers to evaluations where examinees must demonstrate
their knowledge, skills, and values by completing tasks rather than selecting from multiple-
choice options. Examples include essay questions or creating an art project. In modern
contexts, performance assessment focuses on tasks related to specific domains of study, with
experts setting evaluation standards. For instance, an architecture student might be tasked
with designing a blueprint, judged by professional architects. In educational and work
settings, performance tasks are samples of work designed to elicit domain-specific skills,
evaluated by expert-set criteria.
One of many possible types of performance assessment is portfolio assessment.
Portfolio assessment refers to evaluating a collection of an individual's work samples, often
used in educational contexts. It may refer to a portable carrying case, typically used to carry
drawings, artwork, maps, and the like. It offers a more performance-based method of
evaluation compared to traditional assessments. This type of assessment aligns with
"authentic assessment," focusing on applying academic teachings to real-world settings. The
portfolio represents a person's practical work, and its assessment is used to gauge how well
they apply knowledge and skills beyond the classroom.
Authentic assessment, also known as performance-based assessment (Baker et al.,
1993), evaluates students through meaningful, real-world tasks that demonstrate the transfer
of academic knowledge to practical activities. For example, assessing writing skills would
involve reviewing writing samples rather than multiple-choice tests. This type of assessment
aims to increase student engagement and the application of knowledge beyond the classroom.
Authentic assessment is thought to grow student interest and to the transfer of knowledge to
settings outside the classroom. However, a potential drawback is that it might measure prior
knowledge or skills unrelated to the classroom, such as personal experience or perceptual-
motor abilities in practical tasks like cooking. Additionally, authentic skill may inadvertently
entail the assessment of some skills that have little to do with classroom learning.
Peer Appraisal Techniques
One of the methods of obtaining information about an individual is by asking that
individual’s peer group to make the evaluation. Techniques employed to obtain such
information are termed peer appraisal methods. A teacher, a supervisor, or some other group
leader may be interested in peer appraisals for a variety of reasons. Peer appraisals can help
call needed attention to an individual who is experiencing academic, personal, social, or
work-related difficulties that, for whatever reason, have not come to the attention of the
person in charge.
Peer appraisal techniques provide insight into group dynamics by allowing individuals
to be evaluated by their peers, who observe behaviors not always visible to supervisors.
These appraisals help identify group roles and dynamics, enhancing group efficiency. They
are useful in various settings, including schools, workplaces, and the military, especially
when individuals have been together long enough to make accurate assessments. These
methods include the "Guess Who?" technique, where peers identify group members based on
descriptive traits, and the nominating technique, where individuals select others for specific
ASSESSMENT FOR EDUCATION 8
tasks. Results are often visualized through sociograms, which graphically display group
interactions, highlighting popularity and relationships. Despite the value of peer appraisals, it
is important to regularly update them, as group dynamics frequently shift.
Measuring Study Habits, Interests, and Attitudes
Academic performance is influenced by multiple factors, including ability and
motivation. Several instruments designed to look beyond ability and toward factors such as
study habits, interests, and attitudes have been published. Instruments like the ‘Study Habits
Checklist’ assess study behaviors such as note-taking and reading habits. Developed by Phi
Beta Kappa members, this checklist is used to identify effective study techniques, especially
for students struggling with course material. Studies have shown that poor study practices,
especially in note-taking and time management, are linked to learning difficulties.
Additionally, the ‘What I Like to Do Interest Inventory’ assesses students' academic, artistic,
occupational, and leisure interests, helping educators design instructional activities that align
with these areas to boost engagement.
Attitude inventories in educational settings assess students' attitudes toward various
school-related factors, with the idea that positive attitudes increase engagement and
commitment to learning (Epstein & McPartland, 1978, p. 2). Some instruments focus on
specific subjects, while others, like the ‘Survey of School Attitudes’ and ‘Quality of School
Life Scales’, are broader. The ‘Survey of Study Habits and Attitudes (SSHA)’ evaluates both
study methods and attitudes, targeting poor study skills that could impact academic
performance. Designed for students from grade 7 through college, it measures factors like
delay avoidance, work methods, and teacher approval, providing scores for study skills,
attitudes, and overall orientation.
ASSESSMENT FOR EDUCATION 9
CONCLUSION
In conclusion, the variety of assessment methods and tools available for evaluating
education reflects the complexity and diversity of learners' needs. From standardized tests
like the SAT, ACT, and GRE, which provide benchmarks for academic potential, to
diagnostic assessments that identify specific areas of difficulty for targeted intervention, each
tool serves a distinct purpose. Instruments like the K-ABC, designed to measure cognitive
abilities and achievement, offer deeper insights into students’ learning processes. Moreover,
historical approaches, such as the Chinese imperial examinations and Western psychometric
testing, have shaped modern assessment practices by emphasizing fairness, objectivity, and
the identification of individual potential. As education continues to evolve, these methods
must adapt to address the growing emphasis on creativity, inclusivity, and the diverse ways in
which students demonstrate knowledge and ability. Ultimately, a balanced approach,
combining multiple tools and methods, will provide the most comprehensive and equitable
assessment of student learning and development.
ASSESSMENT FOR EDUCATION 10
REFERENCES
Baker, E. L., O’Neill, H. F., & Linn, R. L. (1993). Policy and validity prospects for
performance-based assessment. American Psychologist, 48, 1210–1218.
Cohen, R. J., & Swerdlik, M. E. (2018). Psychological testing and assessment: An
introduction to tests and measurement (9th ed.). McGraw-Hill Education.
Das, J. P., Kirby, J., & Jarman, R. F. (1975). Simultaneous and successive synthesis: An
alternative model for cognitive abilities. Psychological Bulletin, 82, 87–103.
Epstein, J. L., & McPartland, J. M. (1978). The Quality of School Life Scale administration
and technical manual. Boston: Houghton Mifflin.
Kuncel, N. R., & Hezlett, S. A. (2007a). Standardized tests predict graduate students’ success.
Science, 315(5815), 1080–1081.
Kuncel, N. R., & Hezlett, S. A. (2007b). The utility of standardized tests: Response. Science
316(5832), 1696–1697.
Luria, A. R. (1966a). Higher cortical functions in man. New York: Basic Books.
Luria, A. R. (1966b). Human brain and psychological processes. New York: Harper & Row.
Psychological Corporation. (2009). The Wechsler Individual Achievement Test-III. San
Antonio: Pearson Assessments.
Seashore, C. E. (1938). Psychology of music. New York: McGraw-Hill.
Woodcock, R. W., McGrew, K. S., & Mather, N. (2000).
Woodcock, R. W. (2011). The Woodcock Reading Mastery Tests, Third Edition (WRMT-III).
San Antonio: Pearson Assessments.