0% found this document useful (0 votes)
10 views46 pages

2. Chapter 2 Principles of Language Assessment-Handout

Chapter 2 discusses the principles of language assessment, focusing on practicality, reliability, validity, authenticity, and washback. It outlines the characteristics of effective tests, including the importance of clear directions, consistent scoring, and relevance to learning objectives. The chapter emphasizes the need for assessments to provide meaningful feedback and enhance learning while ensuring they are practical and reliable.

Uploaded by

tranquocthai449
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views46 pages

2. Chapter 2 Principles of Language Assessment-Handout

Chapter 2 discusses the principles of language assessment, focusing on practicality, reliability, validity, authenticity, and washback. It outlines the characteristics of effective tests, including the importance of clear directions, consistent scoring, and relevance to learning objectives. The chapter emphasizes the need for assessments to provide meaningful feedback and enhance learning while ensuring they are practical and reliable.

Uploaded by

tranquocthai449
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Chapter 2:

Principles of Language Assessment


Practicality
A practical test …
• stays within budgetary limits
• can be completed by the test-taker within
appropriate time constraints
• has clear directions for administration
• appropriately utilizes available human resources
• does not exceed available material resources
• considers the time and effort involved to both
design and score
A test which is impractical is …
• A test that is prohibitively expensive.
• A test of language proficiency that takes a
student five hours to complete.
• A test that requires individual one-on-one
proctoring.
• A test that takes a few minutes for a student
to take and several hours for an examiner to
evaluate.
Reliability
 A reliable test is consistent and dependable.
 A reliable test …
• is consistent in its conditions across two or more
administrations
• gives clear directions for scoring/evaluation
• has uniform rubrics for scoring/evaluation
• lends itself to consistent application of those rubrics by
the scorer
• contains items/tasks that are unambiguous to the test-
taker
 Consider the following sources: the student, the scoring,
the test administration, and the test itself.
Student-Related Reliability
caused by temporary illness, fatigue, a “bad
day,” anxiety, and other physical or psychological
factors.
Rater Reliability
 Human error, subjectivity, and bias.
• Inter-rater and intra-rater:
• Inter-rater or between-rater consistency
• Inter-rater unreliability occurs when two or
more scores yield inconsistent score of the
same test, possibly for lack of attention to
scoring criteria, inexperience, inattention, or
even preconceived biases.
There are two common ways to measure inter-
rater reliability:
1. Percent Agreement
Suppose two judges are asked to rate the difficulty of
10 items on a test from a scale of 1 to 3. The results are
shown below:
For each question, we can write “1” if the two
judges agree and “0” if they don’t agree.

The percentage of questions the judges agreed on was 7/10 = 70%.


2. Cohen’s Kappa
The more difficult (and more rigorous) way to measure inter-
rater reliability is to use Cohen’s Kappa, which calculates the
percentage of items that the raters agree on.
The formula for Cohen’s Kappa is calculated as:
k = (po – pe) / (1 – pe)
where:
po: Relative observed agreement among raters
pe: Hypothetical probability of chance agreement
Example:
Grading 40 essay tests within only a week.
Problem: ???
You might be “easier” or “harder”, and the result may
be an inconsistent evaluation across all tests.
Solution: ???
Read through about half of the tests before rendering
any final scores or grades, then to cycle back through
the whole set of tests to ensure even-handed
judgment.
 an analytical scoring instrument
Analytic Rubric
Intra-rater or within-rater consistency
Intra-rater unreliability is a common occurrence
for classroom teachers because of unclear
scoring criteria, fatigue, bias toward particular
"good“ and "bad" students, or simple
carelessness.
Test Administration Reliability
 the conditions in which the test is administered.
Sources: photocopying variations, the amount of
light in different parts of the room, variations in
temperature, and the condition of desks and
chairs.
Test Reliability
 the nature of the test itself.
• MCQ tests must be carefully designed to include a
number of characteristics: the items evenly difficult,
distractors well designed, and items well distributed.
• Test unreliability can be caused by rater bias. (for
example, subjective tests with open-ended
responses)
• Timed tests may discriminate against students who
do not perform well on a test with a time limit.
• Test unreliability: a too long test, poorly written test
items.
Validity
 The extent to which inferences made from assessment result
are appropriate, meaningful, and useful in terms of the purpose
of the assessment.
A VALID TEST...
 measures exactly what it proposes to measure
 does not measure irrelevant or “contaminating” variables
 relies as much as possible on empirical evidence
(performance)
 involves performance that samples the test’s criterion
(objective)
 offers useful, meaningful information about a test-taker's
ability
 is supported by a theoretical rationale or argument
• A valid test of reading ability actually measures
reading ability, not previous knowledge nor some
other variable of questionable relevance.
• There are five types of evidence of a test:
1. Content-Related Evidence,
2. Criterion-Related Evidence,
3. Construct-Related Evidence,
4. Consequential Validity,
5. Face Validity.
1. Content-Related Evidence
 If a test actually samples the subject matter about
which conclusion are to be drawn, and if it requires the
test-takers to perform the behavior that is being measured.
Example: Assessing a person's ability to speak a second
language in a conversational setting,
a/ asking the learner to answer ‘paper-and-pencil’
multiple-choice questions requiring grammatical
judgments DOES NOT achieve content validity.
b/ A test that requires the learner actually to speak within
some sort of authentic context  achieve content validity.
c/ A course has ten objectives but only two are covered in a
test  content validity suffers.
Direct and indirect testing
Direct testing involves the test-taker in
actually performing the target task.
In an indirect test, learners do not perform
the task itself but rather a task that is related in
some way.
2. Criterion-Related Evidence
 The extent to which the "criterion" of the test has actually
been reached.
• Most classroom-based assessment with teacher designed
tests fits the concept of criterion-referenced assessment.
• Best demonstrated through a comparison of results of an
assessment with results of some other measure of the same
criterion.
• Examples: in a course unit whose objective is for students to
be able to orally produce voiced and voiceless stops in all
possible phonetic environments, the results of one teacher's
unit test might be compared with an independent
assessment- possibly a commercially produced test in a
textbook-of the same phonemic proficiency.
Criterion-related evidence usually falls into one
of two categories: (1) concurrent and (2)
predictive validity.
(1) Concurrent Validity
This tells us if it is valid to use the value of one
variable to predict the value of some other
variable measured concurrently (i.e. at the same
time).
(2) Predictive Validity
This tells us if it is valid to use the value of one
variable to predict the value of some other
variable in the future.
3. Construct-Related Evidence
 A construct is any theory, hypothesis, or model that
attempts to explain observed phenomena in our universe of
perceptions.
• ‘Proficiency’ and ‘communicative competence’ are
linguistic constructs; ‘self-esteem’ and ‘motivation’ are
psychological constructs.
• Example: a procedure for conducting an oral interview.
The scoring analysis includes several factors :
pronunciation, fluency, grammatical accuracy, vocabulary
use and socio-linguistic appropriateness.
• Construct validity is a major issue in validating large-scale
standardized tests of proficiency.
4. Consequential Validity
Consequential validity encompasses all the
consequences of a test, including such
considerations as its accuracy in measuring
intended criteria, its impact on the preparation
of test-takers, its effect on the learner, and the
social consequences of a test’s interpretation
and use (for high-stakes assessment, the effect
of test preparation courses and manuals on
performance).
5. Face Validity
• The extent to which students view the assessment
as fair, relevant, and useful for improving learning.
• Face validity refers to the degree to which a test
looks right, and appears to measure the knowledge
or abilities it claims to measure, based on the
subjective judgment of the examinees who
take it and the administrative personnel who
decide on its use.
• It is purely a factor of the "eye of the beholder" -
how the test-taker, or possibly the test giver,
intuitively perceives the instrument.
Teachers can increase a student's perception of
fair tests by using:
 formats that are expected and well-constructed with
familiar tasks
 tasks that can be accomplished within an allotted time
limit
 items that are clear and uncomplicated
 directions that are crystal clear
 tasks that have been rehearsed in their previous course
work
 tasks that relate to their course work (content validity)
 level of difficulty that presents a reasonable challenge
Authenticity
 The degree of correspondence of the characteristics of
a given language test task to the features of a target
language task.
• In a test, authenticity may be 'present’ in the following
ways:
– The language in the test is as natural as possible.
– Items are contextualized rather than isolated.
– Topics are meaningful, relevant, interesting for the learner.
– Some thematic organization to items is provided, such as
through a story line or episode.
– Tasks represent, or closely approximate, real-world tasks.
Authenticity (cont’d)
• The authenticity of test tasks in recent years has
increased noticeably.
• Reading passages are selected from real-world
sources.
• Listening comprehension sections feature natural
language with hesitations, white noise, and
interruptions.
• More tests offer items that are ‘episodic' in that
they are sequenced to form meaningful units,
paragraphs, or stories.
Washback
 Washback generally refers to the effects the test have
on instruction in terms of how students prepare for the
test.
• ‘Cram’ courses and ‘teaching to the test’ are examples
of washback. (exam-oriented approach of teaching and
learning).
• Students' incorrect responses can become windows of
insight into further work.
• Washback enhances a number of basic principles of
language acquisition: intrinsic motivation, autonomy,
self-confidence, language ego, interlanguage, and
strategic investment, among others.
• Formative tests provide washback in the form
of information to the learner on progress
toward goals.
• Summative tests which provide assessment
at the end of a course or program, do not
need to offer much in the way of washback.
 it is unfortunate because the end of every
language course or program is always the
beginning of further pursuits, more learning,
more goals, and more challenges to face.
A TEST THAT PROVIDES BENEFICIAL WASHBACK ...
 positively influences what and how teachers
teach
 positively influences what and how learners learn
 offers learners a chance to adequately prepare
 gives learners feedback that enhances their
language development
 is more formative in nature than summative
 provides conditions for peak performance by the
learner
APPLYING PRINCIPLES TO THE
EVALUATION OF CLASSROOM TESTS
Quizzes, tests, final exams, and standardized proficiency
tests can all be scrutinized through these five principles of
practicality, reliability, validity, authenticity, and washback .
There are six questions:
1. Are the test procedures practical?
2. Is the test reliable?
3. Does the procedure demonstrate content validity?
4. Is the procedure face valid and “biased for best”?
5. Are the test tasks as authentic as possible?
6. Does the test offer beneficial washback to the learner?
1. Are the test procedures practical?
To determine whether a test is practical for your needs,
you may want to use the checklist below.
2. Is the test reliable?
Part of achieving test reliability depends on the physical
context - making sure that
 every student has a cleanly photocopied test sheet,
 sound amplification is clearly audible to everyone in the
room,
 video input is equally visible to all,
 lighting, temperature, extraneous noise, and other
classroom conditions are equal for all students,
 objective scoring procedures leave little debate about
correctness of an answer.
2. Is the test reliable?
Intra-rater reliability for open-ended responses may be enhanced
by the following guidelines:
 Use consistent sets of criteria for a correct response.
 Give uniform attention to those sets throughout the evaluation
time.
 Read through tests at least twice to check for your consistency.
 If you have made "mid-stream" modifications of what you
consider as a correct response, go back and apply the same
standards to all.
 Avoid fatigue by reading the tests in several sittings, especially
if the time requirement is a matter of several hours.
3. Does the procedure demonstrate
content validity?
There are two steps to evaluating the content
validity of a classroom test.
3.1. Are classroom objectives identified and
appropriately framed?
3.2. Are lesson objectives represented in the
form of test specifications?
3.1. Are classroom objectives identified and appropriately framed?

Consider the following objectives for lessons, all of which appeared on


lesson plans designed by students in teacher preparation programs.

a. Should  ambiguous; the expected performance not stated


b. No standards stated or implied
c. Can not be assessed
d. Just a teacher’s note
3.2. Are lesson objectives represented in the form of test specifications?

A test should have a structure that follows logically from the lesson
or unit you are testing.
Many tests have a design that
 divides them into a number of sections (corresponding to the
objectives that are being assessed)
 offers students a variety of item types,
 gives an appropriate relative weight to each section.

The content validity of an existing classroom test should be


apparent in how the objectives of the unit being tested are
represented in the form of the content of items, clusters of items,
and item types.
4. Is the procedure face valid and
“biased for best”?
Students will generally judge a test to be face
valid if
directions are clear,
the structure of the test is organized logically,
its difficulty level is appropriately pitched,
the test has no "surprises,"
timing is appropriate.
4. Is the procedure face valid and
“biased for best”?
To give an assessment procedure that is
"biased for best," a teacher
offers students appropriate review and
preparation for the test,
suggests strategies that will be beneficial,
structures the test so that the best students will
be modestly challenged and the weaker
students will not be overwhelmed.
5. Are the test tasks as authentic as
possible?
Evaluate the extent to which a test is authentic by
asking the following questions:
 Is the language in the test as natural as possible?
 Are items as contextualized as possible rather than
isolated?
 Are topics and situations interesting, enjoyable, and/or
humorous?
 Is some thematic organization provided, such as
through a story line or episode?
 Do tasks represent, or closely approximate, real-world
tasks?
6. Does the test offer beneficial
washback to the learner?
A test that achieves content validity demonstrates
relevance to the curriculum in question and thereby
sets the stage for washback.
When test items represent the various objectives of
a unit, and/or when sections of a test clearly focus
on major topics of the unit, classroom tests can
serve in a diagnostic capacity.

You might also like