2 - Principles of Language Assessment
2 - Principles of Language Assessment
I. Practicality
● Refers to the logistical, down-to-earth, administrative issues
involved in making, giving and scoring an assessment instrument
Ex: costs, the amount of time it takes to construct and to
administer, ease of scoring, and ease of interpreting/reporting the
results
● A practical test:
○ Stays within budgetary limits
○ Can be completed by the test-taker within appropriate time
constraints (between 0.5 - 2 hours)
○ Has clear directions for administration
○ Appropriately utilizes available human resources
○ Does not exceed available material resources
○ Another aspect…
● Examples of impractical test:
○ A 3-hour proficiency test
○ A test requiring one-on-one proctoring for hundreds of
test-takers
○ A test requiring a few minutes for a student to take and hours
for an examiner to evaluate
II. Reliability (Consistency of the test results)
● Is the consistency of test scores across facets of the test
● Giving the same test to the same students or matched students on
two different occasions, the test should yield similar results
● A reliable test:
○ Has consistent conditions across two or more administrations
○ Gives clear directions for scoring and evaluation
○ Has uniform rubrics for scoring/evaluation
○ Lends itself to consistent application of rubrics by the scorer
○ Contains items/tasks that are unambiguous to the test-taker
● What causes unreliability?
○ The student-related reliability
- Any factors affecting the learners’ performance:
temporary illness, fatigue, a “bad day”, anxiety and
other physical or psychological factors
⇒ Can T minimize this?
○ Rater reliability
- Human error, subjectivity, and bias ⇒ unreliable test
scores
- Inter-rater reliability: two or more scores yield
consistent scores of the same test (lack of adherence)
- Intra-rater reliability: consistency within the rater
himself
★ An internal factor, a common occurrence for
classroom teachers (self-conflicting) (unclear
scoring criteria, fatigue, bias toward particular
“good” and “bad”students, or simple
carelessness)
★ Read through about half of the tests before
rendering any final scores or grades, then to
cycle back through the whole set of tests to
ensure even-handed judgement
- To avoid intra-rater unreliability:
★ Raters: in good health, in a comfortable place,
tô many test papers ⇒ more raters; breaks
between scoring times
★ Answer keys, scoring guide, rubrics (analytical
scoring rubric > holistic scoring rubrics)
★ Training for raters:
➽ Using benchmarks (samples of standard
qualities and suggested scores for the
test-takers' written responses)
➽ Comparing two sets of scores made by
the same rater (using statistical tests -
correlation)
- To avoid inter-rater unreliability
★ 2 raters score the same paper ⇒ compare,
discuss, and agree upon the scores
★ Comparing two set of scores made by raters
(using statistical tests - correlation ⇒ high
correlation ⇒ High inter-rate reliability)
○ Test administration reliability
- What causes test administration unreliability?
★ The conditions in which the test is
administered (background noise, photocopying
variations, the amount of light different parts of
the room, variations in temperature, and even
the condition of desks and chairs)
★ The consistency of different facets of a test
(e.g. instructions, item types, organization) in
each test administration
★ The nature of the test itself can cause
measurement error (e.g. too long test, poorly
written test items)
★ Subjective tests: Open-ended responses (e.g.
essay responses) ⇒ T’s judgement ⇒ bias
★ Objective tests: predetermined fixed responses
⇒ increases their tests reliability
III. Washback
● A facet of consequential validity
● The effect of testing on teaching and learning
● Can be positive (beneficial) or negative (harmful) → anh huong
xau den nguoi hoc (Eg: Cram courses)
● A final test with a writing section → Ts teach writing strategies, Ss
find motivation to learn
● A final test with MCQs → Ts teach guessing, Ss find ways to cheat
(negative)
● A test with beneficial washback is more formative in nature than
summative
● Enhance washback:
○ Comment generously and specifically on test performance
○ A simple letter grade or numerical scores gives NO info for
improvement
○ Give praises for strengths and constructive feedback of
weaknesses → Improvement
● Triangulating the information on a student before making a final
assessment of competence
IV. Validity
● A valid test:
○ Measures exactly what it proposes to measure
○ Does not measure irrelevant or “contaminating” variables
○ Relies as much as possible on empirical evidence
(performance)
○ Involves performance that samples the test’s criterion
(objective)
○ Offers useful, meaningful information about a test-takers
ability
○ Is supported by a theoretical rationale or argument
Ex:
A valid reading test: measures reading ability, not previous
knowledge of a subject
An invalid writing test: measures the number of words students can
write in 10 minutes
● How do we know if the test is valid (5 types of evidence)
○ Content-related validity
■ Any attempt to show that the content of the test is a
representative sample from the domain that is to be
tested
■ The validity of the content of a test in relation to its
objective (test what we teach)
■ To achieve content validity in classroom assessment:
test performance directly (Ex: testing learners’
pronunciation of the word “choir” ⇒ asking learners
to write the transcription of the word?)
○ Criterion-related evidence
■ The extent to which the “criterion” of the test has
actually been reached
■ CBA: The results of a classroom test are compared
with a standard test of the same criterion
(pronunciation, grammar, etc.) ⇒ high correlation ⇒
criterion-related validity
■ 2 categories:
➢ Concurrent validity: the validity results of a test
supported by other concurrent performance
beyond the assessment itself (IELTS: test scores
vs. real-life proficiency)
➢ Predictive validity: when the test scores are
used to predict some future criterion, such as
academic success (e.g. placement test) ⇒ used
to predict the success of test-takers in the future
○ Construct-related validity
■ A construct is any theory, hypothesis, or model that
attempts to explain observed phenomena in our
universe of perception (e.g. linguistic constructs:
proficiency, communicative competence;
psychological constructs: self-esteem, motivation)
■ Concept vs construct:
➢ Construct has 2 properties:
★ Measurable/observable (e.g. “fluency”
→ what can we observe in speech to
make a decision? ⇒ “speed of speech,”
“To lack of hesitation” (or “pauses”)
★ Relationship with other different
constructs (e.g. “anxiety” & “fluency”
→ hypothesis: anxiety increases →
fluency decreases; if this hypothesis is
tested ⇒ theory of speaking)
■ Đã có construct chắc chắn sẽ có content, ngược lại
chưa chắc