PRINCIPLES OF LANGUAGE ASSESSMENT
The paper submitted to fulfill assignment in Language Testing courses
Lecturer: Rekno Sari, M.Pd.
Arranged by:
1. Citra Debora (201612500547)
2. Yossy Sitompul (201612500580)
3. Nadilla Raudiya (201612500606)
4. Santi Oktaviani (201612500518)
5. Saraswati (201612500528)
ENGLISH EDUCATION STUDY PROGRAM
FACULTY OF LANGUAGE AND ART
INDRAPRASTA PGRI UNIVERSITY
JAKARTA
2019
PREFACE
Thank to Almighty God who has given His bless to the authors for finishing
the Language Testing assignment entitled “Principles of Language Assessment”.
Hence, the authors say thank you to all individuals who helps in the process of
writing this report paper.
Hopefully God replies all helps and bless you all. The authors realized that
this report paper still imperfect in arrangement and the content. Then the authors
hope the criticism from the readers can help the writer in perfecting the next paper.
Last but not the least, hopefully this report paper can help the readers to gain more
knowledge about principles of language assessment.
Jakarta, September 26th 2019
Authors
i
TABLE OF CONTENTS
PREFACE ................................................................................................................ i
TABLE OF CONTENTS ........................................................................................ ii
CHAPTER I INTODUCTION................................................................................ 1
A. Background .................................................................................................. 1
B. Problem Formulation ................................................................................... 1
C. Goal .............................................................................................................. 1
D. Function ....................................................................................................... 1
CHAPTER II DISCUSSION .................................................................................. 2
A. Language Assessment or Language Testing ................................................ 2
B. Principles of Language Assessment ............................................................. 2
a. Practicality ................................................................................................ 3
b. Reliability ................................................................................................. 3
c. Validity ..................................................................................................... 5
d. Authenticity ............................................................................................ 10
e. Washback ............................................................................................... 11
CHAPTER III CONCLUSION ............................................................................. 13
A. Conclusion ................................................................................................. 13
BIBLIOGRAPHY ............................................................................................... 144
ii
CHAPTER I
INTRODUCTION
A. Background
Language assessment or language testing is a field of study under
the umbrella of applied linguistics. Its main focus is the assessment of first,
second or other language in the school, college, or university context;
assessment of language use in the workplace; and assessment of language
in the immigration, citizenship, and asylum contexts.
Language testing are very important in the world of education and
that's why every teacher really needs to know the main principles of
language tests to find out the development of abilities experienced by
students.
B. Problem Formulation
From the background above, writers are listing two problems:
1. What is the meaning of language assessment?
2. What are the major principles of language assessment?
C. Goal
The use of this paper is for letting the readers know about the meaning
of language assessment. Also for discussing about five major principles of
language assessment.
D. Function
The function of writing this paper are:
1. To fulfill the task of Language Testing courses.
2. To give more explanation about five major principles of language
assessment.
3. To make the readers understand about this topic.
1
CHAPTER II
DISCUSSION
A. Language Assessment or Language Testing
Language assessment or language testing is a field of study under the
umbrella of applied linguistics. Its main focus is the assessment
of first, second or other language in the school, college, or university
context; assessment of language use in the workplace; and assessment of
language in the immigration, citizenship, and asylum contexts. The
assessment may include listening, speaking, reading, writing, an integration
of two or more of these skills, or other constructs of language ability. Equal
weight may be placed on knowledge (understanding how the language
works theoretically) and proficiency (ability to use the language practically),
or greater weight may be given to one aspect or the other.
B. Principles of Language Assessment
Since the definition of language assessment has been understood, the
next items for teachers to comprehend are principles of language assessment.
To design good assessment, teachers should pay attention to validity,
reliability, practicality, authenticity, and washback. Each of them is further
explained below.
2
a. Practicality
Practicality is a matter of the extent to which the demands of the
particular test specifications can be met within the limits of time and
existing human and material resources Mousavi (1999). On the same
lines Harris (1969) refers to economy and the ease of administration and
scoring. If a standard test is used, we must take into account the cost per
copy. It should be determined whether several administrators and/or
scorers will be needed, for the more personnel who must be involved in
giving and scoring a test, the more costly the process becomes (p. 21-
22).
An effective test is practical. This means that it :
Is not excessively expensive,
Stays within appropriate time constraints,
Is relatively easy to administer, and
Has a scoring/evaluation procedure that is specific and time-
efficient.
A test that is prohibitively expensive is impractical. A test of
language proficiency that takes a student five hours to complete is
impractical-it consumes more time (and money) than necessary to
accomplish its objective. A test that requires individual one-on-one
proctoring is impractical for a group of several hundred test-takers and
only a handful of examiners. A test that takes a few minutes for a student
to take and several hours for an examiner too evaluate is impractical for
most classroom situations.
b. Reliability
A reliable test is consistent and dependable. If you give the same test
to the same student or matched students on two different occasions, the
test should yield similar result. The issue of reliability of a test may best
be addressed by considering a number of factors that may contribute to
the unreliability of a test. Consider the following possibilities (adapted
from Mousavi, 2002, p. 804): fluctuations in the student, in scoring, in
3
test administration, and in the test itself. Factors affecting reliability are
(Heaton, 1975: 155-156; Brown, 2004: 21-22):
1. student-related reliability: students personal factors such as
motivation, illness, anxiety can hinder from their ‘real’
performance,
2. rater reliability: either intra-rater or inter-rater leads to
subjectivity, error, bias during scoring tests,
3. test administration reliability: when the same test
administered in different occasion, it can result differently,
4. test reliability: dealing with duration of the test and test
instruction. If a test takes a long time to do, it may affect the test
takers performance such as fatigue, confusion, or exhaustion.
Some test takers do not perform well in the timed test. Test
instruction must be clear for all of test takers since they are
affected by mental pressures.
Some methods are employed to gain reliability of assessment
(Heaton, 1975: 156; Weir 1990: 32; Gronlund and Waugh, 2009: 59-64).
They are:
1. test-retest/re-administer: the same test is administered after a
lapse of time. Two gained scores are then correlated.
2. parallel form/equivalent-forms method: administrating two
cloned tests at the same time to the same test takers. Results of
the tests are then correlated.
3. split-half method: a test is divided into two, corresponding
scores obtained, the extent to which they correlate with each
other governing the reliability of the test as a whole.
4. test-retest with equivalent forms: mixed method of test-retest
and parallel form. Two cloned tests are administered to the same
test takers in different occasion.
5. intra-rater and inter-rater: employing one person to score the
same test in different time is called intra-rater. Some hits to
4
minimize unreliability are employing rubric, avoiding fatigue,
giving score on the same numbers, and suggesting students write
their names at the back of test paper. When two people score the
same test, it is interrater. The tests done by test takers are divided
into two. A rubric and discussion must be developed first in
order to have the same perception. Two scores either from intra-
or inter-rater are correlated.
c. Validity
When teachers come to assessment, they deal a lot with a question
of how to measure students’ abilities. The question word ‘how’ implies
that teachers should be able to design a measurement to bring up
students’ potentials as they wish. It is validity. Validity links to accuracy.
A good test should be valid or accurate. Some experts have defined the
term of validity. Heaton (1975: 153), for example, states that the validity
of a test is the extent to which it measures what it is supposed to measure.
Bachman (1990: 236) also mentions that in examining validity, the
relationship between test performance and other types of performance
in other contexts is considered. Brown (2004: 22) defines validity as the
extent to which inferences made from assessment results are appropriate,
meaningful, and useful in terms of the purpose of the assessment.
Similarly, Gronlund and Waugh (2009: 46) state that validity is
concerned with the interpretation and use of assessment results. From
these definitions, it can be inferred that when a test is valid, it can elicit
students’ certain abilities as it is intended to. The valid test can also
measure what it is supposed to measure.
Validity is a unitary concept (Bachman, 1990: 241; Gronlund and
Waugh, 2009: 47). To gain valid inferences from test scores, a test
should have some kinds of evidence. The evidence of validity includes
face validity, content-related evidence, criterion-related evidence,
construct-related evidence, and consequential validity. In the following
section, those kinds of evidence are explained in detail.
5
1. Face Validity
The concept of face validity according to Heaton (1975: 153)
and Brown (2004: 26) is that when a test item looks right to other
testers, teachers, moderators, and test-takers. In addition, it
appears to measure the knowledge or abilities it claims to
measure. Heaton argues that if a test is examined by other people,
some absurdities and ambiguities can be discovered.
Face validity is important in maintaining test takers’
motivation and performance (Heaton, 1975; 153; Weir, 1990:
26). If a test does not have face validity, it may not be acceptable
to students or teachers. If students do not take the test as valid,
they will show adverse reaction (poor study reaction, low
motivation). In other words, they will not perform in a way
which truly reflects their abilities.
Brown (2004: 27) states that face validity will likely be high
if learners encounter:
1. a well-constructed, expected format with familiar tasks,
2. a test that is clearly doable within the allotted time limit,
3. items that are clear and uncomplicated,
4. directions that are crystal clear,
5. tasks that relate to their course work (content validity), and
6. a difficulty level that presents a reasonable challenge.
To examine face validity, no statistical analysis is needed.
Judgmental responses from experts, colleagues, or test takers
may be involved. They can read thoroughly to the whole items
or they can just see at glance the items. Then, they can relate to
the ability that the test want to measure. If a speaking test appears
in vocabulary items, it may not have face validity. Content-
related Evidence A test is administered after materials are wholly
taught. The test can have content-related evidence if it represents
the whole materials taught before so that the students can draw
6
conclusions from the materials (Weir, 1990: 24; Brown, 2004:
22; Gronlund and Waugh, 2009: 48). In addition, the test should
also reflect objectives of the course (Heaton, 1975: 154). If the
objective of the test is to 3 enable students to speak, the test
should make the students speak communicatively. If the
objective of the test is to enable students to read, the test should
make them read something. A speaking test which appears in
paper-and pencil multiple-choice test cannot be claimed as
containing content-related evidence. In relation of curriculum, a
test which has content-related evidence represents basic
competencies.
Direct testing and indirect testing are two ways in
understanding the content validity. Direct testing involves the
test-taker in actually performing the target task. Meanwhile,
learners are not performing the task itself but rather a task that is
related in some way in the indirect testing (Brown, 2004: 23).
Establishing content-related evidence is problematic
especially dealing with portion of items representing the larger
domain. To build an assessment which provides valid results, a
guideline below can be applied (Gronlund and Waugh, 2009: 48-
49).
1. identifying the learning outcomes to be assessed
(objective of the course),
2. preparing a plan that specifies the sample of tasks to be
used (blueprint),
3. preparing an assessment procedure that closely fits the
set of blueprint (rubric).
2. Construct-related Evidence
A construct-related evidence, so called construct validity, is
any theory, hypothesis, or model that attempts to explain
observed phenomena in our universe of perceptions. Constructs
7
may or may not be directly or empirically measured. Their
verification often requires inferential data (Brown, 2004: 25).
Cronbach (as cited in Weir, 1990: 24) states that construction of
a test starts from a theory about behavior or mental organization
derived from prior research that suggests the ground plan for the
test. Before an assessment is built, the creator must review some
theories about content of it. He then will get new concept related
to the content of the items. In language assessment, test makers
believe on existence of several characteristics related to
language behavior and learning. When the test makers interpret
the results of assessment on basis of psychological constructs,
they deal with construct-related evidence (Heaton, 1975: 154;
Gronlund and Waugh, 2009: 55).
For example, scoring analysis for the interview will need
several factors: pronunciation, fluency, grammatical accuracy,
vocabulary use, and sociolinguistic appropriateness. The
justification of these factors lies in a theoretical construct that
claims those factors to be major components of oral proficiency.
When a teacher conducts an oral proficiency interview that
evaluates only two of the 5 factors, the teacher could be
justifiably suspicious about the construct validity of the test.
This kind of validity is the broadest among the previous
validity. In other words, it covers all kinds of evidence (face,
content-related, criterion-related, and other relevant evidence).
Although it is endless to obtain construct-related evidence, test
makers should list from the most relevant ones.
Construct validity is a major issue in validating large-scale
standardized tests of proficiency. Because such tests must adhere
to the principle of practicality, and because they must sample a
limited number of domains of language, they may not be able to
8
contain all the content of a particular field or skill (Brown, 2004:
25).
3. Criterion-related Evidence
Criterion-related Evidence Comparison between test scores
and a suitable external criterion of performance refers to
criterion-related evidence (Heaton, 1975: 254; Weir, 1990: 27;
Brown, 2004: 24). For example, the result of a teacher-made test
about past tense is compared to the result of a test of the same
topic in a textbook.
There are two types of criterion-related evidence based on
time for collection of the external criterion, concurrent and
predictive validity. Concurrent validity focuses on using results
of a test to estimate current performance on some criterion
collected at concurrent time. For example, a teacher-made test
design is considered having concurrent validity when it has the
same score with an existing valid test like TOEFL. If students
have high scores in TOEFL and concurrently 4 have good scores
in doing the teacher-made test, it means that the teacher-made
test has concurrent validity. On the other hand, predictive
validity focuses on using results of a test to predict future
performance on some other valued measure collected in the
future time. For example, a teacher-made test is administered to
some students and they get high scores. It, then, turns out that by
the end of teaching and learning process the students still achieve
high scores. It means that the teacher-made test has predictive
validity. In addition, when a test taker does a particular test from
which result he can be predicted to survive overseas, the test also
has predictive validity. It can be found in performance test,
admissions batteries, language aptitude test, and the like. To
examine criterion-related evidence, correlation coefficient and
9
expectancy table are utilized (Gronlund and Waugh, 2009: 51-
55).
4. Formative Validity
When applied to outcomes assessment it is used to assess
how well a measure is able to provide information to help
improve the program under study. For example, when designing
a rubric for history one could assess student’s knowledge across
the discipline. If the measure can provide information that
students are lacking knowledge in a certain area, for instance the
Civil Rights Movement, then that assessment tool is providing
meaningful information that can be used to improve the course
or program requirements.
5. Sampling Validity
Ensures that the measure covers the broad range of areas
within the concept under study. Not everything can be covered,
so items need to be sampled from all the domains. This may need
to be completed using a panel of “expert” bias (i.e. a test
reflecting what an individual personally feels are the most
important or relevant areas). For example, when designing an
assessment of learning in the theatre department, it would not be
sufficient to only cover issues related to acting. Other areas of
theatre such as lighting, sound, functions of stage managers
should all be included. The assessment should reflect the content
area in its entirety.
d. Authenticity
A test must be authentic. Bachman and Palmer (as cited in Brown,
2004: 28) defined authenticity as the degree of correspondence of the
characteristics of a given language test task to the features of a target
language. Several things must be considered in making an authentic test:
language used in the test should be natural, the items are contextual,
topics brought in the test should be meaningful and interesting for the
10
learners, the items should be organized thematically, and the test must
be based on the real-world.
e. Washback
The effects of tests on teaching and learning are called washback.
Teachers must be able to create classroom tests that serve as learning
devices through which washback is achieved. Washback enhances
intrinsic motivation, autonomy, self-confidence, language ego,
interlanguage, and strategic investment in the students. Instead of giving
letter grades and numerical scores which give no information to the
students’ performance, giving generous and specific comments is a way
to enhance washback (Brown 2004: 29).
Heaton (1975: 161-162) mentions this as backwash effect which
falls into macro and micro aspects. In macro aspect, tests impact society
and education system such as development of curriculum. In micro
aspect, tests impact individual student or teacher such as improving
teaching and learning process.
Washback can also be negative and positive (Saehu, 2012: 124-127).
It is easy to find negative wash back such as narrowing down language
competencies only on those involve in tests and neglecting the rest.
While language is a tool of communication, most students and teachers
in language class only focus on language competencies in the test. On
the other hand, a test can be positive washback if it encourages better
teaching and learning. However, it is quite difficult to achieve. An
example of positive washback of a test is National Matriculation English
Test in China. It resulted that after the test was administered, students’
proficiency in English for actual or authentic language use situation
improved.
Washback can be strong or weak (Saehu, 2012: 122-123). An
example of strong effect of the test is national examination; meanwhile
weak effect of the test is the impact of formative test. Let us compare
11
and decide how most students and teachers react on those two kinds of
test.
12
CHAPTER III
CONCLUSION
Language assessment or language testing is a field of study under the
umbrella of applied linguistics. Its main focus is the assessment of first, second or
other language in the school, college, or university context; assessment of language
use in the workplace; and assessment of language in the immigration, citizenship,
and asylum contexts.
Since the definition of language assessment has been understood, the next
items for teachers to comprehend are principles of language assessment. To design
good assessment, teachers should pay attention to validity, reliability, practicality,
authenticity, and washback. Those are the five major principles of language
assessment.
13
BIBLIOGRAPHY
Bachman, L.F. 1990. Fundamental Considerations in Language Testing. Oxford:
Oxford University Press.
Brown, H.D. 2004. Language Assessment: Principles and Classroom Practices.
White Plains, NY: Pearson Education.
Gronlund, N.E. and Waugh, C.K. 2009. Assessment of Student Achievement.
Upper Saddle River, NJ: Pearson Education.
Heaton, J.B. 1975. Writing English Language Tests. London: Longman.
Saehu, A. 2012. Testing and Its Potential Washback. In Bambang Y. Cahyono.
and Rohmani N. Indah (Eds.), Second Language Research and Pedagogy
(pp. 119-132). Malang: State University of Malang Press.
Weir, C.J. 1990. Communicative Language Testing. London: Prentice Hall.
14