0% found this document useful (0 votes)
46 views215 pages

Course 1. Psychometrics 2024, 2

Uploaded by

evahairech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views215 pages

Course 1. Psychometrics 2024, 2

Uploaded by

evahairech
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 215

‫القياس النفسي‬

PSYCHOMETRICS
S3

Dr. AIT ALI OUSAID


Department of psychology

2024/2025 1
Course contents
Module: Psychometrics
 Psychometrics and psychological tests;

 Tests, scales, individual differences,


standardization…;
 Tests development;

 Psychometric properties (Validity and


reliability);
 Application of psychological tests.
Plage horaire:
Jeudi, 10H30 – 12H30

Mode d’enseignement:

Présentiel
Psychometrics and
psychological tests

4
‫القياس‬
PSYCHOLOGICAL
MEASUREMENT ‫النفسي‬
Psychological measurement and
measurement theory are fundamental
tools that make progress in
psychological research possible.
Estimating relationships between
constructs and testing hypotheses are
essential to the advancement of
psychological theories (Schmidt et al.,
Research cannot estimate
relationships between
constructs directly.
Constructs are measured,
and it is relationships
between scores on the
construct measures that are
directly estimated;

The process of measurement


always contains error that
BRIEF HISTORY OF
PSYCHOMETRICS
The birth of psychometrics is
generally situated at the end of the
nineteenth century when Sir
Francis Galton in 1884 created an
anthropometric laboratory to
determine psychological attributes
experimentally (Galton, 1984).
8
Galton, often referred to as the father of
psychometrics, attempted to measure such
attributes by using a vast variety of tasks,
recording performance accuracy as well as
reaction times.

9
Another pioneer in the field of
psychometrics was James McKeen
Cattel who coined the term “mental
tests” and was responsible for the
research that led to the
development of modern tests.

10
The German physiologist Weber tried to
demonstrate the existence of a
psychological threshold, arguing that a
minimum stimulus was necessary to
activate a sensory system. After Weber,
the German psychologist Fechner
devised the law that the strength of a
sensation grows as the logarithm of the
stimulus intensity. 11
During the early twentieth
century, the interest in
measuring human qualities
intensified greatly when the
US implemented programs to
select soldiers using tests that
measured a range of abilities
relevant to military
performance. 12
Such tests produced a great
deal of data, which led to
questions that inspired the
birth of psychometric theory
as we currently know
regarding in particular the
analysis of psychological test
data, the properties of
psychological tests, and the
selection of the best tests 13
The first property of a test
concerns the notion of reliability,
which is the question of whether a
test produces consistent scores
when applied in the same
circumstances.

One of the first scientists to take


interest in this topic was the
psychologist and statistician
Charles Edward Spearman who
wrote in 1904 an article about the 14
The origin of psychometrics is often traced to
the primary example of a model for
measurement: the common factor model,
constructed by Charles Spearman (1904), in
which a set of observed variables is
regressed on a common latent variable.
The second property of a test concerns the notion of
validity, which is the question of whether a test is valid
in measuring what it is intended to measure. There are
three main types of validity on which the worth of
psychological tests is determined, which are, predictive
validity, content validity, and construct validity.

16
Psychometric research, has further developed
and branched out in a wide array of modeling
techniques such as:
Classical test theory (CTT),
Structural equation modeling (SEM),
Item response theory (IRT), and
Multidimensional scaling (MDS) (Jones &
Thissen, 2007).
Psychometrics is a scientific
discipline concerned with
the question of how
psychological constructs
(e.g., intelligence,
neuroticism, or depression)
can be optimally related to
observables (e.g., outcomes
of psychological tests,
genetic profiles,
Psychometrics is a field of study
that focuses on the theory and
techniques associated primarily
with the measurement of
constructs as well
as the development,
interpretation, and evaluation of
tests and measures (APA, 2013).

19
▰ Psychometrics—the discipline concerned
with the measurement and prediction of
psychological traits, aptitudes, and behavior
—plays a central role in scientific
psychology, educational measurement, and
the structure of our current society
(Borsboom & Wijsen, 2017;
Lemann, 1999). 20
WHAT IS
PSYCHOMETRICS?

Galton (1879) defined psychometry as “the art


of imposing measurement and number upon
operations of the mind” (Wasserman & Bracken,
2013).
Psychometrics is a highly interdisciplinary field,
with connections to statistics, data theory,
econometrics, biometrics, measurement
theory,and mathematical psychology.
ITEMS AND TYPES OF
TESTS
Despite their varied
attributes, all psychological
assessments share the
property of being composed
of a series of items, tasks, or
questions to which an
individual provides a
response. Simply stated,
items are the building blocks
of psychological
24
assessments.
responses to the
items on an
assessment are used
to make inferences
about the individual’s
level of the
psychological trait
being measured, most
commonly through
the creation of a
score reflecting the 25
Good items lead to good-quality scores and bad items
lead to bad-quality scores. But how does one
determine whether a particular item is good or bad?

26
analysis, a process by which
the properties of items are
evaluated with the goal of
determining (a) which items
are and which items are not
making an acceptable
contribution to the quality
of the scores generated by
the assessment and (b)
which items should be
revised or removed from
An item, most basically, must
include a stimulus. That
stimulus can be a simple
question. Or it can be a
question followed by several
alternative answers.

28
▰ A Test :
 measurement device or technique
used to quantify behavior or aid in the
understanding and prediction of
behavior.
▰ Psychological test:
 A set of items designed to measure
characteristics of human beings that
pertain to behavior 29
▰ Scale
 Relate raw scores on a test to some
defined theoritical or empirical
distribution.
 A method of operationalizing a
psychological construct using a
multiple item test (questionnaire).
30
TYPES OF TESTS

31
be categorized by field of study within psychology, for
example:

Personality tests (Big five, MMPI-II)

Intelligence tests (WAIS, WISC, Stanford-Binet


Intelligence Scale),

Neuropsychological tests (Verbal communication test),

Interest inventories (Strong Interest Inventory),

Achievement tests (Exams),

Aptitude tests (Abstract reasoning test)


33
▰ Types of tests
Human ability tests
 Acheivement Tests: Evaluates what an
individual has learned.
 Aptitude Tests: evaluates what an
individual is capable of learning and the
capacity or future potential.
 Intelligence Tests: measures a person’s
potential to solve problems, adapt to
novel situations and profit from 34
Personality Tests
 Objective personality Tests:
present specific stimuli and ask for
specific responses (e.g., true/false
questions).
 Projective Personality Tests:
present more ambigous stimuli and
ask for less specific responses (e.g.,
inkblots, drawings, photographs,
Rorschach, TAT). 35
Tests may also be identified by
their general administration
procedures, that is, as individual
tests that are administered one on
one or as group tests that are
administered to groups of
individuals.

36
▰ Types of tests
Individual Tests vs. Groupe Tests
 Individual tests: test administrator
gives a test to a single person. E.g.,
WAIS-III, MMPI-2
 Group tests: single examiner gives a
test to a group of people. E.g., SAT,
GRE (Scholastic Aptitude Test,
Graduate Record Exam) 37
Another distinction has been made between tests of
maximum performance and typical response.

38
Tests of maximum performance measure how well an
individual performs under standard conditions when
exerting maximal effort and are presumed to include
measures such as intelligence tests and achievement
tests.
Tests of typical response measure an individual’s
responses in a situation and are presumed to include
measures such as personality tests and attitude scales.

40
Tests may be further grouped according to the general
type of information gathered.

Specifically, tests may be (a) based on self-report


(e.g., personality test, attitude measure, opinion poll),
(b) based on performance or task (e.g., intelligence
test, classroom test, eye exam, driver’s test), or (c)
observational (e.g., observation of play behaviors,
observation in an interview).

41
NORM-REFERENCED TESTS, CRITERION
REFERENCED
TESTS and IPSATIVE

42
If the test will report proficiency on a set of standards,
then the test will be a
criterion-referenced test (CRT).

43
Will a single standard
(such as pass–fail) be set, or will
multiple categoriesIf only a single
proficiency level needs to be
reported, then most of the
items on the assessment should be
designed to test
knowledge and skills near the
proficiency level
(just below, on, and just above) 44
If test takers need
to be classified into multiple proficiency levels,
items will need to maximally discriminate near each
of the levels that will be reported

45
A norm-referenced test compares an individual’s
performance on a test with a predefined population or
normative group, whereas a criterion-referenced test
evaluates performance in terms of mastery of a set of
well defined objectives, skills, or competencies.

46
In norm referenced tests, items are generally selected
to have average difficulty levels and high
discrimination between low and high scorers on the
test. In criterion referenced tests, items are primarily
selected on the basis of how well they match the
learning outcomes that are deemed most important.

47
In norm referenced tests, the normative value
indicates how an individual scored relative to the
normative group but provides relatively little
information about the person’s knowledge of,
performance on, or level of the construct per se.

48
Criterion-referenced test outcomes, however, give
detailed information about how well a person has
performed on each of the objectives, skills, or
competencies included in the test.

49
A third type of test, ipsative, can be contrasted with
norm-referenced tests. In ipsative tests, an
individual’s performance is compared with his or her
performance either in the same domain or construct
over time or relative to his or her performance on
other domains or constructs. The latter case is
sometimes referred to as profiling.

50
51
52
SCORING, SCALING, NORMING

53
functions of item
scores. The summed
score is an often-used
raw score, and it is
calculated by summing
the item scores. For
cognitive tests that
are scored as correct
or incorrect, the
summed score is the
number of items the 54
For noncognitive tests, the summed (or scale) score is
often the number of options or items endorsed that
contribute to that particular psychological scale (e.g.,
the Extraversion scale on the revised NEO Personality
Inventory; McCrae & Costa, 2010).

55
Transformation of Raw Scores to
Scale Scores

56
Incorporating Normative Information:

Incorporating normative information begins with the


administration of the test to a norm group. Statistical
characteristics of the scale-score distribution are set
relative to this norm group. The scale scores are
meaningful to the extent that the norm group is central
to score interpretation (Kolen, 2006).

57
Multiphasic Personality
Inventory (Hathaway &
McKinley, 1989) was
administered to a national norm
group of nonpatient subjects
intended to be representative of
adults in the United States.
These data were used to
establish linear T scores with a
mean of 50 and standard 58
By knowing the mean and standard deviation of the
scale scores, test users are able to quickly ascertain, for
example, that a test taker with a T score of 60 on the
Depression scale is 1 standard deviation above the
mean.

59
Nonlinear
transformations are
also used to develop
score scales.
Normalized scale
scores can be used
by test users to
quickly ascertain the
percentile rank of a
test taker’s score,
using facts about the 60
scores for all six
psychological scales of
the WAIS (Wechsler,
2008) are smoothed
normalized scores set to
have a mean of 100 and a
standard deviation of 15.
Thus, a score of 115 on
the Perceptual Reasoning
scale, for example, is 1
standard deviation above
61
the mean and represents
62
Incorporating Content Information

Ebel (1962) stated, “To be meaningful any test scores


must be related to test content as well as to the scores
of other examinees”. High scores on psychological
scales (such as clinical and personality inventories),
then, easily give an indication of the test taker’s
personality or clinical state.

63
For example, interpretations
of three T-score categories
(≤44 = low, 45–55 = moderate,
≥56 = high) are provided for
each of the scales of the
revised NEO Personality
Inventory (McCrae & Costa,
2010) that help to identify
more specific behaviors and
characteristics of test takers
64
SCALES FOR BATTERIES AND
COMPOSITES

Test batteries consist of tests in


various content areas or items
contributing to various
psychological scales, with separate
scores provided for each content
area or scale.

65
Consider a test taker scoring 50 on the Neuroticism
domain and 60 on the Agreeableness domain. This test
taker’s score is near the 50th percentile on the
Neuroticism domain and near the 84th percentile on the
Agreeableness domain. Relative to the norm group, the
test taker exhibits more agreeable behaviors than
neurotic behaviors.
Composites

Composite scores that reflect performance on two or more


tests are often used (Kolen, 2006). They are typically a
linear combination of scale scores on different tests. For
example, six composite scores are reported from the WAIS
(Wechsler, 2008), including the Full Scale IQ scale that is a
measure of overall cognitive ability and is derived from the
other five scales. Each is based on a sum of scale scores.

67
VERTICAL SCALING AND
DEVELOPMENTAL SCORE
SCALES

Assessing the extent to which


the aptitude or achievement of
test takers grows from one
year to the next is important
for many cognitive
applications.
68
Educational and psychological batteries
are typically constructed using multiple
test levels, in which each level is
constructed to be appropriate for test
takers at a particular grade or age. Vertical
scaling procedures are used to relate
scores on these multiple test levels to a
developmental score scale that can be
used to assess test-taker growth over a
range of levels. 69
Measurement
Measurements
may have the
properties of
nominal,
ordinal, and
interval
scales 70
mutually exclusive categories, with
no implicit ordering. For example,
researchers in a psychiatric day
hospital might use a simplified
diagnostic system consisting of three
categories:
1 = schizophrenia; 2 = affective
disorder; 3 = other. In this case, the
numbers are simply labels for the
71
categories: there is no sense in which
property of ordered categories, that
is, it measures a variable along
some continuum. For example,
psychiatric day hospital patients
might be rated on a scale of
psychosocial impairment, consisting
of three categories: 1 = slightly or
not at all impaired; 2 = moderately
impaired; 3 = highly impaired. On
this scale, someone with a score of
72
However, there is no
assumption that the distance
between successive ordinal
scale points is the same, that is,
the distance between 1 and 2 is
not necessarily the same as
that between 2 and 3
73
that the distances between
successive points are assumed to be
equal. For example, the Beck
Depression Inventory (Beck et al.,
1988), a self-report measure of
depression, is usually treated as an
interval scale. This assumes that the
increase in severity of depression
from a score of 10 to a score of 15 is
74
The importance of distinguishing
between these types of
measurement is that different
mathematical and statistical
methods are used to analyze data
from the different scale types. A
scale needs to have interval
properties before adding and
subtracting have any meaning. 75
Testing and assessment

76
77
A psychological test is a
systematic procedure for
obtaining samples of behavior,
relevant to cognitive or affective
functioning, and for scoring and
evaluating those samples according
to standards.
A test is a device or procedure
in which a sample of an
examinee’s behavior in a
specified domain is obtained
and subsequently evaluated
and scored using a
standardized process (2014).

79
Whereas the label test is
sometimes reserved for
instruments on which responses
are evaluated for their
correctness or quality, and the
terms scale and inventory are
used for measures of attitudes,
interest, and dispositions

80
Psychological tests are often described
as standardized for two reasons:
- Uniformity of procedure in all
important aspects of the
administration, scoring, and
interpretation of tests;
- The second meaning of standardization
concerns the use of standards for
evaluating test results.
While studying the effects of
fatigue on children’s mental
ability, the German psychologist
Hermann Ebbinghaus devised
a technique known as the
Ebbinghaus Completion Test.
This technique called for
children to fill in the blanks in
text passages from which words
or word-fragments had been
omitted.
In 1904, the French psychologist Alfred Binet
was appointed to a commission charged with
devising a method for evaluating children who,
due to mental retardation or other
developmental delays, could not profit from
regular classes in the public school system
and would require special education.
In 1905, Binet and his
collaborator, Theodore Simon,
published the first
useful instrument for the
measurement of general cognitive
abilities or global intelligence. The
1905 Binet-Simon scale was a
series of 30 tests or tasks varied
in content and difficulty,
designed mostly to assess
judgment and reasoning ability
irrespective of school learning.
It was, in fact, a small
battery of carefully
selected tests
arranged in order of
difficulty and
accompanied by
precise instructions
on how to
administer and
interpret it.
In 1911 a German psychologist
named William Stern proposed
that the mental level attained on
the Binet-Simon scale, relabeled
as a mental age score, be
divided by the chronological
age of the subject to obtain a
mental quotient that would more
accurately represent ability at
different ages.
This now-familiar score, a true ratio
IQ, was popularized through its use
in the most famous revision of the
Binet-Simon scales—the Stanford-
Binet Intelligence Scale—
published in 1916 by Lewis
Terman.
However, the MA/CA ratio simply did not work
for adolescents and adults because their
intellectual development is far less uniform—
and changes are often
imperceptible—from year to year. The fact
that the maximum chronological age used in
calculating the ratio IQ of the original S-B was
16 years, regardless of the
actual age of the person tested, created
additional problems of interpretation.
In spite of several
problems with the ratio
IQ, its use would last for
several decades, until a
better way of
integrating age into the
scoring of intelligence
tests was devised by
David Wechsler
WAIS–III incorporates four index
scores (Verbal Comprehension,
Perceptual Organization,
Working Memory, and
Processing Speed)
91
92
93
94
95
97
Mental retardation is
divided into four categories:
mild, moderate, severe, and
profound. Severe and
profound mental retardation
is usually caused by genetic
mutations or accidents
during birth, whereas mild
forms have both genetic and
‫نماذج القياس النفسي‬
Models of psychometrics
‫النظرية الكالسيكية لالختبار‬
‫‪CLASSICAL TEST THEORY‬‬
CTT traces its origins to
the procedures
pioneered by Galton,
Pearson, Spearman,
and Thorndike, and is
usually defined by
Gulliksen’s (1950)
classic book.
contemporary
investigations of test
score reliability,
validity, and
fairness as well as
the widespread use of
statistical techniques
such as factor
analysis.
In classical test theory, an
individual’s true score is
conceptualized as the average
score in a hypothetical
distribution of scores that would
be obtained if the individual
took the same test an infinite
number of times.
Classical test theory
 CTT is often called the ‘true score medel’;
 Called classic relative to IRT which is a
medern approach;
 CTT describes a set of procedures used to test
items and scales reliability, difficulty,
discrimination etc;
 CTT analyses are the easiest and the most
widely used form of analyses;
 CTT analyses are performed on the test as a
whole rather than on item. 106
▰ Basics of CTT:
- Assumes that every person has a true score on
an item or a scale if we can only measure it
directly without error;

- CTT analyses assume that a person’s test score


is comprised of their ‘‘true score’’ plus some
measurement error;

- This is the common true score model


107
▰ Based on the expected values of each
component for each person we can see that
the diffrence between expercted value (Xi)
and true score (ti) should be zero.
▰ E and X are random variables, t is constant.

▰ However this is theoritical and not done at


the individual level.
▰ In CTT we assume that the error is:

- Normally distributed;

- Uncorrelated with true score;

- Has a mean of Zero.

109
▰ Domain sampling theory
- Another central component of CTT

- anaother way of thinking about populations


and samples

- Domain_ Population or universe of all possible


items measuring a single concept or trait
(theoritically infinite)

- Test_ a sample of items from that universe 110


 A person’s true score would be obtained by having
them respond to all items in the ‘universe’ of items.
 We only see responses to the sample of items on
the test.
 So reliability is the proportion of variance in the
‘’universe’’ explained by the test variance.

111
 A universe is made up of a (possibly infinitly) large
number of items. So as tests get longer they
represent the domain better, therfore longer tests
should have higher reliability.
 Also, if we take multiple random samples from the
population we can have a distribution of sample
scores that represent population
112
▰ CTT- Reliability

- Reliability is theoritically the correlation between


teste-score and the true score.

- Reliability can be viewed as a measure of


consistency or how well as test holds together.

- Reliability is mesured on a scale of 0-1. The greater


the number the higher the reliability.
113
Although true scores do not really
exist, it is nevertheless possible to
imagine their existence: True scores
are the hypothetical entities that
would result from error-free
measurement.
At its heart, CTT is based on the
assumption that an obtained test
score reflects both true score and
error score. Test scores may be
expressed in the familiar equation:

Observed Score = True Score +


Error
The observed score is the test score that
was actually obtained. The true score is
the hypothetical amount of the designated
trait specific to the examinee, a quantity
that would be expected if the examinee
were tested an infinite number of times
without any confounding effects of such
things as practice or fatigue.
Measurement error is defined as the
difference between true score and observed
score.
With regard to a single score, the
ideas presented so far can be stated
succinctly by means of the following
equation:

Xo = Xtrue + Xerror

X= T+E

E=X–T;T=X–E
As a critic, Borsboom
(2006) has argued that
CTT has
grave limitations in
theory and model
building through its
misplaced emphasis on
observed scores and
true scores
rather than the latent
The first shortcoming of CTT is that the
interpretation of respondent characteristics
depend on the test used. Respondents will
appear smarter, if an easier test is
administered, but will look like less smart, if a
more difficult test is answered. The second
shortcoming of CTT is that test characteristics
are sample-dependent. The same test
administered in a group of high ability students
and in another group of low ability students
will produce items with different levels of
119
difficulty, for example.
▰ In the first sample, items difficulty will
appear lower than the difficulty for the
second group. These shortcomings imply
that test characteristics can only be done in
the same context (sample). Once, test
parameters depend on persons’ latent trait
and vice versa, item and test characteristics
will change when other persons (samples
with different levels of latent trait) answer
the test.
120
The third shortcoming of CTT is that the theory
assumes that errors of measurement are equal for all
persons. This is problematic because persons with
different levels of ability will show different levels of
error (guessing) in a test that evaluates intelligence or
any other construct, for example. The fourth
shortcoming of CTT is that it does not allow accurate
predictions about possible results for a respondent or
for a sample on an item, using only their ability
scores. This information would be important for a test
designer interested in developing a test for a
population with specific characteristics. 121
‫نظرية االستجابة للمفردة‬

‫‪ITEM RESPONSE THEORY‬‬


Broadly speaking, the goals of IRT are (a) to
generate items that provide the
maximum amount of information
possible concerning the ability or trait
levels of examinees who respond to them in
one fashion or another, ( b) to give
examinees items that are adapted to
their ability or trait levels, and thus (c) to
reduce the number of items needed to
pinpoint any given test taker’s standing on the
ability or latent trait while minimizing
measurement error.
▰ Three Basics components of IRT
 Item Response Function (IRF)_ Mathematical
function that relates the latent trait to
probability of endorsing an item
 Item Information Function_ an indication of
item quality ; an item’s ability to differentiate
among respondents
 Invariance_ item characteristics are
population independant within a linear 124
A reasonable assumption is that each
examinee responding to a test item
possesses some amount of the
underlying ability. Thus, one can consider
each examinee to have a numerical value,
a score, that places him or her
somewhere on the ability scale. This
ability score will be denoted by the Greek
letter theta, θ. At each ability level, there
will be a certain probability that an
examinee with that ability will give a
correct answer to the item.
In the case of a typical test item,
this probability will be small for
examinees of low ability and large
for examinees of high ability.
The probability of correct response is
near zero at the lowest levels of ability.
It increases until at the highest levels of
ability, the probability of correct
response approaches 1. This S-shaped
curve describes the relationship between
the probability of correct response to an
item and the ability scale. In item
response theory, it is known as the item
characteristic curve. Each item in a test
will have its own item characteristic
curve.
There are two technical properties of
an item characteristic curve that are
used to describe it. The first is the
difficulty of the item. Under item
response theory, the difficulty of an
item describes where the item
functions along the ability scale. For
example, an easy item functions
among the low-ability examinees and a
hard item functions among the high-
ability examinees; thus, difficulty is a
discrimination, which describes how well
an item can differentiate between
examinees having abilities below the item
location and those having abilities above
the item location. This property
essentially reflects the steepness of the
item characteristic curve in its middle
section. The steeper the curve, the better
the item can discriminate. The flatter the
curve, the less the item is able to
discriminate since the probability of
correct response at low ability levels is
It should be noted that these two
properties say nothing about
whether the item really measures
some facet of the underlying
ability or not; that is a question
of validity. These two properties
simply describe the form of the
item characteristic curve.
Discrimination will
Difficulty will have the
have the following
following levels:
levels:
very easy
none
easy
low
medium
moderate
hard
high
very hard
perfect
Guessing parameter
138
139
140
141
142
143
144
145
146
147
148
149
150
One of the most basic differences
between CTT and IRT stems from
the fact that in CTT, interest
centers mainly on the examinee’s
total score on a test, which
represents the sum of the item
scores; whereas in IRT—as its name
implies—the principal focus is on
the examinee’s performance on
individual items.
Rasch model

One-parameter model or the Rasch


model. The Rasch model is based on
the assumption that both guessing
and item differences in
discrimination are negligible or
constant.
Rasch began his work in educational
and psychological measurement in
the
late 1940’s. His work marked IRT
with its probabilistic modeling of the
interaction between an individual
item and an individual examinee.
The Rasch model is a probabilistic
unidimensional model which asserts
that (1) the easier the question the
more likely the student will respond
correctly to it, and (2) the more able
the student, the more likely he/she
will pass the question compared to a
less able student.
According to Fischer (1974) the Rasch
model can be derived from the following
assumptions :
(1)(1) Unidimensionality. All items are
functionally dependent upon only one
underlying continuum.
(2) Monotonicity. All item characteristic
functions are strictly monotonic in the
latent trait. The item characteristic
function describes the probability of a
predefined response as a function of the
(3) Every person has a certain probability
of giving a predefined response to each
item and this probability is independent of
the answers given to the preceding items.

(4) Dichotomy of the items. For each item


there are only two different responses, for
example positive and negative.
Advantages of the IRT
The benefit of the item response theory
is that its treatment of reliability and
error of measurement through item
information function are computed for
each item (Lord, 1980). These functions
provide a sound basis for choosing items
in test construction. The item information
function takes all items parameters into
account and shows the measurement
efficiency of the item at different ability
‫صدق االختبار‬
Test Validity
“The problem of validity is that of
whether a test really measures
what it purports to measure”
(Kelley, 1927).
“The essential question of test
validity is how well a test does the
job it is employed to do. The same
test may be used for several
different purposes, and its validity
may be high for one, moderate for
another and low for a third”
(Cureton, 1951).
Validity is “an integrated evaluative
judgment of the degree to which
empirical evidence and theoretical
rationales support the adequacy and
appropriateness of inferences and
actions based on test scores or other
modes of assessment” (Messick,
1989).
“A test is valid for
measuring an attribute
if and only if
(a) the attribute exists
and (b) variations in the
attribute causally
produce variation in the
measurement
outcomes” (Borsboom,
Mellenbergh, & van
“At its essence, validity
means that the
information yielded by a
test is appropriate,
meaningful, and useful
for decision making –
the purpose of mental
measurement”
(Osterlind, 2010).
Validity is not an all or nothing
property, it is a matter of degree;
the use of two tests can be
somewhat valid in both cases but
one can be more valid than the
other.
Evolving notion of
validity: Consensus, what
consensus?
So, the current “consensus position”
(Newton & Shaw, 2013) regarding
validity, endorsed by major educational
and psychological bodies (AERA, APA, &
NCME, 2014) and numerous validity
scholars states that validity pertains to
interpretations not tests, is a single
evaluative judgement based on five
sources of evidence, is a matter of
degree, and is a continuous process.
The turn of the 21st century saw
Borsboom and colleagues (2004)
make a compelling critique of the
‘consensus position’ on both counts
and in doing so, argue for a return
to the 1920’s version of validity
Borsboom et al. (2004) put forward a very
simple view of validity. Validity is not
about interpretations or intended uses
but is a property of tests, and a test is
valid if the construct you are trying to
measure exists (in a realist sense) and
causes the measured
behaviour/response. In other words,
completing an intelligence test should
require the use of intelligence, and thus
differences in test responses between
persons would be the result of
Borsboom et al.’s
view is a virtue and
provides clear
guidance for
establishing validity
through a single
question: does the
construct of interest
cause the
observed/measured
behaviour/response?
Validity is not complex, faceted, or
dependent on
nomological networks and social
consequences of testing. It is a very
basic concept and was correctly
formulated, for instance, by Kelley
(1927, p. 14) when he stated that a
test is valid if it measures what it
purports to measure.
170
“It is farfetched to presume that such
a network implicitly defines the
attributes in question…. It is even
more contrived to presume that the
validity of a measurement procedure
derives, in any sense, from the
relation between the measured
attribute and other
attributes” (Borsboom et al., 2004)
Observing a positive correlation between
a measure labelled intelligence and
educational achievement actually tells
you nothing about what is actually
measured. What makes a measure of
intelligence so, is that responding to the
test requires the use of intelligence, what
makes a test of high school history
knowledge so, is that answering the
questions requires the use of knowledge
gained during high school history class.
There are five forms of validity
evidence: content, response processes
(e.g., cognitive processes during item
responding), relations with other
variables (e.g., convergent,
discriminant, concurrent, and
predictive validity evidence), internal
structure (e.g., factor structure), and
evidence based on consequences
(whether test use is fair and unbiased).
Validation is an on-going process in
which various sources of validity
evidence are accumulated to build
an argument in favor of the validity
of the intended interpretation and
use of the test.
Content validity
The first element of evidence relates to
the content of a psychometric measure
and suggests that psychometric
measures should contain exclusively
content relevant to the construct at hand
and that the content should be
representative of the whole construct
domain.
This is very similar to original validity
arguments made by Buckingham (1921) who,
working within the domain of cognitive ability,
argued that measures of intelligence should
cover general learning ability across domains
rather than learning of a narrow topic that is
overly sensitive to the recency with which the
learning had taken place. Simply, if a
psychometric is designed to measure X, it
should contain representative coverage of X.
Evidence relating
to response processes
Evidence relating to response
processes is perhaps one of the
most sophisticated elements of
validity (Borsboom et al., 2004;
Cronbach & Meehl, 1955; Embretson,
2016) and is often ignored in
practice (Borsboom et al., 2004).
Briefly, if a psychometric test is designed
to measure a construct (e.g., school
knowledge) then responding to the items
of the psychometric should require the
use of the construct (e.g., retrieval of
school knowledge). It follows that the
examination of these item responses is
vital for establishing what is measured.
Evidence regarding
relations with other
variables
Evidence regarding relations with other
variables represents the vast majority of
previous validity debate and encapsulates
criterion associations and a large proportion
of construct validity (Cronbach & Meehl,
1955). Relations between psychometric
measures and criterion variables have always
been seen as important with many early
validity discussions focusing heavily on
predictive capabilities (Buckingham, 1921;
Cureton, 1950; 1951; Guilford, 1942, 1946).
The simple suggestion is that if, for
example, intelligence drives learning
then a measure of the rate at which
learning occurs or of scholastic
achievement should constitute a
measure of intelligence
(Buckingham, 1921). Thus, a positive
correlation between a measure of
intelligence and educational
attainment would provide validity
evidence
Convergent and discriminant
evidence.
Relationships between test scores
and other measures intended to
assess the same or similar
constructs provide convergent
evidence, whereas relationships
between test scores and measures
purportedly of different constructs
provide discriminant evidence.
Evidence relating to
the internal structure
of a measure
Evidence relating to the internal
structure of a measure concerns the
relationship between items and
represents a separate branch of validity
evidence according to the Standards. If
ten items are expected to measure the
same construct then they should be
correlated and load onto a single factor
(Spearman, 1904).
If the items are hypothesised to measure
four sub-components of one higher-order
factor then all items should be
correlated, but there should be four
distinguishable factors and modelling a
higher-order factor should be possible.
Evidence concerning
consequences
Evidence concerning consequences
regards the suggestion that we must
consider the intended and unintended
effects of test use when deciding whether
or how a psychometric test should be
used (e.g., Cronbach 1988; Hubley &
Zumbo, 2011; Messick, 1989; Zumbo
2007, 2009).
In essence, consequences consider
fairness and represent the major
contribution of Messick (1989) to validity
theory. The idea that the social
consequences of psychometric use
constitutes a branch of validity is quite a
step from the scientific (or statistical)
forms of
validity evidence previously endorsed;
indeed, it is inherently political.
‫ثبات االختبار‬
TEST RELIABILITY
The term reliability has been used in two
ways in the measurement literature.
First, the term has been used to refer to
the reliability coefficients of classical test
theory, defined as the correlation
between scores on two equivalent forms
of the test, presuming that taking one
form has no effect on performance on the
second form.
Second, the term has been used in a
more general sense, to refer to the
consistency of scores across replications
of a testing procedure, regardless of how
this consistency is estimated or reported
(e.g., in terms of standard errors,
reliability coefficients per se,
generalizability coefficients,
error/tolerance ratios, item response
theory (IRT) information functions, or
various indices of classification
consistency).
scores involves an analysis of the variation in
each test taker’s scores across replications of
the testing procedure. The test is administered
and then, after a brief period during which the
examinee’s standing on the variable being
measured would not be expected to change, the
test (or a
distinct but equivalent form of the test) is
administered a second time; it is assumed that
the first administration has no influence on the
The impact of such measurement errors can be
summarized in a number of ways, but typically, in
educational and psychological measurement, it is
conceptualized in terms of the standard deviation
in the scores for a person over replications of the
testing procedure.
In classical test theory, the consistency of test
scores is evaluated mainly in terms of reliability
coefficients, defined in terms of the correlation
between scores derived from replications of the
testing procedure on a sample of test takers.
Three broad categories
of reliability coefficients are recognized:
a) Coefficients derived from the
administration of alternate forms in
independent testing sessions
(alternate-form coefficients, or
parallel forms );
(b) coefficients obtained by
administration
of the same form on separate
occasions (test-retest coefficients);
(c) Coefficients based on the
relationships/interactions among
scores derived from individual items
or subsets of the items within a
test, all data accruing from a single
administration (internal-consistency
coefficients).
For the split-halves method, scores
on two
more-or-less parallel halves of the
test (e.g., odd-numbered items and
even-numbered items) are
correlated, and the resulting half-
test reliability
coefficient is statistically adjusted
to estimate reliability for the full-
length test.
By far the most common statistical
index of internal consistency is
Cronbach’s alpha, which provides a
lower-bound estimate of test score
reliability equivalent to the average
split-half consistency coefficient for
all possible divisions of the test into
halves (Hogan et al., 2000).
Logistic Item Characteristic Curves for Five Equally Discriminating Items Differing only in Difficulty

1.00
0.90
0.80
0.70
0.60
Probability

item1
0.50
item2
0.40 item3
item4
0.30
item5
0.20
0.10 Item Dif-
ficulty
0.00
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0

Latent Trait (Theta)


Item Characteristic Curves for Five Equally Difficult Items Differing only in their Discrimination Parameters

1.00

0.90

0.80

0.70

0.60
Probability

Item1
0.50
Item2
0.40 Item3
Item4
0.30 Item5
0.20

0.10

0.00
-5 -4 -3 -2 -1 0 1 2 3 4 5

Theta
Item Response Function
Binary items

Probability of getting item


1 right

pe)
Inattention

(slo
Parameters:

tion
• Difficulty
• Discrimination

ina
• Guessing

crim
• Inattention

Dis
Models:
Difficulty • 1 Parameter
• 2 Parameter
• 3 Parameter
Guessing • 4 Parameter
• unfolding

Measured concept (theta)


One-Parameter Logistic Model/Rasch Model (1PL)

7 items of varying difficulty (b)


Two-Parameter Model (2PL)

5 items of varying difficulty (b) and discrimination (a)


Three-Parameter Model (3PL)

One item showing the guessing parameter (c)


Graded Model (example of a model with polytomous items – e.g.
Likert Scales)

“I experience dizziness
when I first wake up in
the morning”
(0) “never”
(1)“rarely”
(2)“some of the time”
(3)“most of the time”
(4)“almost always”
Category Response Curves for an item representing the probability of
responding in a particular category conditional on trait level
213
214
215

You might also like