Course 1. Psychometrics 2024, 2
Course 1. Psychometrics 2024, 2
PSYCHOMETRICS
S3
2024/2025 1
Course contents
Module: Psychometrics
Psychometrics and psychological tests;
Mode d’enseignement:
Présentiel
Psychometrics and
psychological tests
4
القياس
PSYCHOLOGICAL
MEASUREMENT النفسي
Psychological measurement and
measurement theory are fundamental
tools that make progress in
psychological research possible.
Estimating relationships between
constructs and testing hypotheses are
essential to the advancement of
psychological theories (Schmidt et al.,
Research cannot estimate
relationships between
constructs directly.
Constructs are measured,
and it is relationships
between scores on the
construct measures that are
directly estimated;
9
Another pioneer in the field of
psychometrics was James McKeen
Cattel who coined the term “mental
tests” and was responsible for the
research that led to the
development of modern tests.
10
The German physiologist Weber tried to
demonstrate the existence of a
psychological threshold, arguing that a
minimum stimulus was necessary to
activate a sensory system. After Weber,
the German psychologist Fechner
devised the law that the strength of a
sensation grows as the logarithm of the
stimulus intensity. 11
During the early twentieth
century, the interest in
measuring human qualities
intensified greatly when the
US implemented programs to
select soldiers using tests that
measured a range of abilities
relevant to military
performance. 12
Such tests produced a great
deal of data, which led to
questions that inspired the
birth of psychometric theory
as we currently know
regarding in particular the
analysis of psychological test
data, the properties of
psychological tests, and the
selection of the best tests 13
The first property of a test
concerns the notion of reliability,
which is the question of whether a
test produces consistent scores
when applied in the same
circumstances.
16
Psychometric research, has further developed
and branched out in a wide array of modeling
techniques such as:
Classical test theory (CTT),
Structural equation modeling (SEM),
Item response theory (IRT), and
Multidimensional scaling (MDS) (Jones &
Thissen, 2007).
Psychometrics is a scientific
discipline concerned with
the question of how
psychological constructs
(e.g., intelligence,
neuroticism, or depression)
can be optimally related to
observables (e.g., outcomes
of psychological tests,
genetic profiles,
Psychometrics is a field of study
that focuses on the theory and
techniques associated primarily
with the measurement of
constructs as well
as the development,
interpretation, and evaluation of
tests and measures (APA, 2013).
19
▰ Psychometrics—the discipline concerned
with the measurement and prediction of
psychological traits, aptitudes, and behavior
—plays a central role in scientific
psychology, educational measurement, and
the structure of our current society
(Borsboom & Wijsen, 2017;
Lemann, 1999). 20
WHAT IS
PSYCHOMETRICS?
26
analysis, a process by which
the properties of items are
evaluated with the goal of
determining (a) which items
are and which items are not
making an acceptable
contribution to the quality
of the scores generated by
the assessment and (b)
which items should be
revised or removed from
An item, most basically, must
include a stimulus. That
stimulus can be a simple
question. Or it can be a
question followed by several
alternative answers.
28
▰ A Test :
measurement device or technique
used to quantify behavior or aid in the
understanding and prediction of
behavior.
▰ Psychological test:
A set of items designed to measure
characteristics of human beings that
pertain to behavior 29
▰ Scale
Relate raw scores on a test to some
defined theoritical or empirical
distribution.
A method of operationalizing a
psychological construct using a
multiple item test (questionnaire).
30
TYPES OF TESTS
31
be categorized by field of study within psychology, for
example:
36
▰ Types of tests
Individual Tests vs. Groupe Tests
Individual tests: test administrator
gives a test to a single person. E.g.,
WAIS-III, MMPI-2
Group tests: single examiner gives a
test to a group of people. E.g., SAT,
GRE (Scholastic Aptitude Test,
Graduate Record Exam) 37
Another distinction has been made between tests of
maximum performance and typical response.
38
Tests of maximum performance measure how well an
individual performs under standard conditions when
exerting maximal effort and are presumed to include
measures such as intelligence tests and achievement
tests.
Tests of typical response measure an individual’s
responses in a situation and are presumed to include
measures such as personality tests and attitude scales.
40
Tests may be further grouped according to the general
type of information gathered.
41
NORM-REFERENCED TESTS, CRITERION
REFERENCED
TESTS and IPSATIVE
42
If the test will report proficiency on a set of standards,
then the test will be a
criterion-referenced test (CRT).
43
Will a single standard
(such as pass–fail) be set, or will
multiple categoriesIf only a single
proficiency level needs to be
reported, then most of the
items on the assessment should be
designed to test
knowledge and skills near the
proficiency level
(just below, on, and just above) 44
If test takers need
to be classified into multiple proficiency levels,
items will need to maximally discriminate near each
of the levels that will be reported
45
A norm-referenced test compares an individual’s
performance on a test with a predefined population or
normative group, whereas a criterion-referenced test
evaluates performance in terms of mastery of a set of
well defined objectives, skills, or competencies.
46
In norm referenced tests, items are generally selected
to have average difficulty levels and high
discrimination between low and high scorers on the
test. In criterion referenced tests, items are primarily
selected on the basis of how well they match the
learning outcomes that are deemed most important.
47
In norm referenced tests, the normative value
indicates how an individual scored relative to the
normative group but provides relatively little
information about the person’s knowledge of,
performance on, or level of the construct per se.
48
Criterion-referenced test outcomes, however, give
detailed information about how well a person has
performed on each of the objectives, skills, or
competencies included in the test.
49
A third type of test, ipsative, can be contrasted with
norm-referenced tests. In ipsative tests, an
individual’s performance is compared with his or her
performance either in the same domain or construct
over time or relative to his or her performance on
other domains or constructs. The latter case is
sometimes referred to as profiling.
50
51
52
SCORING, SCALING, NORMING
53
functions of item
scores. The summed
score is an often-used
raw score, and it is
calculated by summing
the item scores. For
cognitive tests that
are scored as correct
or incorrect, the
summed score is the
number of items the 54
For noncognitive tests, the summed (or scale) score is
often the number of options or items endorsed that
contribute to that particular psychological scale (e.g.,
the Extraversion scale on the revised NEO Personality
Inventory; McCrae & Costa, 2010).
55
Transformation of Raw Scores to
Scale Scores
56
Incorporating Normative Information:
57
Multiphasic Personality
Inventory (Hathaway &
McKinley, 1989) was
administered to a national norm
group of nonpatient subjects
intended to be representative of
adults in the United States.
These data were used to
establish linear T scores with a
mean of 50 and standard 58
By knowing the mean and standard deviation of the
scale scores, test users are able to quickly ascertain, for
example, that a test taker with a T score of 60 on the
Depression scale is 1 standard deviation above the
mean.
59
Nonlinear
transformations are
also used to develop
score scales.
Normalized scale
scores can be used
by test users to
quickly ascertain the
percentile rank of a
test taker’s score,
using facts about the 60
scores for all six
psychological scales of
the WAIS (Wechsler,
2008) are smoothed
normalized scores set to
have a mean of 100 and a
standard deviation of 15.
Thus, a score of 115 on
the Perceptual Reasoning
scale, for example, is 1
standard deviation above
61
the mean and represents
62
Incorporating Content Information
63
For example, interpretations
of three T-score categories
(≤44 = low, 45–55 = moderate,
≥56 = high) are provided for
each of the scales of the
revised NEO Personality
Inventory (McCrae & Costa,
2010) that help to identify
more specific behaviors and
characteristics of test takers
64
SCALES FOR BATTERIES AND
COMPOSITES
65
Consider a test taker scoring 50 on the Neuroticism
domain and 60 on the Agreeableness domain. This test
taker’s score is near the 50th percentile on the
Neuroticism domain and near the 84th percentile on the
Agreeableness domain. Relative to the norm group, the
test taker exhibits more agreeable behaviors than
neurotic behaviors.
Composites
67
VERTICAL SCALING AND
DEVELOPMENTAL SCORE
SCALES
76
77
A psychological test is a
systematic procedure for
obtaining samples of behavior,
relevant to cognitive or affective
functioning, and for scoring and
evaluating those samples according
to standards.
A test is a device or procedure
in which a sample of an
examinee’s behavior in a
specified domain is obtained
and subsequently evaluated
and scored using a
standardized process (2014).
79
Whereas the label test is
sometimes reserved for
instruments on which responses
are evaluated for their
correctness or quality, and the
terms scale and inventory are
used for measures of attitudes,
interest, and dispositions
80
Psychological tests are often described
as standardized for two reasons:
- Uniformity of procedure in all
important aspects of the
administration, scoring, and
interpretation of tests;
- The second meaning of standardization
concerns the use of standards for
evaluating test results.
While studying the effects of
fatigue on children’s mental
ability, the German psychologist
Hermann Ebbinghaus devised
a technique known as the
Ebbinghaus Completion Test.
This technique called for
children to fill in the blanks in
text passages from which words
or word-fragments had been
omitted.
In 1904, the French psychologist Alfred Binet
was appointed to a commission charged with
devising a method for evaluating children who,
due to mental retardation or other
developmental delays, could not profit from
regular classes in the public school system
and would require special education.
In 1905, Binet and his
collaborator, Theodore Simon,
published the first
useful instrument for the
measurement of general cognitive
abilities or global intelligence. The
1905 Binet-Simon scale was a
series of 30 tests or tasks varied
in content and difficulty,
designed mostly to assess
judgment and reasoning ability
irrespective of school learning.
It was, in fact, a small
battery of carefully
selected tests
arranged in order of
difficulty and
accompanied by
precise instructions
on how to
administer and
interpret it.
In 1911 a German psychologist
named William Stern proposed
that the mental level attained on
the Binet-Simon scale, relabeled
as a mental age score, be
divided by the chronological
age of the subject to obtain a
mental quotient that would more
accurately represent ability at
different ages.
This now-familiar score, a true ratio
IQ, was popularized through its use
in the most famous revision of the
Binet-Simon scales—the Stanford-
Binet Intelligence Scale—
published in 1916 by Lewis
Terman.
However, the MA/CA ratio simply did not work
for adolescents and adults because their
intellectual development is far less uniform—
and changes are often
imperceptible—from year to year. The fact
that the maximum chronological age used in
calculating the ratio IQ of the original S-B was
16 years, regardless of the
actual age of the person tested, created
additional problems of interpretation.
In spite of several
problems with the ratio
IQ, its use would last for
several decades, until a
better way of
integrating age into the
scoring of intelligence
tests was devised by
David Wechsler
WAIS–III incorporates four index
scores (Verbal Comprehension,
Perceptual Organization,
Working Memory, and
Processing Speed)
91
92
93
94
95
97
Mental retardation is
divided into four categories:
mild, moderate, severe, and
profound. Severe and
profound mental retardation
is usually caused by genetic
mutations or accidents
during birth, whereas mild
forms have both genetic and
نماذج القياس النفسي
Models of psychometrics
النظرية الكالسيكية لالختبار
CLASSICAL TEST THEORY
CTT traces its origins to
the procedures
pioneered by Galton,
Pearson, Spearman,
and Thorndike, and is
usually defined by
Gulliksen’s (1950)
classic book.
contemporary
investigations of test
score reliability,
validity, and
fairness as well as
the widespread use of
statistical techniques
such as factor
analysis.
In classical test theory, an
individual’s true score is
conceptualized as the average
score in a hypothetical
distribution of scores that would
be obtained if the individual
took the same test an infinite
number of times.
Classical test theory
CTT is often called the ‘true score medel’;
Called classic relative to IRT which is a
medern approach;
CTT describes a set of procedures used to test
items and scales reliability, difficulty,
discrimination etc;
CTT analyses are the easiest and the most
widely used form of analyses;
CTT analyses are performed on the test as a
whole rather than on item. 106
▰ Basics of CTT:
- Assumes that every person has a true score on
an item or a scale if we can only measure it
directly without error;
- Normally distributed;
109
▰ Domain sampling theory
- Another central component of CTT
111
A universe is made up of a (possibly infinitly) large
number of items. So as tests get longer they
represent the domain better, therfore longer tests
should have higher reliability.
Also, if we take multiple random samples from the
population we can have a distribution of sample
scores that represent population
112
▰ CTT- Reliability
Xo = Xtrue + Xerror
X= T+E
E=X–T;T=X–E
As a critic, Borsboom
(2006) has argued that
CTT has
grave limitations in
theory and model
building through its
misplaced emphasis on
observed scores and
true scores
rather than the latent
The first shortcoming of CTT is that the
interpretation of respondent characteristics
depend on the test used. Respondents will
appear smarter, if an easier test is
administered, but will look like less smart, if a
more difficult test is answered. The second
shortcoming of CTT is that test characteristics
are sample-dependent. The same test
administered in a group of high ability students
and in another group of low ability students
will produce items with different levels of
119
difficulty, for example.
▰ In the first sample, items difficulty will
appear lower than the difficulty for the
second group. These shortcomings imply
that test characteristics can only be done in
the same context (sample). Once, test
parameters depend on persons’ latent trait
and vice versa, item and test characteristics
will change when other persons (samples
with different levels of latent trait) answer
the test.
120
The third shortcoming of CTT is that the theory
assumes that errors of measurement are equal for all
persons. This is problematic because persons with
different levels of ability will show different levels of
error (guessing) in a test that evaluates intelligence or
any other construct, for example. The fourth
shortcoming of CTT is that it does not allow accurate
predictions about possible results for a respondent or
for a sample on an item, using only their ability
scores. This information would be important for a test
designer interested in developing a test for a
population with specific characteristics. 121
نظرية االستجابة للمفردة
1.00
0.90
0.80
0.70
0.60
Probability
item1
0.50
item2
0.40 item3
item4
0.30
item5
0.20
0.10 Item Dif-
ficulty
0.00
-5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0
1.00
0.90
0.80
0.70
0.60
Probability
Item1
0.50
Item2
0.40 Item3
Item4
0.30 Item5
0.20
0.10
0.00
-5 -4 -3 -2 -1 0 1 2 3 4 5
Theta
Item Response Function
Binary items
pe)
Inattention
(slo
Parameters:
tion
• Difficulty
• Discrimination
ina
• Guessing
crim
• Inattention
Dis
Models:
Difficulty • 1 Parameter
• 2 Parameter
• 3 Parameter
Guessing • 4 Parameter
• unfolding
“I experience dizziness
when I first wake up in
the morning”
(0) “never”
(1)“rarely”
(2)“some of the time”
(3)“most of the time”
(4)“almost always”
Category Response Curves for an item representing the probability of
responding in a particular category conditional on trait level
213
214
215