OF TESTS AND TESTING
7 BASIC ASSUMPTIONS 2. Psychological States and Traits can be
Quantified and Measured – once we
1. Psychological states and traits exist.
carefully and rigorously define a
Traits States
construct, then it can be measured.
Distinguishes one Distinguishes one • Cumulative Scoring – getting the
person to another, person to another but is
relatively enduring. relatively less enduring sum of all scores for a psychological
and temporary.
Describes how people test.
think, feel, and behave. Emotions, Cognitive
States, Altered States of
Openness, Consciousness.
Conscientiousness,
3. Test-Related Behavior Predicts Non-
Extraversion, Test Related Behavior – a test could
Agreeableness, And
Neuroticism. measure a few things such as motivation
and emotional states beside from the
main test-related behavior.
• Construct – umbrella term for
states and traits; something will
become a construct if it undergoes 4. Test Should Have Limits and
scientific treatment and when we Imperfections – test users must learn to
describe and control the state and accept the limits and imperfections of a
trait of a person. psychological test as it could result data
that could be used for more research and
studies.
Overt Behavior Covert Behavior
Observable action or Unobservable actions.
the result of an
observable action. Emotions, feelings.
Obvious behavior.
5. Various Sources of Errors are Part of RELIABILITY
the Assessment A test should produce consistent scores in
measuring the same variable after being
Error Error Variance
taken multiple times.
Assumption that factors Unpredictable variation
other than what is in test results due to VALIDITY
intended to be uncontrolled factors
It is the accuracy of a test, if it measures
measured will influence that are not being
what it intends to measure as it ensures that
the result of the test. measured by the study.
the results can be used to make accurate
conclusions and predictions.
6. Unfair and Unbiased Procedures can
(Proceed to Next Page)
be Identified and Reformed
• People are generally biased,
including test administrators.
• All tools can be used properly and
improperly.
7. Testing and Assessment Should Offer
Powerful Benefits to Society
WHAT IS A GOOD TEST
1. Would include clear instructions.
2. Offer economy in the time and money it
took to administer.
3. Psychometric soundness, a test should be
reliable and valid.
4. Validity, a test must measure what it
intends to measure.
Norms / Normative Norming / User Norms Sampling
Standards Sample Standardization
The standard set on Group of people Process of creating The representative Process of selecting
a standardized test after second norms and average score based portion of a group
that helps interpret tryouts that are administering the test on the performance deemed to be
an individual’s test then used as norms to a representative a specific group of representative of
results by during test scores sample for a purpose test takers who have the whole
comparing them to comparison. of establishing norms. already taken the population.
a larger group. test.
SAMPLING METHODS TYPES OF NORMS
1. Stratified Sampling – Selecting a group • Percentile – expression of percentage of
people’s score in a test to determine
and getting subgroups within the whether who falls below a particular raw
population. score.
• Age Norms – setting age standards for
• Stratified-Random Sampling – each
the test.
member of the group has an equal • Grade Norms – setting a school
chance of being included in the grade/level standards for a test.
sample population.
2. Cluster Sampling – selecting a group (Proceed to Next Page)
and including all, commonly divided by
geography.
3. Purposive Sampling – selecting a
sample which we believe is
representative of a certain population.
4. Incidental Sampling – selecting sample
because of convenience.
• National Norms Vs. National Anchor Norms
National Norms National Anchor Norms
The average scores on a single test from a group of The common national standard used to compare
people in a country used to compare individual scores. individual scores from different tests in a country.
• Subgroup Norms Vs. Local Norms
Subgroup Norms Local Norms
A standardized reference used to compare the test A standardized reference used to compare the test
scores of people based on specific traits that were scores of people coming from the same geographical
initially used as criteria for selecting subjects for the location.
sample.
• Fixed Reference Group Scoring - following a fixed scoring system/norms to compare an
individual’s test results to the standards.
• Criterion Referenced Testing and Assessment Vs. Norm Referenced Testing and
Assessment
Criterion Referenced Testing Norm Referenced Testing and Assessment
and Assessment
Derive meaning from a test to evaluate it on the basis It is a set of standards deriving from the performance
whether it has met a certain criterion created by of an already established normative sample and is used
experts. as reference by tests to see whether an individual
performed relative to the normative sample’s
Also referred to as content-referenced testing and performance.
assessment.
RELIABILITY ESTIMATES
RELIABILITY SOURCES OF ERROR VARIANCE
Based on consistency and precision of the 1. Test Construction – when some items
results of a psychological test. and content in the test are poorly
worded, the test takers might interpret
them differently, making the test
ERRORS IN MEASUREMENT unreliable.
1. Error – inconsistencies in a test taker’s 2. Test Administration
performance due to various factors that • Test Environment – environment
can affect one’s performance. wherein tests are administered.
• Random Error – error beyond our • Test taker – the one who will
control as it is an error that is answer the test.
uncontrollable and doesn’t adhere to • Test User – the one who will
any patterns. administer the test.
• Systematic Error – the test itself is • Test Scoring and Interpretation
the cause of error. • Sampling Error
2. Variance – source of the test score’s
variability; how dispersed the scores are.
• Error Variance – unwanted variation • Pearson R – used for normal
in scores due to unwanted errors. distribution, no outliers.
• True Variance – total variance that • Spearman Rho – applicable for both
comes from real differences. normal d and abnormal distribution,
3. Observed Score – an individual’s actual ordinal variables.
score with consideration of error. • Kendall’s Tau (t) – normal and
4. True Score – an individual's true ability, abnormal distribution, nominal
without consideration of error. variables.
RELIABILITY COEFFICIENT 2. Split-Half Reliability Test – divide a
test into two sections to determine its
Reliability Coefficient – statistic that
internal consistency.
quantifies reliability ranging from 0 (not
• Internal Consistency – determining
reliable) to 1 (reliable).
whether the test will measure the
same variable despite being divided
HOW TO MEASURE RELIABILITY? into two parts.
3. Inter-Scorer Reliability – degree of
1. Test-Retest Reliability Estimates – take
agreement or consistency between two
a test twice in one single time.
scorers with regards to a particular
It measures the test’s coefficient of
measure.
stability; not applicable for tests that
have unstable variables.
• Parallel-Forms Reliability HOW TO MEASURE INTERNAL
Estimates – one administration of CONSISTENCY?
two different psychological test that
1. Inter-Item Consistency – degree of
measures the same variable.
correlations of all the individual items in
• Alternate-Forms Reliability
scale.
Estimates – administer two versions
• KR 20 – used when items are
of the same psychological test that
dichotomous.
were constructed to be parallel in one
• Cronbach’s Alpha – used are items
single time.
are not dichotomous and uses Likert
Scale.
Results for this test may encounter
2. Test of Homogeneity – ensures that all
error due to practice, fatigue, and
items measure a single trait or construct.
other intervening events.
• Coefficient of Equivalence –
estimates that measure the degree of
relationship/correlation between the
two scores.
PURPOSE OF RELIABILITY RESTRICTION OR IMFLAMMATION
COEFFICIENT OF RANGE
Not all reliability coefficients reflect the 1. Restriction of Range – when sample
same sources of error variance. size is restricted, correlation and
reliability coefficients decrease.
HOMOGENEITY VS.
HETEROGENEITY OF TEST ITEMS SPEED TESTS VS. POWER TESTS
Homogeneity of Test Heterogeneity of Test Speed Test Power test
Items Items
A time limit is set, and More generous time
If items of a test that Heterogenous test test takers must finish limit and item difficulty
aims to are measures different all items within that is uniformly low.
functionally uniform types of factors, hence limit.
throughout. it is low in internal Score differences are
consistency. If items are difficult, no based on performance
If test item measures test taker can obtain speed.
one single factor, it is perfect score.
homogenous (high How long did they take
internal consistency). How many items have to answer all this items
they got correct in the as correctly as they
span of this time limit? can?
DYNAMIC VS. STATIC
CHARACTERISTICS
Dynamic Static Characteristics
Characteristics
Ever changing due to Stable, made-on traits
cognitive processes and that are unchanging.
other factors, unstable.
CRITERION REFERENCED TEST 4. Item Response Theory (IRT) – it
focuses on how the items could measure
1. Classical Test Theory – this states that
the latent traits of the test taker.
the true score of a person represents his
• Dichotomous Test Items –
or her actual ability without
answerable by yes or no.
consideration of error.
• Polytomous Test Items – questions
with three or more alternative
“Pinanganak kang bobo, lalaki kang
responses where only one is scored
bobo, mamamatay kang bobo.”
correct.
2. Domain Sampling Theory – instead of
testing everything a person knows about THE STANDARD ERROR OF
a topic; we take a sample from a larger MEASUREMENT
set of questions.
It provides an estimate of the amount of
• Domain – all possible questions that
error inherent in an observed score.
could measure a skill.
• Test – a sample from the domain. 1. Confidence Interval – a range of test
scores which likely contains a person’s
The better the sample represents the true score.
full domain, the more accurate it is.
In short, reliability coefficient helps test
Psychological test is only a small developer to build an adequate measuring
domain of a person. instrument and helps the test user to select a
suitable test.
3. Generalizability Theory – to identify if
a test is reliable, the test taker should
have consistent results even if placed in
different conditions.
VALIDITY
VALIDITY • Test Blueprint – detailed definition
of the construct being studied; the
Something would be valid if it is grounded
foundation of a psychological test.
with evidence.
• Culture and Relatively of Content
A test is valid if it measures what it intends
– some test maybe valid to some
to measure.
cultures, but not to others.
VALIDATION CRITERION-RELATED VALIDITY
The process of gathering and evaluating Validating a psychological test through
evidence about validity. comparing our psychological test to an
already well-established test.
1. Validation Studies – constructed by test
developers with their own group of test If the correlation is high, your test is valid.
takers.
1. Incremental Validity – it discovers new
2. Local Validation Studies – used when
information aside from what the test
test user plans to alter a test in terms of
intends to measure or is predicting for a
format, language, instruction, and
future performance.
etcetera in ways that it is more suitable
2. Validity Coefficient – measures the
for the cultural background of the test
correlation of the test score and the
takers.
criterion measure’s score.
3. Face Validity – if a test appears to
3. Concurrent Validity – if test and the
measure what it intends to measure,
criterion test is taken at the same time
judging the book by its cover.
and shows the same result.
4. Content Validity – the subsurface,
representing the items of the
psychological test; does the item reflect
the characteristics it wants to measure?
4. Predictive Validity – measures the EVIDENCE OF CONSTRUCT
relationship of test scores to a criterion VALIDITY
measure obtained at a future time.
1. Evidence of Homogeneity – how
• Base Rate – a group of people who
uniform a test is in measuring one single
shares the same characteristics.
concept.
• Hit Rate – group of people
2. Evidence of Changes in Age – variables
successfully identified as having the
are expected to change overtime.
characteristics.
3. Evidence of Pretest-Posttest Changes
• Miss Rate - group of people – difference of scores between pretest
unsuccessfully identified as having and posttest of a defined construct after
the characteristics. careful manipulation.
• False Positive – diagnosed even with 4. Evidence From Distinct Groups – in
no condition. defining a construct, it is to be
• False Negative – not diagnosed understood that variables manifest
despite having condition. differently per person or group of
people.
CONSTRUCT VALIDITY 5. Convergent Evidence – a test is
compared to another test measuring the
Gathering evidence to prove the existence of
same construct in order to discover the
a construct that a test intends to measure.
correlation between the two.
1. Construct – a scientific idea developed 6. Discriminant Evidence – proof that a
or hypothesized to explain a behavior; test does not measure things it shouldn’t
unobservable traits that a test developer through comparing its results to other
may invoke to describe a test behavior or tests that measures unrelated concepts.
criterion performance. 7. Factor Analysis – a mathematical
procedure to measure characteristics,
personality, and traits.
8. Exploratory Factor Analysis Vs.
Confirmatory Factor Analysis
Exploratory Factor Confirmatory Factor
Analysis Analysis
clustering all items confirming a set of
related to each other; no data.
set of data, still on the
process of
categorization.
VALIDITY, BIAS, AND FAIRNESS
1. Test Fairness – extent to which a test is
used in an impartial and equitable way.
2. Test Bias – a factor inherent to a test that
prevents impartial measurement.
3. Rating Error – intentional or
unintentional misuse of rating scale.
4. Severity Error – tendency to be too
harsh or generous with answers.
5. Central Tendency Error – when the
test taker’s answers are all neutral.
6. Halo Effect – when ratees give
something a higher rating that it deserves
due to subjectivity.
TEST UTILITY
UTILITY 2. Costs – economic financial and budget
related nature must be taken into account
Usefulness of some thing or a process.
when considering test utility.
TEST UTILITY It refers to disadvantages, losses, or
expenses both economic (e.g. money,
The practical value of using a test to aid in
costs of testing and not testing with an
decision making and improve efficiency.
adequate instrument) and noneconomic
Also referred to as the practical value of a (e.g. human life, safety) when
training program or intervention. purchasing a test, a supply bank of test
protocols, and computerized test
processing.
FACTORS THAT AFFECT THE
UTILITY OF A TEST
3. Benefits – testing whether the benefit of
1. Psychometric Soundness – this is the
testing justifies the costs of using a test,
reliability and validity of a test.
referring to profits, gains, and
advantages.
The higher the criterion-related validity
of test scores for making a particular
UTILITY ANALYSIS
decision, the higher the utility of the test
is likely to be, however there are A family of techniques used for cost-benefit
exceptions to this for there are many analysis about the usefulness and practical
variables that could affect a test’s utility. value of a tool of assessment; evaluating
Hence, valid tests are not always useful. whether the benefits outweigh the cost of
using a certain psychological tool.
“Which test gives us the most bang for the
buck?”
EXPECTANCY DATA 4. The Brodgen-Cronbach-Gleser
Formula – calculates the dollar amount
1. Expectancy Table – helps to predict
of a utility gain.
how likely someone is to get a certain
score on a test. It shows the chances of
scoring in different ranges like
PRACTICAL CONSIDERATIONS
“passing”, “acceptable”, or “failing”.
2. Taylor Russell Tables – it helps 1. The Pool of Job Applicants – models
estimates and decide how useful a test is assume that there are always many
in predicting job success. It considers people who can apply for a job.
three main factors:
• Validity Coefficient – measures how However, some jobs require special
well the test predicts job skills or big sacrifices, so only a few
performances. people are qualified for them.
• Selection Ratio – the percentage of
Smaller Applicant Larger Applicant
applicants hired.
Pool Pool
• Base Rate – percentage of already
hired employees who are successful Lower Test Utility: Higher Test Utility:
without using the test which will be Tests are carefully employers use strict
used as the standard reference in used to avoid tests to pick the best.
eliminating good
determining whether using the test
candidates. Even if candidate
improves success rate amongst fails, there are plenty
newly hired employees. Employers rely more to choose from.
3. Naylor-Shine Tables – measures how on experiences and
much the performance of employees will references instead of
rigid tests.
increase if a test is used; it measures
improvement in job performance.
2. The Complexity of the Job – same METHODS OF SETTING CUT SCORES
utility models are used for a variety of
1. The Angoff Method – relies on
positions.
judgements of experts which are
averaged to yield cut scores for the test.
However, the difficulty of the job affects
how well people perform.
Problems arise if there are low
agreements between experts.
Simple Jobs Complex Jobs
Lower Test Utility: a Higher Test Utility: 2. The Known Groups Method – setting
highly detailed test the right candidate cut scores through comparing two
might not be makes a big groups: the skilled group and the group
necessary because difference in
that doesn’t possess the ability of
differences in performance, utilizing
interest.
applicants will be a well-designed test
small. will filter out those
who lack necessary The cut score is set at the point that best
A simple screening skills. separates the skilled group to the
will do.
unskilled group.
3. IRT-Based Method – setting cut scores
3. The Cut Score in Use – setting a based on how test takers respond to
minimum score a candidate must achieve individual questions.
on a test to be considered qualified for
the job. It considers the difficulty of each
question and the test taker’s ability in
The cut scores in use are the setting a fair and accurate cut score.
following:
• Relative Cut Score
• Norm-Referenced Cut Score
• Fixed Cut Score
• Multiple Cut Score
• Compensatory Model of Selection
IRT-BASED METHODS EXAMPLES • Experts will then decide where to
place the “bookmark”.
1. Item Mapping Method – setting cut
• This point becomes the cut score.
scores through these steps:
• Sort Questions by Difficulty –
The bookmark is the point wherein a
questions are organized by how hard
minimally competent test taker
they are.
would stop being able to answer
• Experts Review the Questions –
correctly.
experts trained in the subject will
analyze sample questions from each
This bookmark would separate test
group.
takers who have the minimal
• Experts Decide the Cut Score – the
knowledge, skills, and abilities from
cut score is set where these
those who have not.
candidates start struggling with the
questions.
3. The Method of Predictive Yield –
setting the cut score while also
The questions must be answered
considering the number of positions to
correctly by a minimally competent
be filled, the likelihood of acceptance,
person 50% of the time.
and applicant performance.
• Cut score might be lowered if many
2. Bookmark Method – setting cut scores
positions were to be filled.
using expert judgement and question
• If only few positions were available,
difficulty.
cut score might be raised to select the
• Experts learn the minimum
best candidates.
knowledge and skills a person needs
• If scores are too low overall, the cut
to pass a test.
score might be lowered to have
• Arrange the questions from easiest to
applicants be qualified.
hardest.
4. Discriminant Analysis – helps
determine the cut score by analyzing and
finding patterns that separate the two
groups: (1) Group who is successful at
the job, (2) Group who is unsuccessful at
the job.
It can also be used in determining the
relationship between identified variables
(e.g. which test scores best separate and
distinguish successful groups from the
unsuccessful ones).