0% found this document useful (0 votes)
21 views22 pages

Psych Assessment

The document outlines the fundamental assumptions of psychological testing, emphasizing the measurement of psychological states and traits, the importance of reliability and validity, and the identification of biases in testing. It details various testing methodologies, types of norms, and the significance of construct validity, while also addressing the utility and limitations of psychological assessments. Additionally, it discusses the implications of errors in measurement and the necessity for fair testing practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views22 pages

Psych Assessment

The document outlines the fundamental assumptions of psychological testing, emphasizing the measurement of psychological states and traits, the importance of reliability and validity, and the identification of biases in testing. It details various testing methodologies, types of norms, and the significance of construct validity, while also addressing the utility and limitations of psychological assessments. Additionally, it discusses the implications of errors in measurement and the necessity for fair testing practices.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 22

OF TESTS AND TESTING

7 BASIC ASSUMPTIONS 2. Psychological States and Traits can be


Quantified and Measured – once we
1. Psychological states and traits exist.
carefully and rigorously define a
Traits States
construct, then it can be measured.
Distinguishes one Distinguishes one  Cumulative Scoring – getting the
person to another, person to another but is
relatively enduring. relatively less enduring sum of all scores for a psychological
and temporary.
Describes how people test.
think, feel, and behave. Emotions, Cognitive
States, Altered States of
Openness, Consciousness.
3. Test-Related Behavior Predicts Non-
Conscientiousness,
Extraversion, Test Related Behavior – a test could
Agreeableness, And
Neuroticism. measure a few things such as motivation
and emotional states beside from the
main test-related behavior.
 Construct – umbrella term for
states and traits; something will
4. Test Should Have Limits and
become a construct if it undergoes
Imperfections – test users must learn to
scientific treatment and when we
accept the limits and imperfections of a
describe and control the state and
psychological test as it could result data
trait of a person.
that could be used for more research and
studies.
Overt Behavior Covert Behavior

Observable action or Unobservable actions.


the result of an
observable action. Emotions, feelings.

Obvious behavior.
5. Various Sources of Errors are Part of RELIABILITY
the Assessment A test should produce consistent scores in
measuring the same variable after being
Error Error Variance
taken multiple times.
Assumption that factors Unpredictable variation
other than what is in test results due to VALIDITY
intended to be uncontrolled factors
It is the accuracy of a test, if it measures
measured will influence that are not being
what it intends to measure as it ensures that
the result of the test. measured by the study.
the results can be used to make accurate
conclusions and predictions.

6. Unfair and Unbiased Procedures can


(Proceed to Next Page)
be Identified and Reformed
 People are generally biased,
including test administrators.
 All tools can be used properly and
improperly.

7. Testing and Assessment Should Offer


Powerful Benefits to Society

WHAT IS A GOOD TEST

1. Would include clear instructions.


2. Offer economy in the time and money it
took to administer.
3. Psychometric soundness, a test should be
reliable and valid.
4. Validity, a test must measure what it
intends to measure.
Norms / Normative Norming / User Norms Sampling
Standards Sample Standardization

The standard set on Group of people Process of creating The representative Process of selecting
a standardized test after second norms and average score based portion of a group
that helps interpret tryouts that are administering the test on the performance deemed to be
an individual’s test then used as norms to a representative a specific group of representative of
results by during test scores sample for a purpose test takers who have the whole
comparing them to comparison. of establishing norms. already taken the population.
a larger group. test.

SAMPLING METHODS TYPES OF NORMS

1. Stratified Sampling – Selecting a group  Percentile – expression of percentage of


people’s score in a test to determine
and getting subgroups within the whether who falls below a particular raw
population. score.
 Age Norms – setting age standards for
 Stratified-Random Sampling – each
the test.
member of the group has an equal  Grade Norms – setting a school
chance of being included in the grade/level standards for a test.

sample population.
2. Cluster Sampling – selecting a group (Proceed to Next Page)
and including all, commonly divided by
geography.
3. Purposive Sampling – selecting a
sample which we believe is
representative of a certain population.
4. Incidental Sampling – selecting sample
because of convenience.
 National Norms Vs. National Anchor Norms

National Norms National Anchor Norms

The average scores on a single test from a group of The common national standard used to compare
people in a country used to compare individual scores. individual scores from different tests in a country.

 Subgroup Norms Vs. Local Norms

Subgroup Norms Local Norms

A standardized reference used to compare the test A standardized reference used to compare the test
scores of people based on specific traits that were scores of people coming from the same geographical
initially used as criteria for selecting subjects for the location.
sample.

 Fixed Reference Group Scoring - following a fixed scoring system/norms to compare an


individual’s test results to the standards.

 Criterion Referenced Testing and Assessment Vs. Norm Referenced Testing and
Assessment

Criterion Referenced Testing Norm Referenced Testing and Assessment


and Assessment

Derive meaning from a test to evaluate it on the basis It is a set of standards deriving from the performance
whether it has met a certain criterion created by of an already established normative sample and is used
experts. as reference by tests to see whether an individual
performed relative to the normative sample’s
Also referred to as content-referenced testing and performance.
assessment.
RELIABILITY ESTIMATES

RELIABILITY

Based on consistency and precision of the SOURCES OF ERROR VARIANCE


results of a psychological test.
1. Test Construction – when some items
and content in the test are poorly
worded, the test takers might interpret
ERRORS IN MEASUREMENT
them differently, making the test
1. Error – inconsistencies in a test taker’s unreliable.
performance due to various factors that 2. Test Administration
can affect one’s performance.  Test Environment – environment
 Random Error – error beyond our wherein tests are administered.
control as it is an error that is  Test taker – the one who will
uncontrollable and doesn’t adhere to answer the test.
any patterns.  Test User – the one who will
 Systematic Error – the test itself is administer the test.
the cause of error.  Test Scoring and Interpretation
2. Variance – source of the test score’s  Sampling Error
variability; how dispersed the scores are.
 Error Variance – unwanted variation
in scores due to unwanted errors.  Pearson R – used for normal
 True Variance – total variance that distribution, no outliers.
comes from real differences.  Spearman Rho – applicable for both
3. Observed Score – an individual’s actual normal d and abnormal distribution,
score with consideration of error. ordinal variables.
4. True Score – an individual's true ability,  Kendall’s Tau (t) – normal and
without consideration of error. abnormal distribution, nominal
variables.
relationship/correlation between the
two scores.
2. Split-Half Reliability Test – divide a
RELIABILITY COEFFICIENT test into two sections to determine its

Reliability Coefficient – statistic that internal consistency.

quantifies reliability ranging from 0 (not  Internal Consistency – determining

reliable) to 1 (reliable). whether the test will measure the


same variable despite being divided
into two parts.
HOW TO MEASURE RELIABILITY? 3. Inter-Scorer Reliability – degree of
agreement or consistency between two
1. Test-Retest Reliability Estimates – take
scorers with regards to a particular
a test twice in one single time.
measure.
It measures the test’s coefficient of
stability; not applicable for tests that
have unstable variables.
HOW TO MEASURE INTERNAL
 Parallel-Forms Reliability
CONSISTENCY?
Estimates – one administration of
two different psychological test that 1. Inter-Item Consistency – degree of

measures the same variable. correlations of all the individual items in

 Alternate-Forms Reliability scale.

Estimates – administer two versions  KR 20 – used when items are

of the same psychological test that dichotomous.

were constructed to be parallel in one  Cronbach’s Alpha – used are items

single time. are not dichotomous and uses Likert


Scale.

Results for this test may encounter 2. Test of Homogeneity – ensures that all

error due to practice, fatigue, and items measure a single trait or construct.

other intervening events.

 Coefficient of Equivalence –
estimates that measure the degree of
RESTRICTION OR IMFLAMMATION
OF RANGE
PURPOSE OF RELIABILITY
COEFFICIENT 1. Restriction of Range – when sample
size is restricted, correlation and
Not all reliability coefficients reflect the
reliability coefficients decrease.
same sources of error variance.

SPEED TESTS VS. POWER TESTS


HOMOGENEITY VS.
HETEROGENEITY OF TEST ITEMS Power test Speed Test

Homogeneity of Test Heterogeneity of Test


A time limit is set, and More generous time
Items Items
test takers must finish limit and item difficulty
all items within that is uniformly low.
If items of a test that Heterogenous test
limit.
aims to are measures different
Score differences are
functionally uniform types of factors, hence
If items are difficult, no based on performance
throughout. it is low in internal
test taker can obtain speed.
consistency.
perfect score.
If test item measures
How long did they take
one single factor, it is
How many items have to answer all this items
homogenous (high
they got correct in the as correctly as they
internal consistency).
span of this time limit? can?

DYNAMIC VS. STATIC


CHARACTERISTICS

Dynamic Static Characteristics


Characteristics

Ever changing due to Stable, made-on traits


cognitive processes and that are unchanging.
other factors, unstable.
CRITERION REFERENCED TEST  Dichotomous Test Items –
answerable by yes or no.
1. Classical Test Theory – measures the
 Polytomous Test Items – questions
true ability of a person and how much it
with three or more alternative
is influenced by error.
responses where only one is scored
2. Domain Sampling Theory – instead of
correct.
testing everything a person knows about
a topic; we take a sample from a larger
set of questions. THE STANDARD ERROR OF
 Domain – all possible questions that MEASUREMENT
could measure a skill.
It provides an estimate of the amount of
 Test – a sample from the domain.
error inherent in an observed score.

The better the sample represents the 1. Confidence Interval – a range of test

full domain, the more accurate it is. scores which likely contains a person’s
true score.

Psychological test is only a small


domain of a person. In short, reliability coefficient helps test
developer to build an adequate measuring
3. Generalizability Theory – to identify if instrument and helps the test user to select a
a test is reliable, the test taker should suitable test.
have consistent results even if placed in
different conditions.

4. Item Response Theory (IRT) – it


focuses on how the items could measure
the latent traits of the test taker.

VALIDITY
VALIDITY  Culture and Relatively of Content
– some test maybe valid to some
Something would be valid if it is grounded
cultures, but not to others.
with evidence.

A test is valid if it measures what it intends


CRITERION-RELATED VALIDITY
to measure.
Validating a psychological test through
comparing our psychological test to an
VALIDATION already well-established test.

The process of gathering and evaluating If the correlation is high, your test is valid.
evidence about validity.
1. Incremental Validity – it discovers new
1. Validation Studies – constructed by test information aside from what the test
developers with their own group of test intends to measure or is predicting for a
takers. future performance.
2. Local Validation Studies – used when 2. Validity Coefficient – measures the
test user plans to alter a test in terms of correlation of the test score and the
format, language, instruction, and criterion measure’s score.
etcetera in ways that it is more suitable 3. Concurrent Validity – if test and the
for the cultural background of the test criterion test is taken at the same time
takers. and shows the same result.
3. Face Validity – if a test appears to
measure what it intends to measure,
judging the book by its cover.
4. Content Validity – the subsurface,
representing the items of the
psychological test; does the item reflect 4. Predictive Validity – measures the

the characteristics it wants to measure? relationship of test scores to a criterion

 Test Blueprint – detailed definition measure obtained at a future time.

of the construct being studied; the  Base Rate – a group of people who

foundation of a psychological test. shares the same characteristics.


 Hit Rate – group of people 2. Evidence of Changes in Age – variables
successfully identified as having the are expected to change overtime.
characteristics. 3. Evidence of Pretest-Posttest Changes
 Miss Rate - group of people – difference of scores between pretest
unsuccessfully identified as having and posttest of a defined construct after
the characteristics. careful manipulation.
 False Positive – diagnosed even with 4. Evidence From Distinct Groups – in
no condition. defining a construct, it is to be
 False Negative – not diagnosed understood that variables manifest
despite having condition. differently per person or group of
people.
5. Convergent Evidence – a test is
CONSTRUCT VALIDITY
compared to another test measuring the
Gathering evidence to prove the existence of same construct in order to discover the
a construct that a test intends to measure. correlation between the two.

1. Construct – a scientific idea developed 6. Discriminant Evidence – proof that a

or hypothesized to explain a behavior; test does not measure things it shouldn’t

unobservable traits that a test developer through comparing its results to other

may invoke to describe a test behavior or tests that measures unrelated concepts.

criterion performance. 7. Factor Analysis – a mathematical


procedure to measure characteristics,
personality, and traits.

EVIDENCE OF CONSTRUCT
8. Exploratory Factor Analysis Vs.
VALIDITY
Confirmatory Factor Analysis
1. Evidence of Homogeneity – how
Exploratory Factor Confirmatory Factor
uniform a test is in measuring one single Analysis Analysis
concept.
clustering all items confirming a set of
related to each other; no data.
set of data, still on the
process of
categorization.

VALIDITY, BIAS, AND FAIRNESS

1. Test Fairness – extent to which a test is


used in an impartial and equitable way.
2. Test Bias – a factor inherent to a test that
prevents impartial measurement.
3. Rating Error – intentional or
unintentional misuse of rating scale.
4. Severity Error – tendency to be too
harsh or generous with answers.
5. Central Tendency Error – when the
test taker’s answers are all neutral.
6. Halo Effect – when ratees give
something a higher rating that it deserves
due to subjectivity.

TEST UTILITY

UTILITY Usefulness of some thing or a process.


It refers to disadvantages, losses, or
expenses both economic (e.g. money,
TEST UTILITY
costs of testing and not testing with an
The practical value of using a test to aid in adequate instrument) and noneconomic
decision making and improve efficiency. (e.g. human life, safety) when

Also referred to as the practical value of a purchasing a test, a supply bank of test

training program or intervention. protocols, and computerized test


processing.

FACTORS THAT AFFECT THE


UTILITY OF A TEST 3. Benefits – testing whether the benefit of
testing justifies the costs of using a test,
1. Psychometric Soundness – this is the
referring to profits, gains, and
reliability and validity of a test.
advantages.

The higher the criterion-related validity


UTILITY ANALYSIS
of test scores for making a particular
decision, the higher the utility of the test A family of techniques used for cost-benefit

is likely to be, however there are analysis about the usefulness and practical

exceptions to this for there are many value of a tool of assessment; evaluating

variables that could affect a test’s utility. whether the benefits outweigh the cost of
Hence, valid tests are not always useful. using a certain psychological tool.

“Which test gives us the most bang for the


buck?”

2. Costs – economic financial and budget EXPECTANCY DATA


related nature must be taken into account
1. Expectancy Table – helps to predict
when considering test utility.
how likely someone is to get a certain
score on a test. It shows the chances of
scoring in different ranges like PRACTICAL CONSIDERATIONS
“passing”, “acceptable”, or “failing”.
1. The Pool of Job Applicants – models
2. Taylor Russell Tables – it helps
assume that there are always many
estimates and decide how useful a test is
people who can apply for a job.
in predicting job success. It considers
three main factors:
However, some jobs require special
 Validity Coefficient – measures how
skills or big sacrifices, so only a few
well the test predicts job
people are qualified for them.
performances.
 Selection Ratio – the percentage of
Smaller Applicant Larger Applicant
applicants hired.
Pool Pool
 Base Rate – percentage of already
hired employees who are successful Lower Test Utility: Higher Test Utility:
without using the test which will be Tests are carefully employers use strict
used as the standard reference in used to avoid tests to pick the best.

determining whether using the test eliminating good


candidates. Even if candidate
improves success rate amongst
fails, there are plenty
newly hired employees. Employers rely more to choose from.
3. Naylor-Shine Tables – measures how on experiences and
much the performance of employees will references instead of

increase if a test is used; it measures rigid tests.

improvement in job performance.

4. The Brodgen-Cronbach-Gleser
Formula – calculates the dollar amount 2. The Complexity of the Job – same
of a utility gain. utility models are used for a variety of
positions.
However, the difficulty of the job affects 1. The Angoff Method – relies on
how well people perform. judgements of experts which are
averaged to yield cut scores for the test.

Simple Jobs Complex Jobs


Problems arise if there are low
Lower Test Utility: a Higher Test Utility: agreements between experts.
highly detailed test the right candidate
might not be makes a big
2. The Known Groups Method – setting
necessary because difference in
cut scores through comparing two
differences in performance, utilizing
applicants will be a well-designed test groups: the skilled group and the group
small. will filter out those that doesn’t possess the ability of
who lack necessary interest.
A simple screening skills.
will do.
The cut score is set at the point that best
separates the skilled group to the
unskilled group.
3. The Cut Score in Use – setting a
minimum score a candidate must achieve 3. IRT-Based Method – setting cut scores
on a test to be considered qualified for based on how test takers respond to
the job. individual questions.

The cut scores in use are the It considers the difficulty of each
following: question and the test taker’s ability in
 Relative Cut Score setting a fair and accurate cut score.
 Norm-Referenced Cut Score
 Fixed Cut Score
 Multiple Cut Score
 Compensatory Model of Selection

METHODS OF SETTING CUT SCORES

IRT-BASED METHODS EXAMPLES


1. Item Mapping Method – setting cut  Experts will then decide where to
scores through these steps: place the “bookmark”.
 Sort Questions by Difficulty –  This point becomes the cut score.
questions are organized by how hard
they are through graphs like The bookmark is the point wherein a
histogram. minimally competent test taker
 Experts Review the Questions – would stop being able to answer
experts trained in the subject will correctly.
analyze sample questions from each
group. This bookmark would separate test
 Experts Decide the Cut Score – the takers who have the minimal
cut score is set where these knowledge, skills, and abilities from
candidates start struggling with the those who have not.
questions.
3. The Method of Predictive Yield –
The questions must be answered setting the cut score while also
correctly by a minimally competent considering the number of positions to
person 50% of the time. be filled, the likelihood of acceptance,
and applicant performance.
2. Bookmark Method – setting cut scores  Cut score might be lowered if many
using expert judgement and question positions were to be filled.
difficulty.  If only few positions were available,
 Experts learn the minimum cut score might be raised to select the
knowledge and skills a person needs best candidates.
to pass a test.  If scores are too low overall, the cut
 Arrange the questions from easiest to score might be lowered to have
hardest. applicants be qualified.
4. Discriminant Analysis – helps
determine the cut score by analyzing and
finding patterns that separate the two
groups: (1) Group who is successful at
the job, (2) Group who is unsuccessful at
the job.

It can also be used in determining the


relationship between identified variables
(e.g. which test scores best separate and
distinguish successful groups from the
unsuccessful ones).
TEST DEVELOPMENT

TEST DEVELOPMENT  Who benefits from the


administration of this test?
It is the product of thoughtful and sound
 How will meaning be attributed to
application of established principles of test
the scores on this test?
development.

Tests are not just papers with carefully 1.1. Pilot Work – refers to the
worded questions, it is a scientific product preliminary research surrounding
involving human and economic resources, the creation of a prototype of the
hence psychological tools are expensive. test; a test item is pilot studied to
determine whether they should be
included the instruments’ final
TEST DEVELOPMENT PROCESS
form.
1. Test Conceptualization – includes
answering the following questions: A test developer uses pilot work to
 What is the test designed to determine how best to measure a
measure? targeted construct.
 What is the test’s objective?
 Is there a need for the test? The process may include literature
 Who will take this test? reviews and experimentation and
 What content does the test cover? as well as creation, revision, and
 How will the test be administered? deletion of preliminary items.
 What is the ideal format of the test?
 Should more than one form of the 2. Test Construction
test be developed?  Scaling – process of setting rules in
 What special training does the test assigning numbers in measurement;
user need to have to administer or process by which a measure device
interpret the test? is calibrated to have numbers (scale
 What types of response will be values) be assigned to different
required from the test takers? amounts of trait, attribute, or
characteristic being measured.
 Types of Scales 2. Writing Items
o Nominal, Ordinal, Interval, or  Item Pool is the reservoir wherein
Ratio (NOIR) items will or will not be drawn for the
o Age-based Scale – if the test final version of the test.
taker’s performance as a
function of age is of critical When constructing multiple choice
interest. items, it is advised that the first draft
o Grade-based Scale – if the test contains double the items that the
taler’s performance as a final version of the test will contain.
function of grade is of critical
interest. 1. What range of content should the
items cover?
2. Which of many different types of
SCALING METHOD item formats should be employed?
3. How many items should be written
1. Rating Scale
in total and for each content area
 Likert Scale presents five alternative
covered?
responses (sometimes seven), usually
on agree-disagree or approve-
disapprove continuum.
 Paired Comparison includes
presenting pairs of stimuli to the test
takers which they are asked to
compare, and they must select one
stimulus according to a rule.
ITEM FORMAT 3. Writing Items for Computer
Administration – a computer program
Variables such as form, plan, structure,
designed to facilitate the construction of
arrangement, and layout of individual test
test as well as their administration,
items.
scoring, and interpretation.
1. Selected-response Format – requires  Computerized Adaptive Testing
selecting an answer from a set of (CAT) – interactive, computer-
alternative responses. administered test wherein items
 Multiple Choice Format – has three presented are based on the test taker’s
formats (1) a stem, (2) a correct performance in previous items.
alternative or option, and (3) several
incorrect alternatives referred to as If you answer questions correctly, the
distractors or fails. next ones might be harder.
 Matching Item – presented in two o Item Bank – a relatively large and
columns, a test taker determines easily accessible collection of test
what response is best associated with questions.
the premise. o Item Branching – ability of the
 True-False (binary-choice item) –
computer to tailor the content and
includes a statement that the test
order of presentation of test items
taker is asked to determine whether it
on the basis of responses to
is or is not a fact.
previous items.
2. Constructed-response Format –
o Floor Effect – refers to the
requires the test taker to create the
diminished utility of an assessment
correct answer and not just select it.
tool for distinguishing test takers at
 Completion Item – requires test
the low end of the ability.
takers to provide a word or phrase
o Ceiling Effect – ability of a tool in
that completes a sentence.
distinguishing test takers at the
1. Short Item
high end of the ability, trait, or
2. Essay Item
other attitude being measured.
SCORING ITEMS 1. Item-Difficulty Index – uses p-value to
obtain the value of Item-Difficulty
1. Cumulative Model – each time a test
Index; calculating the proportion of the
taker answers a question in a certain
total number of test takers who answered
way, they earn points, and these points
the item correctly.
adds up and show how well they
understand a certain skill or trait.
0 (if no one got them right), 1 (if
everyone got them right).
The higher the test score, the better the
person is at whatever the test is trying to
measure.
2. Class Scoring (Category Scoring) –
your answer on the test help decide
which group or category you belong in.
You are put in a group whose answers
are similar to you.
3. Ipsative Scoring – comparing your The larger the item-difficulty index, the
score in one part of the test to the other easier the item.
part of the same test.

1.1. Item Endorsement Index –


Compares your strength and weakness provides measures of the percentage
within yourself, not to others. of people who said yes or no, agreed
or disagreed, to an endorsed item.

TEST TRYOUTS
2. Item-Reliability Index – indicates
After creating a pool of items from which internal consistency of a test; the higher
the final version of the test will be the index, the greater a test’s internal
developed, the test developer will try out the consistency.
test to gather what items are good or bad.
Includes Factor Analysis and Inter-item
consistency.
3. Item-Validity Index – a statistic design 7. Qualitative Item Analysis – rely
indicating if a test measures what it primarily on verbal rather than statistical
intends to measure. procedure in analyzing.
 Think Aloud Test Administration –
The higher the item validity index, the a one on one test wherein the client is
greater the test’s criterion related tasked to read aloud a question for the
validity. clinician to analyze how and why a
client misinterpreted a data.
3.1. Item-Score Standard  Expert Panels – a sensitivity review
Deviation – correlation between the wherein the panel analyzes what
item score and criterion score. items might stereotype or offend test
takers.
4. Item-Discrimination Index – indicate
how an item separates or discriminates
between high scorers and low scorers of TEST REVISION

test. Involves rewording, deleting, and creating


5. Analysis of Item Alternative – quality new items; could also be used in revising or
of alternative choices within a multiple- developing already existing tests.
choice can be assessed with comparison
1. Test Revision as a Stage in Niew Test
of upper and lower scorer’s
Development – removing or revising
performance.
items within a test after first tryouts.
6. Other Considerations in Item Analysis
2. Test Revision in the Life Cycle of an
 Guessing
Existing Text
 Item Fairness
 Speed Test

CROSS-VALIDATION

Revalidation of a test on sample; same


psychological test on different sample.

1. Validity Shrinkage – decrease of item


validity.
CO-VALIDATION

Validation of two or more test on the same


sample of test taker; also called co-norming.

USE OF IRT IN BUILDING AND


REVISING TEST

Evaluating existing tests for the purpose of


mapping test revisions.

You might also like