Psychological Assessment
Seniors’ In-Service Training
Worksheet #6
Name: Bula,Tabac,Trocio
Date: Aug.7, 2020
KEY TERM EXERCISE:
Key Term Definition
Reliability Coefficient It is an index of reliability, a proportion that
indicates the ratio between the true score
variance on a test and the total variance.
Variance A statistic useful in describing sources of test
score variability. It is the standard deviation
squared.
True Variance Variance from true differences.
Error Variance Variance from irrelevant and random sources.
Reliability It refers to the proportion of the total variance
attributed to true variance.
Measurement Error Refers to collectively all of the factors associated
with the process of measuring some variable
other than the variable being measured.
Random Error Source of error in measuring a targeted variable
caused by unpredictable fluctuations and
inconsistencies of other variables in the
measurement process
Systematic Error Refers to a source of error in measuring a variable
that is typically constant or proportionate to what
is presumed to be the true value of the variable
being measured
Item or Content Sampling Terms that refer to variation among items within a
test as well as to variation among items between
tests
Test-retest Reliability Estimate of reliability obtained by correlating pairs
of scores from the same people on two different
administrations of the same test
Coefficient of Stability When the interval between testing is greater than
six months. It is also where the subjects being
measured and the measuring instrument remain
precisely the same.
Coefficient of Equivalence The degree of the relationship between various
forms of a test can be evaluated by means of an
alternate-forms or parallel-forms coefficient of
reliability.
Parallel Forms It exists when, for each form of the test, the
means and the variances of observed test scores
are equal.
Parallel Forms Reliability
It refers to an estimate of the extent to which item
sampling and other errors have affected test
scores on versions of the same test when, for
each form of the test, the means and variances of
observed test scores are equal,
Key Term Definition
Alternate Forms These are simply different versions of a test that
have been constructed so as to be parallel.
Alternate Forms Reliability Refers to an estimate of the extent to which these
different forms of the same test have been
affected by an item sampling error, or other error.
Split-half Reliability Obtained by correlating two pairs of scores from
equivalent halves of a single test administered
once
Odd-even Reliability Splitting a test by assigning odd-numbered items
to one test and even-numbered items on another
test
Spearman-Brown Formula Allows a test developer or user to estimate
internal consistency reliability from a correlation of
two halves of a test
Inter-item consistency Refers to the degree of correlation among all the
items on a scale
Test Homogeneity It is when the test contains items that measure a
single trait.
Test Heterogeneity It is when the test is composed of items that
measure more than one trait.
Kuder-Richardson Formula 20/KR-20 It is used for determining the inter-item
consistency of dichotomous items, primarily those
items that can be scored right or wrong (such as
multiple-choice items). If test items are more
heterogeneous, KR-20 will yield lower reliability
estimates than the split-half method.
Coefficient Alpha It may be thought of as the mean of all possible
split-half correlations, corrected by the
Spearman–Brown formula.
Average Proportional Distance Method It is the measure used to evaluate the internal
consistency of a test that focuses on the degree of
difference that exists between item scores.
Inter-scorer Reliability Degree of agreement or consistency between two
or more scorers with regard to a particular
measure
Coefficient of inter-scorer reliability The correlation coefficient used when determining
the degree of consistency among scorers in the
scoring of a test
Dynamic Characteristic Trait, state, or ability presumed to change as a
function of situational or cognitive experiences
Static Characteristic Trait, state, or ability that is relatively unchanging
Power test A test whose time limit is long enough to allow
testtakers to attempt all items, and if some items
are so difficult that no testtaker is able to obtain a
perfect score
Speed Test It contains items in uniform level of difficulty so
that, when given generous time limits, all test
takers should be able to complete all the test
items correctly
Criterion-Referenced Tests It is designed to provide an indication of where a
testtaker stands with respect to some variable or
criterion, such as an educational or a vocational
objective.
Classical Test Theory also referred to as the true score (or classical)
model of measurement. It has a notion that
everyone has a “true score” on a test has had,
and continues to have, great intuitive appeal.
True Score It genuinely reflects an individual’s ability (or trait)
level as measured by a particular test
Domain Sampling Theory It seeks to estimate the extent to which specific
sources of variation under defined conditions are
contributing to the test score.
Generalizability Theory Based on the idea that a person’s test scores vary
from testing to testing because of variables in the
testing situation
Universe Details of a particular test situation
Facets Includes things like number of items in the test,
amount of training of test scorers, purpose of test
administration
Universe Score Analogous to a true score in the true score model
Key Term Definition
Item Response Theory Procedures of this theory provide a way to model
the probability that a person with x ability will be
able to perform at a level of Y.
Latent-Trait Theory It refers to a family of theories and methods—and
quite a large family at that—with many other
names used to distinguish specific approaches. It
is also a general psychometric theory contending
that observed traits, such as intelligence, are
reflections of more basic unobservable traits
Discrimination It signifies the degree to which an item
differentiates among people with higher or lower
levels of the trait, ability, or whatever it is that is
being measured.
Dichotomous Test Items Test items or questions that can be answered with
only one of two alternative responses, such as
true–false, yes–no, or correct–incorrect questions.
Polytomous Test Items Test items or questions with three or more
alternative responses, where only one is scored
correct or scored as being consistent with a
targeted trait or other construct
Rasch Model It is a reference to an IRT model with very specific
assumptions about the underlying distribution.
DISTINGUISHING BETWEEN RANDOM AND SYSTEMATIC ERRORS
Fill out the table below with your own examples of Random and Systematic Errors.
Random Errors Systematic Errors
drowsiness of a test taker using the same metal ruler in different
climates (hot and cold)
a sudden blackout occuring within the vicinity using a weighing scale that adds 11kgs
of the test venue everytime you measure yourself
hunger of the test taker a faulty thermometer that adds 2°C to every
temperature check conducted
the frequent occurrence of brownouts as
a fire suddenly emerging from the test venue electrical currents are consistently low
a tornado passing by within the vicinity of the a cloth measuring tape has been overused
test venue that it’s been stretching 5cm every year
Fill out the table with characteristics of Parallel and Alternate Forms to reflect how they are similar
and how they are different in terms of definitions, descriptions, characteristics.
Parallel and Alternate Forms Similarities Differences
degree of the relationship parallel forms of a test exist
between various forms of a when, for each form of the
test can be evaluated by this Two test administrations with test, the means and the
means the same group are required variances of observed test
scores are equal
the means of scores obtained
on parallel forms correlate
equally with the true score
scores obtained on parallel
correlate equally with other
measures
parallel forms reliability refers
to an estimate of the extent to
which item sampling and
other errors have affected test
scores on versions of the
same test when, for each
form of the test, the means
and variances of observed
test scores are equal
alternate-forms or parallel- test scores may be affected designed to be equivalent
forms of coefficient of by factors such as motivation, with respect to variables such
reliability which is often fatigue, or intervening events as content and level of
termed as coefficient of such as practice, learning, or difficulty
equivalence therapy (although not as
much as when the same test alternate forms reliability
is administered twice) refers to an estimate of the
extent to which these different
forms of the same test have
been affected by item
sampling error, or other error
additional source of error
Certain traits are presumed variance, item sampling, is
to be relatively stable in inherent in the computation of
people over time, and we an alternate- or parallel-forms
would expect tests measuring reliability
those traits— alternate forms,
parallel forms, or otherwise—
to reflect that stabilit
advantageous to the test user
in that it minimizes the effect
of memory for the content of
a previously administered
form of the test
TESTING YOUR UNDERSTANDING
Indicate whether the statement is True (T) or False (F)
1. The greater the proportion of the total variance attributed to true variance, the more reliable
the test. T
2. True differences are not presumed to yield consistent scores on repeated administrations of
the same test. F
3. Error variance can affect the reliability of a test. T
4. Systematic Errors affect score consistency. F
5. A challenge faced by a test developer is minimizing the proportion of the total variance that is
true variance. F
6. Test-Retest reliability is suitable for a test that measures a construct that is relatively stable
over time. T
7. Reliability always increases as test length increases. T
8. A measure of inter-item consistency is calculated from multiple administrations of a single
form of a test. F
9. The more homogeneous a test is, the more inter-item consistency it can be expected to have.
T
10. Where items are highly homogeneous, KR-20 and split-half reliability estimates will be
similar. T
SOURCES OF ERROR VARIANCE
Fill out the table by giving two examples for each of the possible sources of error variance that
are different from the ones provided in the book.
Test Construction Test Administration Test Scoring Test
Interpretation
Making
When evaluating the unsupported
homogeneity of a When assessing the Results that are very conclusions in
measure (or, all items stability of various far from the true terms of test
are tapping a single personality traits score of the test results
construct) taker
If test questions are
difficult, confusing or Instructions that Error due to the Coding of
ambiguous, reliability is interfere with accurately variation of in the behavior
negatively affected. gathering information setting of the
Some people read the (such as a time limit workpiece and the
question to mean one when the measure the instrument
thing, whereas others test is seeking has
read the same nothing to do with
question to mean speed) reduce the
something else. reliability of a test.
LEARNING MORE ABOUT THE SPEARMAN BROWN FORMULA
In the table below, list what you have learned about the Spearman Brown formula and
its usefulness in test
enables test developer to foresee the
internal consistency reliability by
magnifying the relationship of two halves
of a test through correlation
when one wants to shorten the length of a
test, the formula can utilized to calculate
the method of shortening on the test’s
reliability
determine how many items are needed in
order for the test to be called reliable
formula is also used when newly added
items can increase the reliability of the
test
used to determine how homogenous
items are in a test
THE COEFFICIENT ALPHA, THE AVERAGE PROPORTIONAL DISTANCE, AND
THE KR-20
In the table below, list the definitions, similarities, and differences of the three methods
of estimating internal consistency
Coefficient Alpha Average Proportional Kuder-Richardson 20
Distance
Where test items are
Coefficient alpha may be a measure used to highly homogeneous, KR-
thought of as the mean of evaluate the internal 20 and split-half reliability
all possible split-half consistency of a testthat estimates will be similar
correlations focuses on the degree of
difference that exists
between items scores
Appropriate for use on determining the inter-item
tests containing non- Step 1: Calculate the consistency of
dichotomous items absolute difference dichotomous items,
between scores for all of primarily those items that
the items can be scored right or
Step 2: Average the wrong (such as multiple-
difference between scores choice items)
Step 3: Obtain the APD by
dividing the average
difference between scores
by the number of response
options on the test, minus
1
preferred statistic for The general rule of thumb The mostly widely used
obtaining an estimate of for interpreting APD adaptation of the KR-20 is
internal consistency –oA value of .2 or lower is a statistic called the
reliability indicative of excellent coefficient alpha(coefficient
internal consistency α-20)
In contrast to KR-20, which oA value of .25 to .2 is in
is appropriately used only the acceptable range
on tests with dichotomous oA value above .25 is
items, coefficient alpha is suggestive of problems
appropriate for use on with the internal
tests containing consistency of the test
nondichotomous items.
widely used as a measure One potential advantage of used for items that have
of reliability in part the APD method over varying difficulty. For
because it requires only using Cronbach’s alpha is example, some items
one administration of the that the APD might be very easy, others
test index is not connected to more challenging. It should
the number of items on a only be used if there is a
Coefficient alpha range in measure. Cronbach’s correct answer for each
value from 0 to 1 alpha will be higher question — it shouldn’t be
when a measure has more used for questions with
-Calculated to help answer than 25 items partial credit is possible or
questions about for scales like the Likert
howsimilarsets of data are Scale.
0 = absolutely no similarity
1 = perfectly identical
-In contrast to coefficient
alpha, a Pearsonrmay be
thought of as dealing
conceptuallywith both
dissimilarity and similarity
SUMMARIZING WHAT YOU HAVE LEARNED ABOUT THE COEFFICIENTS OF
RELIABILITY
Fill out the missing information in the table below.
Type of Purpose Typical Number of Sources of Statistical
Reliability Uses Testing Error Procedures
Sessions Variance
to review how When 2 Administration Pearson r or
Test-Retest stable a assessing the Spearman rho
measure is stability of
various
personality
tests
Alternate- assess the used when 1 or 2` test Pearson r or
Forms relationship there are construction Spearman rho
between the alternatives or
different for a certain administration
forms of a test
measure
Internal gauge the When 1 Test Pearson r
Consistency level of how evaluating the construction between test
the raters homogeneity halves such as
agree with of a measure Spearman Brown
one another (or, all items correlation and
are tapping on Kuder-
a single Richarson(dichoto
construct) mous items) or
coefficient alpha
for multipoint
Inter-scorer Evaluate the used when 1 scoring and Cohen’s kappa,
level of behavior is interpretation Pearson r, or
agreement being coded Spearman Rho
between and observe
raters on a how different
measure raters observe
a certain
behavior
pattern
NATURE OF TESTS
Different kinds of reliability coefficients are used depending on the nature of the tests. In
each characteristic, please indicate the type of reliability coefficient you would use or
what we would expect to see in terms of reliability.
Characteristic Reliability Coefficient
Homogeneous high degree of internal consistency
Heterogeneous low degree internal consistency
Dynamic Trait internal consistency
Static Trait test-retest or alternate forms
Restricted Range correlation coefficient is low
Inflated Range correlation coefficient is higher
Criterion Referenced traditional ways of estimating reliability
are not always appropriate for criterion-
referenced and may vary on the variety of
the test scores
critical issue for the user of a mastery test
is whether or not a certain criterion score
has been achieved
Norm Referenced traditional ways are appropriate in
estimating reliability (e.g. test-retest,
equivalent forms, etc.)
Power-Tests - mean, standard deviations,
number of items
Speed-Tests two independent
testing periods using either one of the
following:
- test-retest reliability,
- alternate-forms reliability,
- split-half reliability from two separately
timed half tests
If a speed test is administered once and
some measure of internal consistency,
such as the Kuder–Richardson or a split-
half correlation, is calculated, the result
will be a spuriously high reliability
coefficient
COMPARING THEORIES
Fill the table below with information that you can use to compare and contrast theories
related to testing
Classical Test Domain Sampling Generalizability Item Response
Theory Theory Theory Theory
simple and gives seek to estimate “universe score” models the
the notion that the extent replaces that of a probability that a
everyone has a to which specific “true score” which person with X
“true score” on a sources of variation is analogous to a amount of a
test under defined true score particular
conditions are personality trait will
contributing to the exhibit Y amount of
test score that trait on a
personality test
designed to
measure it
assumptions allow a test’s reliability is given the exact synonym is latent-
for its application in conceived of as an same conditions of trait theory
most situations in objective all the facets in
which they are measure of how the universe, the
easily met and precisely the test exact same test
therefore applicable score assesses the score should be
to so many domain from which obtained; based on
measurement the test draws the idea that a
situations can be a sample person’s test
advantageous, scores vary from
especially for the testing to testing
test developer in because of
search of an variables in the
appropriate model testing situation
of
measurement for a
particular
application
in psychometric items in the domain a test’s reliability is assumptions are
parlance, CTT is are thought to have very much a made about the
considered weak the same means function of the frequency
compared to IRT and variances of circumstances distribution of test
which has those in the test under which the scores
assumptions that that samples from test is developed,
are difficult to meet the domain administered, and
interpreted
compatibility Of the three types tests should be refers to a family of
and ease of use of developed with the theories and
with widely used estimates of aid of methods—and
statistical reliability, measures generalizability quite a large family
techniques (as well of internal study where scores at that—with many
as most currently consistency are are examined on other names used
available perhaps the most how generalized to distinguish
data analysis compatible they are if the test specific approaches
software) with domain is administered in
sampling theory different situations
and then in
decision study
where developers
examine the
usefulness of test
scores in helping
the test user make
decisions
References:
Gulliksen, H. (1950). THE RELIABILITY OF SPEEDED TESTS. ETS Research Bulletin
Series, 1950(1), i–16. doi:10.1002/j.2333-8504.1950.tb00876.x
Dannana, S., & Engineer, A. (2018, September 02). What are the sources of errors in
measurement? Retrieved August 05, 2020, from
https://extrudesign.com/sources-of-errors-in-measurement/
Mote, T. (2020, July 21). Factors Affecting Reliability in Psychological Tests. Retrieved
August 05, 2020, from https://healthfully.com/factors-affecting-reliability-in-
psychological-tests-4020509.html
Yrubin1. (n.d.). This is because Spearman Brown estimates are based on a test that is
twice as: Course Hero. Retrieved August 05, 2020, from
https://www.coursehero.com/file/p6ue4h2/This-is-because-Spearman-Brown-
estimates-are-based-on-a-test-that-is-twice-as/