Validity, Norms and
Standardization
By Khushboo, Neha and Kavya
VALIDITY
● Test validity refers to the degree to which the test actually measures what it claims to measure.
● Validity provides a direct check on how well the test fulfill its function.
● The assurance of validity usually requires independent , external criteria of whatever the test is designed to
measure.
● Example: If HR manager wants to measure employee engagement, than he must you use a valid indicator that
reflects the level of engagement, such employee satisfaction surveys, not absenteeism rates.
TYPES OF VALIDITY
The 4 types of validity are:-
1. Face validity.
2. Content validity.
3. Criterion validity.
4. Construct validity.
FACE VALIDITY
● A test is said to have face validity if it “looks like” it is going to measure what it is supposed to
measure.
● It is a subjective and initial judgment about whether the content of the test seems appropriate and
relevant.
● Face validity does not delve into deeper statistical evaluation but is based on intuitive assessment.
● Example: A marketing sector job CV will consist skills like good communication skills, overseeing
marketing campaigns, client relationship management, SEM, etc.
CONTENT VALIDITY
● Content validity evidence involves the degree to which the content of the test matches a domain’s
content associated with the construct.
● Content validity ratio formula determined by Lawshe Table is CVR=(Ne - N/2)/(N/2), here N
means number of experts and Ne is number of ‘essentials’ for an item.
● The formula helps in finding necessary questions.
● Example of content Validity:
CRITERION VALIDITY
● Criterion validity refers to the extent to which a measure is related to an outcome.
● Criterion validity consist of some criteria/standard to indicate accuracy of the construct.
● In this ,the validity of indicator is verified by comparing by comparing it another test of the same
construct in which the research have been validated.
● Example: CAT/MAT/SAT entrance exam score is used for admission in university since the scores are
correlated helping in forming an criteria.
CRITERION VALIDITY
● Predictive validity and concurrent validity is subtype of criterion validity.
PREDICTIVE AND CONCURRENT VALIDITY
● Predictive validity assesses how well a test can predict future criterion scores.
● Example : career assessment test accurately predicts an individual’s performance in few particular
fields on the basis of his/her interest, aptitude,etc.
● Whereas, Concurrent validity assesses answers the question ,how test results are related to a criterion
at present.
● Example: A new mental health inventory test correlates accurately with psychological assessment
when both are consider at a same time.
CONSTRUCT VALIDITY
● Construct validity refers to the degree to which a test accurately assesses the underlying theoretical
construct it intends to measure.
● Example: A human biology exam measuring constructs only related to human body.
CONSTRUCT VALIDITY
Construct validity have 2 subtypes as follow:
Convergent validity
Construct validity
Discriminant validity
CONVERGENT AND DISCRIMINANT VALIDITY
● Convergent validity assesses that whether the test is related to the construct as expected.
● Example: If a new Spirituality test correlated highly with established spirituality test ,then it indicates
convergent validity.
● Discriminant validity assesses the extent to which test of different constructs are have low relatedness.
● Example: employee satisfaction vs employee engagement.
ADVANTAGES OF VALIDITY
● Exact science: Validity plays a huge role in the procedure psychological testing.
It at least make a valid psychological assessment as most medical test. Which
improve the process of psychotherapy in various ways such as monitoring
treatment over time, assisting in case formulation.
● Universally applicable: Since validity provides the theoretical and
methodological principles for development use of any measuring instrument, it
enables the test to be used on a global level.
● Wide Application: validity enables the measuring tool in a way that the every
individual falling into the criteria of the test will be able to attempt it. Thus, the
test widen it’s application from clinical assessment to various other fields like
Organizational or educational assessment.
DISADVANTAGES OF VALIDITY
● Time consuming: Validity as a part of psychological testing process
is lengthy and time consuming, as assessment and evaluation of
validity of any measuring instrument is a long process which
requires expertise and professional judgement.
● Easy manipulation: A test can be low in internal validity since the
independent variables can easily be manipulated. It can result to
change in internal validity as high. Thus affecting the validity of the
test as a whole.
AFFECTING FACTORS
● Individual Bias.
● Time taken for evaluation.
● Socio- culture differences.
● Poorly constructed Test items.
● Improper arrangement of items.
● Difficulty level of the test.
● Lack of pre-testing
Test development- Norms
What are Norms?
The norms determine what a person in a representative group does on a psychological test.
Scores on psychological tests are most commonly interpreted by reference to norms that represent the test
performance of the standardization sample.
Any individual’s raw score is then referred to the distribution of scores obtained by the standardization
sample, to discover where he or she falls in the distribution.
Derived Score
Does the score coincide with average performance of the standardization group?
Is it slightly below average? Or does it fall near the upper end of the distribution?
In order to ascertain more precisely the individuals’s exact position with reference to standardization sample,
the raw score is converted into some relative measure and are called derived score.
Purpose of Derived score
First they indicate an individual’s relative standing in the normative sample and permit an evaluation of her
performance in reference to other persons.
Second they provide comparable measures that permit a direct comparison of the individual’s performance
on different tests.
For eg. If a girl has a raw score of 40 on vocab test and a raw score of 22 on arithmetic, we do not know about
her relative performance on the two tests. Since raw score on different tests are usually expressed in
different units, a direct comparison of such scores is impossible. The difficulty level of the particular test
would also affect such a comparison between raw scores.
Features of Derived score
Derived scores can be expressed in the same units and referred to the same or closely similar normative
samples for different tests. The individual’s relative performance in many different functions can thus be
compared.
Derived scores can be expressed in two major ways:
1. Developmental level attained
2. Relative position within a specified group
Normal distribution
Properties of Normal Distribution
The mean, median and mode are exactly the same.
The distribution is symmetric about the mean—half the values fall below the mean and half above the mean.
The distribution can be described by two values: the mean and the standard deviation.
The mean determines where the peak of the curve is centered. Increasing the mean moves the curve right,
while decreasing it moves the curve left.
The standard deviation stretches or squeezes the curve. A small standard deviation results in a narrow curve
while a large standard deviation leads to a wide curve.
Properties of Normal Distribution
Around 68% of values are within 1 standard deviation from the mean.
Around 95% of values are within 2 standard deviations from the mean.
Around 99.7% of values are within 3 standard deviations from the mean.
The standard normal distribution, also called the z-distribution, is a special normal distribution where the
mean is 0 and the standard deviation is 1.
Z- Score
While individual observations from normal distributions are referred to as x, they are referred to as z in the
z-distribution. Every normal distribution can be converted to the standard normal distribution by turning the
individual values into z-scores.
Z-scores tell you how many standard deviations away from the mean each value lies.
Types of Norms
A. Developmental Norms
Depict the normal developmental path for an individual’s progression.
Can be very useful in providing description but are not well suited for accurate statistical purpose.
Can be classified as mental age norms, grade equivalent norms and ordinal scale norms.
Scores based on these norms don’t lend themselves well to precise statistical treatment. They have
considerable appeal for descriptive purposes especially intensive clinical study of individuals and for certain
research purposes.
1. Mental Age Norms
To obtain this, we take the mean raw score gathered from all in the common age group inside a standardized
sample.
Hence, the 15 year norm would be represented and be applicable by the mean raw score of students aged 15
years.
Eg. Binet simon scale- those items passed by 7 yr old in the standardization sample were placed in the yr
level, similarly those items passed by majority of 8 yr old were placed in 8 yr level. A child’s score on the test
would then correspond to the highest yr level he or she could successfully complete.
It was important to calculate basal age ie. The highest age at or below which all tests were passed. Partial
credits in months were then added to this basal age for all tests passed at higher year levels.
Mental age norms have also been employed with tests that are not divided into year levels.
In such a case the child’s raw score is first determined. Such a score may be the total number of correct items
on the whole test or it must be based on no. of errors or a combination of such measures. The mean raw
score obtained by children in each year age group within the standardization sample constitute the age
norms of such a test.
For eg. The mean raw score of 8 yr old children would represent the 8 yr norm. If individuals raw score is
equal to the mean 8 yr old raw score, then his or her mental age on the test is 8 yrs. All the raw scores on
such a test can be transformed in a similar manner by reference to the age norms.
2. Grade equivalent Norms
Grade norms are found by computing the mean raw score obtained by children in each grade.
Scores on educational achievement tests are often interpreted as grade equivalents. This is due to the tests
are employed within the school setting.
Thus if average no. of problems solved correctly on an arithmetic test by 4th graders in the standardization
sample is 23, then a raw score of 23 corresponds to a grade equivalent of 4.
Because the school yr covers 10 months, successive months can be expressed as decimals. For e. 4.0 refers to
average performance at the beginning of 4th grade, 4.5 refers to average performance at the middle of the
grade and so forth.
Shortcomings of Grade Norms
The content of instructions varies from grade to grade. They are generally not applicable to high school level,
where many subjects may be studied for one or two yrs. Even with subjects taught in each grade the
emphasis placed on different subjects may vary from grade to grade and the progress may be more rapid in
one subject than in another during a particular grade.
Grade norms are also applicable to misinterpretation unless the test user keeps firmly in mind the manner in
which they were derived. Children may not score well on a multiplication question bank unless the topic
has been covered with them! This has no bearing on their capability/capacity of doing multiplication.
Grade norms tend to be incorrectly regarded as performance standards. Eg. expecting all children in grade 5
will be around grade 5 norm will be an incorrect exectation.
3. Ordinal Scale
The scales developed within this framework are ordinal in the sense that developmental stages follow in a
constant order, each stage presupposing mastery of prerequisite behavior characteristic of earlier stages.
Ordinal scales are designed to identify the stage reached by the child in the development of specific behavior
functions. Although the scores may be reported in approximate age levels, such scores are secondary to a
qualitative description of the child’s characteristic behaviour. The ordinality of such scales refers to the
uniform progression of development through successive stages.
Eg. Development theories of swiss psychologist Jean Piaget. His research focused on the development of
cognitive processes from infancy to midteens. He was concerned with specific concepts rather than broad
abilities.
For e.g. Object permanence- child is aware of continuing existence of objects when are seen from different
angles or are out of sight.
Another concept is conservation-recognition that an object remains constant over changes in perceptual
appearance., when same quantity of liquid is poured into differently shaped containers
B. Group Norms
This type of norm is used for comparison of an individual’s performance to the most closely related groups’
performance.
E.g. When comparing a child’s raw score with that of children with same chronological age or in the same
school grade.
They carry a clear and well defined quantitative meaning which can be applied to most statistical analysis.
4. Percentiles
Percentile scores are expressed in terms of the percentage of persons in the standardization sample who fall
below a given raw score.
E.g. 28% of persons obtain fewer than 15 problems correct on an arithmetic reasoning test, then a raw score
of 15 corresponds to the 28th percentile.
A percentile indicates an individual’s relative performance in the standardization sample. The lower the
percentile the poorer the individuals standing.
Percentiles are derived scores, expressed in terms of percentage of persons;while percentage scores are raw
scores expressed in terms of percentage of correct items.
The P50, 50th percentile corresponds to median. Percentile above 50 represent above average performance,
below 50 signify inferior performance. 25th and 75th percentile are first and third quartile points.
Here the counting begins from bottom, so the higher the percentile the better the rank. For example if a
person gets 97 percentile in a competitive exam, it means 97% of the participants have scored less than or
below him/her.
5. Standard scores
These express the individual’s distance from the mean in terms of the standard deviation of the distribution.
Standard scores may be obtained by linear or non linear transformations of original raw scores. When found
by linear transformation, they retain the exact numerical relations of the original raw scores.
Relative magnitude of differences between standard scores derived by such linear transformation
corresponds exactly to that between the raw scores. All properties of the original distribution of raw scores
are duplicated in the distribution in the standard scores.
Benefits of using z- score
Understand where a data point fits into a distribution.
Compare observations from dissimilar normal distributions.
Identify outliers – z scores beyond +-2 (depending on the hypothesis).
Calculate probabilities and percentiles using the standard normal distribution.
6. Stanine and Sten Score
Stanine scores are derived from a national norm reference sample. A stanine is a score from 1 to 9 with a.
Stanine of 9 indicating a very high level of general ability relative to the whole norm reference group, and. A
stanine of 1 indicating a very low relative achievement.
STEN scores (or “Standard Tens”) divide a scale into ten units. In simple terms, you can think of them as
scores out of ten.
A STEN of 1 or 2 is far below average,
A STEN of 5-7 is average,
A STEN of 9 or 10 is far above average.
7. Deviation IQ
In order to convert mental age scores into a uniform index of individual’s relative status, the intelligent
quotient was introduced.
An IQ of 100 represented a normal or average performance.
IQ below 100 represented Retardation
IQ above 100 represent acceleration
Criteria for good test norms
Test norms should be representative
Test norms should be relevant
Test norms should be up to date
Test norms should be comparable
Test norms should be properly and adequately described
Standardization
Standardization of psychological tests refers to the consistency of processes and procedures used for conducting and scoring the
test.. It involves creating a set of normative data or scores derived from groups of people for whom the measure is designed.
This also includes the type of materials to be used, verbal instructions, time to be taken, the way to handle questions by test takers
and all other minute details of a testing environment.
○ For example, you take an intelligence test that's given to you by Amy, a nice woman who hands the test over, tells you that
you have an hour to take it, and then walks away. You are left to figure everything out on your own.
○ But imagine that your friend takes that same test, but this time it's given by someone named Rosa. Rosa notices when your
friend starts to struggle with a question, so she gives him a hint. When he really can't get an answer, she lets him look the
answers up online. What if you score the same as your friend?
Does that mean that you are equally adept? No, because you didn't have standardization. That is, the test you took was
harder than your friend's test, even though it had the same questions, just by virtue of the fact that you didn't have the same
help that he did
○ As you can probably tell, standardization is very important in an intelligence test and other psychological tests. Making sure
that every single person gets the test under standard conditions ensures that everyone gets a fair shot at the test
It is important to establish Norms- average performance
Standardization sample / norming group
Advantages
1. Reliable Comparison: allowing for consistent and fair evaluations across individuals or
groups
2. Quantifiable Measures: They offer quantifiable measures such as Standard Age Scores
(SAS) and other standardized scores, making it easier to interpret and compare results
objectively
3. Unbiased Evaluation: Standardized tests are designed to be unbiased, ensuring that all
test-takers are evaluated using the same criteria and standards, promoting fairness in
assessment
4. Setting a Benchmark: These tests set a benchmark or standard against which
individual performances can be measured, helping educators and policymakers assess
the effectiveness of curricula and learning objectives
The normal process for administering standardization includes:
1) A calm, quiet and disturbance free setting
2) Accurately understanding the written instructions, and
3) Provisioning of required stimuli.
Classification
Norm-referenced Testing: It is used to measure the result or performance in relation to all other
individuals being administered the same test. It can be used to compare an individual to the others.
Eg- Scholastic Aptitude Test (SAT), which is a standardized test used for college admissions in the
United States, JEE
Criterion referenced Testing: It is used for measuring the real knowledge of a certain topic. Eg-
Advanced Placement (AP) exams: These exams assess a student's understanding of college-level
material in various subjects and determine if they have met the criteria for advanced placement or
college credit
Steps for constructing standardized tests
1) Plan for the test.
2) Preparation of the test.
3) Trial run of the test.
4) Checking the Reliability and Validity of the test.
5) Prepare the norms for the test.
6) Prepare the manual of the test and reproducing the test.
Computer applications in psychological testing
1. Diagnosis and Symptom Severity Rating: Computers assist in clinical practice by aiding in the diagnosis
of psychological conditions and rating the severity of symptoms, which can improve the accuracy and
efficiency of assessments.
2. Evaluation of Treatment Outcomes: help in evaluating the effectiveness of treatments by tracking
changes in symptoms and behaviors over time, providing valuable data for assessing the impact of
interventions.
3. Computer-Assisted Scoring: One common application is computer-assisted scoring of
examiner-administered tests, which enhances the accuracy and speed of scoring psychological
assessments.
4. Data Processing: Computers are used for processing data resulting from psychological tests, facilitating
the analysis and interpretation of test results in clinical psychology.
5. Enhanced Psychological Laboratory Practices: Computer applications are invaluable resources in
psychological laboratories, aiding in various testing procedures and research activities to deepen
understanding in the field of psychology.
Limitations
1. lack of acceptance by some practitioners
2. evaluation of software is another obstacle
3. most systems of computer assisted test interpretation tend to combine
clinical and statistical procedures
4. Computerized assessments may lack the ability to capture the full
nuances and context of human behavior, potentially leading to
oversimplified interpretations
5. Limited Flexibility
THANK YOU