0% found this document useful (0 votes)
41 views52 pages

Psychometrics

The document discusses psychometric scaling techniques, item writing guidelines, and the concepts of reliability and validity in psychological assessments. It emphasizes the importance of creating reliable tests that accurately measure psychological constructs while providing detailed guidelines for writing effective test items. Additionally, it explores the relationship between reliability and validity, as well as the significance of normality in behavioral assessments.

Uploaded by

Mohamed Nasar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views52 pages

Psychometrics

The document discusses psychometric scaling techniques, item writing guidelines, and the concepts of reliability and validity in psychological assessments. It emphasizes the importance of creating reliable tests that accurately measure psychological constructs while providing detailed guidelines for writing effective test items. Additionally, it explores the relationship between reliability and validity, as well as the significance of normality in behavioral assessments.

Uploaded by

Mohamed Nasar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Psychometric studies

Long essay 2016

1.Write a detailed essay on the nature of scaling technique.

psychometric scaling

the creation of an instrument to measure a psychological concept through a process


of analyzing responses to a set of test items or other stimuli. It involves identifying
item properties, noting whether responses match theoretical formats, reducing the
larger set of items into a smaller number (e.g., through exploratory factor analysis),
and determining appropriate scoring methods

Scaling at the test level really has two meanings in psychometrics. First, it involves defining the
method to operationally scoring the test, establishing an underlying scale on which people are
being measured. It also refers to score conversions used for reporting scores, especially
conversions that are designed to carry specific information. The latter is typically called scaled
scoring.

You have all been exposed to this type of scaling, though you might not have realized it at the
time. Most high-stakes tests like the ACT, SAT, GRE, and MCAT are reported on scales that are
selected to convey certain information, with the actual numbers selected more or less arbitrarily.
The SAT and GRE have historically had a nominal mean of 500 and a standard deviation of 100,
while the ACT has a nominal mean of 18 and standard deviation of 6. These are actually the
same scale, because they are nothing more than a converted z-score (standard or zed score),
simply because no examinee wants to receive a score report that says you got a score of -1. The
numbers above were arbitrarily selected, and then the score range bounds were selected based
on the fact that 99% of the population is within plus or minus three standard deviations. Hence,
the SAT and GRE range from 200 to 800 and the ACT ranges from 0 to 36. This leads to the
urban legend of receiving 200 points for writing your name correctly on the SAT; again, it feels
better for the examinee. A score of 300 might seem like a big number and 100 points above the
minimum, but it just means that someone is in the 3rd percentile.

Now, notice that I said “nominal.” I said that because the tests do not actually have those means
observed in samples, because the samples have substantial range restriction. Because these
tests are only taken by students serious about proceeding to the next level of education, the
actual sample is of higher ability than the population. The lower third or so of high school
students usually do not bother with the SAT or ACT. So many states will have an observed
average ACT of 21 and standard deviation of 4. This is an important issue to consider in
developing any test. Consider just how restricted the population of medical school students is; it
is a very select group.

How can I select a score scale?

For various reasons, actual observed scores from tests are often not reported, and only
converted scores are reported. If there are multiple forms which are being equated, scaling will
hide the fact that the forms differ in difficulty, and in many cases, differ in cutscore. Scaled
scores can facilitate feedback. They can also help the organization avoid explanations of IRT
scoring, which can be a headache to some.

When deciding on the conversion calculations, there are several important questions to consider.

First, do we want to be able to make fine distinctions among examinees? If so, the range should
be sufficiently wide. My personal view is that the scale should be at least as wide as the number
of items; otherwise you are voluntarily giving up information. This in turn means you are giving up
variance, which makes it more difficult to correlate your scaled scores with other variables, like
the MCAT is correlated with success in medical school. This, of course, means that you are
hampering future research – unless that research is able to revert back to actual observed
scores to make sure all information possible is used. For example, supposed a test with 100
items is reported on a 5-point grade scale of A-B-C-D-F. That scale is quite restricted, and
therefore difficult to correlate with other variables in research. But you have the option of
reporting the grades to students and still using the original scores (0 to 100) for your research.

Along the same lines, we can swing completely in the other direction. For many tests, the
purpose of the test is not to make fine distinctions, but only to broadly categorize examinees.
The most common example of this is a mastery test, where the examinee is being assessed on
their mastery of a certain subject, and the only possible scores are pass and fail. Licensure and
certification examinations are an example. An extension of this is the “proficiency categories”
used in K-12 testing, where students are classified into four groups: Below Basic, Basic,
Proficient, and Advanced. This is used in the National Assessment of Educational Progress
(http://nces.ed.gov/nationsreportcard/). Again, we see the care taken for reporting of low
scores; instead of receiving a classification like “nonmastery” or “fail,” the failures are given the
more palatable “Below Basic.”

Another issue to consider, which is very important in some settings but irrelevant in others, is
vertical scaling. This refers to the chaining of scales across various tests that are at quite
different levels. In education, this might involve linking the scales of exams in 8th grade, 10th
grade, and 12th grade (graduation), so that student progress can be accurately tracked over
time. Obviously, this is of great use in educational research, such as the medical school process.
But for a test to award a certification in a medical specialty, it is not relevant because it is really a
one-time deal.

Lastly, there are three calculation options: pure linear (ScaledScore = RawScore * Slope +
Intercept), standardized conversion (Old Mean/SD to New Mean/SD), and nonlinear approaches
like Equipercentile.

Perhaps the most important issue is whether the scores from the test will be criterion-referenced
or norm-referenced. Often, this choice will be made for you because it distinctly represents the
purpose of your tests. However, it is quite important and usually misunderstood, so I will discuss
this in detail.

2. Elucidate the guidelines for item writing.


Item writing refers to the process of constructing the items required for a given test or
assessment. The people who write these items are called item writers.

Item Writing Guidelines General Guidelines


1) Do Not Use Trick Items • Items are tricky either intentionally or accidentally. o Trick items are unfair to
examinees and threaten the validity of the test. o Trick items measure test-taking skills more than the intended
construct. o Trick items heighten test anxiety and cause examinees to mistrust the intent of all other items.

• Causes of unintentional trickiness. o Trivial content o Discrimination between options is too fine o Overlapping
options o Irrelevant content o Single answer allowed, but multiple correct answers possible o Ambiguity in either
the stem or options

2) Measure a Single Construct • If an examinee incorrectly answers an item that has multiple constructs, it is
impossible to know which construct is not mastered.

• Items are generally scored dichotomously, so the only inference that can be made is that the examinee knows
the entire item or none of the item.

• Compound items heighten test anxiety and can lower the perceived validity of the exam. 3) Avoid Opinion-Based
Items • Never ask "What would you ... do", " ... use", " ... try", etc. The examinee's answer can never be wrong. • Use
caution when asking for the "best" thing, or the "best" way of doing something, unless it is clearly the best
amongst the options.

• If differences exist between any experts' opinion about what the "best" is, then avoid using it. • Qualify the
standard for "best" (i.e., according to ... ). 4) A void Absolute Modifiers such as always, never, only and none. •
The use of absolute modifiers in options makes it easy to eliminate options, increasing the guessing probability.

5) Avoiding Excessive Verbiage • "Verbosity is an enemy to clarity." (Haladyna, 2004) • Wordy items take longer to
read and answer, meaning fewer items can be presented in a fixed amount of time, reducing reliability.
• Write items as briefly as possible without compromising the construct and cognitive demand required. • Get to
the point in the stem and present clean, clear options for the examinee to choose. • Avoid unnecessary
background information. 6) Avoid over-specific or over-general content • Over-specific content tends to be trivial.
• Over-general content tends to be ambiguous. 7) Use Novel Content • Do not repeat exact wording from training
materials. • Repeated wording tends to test recall and recognition, rather than learning.

10) Keep Items Independent • Content should be independent from item to item. • Don't give the answer away to
one item in the stem of another. • Don't make answering one item correctly dependent on knowing the correct
answer to another item. 11) Write items to a sixth grade reading level. • Use appropriate vocabulary for construct
• Use necessary technical terms & content • For everything else, use the simplest words and sentence structure
possible 12) Do not teach. • The purpose of certification is to verify knowledge. • Do not introduce new material
nor reinforce material the examinee should already know.

Guidelines for Writing the Stem

1) Write the stem in the form of a question. 2) Place the main idea in the stem. 3) The examinee should be able to
know immediately what the focus of the item is by just reading the stem.

4) The examinee should be able to answer the question without reading the options. 5) Make the stem a
complete sentence, containing all information necessary to answer the question. 6) Keep the stem as brief as
possible. 7) Move repeated words from the options into the stem, when possible. 8) Avoid negative words such
as “not” or “except”.

Guidelines for Writing the Options

1) Make the correct answer always correct. 2) Make the distractors always wrong, but attractive (plausible) to
examinees that are not minimally competent.

3) A void "All of the above" and "None of the above" as options. 4) A void inadvertent clues: • Do not always make
the correct option the longest. • Do not repeat words in the options that are in the stem (clang associations). •
Don not use specific determiners, such as, always, never, all, every, etc. They are so extreme that they are seldom
the correct answer.

• Keep options homogeneous (parallel) in content, length, and grammatical structure.


5) All items should have at least one correct option and one distractor. 6) At least four options are preferable, but
three are sufficient. 7) Order the options in either logical or numerical order, if one exists. Otherwise, sort shortest
to longest. 8) A void humor.

Guidelines for Writing Distractors


1) Use logical misunderstandings or misconceptions. 2) Use common errors. 3) Use familiar terms, key-words,
structures, or ordering. 4) Use statements that are correct or true but do not answer or address the stem
(question). 5) Avoid opposing statements. 6) Use correct concepts, but “mixed up.”

3. Bring out the meaning of reliability.

Psychometric reliability is the extent to which test scores are accurate and without any
measurement error. A reliable test score is precise and consistent during all the tests. It can
also be recreated on multiple occasions. A psychometric test is considered reliable only if it
produces similar results under invariable conditions.

Reliability is an essential component of a perfect psychological assessment test. A test will not
be considered reliable if it produces inconsistent and unreliable results every time. The reliability
of test scores depends on the extent to which scores are consistent across multiple instances of
testing, numerous test editions, or multiple raters grading the participant’s responses.

The term reliability refers to the invariability of the outcome. For example, if a test aims at
measuring a trait (introversion), then each time a subject undergoes the test, the assessment will
produce consistent results. It may be difficult to measure reliability precisely in the real world, but
it can be predicted in many ways.

A test is reliable as long as it produces similar results over time, repeated administration, or
similar circumstances.

If you were to use a professional dart player as an example, their ability to hit the designated
target consistently, but not the bull’s eye under specified conditions, would classify them as an
excellent and reliable player. However, this does not account for psychometric validity.
Compared to psychometric assessments, a reliable test is better known for producing stable
results over time.

Over the years, scholars and researchers uncovered multiple ways to check for psychometric
reliability. Some include testing the same participants at different points of time or presenting
the participants with varying versions of the same test to evaluate their consistency levels. An
assessment must demonstrate excellent reliability to qualify for validity.

What Are the Four Types of Reliability?

The four types of psychometric reliability are:

 Parallel Forms Reliability: The two different tests use the same content but separate
procedures or equipment, and yield the same result for each test-taker.

 Internal Consistency Reliability: Items within the test are examined to see if they appear
to measure what the test measures. Internal reliability between test items is referred to
as internal consistency.

 Inter-Rater Reliability: When two raters score the psychometric test in the same manner,
inter-scorer consistency is high.
 Test-Retest Reliability: This is when the same test is conducted over time, and the test-
taker displays consistency in scores over multiple administrations of the same test.

Is Psychometric Testing Reliable?

Psychometric tests are as reliable as any other medical test, sometimes more. However, there
can be minor discrepancies in psychometric reliability due to individuals having different
thoughts, feelings, or ideas at various points in the time, leading to variance in scores. Several
factors (both stable traits and momentary issues) can result in variation in test scores.

Stable traits include weight, height, and other such characteristics. Momentary inconsistency is
attributed to different things such as the health of the test-takers, an understanding of a
particular test item and so forth.

Why Psychometric Test Reliability Counts?

Reliability is essential for psychometric tests. After all, it is irrelevant to have the same test yield
different results each time, especially if scores can affect employee selection, retention and
promotion.

Psychometricians identify two different categories of errors:

 Systematic errors: These are factors that impact test construction and are inbuilt in the
test.
 Unsystematic errors: These are errors resulting from random factors such as how the
test is given or taken.

Numerous factors influence the psychometric reliability of tests. The timing between two test
sessions affects test-retest and alternate/parallel forms reliability. The similarity of content and
expectations of subjects regarding different testing elements affects only the latter type of
reliability along with split half and internal consistency.

Changes in subjects over time, such as their environment, physical state, emotional and mental
well-being, must also be considered while assessing the reliability of psychometric tests. Test-
based factors such as inadequate testing instructions, biased scoring, lacking objectivity and
guessing on the part of the test-taker also influence the psychometric reliability of tests. Tests
can generate reliable estimates sometimes and not so stable results other times (Geisinger,
2013).

4. Describe the relationship between reliability and validity.

https://www.scribbr.com/methodology/reliability-vs-validity/

5. Explain the importance and elements of normality.

Normality is a behavior that can be normal for an individual (intrapersonal normality)


when it is consistent with the most common behavior for that person. Normal is also
used to describe individual behavior that conforms to the most common behavior in
society (known as conformity). However, normal behavior is often only recognized in
contrast to abnormality. In its simplest form, normality is seen as good while
abnormality is seen as bad.[1] Someone being seen as normal or not normal can
have social ramifications, such as being included, excluded or stigmatized by wider
society.

MeasuringEdit
Many difficulties arise in measuring normal behaviors—biologists come across
parallel issues when defining normality. One complication that arises regards
whether 'normality' is used correctly in everyday language.[2] People say "this heart is
abnormal" if only a portion of it is not working correctly, yet it may be inaccurate to
include the entirety of the heart under the description of 'abnormal'. There can be a
difference between the normality of a body part's structure and its function.
Similarly, a behavioral pattern may not conform to social norms, but still be effective
and non-problematic for that individual. Where there is a dichotomy between
appearance and function of a behavior, it may be difficult to measure its normality.
This is applicable when trying to diagnose a pathology and is addressed in the
Diagnostic and Statistical Manual of Mental Disorders.

Clinical normalityEdit
Applying normality clinically depends on the field and situation a practitioner is in. In
the broadest sense, clinical normality is the idea of uniformity of physical and
psychological functioning across individuals.
Psychiatric normality, in a broad sense, states that psychopathology are disorders
that are deviations from normality.[22]
Normality, and abnormality, can be characterized statistically. Related to the
previous definition, statistical normality is usually defined it in terms of a normal
distribution curve, with the so-called 'normal zone' commonly accounting for 95.45%
of all the data. The remaining 4.55% will lie split outside of two standard deviations
from the mean. Thus any variable case that lies outside of two deviations from the
mean would be considered abnormal. However, the critical value of such statistical
judgments may be subjectively altered to a less conservative estimate. It is in fact
normal for a population to have a proportion of abnormals. The presence of
abnormals is important because it is necessary to define what 'normal' is, as
normality is a relative concept.[23] So at a group, or macro, level of analysis,
abnormalities are normal given a demographic survey; while at an individual level,
abnormal individuals are seen as being deviant in some way that needs to be
corrected.
Statistical normality is important in determining demographic pathologies. When a
variable rate, such as virus spread within a human population, exceeds its normal
infection rate, then preventative or emergency measures can be introduced.
However, it is often impractical to apply statistical normality to diagnose individuals.
Symptom normality is the current, and assumed most effective, way to assess
patient pathology.
DSMEdit
Normality, as a relative concept, is intrinsically involved with contextual elements. As
a result, clinical disorder classification has particular challenges in discretely
diagnosing 'normal' constitutions from true disorders. The Diagnostic and Statistical
Manual of Mental Disorders (DSM) is the psychiatric profession's official
classification manual of mental disorders since its first published version (DSM-I) by
the American Psychological Association in 1952.
As the DSM evolved into its current version (DSM-V) in late 2013, there have been
numerous conflicts in proposed classification between mental illness and normal
mentality. In his book Saving Normal, Dr. Allen Frances, who chaired the task force
for content in the DSM-IV and DSM-IV-TR, wrote a scathing indictment of the
pressures incumbent on the definition of "normal" relative to psychological
constructs and mental illness.
Most of this difficulty stems from the DSM's ambiguity of natural contextual stressor
reactions versus individual dysfunction. There are some key progressions along the
DSM history that have attempted to integrate some aspects of normality into proper
diagnosis classification. As a diagnostic manual for classification of abnormalities,
all DSMs have been biased towards classifying symptoms as disorders by
emphasizing symptomatic singularity. The result is an encompassing misdiagnosis
of possible normal symptoms, appropriate as contextually derived.[24]
DSM-IIEdit
The second edition of the DSM could not be effectively applied because of its vague
descriptive nature. Psychodynamic etiology was a strong theme in classifying
mental illnesses. The applied definitions became idiosyncratic, stressing individual
unconscious roots. This made applying the DSM unreliable across psychiatrists.[24]
No distinction between abnormal to normal was established.
Evidence of the classification ambiguity were punctuated by the Rosenhan
experiment of 1972. This experiment demonstrated that the methodology of
psychiatric diagnosis could not effectively distinguish normal from disordered
mentalities. DSM-II labelled 'excessive' behavioral and emotional response as an
index of abnormal mental wellness to diagnose some particular disorders.[25]
'Excessiveness' of a reaction implied alternative normal behavior which would have
to include a situational factor in evaluation. As an example, a year of intense grief
from the death of a spouse may be a normal appropriate response. To have intense
grief for twenty years would be indicative of a mental disorder. As well, to grieve
intensely over the loss of a sock would also not be considered normal
responsiveness and indicate a mental disorder. The consideration of proportionality
to stimuli was a perceived strength in psychiatric diagnosis for the DSM-II.[24]
Another characteristic of the DSM-II systemization was that it classified
homosexuality as a mental disorder. Thus, homosexuality was psychiatrically
defined a pathological deviation from 'normal' sexual development. Homosexuality
was later replaced in the 7th printing of DSM-II, instead categorized as a 'Sexual
orientation disturbance'. The intent was to have a label that applied only to those
homosexual individuals who were bothered by their sexual orientation.[26] In this
manner homosexuality would not be viewed as an atypical illness.[26] Only if it was
distressing would homosexuality be classified as a mental illness.[25] However, the
DMS-II did not explicitly state that any homosexuality was normal either. This stigma
lasted into DSM-III until it was reformed entirely from DSM classifications in 1987.[24]
[25]

DSM-IIIEdit
DSM-III was a best attempt to credit psychiatry as a scientific discipline from the
opprobrium resulting from DSM-II.[22] A reduction in the psychodynamic etiologies of
DSM-II spilled over into a reduction symptom etiology altogether. Thus, DSM-III was
a specific set of definitions for mental illnesses, and entities more suited to
diagnostic psychiatry, but which annexed response proportionality as a classification
factor. The product was that all symptoms, whether normal proportional response or
inappropriate pathological tendencies, could both be treated as potential signs of
mental illness.[22]

DSM-IVEdit
DSM-IV explicitly distinguishes mental disorders and non-disordered conditions. A
non-disordered condition results from, and is perpetuated by, social stressors.
Included in DSM-IV's classification is that a mental disorder "must not be merely an
expectable and culturally sanctioned response to a particular event, for example, the
death of a loved one. Whatever its original cause, it must currently be considered a
manifestation of a behavioral, psychological, or biological dysfunction in the
individual" (American Psychiatric Association 2000:xxxi) This had supposedly
injected normality consideration back into the DSM, from its removal from DSM-II.
However, it has been speculated that DSM-IV still does not escape the problems
DSM-III faced, where psychiatric diagnoses still include symptoms of expectable
responses to stressful circumstances to be signs of disorders, along with symptoms
that are individual dysfunctions.[24] The example set by DSM-III, for principally
symptom-based disorder classification, has been integrated as the norm of mental
diagnostic practice.[24]
DSM-5Edit
The DSM-5 was released in the second half of 2013. It has significant differences
from DSM IV-TR, including the removal of the multi-axial classifications and
reconfiguring the Asperger's/autistic spectrum classifications.

Psychometric 2017

1.Describe the psychophysical and psychological scaling methods.


psychophysical scaling

any of the techniques used to construct scales relating physical stimulus properties
to perceived magnitude. For example, a respondent in a study may have to indicate
the roughness of several different materials that vary in texture. Methods are often
classified as direct or indirect, based on how the observer judges magnitude

In traditional psychophysical scaling methods, a set of standard stimuli (such as


weights) that can be ordered according to some physical property is related to
sensory judgments made by experimental subjects. By the method of average error,
for example, subjects are given a standard stimulus and then made to adjust a
variable stimulus until they believe it is equal to the standard. The mean (average) of
a number of judgments is obtained. This method and many variations have been
used to study such experiences as visual illusions, tactual intensities, and auditory
pitch.

Psychological (psychometric) scaling methods are an outgrowth of the


psychophysical tradition just described. Although their purpose is to locate stimuli
on a linear (straight-line) scale, no quantitative physical values (e.g., loudness or
weight) for stimuli are involved. The linear scale may represent an individual’s
attitude toward a social institution, his judgment of the quality of an artistic product,
the degree to which he exhibits a personality characteristic, or his preference for
different foods. Psychological scales thus are used for having a person rate his own
characteristics as well as those of other individuals in terms of such attributes, for
example, as leadership potential or initiative. In addition to locating individuals on a
scale, psychological scaling can also be used to scale objects and various kinds of
characteristics: finding where different foods fall on a group’s preference scale; or
determining the relative positions of various job characteristics in the view of those
holding that job. Reported degrees of similarities between pairs of objects are used to
identify scales or dimensions on which people perceive the objects.

2. Describe the guidelines for item writing.


Answered
3. What is the utility of reliability in test construction?

4. Write a note on validly and its types.


https://www.scribbr.com/methodology/types-of-validity/

5. Describe the significance of norms in test construction


Norms refer to information regarding the group performance of a particular reference on a
particular measure for which a person can be compared to.
Norms mean standardized scores. Scores on the psychological tests are most commonly
interpreted by reference to the norm that represents the test performance on the standardization
sample. Norms always represent the best performance.

Basically, there are two purposes of norms:

 Norms indicate the individual’s relative standing in the normative sample and thus permit
evaluation of his/her performance in reference to other persons.
 Norms provide compared measures that permitted a direct comparison of the individual
performance on a difference test.

Sir Francis Galton at the first time developed the logic for norm-based testing in the 18th century.

Statistical concept:

 Frequency distribution: A major object of the statistical method is to organize and


summarize quantitative data in order to facilitate their understanding. A list of 1000 test
scores can be an overwhelming sight. In that form, it conveys little meaning. A first step
In bringing order into such a chaos of raw data is to tabulate the scores into a frequency
distribution. A distribution is prepared by grouping the scores into convenient class
intervals and tallying each score in the appropriate interval. When all scores have been
entered the tallies are counted to find the frequency, or a number of cases, in each class
interval. The sum of these frequencies will equal N, the total number of cases in the
group.
 Graphical representation: The information provided by a frequency distribution can also
be presented graphically in the form of a distribution curve. On the baselines, or
horizontal axis, are the scores grouped into class intervals; on the vertical axis are the
frequencies or number of cases falling within each class interval. The graph has been
plotted in two ways. In the histogram, the height of the column erected over each class
interval corresponds to the number of persons scoring in that interval. In the frequency
polygon, the number of persons in each interval is indicated by a point in the center of the
class interval and across from the appropriate frequency. The successive points are then
joined by straight lines.
 Central Tendency: A group of scores can also be described in terms of some measure of
central tendency. The most familiar of these measures is the average, more technically
known as the mean (M), and it is found by adding all sores and dividing the sum by the
number of cases (N). Another measure is the mode or most frequent score. In a
frequency distribution, the mode is the midpoint of the class interval with the highest
frequency. The third measure of central tendency is the median or middlemost score
when all scores have been arranged in order of size. The medial is the point that bisects
the distribution, half the cases falling above it and half below.
 Variability: Further description of a set of test scores is given by measures of variability,
or the extent of individual differences around the central tendency. The most obvious and
familiar way for reporting variability is in terms of range between the highest and lowest
score. T range, however, is extremely crude and unstable, for it is determined by only two
scores. A single unusually high or low score would thus markedly affect its size. A more
precise method of measuring variability is based on the difference between each
individual’s score and the mean of the group.

A much more serviceable measure of variability is the standard deviation (symbolized by either
SD) in which the negative signs are legitimately eliminated by squaring each deviation. The sum
of this column divided by the number of cases is known as the variance, or mean square
deviation. The variance has proved extremely useful in sorting out the contributions of different
factors lo individual differences in test performance. The SD also provides the basis for
expressing an individual’s scores on different tests in terms of norms.

Developmental Norms

One way in which meaning can be attached to test scores is to indicate how far along the normal
developmental path the individual has progressed. Developmental systems utilize more highly
qualitative descriptions of behavior in specific functions, such as sensorimotor activities or
concept formation.

Mental Age: The term “mental age’ was widely popularized through the various translations and
adaptations of the Binet-Simon scales, although Binet himself had employed the more neutral
term “mental Level”. In age scales such as the Bind and its revisions (prior to 1986), items were
grouped into year levels. For example, those items passed by the majority of 7-years olds in the
standardization sample were placed in the 7-year level, and so forth. A child’s score on the test
would then correspond to the highest year level that he or she could successfully complete In
actual practice, the individual’s tests below their mental age and passed some above it. For this
reason, it was customary to compute the basal age, that is, the highest age at and below which
all tests were passed. Partial credits, in months, were then added to this basal age for all tests
passed at higher year levels. Mental age norms have also been employed with tests that are not
divided into year levels. In such a case, the child’s raw score is first determined. The mean raw
scores obtained by the children in each year group within the standardization sample constitute
the age norms for such a test. The mean raw score of the e8-year old children, for example,
would represent the 8-year old raw score then her or his mental age on the test is 8 years. All raw
scores on such a test can be transformed in a similar manner by reference to the age norms.

Grade Equivalents: Scores on educational achievement tests are often interpreted in terms of
grade equivalents. Grade norms are found by computing the mean raw score obtained by
children in each grade. Thus, if the average number of problems solved correctly on an arithmetic
test by the fourth graders in the standardization sample is 23, then a raw score of 23
corresponds to grade equivalents of 4. Intermediate grade equivalents, representing fractions of
a grade, are usually found by interpolation, although they can also be obtained directly by testing
children at different times within the school years. For example, 4.0 refers to average
performance at the beginning of the fourth grade. Grade norms are also subject to
misinterpretation unless the test user keeps firmly in mind the manner in which they were
derived.
Ordinal Scales: Ordinal scales are designed to identify the stage reached by the child in the
development of specific behavior functions. Although scores may be reported in terms of
approximate age levels, such scores are secondary to qualitative description of the child’s
characteristics behavior. The ordinarily of such scales refers to the uniform progression of
development through successive stages. in so far as these scales typically provide information
about what the child is actually able to do(e.g. climbs stairs without assistance; recognizes
identity in quantity of liquid when poured into differently shaped containers), they share
important features with the domain-referenced tests.

Psychometric 2018

1.Write an account on the classification of psychological tests.

https://leverageedu.com/blog/types-of-psychological-
tests/#:~:text=Psychological%20Tests%20are%20of%20different,behaviour%2C%20research
%20purposes%2C%20etc.

2.Describe the item response theory.

Item response theory (IRT), also called latent trait theory, is a psychometric theory that was
created to better understand how individuals respond to individual items on psychological and
educational tests. The underlying theory is built around a series of mathematical formulas that
have parameters that need to be estimated using complex statistical algorithms. These
parameters relate to properties of individual items and characteristics of individual respondents.
The term latent trait is used to describe IRT in that characteristics of individuals cannot be
directly observed; they must be inferred by using certain assumptions about the response
process that help estimate these parameters.

Item response theory complements and contrasts classical test theory (CTT), which is the
predominant psychometric theory taught in undergraduate and graduate programs. Classical
test theory differs from IRT in several ways that will be discussed throughout this entry. In
general, though, IRT can be thought of as analogous to an electron microscope for item analysis,
whereas CTT would be more like a traditional optical microscope. Both techniques are useful for
their own purposes. Just like the electron microscope, IRT provides powerful measurement
analysis; IRT is useful if you have a need for specific, precise analysis. On the other hand, CTT
can be just as useful as IRT when the research questions are vague and general. In medical
research, sometimes the optical microscope is preferred to the electron microscope. Likewise,
CTT may be preferred in some situations.
The Item Response Function

Item response theory relates characteristics of items and characteristics of individuals to the
probability of affirming, endorsing, or correctly answering individual items. The cornerstone of
IRT is the item response function (IRF), which is the graphical representation of a mathematical
formula that relates the probability of affirming item i with the value of a latent trait

Applications of Item Response Theory

Item response theory has had a significant impact in psychology by allowing for more precise
methods of assessing properties of tests compared with classical test theory. In addition, IRT
has had a big impact on psychology by making possible several tools that would be difficult to
create without IRT. Psychometric applications, such as computerized adaptive testing, detecting
item bias, equating tests, and identifying aberrant individuals, have been greatly improved with
the development of IRT. In particular, computerized adaptive testing merits additional
discussion.

Computer adaptive tests work by choosing items that are best suited for identifying the precise
level of 6 for an individual respondent. Specially, there is an IRT concept called information that
is important for adaptive tests. Item-level information is related to the amount of uncertainty
about a 6 estimate that can be reduced by administering that item. Information differs by the
level of 6. Some items will have high information for low levels of 6, whereas other items may
have high levels of information for high levels of 6. Imagine a mathematics test. A basic algebra
item may provide high amounts of information for people who possess extremely low ability.
That same item, however, would do little to differentiate between individuals of moderate and
high math ability. To differentiate between those individuals, a more complex item would need to
be given. Information functions can be plotted for individual items (or for tests) to see for what
level of 6 the item is best suited.

Computerized adaptive tests work by choosing items that have large amounts of information for
the respondent’s estimated 6. Theta estimates are revised after each item response, and then a
computer algorithm selects the next item to present based on the information level of items at
the revised theta estimate. By choosing only items with large amounts of information, adaptive
tests can maintain measurement precision at the levels of conventional tests even though fewer
items are administered.

Item response theory has already had a major effect on educational testing through its impact on
computerized adaptive testing (CAT). In the 1990s, Educational Testing Service implemented a
CAT version of the Graduate Record Examination (GRE). The success of adaptive testing would
not be possible without development of IRT. Large-scale adaptive testing would not be possible
using CTT.

In the future it is likely that item response theory will yield progress, not only in improvement of
measurement technologies but also by making contributions in substantive areas, such as
decision-making theory. Graduate students, researchers, and practitioners who are interested in
psychological measurement should invest some time to learn more about IRT technology.
Computer programs, such as BILOG, MULTILOG, and PARSCALE, are available to conduct IRT
analyses.

3.Explain the factors affecting reliability in psychological tests.

Some intrinsic and some extrinsic factors have been identified to affect the reliability of test
scores.

(A) Intrinsic Factors:

The principal intrinsic factors (i.e. those factors which lie within the test itself) which affect the
reliability are:

(i) Length of the Test:

Reliability has a definite relation with the length of the test. The more the number of items the
test contains, the greater will be its reliability and vice-versa. Logically, the more sample of items
we take of a given area of knowledge, skill and the like, the more reliable the test will be.

However, it is difficult to ensure the maximum length of the test to ensure an appropriate value
of reliability. The length of the tests in such case should not give rise to fatigue effects in the
testees, etc. Thus, it is advisable to use longer tests rather than shorter tests. Shorter tests are
less reliable.

The number of times a test should be lengthened to get a desirable level of reliability is given by
the formula:

When a test has a reliability of 0.8, the number of items the test has to be lengthened to get a
reliability of 0.95 is estimated in the following way:

Hence the test is to be lengthened 4.75 times. However, while lengthening the test one should
see that the items added to increase the length of the test must satisfy the conditions such as
equal range of difficulty, desired discrimination power and comparability with other test items.

(ii) Homogeneity of Items:

Homogeneity of items has two aspects: item reliability and the homogeneity of traits measured
from one item to another. If the items measure different functions and the inter-correlations of
items are ‘zero’ or near to it, then the reliability is ‘zero’ or very low and vice-versa.

(iii) Difficulty Value of Items:

The difficulty level and clarity of expression of a test item also affect the reliability of test scores.
If the test items are too easy or too difficult for the group members it will tend to produce scores
of low reliability. Because both the tests have a restricted spread of scores.

ADVERTISEMENTS:

(iv) Discriminative Value:

When items can discriminate well between superior and inferior, the item total-correlation is high,
the reliability is also likely to be high and vice-versa.

(v) Test instructions:

Clear and concise instructions increase reliability. Complicated and ambiguous directions give
rise to difficulties in understanding the questions and the nature of the response expected from
the testee ultimately leading to low reliability.

(vi) Item selection:

If there are too many interdependent items in a test, the reliability is found to be low.

(vii) Reliability of the scorer:

The reliability of the scorer also influences reliability of the test. If he is moody, fluctuating type,
the scores will vary from one situation to another. Mistake in him give rises to mistake in the
score and thus leads to reliability.

(B) Extrinsic Factors:

The important extrinsic factors (i.e. the factors which remain outside the test itself) influencing
the reliability are:

(i) Group variability:

When the group of pupils being tested is homogeneous in ability, the reliability of the test scores
is likely to be lowered and vice-versa.

(ii) Guessing and chance errors:


Guessing in test gives rise to increased error variance and as such reduces reliability. For
example, in two-alternative response options there is a 50% chance of answering the items
correctly in terms of guessing.

(iii) Environmental conditions:

As far as practicable, testing environment should be uniform. Arrangement should be such that
light, sound, and other comforts should be equal to all testees, otherwise it will affect the
reliability of the test scores.

(iv) Momentary fluctuations:

Momentary fluctuations may raise or lower the reliability of the test scores. Broken pencil,
momentary distraction by sudden sound of a train running outside, anxiety regarding non-
completion of home-work, mistake in giving the answer and knowing no way to change it are the
factors which may affect the reliability of test score.

4.What is the link between reliability and validity as essential features of psychological
tests?

5.Describe norms and norm development in the context of psychological tests

Standardization and Testing Norms

As part of the development of any psychometrically sound measure, explicit methods and
procedures by which tasks should be administered are determined and clearly spelled out. This
is what is commonly known as standardization. Typical standardized administration procedures
or expectations include (1) a quiet, relatively distraction-free environment, (2) precise reading of
scripted instructions, and (3) provision of necessary tools or stimuli. All examiners use such
methods and procedures during the process of collecting the normative data, and such
procedures normally should be used in any other administration, which enables application of
normative data to the individual being evaluated (Lezak et al., 2012).

Standardized tests provide a set of normative data (i.e., norms), or scores derived from groups of
people for whom the measure is designed (i.e., the designated population) to which an
individual's performance can be compared. Norms consist of transformed scores such as
percentiles, cumulative percentiles, and standard scores (e.g., T-scores, Z-scores, stanines, IQs),
allowing for comparison of an individual's test results with the designated population. Without
standardized administration, the individual's performance may not accurately reflect his or her
ability. For example, an individual's abilities may be overestimated if the examiner provides
additional information or guidance than what is outlined in the test administration manual.
Conversely, a claimant's abilities may be underestimated if appropriate instructions, examples, or
prompts are not presented. When nonstandardized administration techniques must be used,
norms should be used with caution due to the systematic error that may be introduced into the
testing process; this topic is discussed in detail later in the chapter.

It is important to clearly understand the population for which a particular test is intended. The
standardization sample is another name for the norm group. Norms enable one to make
meaningful interpretations of obtained test scores, such as making predictions based on
evidence. Developing appropriate norms depends on size and representativeness of the sample.
In general, the more people in the norm group the closer the approximation to a population
distribution so long as they represent the group who will be taking the test.

Norms should be based upon representative samples of individuals from the intended test
population, as each person should have an equal chance of being in the standardization sample.
Stratified samples enable the test developer to identify particular demographic characteristics
represented in the population and more closely approximate these features in proportion to the
population. For example, intelligence test scores are often established based upon census-based
norming with proportional representation of demographic features including race and ethnic
group membership, parental education, socioeconomic status, and geographic region of the
country.

When tests are applied to individuals for whom the test was not intended and, hence, were not
included as part of the norm group, inaccurate scores and subsequent misinterpretations may
result. Tests administered to persons with disabilities often raise complex issues. Test users
sometimes use psychological tests that were not developed or normed for individuals with
disabilities. It is critical that tests used with such persons (including SSA disability claimants)
include attention to representative norming samples; when such norming samples are not
available, it is important for the assessor to note that the test or tests used are not based on
representative norming samples and the potential implications for interpretation (Turner et al.,
2001).

Psychometric 2019

1.Differentiate between measurement and evaluation and describe levels of


measurement.

Measurement is a systematic process of determining the attributes


of an object. It ascertains how fast, tall, dense, heavy, broad,
something is. However, one can make measurements of physical
attributes only and if one has to measure those attributes which
cannot be measured with the help of tools. That is where the need
for evaluation arises. It helps in passing value judgement about the
policies, performances, method, techniques, strategies,
effectiveness, etc. of teaching.

Comparison Chart

Basis for
Measurement Evaluation
Comparison

Measurement refers to the process of Evaluation is when the comparison is


delegating a numerical index, to the being made between the score of a
Meaning
object in a meaningful and consistent learner with the score of other learners
manner. and judge the results.

Comprises the observations which can Comprises of both quantitative and


Observations
be expressed numerically. qualitative observations.

Assignment of numerals according to Assignment of grades, level or symbols


Involves
certain rules. according to established standards.

Concerned One or more attributes or features of a Physical, psychological and behavioral


with person or object. aspects of a person.

Answers How much How good or how well

Logical It does not convey any logical Logical assumption can be made, about
assumption assumption about the student. the student.

Time and
Requires less time and energy Requires more time and energy
Energy

Scope Limited Wide

Orientation Content oriented Objective oriented

Levels of measurement, also called scales of measurement, tell you how precisely
variables are recorded. In scientific research, a variable is anything that can take on
different values across your data set (e.g., height or test scores).

There are 4 levels of measurement:

 Nominal: the data can only be categorized


 Ordinal: the data can be categorized and ranked
 Interval: the data can be categorized, ranked, and evenly spaced
 Ratio: the data can be categorized, ranked, evenly spaced, and has a natural zero.
Depending on the level of measurement of the variable, what you can do to analyze
your data may be limited. There is a hierarchy in the complexity and precision of the
level of measurement, from low (nominal) to high (ratio).

2.Describe the meaning and purpose of item analysis,

Within psychometrics, Item analysis refers to statistical methods used for selecting
items for inclusion in a psychological test. The concept goes back at least to
Guildford (1936). The process of item analysis varies depending on the
psychometric model. For example, classical test theory or the Rasch model call for
different procedures. In all cases, however, the purpose of item analysis is to
produce a relatively short list of items (that is, questions to be included in an
interview or questionnaire) that constitute a pure but comprehensive test of one or a
few psychological constructs.
To carry out the analysis, a large pool of candidate items, all of which show some
degree of face validity, are given to a large sample of participants who are
representative of the target population. Ideally, there should be at least 10 times as
many candidate items as the desired final length of the test, and several times more
people in the sample than there are items in the pool. Researchers apply a variety of
statistical procedures to the responses to eliminate unsatisfactory items. For
example, under classical test theory, researcher discard items if the answers:
 Show little variation within the sample

 Are strongly correlated with one or more other items

 Weakly correlate with the totality of the remaining items, reflected in an increase in
Cronbach's alpha if the item is eliminated from the test

In practical test construction, item analysis is an iterative process, and cannot be


entirely automated. The psychometrician's judgement is required to determine
whether the emerging set of items to be retained constitutes a satisfactory test of
the target construct.[citation needed] The three criteria above do not always agree, and a
balance must be struck between them in deciding whether or not to include an item

3.Explain reliability, its types and factors influencing reliability.

Psychometric reliability is the extent to which test scores are accurate and
without any measurement error. A reliable test score is precise and
consistent during all the tests. It can also be recreated on multiple
occasions. A psychometric test is considered reliable only if it produces
similar results under invariable conditions.
Reliability is an essential component of a perfect psychological assessment test. A test will not
be considered reliable if it produces inconsistent and unreliable results every time. The
reliability of test scores depends on the extent to which scores are consistent across multiple
instances of testing, numerous test editions, or multiple raters grading the participant’s
responses.

The term reliability refers to the invariability of the outcome. For example, if a test aims at
measuring a trait (introversion), then each time a subject undergoes the test, the
assessment will produce consistent results. It may be difficult to measure reliability
precisely in the real world, but it can be predicted in many ways.

A test is reliable as long as it produces similar results over time, repeated administration, or
similar circumstances.
If you were to use a professional dart player as an example, their ability to hit the designated
target consistently, but not the bull’s eye under specified conditions, would classify them as
an excellent and reliable player. However, this does not account for psychometric validity.
Compared to psychometric assessments, a reliable test is better known for producing stable
results over time.
Over the years, scholars and researchers uncovered multiple ways to check for
psychometric reliability. Some include testing the same participants at different points of
time or presenting the participants with varying versions of the same test to evaluate their
consistency levels. An assessment must demonstrate excellent reliability to qualify for
validity.

What Are the Four Types of Reliability?

The four types of psychometric reliability are:


 Parallel Forms Reliability: The two different tests use the same content but separate
procedures or equipment, and yield the same result for each test-taker.

 Internal Consistency Reliability: Items within the test are examined to see if they
appear to measure what the test measures. Internal reliability between test items is
referred to as internal consistency.
 Inter-Rater Reliability: When two raters score the psychometric test in the same
manner, inter-scorer consistency is high.

 Test-Retest Reliability: This is when the same test is conducted over time, and the
test-taker displays consistency in scores over multiple administrations of the same
test.

Errors in Reliability

Psychometricians identify two different categories of errors:


 Systematic errors: These are factors that impact test construction and are inbuilt in
the test.

 Unsystematic errors: These are errors resulting from random factors such as how
the test is given or taken.

Numerous factors influence the psychometric reliability of tests. The timing between two
test sessions affects test-retest and alternate/parallel forms reliability. The similarity of
content and expectations of subjects regarding different testing elements affects only the
latter type of reliability along with split half and internal consistency.
Changes in subjects over time, such as their environment, physical state, emotional and
mental well-being, must also be considered while assessing the reliability of psychometric
tests. Test-based factors such as inadequate testing instructions, biased scoring, lacking
objectivity and guessing on the part of the test-taker also influence the psychometric
reliability of tests. Tests can generate reliable estimates sometimes and not so stable
results other times (Geisinger, 2013).
The reliability of your test depends on the following factors:
Construction of Items/ Questions

Test designers construct questions of the psychometric test to assess mental quality (for
example, motivation). The test questions’ difficulty level or the confusion they create through
ambiguity can negatively influence reliability. Biases in interpreting the items and the errors
in question construction can only be corrected if test instructions are properly implemented,
and the redesign and research process is active and ongoing.

Administration

Administration of the test is another area where systemic errors can occur. Instructions
accompanying the analysis should be precise and well-defined. Errors in the guidance
provided to the test-takers or the administrators can have several adverse effects on the
reliability of the test. Guidelines that affect accurate interpretation could lower test reliability.

Scoring
Psychometric reliability also means that the test has a particular scoring system, by which
interpreting the results is possible. All tests comprise instructions on scoring. Errors such as
conclusions without basis or substantial proof can lower the reliability of the test. Test
construction is associated with research to provide evidence for the conclusions drawn. If
there is a systematic error in the test design phase, this can also impact reliability.

Environmental Factors

Extremes in temperature or audio-visual distractions can influence test scores’ reliability.


Errors in administering the psychometric test can also impact the reliability of the scores
obtained. Human error is equally possible, and interpretation or scoring can be influenced by
the examiner’s attitude toward the test-taker.

Test-Taker

The person being examined may suffer from social desirability concerns and give answers
that do not reflect actual choices. Other factors that influence the test-takers include anxiety,
bias and physical factors such as illness or sleep deprivation.

4.Describe the types of validity with examples.

https://www.scribbr.com/methodology/types-of-validity/

5.Explain the steps in developing norms.


SHORT ESSAY

Psychometric studies 2016

6. a) Describe the importance of ethical issues in psychological testing.

Ethics refers to the correct rules of conduct necessary when carrying out
research. We have a moral responsibility to protect research participants from
harm.

However important the issue under investigation psychologists need to


remember that they have a duty to respect the rights and dignity of research
participants. This means that they must abide by certain moral principles and
rules of conduct.

Informed Consent
Whenever possible investigators should obtain the consent of participants. In
practice, this means it is not sufficient to simply get potential participants to
say “Yes”.
They also need to know what it is that they are agreeing to. In other words, the
psychologist should, so far as is practicable explain what is involved in
advance and obtain the informed consent of participants.

Debrief
After the research is over the participant should be able to discuss the
procedure and the findings with the psychologist. They must be given a
general idea of what the researcher was investigating and why, and their part
in the research should be explained.
Participants must be told if they have been deceived and given reasons why.
They must be asked if they have any questions and those questions should be
answered honestly and as fully as possible.
Debriefing should take place as soon as possible and be as full as possible;
experimenters should take reasonable steps to ensure that participants
understand debriefing.

Protection of Participants
Researchers must ensure that those taking part in research will not be caused
distress. They must be protected from physical and mental harm. This means
you must not embarrass, frighten, offend or harm participants.

Confidentiality
Participants, and the data gained from them must be kept anonymous unless
they give their full consent. No names must be used in a lab report.
What do we do if we find out something which should be disclosed (e.g.
criminal act)? Researchers have no legal obligation to disclose criminal acts
and have to determine which is the most important consideration: their duty
to the participant vs. duty to the wider community.

OR

6b) Explain the process involved in the development of rating scales.

The scale development process as described by Trochim (2006) is completed in five


steps (as quoted by Dimitrov, 2012): 1) Define the measured trait, assuming it is
unidimensional. 2) Generate a pool of potential Likert items, (preferably 80-100)
rated on a 5 or 7 disagree-agree response scale. 3) Have the items rated by a panel
of experts on a 1 - 5 scale on how favorable the items measure the construct (from
1 = strongly unfavorable, to 5 = strongly favorable). 4) Select the items to retain
for the final scale. 5) Administer the scale and to some of the responses of all items
(raw score of the scale), reversing items that measure something in the opposite
direction of the rest of the scale. Because the overall assessment with an instrument
is based on the respondent’s scores on all items, the measurement quality of the
total score is of particular interest ( Dimitrov, 2012). In a similar vein, Furr (2011)
also described it as a process completed in five steps: (a) Define the Construct
measured and the Context, (b) Choose Response Format, (c) Assemble the initial
item pool, (d) Select and revise items and (e) Evaluate the psychometric properties
(see relevant section). Steps (d) and (e) are an iterative process of refinement of
the initial pool until the properties of the scale are adequate. Test score then can be
standardized (see relevant section).
There are several models of test development. In practice, steps within the different
stages may actually be grouped and undertaken in different combinations and
sequences, and crucially, many steps of the process are iterative ( Irwing & Hughes,
2018)

7a) Elaborate on the nature of power and speed tests.


Power Test - Psychometric Glossary
A test designed to find out how many correct answers a subject can complete, with no
predetermined time limit. The purpose of a power test is to examine a person’s ability to solve
issues if given enough time.

A Power Test is a statistical calculation performed before a study to determine


the minimum sample size needed for the study to have enough power. In
other words, the minimum numbers of participants you need to have in your
study. To make this more understandable, let's discuss "Power".

Power is the probability that a statistically significant effect can be found when
it actually exists. Without adequate power you might commit a Type II error,
meaning that you fail to reject the null hypothesis when it is false. The general
consensus is that power should be 0.8 or greater; if it is less than 0.8 then the
same size is too small. The exact formula for a power test depends on what
type of analysis you are running (such as a t-test), but power formulas take
into account the desired alpha or significance level, the effect size or expected
difference you wish to detect, and known variation in the population.

Speed Test - Psychometric Glossary


A test that has a strict time limit, which is used to measure a subject's aptitude or ability while
working within a short timeline. The goal of the test is to answer as many items correctly as
possible within the time frame.
Speed tests are designed to assess how quickly a test taker is able to complete the items
within a set time period. The primary objective of speed tests is to measure the person's
ability to process information quickly and accurately, while under duress.

Both speed tests and power tests are types of psychometric testing techniques that can
measure attributes like personality, aptitude (i.e., ability to do something), and intelligence.
Speed tests: these tests have a time limit in which to complete them. The questions are
usually similar in difficulty.

OR

7b) What is the meaning and purpose of item analysis?

Item analysis is the act of analyzing student responses to individual exam questions
with the intention of evaluating exam quality. It is an important tool to uphold test
effectiveness and fairness.

Item analysis is likely something educators do both consciously and unconsciously


on a regular basis. In fact, grading literally involves studying student responses and
the pattern of student errors, whether to a particular question or particular types of
questions.

But when the process is formalized, item analysis becomes a scientific method
through which tests can be improved, and academic integrity upheld.

Item analysis brings to light test quality in the following ways:

 Item Difficulty -- is the exam question (aka “item”) too easy or too
hard? When an item is one that every student either gets wrong or
correct, it decreases an exam’s reliability. If everyone gets a
particular answer correct, there’s less of a way to tell who really
understands the material with deep knowledge. Conversely, if
everyone gets a particular answer incorrect, then there’s no way to
differentiate those who’ve learned the material deeply.
 Item Discrimination -- does the exam question discriminate
between students who understand the material and those who do
not? Exam questions should suss out the varying degrees of
knowledge students have on the material, reflected by the
percentage correct on exam questions. Desirable discrimination can
be shown by comparing the correct answers to the total test scores of
students--i.e., do students who scored high overall have a higher rate
of correct answers on the item than those who scored low overall? If
you separate top scorers from bottom scorers, which group is getting
which answer correct?
 Item Distractors -- for multiple-choice exams, distractors play a
significant role. Do exam questions effectively distract test takers
from the correct answer? For example, if a multiple-choice question
has four possible answers, are two of the answers obviously
incorrect, thereby rendering the question with a 50/50 percent chance
of correct response? When distractors are ineffective and obviously
incorrect as opposed to being more disguised, then they become
ineffective in assessing student knowledge. An effective distractor will
attract test takers with a lower overall score than those with a higher
overall score.

8a) Portray the process of test-retest reliability.

Test-retest reliability is a specific way to measure reliability of


a test and it refers to the extent that a test produces similar
results over time.

We calculate the test-retest reliability by using the Pearson


Correlation Coefficient, which takes on a value between -1 and
1 where:

 -1 indicates a perfectly negative linear correlation between


two scores
 0 indicates no linear correlation between two scores
 1 indicates a perfectly positive linear correlation between
two scores

For example, we may give an IQ test to 50 participants on


January 1st and then give the same type of IQ test of similar
difficulty to the same group of 50 participants one month later.

We could calculate the correlation of scores between the two


tests to determine if the test has good test-retest reliability.

Generally a test-retest reliability correlation of at least 0.80 or


higher indicates good reliability.

Calculating Test-Retest Reliability Coefficients


Finding a correlation coefficient for the two sets of data is one of the most common ways to find a
correlation between the two tests. Test-retest reliability coefficients (also called coefficients of
stability) vary between 0 and 1, where:
 1 : perfect reliability,
 ≥ 0.9: excellent reliability,
 ≥ 0.8 < 0.9: good reliability,
 ≥ 0.7 < 0.8: acceptable reliability,
 ≥ 0.6 < 0.7: questionable reliability,
 ≥ 0.5 < 0.6: poor reliability,
 < 0.5: unacceptable reliability,
 0: no reliability.
On this scale, a correlation of .9(90%) would indicate a very high correlation (good reliability) and a
value of 10% a very low one (poor reliability).

 For measuring reliability for two tests, use the Pearson Correlation Coefficient. One
disadvantage: it overestimates the true relationship for small samples (under 15).
 If you have more than two tests, use Intraclass Correlation. This can also be used for two tests,
and has the advantage it doesn’t overestimate relationships for small samples. However, it is
more challenging to calculate, compared to the simplicity of Pearson’s.

OR

8b) Write a note on the importance of internal consistency reliability.

What is Internal Consistency Reliability?


Internal consistency reliability is a way to gauge how well a test or survey is actually measuring what
you want it to measure.

Is your test measuring what it’s supposed to?

A simple example: you want to find out how satisfied your customers are with the level of customer
service they receive at your call center. You send out a survey with three questions designed to
measure overall satisfaction. Choices for each question are: Strongly
agree/Agree/Neutral/Disagree/Strongly disagree.

1. I was satisfied with my experience.


2. I will probably recommend your company to others.
3. If I write an online review, it would be positive.
If the survey has good internal consistency, respondents should answer the same for each question,
i.e. three “agrees” or three “strongly disagrees.” If different answers are given, this is a sign that your
questions are poorly worded and are not reliably measuring customer satisfaction. Most researchers
prefer to include at least two questions that measure the same thing (the above survey has three).
Another example: you give students a math test for number sense and logic. High internal
consistency would tell you that the test is measuring those constructs well. Low internal consistency
means that your math test is testing something else (like arithmetic skills) instead of, or in addition to,
number sense and logic.

Testing for Internal Consistency


In order to test for internal consistency, you should send out the surveys at the same time. Sending
the surveys out over different periods of time, while testing, could introduce confounding variables.
An informal way to test for internal consistency is just to compare the answers to see if they all agree
with each other. In real life, you will likely get a wide variety of answers, making it difficult to see if
internal consistency is good or not. A wide variety of statistical tests are available for internal
consistency; one of the most widely used is Cronbach’s Alpha.
 Average inter-item correlation finds the average of all correlations between pairs of questions.
 Split Half Reliability: all items that measure the same thing are randomly split into two. The two
halves of the test are given to a group of people and find the correlation between the two. The
split-half reliability is the correlation between the two sets of scores.
 Kuder-Richardson 20: the higher the Kuder-Richardson score (from 0 to 1), the stronger the
relationship between test items. A Score of at least 70 is considered good reliability.

9a) Bring out the consequences of construct validity.

The consequential aspect of construct validity includes evidence


for evaluating the intended and unintended consequences of score
interpretation and use. Consequences can be associated with bias
in scoring and interpretation or with unfairness in test employment.
The major concern regarding negative consequences is that any
negative impact on individuals or groups should derive from test
invalidity, such as construct underrepresentation or construct
irrelevant variance (Messick, 1989).
A fundamental of construct validity is construct representation. In
construct validity through the use of cognitive-process analysis or
research on personality and motivation, a person attempts to
identify the mechanisms underlying task performance. According
to Messick (1995) there are two major threats to construct validity:
construct underrepresentation (CU) and construct irrelevant
invariance (CII). In construct underrepresentation the assessment is
too limited and fails to include important dimensions or facets of the
construct. In contrast, in the CII the assessment is too broad,
containing excessive reliable variables associated with biased
responses that may affect an objective construct interpretation.

OR

9b) Examine the factors influencing test validity.

Factors Affecting Validity


1. Inappropriateness of the test item.
Measuring the understandings, thinking skills, and other complex types of
achievement with test forms that are appropriate only for measuring factual
knowledge will invalidate the results (Asaad, 2004).

2. Directions of the test items.


Directions that are not clearly stated as to how the students respond to the
items and record their answers will tend to lessen the validity of the test items
(Asaad, 2004).

3. Reading vocabulary and sentence structure.


Vocabulary and sentence structures that do not match the level of the
students will result in the test of measuring reading comprehension or
intelligence rather than what it intends to measure (Asaad, 2004).

4. Level of difficulty of the test item.


When the test items are too easy and too difficult they cannot discriminate
between the bright and the poor students. Thus, it will lower the validity of the
test (Asaad, 2004).

5. Poorly constructed test items.


Test items that unintentionally provide clues to the answer will tend to
measure the students’ alertness in detecting clues and the important aspects
of students’ performance that the test is intended to measure will be affected
(Asaad, 2004).

6. Length of the test items.


A test should have a sufficient number of items to measure what it is
supposed to measure. If a test is too short to provide a representative sample
of the performance that is to be measured, validity will suffer accordingly
Asaad, 2004).

7. Arrangement of the test items.


Test items should be arranged in increasing difficulty. Placing difficult items
early in the test may cause mental blocks and it may take up too much time
for the students; hence, students are prevented from reaching items they
could easily answer. Therefore, the improper arrangement may also affect
validity by having a detrimental effect on students’ motivation (Asaad, 2004).

8. Pattern of the answers.


A systematic pattern of correct answers and this will lower again the validity
of the test (Asaad, 2004).

9. Ambiguity.
Ambiguous statements in test items contribute to misinterpretations and
confusion. Ambiguity sometimes confuses the bright students more than the
poor students, causing the items to discriminate in a negative direction
(Asaad, 2004).

10a) Explain the steps involved in developing norms.

OR

10b) Describe the process of selecting sample from target population.

Psychometric 2017

6a) Describe the classification of tests.

Types of Psychological Tests

Now that you know about their origins, let’s explore the top and most popular
psychological tests.
Here are the major nine types of Psychological tests:

1. Personality Tests
2. Achievement Tests
3. Attitude Tests
4. Aptitude Tests
5. Emotional Intelligence Tests
6. Intelligence Tests
7. Neuropsychological Tests
8. Projective Tests
9. Observation (Direct) Tests

OR

6b) What are rating scales ? Describe with suitable examples.


Rating Scale Definition
Rating scale is defined as a closed-ended survey question used to
represent respondent feedback in a comparative form for specific particular
features/products/services. It is one of the most established question types
for online and offline surveys where survey respondents are expected to
rate an attribute or feature. Rating scale is a variant of the popular multiple-
choice question which is widely used to gather information that provides
relative information about a specific topic.

Examples of Rating Scale Questions


Rating scale questions are widely used in customer satisfaction as well as
employee satisfaction surveys to gather detailed information. Here are a
few examples of rating scale questions –

 Degree of Agreement: An organization has been intending to improve


the efficiency of their employees. After organizing multiple courses and
certifications for the employees, the management decides to conduct a
survey to know whether employees resonate with their ideology behind
these certifications. They can use a rating scale question such as Even
Likert Scale or Odd Likert Scale to evaluate the degree of agreement.
 5 Point Likert Scale
 Customer Experience: It is important to collect information about
customer experience. It is important for organizations to gather real-time
details about product or service purchase experiences. A rating scale
question such as a Semantic Differential Scale can help the
organization’s management to collect and analyze information about
customer experience.

 Semantic Differential Scale
 Analyze brand loyalty: Organizations thrive on customer loyalty
towards their brand. But brand loyalty is a factor which needs to be
regularly monitored. Using a rating scale question such as Net
Promoter Score can help organizations in garnering real-time details
about customer loyalty and brand shareability. A rating question: “On
a scale of 0-10, considering your purchasing experience, how likely
are you to recommend our brand to your friends and colleagues?”
can be effective in monitoring customer satisfaction and loyalty.

7a) Describe the meaning and purpose of item analysis.


Item analysis is the act of analyzing student responses to individual exam questions
with the intention of evaluating exam quality. It is an important tool to uphold test
effectiveness and fairness.

Item analysis is likely something educators do both consciously and unconsciously


on a regular basis. In fact, grading literally involves studying student responses and
the pattern of student errors, whether to a particular question or particular types of
questions.

But when the process is formalized, item analysis becomes a scientific method
through which tests can be improved, and academic integrity upheld.

Item analysis brings to light test quality in the following ways:

 Item Difficulty -- is the exam question (aka “item”) too easy or too
hard? When an item is one that every student either gets wrong or
correct, it decreases an exam’s reliability. If everyone gets a
particular answer correct, there’s less of a way to tell who really
understands the material with deep knowledge. Conversely, if
everyone gets a particular answer incorrect, then there’s no way to
differentiate those who’ve learned the material deeply.
 Item Discrimination -- does the exam question discriminate
between students who understand the material and those who do
not? Exam questions should suss out the varying degrees of
knowledge students have on the material, reflected by the
percentage correct on exam questions. Desirable discrimination can
be shown by comparing the correct answers to the total test scores of
students--i.e., do students who scored high overall have a higher rate
of correct answers on the item than those who scored low overall? If
you separate top scorers from bottom scorers, which group is getting
which answer correct?
 Item Distractors -- for multiple-choice exams, distractors play a
significant role. Do exam questions effectively distract test takers
from the correct answer? For example, if a multiple-choice question
has four possible answers, are two of the answers obviously
incorrect, thereby rendering the question with a 50/50 percent chance
of correct response? When distractors are ineffective and obviously
incorrect as opposed to being more disguised, then they become
ineffective in assessing student knowledge. An effective distractor will
attract test takers with a lower overall score than those with a higher
overall score.
Item analysis entails noting the pattern of student errors to various questions in all
the ways stated above. This analysis can provide distinct feedback on exam efficacy
and support exam design.

OR

7b) 7b) Write a note on index of discrimination

index of discrimination

the degree to which a test or test item differentiates between individuals of different
performance levels, often given as the percentage difference between high-
performing and low-performing individuals who answer a target item correctly. Also
called discrimination index.

8a) What is index of reliability?

The index of reliability is a statistic that provides a theoretical estimate of the correlation
between actual scores of a psychometric test and the assumed true scores. The index is given
the value of the square root of r where r is the coefficient of reliability.

OR

8b) Write a note on alternate-form reliability .

Alternate form reliability occurs when an individual participating in a research


or testing scenario is given two different versions of the same test at different
times. The scores are then compared to see if it is a reliable form of testing.
An individual is given one form of the test and after a period of time (usually a
week or so) the person is given a different version of the same test. If the
scores differ dramatically then something is wrong with the test and it is not
measuring what it is supposed to. If this is the case the test needs to be
analyzed more to determine what is wrong with the different forms.

9a) Write a note on the relationship between validity and reliability.


OR

9b) Describe face validity.

Face validity is the extent to which a test is subjectively viewed as covering the concept it
purports to measure. It refers to the transparency or relevance of a test as it appears to test
participants.[1][2] In other words, a test can be said to have face validity if it "looks like" it is going
to measure what it is supposed to measure.[3] For instance, if a test is prepared to measure
whether students can perform multiplication, and the people to whom it is shown all agree that it
looks like a good test of multiplication ability, this demonstrates face validity of the test. Face
validity is often contrasted with content validity and construct validity.
Some people use the term face validity to refer only to the validity of a test to observers who are
not expert in testing methodologies. For instance, if a test is designed to measure whether
children are good spellers, and parents are asked whether the test is a good test, this measures
the face validity of the test. If an expert is asked instead, some people would argue that this does
not measure face validity.[4] This distinction seems too careful for most applications. [citation needed]
Generally, face validity means that the test "looks like" it will work, as opposed to "has been
shown to work".

10a) Describe the steps in developing norms.

OR

10b) What is percentile norm

Percentile Norms

By percentile norms in a test is meant the different percentiles obtained by a large group of
students. In other words, percentile norms are those scores, the number of students obtaining
scores below than that is equal to the percentage of such students. For example, 75th percentile
norm tells that 75% students have scored below this score and only 25% students have obtained
scores above it. In calculating percentile norm, a candidate is compared with the group of which he
is a member. By percentile scores is meant the grade of a candidate in percentiles. Supposing 100
individuals are taking part in a race. One of them runs the fastest and stands first. He is better than
99 individuals, so his percentile value is 99. The individual standing second in the race is better than
98 individuals, so his percentile position is 98th. The distance between the first and second
individuals does not influence their percentile positions. No other individual follows the individual
running last, so his percentile position will be zero. In the same way, under educational situations,
when several students of the same or different schools are studied, it is quite convenient and useful
to transform their sequences into percentile ranks. In ordinary words, percentile is the point on the
scale below which a fixed percentage of the distribution falls. In order to know percentile value, a
test is administered on a large group and different percentile values are calculated based on scores
obtained by students. These percentile values are percentile norms. Because, it is possible to use
them on all individuals of the common group under all circumstances, so it can be said about them
that percentile norms provide a basis for interpreting the score of an individual in terms of his
standing in some particular group.
Uses of Percentile

They can be analysed easily. ii. It is not necessary to administer the test on a sample representative
group, as is done in other tests. Therefore, no hypothesis has to be formulated for these norms. So,
these are used widely. iii. These norms are useful in all types of circumstances, such as educational,
industrial, military fields etc. iv. Percentile norms are easy to develop. v. They can be used to
meaningfully express the scores with different units and numerical standards. vi. These are used to
determine the findings of personality tests, IQ tests, attitude tests, aptitude tests etc.

Weakness of Percentile Norms

i. It is not possible to carry out statistical analysis of these norms. ii. The percentile
scores of different tests cannot be compared unless the groups on which they
were administered are not comparable; for example, if in a personality test,
percentile norms have been developed for adolescent girls taken from a large
group, then the scores of all adolescent girls can be compared with these. iii. In
normal situations, percentile norms tell the relative position of each individual,
but it does not make out the difference in scores between two individuals. iv.
Percentile norms are often confused with percent scores. v. The relative position
of an individual is ascertained on the basis of these norms. It is not possible to
analyse actual ability or capability of an individual objectively. vi. The units of
percentile scores are not uniform. If the details of actual scores are almost
common, then there is much difference in changing proximate scores into
percentile values, while there is not much difference in changing scores at
extreme ends.

Psychometric 2018

6a) What are the ways to reduce the effect of guessing while constructing a psychological
test?

OR

6b) Describe the characteristics of a good psychological test.

7a) What is the role of distracters in construction of test items?


OR

7b) What are ceiling and floor effects in the use of multiple choice tests?

A floor effect is when most of your subjects score near the bottom. There is very little variance
because the floor of your test is too high. In layperson terms, your questions are too hard for the group
you are testing. This is even more of a problem with multiple choice tests. With other types, if the
subject doesn’t know, they aren’t likely to guess that the answer is, say (a+b)(a-b) and so they get it
wrong. With a multiple-choice test with four choices, they will randomly get it correct 25% of the time. If
there are a bunch of questions that are too hard, you have a bunch of people randomly getting each one
right just by chance. Combine low variance with a lot of random error and your internal consistency
reliability is going to be in the toilet. So, let’s say you have exactly that on your pre-test. Then, you test
again after some time and your control group, having had no training in the meantime, is equally low,
the problems are still too hard, you still have random guessing and low variance.
A ceiling effect is the opposite, all of your subjects score near the top. There is very little variance
because the ceiling of your test is too low. In layperson terms, your questions are too easy for the group
you are testing. Here you don’t have the problem of random guessing, but you do have low variance.
Think back to Statistics 101 – restriction of range attenuates correlations. Again, in layperson terms, if
you correlate height and weight of NBA players, for example, you find almost no relationship between
height and weight because they are ALL very tall and ALL very heavy. If you make the questions on your
pretest easier, that may give you better internal consistency reliability at pre-test, but since a good
percentage of your subjects knew the questions at the beginning, by the end of your training maybe
nearly all of them will, and then you run into a ceiling effect

8a) Write a note on common threats to reliability in psychological tests.

OR

8b) Explain odd-even reliability

9a) Write a note on measurement of validity.

OR

9b) Explain ecological validity .

In the behavioral sciences, ecological validity is often used to refer to the judgment of whether
a given study's variables and conclusions (often collected in lab) are sufficiently relevant to its
population (e.g. the "real world" context). Psychological studies are usually conducted in
laboratories though the goal of these studies is to understand human behavior in the real-world.
Ideally, an experiment would have generalizable results that predict behavior outside of the lab,
thus having more ecological validity. Ecological validity can be considered a commentary on the
relative strength of a study's implication(s) for policy, society, culture, etc.
This term was originally coined by Egon Brunswik[1] and held a very narrow meaning that has
since been conceptually modified. He regarded ecological validity as the utility of a perceptual
cue in predicting a property (basically how informative the cue is). For example, the movement of
leaves on trees is a perceptual cue to how windy it is outside. Therefore, trees rustling has high
ecological validity because it is highly correlated with it being windy.
Due to the evolving and broad definition of ecological validity, problematic usage of this term in
modern scientific studies occurs because it is often not defined and interpreted differently in the
scientific community. In fact, in many cases just being specific about what behavior/context you
are testing makes addressing ecological validity unnecessary in studies.

10a) What are the demerits of normative scores?

OR

10b) Write a note on the uses of grade equivalent scores.

Grade Equivalent Scores


Grade Equivalent scores, on the other hand, allow us to compare the total number of correct
answers the average test taker got. For example, an average 12-year old taking the 3 subtests that
make up the Broad Math portion of the Woodcock Johnson-III Test of Achievement would need
to get a total of 141 correct answers out of a total of 268 possible questions to score at the 50th
Percentile. How that test taker got those 141 correct math answers will depend upon the
individual.
For example, some people ace all their grade-level problems but then immediately get answers
wrong when they encounter concepts they haven’t learned yet. Other students make careless
mistakes on easy problems because they’re anxious to get to the more difficult – and more
interesting – problems. Still other students have very inconsistent math skills and do extremely
well in one area, such as calculations, but struggle with word problems. Just seeing a GE score
doesn’t give you any insight into how the student obtained their Raw Scores.

USES:

Grade Equivalent scores can be used to compare the number of correct answers children of
different ages or grades received on the same test. Those Raw Scores, however, will lead to
different Standard Scores based upon the test taker’s actual age or grade. Grade Equivalent scores
do not tell us that a child is actually achieving at a specific grade level.

A high GE score tells us that a child has been able to correctly answer far more questions than his
or her peers – but it tells us nothing more. At the same time, a high GE allows us to infer that the
student more than likely has the ability to handle a greater breadth or depth of material than they
are currently encountering, if they are in a typical age-grade placement.

Just how advanced the material should be is a question better judged by examining work samples
and talking directly to the child. If you are attempting to advocate for a grade skip through a
school, requesting that your child take the end-of-year assessment test for a specific grade level
subject will provide you with stronger data.

Psychometric 2019
6a) What are the steps in test construction?

“GENERAL STEPS OF TEST CONSTRUCTION”

The development of a good psychological test requires thoughtful and sound


application of established principles of test construction. Before the real work of test
construction, the test constructor takes some broad decisions about the major
objectives of the test in general terms and population for whom the test is intended
and also indicates the possible conditions under which the test can be used and its
important uses.
These preliminary decisions have far-reaching consequences. For example, a test
constructor may decide to construct an intelligence test meant for students of tenth
grade broadly aiming at diagnosing the manipulative and organizational ability of the
pupils. Having decided the above preliminary things, the test constructor goes ahead
with the following steps:
1. Planning
2. Writing items for the test.
3. Preliminary administration of the test.
4. Reliability of the final test.
5. The validity of the final test.
6. Preparation of norms for the final test.
7. Preparation of manual and reproduction of the test.
8. PLANNING:
The first step in the test construction is the careful planning. At this stage, the test
constructor address the following issues;
 DEFINITION OF THE CONSTRUCT:
Definition of the construct to be measured by the proposed test.
 OBJECTIVE OF THE TEST:
The author has to spell out the broad and specific objectives of the test in clear
terms. That is the prospective users (For example Vocational counselors, Clinical
psychologists, Educationalists) and the purpose or purposes for which they will use
the test.
 POPULATION:
What will be the appropriate age range, educational level and cultural background of
the examinees, who would find it desirable to take the test?
 CONTENT OF THE TEST:
What will be the content of the test? Is this content coverage different from that of
the existing tests developed for the same or similar purposes? Is this cultural-
specific?
 TEST FORMAT:
The author has to decide what would be the nature of items, that is to decide if the
test will be a multiple-choice, true-false, inventive response, or n some other form.
 TYPE OF INSTRUCTIONS:
What would be the type of instructions i-e written or to be delivered orally?
 TEST ADMINISTRATION:
Whether the test would be administered individually or in groups? Will the test be
designed or modified for computer administration. A detailed agreement for
preliminary and final administration should be considered.
 USER QUALIFICATION AND PROFESSIONAL COMPETENCE:
What special training or qualifications will be necessary for administering or
interpreting the test?
 PROBABLE LENGTH, TIME STATISTICAL METHODS:
The test constructor must have to decide about the probable length and time for
completion of test.
 METHOD OF SAMPLING:
What would be the method of sampling i-e random or selective.
 ETHICAL AND SOCIAL CONSIDERATION:
Is there any potential harm for the examinees resulting from the administration of
this test? Are there any safeguards built into the recommended testing procedure to
prevent any sort of harm to anyone involved in the use of this test.
 INTERPRETATION OF SCORES:
How will the scores be interpreted? Will the scores of an examinee be compared to
others in the criteria group or will they be used to assess mastery of a specific
content area? To answer this question, the author has to decide whether the
proposed test will be criterion-referenced or norm-referenced.
 MANUAL AND REPRODUCTION OF TEST:
Planning also include the total number of reproductions and a preparation of
manual.

OR

6b) Write a note on ethical issues in psychological testing.

Ethics refers to the correct rules of conduct necessary when carrying out
research. We have a moral responsibility to protect research participants from
harm.

However important the issue under investigation psychologists need to


remember that they have a duty to respect the rights and dignity of research
participants. This means that they must abide by certain moral principles and
rules of conduct.

Informed Consent
Whenever possible investigators should obtain the consent of participants. In
practice, this means it is not sufficient to simply get potential participants to
say “Yes”.
They also need to know what it is that they are agreeing to. In other words, the
psychologist should, so far as is practicable explain what is involved in
advance and obtain the informed consent of participants.

Debrief
After the research is over the participant should be able to discuss the
procedure and the findings with the psychologist. They must be given a
general idea of what the researcher was investigating and why, and their part
in the research should be explained.
Participants must be told if they have been deceived and given reasons why.
They must be asked if they have any questions and those questions should be
answered honestly and as fully as possible.
Debriefing should take place as soon as possible and be as full as possible;
experimenters should take reasonable steps to ensure that participants
understand debriefing.

Protection of Participants
Researchers must ensure that those taking part in research will not be caused
distress. They must be protected from physical and mental harm. This means
you must not embarrass, frighten, offend or harm participants.

Confidentiality
Participants, and the data gained from them must be kept anonymous unless
they give their full consent. No names must be used in a lab report.
What do we do if we find out something which should be disclosed (e.g.
criminal act)? Researchers have no legal obligation to disclose criminal acts
and have to determine which is the most important consideration: their duty
to the participant vs. duty to the wider community.
7a) Write a note on item characteristic curve.

item characteristic curve (ICC)

a plot of the probability that a test item is answered correctly against the examinee’s
underlying ability on the trait being measured. The item characteristic curve is the
basic building block of item response theory: The curve is bounded between 0 and 1,
is monotonically increasing, and is commonly assumed to take the shape of a
logistic function. Each item in a test has its own item characteristic curve.

OR

7b) Describe power and speed tests.

Answered

8a) Describe the test-retest reliability.

Test-Retest Reliability (sometimes called retest reliability) measures test consistency — the reliability
of a test measured over time. In other words, give the same test twice to the same people at different
times to see if the scores are the same. For example, test on a Monday, then again the following
Monday. The two scores are then correlated

Bias is a known problem with this type of reliability test, due to:
 Feedback between tests,
 Participants gaining knowledge about the purpose of the test, so they are more prepared the
second time around.
This reliability test can also take a long time to calculate correlations for. Depending upon the length
of time between the two tests, this could be months or even years.

Calculating Test-Retest Reliability Coefficients


Finding a correlation coefficient for the two sets of data is one of the most common ways to find a
correlation between the two tests. Test-retest reliability coefficients (also called coefficients of
stability) vary between 0 and 1, where:
 1 : perfect reliability,
 ≥ 0.9: excellent reliability,
 ≥ 0.8 < 0.9: good reliability,
 ≥ 0.7 < 0.8: acceptable reliability,
 ≥ 0.6 < 0.7: questionable reliability,
 ≥ 0.5 < 0.6: poor reliability,
 < 0.5: unacceptable reliability,
 0: no reliability.
On this scale, a correlation of .9(90%) would indicate a very high correlation (good reliability) and a
value of 10% a very low one (poor reliability).

 For measuring reliability for two tests, use the Pearson Correlation Coefficient. One
disadvantage: it overestimates the true relationship for small samples (under 15).
 If you have more than two tests, use Intraclass Correlation. This can also be used for two tests,
and has the advantage it doesn’t overestimate relationships for small samples. However, it is
more challenging to calculate, compared to the simplicity of Pearson’s.

OR

8b) Write a note on scorer reliability .

Scorer reliability refers to the consistency with which different people who
score the same test agree. For a test with a definite answer key, scorer
reliability is of negligible concern. When the subject responds with his own
words, handwriting, and organization of subject matter, however, the
preconceptions of different raters produce different scores for the same test
from one rater to another; that is, the test shows scorer (or rater)
unreliability. In the absence of an objective scoring key, a scorer’s
evaluation may differ from one time to another and from those of equally
respected evaluators. Other things being equal, tests that permit objective
scoring are preferred.

OR
In statistics, inter-rater reliability (also called by various similar names, such as inter-rater
agreement, inter-rater concordance, inter-observer reliability, and so on) is the degree of
agreement among independent observers who rate, code, or assess the same phenomenon.

In contrast, intra-rater reliability is a score of the consistency in ratings given by the same person
across multiple instances. For example, the grader should not let elements like fatigue influence
their grading towards the end, or let a good paper influence the grading of the next paper. The
grader should not compare papers together, but they should grade each paper based on the
standard.
Inter-rater and intra-rater reliability are aspects of test validity. Assessments of them are useful in
refining the tools given to human judges, for example, by determining if a particular scale is
appropriate for measuring a particular variable. If various raters do not agree, either the scale is
defective or the raters need to be re-trained.
There are a number of statistics that can be used to determine inter-rater reliability. Different
statistics are appropriate for different types of measurement. Some options are joint-probability of
agreement, Cohen's kappa, Scott's pi and the related Fleiss' kappa, inter-rater correlation,
concordance correlation coefficient, intra-class correlation, and Krippendorff's alpha.
9a) Write a note on criterion validity.

In psychometrics, criterion validity, or criterion-related validity, is the extent to which an


operationalization of a construct, such as a test, relates to, or predicts, a theoretical
representation of the construct—the criterion.[1][2] Criterion validity is often divided into concurrent
and predictive validity based on the timing of measurement for the "predictor" and outcome. [2]
: page 282 Concurrent validity refers to a comparison between the measure in question and an
outcome assessed at the same time. In Standards for Educational & Psychological Tests, it
states, "concurrent validity reflects only the status quo at a particular time." [3] Predictive validity,
on the other hand, compares the measure in question with an outcome assessed at a later time.
Although concurrent and predictive validity are similar, it is cautioned to keep the terms and
findings separated. "Concurrent validity should not be used as a substitute for predictive validity
without an appropriate supporting rationale."[3] Criterion validity is typically assessed by
comparison with a gold standard test. [4]
An example of concurrent validity is a comparison of the scores of the CLEP College Algebra
exam with course grades in college algebra to determine the degree to which scores on the
CLEP are related to performance in a college algebra class.[5] An example of predictive validity is
IQ tests, it was originally developed predict future school performance. Another example is a
comparison of scores on the SAT with first semester grade point average (GPA) in college; this
assesses the degree to which SAT scores are predictive of college performance.

OR

9b) Describe the factors influencing validity of a test.

Answered

10a) Describe the age - equivalent norms.

OR

10b) Explain standard score norms.

https://www.illuminateed.com/blog/2017/05/understanding-test-norms/
SHORT ANSWER

Psychometric studies 2016

10.Define psychophysical scaling.

psychophysical scaling

any of the techniques used to construct scales relating physical stimulus properties to
perceived magnitude. For example, a respondent in a study may have to indicate the
roughness of several different materials that vary in texture. Methods are often classified
as direct or indirect, based on how the observer judges magnitude.

12. Briefly explain item difficulty.

Item difficulty is an estimate of the skill level needed to pass an item. It is


frequently measured by calculating the proportion of individuals passing an
item. In order to increase efficient use of both the examiner’s and the
examinee’s time, the item difficulty index values can be used to order the
administration items so that a discontinue rule can be invoked to reduce the
administration of more difficult items to individuals who would be unlikely
to pass them.

13. What is index of reliability ?

The index of reliability is a statistic that provides a theoretical estimate of the


correlation between actual scores of a psychometric test and the assumed true
scores. The index is given the value of the square root of r where r is the coefficient of
reliability.

14. What is criterion validity?

In psychometrics, criterion validity, or criterion-related validity, is the extent to which an


operationalization of a construct, such as a test, relates to, or predicts, a theoretical
representation of the construct—the criterion.[1][2] Criterion validity is often divided into concurrent
and predictive validity based on the timing of measurement for the "predictor" and outcome

15. Define population.

A population is a distinct group of individuals, whether that group


comprises a nation or a group of people with a common characteristic.
In statistics, a population is the pool of individuals from which a statistical
sample is drawn for a study. Thus, any selection of individuals grouped
together by a common feature can be said to be a population.

Psychometric 2017

11.What is measurement?

Measurement is the assignment of scores to individuals so that the scores


represent some characteristic of the individuals. This very general
definition is consistent with the kinds of measurement that everyone is
familiar with—for example, weighing oneself by stepping onto a bathroom
scale, or checking the internal temperature of a roasting turkey by inserting
a meat thermometer

12. What is item difficulty?

Item difficulty is an estimate of the skill level needed to pass an item. It is frequently
measured by calculating the proportion of individuals passing an item

13. What is scorer reliability?

Scorer reliability refers to the consistency with which different people who
score the same test agree. For a test with a definite answer key, scorer
reliability is of negligible concern

14. Define construct validity.

Construct validity is the accumulation of evidence to support the interpretation of what a


measure reflects.[1][2][3][4] Modern validity theory defines construct validity as the overarching
concern of validity research, subsuming all other types of validity evidence[5][6] such as content
validity and criterion validity.[7][8]

15. What are standard score norms?


Psychometric 2018

11.Define psychological scaling.

psychophysical scaling

any of the techniques used to construct scales relating physical stimulus properties to
perceived magnitude. For example, a respondent in a study may have to indicate the
roughness of several different materials that vary in texture. Methods are often classified as
direct or indirect, based on how the observer judges magnitude.

12. What is the use of item discrimination index?

Item discrimination refers to the ability of an item to differentiate among students on the
basis of how well they know the material being tested. Various hand calculation
procedures have traditionally been used to compare item responses to total test scores
using high and low scoring groups of students

13 Write a note on significance of test-retest reliability.

Having good test re-test reliability signifies the internal validity of a test and ensures that the
measurements obtained in one sitting are both representative and stable over time. Often,
test re-test reliability analyses are conducted over two time-points (T1, T2) over a relatively
short period of time, to mitigate against conclusions being due to age-related changes in
performance, as opposed to poor test stability.

14. Define validity

1. the characteristic of being founded on truth, accuracy, fact, or law.

2. the degree to which empirical evidence and theoretical rationales support the
adequacy and appropriateness of conclusions drawn from some form of
assessment. Validity has multiple forms, depending on the research question and on
the particular type of inference being made

15 Write a note on percentile norms


By percentile norms in a test is meant the different percentiles obtained by a large group of
students. In other words, percentile norms are those scores, the number of students
obtaining scores below than that is equal to the percentage of such students.

Psychometric 2019

11.What are the characteristics of a psychological test?

Five main characteristics of a good psychological test are as follows:


1. Objectivity 2. Reliability 3. Validity 4. Norms 5. Practicability!

1. Objectivity:
The test should be free from subjective—judgement regarding the
ability, skill, knowledge, trait or potentiality to be measured and
evaluated.

2. Reliability:

This refers to the extent to which they obtained results are


consistent or reliable. When the test is administered on the same
sample for more than once with a reasonable gap of time, a reliable
test will yield same scores. It means the test is trustworthy. There
are many methods of testing reliability of a test.

3. Validity:
It refers to extent to which the test measures what it intends to
measure. For example, when an intelligent test is developed to
assess the level of intelligence, it should assess the intelligence of
the person, not other factors.

Validity explains us whether the test fulfils the objective of its


development. There are many methods to assess validity of a test.

4. Norms:
Norms refer to the average performance of a representative sample
on a given test. It gives a picture of average standard of a particular
sample in a particular aspect. Norms are the standard scores,
developed by the person who develops test. The future users of the
test can compare their scores with norms to know the level of their
sample.

5. Practicability:
The test must be practicable in- time required for completion, the
length, number of items or questions, scoring, etc. The test should
not be too lengthy and difficult to answer as well as scoring.

12.Explain the item characteristics curve.

item characteristic curve (ICC)

a plot of the probability that a test item is answered correctly against the examinee’s
underlying ability on the trait being measured. The item characteristic curve is the
basic building block of item response theory: The curve is bounded between 0 and 1,
is monotonically increasing, and is commonly assumed to take the shape of a
logistic function. Each item in a test has its own item characteristic curve.

13. What is split-half reliability?

Split-half reliability is a statistical method used to measure the consistency of the scores of a test.
It is a form of internal consistency reliability and had been commonly used before the coefficient
α was invented. Split-half reliability is a convenient alternative to other forms of reliability,
including test–retest reliability and parallel forms reliability because it requires only one
administration of the test. As can be inferred from its name, the method involves splitting a test
into halves and correlating examinees’ scores on the two halves of the test. The resulting
correlation is then adjusted for test length using the Spearman-Brown prophecy formula.

14. What is face validity?

Face validity is the extent to which a test is subjectively viewed as covering the concept it
purports to measure. It refers to the transparency or relevance of a test as it appears to test
participants.[1][2] In other words, a test can be said to have face validity if it "looks like" it is going
to measure what it is supposed to measure.[3] For instance, if a test is prepared to measure
whether students can perform multiplication, and the people to whom it is shown all agree that it
looks like a good test of multiplication ability, this demonstrates face validity of the test. Face
validity is often contrasted with content validity and construct validity.

15. What are the types of norms?

There are four kinds of norms i.e. Age norms, Grade norms, Percentile norms and
Standard score norms.

You might also like