0% found this document useful (0 votes)
34 views80 pages

MS Final Lecture Reliability

The document discusses the importance of reliability and validity in research data collection, emphasizing that the quality of research results depends on the research instrument used. It explains various methods for measuring reliability, including Cronbach's alpha, test-retest reliability, and inter-rater reliability, and highlights the significance of achieving acceptable reliability coefficients. Additionally, it outlines the implications of negative reliability scores and the use of different statistical methods to assess reliability in various contexts.

Uploaded by

amnapsy1122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views80 pages

MS Final Lecture Reliability

The document discusses the importance of reliability and validity in research data collection, emphasizing that the quality of research results depends on the research instrument used. It explains various methods for measuring reliability, including Cronbach's alpha, test-retest reliability, and inter-rater reliability, and highlights the significance of achieving acceptable reliability coefficients. Additionally, it outlines the implications of negative reliability scores and the use of different statistical methods to assess reliability in various contexts.

Uploaded by

amnapsy1122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 80

Reliability &

Validity
Dr. Humaira Naz
Assistant Professor
 The validity and reliability the instrument is essential in
2
research data collection. Therefore, the correct data will
be determining true the results of research quality. While
true or not the data is highly dependent on true or not the
research instrument.
 Validity is a measure of the degree of validity or the
validity of a research instrument. An instrument is said to
be valid if it is able to measure what is to be measured or
desired. An instrument said to be valid if can be reveal
the data of the variables studied.
Reliability
3

 Cronbach's alpha is the most common


measure of internal consistency
("reliability"). It is most commonly used
when there are multiple Likert questions
in a survey/questionnaire that form a
scale and you wish to determine if the
scale is reliable.
 Cronbach’s α is a measure of internal
consistency. This refers to how closely
4
related a set of items are as a
collective. It can also be defined as the
measure of scale reliability. Sometimes,
Cronbach’s alpha is defined as a
purpose of the quantity of items in a
test, the average covariance between
pairs, and the total score variance.
 It can also be described simply as a
measure of how closely related a set
of items are as a collective.
 The dependability of given
5
measurements intends the extend
to which it is a dependable
measure of a concept. Cronbach’s
α is an excellent way for measuring
the strength of consistency.
 It is measured by correlating item's

scale with total score of individual


observations. Later variance of
item is compared with all scores of
individual items
 Cronbach’s α of 0.7 of higher is usually
considered acceptable. However, when
6
evaluating a scale’s reliability, it is worth
considering other factors. Such factors include
the face and construct validity of the objects in
the measure. Once you have obtained the
Cronbach’s alpha value, the next step is to
construct an additional analysis. For instance,
you may want to conduct an explanatory factor
analysis for determining the unidimensionality of
the scale. However, there are various
assumptions behind this method. These
assumptions include;
 Items are ordinal
 The scale is unidimensional
 It is an essential concept
α Internal used in the assessment and
consistency evaluation of questionnaires
7
 It is the most standard
>0.9 excellent
degree of inner constancy
>0.8 good
or reliability
 Cronbach’s alpha is
>0.7 acceptable commonly applied when
there are several Likert
>0,.6 questionable
enquiries in a survey or
>0.5 poor questionnaire
 Cronbach’s alpha is usually
≤0.5 unacceptable reported in scales ranging
from 0-1 with the larger
values representing more
reliability
8

 A researcher has devised a nine-question


questionnaire to measure how safe people
feel at work at an industrial complex. Each
question was a 5-point Likert item from
"strongly disagree" to "strongly agree". In
order to understand whether the questions
in this questionnaire all reliably measure
the same latent variable (feeling of safety)
(so a Likert scale could be constructed), a
Cronbach's alpha was run on a sample size
of 15 workers.
9
10
11
 This column Cronbach's alpha if that particular item
12
was deleted from the scale refers to the removal of
any question, except question 8, would result in a
lower Cronbach's alpha.
 Therefore, we would not want to remove these
questions. Removal of question 8 would lead to a
small improvement in Cronbach's alpha, and we can
also see that the "Corrected Item-Total
Correlation" value was low (0.128) for this item.
This might lead us to consider whether we should
remove this item.
13

 Cronbach's alpha simply provides with an overall


reliability coefficient for a set of variables (e.g.,
questions).
 If your questions reflect different underlying
personal qualities (or other dimensions), for
example, employee motivation and employee
commitment, Cronbach's alpha will not be able
to distinguish between these. In order to do this
and then check their reliability (using Cronbach's
alpha), you will first need to run a test such as a
principal components analysis (PCA).
Negative Cronbach’s alpha
14

 means that there is inconsistent coding


 or a mixture of items that measure different dimensions.
 If you get a Negative Cronbach’s α, researcher can use a factor
analysis to check the factorial structure and correlations
between the extended factors.
 Researcher can also have a theory before or prior results to help
predict if the results will be positive or negative.
 Note that Negative Cronbach’s alpha usually violates the
reliability model assumption.
 If scale items are binary, then Kuder Richardson formula 20 is
15
used to measure chronbach alpha.
 The Kuder-Richardson Formula 20, often abbreviated KR-20,
is used to measure the internal consistency reliability of a test in
which each question only has two answers: right or wrong.
 The Kuder-Richardson Formula 20 is as follows:
 KR-20 = (k / (k-1)) * (1 – Σpjqj / σ2)
 where:
 k: Total number of questions
 pj: Proportion of individuals who answered question j correctly
 qj: Proportion of individuals who answered question j incorrectly
 σ2: Variance of scores for all individuals who took the test
16

The KR-20 value turns out to be 0.0603.


Since this value is extremely low, this
indicates that the test has low reliability.
This means the questions may need to be
re-written or re-phrased in such a way
that the reliability of the test can be
increased.
Test-Retest Reliability Meth
 od
Determines how much error in the test results is due to
17

administration problems – e.g. loud environment, poor lighting,
insufficient time to complete test.
 This method uses the following process:
 Administer a test to a group of individuals.
 Wait some amount of time (days, weeks, or months) and
administer the same test to the same group of individuals.
 Calculate the correlation between the scores of the two tests.
 Generally a test-retest reliability correlation of at least 0.80 or
higher indicates good reliability.
18

 1. Split-Half Reliability Method –


 a measure of internal consistency — how well the test components
contribute to the construct that’s being measured. It is most
commonly used for multiple choice tests you can theoretically use it
for any type of test—even tests with essay questions.
 Determines how much error in the test results is due to poor test
construction -e.g. poorly worded questions or confusing instructions.
19
 This method uses the following process:
 Split a test into two halves. For example, one half may be
composed of even-numbered questions while the other half is
composed of odd-numbered questions.
 Administer each half to the same individual.
 Calculate the correlation between the scores for both halves.
20

 Pearson's r or Spearman's rho correlation is run


between the two halves of the instrument.
 The higher the correlation between the two halves,
the higher the internal consistency of the test or
survey. Ideally the high correlation between the
halves to indicates that all parts of the test are
contributing equally to what is being measured.
 https://www.youtube.com/watch?v=GK82PJCncNk
 One option:

21

 is to simply to divide the  This can be problematic


measurement procedure in because of (a) issues of test
half; that is, take the scores design (e.g., easier/harder
from the measures/items in questions are in the first/second
the first half of the half of the measurement
measurement procedure procedure), (b) participant
and compare them to the fatigue/concentration/focus (i.e.,
scores from those scores may decrease during the
measures/items in the second half of the measurement
second half of the procedure), and (c) different
measurement procedure. items/types of content in
different parts of the test.
 Another option:
22

 to compare odd- and even-  This helps to avoid some of


numbered items/measures the potential biases that
from the measurement arise from simply dividing
procedure. The aim of this the measurement procedure
method is to try in two.
and match the
measures/items that are
being compared in terms of
content, test design (i.e.,
difficulty), participant
demands, and so forth.
23
 Drawbacks
 One drawback with this method: it only works for a large set
of questions (a 100 point test is recommended) which all
measure the same construct/area of knowledge.
 For example, this personality inventory test measures
introversion, extroversion, depression and a variety of other
personality traits. This is not a good candidate for split-half
testing.
Parallel Forms Reliability
24
Method
– Determines how much error in the test results is due to
outside effects – e.g. students getting access to questions
ahead of time or students getting better scores by simply
practicing more.
 This method uses the following process:
 Administer one version of a test to a group of individuals.
 Administer an alternate but equally difficult version of
the test to the same group of individuals.
 Calculate the correlation between the scores of the two
tests.
 Split half-reliability is
similar to parallel forms
25 reliability, which uses one
set of questions divided into  Another difference: the
two equivalent sets. The sets
two tests in parallel
are given to the same
forms reliability are
students, usually within a
short time frame, like one equivalent and are
set of test questions on independent of each
Monday and another set on other. This is not true
Friday. equivalent with split-half reliability;
(“parallel”). With split- the two sets do not have
half reliability, the two tests to be
are given to one group of
students who sit the test at
the same time.
Inter-rater Reliability
Method
26
Interrater reliability is a measure used to

examine the agreement between two people
(raters/observers) on the assignment of categories of a
categorical variable. It is an important measure in
determining how well an implementation of some
coding or measurement system works.
Inter-rater Reliability
27
Method
 Example:
 Determines how consistently each item on a test measures the
true construct being measured – e.g. are all questions clearly
communicated and relevant to the construct being measured?
 This method involves having multiple qualified raters or
judges rate each item on a test and then calculating the overall
percent agreement between raters or judges.
 The higher the percent agreement between judges, the higher
the reliability of the test.
Continuous Ratings, Two
28
Judges
 suppose that we have two judges rating the aggressiveness of
each of a group of children on a playground. If the judges
agree with one another, then there should be a high correlation
between the ratings given by the one judge and those given by
the other. Accordingly, one thing we can do to assess inter-
rater agreement is to correlate the two judges' ratings.
 Consider the following ratings (they also happen to be ranks)
of ten subjects:
 Subject 1 2 3 4 5 6 7 8 9 10
 Judge 1 10 9 8 7 6 5 4 3 2 1
 Judge 2 9 10 8 7 5 6 4 3 1 2
29

 Statistical test: The Pearson correlation “


 r = .964. If our scores are ranks or we can
justify converting them to ranks, we can
compute the Spearman correlation coefficient
or Kendall's tau. For these data Spearman rho
is .964 and Kendall's tau is .867
30

 We must, however, consider the fact that two


judges scores could be highly correlated with one
another but show little agreement. Consider the
following data:
 Subject 1 2 3 4 5 6 7 8 9 10
 Judge 4 10 9 8 7 6 5 4 3 2 1
 Judge 5 90 100 80 70 50 60 40 30 10 20
 The correlations between judges 4 and 5 are
identical to those between 1 and 2, but judges 4 and
5 obviously do not agree with one another well.
Judges 4 and 5 agree on the ordering of the children
with respect to their aggressiveness, but not on the
overall amount of aggressiveness shown by the
children.
31
 One solution to this problem is to compute the intraclass
correlation coefficient.
 For the example data: the intraclass correlation coefficient
between Judges 1 and 2 is .9672 while that between Judges 4
and 5 is .0535
 The main reason for all of this complexity is that the ICC is
very flexible and can be adjusted for inconsistent raters for all
ratees. For example, let’s say you have a group of 10 raters
who rate 20 ratees. If 9 of the raters rate 15 of the ratees and 1
rater rates all of them, or if 10 raters rate 2 each, you can still
calculate the ICC.
32

 Analyze Reliability click statistics


 in SPSS, you’re given three different options for
calculating the ICC.
 For inconsistent raters/ratees, use “One-Way
Random.”
 For consistent raters/ratees (e.g. 10 raters each
rate 10 ratees), and you have sample data. use
“Two-Way Random.”
33

 There are several different versions of an ICC that can be


calculated, depending on the following three factors:
 Model: One-Way Random Effects, Two-Way Random
Effects, or Two-Way Mixed Effects
 Type of Relationship: Consistency or Absolute Agreement
 Unit: Single rater or the mean of raters
34
 1. One-way random effects  2. Two-way random effects
model: This model assumes model: This model assumes
that each subject is rated by a that a group of k raters is
different group of randomly randomly selected from a
chosen raters. Using this population and then used to
model, the raters are rate subjects. Using this
considered the source of model, both the raters and
random effects. This model is the subjects are considered
rarely used in practice sources of random effects.
because the same group of This model is often used
raters is usually used to rate when we’d like to generalize
each subject. our findings to any raters
who are similar to the raters
used in the study.
35

 3. Two-way mixed effects model: This model also assumes that a


group of k raters is randomly selected from a population and then
used to rate subjects. However, this model assumes that the group
of raters we chose are the only raters of interest, which means we
aren’t interested in generalizing our findings to any other raters who
might also share similar characteristics as the raters used in the
study.
types of relationships
36

 1. Consistency: For the systematic differences between the


ratings of judges (e.g. did the judges rate similar subjects low
and high?)
 2. Absolute Agreement: for absolute differences between the
ratings of judges (e.g. what is the absolute difference in ratings
between judge A and judge B?)
units
37

 1. Single rater: When interest is in using the ratings


from a single rater as the basis for measurement.
 2. Mean of raters: When interest is in using the mean of
ratings from all judges as the basis for measurement.
 Pearson’s is usually used for inter-rater
reliability when researcher has one or
two meaningful pairs from one or two
38  Here is how to interpret the raters. Like most correlation coefficients,
value of an intraclass the ICC ranges from 0 to 1.
correlation coefficient,  A high Intraclass Correlation Coefficient
(ICC) close to 1 indicates high similarity
according to Koo & Li:
between values from the same group.
 Less than 0.50: Poor  A low ICC close to zero means that
reliability values from the same group
 Between 0.5 and are not similar.
0.75: Moderate reliability
 Between 0.75 and
0.9: Good reliability
 Greater than
0.9: Excellent reliability
39

 Suppose four different


judges were asked to rate
the quality of 10
different college entrance
exams. The results are
shown below:
40

 Suppose the four judges were randomly selected from a


population of qualified entrance exam judges
 and that we’d like to measure the absolute agreement
among judges and
 that we’re interested in using the ratings from a single
rater perspective as the basis for our measurement with 2
way random model
 The intraclass correlation coefficient (ICC) turns out to
41
be 0.782.
 Based on the rules of thumb for interpreting ICC, we would

conclude that an ICC of 0.782 indicates that the exams can be


rated with “good” reliability by different raters.
 https://www.youtube.com/watch?v=v2d6eyYoh4M
Cohen's kappa (κ)
42

 In research designs where you have two or more raters (also known as
"judges" or "observers") who are responsible for measuring a variable
on a categorical scale, it is important to determine whether such raters
agree.
 Cohen's kappa (κ) is such a measure of inter-rater agreement for
categorical scales when there are two raters (where κ is the lower-case
Greek letter 'kappa').
 There are many occasions when you need to determine the
agreement between two raters. For example, the head of a
43
local medical practice might want to determine whether two
experienced doctors at the practice agree on when to send a
patient to get a mole checked by a specialist.
 Both doctors look at the moles of 30 patients and decide
whether to "refer" or "not refer" the patient to a specialist (i.e.,
where "refer" and "not refer" are two categories of a nominal
variable, "referral decision").
 The level of agreement between the two doctors for each
patient is analysed using Cohen's kappa.
44
45

. Open the file KAPPA.SAV. Before performing the analysis on this


summarized data, you must tell SPSS that the Count variable is a
"weighted" variable. Select Data/Weight Cases...and select the "weight
cases by" option with Count as the Frequency variable
2. Select Analyze/Descriptive Statistics/Crosstabs.
3. Select Rater A as Row, Rater B as Col.
4. Click on the Statistics button, select Kappa and Continue.
46

. Click OK to display the results for the Kappa test shown


here:
The results of the interrater analysis are Kappa =
0.676 with p < 0.001. As a rule of thumb values
of Kappa from 0.40 to 0.59 are considered moderate,
47 0.60 to 0.79 substantial, and 0.80 outstanding (Landis

& Koch, 1977).


Most statisticians prefer for Kappa values to be at least
0.6 and most often higher than 0.7 before claiming a
good level of agreement.
95% confidence interval on Kappa is (0.504, 0.848).
48

A more complete list of


how Kappa might be
interpreted (Landis & Koch,
1977) is given in the
following table
Kappa Interpretation
<0 Poor agreement
0.0 – 0.20 Slight agreement
0.21 – 0.40 Fair agreement
0.41 – 0.60 Moderate agreement
0.61 – 0.80 Substantial agreement
0.81 – 1.00 Almost perfect agreement
49

An interrater reliability analysis using the Kappa statistic was


performed to determine consistency among raters."
Narrative for the results section:
"The interrater reliability for the raters was found to be Kappa
= 0.68 (p <.0.001), 95% CI (0.504, 0.848)."

www.stattutorials.com/SPSSDATA
Cohens Kappa is calculated in statistics to determine
interrater reliability. On DATAtab you can calculate either the
Cohen’s Kappa or the Fleiss Kappa online. If you want to
calculate the Cohen's Kappa, simply select 2 categorical
variables, if you want to calculate the Fleiss Kappa, simply
select three variables.
Fleiss Kappa
50

 Fleiss' kappa, κ (Fleiss, 1971; Fleiss et al., 2003), is a measure


of inter-rater agreement used to determine the level of
agreement between two or more raters (also known as
"judges" or "observers") when the method of assessment,
known as the response variable, is measured on a categorical
scale.
 In addition, Fleiss' kappa is used when: (a)
51 the targets being rated (e.g., patients in a medical practice,
learners taking a driving test, customers in a shopping
mall/centre, burgers in a fast food chain, boxes delivered by
a delivery company, chocolate bars from an assembly line)
are randomly selected from the population of interest
rather than being specifically chosen;
 and (b) the raters who assess these targets are non-
unique and are randomly selected from a larger
population of raters.
Imagine that the head of a large medical practice wants to
determine whether doctors at the practice agree on when to
52
prescribe a patient antibiotics. Therefore, four
doctors were randomly selected from the population of all
doctors at the large medical practice to examine
a patient complaining of an illness that might require antibiotics
(i.e., the "four randomly selected doctors" are the non-unique
raters and the "patients" are the targets being assessed). to
decide whether to "prescribe antibiotics", "request the patient
come in for a follow-up appointment" or "not prescribe
antibiotics" (i.e., where "prescribe", "follow-up" and "not
prescribe" are three categories of the nominal response
variable, antibiotics prescription decision).
 This process was repeated for 10 patients, where on each
53 occasion, four doctors were randomly selected from all
doctors at the large medical practice to examine one of
the 10 patients.
 The 10 patients were also randomly selected from
the population of patients at the large medical practice
(i.e., the "population" of patients at the large medical
practice refers to all patients at the large medical
practice).
 The level of agreement between the four non-unique
doctors for each patient is analysed using Fleiss' kappa.
 Since the results showed a very good strength of
agreement between the four non-unique doctors,
54
the head of the large medical practice feels
somewhat confident that doctors are prescribing
antibiotics to patients in a similar manner.
 Furthermore, an analysis of the individual

kappas can highlight any differences in the level


of agreement between the four non-unique
doctors for each category of the nominal
response variable. For example, the individual
kappas could show that the doctors were
in greater agreement when the decision was to
"prescribe" or "not prescribe", but in much less
agreement when the decision was to "follow-
up".
 It is also worth noting that even if raters strongly agree,
55
this does not mean that their decision is correct (e.g., the
doctors could be misdiagnosing the patients, perhaps
prescribing antibiotics too often when it is not necessary).
This is something that you have to take into account
when reporting your findings, but it cannot be measured
using Fleiss' kappa.
 Example 2: To assess police officers' level of agreement, the
police force conducted an experiment where three police
56
officers were randomly selected from all available police
officers at the local police force of approximately 100 police
officers. These three police offers were asked to view a video
clip of a person in a clothing retail store (i.e., the people
being viewed in the clothing retail store are the targets that
are being rated). This video clip captured the movement of
just one individual from the moment that they entered the
retail store to the moment they exited the store. At the end of
the video clip, each of the three police officers was asked to
record (i.e., rate) whether they considered the person’s
behaviour to be "normal", "unusual, but not suspicious" or
"suspicious" (i.e., where these are three categories of
the nominal response variable, behavioural_assessment).
57
58
59
60
61

 Fleiss' kappa is .557. This is the proportion of


agreement over and above chance agreement. Fleiss'
kappa can range from -1 to +1. A negative value for
kappa (κ) indicates that agreement between the two or
more raters was less than the agreement expected by
chance, with -1 indicating that there was no observed
agreement (i.e., the raters did not agree on anything),
and 0 (zero) indicating that agreement was no better
than chance. However, negative values rarely actually
occur (Agresti, 2013). Alternately, kappa
values increasingly greater that 0
(zero) represent increasing better-than-chance
agreement for the two or more raters, to a maximum
value of +1, which indicates perfect agreement (i.e., the
raters agreed on everything).
 There are no rules of
thumb to assess how good
62
our kappa value of .557 is Value of Strength of
(i.e., how strong the level of κ agreement
agreement is between the
police officers). With that < 0.20 Poor
being said, the following
classifications have been 0.21-
Fair
suggested for assessing 0.40
how good the strength of 0.41-
Moderate
agreement is when based 0.60
on the value of Cohen's 0.61-
Good
kappa coefficient. The 0.80
guidelines below are from 0.81-
Altman (1999), and Very good
1.00
adapted from Landis and
Koch (1977): Table: Classification of
Cohen's kappa.
63

It is also good to report a 95% confidence interval for Fleiss'


kappa. To do this, you need to consult the "Lower 95%
Asymptotic CI Bound" and the "Upper 95% Asymptotic CI
Bound" columns, as highlighted below:
64

 Laerd Statistics (2019). Fleiss' kappa using SPSS


Statistics. Statistical tutorials and software guides.
Retrieved Month, Day, Year, from
https://statistics.laerd.com/spss-tutorials/fleiss-kappa-in-
spss-statistics.php
 For example, if you viewed this guide on 19th October
2019, you would use the following reference:
 Laerd Statistics (2019). Fleiss' kappa using SPSS
Statistics. Statistical tutorials and software guides.
Retrieved October, 19, 2019, from
https://statistics.laerd.com/spss-tuorials/fleiss-kappa-in-
spss-statistics.php
 1
65
validity
66

 validity is a matter of degree. Validity is not


absolute since not every question can be
used in every situation, as it depends on
the context, population, and other factors in
which a question is used. Since validity is
not absolute, we do not assess the validity
of an indicator but instead the validity of
the use to which it is being put. A measure
of job satisfaction for American workers
may not be valid for workers in Asia, for
example.
67
68

 This is the extent to which a question, or set of


questions, reflects a specific domain of content,
body of knowledge, or specific set of tasks. It is
used extensively in test construction by
psychologists and educators, but less so by
survey researchers, and is best used for a
group of questions rather than one item. Thus,
a scale composed of a group of items has
content validity if it adequately represents the
universe of potential questions that could be
used to measure a specific concept.
69
 these five questions would have content
validity if they were judged to be a
representative sample that covers the
potential components of job satisfaction. If,
instead, an area were not represented (say,
how interesting or challenging the work is),
then the overall measure would have lower
content validity. There is no direct method to
quantify content validity, but it should always
be assessed when creating multi-item scales
by thinking carefully about the concept to be
measured, or consulting with experts on the
topic
70

 Does a measure relate o a particular behavior:


 Concurrent Validity: Present Behavior
 Predictive Validity: Future Behavior
71

 Concurrent validity is established when the scores


from a new measurement procedure are directly related
to the scores from a well-established measurement
procedure for the same construct; that is, there is
consistent relationship between the scores from the
two measurement procedures. This gives us confidence
that the two measurement procedures are measuring
the same thing (i.e., the same construct).
 Let's imagine that we are interested in
determining test effectiveness; that is, we want to
72 create a new measurement procedure for intellectual
ability, but we unsure whether it will be as effective
as existing, well-established measurement
procedures, such as the 11+ entrance exams, Mensa,
ACTs (American College Tests), or SATs (Scholastic
Aptitude Tests). However, we want to create a new
measurement procedure that is much shorter,
reducing the demands on students whilst still
measuring their intellectual ability.
 The scores must differentiate individuals in the same way
73
on both measurement procedures; that is, a student that
gets a high score on Mensa test (i.e., the well-established
measurement procedure) should also get a high score on
the new measurement procedure.
 This should be mirrored for students that get a medium
and low score (i.e., the relationship between the scores
should be consistent). If the relationship
is inconsistent or weak, the new measurement procedure
does not demonstrate concurrent validity.
 Assessing predictive validity involves establishing that
74
the scores from a measurement procedure (e.g., a test or
survey) make accurate predictions about the construct they
represent (e.g., constructs like intelligence, achievement,
burnout, depression, etc.). Such predictions must be made
in accordance with theory; that is, theories should tell us
how scores from a measurement procedure predict
the construct in question.
 In order to be able to test for predictive validity, the new
measurement procedure must be taken after the well-
established measurement procedure. By after, we typically
would expect there to be quite some time between the two
measurements (i.e., weeks, if not months or years).
 Universities often use ACTs (American College Tests) or
SATs (Scholastic Aptitude Tests) scores to help them
75
with student admissions because there is strong predictive
validity between these tests of intellectual
ability and academic performance, where academic
performance is measured in terms of freshman (i.e., first
year) GPA (grade point average) scores at university (i.e.,
GPA score reflect honours degree classifications; e.g.,
2:2, 2:1, 1st class).
 This is important because if these pre-university tests of
intellectual ability (i.e., ACT, SAT, etc.) did not predict
academic performance (i.e., GPA) at university, they
would be a poor measurement procedure to attract the
right students.
 However, let's imagine that we are only interested in
finding the brightest students, and we feel that a test of
76
intellectual ability designed specifically for this would be
better than using ACT or SAT tests. For the purpose of
this example, let's imagine that this advanced test of
intellectual ability is a new measurement procedure that
is the equivalent of the Mensa test, which is designed to
detect the highest levels of intellectual ability. Therefore,
a sample of students take the new test just before they go
off to university. After one year, the GPA scores of these
students are collected. The aim is to assess whether there
is a strong, consistent relationship between the scores
from the new measurement procedure (i.e., the
intelligence test) and the scores from the well-established
measurement procedure (i.e., the GPA scores).
Construct validity
77 Construct validity is the measure of how well the items
selected for the construct actually measure the construct.
 For example, Life Satisfaction in this case is measured
using 5 indicators, construct validity will help determine
how well these five items measure the latent unobserved
construct of Life Satisfaction.
 Construct validity is established through two forms of
validities, convergent validity and discriminant validity.
78
 Convergent validity refers to the degree to which
multiple measures of a construct that theoretically
79
should be related, are in fact related (Gefen, Straub &
Boudreau, 2000).

 measures of constructs that theoretically should be


related to each other are, in fact, observed to be
related to each other (that is, you should be able to
show a correspondence or convergence between
similar constructs)


Discriminant Validity
80

 measures of constructs that theoretically


should not be related to each other are, in
fact, observed to not be related to each
other (that is, you should be able to
discriminate between dissimilar
constructs)

You might also like