0% found this document useful (0 votes)

13 views12 pages

Pesudovs 2007

Uploaded by

Anh Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views12 pages

Pesudovs 2007

Uploaded by

Anh Hoàng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

1040-5488/07/8408-0663/0 VOL. 84, NO. 8, PP.

663–674
OPTOMETRY AND VISION SCIENCE
Copyright © 2007 American Academy of Optometry

FEATURE ARTICLE ON LINE

The Development, Assessment, and Selection

of Questionnaires
KONRAD PESUDOVS, PhD, FAAO, JENNIFER M. BURR, MRCOphth, MSc(Epidemiology),
CLARE HARLEY, PhD, and DAVID B. ELLIOTT, PhD, FAAO
NH&MRC Centre for Clinical Eye Research, Department of Ophthalmology, Flinders Medical Centre and Flinders University of South
Australia, Bedford Park, South Australia, Australia (KP), Health Services Research Unit, University of Aberdeen, Polwarth Building,
Foresterhill, Aberdeen, United Kingdom (JMB), and Department of Optometry, University of Bradford, Richmond Road, Bradford,
West Yorkshire, United Kingdom (CH, DBE)

ABSTRACT
Patient-reported outcome measurement has become accepted as an important component of comprehensive outcomes
research. Researchers wishing to use a patient-reported measure must either develop their own questionnaire (called an
instrument in the research literature) or choose from the myriad of instruments previously reported. This article
summarizes how previously developed instruments are best assessed using a systematic process and we propose a system
of quality assessment so that clinicians and researchers can determine whether there exists an appropriately developed
and validated instrument that matches their particular needs. These quality assessment criteria may also be useful to guide
new instrument development and refinement. We welcome debate over the appropriateness of these criteria as this will
lead to the evolution of better quality assessment criteria and in turn better assessment of patient-reported outcomes.
(Optom Vis Sci 2007;84:663–674)

Key Words: factor analysis, instrument, quality assessment, quality of life, questionnaire, Rasch analysis, reliability,
responsiveness, validity, visual disability

T
he assessment of health-related quality of life (HR-QoL) to produce a score on a quasi continuous variable.1 As the use of
has been an important expansion of the assessment of the instruments extended beyond psychology, to medical fields, the
impact of disease and its treatment beyond the traditional format, and purpose of instruments changed. Unfortunately,
areas of symptoms, signs, morbidity, and mortality. It provides the change in the design and application of instruments also
a more holistic assessment of the effects of disease on the person meant that traditional methods of scoring and validation be-
to include such dimensions as a patient’s physical, social, and came outdated, but this was not recognized for many of the
emotional wellbeing. Most funding organizations now insist on originally developed instruments.
a patient-reported outcome for a clinical trial of any disease Throughout this article, we highlight key quality criteria
intervention/treatment or assistive device. Because of the (summarized in Table 1) that build upon previous contribu-
breadth of content of HR-QoL and its patient-reported nature, tions to the field.2,3 It is our aim to present a robust set of
it has been measured using questionnaires (called instruments quality criteria to be used by researchers and practitioners in the
in the research literature), which are efficient tools for gathering selection of instruments, and we welcome comments and sug-
large amounts of data quickly. Given the large number of in- gestions for their refinement or further development. These
struments that have been developed over the last few years, proposed quality assessment criteria (Table 1) provide a frame-
investigators may find it difficult to decide upon an appropriate work for a systematic review of instruments in the disease area
instrument or decide whether a questionnaire needs to be spe- under study to determine if any existing instruments are ade-
cially developed for their study. Another problem is that origi- quate for the intended use in the intended target population. In
nally (1920s to 1950s), the primary purpose of instrument the absence of a sufficiently reliable, valid instrument with con-
development was to determine people’s attitudes and how that tent appropriate to the intended use, the development of a new
range of attitudes was distributed in the population rather than instrument for the intended purpose can be justified. The qual-

Optometry and Vision Science, Vol. 84, No. 8, August 2007

664 The Development, Assessment, and Selection of Questionnaires—Pesudovs et al.

TABLE 1.
Quality assessment tool for evaluation of health status questionnaires
Property Definition Quality criteriaa

Development of the instrument

Prestudy hypothesis The prestudy specification of the ⻫⻫ A clear description is provided of the aim of the instrument and the intended population
aim of the instrument and the ⻫ Only one of the above
intended population
x Neither reported
Intended population The extent to which the ⻫⻫ Intended population studied
instrument has been studied in ⻫ Partly studied only or sample size was small (less than 50 patients)
the intended population
x Not studied in the intended population, only generic
Actual content area The extent to which the content ⻫⻫ Content as intended, and is relevant to the intended population
meets the prestudy hypothesis ⻫ Some of the intended content areas missing
specifications
x Content area not relevant to intended population
Item identification Selection of the items relevant to ⻫⻫ Comprehensive consulting with patients, (focus groups or in-depth interviews) and a literature review
the target population for ⻫ Minimal consultation with patients and experts opinion and literature review
inclusion in the pilot
instrument x No consultation with patients
Item selection Determining the items included ⻫⻫ A pilot instrument was developed and tested with Rasch or factor analysis and statistical justification
in the final instrument provided for removing items, plus items with floor and ceiling effects removed and the amount of
missing data considered
⻫ Only some of above techniques were used
x No pilot instrument OR no statistical justification of items included in the final instrument
Unidimensionality Demonstration that all items fit ⻫⻫ Rasch analysis using fit statistics (0.7–1.3) or item-trait interaction or Factor analysis on Rasch scores
with a single underlying (1st factor loadings ⬎0.4 for all items)
construct ⻫ Rasch fit statistics mostly within 0.7 to 1.3 range but some less well fitting items retained, or Cronbach’s
␣ ⬎0.7, and ⬍0.9 or factor analysis on raw scores (1st factor loadings ⬎0.4 for all items)
⻫ Rasch analysis does not support unidimensionality or Factor analysis does not support
unidimensionality or Cronbach’s ␣ ⬍0.7 or ⬎0.9
Response scale Categories used to rate the items ⻫⻫ Statistically justified scale without significant missing data, floor and ceiling effects, and a
demonstration of ordered thresholds on Rasch analysis
⻫ Some, but not all of above
x Methods for determining response scale not justified statistically
Scoring A description of how the ⻫⻫ Rasch scoring of a statistically justified response scale
instrument should be scored ⻫ Summary scoring of a statistically justified response scale
x Scoring system not described or scoring of a statistically unjustified or faulty scale
Performance of instrument (validity and reliability)
Validity
Convergent validity Amount of correlation with a ⻫⻫ Tested against appropriate measure, correlates between 0.3 and 0.9
related measure ⻫ Debatable choice of measure, but correlation between 0.3 and 0.9
x Tested and correlates ⬍0.3 or ⬎0.9
Discriminant validity The degree to which an ⻫⻫ Tested against appropriate measure, correlates ⬍0.3
instrument it not similar to ⻫ Debatable choice of measure, but correlation between ⬍0.3
(diverges from) other
instruments that it should not x Tested and correlates ⬎0.3
be similar to
Predictive validity The extent to which the instrument ⻫⻫ Tested against appropriate measure, correlates ⬎0.3, or significant difference between groups
can predict a future event ⻫ Debatable choice of measure, but correlation ⬎0.3 or significant difference between groups
x Tested and correlates ⬍0.3 or nonsignificant difference between groups
Other evidence for construct Any other hypothesis driven ⻫⻫ Hypothesis stated, tested and proven
validity testing ⻫ Construct validity claimed but debatable under scrutiny
x Construct validity claimed but does not hold up to scrutiny
Test-retest (T-R) agreement The extent to which the results
are repeatable when taken by
the same observer ⻫⻫ LOA appear tight and less than MID, or weighted Kappa or ICC ⱖ0.8 (T-R) or 0.70 (int.)
Interobserver The extent to which the results are ⻫ LOA broader but still close to MID, or weighted Kappa or ICC 0.60 to 0.79 (T-R) or 0.50 to 0.69 (int.)
agreement/intermode (int.) repeatable between observers/ x LOA ⬎⬎ MID, weighted Kappa or ICC ⬍0.60 (T-R) or 0.50 (int.) or incorrect statistical test or inadequate
agreement The extent to which the results sample (⬍30 subjects),
are repeatable between modes
of administration
Person and item separation A Rasch analysis indication of ⻫⻫ Reliability of ⱖ0.8 for both person and item separation or a G value or separation ratio ⬎2
reliability reliability—the proportion of ⻫ Only one of person or item separation of ⱖ0.8, or a G value or separation ratio ⬎2
true variance in the observed
variance x Person or item separation of ⬍0.8, or a G value or separation ratio ⬍2
0 Not reported (not a Rasch scaled measure)
Interpretation The extent to which score ⻫⻫ Normative data (Mean scores and SD) and MID given for a representative target population, and test
differences are meaningful population demographic reported
⻫ MID or normative data or demographic details of study populations, or ad hoc population
x No normative data and no MID
Responsiveness The extent to which the ⻫⻫ Score changes ⬎MID for measures of progression over time or changes with intervention. Effect size
instrument can detect or responsiveness statistic given
clinically important changes ⻫ Changes over time but relationship to MID not reported, small sample, and inadequate time frame
over time
x Score changes ⱕMID

a
If not reported, scored as “0”; ⻫⻫, Positive rating; ⻫, Minimal acceptable rating; x, Negative rating.
MID, minimally important difference; LOA, limits of agreement; ICC, intraclass coefficient; SD, standard deviation.

Optometry and Vision Science, Vol. 84, No. 8, August 2007

The Development, Assessment, and Selection of Questionnaires—Pesudovs et al. 665
22
ity assessment criteria provided in Table 1 can also be used to inspection. However, face validity is probably best considered to
guide new instrument development and refinement. be the weakest form of validity. Demonstrating that an instrument
has face validity is probably best confined to consideration of
whether the concept seemingly being measured, and the rating
Overview
scale used etc [e.g., is frequency (or difficulty) the right concept],
The organizational structure of this manuscript follows that of appears to be sensible. The items should be phrased in simple,
listing of the Quality Assessment Criteria in Table 1. We start with unambiguous language, kept brief, clear, and avoid over intellec-
issues involved in the development of an instrument. These in- tualization, multiple concepts, and double negatives. As a guide,
clude defining the purpose of the instrument and its target popu- items should be written at a comprehension level suitable for a 12
lation; the steps taken in defining the content of the instrument; years old.24
and the steps involved in developing the rating scale and scoring Unfortunately, face validity may be misused. The purpose of
system. The second half of the manuscript deals with the perfor- drawing together a group of items is to measure a latent trait
mance of an instrument. This includes the different types of valid- represented by those items. If the instrument has face validity, then
ity, and reliability as well as responsiveness and interpretation of it should appear to measure what it intends to measure. In depth
the results. analysis of the items included, or objections to missing items, is not
By way of example, the quality criteria assessment from Table 1 appropriate for face validity. After all, it is likely that various col-
has been applied to four refractive error-related QoL instruments lections of items could measure the same underlying construct.
in Table 2. The Psychosocial Impact of Assistive Devices,4 –7 the Therefore heavy emphasis on the inclusion or exclusion of specific
Refractive Status Vision Profile (RSVP),8 –12 the National Eye In- items is not appropriate for face validity.
stitute Refractive Quality of Life12–16 and the Quality of Life Im-
pact of Refractive Correction17–19 were assessed on the premise
they are to be used in a study comparing QoL among spectacle and
Actual Content Area
contact lens wearers. All articles (three to four per instrument)
contributing to the description, development, and validation of The actual content area quality assessment criterion addresses
the instruments were included in the assessment. face validity; the extent to which the content meets the intended
concepts specified in the prestudy hypothesis. The assessment is
somewhat subjective, but can be assisted by clear definitions from
Prestudy Hypothesis and Intended Population
the developers of the instrument as to what the framework of the
Studies describing the development of an instrument should content is. Clarification of the content areas is especially important
clearly state the specific construct the instrument is intended to for instruments which measure multiple traits. The actual content
measure and the intended population of study. If the instrument area violates the intended content area when aspects of the in-
was not developed on a comparable population to your target tended content area are missing or content not relevant to the
population then relevant content is unlikely to have been included. intended content are included.
For example the Impact of Visual Impairment questionnaire which Content validity is the extent to which the items in the instru-
was developed and validated using a low vision population20 was ment reflect the entirety of the concept being measured. Content
shown to perform poorly in a clinical glaucoma population with validity cannot be formally assessed because it is difficult to prove
respect to targeting of item difficulty to patient ability.21 The same conclusively that the items chosen were representative of all possi-
is true if important population subsets were omitted, because the ble items.25 However, instrument development methods such as
breadth of population and extent of generalizability are important: item identification and item reduction are important for establish-
for example, an instrument for assessing quality of life in the dif- ment of content validity (see quality criteria in Table 1).26
ferent modes of correction of refractive error,17 should include Breadth of content is critical to content validity. Many instru-
items relevant to all modes of refractive correction (e.g., spectacles, ments purport to measure quality of life, but only measure a few
contact lenses, and refractive surgery) to ensure the content is dimensions; often vision-related activity limitation only (visual
relevant to all subtypes. This is the basis of the results in Table 2 functioning or visual disability would be more appropriately called
where the RSVP instrument only scores one tick for intended vision-related activity limitation to be in line with the World
population as it was developed primarily of refractive surgery can- Health Organization International Classification of Functioning,
didates (therefore “partly studied only” as per Table 1), whereas the Disability and Health27). However, QoL has many other dimen-
other instruments scored two ticks (intended population studied). sions e.g., emotional, spiritual, vocational, economical attributes
etc. So to purport to measure QoL but to only or principally
measure activity limitation means that any inferences one may
Representation, Face, and Content Validity
draw about QoL impacts will be incorrect unless they are confined
Representation (or translation)22 validity is an over-arching to activity limitation only. This problem is called construct under
term relating to how well the construct under measurement is representation,28 and is common in vision-related instruments in-
represented by an instrument.23 This term exists to draw together cluding the popular NEI-VFQ.29 So the name of an instrument is
face validity and content validity which both address the content of actually very important as it feeds into defining the concept that
the instrument but in different ways. the instrument purports to measure. The title of the VF-14 (Visual
Face validity is the basic idea that an instrument should appear Function Index 14) instrument and the research article that intro-
to measure what it purports to measure and this can be assessed by duced it quite clearly indicates that it measures activity limitation

Optometry and Vision Science, Vol. 84, No. 8, August 2007

666 The Development, Assessment, and Selection of Questionnaires—Pesudovs et al.

TABLE 2.
Quality assessment of 4 refractive error-related quality of life instruments: Psychosocial Impact of Assistive Devices (PIADS),4 –7
the Refractive Status Vision Profile (RSVP),8 –13 the National Eye Institute Refractive Quality of Life (NEI-RQL)12,14 –17 and the
Quality of Life Impact of Refractive Correction (QIRC)18 –20

Intended Actual Item Item Response Scoring

Hypothesis population content area identification reduction Unidimensionality scale scale

PIADS ⻫⻫⻫⻫⻫⻫⻫⻫ x ⻫⻫
RSVPa ⻫⻫⻫⻫⻫⻫⻫⻫ x x x
NEI-RQL ⻫⻫⻫⻫⻫⻫⻫⻫ x x x x
QIRC ⻫⻫⻫⻫⻫⻫⻫⻫⻫⻫⻫⻫⻫⻫⻫⻫
a
A Rasch-analyzed version of the RSVP (Garamendi et al., 2006) with a modified response scale and a reduced number of items has
been shown to have greater responsiveness and test-retest reliability than the standard instrument. It also provides a unidimensional
score, statistically justified response and scoring scales and good Rasch separation reliability.
b
Conflicting reports of normative data levels and responsiveness of the RSVP are provided by Schein et al. (2001) and Nichols et al. (2001).

and does not claim to measure quality of life but it has been mis- Unidimensionality and Item Reduction
interpreted as assessing QoL.30 –32
Item reduction is performed to maximize item quality, measure-
ment precision, and targeting of items to persons. Unidimension-
Item Identification ality is the demonstration that all items included in an instrument
To ensure a good breadth of relevance, at least three approaches fit with a single underlying construct (e.g., VR-QoL) and is a
should have been taken for item generation. These include obtain- prerequisite to allow appropriate summation of any set of
ing sample statements, experiences, and opinions directly from: items24,35 and an important asset if a meaningful measurement is
individuals within the target population, through focus groups or to be obtained.35,36 A number of statistical methodologies are used
one-to-one interviews; experts working in the area (not just clini- to justify item reduction and give insight into dimensionality:
cians, but individuals who have contact with patients and may • Conventional descriptive statistics
develop expertise in understanding the impact of the condition on • Cronbach’s alpha
the person); and the published literature in the field. Patient inter- • Factor analysis
views are useful for gathering a range of opinions on a topic and can • Rasch analysis
help to draw views from particular minority groups. Focus groups
are useful for eliciting mediated responses that are likely to be Statistical methods for item reduction serve to highlight the
common to the majority of individuals in a given population and worst performing items, which are removed. The items are re-
can also be more productive than in-depth patient interviews due moved one at a time with the analyses performed iteratively to
to the enthusiasm and interaction created by the discussion pro- calculate the improvement in the instrument and to identify the
cess.33,34 Expert knowledge is a valuable resource, but should not next candidate item for removal. Traditionally, the following de-
be used as the sole procedure for generating items because clini- scriptive and statistical analyses have been used to determine can-
cians tend to focus on presenting complaints. There may also be didate items for removal.4,5,26
issues that the patient does not present to a clinician, but which • Missing data. Items that have large percentages (⬎50%) of
have an impact on their quality of life. For example, the RSVP is a missing data are likely to be ambiguous, or not applicable to
clinician-developed instrument of QoL for refractive surgery and many respondents.
has been shown to include too many items related to symptoms • All items should approximate a normal distribution, as identi-
and functional problems,11 whereas patients are more concerned fied using histogram plots, nonsignificant results on tests of
about issues such as convenience, cost, health concerns, and well normality such as Kolmogorov-Smirnov or Shapiro-Wilk, or
being.17 Skewness and Kurtosis values within ⫺2.00 to ⫹2.00. Al-
though items at the ends of the scale will likely deviate from
Pilot Questionnaire normal.37
Item generation will typically produce a vast number of items. Unidimensionality of the whole instrument must be considered
An item removal process is required to determine which items to when deciding which items to remove. Traditionally, Cronbach’s
retain for the final instrument. A pilot questionnaire is best used for alpha and factor analysis were used to assess unidimensionality.
this process (see quality criteria in Table 1). The pilot question- Cronbach’s alpha determines the correlation of every item in the
naire indicates how well each item taps the underlying construct instrument with every other item. The nearer Cronbach’s alpha to
being measured, and allows poorly discriminating, unreliable or 1, the more internally consistent the scale is Cronbach’s alpha can
invalid items to be removed. The respondent population for the also be calculated on the items with each item in turn deleted from
pilot data should have been broad and representative of the target the analysis. If alpha increases (relative to the alpha of all items
population. included) this indicates that the item removed was not contribut-

Optometry and Vision Science, Vol. 84, No. 8, August 2007

The Development, Assessment, and Selection of Questionnaires—Pesudovs et al. 667
TABLE 2.
Continued

Discriminant Convergent “Other” Test-retest Interobserver or Rasch separation

validity validity validity reliability intermode agreement reliability Interpretation Responsiveness

0 ⻫⻫⻫⻫⻫⻫⻫⻫ 0 ⻫⻫⻫⻫
0 ⻫⻫⻫ 0 0 ⻫b ⻫b
0 ⻫⻫⻫⻫⻫⻫ 0 0 ⻫⻫⻫⻫
0 0 ⻫⻫⻫⻫ 0 ⻫⻫⻫⻫⻫⻫

ing to unidimensionality. Because Cronbach’s alpha is essentially significant probability value (p ⬎ 0.05) indicates no substantial
determined by the average of the correlation coefficients between deviation from the model which implies unidimensionality.20 The
items, exceptionally high values of Cronbach’s alpha (⬎0.90) may infit and outfit statistics also help to identify which items contrib-
be indicative of redundancy (e.g., in the RSVP, see Table 2). Al- ute most to the measurement of the latent trait. Infit and outfit
though this does not contravene unidimensionality, redundancy is means squares have an expected value of 1.00. Infit means (⬍0.8)
a problem if the process of creating the “overall score” for the represent items are too predictable (they have at least 20% less
instrument involves just adding all the item scores together. In variation than expected). These overfitting items may be redun-
such a case, the overall score overweighs the importance of the issue dant or lack variance to contribute new information to the mea-
that is served by redundant items. Therefore, in our quality criteria, sure. Mean outfit values ⬎1.20 represent misfit (at least 20% more
we downgrade those instruments with Cronbach’s alpha ⬎0.90 variance than was expected) and suggests that the item measures
(Table 1). Similarly, as Cronbach’s alpha is not independent of the something different to the overall scale. Acceptable values for item
number of items and may be elevated by including many items. For inclusion may be 0.80 to 1.20 for a strict definition (often used for
these reasons Cronbach’s alpha should probably be considered to infit) or 0.70 to 1.30 or even higher for lenient definition. Alter-
be more of a traditional indicator than a useful one.38 Nevertheless natively, fit residuals may be used, in which case values ⬎2.5 or
we retain it in our quality criteria as it is such a commonly reported probability values below the Bonferroni adjusted alpha value (i.e.,
metric: values should be ⬎0.70 and ⬍0.90. 0.05/number of items) are also used to indicate misfit to the model.
Factor analysis is a multivariate statistical method that is used to Rasch analysis can also indicate the effect of removing an item on
explain variance within a matrix of variables and reduces those overall scale performance. If removal of an item considerably de-
variables to a number of factors. This method can be used to creases person separation that item should be retained.36 Person
determine whether an instrument possesses unidimensionality.26 separation is an indicator of the ability (precision) of the instru-
The proportion of the variance described by the principal (most ment to differentiate between different people’s quality of life.
significant) factor indicates whether the instrument tests in one or Person separation is expressed as the ratio of the adjusted standard
more content areas. In addition, factor analysis can be “rotated” by deviation to the root mean square error and a person separation
various techniques such as Varimax or Oblimin to find items value of 2.0 or more is indicative that subjects are significantly
which can have high communality and thus form additional fac- different in QoL across the measurement distribution.46
tors. This grouping of items into additional factors can be used to
justify the creation of subscale indices as items that load onto the same
Targeting of Items to Persons
factor are likely to sample the same content area specified by the factor
to which they contribute. Subscales should be proposed hypothetically Rasch analysis also provides insight into targeting of item diffi-
and justified with confirmatory factor analysis rather than simply be- culty to person ability and can therefore be used to remove items
ing the product of exploratory factor analysis.39 Once demonstrated that less well target the population.47 Figs. 1 and 2 show person-
to exist by factor analysis, subscales themselves should also be as- item maps for a group of cataract patients responding to the Ac-
sessed for unidimensionality. Factor analysis can guide item reduc- tivities of Daily Vision Scale (ADVS),48 a visual activity limitation
tion by indicating both failure to fit (items contributing to ⬍0.40 of a instrument. This analysis rank orders the items and participant
particular factor) and redundancy (⬎0.80). Ideally, factor analysis responses. The means of the two distributions (person and item)
should be performed on Rasch-scaled data, so that items do not group are denoted as ‘M’. If the items were well targeted to the subjects,
simply because of similar item difficulty.40 the means of the two distributions should be close (e.g., 0.5 logits)
More recently developed instruments have used Rasch analysis to each other. In Fig. 1, the original conventionally validated
to help guide item reduction.41– 45 Rasch analysis provides a more ADVS is shown, and it can be seen that the means are far apart. Fig.
detailed view of dimensionality through both model and item fit 2 shows how item reduction of the ADVS, using Rasch analysis,
statistics.38 The item-trait interaction score, reported as a ␹2, re- provides better targeting of item difficulty to patient ability, with
flects the property of invariance across the trait. A statistically non the ‘M’ values now closer together. This was achieved through

Optometry and Vision Science, Vol. 84, No. 8, August 2007

668 The Development, Assessment, and Selection of Questionnaires—Pesudovs et al.

FIGURE 1.
Patient activity limitation/item difficulty map for the 22-item ADVS. On the left of the dashed line are the patients, represented by X. On the right are
the cross-over points between each response scale (level of the scale where the answer category is most probable to be rated by a patient with that
activity limitation). More able patients and more difficult items are near the bottom of the map; less able patients and less difficult items are near the
top. The scale is in log units (0 –100). M, mean; S, 1 SD from the mean; T, 2 SD from the mean.

removal of items that were too easy for patient ability. This item Rating Scale
reduction approach can lead to a minimum item set, which has the Unfortunately, many QoL instruments use traditional sum-
optimum instrument efficiency and the advantage of shortening mary scoring where an overall score is derived through summa-
test time and reducing user and respondent burden. tive scoring of responses. Summary scoring is based on the
Criteria to guide item removal that incorporate all of these statistical hypotheses that all questions have equal importance and re-
approaches have been proposed.17,49 The suggested infit and outfit sponse categories are accordingly scaled to have equal value with
ranges are only guides and can depend largely on sample size.50 uniform increments from category to category. In cases where
1. Infit mean square outside 0.7 to 1.30 the items in an instrument no longer have equal importance, the
2. Outfit mean square outside 0.70 to 1.30 logic of averaging scores across all items becomes questionable.
3. Item with mean furthest from subject mean For example, in a summary scaled visual activity limitation
4. High proportion of missing data (⬎50%) instrument, the ADVS, “a little difficulty” scores 4, “extreme
5. Ceiling effect—a high proportion of responses in item end- difficulty” is twice as bad and scores 2, and “unable to perform
response category (⬎50%) the activity due to vision” is again twice as bad with a score of 1.
6. Items with markedly different standard deviation of item scores The ADVS ascribes the same response scale to a range of differ-
to other items ent items, such that “a little difficulty” “driving at night” re-
7. Items that do not demonstrate a normal distribution, as identified ceives the same numerical score as “a little difficulty” “driving
using histogram plots, tests of normality such as Kolmogorov- during the day”, despite the former being by far the more dif-
Smirnov or Shapiro-Wilk, or Skewness and Kurtosis values outside ficult and complex task. This rationale of “one size fits all” is
⫺2.00 to ⫹2.00. flawed in this case, and Rasch analysis has been used to confirm

Optometry and Vision Science, Vol. 84, No. 8, August 2007

The Development, Assessment, and Selection of Questionnaires—Pesudovs et al. 669
11
the control group, as might be expected. This occurs through
the reduction of noise in the original measurement which
chiefly arises from considering all items to be of the same value.
Note that conventionally developed instruments can also be
reengineered using Rasch analysis11,50,52 and it is possible to use
the Rasch calibrations from these studies to convert summary-
scaled data from these instruments.20,53,54
Rasch analysis provides the additional benefit that it can be used
to determine the optimum number of response categories. It has
been shown that people tend to only use four or five categories55
and in some cases just three are used.17 Using too many response
options can also disrupt the expected order of categories. This
disruption can be detected using Rasch analysis, which calibrates
the responses for each category. If the analysis shows redundancy or
disruption to category order, it may be necessary to combine adja-
cent response categories. Fig. 3 illustrates how Rasch analysis was
applied to an instrument that determined the extent of pain from
ocular surface disease.56 The Faces Pain Scale originally used a
seven-category response format (seven faces with different expres-
sions of pain designed to be chosen to represent how the partici-
pants feels about their ocular pain) but Rasch analysis revealed that
category 5 of the scale was underutilized and for no part of the scale
was it the most likely to be selected; this category needed to be
collapsed into an adjacent category. Rasch analysis determined that
a 5-point scale would be more appropriate for this particular in-
strument. Visual analog scales are an extreme example of this prob-
lem. Users have the misconception that a 10-cm line scored by the
millimeter results in a 101 category scale. However Rasch analysis
shows that people tend to only use four or five categories.55 When
using Rasch analysis, investigation of category ordering and any
repair of disordered thresholds should be undertaken before item
reduction.
Response category design and function is also important when
using the Rasch model. If all items have the same format and use
the same categorical rating scale then a single Andrich rating scale
can be used.57 This means that all items use the same differences
between response category values. If one prefers, one can use a
partial credit model where response categories for all items are
allowed to vary independently.58 However, the use of a partial
FIGURE 2. credit model introduces additional degrees of freedom and di-
Patient activity limitation/item difficulty map for the revised 15-item minishes the value of item fit statistics as indicators of unidi-
ADVS. The patient and item means are much closer together now that
items that were too easy have been removed.
mensionality. For scales with several types of rating scales or
question format, a different rating scale should be used for each
type and the partial credit model is most appropriate.
that differently calibrated response categories can help to pro- Rasch analysis is also useful where there are missing data in
vide a valid and contextual scale that truly represents QoL.50 patient or respondent answers, which is a common occurrence.
By resolving inequities in a scale arising from differential item With Rasch analysis, person estimates are made from valid data
difficulty, Rasch analysis provides a self-evident benefit in terms of only, so missing data are effectively ignored, without adding noise
accuracy of scoring. This process also removes noise from the mea- to the measure. This is an important attribute of Rasch-scaled instru-
surement which in turn improves sensitivity to change and corre- ments as there are special cases where items with high rates of missing
lations with other variables.11,51 For example, the standard scoring data may be important, such as driving in cataract populations.
of the Refractive Status and Vision Profile (RSVP) failed to show
any difference in QoL between a group of spectacle and contact
Performance of the Instrument
lenses wearers in optometric practice and a group of spectacle and
contact lenses wearers about to undergo refractive surgery. When Validity. Construct validity refers to whether an instrument
Rasch analysis was used to differentially calibrate each item, signif- measures the unobservable construct (such as “quality of life”) that
icant differences between the groups was found, with the prere- it purports to measure. Construct validity cannot be demonstrated
fractive surgery group having a lower self-reported QoL than by one simple test e.g., a correlation, because validation is an on-

Optometry and Vision Science, Vol. 84, No. 8, August 2007

670 The Development, Assessment, and Selection of Questionnaires—Pesudovs et al.

FIGURE 3.
(A) Rasch model category probability curves for the faces pain scale representing the likelihood that a subject with a particular pain severity will select
a category. The x-axis represents pain. For any given point along this scale, the category most likely to be chosen by a subject is shown by the category
curve with the highest probability. At no point is category 5 the most likely to be selected. This suggests there are too many categories and these are
not used in order. (B) Rasch model category probability curves for the faces scale shortened to 5 categories by combining categories 2 and 3, and 5
and 6. This model gives excellent symmetry and the thresholds are now ordered. Both figures reproduced with permission from J Pain 2005;6:630 – 6.

going process requiring the statement of hypotheses and the testing A cutoff of 0.3 is probably appropriate as a minimum correlation
thereof; if an instrument measures a trait, then it should correlate between two measures which should be related. Therefore, the
with another measure etc. There are specific types of validities that hypothesis is critical for convergent validity.
together contribute to construct validity e.g., concurrent, conver- Discriminant validity is the degree to which an instrument di-
gent, predictive, and discriminant validity. Although it is not pos- verges from other instruments that should be dissimilar. This is
sible to perform all of these tests, it is important that construct probably the validity test performed least often; no results in Table
validity should be a hypothesis driven process. Sometimes the hy- 2. For refractive-error related QoL instruments, it might be simple
pothesis will be simple and easily fall under the heading of e.g., to show a poor correlation to an instrument designed for measur-
convergent validity. Other times, complex hypothesis testing will ing visual activity limitation, because disability is not typically a
not be readily subclassified but be critical to the establishment of component of the former. The statistical test required is again a
construct validity. With the right set of hypotheses and tests, a simple Pearson correlation coefficient, but in this case, a poor
persuasive picture of construct validity can be developed. correlation e.g., ⬍0.3 is the desired result. More complex hypoth-
Criterion validity is a traditional definition of validity where an eses of concurrent and discriminant validity could also be set. For
instrument is correlated with an existing “standard” or accepted example, a new cataract specific visual activity limitation instru-
measure which measures the same thing. However, criterion valid- ment could be hypothesized to correlate very well with an existing
ity can be further subdivided so we use “criterion-related validity” cataract specific visual activity limitation instrument, less well with
as an umbrella term here. an ophthalmic QoL instrument and least well with a general health
Convergent validity is the classic form of criterion validity where QoL instrument. Such a hypothesis can avoid the 0.3 cutoff, as the
a new instrument is correlated with something that measures a correlations may well be of the order of 0.7, 0.5, and 0.3, respec-
related construct. For visual activity limitation instruments, corre- tively, and therefore provide good criterion-related validity evi-
lation with visual acuity (VA), or an existing validated visual activ- dence for both convergent and discriminant validity.
ity limitation instrument (e.g., the VF-1432) is typically used to Predictive validity determines whether the instrument can make
indicate convergent validity. Suitable statistical analyses are a Pear- accurate predictions of future outcomes. For example, can a score
son correlation coefficient for continuous variables or, for dichot- on a visual activity limitation instrument be used to predict the
omous data, a chi squared analysis with a Phi coefficient as a need for cataract surgery? This may be worthwhile because people
measure of the correlation. Note that for convergent validity, a very could be prioritized for examination based on instrument scores
high correlation (⬎0.90) is not advantageous as it suggest that the and some people with minimal activity limitation could be spared
new instrument provides information so similar to a previously a costly comprehensive eye examination. Again, a simple Pearson
developed instrument or other measure that it provides no signif- correlation coefficient (assuming a normal distribution, alterna-
icant additional information. So, a moderate correlation may ac- tively a Spearman rank correlation) is the appropriate test and a
tually be better than a high one because it indicates that the two correlation of ⬎0.3 is an appropriate cut-off, although for predic-
measures are related but the instrument is also providing different tive validity a very high correlation is not disadvantageous. For a
information. However, a low correlation implies that two measures dichotomous outcome, a significant ␹2 or odds ratio would be
which are hypothesized to be related are not very well related at all. appropriate.

Optometry and Vision Science, Vol. 84, No. 8, August 2007

The Development, Assessment, and Selection of Questionnaires—Pesudovs et al. 671

Concurrent validity illustrates an instrument’s ability to distin- The Bland-Altman limits of agreement (LoA) is the range of values
guish between groups that it should theoretically be able to distin- over which 95% of the differences between two measures should
guish.22 Critically, both are measured at the same time, rather than lie.60,61 This is a simple method to perform, and is applicable to
one being measured at a future time. For example, an instrument many situations as long as the units of measurement (e.g., Diopters
designed for a particular condition should be able to discriminate for refractive error etc) are the same (for reliability testing the units
between groups with and without a condition. Testing such a of measurement are essentially the same). The advantage of this
hypothesis is often the easiest contribution to construct validity. approach is that it is robust to large data ranges and can detect and
The instrument is administered to two groups, one with the con- manage bias. Interpretation of whether a limit of agreement is a
dition, one without. For simplicity, equivocal cases are not in- good or a bad result requires clinical context. Therefore, a disad-
cluded in the analysis. Although this provides weak evidence of vantage of this approach lies in interpretability if the scale of the
validity because it may be the equivocal cases where the instrument instrument is unfamiliar. For an LoA result showing that the reli-
may be most needed (assuming there is a needs-based reason for ability of subjective refraction is ⫾0.50 D, an optometrist or
developing the new instrument). The results become more power- related clinician will readily understand the precision of the mea-
ful when discriminating between two groups that are very similar. surement, but other people would not know whether this was good
Validity demonstrates that the instrument measures the con- or bad without an appreciation of typical values for the scale.
struct that it was intended to measure, and relates well to other Kappa statistics should be used when comparing categorical
measures of the same or similar constructs. It does not, however, data.64 This statistic is designed to indicate the agreement between
show that the construct is consistently captured across respon- two measurers using the same nominal scale, but corrected for
dents, time, and, setting. agreement that would occur by chance. Kappa varies from ⫺1 to 1
Reliability. Reliability is the consistency of the instrument in where 0 is equivalent to agreement occurring by chance. Kappa of
measuring the same construct over different administrations, but 0.81 or greater represents “almost perfect agreement”, and between
does not indicate validity, as it makes no assumption that the 0.61 and 0.80 represents “substantial agreement”.65 A Kappa sta-
correct construct was measured in the first place. Reliability gen- tistic ⱖ0.70 is desirable for reliability testing of instrument
erally examines the proportion of the total variance that is attrib- responses. A weighted Kappa statistic is designed for ordinal cate-
gorical data such as that seen with instrument response scales where
utable to true differences among subjects. Total variance includes
greater penalty is given for pairs with greater disagreement over
the true differences and measurement error. That measurement
scale categories. Kappa weighted with the quadratic weighting
error is considered to result from time, rater, and content selec-
scheme is mathematically identical to the ICC.66 Notably, Kappa
tion.26 Reliability is a very important quality of an instrument as
statistics depend upon the prevalence of the characteristic under
unreliability detracts from validity. For example, if a test has poor
study so are not directly comparable across measures.
reliability such that test results correlate poorly with retest results,
In addition to the above tests, Rasch analysis also provides person
it is unlikely that results from the test will correlate highly with gold
and item separation reliability indices, indicating the overall perfor-
standard measures, so that its concurrent and convergent validity
mance of an instrument. It is the ratio of the true variance in the
will also be impaired by poor reliability.
estimated measures to the observed variance and indicates the
The reliability of an instrument can be explored using many number of distinct person strata that can be distinguished.36 There
methods, which can be classified broadly into two categories: single are a number of versions of separation including the Person Sepa-
administration and multiple administrations. Single administra- ration Index (PSI) or person separation reliability, which can range
tion methods include split half and internal consistency tests, for from 0 to 1, with high values indicating better reliability. A PSI
example Cronbach’s alpha. These methods, however, are really value of 0.8 is the equivalent of a G value (person separation ratio)
examining ‘internal consistency reliability’, which indicates unidi- of 2.0, representing the ability to distinguish three distinct strata of
mensionality (as discussed above) rather than reliability. In partic- person ability.58,67 A value of 0.9 is equivalent to a G value of 3,
ular, claims of very good instrument reliability based on very high with the ability to distinguish four strata of person ability. Item
Cronbach’s alpha values (⬎0.90) can be downgraded as they are separation reliability should also be reported with 0.8 being the
more indicative of redundancy in the instrument. It is important cutoff for both in terms of acceptability.
that Cronbach’s alpha is not overemphasized as a measure of reli- Other Important Indicators: Responsiveness and Inter-
ability and that the other attributes of reliability are reported. Mul- pretation. Responsiveness is the extent to which the instrument
tiple administration methods include test-retest, alternate forms can detect clinically important changes over time.68,69 This can be
(intermode), and interobserver reliability (not appropriate for self- studied in patients who are known to undergo a change in status
administered instruments) and are typically calculated using the over a time frame, e.g., before and after cataract surgery. The
Pearson product-moment correlation coefficient (r), the intraclass perspective of what constitutes a “clinically important” change is
correlation coefficient (ICC),26,59 Bland-Altman limits of agree- given by the minimum clinically important difference (MID). The
ment,60,61 or kappa statistics. MID indicates the smallest difference in score that can be perceived
The ICC is defined as the ratio of the between-groups variance as beneficial by the subject. This is calculated relative to a differ-
to the total variance. Thus it is a measure of agreement and it is ence reported by a patient. For example, one could ask cataract
valid to be used as such when there is no intrinsic ordering of two patients: “By how much has the operation improved your vision?”
measures under comparison, e.g., in test-retest reliability.62 The and provide the options: “made it worse”, “not at all”, “a little”,
ICC, is dependent on the range of responses, so care must be taken “quite a bit”, “a lot”. The score change in the instrument of interest
with the population in question.63 that equates to a change in status from one step to the next on this

Optometry and Vision Science, Vol. 84, No. 8, August 2007

672 The Development, Assessment, and Selection of Questionnaires—Pesudovs et al.

question can be used to calculate the MID with receiver operating 3. Check that appropriate item selection and reduction processes
characteristic analysis. The MID ideally should be larger that the were used and that the final number of items in the instrument
LoA of test-retest reliability of the instrument, as this means that is not too large as to represent a burden to respondents.
the reliability of the test does not interfere with detection of the 4. Check the scaling for whether adding scores is justified statisti-
MID. Although this criterion may not always be achieved, a MID cally. Note that some traditionally developed instruments can
comparable to the LoA is still scored as a positive result (Table 1). be Rasch scaled to provide a more sensitive and effective (al-
To demonstrate that an instrument is responsive to an interven- though perhaps not ideal) measurement. Score-to-measure ta-
tion, the mean change e.g., with cataract surgery needs to be greater bles that provide a cross-walk between total raw scores and
than the MID. Responsiveness can be expressed by a number of Rasch measures for some traditionally developed instruments,
statistics: Effect Size, the difference between pre and post operative such as the ADVS, RSVP,11 NEI-VFQ,54 may be published, or
score divided by the preoperative standard deviation; standardized available on request from researchers who have investigated the
response mean (SRM), the mean of the change scores divided by performance of such instruments within the Rasch model.
the standard deviation of the change scores; and Responsiveness 5. Check that the validity and reliability of the instrument are
Statistic (RS), the difference between pre and post operative score adequate for your purposes.
divided by the standard deviation of retest score. Effect size, SRM 6. Check for useful interpretation and responsiveness data that
and RS are considered to be large if ⬎0.80.70 For each of these correspond to your intended purpose.
measures, convention holds that effect sizes of 0.20 to 0.49 are
considered small; 0.50 to 0.79 are moderate, and 0.80 or above are It is likely that many existing instruments will not have been
large.70 tested in all the ways recommended herein. By necessity, these
Interpretation indicates the degree to which scores on a measure quality assessment criteria must be comprehensive. However, ex-
can be considered meaningful. To ensure interpretation of an in- isting instruments which have not been tested on certain criteria
strument, the instrument should be tested on a representative tar- are not necessarily flawed, just untested. Such instruments may
get population whose demographics are fully described. Normative give useful information, but should be used with caution.
scores and the minimum clinically important difference (see re-
sponsiveness) should be described. The amount of interpretation
CONCLUSION
information that should be described depends on the purpose of an
instrument. For example, an instrument intended for cataract sur- The quality assessment criteria proposed herein may be useful to
gery probably need only report normative data (means and SDs) guide new instrument development, redevelopment of existing
for typical populations of bilateral and second eye surgery cases. instruments or assessment of existing instruments whether for
Although one could perhaps argue that cataract only and cataract choosing an instrument for use or as part of a formal review of
and comorbidity populations should also be described. Contrast instruments. Questionnaire research is a dynamic field, with the
this to an instrument designed for use across all ophthalmic con- importance of item response theory, particularly Rasch analysis,
ditions; normative data would need to be provided for a great many gaining prominence in recent ophthalmic instruments.71 We have
eye diseases. Data may also need to be provided for subgroups sought to represent this progress in these quality assessment criteria
other than disease: e.g., age, gender, socioeconomic status. Scores while remaining inclusive of traditional methods. These quality
before and after important interventions e.g., cataract surgery criteria should be considered as a proposal, and we acknowledge
should also be provided. that debate over the appropriateness of these criteria will likely
occur. However, we welcome this debate as we believe it can only
lead to the evolution of better quality assessment criteria and in
Recommendations for Instrument Selection turn better assessment of patient-centered outcomes.
In this article, we have presented a range of methods and analysis
techniques for developing and validating instruments and scales.
ACKNOWLEDGEMENTS
These guidelines are intended to help investigators understand
what determines instrument quality and to assist interpretation of We thank Professor Peter Fayers, Department of Public Health, University of
articles detailing instrument development. Once the basic princi- Aberdeen, for his initial guidance on traditional methods for quality assess-
ment criteria reported in this article. We also thank Dr. Trudy Mallinson for
ples of psychometric methods are understood, we recommend that her helpful advice on this manuscript.
researchers wishing to include a QoL measure in a study or clinical Received May 2, 2007; accepted June 6, 2007.
trial, and not wishing to develop and validate their own instru-
ment, use the following instrument selection process and the qual-
ity criteria presented in Table 1 to guide their selection of an REFERENCES
appropriate instrument. 1. Likert RA. A technique for the measurement of attitudes. Arch Psy-
Instrument Selection Process. chol 1932;140:1–55.
2. de Boer MR, Moll AC, de Vet HC, Terwee CB, Volker-Dieben HJ,
1. Be sure that the content area of the instrument suits the purpose van Rens GH. Psychometric properties of vision-related quality of life
of your study. questionnaires: a systematic review. Ophthal Physiol Opt 2004;24:
2. Be aware of what it was developed for and whom it was devel- 257–73.
oped on and not just assume that it will work on your sample. 3. Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol DL,
Be aware of cultural differences. Dekker J, Bouter LM, de Vet HC. Quality criteria were proposed for

Optometry and Vision Science, Vol. 84, No. 8, August 2007

The Development, Assessment, and Selection of Questionnaires—Pesudovs et al. 673
measurement properties of health status questionnaires. J Clin Epi- 24. Streiner DL, Norman GR. Health Measurement Scales: A Practical
demiol 2007;60:34–42. Guide to Their Development and Use, 3rd ed. Oxford: Oxford Uni-
4. Day H, Jutai J, Campbell KA. Development of a scale to measure the versity Press; 2003.
psychosocial impact of assistive devices: lessons learned and the road 25. Seiler LH. The 22-item scale used in field studies of mental illness: a
ahead. Disabil Rehabil 2002;24:31–7. question of method, a question of substance, and a question of theory.
5. Jutai J, Day H, Woolrich W, Strong G. The predictability of reten- J Health Soc Behav 1973;14:252–64.
tion and discontinuation of contact lenses. Optometry 2003;74: 26. McDowell I, Newell C. Measuring Health: A Guide to Rating Scales
299–308. and Questionnaires. New York: Oxford University Press; 1987.
6. Day HY, Jutai J, Woolrich W, Strong G. The stability of impact of 27. World Health Organization. The International Classification of
assistive devices. Disabil Rehabil 2001;23:400–4. Functioning, Disability and Health (ICF). Geneva: World Health
7. Day H, Campbell KA. Is telephone assessment a valid tool in reha- Organization; 2001.
bilitation research and practice? Disabil Rehabil 2003;25:1126–31. 28. Downing SM, Haladyna TM. Validity threats: overcoming interfer-
8. Vitale S, Schein OD, Meinert CL, Steinberg EP. The refractive status ence with proposed interpretations of assessment data. Med Educ
and vision profile: a questionnaire to measure vision-related quality of 2004;38:327–33.
life in persons with refractive error. Ophthalmology 2000;107: 29. La Grow S. Predicting perceived quality of life scores from the NEI-
1529–39. VFQ-25. Optom Vis Sci 2007;84:785–8.
9. Schein OD. The measurement of patient-reported outcomes of re- 30. Valderas JM, Alonso J, Prieto L, Espallargues M, Castells X. Content-
fractive surgery: the refractive status and vision profile. Trans Am based interpretation aids for health-related quality of life measures in
Ophthalmol Soc 2000;98:439–69. clinical practice. An example for the visual function index (VF-14).
10. Schein OD, Vitale S, Cassard SD, Steinberg EP. Patient outcomes of Qual Life Res 2004;13:35–44.
refractive surgery. The refractive status and vision profile. J Cataract 31. Uusitalo RJ, Brans T, Pessi T, Tarkkanen A. Evaluating cataract
Refract Surg 2001;27:665–73. surgery gains by assessing patients’ quality of life using the VF-7.
11. Garamendi E, Pesudovs K, Stevens MJ, Elliott DB. The Refractive J Cataract Refract Surg 1999;25:989–94.
Status and Vision Profile: evaluation of psychometric properties and 32. Steinberg EP, Tielsch JM, Schein OD, Javitt JC, Sharkey P, Cassard
comparison of Rasch and summated Likert-scaling. Vision Res 2006; SD, Legro MW, Diener-West M, Bass EB, Damiano AM, Steinw-
46:1375–83.
achs DM, Sommer A. The VF-14. An index of functional impair-
12. Nichols JJ, Mitchell GL, Saracino M, Zadnik K. Reliability and va-
ment in patients with cataract. Arch Ophthalmol 1994;112:630–8.
lidity of refractive error-specific quality-of-life instruments. Arch
33. Krueger RA. Focus Groups: A Practical Guide for Applied Research,
Ophthalmol 2003;121:1289–96.
2nd ed. Thousand Oaks, CA: Sage Publications; 1994.
13. Nichols JJ, Twa MD, Mitchell GL. Sensitivity of the National Eye
34. Caudle LE, Williams KA, Pesudovs K. The Eye Sensation Scale:
Institute Refractive Error Quality of Life instrument to refractive
an ophthalmic pain severity measure. Optom Vis Sci 2007;84:
surgery outcomes. J Cataract Refract Surg 2005;31:2313–8.
752–62.
14. McDonnell PJ, Mangione C, Lee P, Lindblad AS, Spritzer KL, Berry
35. Tennant A, McKenna SP, Hagell P. Application of Rasch analysis in
S, Hays RD. Responsiveness of the National Eye Institute Refractive
the development and application of quality of life instruments. Value
Error Quality of Life instrument to surgical correction of refractive
Health 2004;7 (Suppl 1):S22–6.
error. Ophthalmology 2003;110:2302–9.
36. Mallinson T. Why measurement matters for measuring patient vision
15. Hays RD, Mangione CM, Ellwein L, Lindblad AS, Spritzer KL,
outcome. Optom Vis Sci 2007;84:675–82.
McDonnell PJ. Psychometric properties of the National Eye
Institute-Refractive Error Quality of Life instrument. Ophthalmol- 37. Wright BD Fundamental Measurement for Psychology. In: Embretson
ogy 2003;110:2292–301. SE, Hershberger SL, eds. The New Rules of Measurement: What Every
16. McDonnell PJ, Lee P, Spritzer K, Lindblad AS, Hays RD. Associa- Psychologist and Educator Should Know. Mahway, NJ: Lawrence
tions of presbyopia with vision-targeted health-related quality of life. Erlbaum;1999:65–104.
Arch Ophthalmol 2003;121:1577–81. 38. Massof RW. The measurement of vision disability. Optom Vis Sci
17. Pesudovs K, Garamendi E, Elliott DB. The Quality of Life Impact of 2002;79:516–52.
Refractive Correction (QIRC) Questionnaire: development and val- 39. Lamoureux EL, Pallant JF, Pesudovs K, Rees G, Hassell JB, Keeffe JE.
idation. Optom Vis Sci 2004;81:769–77. The impact of vision impairment questionnaire: an assessment of its
18. Garamendi E, Pesudovs K, Elliott DB. Changes in quality of life after domain structure using confirmatory factor analysis and Rasch anal-
laser in situ keratomileusis for myopia. J Cataract Refract Surg 2005; ysis. Invest Ophthalmol Vis Sci 2007;48:1001–6.
31:1537–43. 40. Linacre JM. Structure in Rasch residuals: why principal components
19. Pesudovs K, Garamendi E, Elliott DB. A quality of life comparison of analysis? Rasch Meas Trans 1998;12:636. Available at: http://
people wearing spectacles or contact lenses or having undergone re- www.rasch.org/rmt/rmt122m.htm. Accessed June 8, 2007.
fractive surgery. J Refract Surg 2006;22:19–27. 41. Massof RW, Ahmadian L, Grover LL, Deremeik JT, Goldstein JE,
20. Lamoureux EL, Pallant JF, Pesudovs K, Hassell JB, Keeffe JE. The Rainey C, Epstein C, Barnett GD. The Activity Inventory (AI): An
Impact of Vision Impairment Questionnaire: an evaluation of its adaptive visual function questionnaire. Optom Vis Sci 2007;84:
measurement properties using Rasch analysis. Invest Ophthalmol Vis 763–74.
Sci 2006;47:4732–41. 42. Massof RW, Hsu CT, Baker FH, Barnett GD, Park WL, Deremeik
21. Lamoureux EL, Ferraro JG, Pallant JF, Pesudovs K, Rees G, Keeffe JT, Rainey C, Epstein C. Visual disability variables. II. The difficulty
JE. Are standard instruments valid for the assessment of quality of life of tasks for a sample of low-vision patients. Arch Phys Med Rehabil
and symptoms in glaucoma? Optom Vis Sci 2007;84:789–96. 2005;86:954–67.
22. Trochim WMK. The Research Methods Knowledge Base, 2nd ed. 43. Massof RW, Hsu CT, Baker FH, Barnett GD, Park WL, Deremeik
Cincinnati, OH: Atomic Dog Publishing; 2000. JT, Rainey C, Epstein C. Visual disability variables. I: the importance
23. Guion RM. Content validity: the source of my discontent. App Psy- and difficulty of activity goals for a sample of low-vision patients.
chol Meas 1977;1:1–10. Arch Phys Med Rehabil 2005;86:946–53.

Optometry and Vision Science, Vol. 84, No. 8, August 2007

674 The Development, Assessment, and Selection of Questionnaires—Pesudovs et al.
44. Stelmack J, Massof RW. Using the VA LV VFQ-48 in low vision with Cronbach’s alpha or the intraclass correlation coefficient: toward
rehabilitation. Optom Vis Sci 2007;84:705–9. the integration of two traditions. J Clin Epidemiol 1991;44:381–90.
45. Stelmack JA, Szlyk JP, Stelmack TR, Demers-Turco P, Williams RT, 60. Bland JM, Altman DG. Statistical methods for assessing agreement
Moran D, Massof RW. Psychometric properties of the Veterans Af- between two methods of clinical measurement. Lancet 1986;1:
fairs Low-Vision Visual Functioning Questionnaire. Invest Ophthal- 307–10.
mol Vis Sci 2004;45:3919–28. 61. Bland JM, Altman DG. Measuring agreement in method comparison
46. Bond TG, Fox CM. Applying the Rasch Model: Fundamental Mea- studies. Stat Methods Med Res 1999;8:135–60.
surement in the Human Sciences. Mahwah, NJ: L. Earlbaum, 2001. 62. Bland JM, Altman DG. A note on the use of the intraclass correlation
47. Stelmack J, Szlyk JP, Stelmack T, Babcock-Parziale J, Demers-Turco coefficient in the evaluation of agreement between two methods of
P, Williams RT, Massof RW. Use of Rasch person-item map in measurement. Comput Biol Med 1990;20:337–40.
exploratory data analysis: A clinical perspective. J Rehabil Res Dev 63. Patton N, Aslam T, Murray G. Statistical strategies to assess reliability
2004;41:233–41. in ophthalmology. Eye 2006;20:749–54.
48. Mangione CM, Phillips RS, Seddon JM, Lawrence MG, Cook EF, 64. Chmura Kraemer H, Periyakoil VS, Noda A. Kappa coefficients in
Dailey R, Goldman L. Development of the ‘Activities of Daily Vision medical research. Stat Med 2002;21:2109–29.
Scale’. A measure of visual functional status. Med Care 1992;30: 65. Landis JR, Koch GG. The measurement of observer agreement for
1111–26. categorical data. Biometrics 1977;33:159–74.
49. Pesudovs K, Garamendi E, Keeves JP, Elliott DB. The Activities of 66. Fleiss JL, Cohen J. The equivalence of weighted kappa and the intra-
Daily Vision Scale for cataract surgery outcomes: re-evaluating valid- class correlation coefficient as measures of reliability. Educ Psychol
ity with Rasch analysis. Invest Ophthalmol Vis Sci 2003;44:2892–9. Meas 1973;33:613–9.
50. Linacre JM. Size vs. significance: Standardized chi-square fit statistic. 67. Fisher W Jr. Reliability statistics. Rasch Meas Trans 1992;6:238.
Rasch Meas Trans 2003;17:918. Available at: http://www.rasch.org/ Available at: http://www.rasch.org/rmt/rmt63i.htm. Accessed June
rmt/rmt171n.htm. Accessed May 25, 2007. 8, 2007.
51. Norquist JM, Fitzpatrick R, Dawson J, Jenkinson C. Comparing 68. Brozek JL, Guyatt GH, Schunemann HJ. How a well-grounded min-
alternative Rasch-based methods vs raw scores in measuring change in imal important difference can enhance transparency of labelling
health. Med Care 2004;42:I25–36. claims and improve interpretation of a patient reported outcome
52. Massof RW, Fletcher DC. Evaluation of the NEI visual functioning measure. Health Qual Life Outcomes 2006;4:69.
questionnaire as an interval measure of visual ability in low vision. 69. Eton DT, Cella D, Yost KJ, Yount SE, Peterman AH, Neuberg DS,
Vision Res 2001;41:397–413. Sledge GW, Wood WC. A combination of distribution- and anchor-
53. Massof RW. An interval-scaled scoring algorithm for visual function based approaches determined minimally important differences
questionnaires. Optom Vis Sci 2007;84:689–704. (MIDs) for four endpoints in a breast cancer scale. J Clin Epidemiol
54. Massof RW. Application of stochastic measurement models to visual 2004;57:898–910.
function rating scale questionnaires. Ophthalmic Epidemiol 2005; 70. Husted JA, Cook RJ, Farewell VT, Gladman DD. Methods for as-
12:103–24. sessing responsiveness: a critical review and recommendations. J Clin
55. Thomee R, Grimby G, Wright BD, Linacre JM. Rasch analysis of Epidemiol 2000;53:459–68.
Visual Analog Scale measurements before and after treatment of 71. Pesudovs K. Patient-centred measurement in ophthalmology—a par-
Patellofemoral Pain Syndrome in women. Scand J Rehabil Med adigm shift. BMC Ophthalmol 2006;6:25.
1995;27:145–51.
56. Pesudovs K, Noble BA. Improving subjective scaling of pain using
Rasch analysis. J Pain 2005;6:630–6.
Konrad Pesudovs
57. Andrich D. A rating scale formulation for ordered response catego- NH&MRC Centre for Clinical Eye Research
ries. Psychometrika 1978;43:561–73. Department of Ophthalmology
58. Wright BD, Masters GN. Rating Scale Analysis. Chicago: MESA Flinders Medical Centre
Press; 1982. Bedford Park. SA 5042, Australia
59. Bravo G, Potvin L. Estimating the reliability of continuous measures e-mail: [email protected]

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Group 2 - Quali Quanti Instruments & Validity Reability
No ratings yet
Group 2 - Quali Quanti Instruments & Validity Reability
46 pages
2 q2 Practical Research
No ratings yet
2 q2 Practical Research
29 pages
2 - Q2 Practical Research
No ratings yet
2 - Q2 Practical Research
29 pages
Questionnaire Design & Administration Guide
No ratings yet
Questionnaire Design & Administration Guide
6 pages
Research Instruments Guide
No ratings yet
Research Instruments Guide
3 pages
OH Kuisioner
No ratings yet
OH Kuisioner
8 pages
Designing and Validation of Questionnaire: Review Article
No ratings yet
Designing and Validation of Questionnaire: Review Article
3 pages
Evaluación Del Estado de Salud y Los Instrumentos de Calidad de Vida: Atributos y Criterios de Revisión (Ingles)
No ratings yet
Evaluación Del Estado de Salud y Los Instrumentos de Calidad de Vida: Atributos y Criterios de Revisión (Ingles)
13 pages
Inbound 5816460744021480108
No ratings yet
Inbound 5816460744021480108
30 pages
Research Instrument
No ratings yet
Research Instrument
21 pages
Cream and Beige Illustrative Research Report Presentation
No ratings yet
Cream and Beige Illustrative Research Report Presentation
10 pages
Week 9 - Concept Notes - 0
No ratings yet
Week 9 - Concept Notes - 0
6 pages
Patient Questionnaires in Orthodontics: Sophy Barber Andrew Shelton
No ratings yet
Patient Questionnaires in Orthodontics: Sophy Barber Andrew Shelton
7 pages
Biophysiological Methods
100% (1)
Biophysiological Methods
17 pages
Health & QoL Instrument Review
No ratings yet
Health & QoL Instrument Review
14 pages
Q2 Hand-Out PR 2 2
No ratings yet
Q2 Hand-Out PR 2 2
3 pages
SAS - Session-25-Research 1
No ratings yet
SAS - Session-25-Research 1
6 pages
Data Collection Tool Development
No ratings yet
Data Collection Tool Development
49 pages
Las PR2W2
No ratings yet
Las PR2W2
15 pages
08 .Data Collection Method
No ratings yet
08 .Data Collection Method
5 pages
A Step-By-Step Approach To Developing Scales For Survey Research
No ratings yet
A Step-By-Step Approach To Developing Scales For Survey Research
6 pages
Module 1 Lesson 3
No ratings yet
Module 1 Lesson 3
24 pages
Designing and Using Surveys in Nursing Research A
No ratings yet
Designing and Using Surveys in Nursing Research A
4 pages
Quantitative Research Tools Guide
No ratings yet
Quantitative Research Tools Guide
23 pages
SAS - Session-24-Research 1
No ratings yet
SAS - Session-24-Research 1
5 pages
SAS Session 25 Research 1
No ratings yet
SAS Session 25 Research 1
5 pages
PR2 Lesson3
No ratings yet
PR2 Lesson3
25 pages
Types of Quantitative Research Design
No ratings yet
Types of Quantitative Research Design
3 pages
Lec6 - Research Instrument
No ratings yet
Lec6 - Research Instrument
32 pages
Quality of Life Measures in Health Care. I: Applications and Issues in Assessment
No ratings yet
Quality of Life Measures in Health Care. I: Applications and Issues in Assessment
4 pages
Practical Research 2 - Q2 - Kit2 - Week 2 3 - Parejas - Joanne
No ratings yet
Practical Research 2 - Q2 - Kit2 - Week 2 3 - Parejas - Joanne
15 pages
Radiology Interater
No ratings yet
Radiology Interater
9 pages
Development of Research Instrument
No ratings yet
Development of Research Instrument
6 pages
Practical Research 2 Q2 SLM6
No ratings yet
Practical Research 2 Q2 SLM6
11 pages
Creating Questionnaire
No ratings yet
Creating Questionnaire
53 pages
QUESTIONNAIRE
No ratings yet
QUESTIONNAIRE
14 pages
Practical Research 2 WLAS Week 3 Quarter 2. Bondoc Rosalyn Ann
No ratings yet
Practical Research 2 WLAS Week 3 Quarter 2. Bondoc Rosalyn Ann
9 pages
Assessment - Methods - in - Medical - Education 1
No ratings yet
Assessment - Methods - in - Medical - Education 1
5 pages
A New Instrument For Assessing The Quality of Studies On Prevalence
No ratings yet
A New Instrument For Assessing The Quality of Studies On Prevalence
8 pages
Data Collection & Measurement Validity
No ratings yet
Data Collection & Measurement Validity
44 pages
PR2 Lesson
No ratings yet
PR2 Lesson
2 pages
Research II
No ratings yet
Research II
10 pages
Terwee Quality Criteria PDF
No ratings yet
Terwee Quality Criteria PDF
9 pages
Terwee2007 PDF
No ratings yet
Terwee2007 PDF
9 pages
Research Instrument Validity and Reliability PDF
No ratings yet
Research Instrument Validity and Reliability PDF
18 pages
Lseh WP22
No ratings yet
Lseh WP22
58 pages
3 - Research Instrument, Validity, and Reliability
100% (1)
3 - Research Instrument, Validity, and Reliability
4 pages
Survey 2
No ratings yet
Survey 2
6 pages
Group 7 Handouts
No ratings yet
Group 7 Handouts
15 pages
Survey Design
No ratings yet
Survey Design
10 pages
Constructing A Research Instrument
No ratings yet
Constructing A Research Instrument
18 pages
MB0050 2 Complete
No ratings yet
MB0050 2 Complete
11 pages
Fitzpatrick 1992
No ratings yet
Fitzpatrick 1992
4 pages
YA WEN Part - Rattray - Essential Elements of Questionnaire Design and Development
No ratings yet
YA WEN Part - Rattray - Essential Elements of Questionnaire Design and Development
10 pages
Pe & Health - Notes
No ratings yet
Pe & Health - Notes
2 pages
Introduction To Research Tools
No ratings yet
Introduction To Research Tools
14 pages
Q4 Module 3
No ratings yet
Q4 Module 3
24 pages
(Quality and Reliability 54) Camil Fuchs, Ron Kenett - Multivariate Quality Control - Theory and Applications-M. Dekker (1998) PDF
No ratings yet
(Quality and Reliability 54) Camil Fuchs, Ron Kenett - Multivariate Quality Control - Theory and Applications-M. Dekker (1998) PDF
225 pages
Marketing Research
No ratings yet
Marketing Research
26 pages
Edci 631 7901-sp22-1
No ratings yet
Edci 631 7901-sp22-1
13 pages
Vaccinology Masters Programme Brochure Final
100% (1)
Vaccinology Masters Programme Brochure Final
20 pages
Data Collection & Presentation Guide
No ratings yet
Data Collection & Presentation Guide
14 pages
Application of SPSS in Research Writing
No ratings yet
Application of SPSS in Research Writing
77 pages
Assorted Cement Brands Sales Performance Among The Shop Sellers - A Group Level Analysis in The Context With Chennai and Tiruvallur Districts
No ratings yet
Assorted Cement Brands Sales Performance Among The Shop Sellers - A Group Level Analysis in The Context With Chennai and Tiruvallur Districts
12 pages
What To Believe? John D. Caputo Instant Download
No ratings yet
What To Believe? John D. Caputo Instant Download
82 pages
Quality Control - Homework 5: Madhava Reddy Yenimireddy - M07579553
75% (4)
Quality Control - Homework 5: Madhava Reddy Yenimireddy - M07579553
18 pages
Systems Architecture Modeling With The Arcadia Method: A Practical Guide To Capella 1st Edition Pascal Roques Download
100% (1)
Systems Architecture Modeling With The Arcadia Method: A Practical Guide To Capella 1st Edition Pascal Roques Download
122 pages
Journal of Hydrology: Sciencedirect
No ratings yet
Journal of Hydrology: Sciencedirect
12 pages
Math 7 Curriculum Overview
No ratings yet
Math 7 Curriculum Overview
21 pages
Study on Brand Equity of Pondicherry Spinning Mills
No ratings yet
Study on Brand Equity of Pondicherry Spinning Mills
48 pages
Tugas Hidrologi Teknik: Kelompok 3
No ratings yet
Tugas Hidrologi Teknik: Kelompok 3
12 pages
Writing Arguments: A Rhetoric With Readings (10th Edition) John D. Ramage Instant Download
100% (1)
Writing Arguments: A Rhetoric With Readings (10th Edition) John D. Ramage Instant Download
101 pages
Arena Packaging Template User's Guide
100% (1)
Arena Packaging Template User's Guide
112 pages
1 Nature of Statistics
No ratings yet
1 Nature of Statistics
7 pages
First Year (Part - I) : Course Structure
No ratings yet
First Year (Part - I) : Course Structure
20 pages
Resume Jurnal Public Speaking - P14
No ratings yet
Resume Jurnal Public Speaking - P14
29 pages
Sample Thesis Descriptive Method
100% (3)
Sample Thesis Descriptive Method
7 pages
Intro to Random Variables
100% (1)
Intro to Random Variables
15 pages
Business Mathematics and Statistics by Ngdasandkldas
No ratings yet
Business Mathematics and Statistics by Ngdasandkldas
1 page
YOLO
No ratings yet
YOLO
31 pages
2023 Teacher Training: Statistics & Life Skills
No ratings yet
2023 Teacher Training: Statistics & Life Skills
8 pages
Influence of Drug Abuse On Students Academic Performance in Selected Senior Secondary Schools in Sokoto South Local Government, Sokoto State
No ratings yet
Influence of Drug Abuse On Students Academic Performance in Selected Senior Secondary Schools in Sokoto South Local Government, Sokoto State
7 pages
Nursing Research Test Bank (20 Questions) - Nurseslabs
No ratings yet
Nursing Research Test Bank (20 Questions) - Nurseslabs
26 pages
Taticek-Product Monitoring & Post-Approval Lifecycle Management of Biotech Products
No ratings yet
Taticek-Product Monitoring & Post-Approval Lifecycle Management of Biotech Products
36 pages
AutoClimate User Guide tVieW v2
No ratings yet
AutoClimate User Guide tVieW v2
11 pages
Formula Sheet
No ratings yet
Formula Sheet
6 pages
Introduction To Deep Learning: Radu Ionescu, Prof. PHD
No ratings yet
Introduction To Deep Learning: Radu Ionescu, Prof. PHD
90 pages

Pesudovs 2007

Uploaded by

Pesudovs 2007

Uploaded by

1040-5488/07/8408-0663/0 VOL. 84, NO. 8, PP.

FEATURE ARTICLE ON LINE

The Development, Assessment, and Selection

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Development of the instrument

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Intended Actual Item Item Response Scoring

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Discriminant Convergent “Other” Test-retest Interobserver or Rasch separation

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

Optometry and Vision Science, Vol. 84, No. 8, August 2007

You might also like