Pesudovs 2007
Pesudovs 2007
663–674
OPTOMETRY AND VISION SCIENCE
Copyright © 2007 American Academy of Optometry
ABSTRACT
Patient-reported outcome measurement has become accepted as an important component of comprehensive outcomes
research. Researchers wishing to use a patient-reported measure must either develop their own questionnaire (called an
instrument in the research literature) or choose from the myriad of instruments previously reported. This article
summarizes how previously developed instruments are best assessed using a systematic process and we propose a system
of quality assessment so that clinicians and researchers can determine whether there exists an appropriately developed
and validated instrument that matches their particular needs. These quality assessment criteria may also be useful to guide
new instrument development and refinement. We welcome debate over the appropriateness of these criteria as this will
lead to the evolution of better quality assessment criteria and in turn better assessment of patient-reported outcomes.
(Optom Vis Sci 2007;84:663–674)
Key Words: factor analysis, instrument, quality assessment, quality of life, questionnaire, Rasch analysis, reliability,
responsiveness, validity, visual disability
T
he assessment of health-related quality of life (HR-QoL) to produce a score on a quasi continuous variable.1 As the use of
has been an important expansion of the assessment of the instruments extended beyond psychology, to medical fields, the
impact of disease and its treatment beyond the traditional format, and purpose of instruments changed. Unfortunately,
areas of symptoms, signs, morbidity, and mortality. It provides the change in the design and application of instruments also
a more holistic assessment of the effects of disease on the person meant that traditional methods of scoring and validation be-
to include such dimensions as a patient’s physical, social, and came outdated, but this was not recognized for many of the
emotional wellbeing. Most funding organizations now insist on originally developed instruments.
a patient-reported outcome for a clinical trial of any disease Throughout this article, we highlight key quality criteria
intervention/treatment or assistive device. Because of the (summarized in Table 1) that build upon previous contribu-
breadth of content of HR-QoL and its patient-reported nature, tions to the field.2,3 It is our aim to present a robust set of
it has been measured using questionnaires (called instruments quality criteria to be used by researchers and practitioners in the
in the research literature), which are efficient tools for gathering selection of instruments, and we welcome comments and sug-
large amounts of data quickly. Given the large number of in- gestions for their refinement or further development. These
struments that have been developed over the last few years, proposed quality assessment criteria (Table 1) provide a frame-
investigators may find it difficult to decide upon an appropriate work for a systematic review of instruments in the disease area
instrument or decide whether a questionnaire needs to be spe- under study to determine if any existing instruments are ade-
cially developed for their study. Another problem is that origi- quate for the intended use in the intended target population. In
nally (1920s to 1950s), the primary purpose of instrument the absence of a sufficiently reliable, valid instrument with con-
development was to determine people’s attitudes and how that tent appropriate to the intended use, the development of a new
range of attitudes was distributed in the population rather than instrument for the intended purpose can be justified. The qual-
TABLE 1.
Quality assessment tool for evaluation of health status questionnaires
Property Definition Quality criteriaa
a
If not reported, scored as “0”; ⻫⻫, Positive rating; ⻫, Minimal acceptable rating; x, Negative rating.
MID, minimally important difference; LOA, limits of agreement; ICC, intraclass coefficient; SD, standard deviation.
TABLE 2.
Quality assessment of 4 refractive error-related quality of life instruments: Psychosocial Impact of Assistive Devices (PIADS),4 –7
the Refractive Status Vision Profile (RSVP),8 –13 the National Eye Institute Refractive Quality of Life (NEI-RQL)12,14 –17 and the
Quality of Life Impact of Refractive Correction (QIRC)18 –20
PIADS ⻫⻫ ⻫⻫ ⻫⻫ ⻫ ⻫ x ⻫ ⻫
RSVPa ⻫⻫ ⻫ ⻫⻫ ⻫⻫ ⻫ x x x
NEI-RQL ⻫⻫ ⻫⻫ ⻫⻫ ⻫⻫ x x x x
QIRC ⻫⻫ ⻫⻫ ⻫⻫ ⻫⻫ ⻫⻫ ⻫⻫ ⻫⻫ ⻫⻫
a
A Rasch-analyzed version of the RSVP (Garamendi et al., 2006) with a modified response scale and a reduced number of items has
been shown to have greater responsiveness and test-retest reliability than the standard instrument. It also provides a unidimensional
score, statistically justified response and scoring scales and good Rasch separation reliability.
b
Conflicting reports of normative data levels and responsiveness of the RSVP are provided by Schein et al. (2001) and Nichols et al. (2001).
and does not claim to measure quality of life but it has been mis- Unidimensionality and Item Reduction
interpreted as assessing QoL.30 –32
Item reduction is performed to maximize item quality, measure-
ment precision, and targeting of items to persons. Unidimension-
Item Identification ality is the demonstration that all items included in an instrument
To ensure a good breadth of relevance, at least three approaches fit with a single underlying construct (e.g., VR-QoL) and is a
should have been taken for item generation. These include obtain- prerequisite to allow appropriate summation of any set of
ing sample statements, experiences, and opinions directly from: items24,35 and an important asset if a meaningful measurement is
individuals within the target population, through focus groups or to be obtained.35,36 A number of statistical methodologies are used
one-to-one interviews; experts working in the area (not just clini- to justify item reduction and give insight into dimensionality:
cians, but individuals who have contact with patients and may • Conventional descriptive statistics
develop expertise in understanding the impact of the condition on • Cronbach’s alpha
the person); and the published literature in the field. Patient inter- • Factor analysis
views are useful for gathering a range of opinions on a topic and can • Rasch analysis
help to draw views from particular minority groups. Focus groups
are useful for eliciting mediated responses that are likely to be Statistical methods for item reduction serve to highlight the
common to the majority of individuals in a given population and worst performing items, which are removed. The items are re-
can also be more productive than in-depth patient interviews due moved one at a time with the analyses performed iteratively to
to the enthusiasm and interaction created by the discussion pro- calculate the improvement in the instrument and to identify the
cess.33,34 Expert knowledge is a valuable resource, but should not next candidate item for removal. Traditionally, the following de-
be used as the sole procedure for generating items because clini- scriptive and statistical analyses have been used to determine can-
cians tend to focus on presenting complaints. There may also be didate items for removal.4,5,26
issues that the patient does not present to a clinician, but which • Missing data. Items that have large percentages (⬎50%) of
have an impact on their quality of life. For example, the RSVP is a missing data are likely to be ambiguous, or not applicable to
clinician-developed instrument of QoL for refractive surgery and many respondents.
has been shown to include too many items related to symptoms • All items should approximate a normal distribution, as identi-
and functional problems,11 whereas patients are more concerned fied using histogram plots, nonsignificant results on tests of
about issues such as convenience, cost, health concerns, and well normality such as Kolmogorov-Smirnov or Shapiro-Wilk, or
being.17 Skewness and Kurtosis values within ⫺2.00 to ⫹2.00. Al-
though items at the ends of the scale will likely deviate from
Pilot Questionnaire normal.37
Item generation will typically produce a vast number of items. Unidimensionality of the whole instrument must be considered
An item removal process is required to determine which items to when deciding which items to remove. Traditionally, Cronbach’s
retain for the final instrument. A pilot questionnaire is best used for alpha and factor analysis were used to assess unidimensionality.
this process (see quality criteria in Table 1). The pilot question- Cronbach’s alpha determines the correlation of every item in the
naire indicates how well each item taps the underlying construct instrument with every other item. The nearer Cronbach’s alpha to
being measured, and allows poorly discriminating, unreliable or 1, the more internally consistent the scale is Cronbach’s alpha can
invalid items to be removed. The respondent population for the also be calculated on the items with each item in turn deleted from
pilot data should have been broad and representative of the target the analysis. If alpha increases (relative to the alpha of all items
population. included) this indicates that the item removed was not contribut-
0 ⻫⻫ ⻫⻫ ⻫⻫ ⻫⻫ 0 ⻫⻫ ⻫⻫
0 ⻫ ⻫ ⻫ 0 0 ⻫b ⻫b
0 ⻫⻫ ⻫⻫ ⻫⻫ 0 0 ⻫⻫ ⻫⻫
0 0 ⻫⻫ ⻫⻫ 0 ⻫⻫ ⻫⻫ ⻫⻫
ing to unidimensionality. Because Cronbach’s alpha is essentially significant probability value (p ⬎ 0.05) indicates no substantial
determined by the average of the correlation coefficients between deviation from the model which implies unidimensionality.20 The
items, exceptionally high values of Cronbach’s alpha (⬎0.90) may infit and outfit statistics also help to identify which items contrib-
be indicative of redundancy (e.g., in the RSVP, see Table 2). Al- ute most to the measurement of the latent trait. Infit and outfit
though this does not contravene unidimensionality, redundancy is means squares have an expected value of 1.00. Infit means (⬍0.8)
a problem if the process of creating the “overall score” for the represent items are too predictable (they have at least 20% less
instrument involves just adding all the item scores together. In variation than expected). These overfitting items may be redun-
such a case, the overall score overweighs the importance of the issue dant or lack variance to contribute new information to the mea-
that is served by redundant items. Therefore, in our quality criteria, sure. Mean outfit values ⬎1.20 represent misfit (at least 20% more
we downgrade those instruments with Cronbach’s alpha ⬎0.90 variance than was expected) and suggests that the item measures
(Table 1). Similarly, as Cronbach’s alpha is not independent of the something different to the overall scale. Acceptable values for item
number of items and may be elevated by including many items. For inclusion may be 0.80 to 1.20 for a strict definition (often used for
these reasons Cronbach’s alpha should probably be considered to infit) or 0.70 to 1.30 or even higher for lenient definition. Alter-
be more of a traditional indicator than a useful one.38 Nevertheless natively, fit residuals may be used, in which case values ⬎2.5 or
we retain it in our quality criteria as it is such a commonly reported probability values below the Bonferroni adjusted alpha value (i.e.,
metric: values should be ⬎0.70 and ⬍0.90. 0.05/number of items) are also used to indicate misfit to the model.
Factor analysis is a multivariate statistical method that is used to Rasch analysis can also indicate the effect of removing an item on
explain variance within a matrix of variables and reduces those overall scale performance. If removal of an item considerably de-
variables to a number of factors. This method can be used to creases person separation that item should be retained.36 Person
determine whether an instrument possesses unidimensionality.26 separation is an indicator of the ability (precision) of the instru-
The proportion of the variance described by the principal (most ment to differentiate between different people’s quality of life.
significant) factor indicates whether the instrument tests in one or Person separation is expressed as the ratio of the adjusted standard
more content areas. In addition, factor analysis can be “rotated” by deviation to the root mean square error and a person separation
various techniques such as Varimax or Oblimin to find items value of 2.0 or more is indicative that subjects are significantly
which can have high communality and thus form additional fac- different in QoL across the measurement distribution.46
tors. This grouping of items into additional factors can be used to
justify the creation of subscale indices as items that load onto the same
Targeting of Items to Persons
factor are likely to sample the same content area specified by the factor
to which they contribute. Subscales should be proposed hypothetically Rasch analysis also provides insight into targeting of item diffi-
and justified with confirmatory factor analysis rather than simply be- culty to person ability and can therefore be used to remove items
ing the product of exploratory factor analysis.39 Once demonstrated that less well target the population.47 Figs. 1 and 2 show person-
to exist by factor analysis, subscales themselves should also be as- item maps for a group of cataract patients responding to the Ac-
sessed for unidimensionality. Factor analysis can guide item reduc- tivities of Daily Vision Scale (ADVS),48 a visual activity limitation
tion by indicating both failure to fit (items contributing to ⬍0.40 of a instrument. This analysis rank orders the items and participant
particular factor) and redundancy (⬎0.80). Ideally, factor analysis responses. The means of the two distributions (person and item)
should be performed on Rasch-scaled data, so that items do not group are denoted as ‘M’. If the items were well targeted to the subjects,
simply because of similar item difficulty.40 the means of the two distributions should be close (e.g., 0.5 logits)
More recently developed instruments have used Rasch analysis to each other. In Fig. 1, the original conventionally validated
to help guide item reduction.41– 45 Rasch analysis provides a more ADVS is shown, and it can be seen that the means are far apart. Fig.
detailed view of dimensionality through both model and item fit 2 shows how item reduction of the ADVS, using Rasch analysis,
statistics.38 The item-trait interaction score, reported as a 2, re- provides better targeting of item difficulty to patient ability, with
flects the property of invariance across the trait. A statistically non the ‘M’ values now closer together. This was achieved through
FIGURE 1.
Patient activity limitation/item difficulty map for the 22-item ADVS. On the left of the dashed line are the patients, represented by X. On the right are
the cross-over points between each response scale (level of the scale where the answer category is most probable to be rated by a patient with that
activity limitation). More able patients and more difficult items are near the bottom of the map; less able patients and less difficult items are near the
top. The scale is in log units (0 –100). M, mean; S, 1 SD from the mean; T, 2 SD from the mean.
removal of items that were too easy for patient ability. This item Rating Scale
reduction approach can lead to a minimum item set, which has the Unfortunately, many QoL instruments use traditional sum-
optimum instrument efficiency and the advantage of shortening mary scoring where an overall score is derived through summa-
test time and reducing user and respondent burden. tive scoring of responses. Summary scoring is based on the
Criteria to guide item removal that incorporate all of these statistical hypotheses that all questions have equal importance and re-
approaches have been proposed.17,49 The suggested infit and outfit sponse categories are accordingly scaled to have equal value with
ranges are only guides and can depend largely on sample size.50 uniform increments from category to category. In cases where
1. Infit mean square outside 0.7 to 1.30 the items in an instrument no longer have equal importance, the
2. Outfit mean square outside 0.70 to 1.30 logic of averaging scores across all items becomes questionable.
3. Item with mean furthest from subject mean For example, in a summary scaled visual activity limitation
4. High proportion of missing data (⬎50%) instrument, the ADVS, “a little difficulty” scores 4, “extreme
5. Ceiling effect—a high proportion of responses in item end- difficulty” is twice as bad and scores 2, and “unable to perform
response category (⬎50%) the activity due to vision” is again twice as bad with a score of 1.
6. Items with markedly different standard deviation of item scores The ADVS ascribes the same response scale to a range of differ-
to other items ent items, such that “a little difficulty” “driving at night” re-
7. Items that do not demonstrate a normal distribution, as identified ceives the same numerical score as “a little difficulty” “driving
using histogram plots, tests of normality such as Kolmogorov- during the day”, despite the former being by far the more dif-
Smirnov or Shapiro-Wilk, or Skewness and Kurtosis values outside ficult and complex task. This rationale of “one size fits all” is
⫺2.00 to ⫹2.00. flawed in this case, and Rasch analysis has been used to confirm
FIGURE 3.
(A) Rasch model category probability curves for the faces pain scale representing the likelihood that a subject with a particular pain severity will select
a category. The x-axis represents pain. For any given point along this scale, the category most likely to be chosen by a subject is shown by the category
curve with the highest probability. At no point is category 5 the most likely to be selected. This suggests there are too many categories and these are
not used in order. (B) Rasch model category probability curves for the faces scale shortened to 5 categories by combining categories 2 and 3, and 5
and 6. This model gives excellent symmetry and the thresholds are now ordered. Both figures reproduced with permission from J Pain 2005;6:630 – 6.
going process requiring the statement of hypotheses and the testing A cutoff of 0.3 is probably appropriate as a minimum correlation
thereof; if an instrument measures a trait, then it should correlate between two measures which should be related. Therefore, the
with another measure etc. There are specific types of validities that hypothesis is critical for convergent validity.
together contribute to construct validity e.g., concurrent, conver- Discriminant validity is the degree to which an instrument di-
gent, predictive, and discriminant validity. Although it is not pos- verges from other instruments that should be dissimilar. This is
sible to perform all of these tests, it is important that construct probably the validity test performed least often; no results in Table
validity should be a hypothesis driven process. Sometimes the hy- 2. For refractive-error related QoL instruments, it might be simple
pothesis will be simple and easily fall under the heading of e.g., to show a poor correlation to an instrument designed for measur-
convergent validity. Other times, complex hypothesis testing will ing visual activity limitation, because disability is not typically a
not be readily subclassified but be critical to the establishment of component of the former. The statistical test required is again a
construct validity. With the right set of hypotheses and tests, a simple Pearson correlation coefficient, but in this case, a poor
persuasive picture of construct validity can be developed. correlation e.g., ⬍0.3 is the desired result. More complex hypoth-
Criterion validity is a traditional definition of validity where an eses of concurrent and discriminant validity could also be set. For
instrument is correlated with an existing “standard” or accepted example, a new cataract specific visual activity limitation instru-
measure which measures the same thing. However, criterion valid- ment could be hypothesized to correlate very well with an existing
ity can be further subdivided so we use “criterion-related validity” cataract specific visual activity limitation instrument, less well with
as an umbrella term here. an ophthalmic QoL instrument and least well with a general health
Convergent validity is the classic form of criterion validity where QoL instrument. Such a hypothesis can avoid the 0.3 cutoff, as the
a new instrument is correlated with something that measures a correlations may well be of the order of 0.7, 0.5, and 0.3, respec-
related construct. For visual activity limitation instruments, corre- tively, and therefore provide good criterion-related validity evi-
lation with visual acuity (VA), or an existing validated visual activ- dence for both convergent and discriminant validity.
ity limitation instrument (e.g., the VF-1432) is typically used to Predictive validity determines whether the instrument can make
indicate convergent validity. Suitable statistical analyses are a Pear- accurate predictions of future outcomes. For example, can a score
son correlation coefficient for continuous variables or, for dichot- on a visual activity limitation instrument be used to predict the
omous data, a chi squared analysis with a Phi coefficient as a need for cataract surgery? This may be worthwhile because people
measure of the correlation. Note that for convergent validity, a very could be prioritized for examination based on instrument scores
high correlation (⬎0.90) is not advantageous as it suggest that the and some people with minimal activity limitation could be spared
new instrument provides information so similar to a previously a costly comprehensive eye examination. Again, a simple Pearson
developed instrument or other measure that it provides no signif- correlation coefficient (assuming a normal distribution, alterna-
icant additional information. So, a moderate correlation may ac- tively a Spearman rank correlation) is the appropriate test and a
tually be better than a high one because it indicates that the two correlation of ⬎0.3 is an appropriate cut-off, although for predic-
measures are related but the instrument is also providing different tive validity a very high correlation is not disadvantageous. For a
information. However, a low correlation implies that two measures dichotomous outcome, a significant 2 or odds ratio would be
which are hypothesized to be related are not very well related at all. appropriate.
Concurrent validity illustrates an instrument’s ability to distin- The Bland-Altman limits of agreement (LoA) is the range of values
guish between groups that it should theoretically be able to distin- over which 95% of the differences between two measures should
guish.22 Critically, both are measured at the same time, rather than lie.60,61 This is a simple method to perform, and is applicable to
one being measured at a future time. For example, an instrument many situations as long as the units of measurement (e.g., Diopters
designed for a particular condition should be able to discriminate for refractive error etc) are the same (for reliability testing the units
between groups with and without a condition. Testing such a of measurement are essentially the same). The advantage of this
hypothesis is often the easiest contribution to construct validity. approach is that it is robust to large data ranges and can detect and
The instrument is administered to two groups, one with the con- manage bias. Interpretation of whether a limit of agreement is a
dition, one without. For simplicity, equivocal cases are not in- good or a bad result requires clinical context. Therefore, a disad-
cluded in the analysis. Although this provides weak evidence of vantage of this approach lies in interpretability if the scale of the
validity because it may be the equivocal cases where the instrument instrument is unfamiliar. For an LoA result showing that the reli-
may be most needed (assuming there is a needs-based reason for ability of subjective refraction is ⫾0.50 D, an optometrist or
developing the new instrument). The results become more power- related clinician will readily understand the precision of the mea-
ful when discriminating between two groups that are very similar. surement, but other people would not know whether this was good
Validity demonstrates that the instrument measures the con- or bad without an appreciation of typical values for the scale.
struct that it was intended to measure, and relates well to other Kappa statistics should be used when comparing categorical
measures of the same or similar constructs. It does not, however, data.64 This statistic is designed to indicate the agreement between
show that the construct is consistently captured across respon- two measurers using the same nominal scale, but corrected for
dents, time, and, setting. agreement that would occur by chance. Kappa varies from ⫺1 to 1
Reliability. Reliability is the consistency of the instrument in where 0 is equivalent to agreement occurring by chance. Kappa of
measuring the same construct over different administrations, but 0.81 or greater represents “almost perfect agreement”, and between
does not indicate validity, as it makes no assumption that the 0.61 and 0.80 represents “substantial agreement”.65 A Kappa sta-
correct construct was measured in the first place. Reliability gen- tistic ⱖ0.70 is desirable for reliability testing of instrument
erally examines the proportion of the total variance that is attrib- responses. A weighted Kappa statistic is designed for ordinal cate-
gorical data such as that seen with instrument response scales where
utable to true differences among subjects. Total variance includes
greater penalty is given for pairs with greater disagreement over
the true differences and measurement error. That measurement
scale categories. Kappa weighted with the quadratic weighting
error is considered to result from time, rater, and content selec-
scheme is mathematically identical to the ICC.66 Notably, Kappa
tion.26 Reliability is a very important quality of an instrument as
statistics depend upon the prevalence of the characteristic under
unreliability detracts from validity. For example, if a test has poor
study so are not directly comparable across measures.
reliability such that test results correlate poorly with retest results,
In addition to the above tests, Rasch analysis also provides person
it is unlikely that results from the test will correlate highly with gold
and item separation reliability indices, indicating the overall perfor-
standard measures, so that its concurrent and convergent validity
mance of an instrument. It is the ratio of the true variance in the
will also be impaired by poor reliability.
estimated measures to the observed variance and indicates the
The reliability of an instrument can be explored using many number of distinct person strata that can be distinguished.36 There
methods, which can be classified broadly into two categories: single are a number of versions of separation including the Person Sepa-
administration and multiple administrations. Single administra- ration Index (PSI) or person separation reliability, which can range
tion methods include split half and internal consistency tests, for from 0 to 1, with high values indicating better reliability. A PSI
example Cronbach’s alpha. These methods, however, are really value of 0.8 is the equivalent of a G value (person separation ratio)
examining ‘internal consistency reliability’, which indicates unidi- of 2.0, representing the ability to distinguish three distinct strata of
mensionality (as discussed above) rather than reliability. In partic- person ability.58,67 A value of 0.9 is equivalent to a G value of 3,
ular, claims of very good instrument reliability based on very high with the ability to distinguish four strata of person ability. Item
Cronbach’s alpha values (⬎0.90) can be downgraded as they are separation reliability should also be reported with 0.8 being the
more indicative of redundancy in the instrument. It is important cutoff for both in terms of acceptability.
that Cronbach’s alpha is not overemphasized as a measure of reli- Other Important Indicators: Responsiveness and Inter-
ability and that the other attributes of reliability are reported. Mul- pretation. Responsiveness is the extent to which the instrument
tiple administration methods include test-retest, alternate forms can detect clinically important changes over time.68,69 This can be
(intermode), and interobserver reliability (not appropriate for self- studied in patients who are known to undergo a change in status
administered instruments) and are typically calculated using the over a time frame, e.g., before and after cataract surgery. The
Pearson product-moment correlation coefficient (r), the intraclass perspective of what constitutes a “clinically important” change is
correlation coefficient (ICC),26,59 Bland-Altman limits of agree- given by the minimum clinically important difference (MID). The
ment,60,61 or kappa statistics. MID indicates the smallest difference in score that can be perceived
The ICC is defined as the ratio of the between-groups variance as beneficial by the subject. This is calculated relative to a differ-
to the total variance. Thus it is a measure of agreement and it is ence reported by a patient. For example, one could ask cataract
valid to be used as such when there is no intrinsic ordering of two patients: “By how much has the operation improved your vision?”
measures under comparison, e.g., in test-retest reliability.62 The and provide the options: “made it worse”, “not at all”, “a little”,
ICC, is dependent on the range of responses, so care must be taken “quite a bit”, “a lot”. The score change in the instrument of interest
with the population in question.63 that equates to a change in status from one step to the next on this
question can be used to calculate the MID with receiver operating 3. Check that appropriate item selection and reduction processes
characteristic analysis. The MID ideally should be larger that the were used and that the final number of items in the instrument
LoA of test-retest reliability of the instrument, as this means that is not too large as to represent a burden to respondents.
the reliability of the test does not interfere with detection of the 4. Check the scaling for whether adding scores is justified statisti-
MID. Although this criterion may not always be achieved, a MID cally. Note that some traditionally developed instruments can
comparable to the LoA is still scored as a positive result (Table 1). be Rasch scaled to provide a more sensitive and effective (al-
To demonstrate that an instrument is responsive to an interven- though perhaps not ideal) measurement. Score-to-measure ta-
tion, the mean change e.g., with cataract surgery needs to be greater bles that provide a cross-walk between total raw scores and
than the MID. Responsiveness can be expressed by a number of Rasch measures for some traditionally developed instruments,
statistics: Effect Size, the difference between pre and post operative such as the ADVS, RSVP,11 NEI-VFQ,54 may be published, or
score divided by the preoperative standard deviation; standardized available on request from researchers who have investigated the
response mean (SRM), the mean of the change scores divided by performance of such instruments within the Rasch model.
the standard deviation of the change scores; and Responsiveness 5. Check that the validity and reliability of the instrument are
Statistic (RS), the difference between pre and post operative score adequate for your purposes.
divided by the standard deviation of retest score. Effect size, SRM 6. Check for useful interpretation and responsiveness data that
and RS are considered to be large if ⬎0.80.70 For each of these correspond to your intended purpose.
measures, convention holds that effect sizes of 0.20 to 0.49 are
considered small; 0.50 to 0.79 are moderate, and 0.80 or above are It is likely that many existing instruments will not have been
large.70 tested in all the ways recommended herein. By necessity, these
Interpretation indicates the degree to which scores on a measure quality assessment criteria must be comprehensive. However, ex-
can be considered meaningful. To ensure interpretation of an in- isting instruments which have not been tested on certain criteria
strument, the instrument should be tested on a representative tar- are not necessarily flawed, just untested. Such instruments may
get population whose demographics are fully described. Normative give useful information, but should be used with caution.
scores and the minimum clinically important difference (see re-
sponsiveness) should be described. The amount of interpretation
CONCLUSION
information that should be described depends on the purpose of an
instrument. For example, an instrument intended for cataract sur- The quality assessment criteria proposed herein may be useful to
gery probably need only report normative data (means and SDs) guide new instrument development, redevelopment of existing
for typical populations of bilateral and second eye surgery cases. instruments or assessment of existing instruments whether for
Although one could perhaps argue that cataract only and cataract choosing an instrument for use or as part of a formal review of
and comorbidity populations should also be described. Contrast instruments. Questionnaire research is a dynamic field, with the
this to an instrument designed for use across all ophthalmic con- importance of item response theory, particularly Rasch analysis,
ditions; normative data would need to be provided for a great many gaining prominence in recent ophthalmic instruments.71 We have
eye diseases. Data may also need to be provided for subgroups sought to represent this progress in these quality assessment criteria
other than disease: e.g., age, gender, socioeconomic status. Scores while remaining inclusive of traditional methods. These quality
before and after important interventions e.g., cataract surgery criteria should be considered as a proposal, and we acknowledge
should also be provided. that debate over the appropriateness of these criteria will likely
occur. However, we welcome this debate as we believe it can only
lead to the evolution of better quality assessment criteria and in
Recommendations for Instrument Selection turn better assessment of patient-centered outcomes.
In this article, we have presented a range of methods and analysis
techniques for developing and validating instruments and scales.
ACKNOWLEDGEMENTS
These guidelines are intended to help investigators understand
what determines instrument quality and to assist interpretation of We thank Professor Peter Fayers, Department of Public Health, University of
articles detailing instrument development. Once the basic princi- Aberdeen, for his initial guidance on traditional methods for quality assess-
ment criteria reported in this article. We also thank Dr. Trudy Mallinson for
ples of psychometric methods are understood, we recommend that her helpful advice on this manuscript.
researchers wishing to include a QoL measure in a study or clinical Received May 2, 2007; accepted June 6, 2007.
trial, and not wishing to develop and validate their own instru-
ment, use the following instrument selection process and the qual-
ity criteria presented in Table 1 to guide their selection of an REFERENCES
appropriate instrument. 1. Likert RA. A technique for the measurement of attitudes. Arch Psy-
Instrument Selection Process. chol 1932;140:1–55.
2. de Boer MR, Moll AC, de Vet HC, Terwee CB, Volker-Dieben HJ,
1. Be sure that the content area of the instrument suits the purpose van Rens GH. Psychometric properties of vision-related quality of life
of your study. questionnaires: a systematic review. Ophthal Physiol Opt 2004;24:
2. Be aware of what it was developed for and whom it was devel- 257–73.
oped on and not just assume that it will work on your sample. 3. Terwee CB, Bot SD, de Boer MR, van der Windt DA, Knol DL,
Be aware of cultural differences. Dekker J, Bouter LM, de Vet HC. Quality criteria were proposed for