3 - Cosmin 2010
3 - Cosmin 2010
Abstract
Background: The COSMIN checklist (COnsensus-based Standards for the selection of health status Measurement
INstruments) was developed in an international Delphi study to evaluate the methodological quality of studies on
measurement properties of health-related patient reported outcomes (HR-PROs). In this paper, we explain our
choices for the design requirements and preferred statistical methods for which no evidence is available in the
literature or on which the Delphi panel members had substantial discussion.
Methods: The issues described in this paper are a reflection of the Delphi process in which 43 panel members
participated.
Results: The topics discussed are internal consistency (relevance for reflective and formative models, and
distinction with unidimensionality), content validity (judging relevance and comprehensiveness), hypotheses testing
as an aspect of construct validity (specificity of hypotheses), criterion validity (relevance for PROs), and
responsiveness (concept and relation to validity, and (in) appropriate measures).
Conclusions: We expect that this paper will contribute to a better understanding of the rationale behind the
items, thereby enhancing the acceptance and use of the COSMIN checklist.
© 2010 Mokkink et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Mokkink et al. BMC Medical Research Methodology 2010, 10:22 Page 2 of 8
http://www.biomedcentral.com/1471-2288/10/22
checklist that contain general requirements for articles properties, and reached international consensus on ter-
in which IRT methods are applied (IRT box), and gen- minology and definitions of measurement properties [5].
eral requirements for the generalisability of the results The focus of the checklist is on studies on measurement
(Generalisability box), respectively. More information on properties of HR-PROs used in an evaluative applica-
how to use the COSMIN checklist can be found else- tion, i.e. longitudinal assessment of treatment effects or
where [4]. changes in health over time.
The checklist can be used, for example, in a systematic In this paper, we provide a clarification for some parts
review of measurement properties, in which the quality of the COSMIN checklist. We explain our choices for
of studies on measurement properties of instruments the included design requirements and preferred statisti-
with a similar purpose are assessed, and results of those cal methods for which no evidence is available in the lit-
studies are compared with a view to select the best erature or which generated substantial discussion among
instrument. If the results of high-quality studies differ the members of the Delphi panel. The topics that are
from the results of low-quality studies, this can be an subsequently discussed in detail are internal consistency,
indication of bias. Consequently, instrument selection content validity, hypotheses testing as an aspect of con-
should be based on the high-quality studies. The COS- struct validity, criterion validity, and responsiveness.
MIN checklist can also be used as guidance for designing
or reporting a study on measurement properties. Further- Internal Consistency
more, students can use the checklist when learning about Internal consistency was defined as the interrelatedness
measurement properties, and reviewers or editors of among the items [5]. In Figure 1 its standards are given.
journals can use it to appraise the methodological quality The discussion was about the relevance of internal con-
of studies on measurement properties. Note that the sistency for reflective models and formative models, and
COSMIN checklist is not a checklist for the evaluation of on the distinction between internal consistency and
the quality of a HR-PRO, but for the methodological unidimensionality.
quality of studies on their measurement properties. The Delphi panel reached consensus that the internal
As a foundation for the content of the checklist, we consistency statistic only gets an interpretable meaning,
developed a taxonomy of all included measurement when (1) the interrelatedness among the items is
determined of a set of items that together form a reflec- reflection of the construct to be measured [5] (see
tive model, and (2) all items tap the same construct, i.e., Figure 2 for its standards). The discussion was about
they form a unidimensional (sub)scale [6,7]. how to evaluate content validity.
A reflective model is a model in which all items are a The Delphi panel agreed that content validity should
manifestation of the same underlying construct [8,9]. be assessed by making a judgment about the relevance
These items are called effect indicators and are expected and the comprehensiveness of the items. The relevance
to be highly correlated and interchangeable [9]. Its of the items should be assessed by judging whether the
counterpart is a formative model, in which the items items are relevant for the construct to be measured
together form a construct [8]. These items do not need (D1), for the study population (D2), and for the purpose
to be correlated. Therefore, internal consistency is not of the HR-PRO (D3). When a new HR-PRO is devel-
relevant for items that form a formative model. For oped, the focus and detail of the content of the instru-
example, stress could be measured by asking about the ment should match the target population (D2). When
occurrence of different situations and events that might the instrument is subsequently used in another popula-
lead to stress, such as job loss, death in a family, divorce tion than the original target population for which it was
etc. These events obviously do not need to be corre- developed, it should be assessed whether all items are
lated, thus internal consistency is not relevant for such relevant for this new study population (D2). For exam-
an instrument. Often, authors do not explicitly describe ple, a questionnaire measuring shoulder disability (i.e.,
whether their HR-PRO is based on a reflective or forma- the Shoulder Disability Questionnaire [10]) may include
tive model. To decide afterwards which model was used, the item “my shoulder hurts when I bring my hand
one can do a simple “thought test”. With this test one towards the back of my head”. When one decides to use
should consider whether all item scores are expected to this questionnaire in a population of patients with wrist
change when the construct changes. If yes, a reflective problems to measure wrist disability, one could not sim-
model is at issue. If not, the HR-PRO instrument is ply change the word “shoulder” into “wrist” because this
probably based on a formative model [8]. item might not be relevant for patients with wrist pro-
For an internal consistency statistic to get an interpre- blems. Moreover, an item like “Do you have difficulty
table meaning the scale needs to be unidimensional. with the grasping and use of small objects such as keys
Unidimensionality of a scale can be investigated with e. or pens?” [11] will probably not be included in a ques-
g. a factor analysis, but not with an assessment of inter- tionnaire for shoulder disability, while it is clearly rele-
nal consistency [8]. Rather, unidimensionality of a scale vant to ask patients with wrist problems.
is a prerequisite for a clear interpretation of the internal Experts, should judge the relevance of the items for
consistency statistics [6,7]. the construct (D1), for the patient population (D2), and
for the purpose (D3). Because the focus is on PROs
Content validity patients should be considered as experts when judging
Content validity was defined as the degree to which the the relevance of the items for the patient population
content of a HR-PRO instrument is an adequate (D2). In addition, many missing observations on an item
can be an indication that the item is not relevant for the (for instance with regard to internal relationships, rela-
population, or it is ambiguously formulated. tionships to scores of other instruments, or differences
To assess the comprehensiveness of the items (D4) between relevant groups) based on the assumption that
three aspects should be taken into account: the content the HR-PRO instrument validly measures the construct
coverage of the items, the description of the domains, to be measured [5]. It contains three aspects, i.e. struc-
and the theoretical foundation. The first two aspects tural validity, which concerns the internal relation-
refer to the question if all relevant aspects of the con- ships, hypotheses testing, and cross-cultural validity,
struct are covered by the items and the domains. The which both concern the relationships to scores of
theoretical foundation refers to the availability of a clear other instruments, or differences between relevant
description of the construct, and the theory on which it groups.
is based. A part of this theoretical foundation could be a Hypotheses testing
description of how different constructs within a concept The standards for hypotheses testing are given in Figure 3.
are interrelated, like for instance as described in the The discussion was about how specific the hypotheses that
model of health status of Wilson and Cleary [12] or the are being formulated should be.
International Classification of Functioning, Disability Hypotheses testing is an ongoing, iterative process
and Health (ICF) model [13]. An indication that the [14]. Specific hypotheses should include an indication of
comprehensiveness of the items was assessed could be the expected direction and magnitude of correlations or
that patients or experts were asked whether they missed differences. Hypotheses testing is about whether the
items. Large floor and ceiling effects can be an indica- direction and magnitude of a correlation or difference is
tion that a scale is not comprehensive. similar to what could be expected based on the con-
struct(s) that are being measured. The more hypotheses
Construct validity are being tested on whether the data correspond to a
Construct validity is the degree to which the scores of priori formulated hypotheses, the more evidence is gath-
an HR-PRO instrument are consistent with hypotheses ered for construct validity.
change on the LVQOL/VCM1 with change on the factor analyses, such as the choice of the explorative fac-
Visual Functioning questionnaire (VF-14) is higher than tor analysis (principal component analysis or common
the correlation with the global rating scale, change in factor analysis), the choice and justification of the rota-
visual acuity and change on the Euroqol thermometer’. tion method (e.g. orthogonal or oblique rotation), or the
After calculating correlations between the change scores decision about the number of relevant factors. Such spe-
on the different measurement instruments they con- cific requirements are described by e.g. Floyd & Wida-
cluded whether the correlations were as expected. man [28] and De Vet et al. [29].
There are a number of parameters proposed in the lit- In the Delphi panel it was discussed that in a study
erature to assess responsiveness that the Delphi panel evaluating the interpretability of scores of an HR-PRO
considers inappropriate. The panel reached consensus instrument the minimal important change (MIC) or
that the use of effect sizes (mean change score/SD base- minimal important difference (MID) should be deter-
line) [21], and related measures, such as standardised mined. The MIC is the smallest change in score in the
response mean (mean change score/SD change score) construct to be measured which patients perceive as
[22], Norman’s responsiveness coefficient (s2 change/s2 important. The MID is the smallest differences in the
change + s2 error) [23], and relative efficacy statistic ((t- construct to be measured between patients that is consid-
statistic1/t-statistic2)2) [24] are inappropriate measures ered important [30]. Since we talk about patient-reported
of responsiveness. The paired t-test was also considered outcomes, the agreement among panel members was that
to be inappropriate, because it is a measure of signifi- the patients should be the one to decide on what is
cant change instead of valid change, and it is dependent important. In the literature there is an ongoing discus-
on the sample size of the study [18]. These measures sion about which methods should be used to determine
are considered measures of the magnitude of change the MIC or MID of a HR-PRO instrument [31]. Conse-
due to an intervention or other event, rather than mea- quently, the opinions of the panel members differed
sures of the quality of the measurement instrument widely, and within the COSMIN study no consensus on
[25,26]. Guyatt’s responsiveness ratio (MIC/SD change standards for assessing MIC could be reached.
score of stable patients) [27] was also considered to be The results of a Delphi study are dependent on the
inappropriate, because it takes the minimal important composition of the panel. The panel members do not
change into account. The Delphi panel agreed that need to be randomly selected to represent a target
minimal important change concerns the interpretation population. Rather experts are chosen because of their
of the change score, but not the validity of the change knowledge of the topic of interest [32,33]. It has been
score. noted that heterogeneous groups produce a higher pro-
portion of high-quality, highly acceptable solutions than
Discussion homogeneous groups [1]. Furthermore, anonymity of
In this article, we explained our choices for the design each of the panel members is often recommended,
requirements and preferred statistical methods for because it provides an equal chance for each panel
which no evidence is available in the literature or which member to present and react to ideas unbiased by the
generated major discussions among the members of the identities of other participants [34]. Both issues were
Delphi study during the development of the COSMIN ensured in this Delphi study. We included experts in
checklist. However, within the four rounds of the Delphi the field of psychology, epidemiology, statistics and clin-
study, two issues could not be discussed extensively, due ical medicine. The panel members did not know who
to lack of time. These issues concerned factor analyses the other panel members were. All questionnaires were
(mentioned in Box A internal consistency and Box E analysed and reported back anonymously. Only one of
structural validity) and minimal important change (men- the researchers (LM) had access to this information.
tioned in Box J interpretability). The COSMIN Delphi study focussed on assessing the
The Delphi panel decided that the evaluation of struc- methodological quality of studies on measurement prop-
tural validity can be done either by explorative factor erties of existing HR-PROs. However, we think that the
analysis or confirmative factor analysis. However, confir- discussions described above and the COSMIN checklist
matory factor analysis is preferred over explorative fac- itself are also relevant and applicable for researchers
tor analysis, because confirmative factor analysis tests who are developing HR-PROs. The COSMIN checklist
whether the data fit an a priori hypothesized factor can be a useful tool for designing a study on measure-
structure [28], while explorative factor analysis can be ment properties.
used when no clear hypotheses exist about the underly-
ing dimensions [28]. Such an explorative factor analysis Conclusions
is not a strong tool in hypothesis testing. In the COS- In conclusion, as there is not much empirical evidence
MIN study we did not discuss specific requirements for for standards for the assessment of measurement
Mokkink et al. BMC Medical Research Methodology 2010, 10:22 Page 7 of 8
http://www.biomedcentral.com/1471-2288/10/22
properties, we consider the Delphi technique the most status measurement instruments: an international Delphi study. Qual Life
Res 2010.
appropriate method to develop a checklist on the meth- 4. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL,
odological quality of studies on measurement properties. Bouter LM, De Vet HCW: The COSMIN checklist manual. [http://cosmin.nl].
Within this Delphi study we have had many interesting 5. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL,
Bouter LM, De Vet HCW: International consensus on taxonomy,
discussions, and reached consensus on a number of terminology, and definitions of measurement properties for health-
important issues about the assessment of measurement related patient-reported outcomes: results of the COSMIN study. J Clin
properties. We expect that this paper will contribute to Epidemiol 2010.
6. Cortina JM: What is coefficient alpha? An examination of theory and
a better understanding of the rationale behind the items applications. J Appl Psychology 1993, 78:98-104.
in the COSMIN checklist, thereby enhancing its accep- 7. Cronbach LJ: Coefficient Alpha and the Internal Structure of Tests.
tance and use. Psychometrika 1951, 16:297-334.
8. Fayers PM, Hand DJ, Bjordal K, Groenvold M: Causal indicators in quality of
life research. Qual Life Res 1997, 6:393-406.
9. Streiner DL: Being inconsistent about consistency: when coefficient alpha
Acknowledgements
does and doesn’t matter. J Pers Assess 2003, 80:217-222.
We are grateful to all the panel members who have participated in the
10. Heijden Van der GJ, Leffers P, Bouter LM: Shoulder disability questionnaire
COSMIN study:
design and responsiveness of a functional status measure. J Clin
Neil Aaronson, Linda Abetz, Elena Andresen, Dorcas Beaton, Martijn Berger,
Epidemiol 2000, 53:29-38.
Giorgio Bertolotti, Monika Bullinger, David Cella, Joost Dekker, Dominique
11. Levine DW, Simmons BP, Koris MJ, Daltroy LH, Hohl GG, Fossel AH, Katz JN:
Dubois, Arne Evers, Diane Fairclough, David Feeny, Raymond Fitzpatrick,
A self-administered questionnaire for the assessment of severity of
Andrew Garratt, Francis Guillemin, Dennis Hart, Graeme Hawthorne, Ron
symptoms and functional status in carpal tunnel syndrom. J Bone Joint
Hays, Elizabeth Juniper, Robert Kane, Donna Lamping, Marissa Lassere,
Surg Am 1993, 75:1585-1592.
Matthew Liang, Kathleen Lohr, Patrick Marquis, Chris McCarthy, Elaine
12. Wilson IB, Cleary PD: Linking clinical variables with health-related quality
McColl, Ian McDowell, Don Mellenbergh, Mauro Niero, Geoffrey Norman,
of life. A conceptual model of patient outcomes. JAMA 1995, 273:59-65.
Manoj Pandey, Luis Rajmil, Bryce Reeve, Dennis Revicki, Margaret Rothman,
13. World Health Organization: ICF: international classification of functioning,
Mirjam Sprangers, David Streiner, Gerold Stucki, Giulio Vidotto, Sharon
disability and health Geneva: World Health Organization 2001.
Wood-Dauphinee, Albert Wu.
14. Strauss ME, Smith GT: Construct Validity: Advances in Theory and
This study was financially supported by the EMGO Institute for Health and
Methodology. Annu Rev Clin Psychol 2008.
Care Research, VU University Medical Center, Amsterdam, and the Anna
15. Cronbach LJ, Meehl PE: Construct validity in psychological tests. Psychol
Foundation, Leiden, the Netherlands. These funding organizations did not
Bull 1955, 52:281-302.
play any role in the study design, data collection, data analysis, data
16. McDowell I, Newell C: Measuring health. A guide to rating scales and
interpretation, or publication.
questionnaires New York, NY: Oxford University Press, 2 1996.
17. Messick S: The standard problem. Meaning and values in measurement
Author details
1 and evaluation. American Psychologist 1975, 955-966.
Department of Epidemiology and Biostatistics and the EMGO Institute for
18. Altman DG: Practical statistics for medical research London: Chapman & Hall/
Health and Care Research, VU University Medical Center, Amsterdam, the
CRC 1991.
Netherlands. 2School of Rehabilitation Science and Department of Clinical
19. Terwee CB, Mokkink LB, Van Poppel MNM, Chinapaw MJM, Van
Epidemiology and Biostatistics, McMaster University, Hamilton, Canada.
3 Mechelen W, De Vet HCW: Qualitative attributes and measurement
Health Services Research Unit, Institute Municipal d’Investigació Mèdica
properties of physical activity questionnaires: a checklist. Accepted for
(IMIM-Hospital del Mar), Barcelona, Spain. 4Centro de Investigación
publication in Sports Med .
Biomédica en Red de Epidemiología y Salud Pública (CIBERESP), Spain.
5 20. De Boer MR, Terwee CB, De Vet HC, Moll AC, Völker-Dieben HJ, Van
Department of Health Services, University of Washington, Seattle, USA.
6 Rens GH: Evaluation of cross-sectional and longitudinal construct validity
Executive Board of VU University Amsterdam, Amsterdam, the Netherlands.
of two vision-related quality of life questionnaires: the LVQOL and
VCM1. Qual Life Res 2006, 15:233-248.
Authors’ contributions
21. Cohen J: Statistical power analysis for the behavioural sciences Hillsdale, NJ:
CT and HdV secured funding for the study. CT, HdV, LB, DK, DP, JA, and PS
Lawrence Erlbaum Associates, 2 1988.
conceived the idea for the study. LM and CT prepared all questionnaires for
22. McHorney CA, Tarlov AR: Individual-patient monitoring in clinical practice:
the four Delphi rounds, supervised by HdV, DP, JA, PS, DK and LB. LM, CT,
are available health status surveys adequate? Qual Life Res 1995,
and HdV interpreted the data. LM coordinated the study and managed the
4:293-307.
data. CT, DP, JA, PS, DK, LB and HdV supervised the study. LM wrote the
23. Norman GR: Issues in the use of change scores in randomized trials. J
manuscript with input from all the authors. All authors read and approved
Clin Epidemiol 1989, 42:1097-1105.
the final version of the report.
24. Stockler MR, Osoba D, Goodwin P, Corey P, Tannock IF: Responsiveness to
change in health-related quality of life in a randomized clinical trial: a
Competing interests
comparison of the Prostate Cancer Specific Quality of Life Instrument
The authors declare that they have no competing interests.
(PROSQOLI) with analogous scales from the EORTC QLQ-C30 and a trial
specific module. European Organization for Research and Treatment of
Received: 18 August 2009 Accepted: 18 March 2010
Cancer. J Clin Epidemiol 1998, 51:137-145.
Published: 18 March 2010
25. Streiner DL, Norman GR: Health measurement scales. A practical guide to
their development and use Oxford: University Press, 4 2008.
References 26. Terwee CB, Dekker FW, Wiersinga WM, Prummel MF, Bossuyt PM: On
1. Powell C: The Delphi technique: myths and realities. J Adv Nurs 2003, assessing responsiveness of health-related quality of life instruments:
41:376-382. guidelines for instrument evaluation. Qual Life Res 2003, 12:349-362.
2. Mokkink LB, Terwee CB, Knol DL, Stratford PW, Alonso J, Patrick DL, 27. Guyatt GH, Walter S, Norman GR: Measuring change over time: assessing
Bouter LM, De Vet HCW: Protocol of the COSMIN study: COnsensus-based the usefulness of evaluative instruments. J Chronic Dis 1987, 40:171-178.
Standards for the selection of health Measurement INstruments. BMC 28. Floyd FJ, Widaman KF: Factor analysis in the development and
Med Res Methodol 2006, 6:2. refinement of clinical assessment instruments. Psychological Assessment
3. Mokkink LB, Terwee CB, Patrick DL, Alonso J, Stratford PW, Knol DL, 1995, 7:286-299.
Bouter LM, De Vet HCW: The COSMIN checklist for assessing the 29. De Vet HC, Ader HJ, Terwee CB, Pouwer F: Are factor analytical techniques
methodological quality of studies on measurement properties of health used appropriately in the validation of health status questionnaires? A
Mokkink et al. BMC Medical Research Methodology 2010, 10:22 Page 8 of 8
http://www.biomedcentral.com/1471-2288/10/22
systematic review on the quality of factor analysis of the SF-36. Qual Life
Res 2005, 14:1203-1218.
30. De Vet H, Beckerman H, Terwee CB, Terluin B, Bouter LM, for the
clinimetrics working group: Definition of clinical differences. Letter to the
Editor. J Rheumatol 2006, 33:434.
31. Revicki DA, Hays RD, Cella DF, Sloan JA: Recommended methods for
determining responsiveness and minimally important differences for
patient-reported outcomes. J Clin Epidemiol 2008, 61:102-109.
32. Keeney S, Hasson F, McKenna HP: A critical review of the Delphi
technique as a research methodology for nursing. Int J Nurs Stud 2001,
38:195-200.
33. Hasson F, Keeney S, McKenna H: Research guidelines for the Delphi
survey technique. J Adv Nurs 2000, 32:1008-1015.
34. Goodman CM: The Delphi technique: a critique. J Adv Nurs 1987,
12:729-734.
Pre-publication history
The pre-publication history for this paper can be accessed here: http://www.
biomedcentral.com/1471-2288/10/22/prepub
doi:10.1186/1471-2288-10-22
Cite this article as: Mokkink et al.: The COSMIN checklist for evaluating
the methodological quality of studies on measurement properties: A
clarification of its content. BMC Medical Research Methodology 2010 10:22.