0% found this document useful (0 votes)
20 views7 pages

Diagnostic Testing and Decision Making Beauty Is.39

.

Uploaded by

jessicaalage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

Diagnostic Testing and Decision Making Beauty Is.39

.

Uploaded by

jessicaalage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

E SPECIAL ARTICLE

Diagnostic Testing and Decision-Making: Beauty Is


Not Just in the Eye of the Beholder
Thomas R. Vetter, MD, MPH,* Patrick Schober, MD, PhD, MMedStat,† and Edward J. Mascha, PhD‡§

To use a diagnostic test effectively and consistently in their practice, clinicians need to know
how well the test distinguishes between those patients who have the suspected acute or
chronic disease and those patients who do not. Clinicians are equally interested and usually
more concerned whether, based on the results of a screening test, a given patient actually: (1)
does or does not have the suspected disease; or (2) will or will not subsequently experience the
adverse event or outcome. Medical tests that are performed to screen for a risk factor, diagnose
a disease, or to estimate a patient’s prognosis are frequently a key component of a clinical
research study. Like therapeutic interventions, medical tests require proper analysis and dem-
onstrated efficacy before being incorporated into routine clinical practice. This basic statistical
tutorial, thus, discusses the fundamental concepts and techniques related to diagnostic testing
and medical decision-making, including sensitivity and specificity, positive predictive value and
negative predictive value, positive and negative likelihood ratio, receiver operating characteris-
tic curve, diagnostic accuracy, choosing a best cut-point for a continuous variable biomarker,
comparing methods on diagnostic accuracy, and design of a diagnostic accuracy study. (Anesth
Analg 2018;127:1085–91)

I shall try not to use statistics as a drunken man uses lamp-posts, However, studies of diagnostic tests are frequently meth-
for support rather than for illumination. odologically flawed, and their results are often not well
—Andrew Lang (1844–1912), Scottish poet, understood or applied in clinical practice.5 For example,
novelist, and literary critic if investigators select clinically inappropriate populations

T
o use a diagnostic test effectively and consistently in for their study of a diagnostic test, they introduce so-called
their practice, clinicians need to know how well the “spectrum bias,” and their study results can be invalid and
test distinguishes between those patients who have misinform practicing clinicians.1,3,7,8 Therefore, rigor must
the suspected acute or chronic disease and those patients be applied in studying whether and in whom a particular
who do not.1 Clinicians are equally interested and usually medical test should be performed.4
more concerned whether based on the results of a screening As part of the ongoing series in Anesthesia & Analgesia,
test, a given patient actually (1) does or does not have the this basic statistical tutorial, thus, discusses the fundamen-
tal concepts and techniques related to diagnostic testing and
suspected disease; or (2) will or will not subsequently expe-
medical decision-making. This tutorial includes the follow-
rience the adverse event or outcome.2,3
ing concepts and techniques:
Medical tests performed to screen for a risk factor,
diagnose a disease, or estimate a patient’s prognosis are • Sensitivity and specificity;
frequently a key component of a clinical research study— • Positive predictive value and negative predictive value;
including in anesthesiology, perioperative medicine, critical • Likelihood ratio;
care, and pain medicine.4,5 Like therapeutic interventions, • Receiver operating characteristic (ROC) curve;
medical tests require proper analysis and demonstrated effi- • Diagnostic accuracy;
cacy before being incorporated into routine clinical practice.6 • Choosing and reporting the cut-point for a continuous
variable biomarker;
• Comparing methods on diagnostic accuracy; and
From the *Department of Surgery and Perioperative Care, Dell Medical
School at the University of Texas at Austin, Austin, Texas; †Department of • Design of a diagnostic accuracy study.
Anesthesiology, VU University Medical Center, Amsterdam, the Netherlands;
and Departments of ‡Quantitative Health Sciences and §Outcomes Research,
Cleveland Clinic, Cleveland, Ohio. SENSITIVITY AND SPECIFICITY
Accepted for publication July 3, 2018. The simplest screening or diagnostic test is one where the
Funding: None. results of a clinical investigation (eg, electrocardiogram or
The authors declare no conflicts of interest. cardiac stress test) are used to classify patients into 2 dichot-
Reprints will not be available from the authors. omous groups, according to the presence or absence of a
Address correspondence to Thomas R. Vetter, MD, MPH, Department of Sur- sign or symptom.9 When the results of such a dichotomous
gery and Perioperative Care, Dell Medical School at the University of Texas
at Austin, Health Discovery Bldg, Room 6.812, 1701 Trinity St, Austin, TX (positive or negative) test are compared with a dichotomous
78712. Address e-mail to [email protected]. “gold standard” test (eg, cardiac catheterization) that is
Copyright © 2018 The Author(s). Published by Wolters Kluwer Health, Inc. often costlier and/or more invasive, the results can be sum-
on behalf of the International Anesthesia Research Society. This is an open-
access article distributed under the terms of the Creative Commons Attribu-
marized in a simple 2 × 2 table (Figure 1).4
tion-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), where it The validity of such a screening or diagnostic test is its
is permissible to download and share the work provided it is properly cited. ability to distinguish between patients who have and those
The work cannot be changed in any way or used commercially without per-
mission from the journal. who do not have a disease.2 This validity of a medical test
DOI: 10.1213/ANE.0000000000003698 has 2 primary components: sensitivity and specificity. The

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1085


EE Special Article

Figure 1. A 2 × 2 table presenting the results


(namely, the sensitivity, specificity, positive pre-
dictive value, and negative predictive value) from
a study comparing a dichotomous diagnostic or
screening test with a gold standard test or clinical
outcome.2–4,12

Figure 2. Relationship among the chosen cutoff


point (cut-point), sensitivity, and specificity, in this
example, using BNP for diagnosing congestive
heart failure in patients presenting with acute dys-
pnea.3,11 BNP indicates B-type natriuretic peptide.

sensitivity of the test is its ability to identify correctly those As discussed in more detail below, the choice of this cut-
patients who have the disease, whereas the specificity of the point value is innately a balance between the sensitivity and
test is its ability to identify correctly those patients who do the specificity for a diagnostic test.2,3
not have the disease.2 There is typically an inverse relationship between sensi-
Sensitivity is thus defined as the proportion of truly dis- tivity and specificity. This is exemplified in a classic study
eased patients who have a positive result on the screening of BNP for diagnosing congestive heart failure in patients
or diagnostic test (Figure 1).3,10 presenting with acute dyspnea (Figure 2).3,11 These authors
Specificity is thus defined as the proportion of truly non- concluded an acceptable compromise to be a BNP cut-point
diseased patients who have a negative result on the screen- value (plasma level) of 100 pg/mL, with a corresponding
ing or diagnostic test (Figure 1).3,10 sensitivity of 90% and a specificity of 76%.11

Diagnostic Cut-off Point POSITIVE PREDICTIVE VALUE AND NEGATIVE


Ideally, a diagnostic test would display both high sensitivity PREDICTIVE VALUE
and high specificity. However, this is often not the case, either Sensitivity and specificity are the most accepted ways to
for binary or continuous biomarkers.3 When the clinical quantify the diagnostic accuracy and validity of a medi-
data generated by a medical test are not binary (eg, positive cal test. However, in clinical practice, even if the sensitivity
or negative) but instead have a range of values (eg, fasting and specificity of a test are known, all that is reported, and
serum glucose for diabetes or B-type natriuretic peptide thus known for a particular patient, is the test result. Yet, as
[BNP] for congestive heart failure), a so-called cutoff point (or noted above, the clinician really wants to know how good
cut-point) value is often sought as a means to separate nor- the test is at predicting abnormality (ie, what proportion of
mal (or nondiseased) from abnormal (or diseased) patients. patients with an abnormal test result is truly abnormal).9

1086   
www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA
Diagnostic Testing and Decision-Making

In other words, if the test result is positive, what is the the test result is positive; the greater the LR+, the stronger
probability that this given patient has the disease? Likewise, the ability of a positive test result to predict the presence
if the test result is negative, what is the probability that this of disease. Positive diagnostic test results with a high LR+
given patient does not have the disease?2 Alternatively, (>10) are considered to provide strong evidence to rule in
what is the probability that a patient with an abnormal test the diagnosis.14
result will subsequently experience the adverse event or Similarly, the negative likelihood ratio (LR−) compares
outcome of concern (eg, postoperative myocardial infrac- the probability of a negative test result in patients with the
tion)? Or, vice versa, what is the probability that a patient disease or condition of interest (1 – sensitivity) with the prob-
with a normal test result will not subsequently experience ability of a negative test result in patients without the disease
the adverse event or outcome of concern? (specificity).10,15 An LR− close to 0 (<0.1) indicates that a neg-
The positive predictive value is the proportion of patients ative test result is most likely a true negative result, provid-
with a positive test result who truly have the disease, or the ing strong evidence to rule out the disease or condition.14
probability that a positive test accurately predicts presence of Of note, using a nomogram or a simple formula, likeli-
disease or the occurrence of the adverse outcome (Figure 1).2–4,12 hood ratios can also be used to estimate the probability (or
The negative predictive value is the proportion of odds) that a positive or negative test result reflects presence
patients with negative test results who truly do not have or absence of the disease, respectively, for a given pretest
the disease, or the probability that a negative test accurately probability (or odds) of disease presence.16 This pretest
predicts absence of disease or nonoccurrence of the adverse probability is the assumed probability that a given tested
outcome (Figure 1).2–4,12 individual actually has the condition, based on the informa-
tion available before the test is performed.
Effect of Disease Prevalence on Predictive Unless a particular individual is known or suspected to
Values have a higher or lower risk of having the condition than
The underlying prevalence of the disease being screened other patients in the same population undergoing the diag-
for or diagnosed does not affect the sensitivity or specific- nostic test, the pretest probability can be assumed to be the
ity of a medical test, which is why sensitivity or specificity prevalence of the condition in this population.13 The posttest
is usually referred to as measures of the intrinsic accuracy probability of presence or absence of the disease can then
of a test. The performance characteristics of the test itself be estimated by the positive predictive value and negative
in identifying patients with and without the disease remain predictive value of the test, respectively.13 The positive and
the same despite changes in disease prevalence.1 negative predictive values, which once again depend on the
However, as the underlying prevalence of the disease disease prevalence, can readily be calculated for any given
of interest increases, the positive predictive value of the prevalence using the likelihood ratios.
test increases and the negative predictive value decreases. Kruisselbrink et al17 assessed the diagnostic accuracy of
The more common the disease in the target population, the point-of-care gastric ultrasound to detect a “full stomach”
stronger the positive predictive value of the test. Similarly, (defined as either solid particulate content or >1.5 mL/kg of
as the underlying prevalence of the disease of interest fluid) in 40 healthy volunteers. The authors reported an LR+
decreases, the positive predictive value of the test decreases of 40.0 (95% confidence interval [CI], 10.3–∞) and an LR− of 0
and the negative predictive value increases. The less com- (95% CI, 0–0.07), indicating that gastric ultrasound is highly
mon the disease in the target population, the stronger the accurate to rule in and to rule out a full stomach. Assuming
negative predictive value of the test.1–3 a pretest probability of 50% for having a full stomach (the
Due to this relationship between prevalence and predic- prevalence in their study sample), the authors used a nomo-
tive values, it is very important to understand that predictive gram to show that a positive test result increases the prob-
values reported in a study cannot simply be generalized to ability of having a full stomach to 97%, whereas a negative
other settings with different disease prevalence. Particularly, test result decreases the probability to <.1%.17
in studies in which the prevalence does not reflect the natu-
ral population prevalence, but in which the observed prev- ROC CURVE
alence is determined by the study design (such as in a 1:1 The previous paragraphs focused on diagnostic tests with
case-control study which artificially sets the prevalence at a dichotomous outcome, in which the test result is either
50%), any reported diagnostic predictive values are of mini- positive or negative. In situations in which a test result is
mal practical use and must be interpreted carefully. reported on a continuous or ordinal scale, the sensitivity,
specificity, and predictive values vary depending on the
LIKELIHOOD RATIO cut-point value or threshold that is used to classify the test
The positive likelihood ratio (LR+) compares the probability result as positive or negative. Before defining an optimal
of a positive test result in patients with the disease or condi- threshold (as described in a subsequent section), it is use-
tion of interest (sensitivity) with the probability of a positive ful to first assess the global diagnostic accuracy of the test
test result in patients without the disease (1 – specificity).10,13 across various cut-point values.
The LR+ describes how many times more likely a positive A ROC curve is a very common way to display the rela-
test result is to be a “true positive” result compared to “false tionship between the sensitivity and specificity of a continu-
positive.” ous-scaled or ordinal-scaled diagnostic test across the range
Hence, an LR+ >1 indicates that the presence of the dis- of observed test values.3,18 A ROC curve plots the true-posi-
ease is more likely than the absence of the disease when tive rate (sensitivity) on the y-axis against the false-positive

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1087


EE Special Article

rate (1 – specificity) on the x-axis for a range of different to assess the overall diagnostic ability across different cut-
cutoff values (Figure 3).18,19 The ROC curve essentially dem- points and to compare the AUC of the 2 approaches. The
onstrates the tradeoff between sensitivity and specificity. authors observed that the TDR/BT ratio had an overall bet-
Simple visual inspection of the ROC curve provides use- ter diagnostic ability to discriminate IgE-dependent from
ful information on the global diagnostic accuracy. A curve IgE-independent hypersensitivity reactions than TDR, with
close to the left upper corner of the graph suggests good the TDR/BT AUC of 0.79 (95% CI, 0.70–0.88), and the TDR
ability to discriminate patients with and without the condi- AUC of 0.66 (95% CI, 0.56–0.76), respectively, with the dif-
tion, whereas a curve close to the diagonal from the bottom ference in AUC of 0.13 (95% CI, 0.05–0.20).21
left to the upper right corner suggests that the test is only
approximately as good as a random guess (Figure 3).15 DIAGNOSTIC ACCURACY
More formally, the area under the curve (AUC) for a Diagnostic accuracy refers to the discriminative ability of
ROC curve can be calculated. The closer this AUC is to 1, the a medical test to distinguish healthy from nonhealthy sub-
stronger the discriminative ability of the test. An AUC of 0.5 jects. The metrics of sensitivity, specificity, and predictive
suggests that the test is unable to discriminate healthy from values are often considered measures of accuracy because
nonhealthy subjects, while an AUC <0.5 (not commonly they provide information on how well a dichotomous test—
observed in practice) suggests that a positive test result is or a test with a continuous value that is dichotomized at
somewhat predictive of absence of the disease. a given cut-point threshold—can distinguish between dis-
Estimates of the AUC should be accompanied by a CI to eased and nondiseased patients. However, as noted above,
provide an estimate of plausible values of the AUC in the some metrics depend on the disease prevalence and are thus
population of interest.20 As described below, statistics are actually not measures of the intrinsic accuracy of the test
available to test the null hypothesis that the AUC is equal itself. In contrast, sensitivity and specificity do not depend
to 0.5, and to compare AUC values of different diagnostic on prevalence and are thus considered measures of intrinsic
tests. diagnostic accuracy.22
Gastaminza et al21 studied whether tryptase levels dur- The proportion of correctly classified, true-positive and
ing the reaction (TDR), as well as the ratio of TDR to baseline true-negative patients, sometimes termed “overall diagnos-
tryptase levels (TDR/BT), would be useful in discriminating tic accuracy” is often reported as a global marker of accu-
immunoglobulin E (IgE)-dependent from IgE-independent racy.15 However, overall diagnostic accuracy depends on the
hypersensitivity reactions. ROC analysis was performed prevalence of the condition.23 Therefore, the overall accuracy
obtained from a study sample is not a measure of intrinsic
accuracy of the test, and it usually cannot be generalized.
In contrast, the likelihood ratio is particularly useful as a
global marker of accuracy because it: (1) combines informa-
tion from sensitivity and specificity; (2) does not depend on
disease prevalence; and (3) allows estimation of the posttest
probability of having a particular disease for any assumed
pretest probability.14
The AUC of the ROC curve is also independent of the
prevalence. It is also not influenced by arbitrarily chosen
cut-point thresholds. The AUC of the ROC curve is thus
often considered the most useful global marker of the diag-
nostic accuracy of a medical test with continuous values.
However, as a summary across all cut-point thresholds
(including those that are clinically nonsensical), the AUC of
the ROC curve provides very limited information on how
well the test performs at a specific threshold as commonly
used in clinical practice.19

CHOOSING AND REPORTING THE CUT-POINT FOR


A CONTINUOUS VARIABLE BIOMARKER
In addition to assessing the overall discriminative ability of
Figure 3. ROC curves are plots of the true-positive rate (sensitiv- a biomarker, it is often of interest to identify the best cut-
ity) against the false-positive rate (1 − specificity) for a range of point for that continuous or ordinal biomarker to be used
different cutoff values. Shown are 3 smoothed curves, visually rep- in practice to classify individual patients as either having or
resenting high (red curve close to the left top corner), intermediate
(blue curve), and low (green curve close to the dashed diagonal not having the disease or outcome of interest.
line) discriminative ability to distinguish patients with a condition First, it is not always prudent or feasible to make a rec-
from patients without the condition. More formally, the AUC can be ommendation of a cut-point because there might not exists
calculated, where an AUC close to 1 indicates high discriminative a cut-point that gives adequate sensitivity and specificity.
ability and an AUC close to 0.5 (representing the area under the
diagonal line) indicates that the test is no better in predicting the
In their study design phase, researchers should specify
condition than tossing a coin. AUC indicates area under the curve; the minimal criteria for reporting a cut-point at all—for
ROC, receiver operating characteristic. example, requiring a certain minimum value for each of

1088   
www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA
Diagnostic Testing and Decision-Making

sensitivity and specificity, such as 70% or 75%, or an AUC of this article, in practice, the estimated gray zone often cor-
of ≥0.75. It is not helpful or prudent to introduce a new cut- responds closely to the region specified by the confidence
point into practice if it does not have sufficient accuracy. limits for a best cut-point when maximizing sensitivity and
A common method for estimating an optimal cut-point specificity.
is to choose a threshold that maximizes both sensitivity and
specificity (not their sum). One simply calculates sensitiv- COMPARING METHODS ON DIAGNOSTIC
ity and specificity for each observed value of the biomarker ACCURACY
and identifies the cut-point (or cut-points) that generate Frequently, researchers undertake a study to assess whether
the best combination of sensitivity and specificity. This is 1 biomarker or laboratory value has better diagnostic accu-
appropriate when sensitivity and specificity are thought to racy than another. Such situations require formally com-
be equally important for the study at hand, implying that paring the biomarkers on AUC, or if there is a specified
a false-positive or false-negative mistake would be equally cut-point, on sensitivity and specificity.
costly. Choice of the appropriate test statistic depends on
Alternatively, researchers might require a minimum whether the diagnostic accuracy results for the biomarkers
specificity (or sensitivity), which would influence the choice being compared are correlated or not. Results would be cor-
of optimal cut-point. For example, it may be that the cut- related if comparing 2 biomarkers measured on all included
point that maximizes sensitivity and specificity yields a subjects. They would be independent if different patient
sensitivity of 80% and a specificity of 78%. But if ≥90% spec- groups were being compared (eg, when assessing diagnos-
ificity were required, the optimal cut-point for this study tic accuracy between males and females).
might have a sensitivity of only 60%. The desired balance Comparing independent AUCs is typically done using
between sensitivity and specificity should be determined the method of Hanley and McNeil29 for independent data.
and justified a priori. The method of Delong et al30 or the paired (same case)
Finally, the cut-point that maximizes the sum of sensitiv- method of Hanley and McNeil31 can be used to compare
ity and specificity could be chosen, as with Youden index.24 dependent AUCs.
However, this method has the notable disadvantage of not When comparing biomarkers on sensitivity or specificity,
monitoring whether sensitivity and specificity are very dif- the denominator is all the diseased patients or nondiseased
ferent from each other, and it can often identify a cut-point patients, respectively. When 2 independent groups such
at which they differ markedly. This tends to occur most as males versus females are being compared on sensitivity
often when the AUC of the ROC curve is low. When the or specificity, a simple Pearson χ2 test can be used to com-
AUC is very high, Youden index tends to identify cut-points pare the proportion who tested positive (for sensitivity) or
closer to those achieved when maximizing both sensitivity negative (for specificity). For dependent comparisons, the
and specificity—the first method described above.25 McNemar test for correlated proportions is appropriate.32
Gomez Builes et al26 sought to find a cut-point for val- Analogous tests could be conducted for overall accuracy.
ues of maximum lysis in trauma patients, below which a Mainstream statistical packages include options for most if
patient would be less likely to survive 48 hours. Applying not all of these methods.33
Youden index, their chosen cut-point had sensitivity of 42%
(95% CI, 27–57) and specificity of 76% (95% CI, 51–88). They DESIGN OF A DIAGNOSTIC ACCURACY STUDY
note that the discrepancy between specificity and sensitivity Rigorous study design is essential for a diagnostic accu-
was acceptable in their clinical setting because it was impor- racy study. First, the study objective must be clearly stated.
tant to reduce false positives. However, since sensitivity Because the chosen population greatly influences the
could just as well have been much higher than specificity, diagnostic accuracy results, as well as their meaning and
researchers in similar situations might choose the cut-point applicability, researchers must describe the exact patient
with the highest sensitivity at a predetermined specificity, population about which they want to make inference.
or simply maximize both.25 Questions to consider include the following: Which
Because a chosen cut-point is an estimate, it should be patients are targeted? Those who already have had a posi-
accompanied by a CI. CIs for a cut-point can be estimated tive result on a certain prescreening test? Those with a cer-
using bootstrap resampling.27 The CI for the estimated cut- tain background predisposing them to have or not have the
point can be interpreted as the estimated range of plausible disease or outcome of interest? Is the goal one of estimation
values of the true optimal cut-point. of diagnostic accuracy, comparison between biomarkers,
The underlying variability in determining a best cut- or comparison between populations? Is the biomarker or
point to distinguish truly diseased from the truly nondis- modality of interest new or well-established?34
eased patients can be seen from a different angle using the Choice of the gold standard method used to define dis-
so-called “grey zone” approach.28 In this approach, instead eased versus nondiseased patients should be carefully con-
of a single cut-point to attempt to discriminate diseased sidered. Many times there is no perfect gold standard—a
from nondiseased into 2 regions, 2 cut-points are identified clear study limitation. An imperfect gold standard raises
to form 3 regions: patients who are believed to be diseased, important questions of how the data will be analyzed and
nondiseased, and indeterminate (ie, not sure, the gray how the results can be interpreted. Nevertheless, statistical
zone). The gray zone is estimated using specified values of methods can attempt to account for an imperfect gold stan-
LR+ and LR− that indicate the allowable false-positive and dard.35 Reliability of the biomarker or medical test being
false-negative errors. While the details are beyond the scope evaluated should also be assessed and reported.

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1089


EE Special Article

Calculation of the appropriate sample size depends on 4. Newman TB, Browner WS, Cummings SR, Hulley Stephen
the goal of the study being either estimation of diagnos- B. Diagnostic studies of medical tests. In: Hulley Stephen
B, Cummings SR, Browner WS, Grady DG, Newman TB,
tic accuracy or comparing biomarkers or groups on diag- eds. Designing Clinical Research. 4th ed. Philadelphia, PA:
nostic accuracy. When estimating diagnostic accuracy, the Wolters Kluwer Health/Lippincott Williams & Wilkins,
goal is typically to estimate the parameter of interest with 2013:171–191.
a desired precision, measured by the expected width of the 5. Scott IA, Greenberg PB, Poole PJ. Cautionary tales in the clini-
CI.36,37 When comparing biomarkers or groups on diagnos- cal interpretation of studies of diagnostic tests. Intern Med J.
2008;38:120–129.
tic accuracy, the difference to detect between the biomarkers 6. Daya S. Study design for the evaluation of diagnostic tests.
or populations needs to be specified, and the sample size Semin Reprod Endocrinol. 1996;14:101–109.
determined accordingly. 7. Ransohoff DF, Feinstein AR. Problems of spectrum and bias
Of note, if the prevalence of the disease is expected to in evaluating the efficacy of diagnostic tests. N Engl J Med.
1978;299:926–930.
be low in the study sample (<50%), then estimation of or 8. Goehring C, Perrier A, Morabia A. Spectrum bias: a quantita-
detecting differences in sensitivity would drive the sample tive and graphical analysis of the variability of medical diag-
size because the truly diseased would have a smaller over- nostic test performance. Stat Med. 2004;23:125–135.
all sample size compared to the nondiseased, and sufficient 9. Altman DG, Bland JM. Diagnostic tests. 1: sensitivity and speci-
power or precision for the smaller sample (the diseased) ficity. BMJ. 1994;308:1552.
10. Straus SE, Glasziou P, Richardson WS, Haynes RB. Diagnosis
would guarantee it for the larger sample (the nondiseased). and screening. Evidence-Based Medicine: How to Practice and
Likewise, specificity would drive the calculations if preva- Teach It. 4th ed. Edinburgh, United Kingdom: Elsevier Churchill
lence was expected to be >50%. Livingstone, 2015:137–167.
11. Maisel AS, Krishnaswamy P, Nowak RM, et al; Breathing Not
Properly Multinational Study Investigators. Rapid measure-
CONCLUSIONS ment of B-type natriuretic peptide in the emergency diagnosis
Knowing the diagnostic accuracy of a medical test used in of heart failure. N Engl J Med. 2002;347:161–167.
clinical decision-making is of paramount importance for cli- 12. Altman DG, Bland JM. Diagnostic tests 2: predictive values.
nicians, given that false-positive and false-negative results, BMJ. 1994;309:102.
13. Linnet K, Bossuyt PM, Moons KG, Reitsma JB. Quantifying
and subsequent therapeutic decisions, can have major con-
the accuracy of a diagnostic test or marker. Clin Chem.
sequences for patient physical and emotional well-being. 2012;58:1292–1301.
The sensitivity, specificity, positive predictive value, nega- 14. Deeks JJ, Altman DG. Diagnostic tests 4: likelihood ratios. BMJ.
tive predictive value, and likelihood ratios of a screening 2004;329:168–169.
or diagnostic test each have unique merits yet limitations. 15. Eusebi P. Diagnostic accuracy measures. Cerebrovasc Dis.
2013;36:267–272.
For continuous or ordinal-scaled tests, the AUC of the 16. Fagan TJ. Letter: nomogram for Bayes theorem. N Engl J Med.
ROC curve provides insight into overall diagnostic accu- 1975;293:257.
racy. Choice of cut-point value or threshold should be 17. Kruisselbrink R, Gharapetian A, Chaparro LE, et al. Diagnostic
informed by the relative importance of sensitivity versus accuracy of point-of-care gastric ultrasound. Anesth Analg.
2018 [Epub ahead of print].
specificity of the particular diagnostic test.
18. Altman DG, Bland JM. Diagnostic tests 3: receiver operating
Last, a diagnostic accuracy study needs to be carefully characteristic plots. BMJ. 1994;309:188.
designed to obtain valid and useful estimates of diagnostic 19. Mallett S, Halligan S, Thompson M, Collins GS, Altman DG.
accuracy. When interpreting results of a diagnostic accu- Interpreting diagnostic accuracy studies for patient care. BMJ.
racy study, clinicians should understand that metrics that 2012;345:e3999.
20. Schober P, Bossers SM, Schwarte LA. Statistical significance
depend on the disease prevalence—namely, predictive val- versus clinical importance of observed effect sizes: what do p
ues and the so-called overall diagnostic accuracy—cannot values and confidence intervals really represent? Anesth Analg.
be readily generalized beyond the study population in 2018;126:1068–1072.
which they were estimated. E 21. Gastaminza G, Lafuente A, Goikoetxea MJ, et al. Improvement
of the elevated tryptase criterion to discriminate IgE- from
non-IgE-mediated allergic reactions. Anesth Analg. 2018;127:
DISCLOSURES 414–419.
Name: Thomas R. Vetter, MD, MPH. 22. Šimundić AM. Measures of diagnostic accuracy: basic defini-
Contribution: This author helped write and revise the manuscript. tions. EJIFCC. 2009;19:203–211.
Name: Patrick Schober, MD, PhD, MMedStat. 23. Alberg AJ, Park JW, Hager BW, Brock MV, Diener-West M. The
Contribution: This author helped write and revise the manuscript. use of “overall accuracy” to evaluate the validity of screening
Name: Edward J. Mascha, PhD. or diagnostic tests. J Gen Intern Med. 2004;19:460–465.
Contribution: This author helped write and revise the manuscript. 24. Youden WJ. Index for rating diagnostic tests. Cancer.
This manuscript was handled by: Jean-Francois Pittet, MD. 1950;3:32–35.
25. Mascha EJ. Identifying the best cut-point for a biomarker, or
REFERENCES not. Anesth Analg. 2018;127:820–822.
1. Montori VM, Wyer P, Newman TB, Keitz S, Guyatt G; 26. Gomez-Builes JC, Acuna SA, Nasimento B, Madotta F, Rizoli
Evidence-Based Medicine Teaching Tips Working Group. Tips SB. Harmful or physiologic: diagnosing fibrinolysis shutdown
for learners of evidence-based medicine: 5. The effect of spec- in a trauma cohort with rotational thromboelastography. Anesth
trum of disease on the performance of diagnostic tests. CMAJ. Analg. 2018;127:840-849.
2005;173:385–390. 27. Efron B, Tibshirani R. An Introduction to the Bootstrap. New York,
2. Gordis L. Assessing the validity and relaibaility of diagnostic NY: Chapman & Hall; 1993.
and screening tests. Epidemiology. 5th ed. Philadelphia, PA: 28. Coste J, Pouchot J. A grey zone for quantitative diagnostic and
Elsevier Saunders, 2014:88–115. screening tests. Int J Epidemiol. 2003;32:304–313.
3. Fletcher RH, Fletcher SW, Fletcher GS. Diagnosis. Clinical 29. Hanley JA, McNeil BJ. The meaning and use of the area under
Epidemiology: The Essentials. 5th ed. Philadelphia, PA: Wolters a receiver operating characteristic (ROC) curve. Radiology.
Kluwer/Lippincott Williams & Wilkins, 2014:108–131. 1982;143:29–36.

1090   
www.anesthesia-analgesia.org ANESTHESIA & ANALGESIA
Diagnostic Testing and Decision-Making

30. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the 34. Zhou X, Obuchowski NA, McClish DK. Design of diagnostic
areas under two or more correlated receiver operating characteris- accuracy studies. Statistical Methods in Diagnostic Medicine. 2nd
tic curves: a nonparametric approach. Biometrics. 1988;44:837–845. ed. Hoboken, NJ: Wiley and Sons, 2011:57–102.
31. Hanley JA, McNeil BJ. A method of comparing the areas under 35. Zhou X, Obuchowski NA, McClish DK. Methods for cor-
receiver operating characteristic curves derived from the same recting imperfect gold standard bias. Statistical Methods in
cases. Radiology. 1983;148:839–843. Diagnostic Medicine. 2nd ed. Hoboken, NJ: Wiley and Sons,
32. McNemar Q. Note on the sampling error of the difference 2011:389–434.
between correlated proportions or percentages. Psychometrika. 36. Obuchowski NA. Computing sample size for receiver operat-
1947;12:153–157. ing characteristic studies. Invest Radiol. 1994;29:238–243.
33. Robin X, Turck N, Hainard A, et al. pROC: an open-source 37. Flahault A, Cadilhac M, Thomas G. Sample size calculation
package for R and S+ to analyze and compare ROC curves. should be performed for design accuracy in diagnostic test
BMC Bioinformatics. 2011;12:77. studies. J Clin Epidemiol. 2005;58:859–862.

October 2018 • Volume 127 • Number 4 www.anesthesia-analgesia.org 1091

You might also like