Lind Plesner Et Al 2023 Commercially Available Chest Radiograph Ai Tools For Detecting Airspace Disease Pneumothorax
Lind Plesner Et Al 2023 Commercially Available Chest Radiograph Ai Tools For Detecting Airspace Disease Pneumothorax
Background: Commercially available artificial intelligence (AI) tools can assist radiologists in interpreting chest radiographs, but their
real-life diagnostic accuracy remains unclear.
Purpose: To evaluate the diagnostic accuracy of four commercially available AI tools for detection of airspace disease, pneumothorax,
and pleural effusion on chest radiographs.
Materials and Methods: This retrospective study included consecutive adult patients who underwent chest radiography at one of four
Danish hospitals in January 2020. Two thoracic radiologists (or three, in cases of disagreement) who had access to all previous and
future imaging labeled chest radiographs independently for the reference standard. Area under the receiver operating characteristic
curve, sensitivity, and specificity were calculated. Sensitivity and specificity were additionally stratified according to the severity of find-
ings, number of findings on chest radiographs, and radiographic projection. The χ2 and McNemar tests were used for comparisons.
Results: The data set comprised 2040 patients (median age, 72 years [IQR, 58–81 years]; 1033 female), of whom 669 (32.8%) had
target findings. The AI tools demonstrated areas under the receiver operating characteristic curve ranging 0.83–0.88 for airspace
disease, 0.89–0.97 for pneumothorax, and 0.94–0.97 for pleural effusion. Sensitivities ranged 72%–91% for airspace disease, 63%–
90% for pneumothorax, and 62%–95% for pleural effusion. Negative predictive values ranged 92%–100% for all target findings.
In airspace disease, pneumothorax, and pleural effusion, specificity was high for chest radiographs with normal or single findings
(range, 85%–96%, 99%–100%, and 95%–100%, respectively) and markedly lower for chest radiographs with four or more find-
ings (range, 27%–69%, 96%–99%, 65%–92%, respectively) (P < .001). AI sensitivity was lower for vague airspace disease (range,
33%–61%) and small pneumothorax or pleural effusion (range, 9%–94%) compared with larger findings (range, 81%–100%; P
value range, > .99 to < .001).
Conclusion: Current-generation AI tools showed moderate to high sensitivity for detecting airspace disease, pneumothorax, and pleural
effusion on chest radiographs. However, they produced more false-positive findings than radiology reports, and their performance de-
creased for smaller-sized target findings and when multiple findings were present.
© RSNA, 2023
Figure 1: Flowchart shows study inclusion and exclusion. The sample was enriched by including chest radiographs
(CXRs) with pneumothorax (n = 44) to achieve a sufficient sample size. The analysis sample (n = 2040) is defined as the
sample analyzed by all artificial intelligence (AI) tools in this study. For comparison of AI performance with corresponding
clinical radiology reports (*), insufficient radiology reports (n = 29; defined as a report that did not state the presence or
absence of any chest radiograph findings but instead, eg, referenced more recent CT findings) were excluded from the
analysis. A target finding chest radiograph was defined as a chest radiograph with one or more of the following findings as
determined according to the reference standard: airspace disease, pneumothorax, and/or pleural effusion. Normal and
other abnormal chest radiographs were also determined according to the reference standard. DICOM = Digital Imaging
and Communications in Medicine.
recent CT findings instead of interpreting the chest radiograph), posteroanterior chest radiographs were included for this find-
the chest radiograph was excluded from diagnostic accuracy as- ing with this tool. None of the AI tools had been trained on
sessment of the clinical radiologist’s report. Examinations that data from any of the included hospitals.
were reported as equivocal were labeled as positive.
Statistical Analysis
AI Tools Continuous data are presented as medians with IQRs and
Seven vendors with class IIA/IIB European conformity– categorical data are presented as numbers and percentages.
marked (CE-marked) AI tools as of 2022 were invited to For the primary aim, examination-level values for sensitivity,
participate in the study. Four vendors agreed as follows: ven- specificity, positive predictive value, and negative predictive
dor A, Annalise Enterprise CXR (version 2.2; Annalise-AI); value with 95% CIs were calculated using the binomial ex-
vendor B, SmartUrgences (version 1.24; Milvue); vendor act method. Comparisons of cross-tabulated frequencies were
C, ChestEye (version 2.6; Oxipit); and vendor D, AI-Rad performed using the χ2 test for independent observations or
Companion (version 10; Siemens Healthineers). AI tools are the Fischer exact test when specifically stated. Individual AI
detailed in Table S2. De-identified frontal chest radiographs tools were not statistically compared head to head but were
were processed by each AI tool to obtain a probability score instead grouped to assess any differences in performance
for each target finding (score 0–1, with low values represent- across all tools. For this purpose, the McNemar test was used
ing low probability of disease and vice versa). Binary diagnos- to compare sensitivity and specificity, and the χ2 test was used
tic accuracy metrics were calculated using the manufacturer- to compare positive predictive values and negative predictive
specified probability thresholds. Three AI tools used a single values. Areas under the receiver operating characteristic curve
threshold, while vendor B used both a high sensitivity thresh- for detection of target findings were calculated and com-
old (hereafter, vendor B sens) and high specificity threshold pared using the DeLong method. For the secondary aim, the
(hereafter, vendor B spec). When not capable of processing a McNemar test was used to compare false-positive and false-
chest radiograph, the AI probability score was 0. Two AI tools negative rates between AI tools and radiology reports. With a
(vendor A and vendor B) can evaluate lateral images in clini- sample size of at least 75 cases, an AI sensitivity or specificity
cal use; however, only frontal chest radiographs were processed of 85% ± 15 (SD) can be detected with a power of 0.9 and
in this study. The vendor D tool does not classify pleural ef- significance level of .05. P < .05 was considered indicative of
fusion on anteroposterior chest radiographs; therefore, only a statistically significant difference. Statistical analyses were
Table 1: Characteristics of Patients with and without Target Findings on Chest Radiographs
carried out by one author (L.L.P.) using R Software (version of 2040 patients (0.4%) had chest radiographs with no AI
3.6.1; The R Foundation [14]) with pROC, thresholdROC, output from vendor A and two of 2040 (0.1%) had no out-
tidyverse, and gtsummary packages. put from vendor C.
Demographic information is presented in Table 1. The
Results median age in the analysis sample was 72 years (IQR, 58–81
years), with 1033 female and 1007 male patients included.
Patient Characteristics and Examination Findings Prior or subsequent chest radiographs or chest CT scans
A total of 2055 consecutive patients with chest radiographs were available for 1641 of 2040 (80.4%) and 1165 of 2040
were screened for inclusion, along with 44 patients with chest (57.1%) patients, respectively. There were 1222 of 2040 pa-
radiographs in the enrichment sample for pneumothorax (Fig tients (59.9%) with two or more findings and 646 of 2040
1). A total of 59 of 2099 patients (2.8%) were excluded due (31.7%) with four or more findings on chest radiographs.
to insufficient lung visualization (n = 35), a missing DICOM The radiographic projection was posteroanterior in 1564 of
image (n = 14), a chest radiograph from another hospital 2040 patients (76.7%) and anteroposterior in 476 of 2040
(n = 9), or duplicate inclusion (n = 1). The remaining 2040 patients (23.3%). There were 113 of 1564 patients (7.2%)
patients were included in the analysis sample; of these, 669 with posteroanterior chest radiographs labeled as suboptimal
(32.8%) had at least one target finding, while 1371 (67.2%) at reference standard assessment due to one or more quality
did not have any target findings. There were 461 of 2040 pa- issues that included external objects (36.3% [41 of 113]), an
tients (22.6%) without any chest radiograph findings. Eight underexposed chest radiograph (32.7% [37 of 113]), rotation
Figure 2: Diagnostic accuracy of four artificial intelligence (AI) tools for detection of airspace disease, pneumothorax, and pleural effusion as target findings.
Top: Receiver operating curves show performance of the AI tools for detecting the target findings on chest radiographs. Bottom: Precision recall curves show performance for
the same target findings. Colored diamonds mark the operating point thresholds set by the manufacturer and used in this study, while white diamonds represent clinical radiol-
ogy report performance (n = 2011). Two thoracic radiologists, or three in the case of disagreement, independently labeled all chest radiographs, and the reference standard
was the consensus finding. ** = The vendor D AI tool does not detect pleural effusion on anteroposterior chest radiographs, thus the green line in these graphs represent
posteroanterior chest radiographs only (n = 1564). PPV = positive predictive value.
(19.5% [22 of 113]), incomplete inspiration (15.9% [18 of Diagnostic Accuracy of the AI Tools
113]), or other (8.8% [10 of 113]). Using the expert-labeled chest radiographs as the reference
Among the 393 chest radiographs on which airspace standard, the four AI tools demonstrated areas under the re-
disease was identified at reference standard assessment, 74 ceiver operating characteristic curve ranging 0.83–0.88 (95%
(18.8%) were classified as diffuse, 146 (37.2%) as multifo- CI range: 0.81–0.90) for airspace disease, 0.89–0.97 (95% CI
cal, 112 (28.5%) as unifocal, and 61 (15.5%) as unifocal and range: 0.84–1.00) for pneumothorax, and 0.94–0.97 (95%
vague. Among the 78 chest radiographs on which pneumo- CI range: 0.93–0.98) for pleural effusion (Fig 2; Tables 2, S3).
thorax was identified, 31 (39.7%) were large, 25 (32.1%) Sensitivities of the AI tools ranged 72%–91% (95% CI range:
were moderate, and 22 (28.2%) were small. Among the 365 67–94) for airspace disease, 63%–90% (95% CI range: 51–95)
chest radiographs on which pleural effusions were identi- for pneumothorax, and 62%–95% (95% CI range: 57–97) for
fied, 36 (9.9%) were large, 81 (22.2%) were moderate, and pleural effusion, while specificities ranged 62%–86% (95%
248 (67.9%) were small. Furthermore, an intercostal drain- CI range: 60–88), 98%–100% (95% CI range: 97–100), and
age tube was present in 29.5% (23 of 78) of patients with a 83%–97% (95% CI range: 82–98), respectively, for the target
pneumothorax finding and 2.7% (10 of 365) of patients with findings. Negative predictive values were high across all findings,
a pleural effusion finding on chest radiographs. Finally, pleu- ranging 92%–100% (95% CI range: 91–100), but positive pre-
ral effusion or airspace disease were visible on only the lateral dictive values were lower, especially for airspace disease (range,
projection for 27 and seven examinations, respectively, and 37%–55%) but also for pneumothorax (range, 60%–86%) and
were thus counted as negative. pleural effusion (range, 56%–84%). The areas under the receiver
Table 2: Diagnostic Accuracy of the AI Tools for Airspace Disease, Pneumothorax, and Pleural Effusion
operating characteristic curve, sensitivities, specificities, positive for all). For pleural effusion, sensitivities for large versus small le-
predictive values, and negative predictive values were different sions were similar for vendor A at 94% (95% CI range: 80–99)
for similar target findings across the AI tools (P < .001), and a versus 94% (95% CI range: 90–96) (P = > .99) but lower for
lower sensitivity corresponded directly to a higher specificity (Fig other vendors (range, 81%–100% [95% CI range: 63–100] vs
2). No difference was observed in the mean sensitivity of all AI 56%–76% [95% CI range: 49–82]; P < .001 for all).
tools for pneumothorax detection on chest radiographs between The specificity for target findings on chest radiographs
the enrichment sample and the consecutive sample (77.9% vs with 0–1 findings compared with four or more findings was
77.8%, P = > .99). higher across all AI tools (P value range, .10 to < .001), ex-
cept for vendor B with the high specificity threshold (ven-
Diagnostic Performance for Target Findings Based on Size, dor B spec) for pneumothorax (P = .17) (Fig 4). This was
Number of Findings, and Projection especially evident for airspace disease, where average AI tool
Figures 3–5 illustrate AI and clinical radiology report performance specificity was 90.7% for chest radiographs with 0–1 findings
in prespecified subgroups (full data are available in Tables S4 and versus 46.8% for those with 4 or more findings (P < .001).
S5). The range of sensitivities for AI tools for diffuse airspace dis- The specificity for airspace disease on posteroanterior chest
ease were 92%–100% (95% CI range: 83–100) compared with radiographs compared with anteroposterior chest radiographs
33%–61% (95% CI range: 22–73) for unifocal vague airspace was also higher across all AI tools (P < .001 for all), with an
disease (P < .001 for all AI tools). For pneumothorax, sensitivities average AI specificity of 77.8% versus 56.2%, respectively
for large versus small lesions were similar for vendor A at 97% (P < .001) (Fig 5). For pneumothorax, this pattern was also
(95% CI range: 81–100) versus 86% (95% CI range: 64–96) seen for vendors A, B sens, B spec, and C (P < .001 for all) but
(P = .30), but lower for other vendors (range, 94%–100% [95% not for vendor D (P = .30). For pleural effusion, vendors A and
CI range: 77–100] vs 9%–59% [95% CI range: 2–79]; P < .001 C had a lower specificity for posteroanterior compared with
Figure 3: Sensitivity of artificial intelligence (AI) tools and clinical radiology reports stratified according to target finding. Top: Bar graphs show airspace disease findings
(n = 393), which were categorized as diffuse (n = 74), multifocal (n = 146), unifocal (n = 112), or unifocal vague (n = 61), for the AI tools and radiology reports, with the
lowest sensitivity values for unifocal vague findings (range, 33%–61%; P < .001 for all). Middle: Bar graphs show pneumothorax findings (n = 78), which were categorized
as large (n = 31), moderate (n = 25), or small (n = 22), for the AI tools and radiology reports, with a lower sensitivity for small findings (range, 9%–59%; P < .001), except
for that of vendor A. Bottom: Bar graphs show pleural effusion findings (n = 365), which were categorized as large (n = 36), moderate (n = 81), or small (n = 248), for the
AI tools and radiology reports, with a lower sensitivity for small findings (range, 56%–76%; P < .001), except for that of vendor A. Vendor B used both high sensitivity (Vendor
B Sens.) and high specificity (Vendor B Spec.) probability thresholds. Error bars represent 95% CIs on the sensitivity estimate. * = A statistically significant difference (P < .05)
is indicated with reference to the bar illustrating the highest sensitivity for the individual AI tool (not across different AI tools), as calculated using the Fisher exact test. ** = The
vendor D AI tool does not detect pleural effusion in anteroposterior chest radiographs, thus the graph illustrates results for posteroanterior only (n = 1564) and should not be
directly compared with other vendors. All data are provided in Table S4. NS = not significant.
anteroposterior chest radiographs (86% and 98% vs 72% and subspecialities who validated one or more chest radiographs, in-
94%, respectively; P < .001 for both), while vendor B sens and cluding five radiologists in training who together validated 14
vender B spec showed no significant difference between pos- chest radiographs in total (0.7% [14 of 2011]). No evidence of a
teroanterior and anteroposterior projections (93% and 97% vs difference was observed in the rate of airspace disease false-negative
90% and 96%, respectively; P = .09 and P = .29) (Table S5). findings between the AI tools and the clinical radiology reports,
Vendor D was not designed to detect pleural effusion on an- except for when vendor B sens (false-negative rate, 9% vs 21.5%;
teroposterior images. P < .001) was used (Table 3). All AI tools had a higher false-pos-
itive rate for airspace disease (range, 13.7%–36.9%) compared
Comparison of AI Tools with Clinical Radiology Reports for with radiology reports (11.6%; P value range, < .001 to .01). For
Target Findings pneumothorax, no difference in the false-negative rate between
Clinical radiology reports were deemed insufficient and, thus, ex- the AI tools and radiology reports was found except for when ven-
cluded from the following analysis in 29 of 2040 patients (1.4%). dor B spec was used, for which a higher false-negative rate was
There were 72 different report readers from a variety of radiology observed (37.3% vs 16.0%, P = .01). Most AI tools had a higher
Figure 4: Specificity of artificial intelligence (AI) tools and clinical radiology reports stratified according to the number of concurrent findings on chest radiographs. Top:
Bar graphs show airspace disease controls grouped into 0–1 (n = 772), 2–3 (n = 454), and 4 or more (n = 421) chest radiograph findings, with the lowest specificity values
in the 4 or more category (range, 27%–69%; P < .001 for all, compared with 0–1 findings). Middle: Bar graphs show pneumothorax controls grouped into 0–1 (n = 814),
2–3 (n = 548), and 4 or more (n = 600) chest radiograph findings, with lowest values in the 4 or more category (range, 96%–99%; P = .17 for vendor B spec; P value range,
.01 to < .001 for others). Bottom: Bar graphs show pleural effusion controls grouped into 0–1 (n = 812), 2–3 (n = 510), and 4 or more (n = 353) chest radiograph findings,
with the lowest values in the 4 or more category (range, 65%–92%; P < .001 for all). Vendor B used both high sensitivity (Vendor B Sens.) and high specificity (Vendor B Spec.)
probability thresholds. Error bars represent 95% CIs on the specificity estimate. * = A statistically significant difference (P < .05) is indicated with reference to the bar illustrating
the highest sensitivity for the individual AI tool (not across different AI tools), as calculated using the Fisher exact test. ** = The vendor D AI tool does not detect pleural effusion
in anteroposterior chest radiographs, thus the graph illustrates results for posteroanterior only (n = 1564) and should not be directly compared with other vendors. All data are
provided in Table S5. NS = not significant.
Figure 5: Specificity of artificial intelligence (AI) tools and clinical radiology reports stratified according to radiographic projection. Top: Bar graphs show airspace
disease controls grouped into anteroposterior (AP, n = 318) and posteroanterior (PA, n = 1329), with the lowest values in the anteroposterior projection (range, 42%–73%;
P < .001 for all, compared with posteroanterior). Middle: Bar graphs show pneumothorax controls grouped into anteroposterior (n = 466) and posteroanterior (n = 1496),
with the lowest values in the anteroposterior projection (range, 93%–99%; P = .30 for vendor D, P < .001 for others). Bottom: Bar graphs show pleural effusion controls
grouped into anteroposterior (n = 340) and posteroanterior (n = 1335), with the lowest values in the anteroposterior projection for vendors A and C (P < .001 for both) and
the proportion unchanged for vendor B at both the high sensitivity (Vendor B Sens.) and high specificity (Vendor B Spec.) thresholds (P = .09 and P = .29). Error bars represent
95% CIs on the sensitivity estimate. * = A statistically significant difference (P < .05) is indicated with reference to the bar illustrating the highest sensitivity for the individual AI tool
(not across different AI tools), as calculated using the Fisher exact test. ** = The vendor D AI tool does not detect pleural effusion in anteroposterior chest radiographs, thus the
graph illustrates results for posteroanterior only (n = 1564). Data used to generate this figure are provided in Table S5. NS = not significant.
false-positive rate compared with the radiology reports for pneu- (4.7% vs 27.5%, P < .001) and vendor B spec and vendor C had
mothorax (range, 1.1%–2.4% vs 0.2%; P < .001 for all), except higher false-negative rates than the reports (31.7% and 38.0% vs
for vendor B spec, which was 0.4% (P = .91). For pleural effusion, 27.5%, P = .01 and P < .001). No differences were observed for
vendor A had a lower false-negative rate than the radiology reports pleural effusion false-negative rates between either the vendor B
Table 3: Performance of the AI Tools Compared with Corresponding Radiology Reports for Target Findings
sens and vendor D AI tools and radiology reports (P = .53 and P majority have used the Lunit INSIGHT AI tool, which was not
= .07). Three AI tools had higher false-positive rates for pleural ef- tested here. There are currently no published studies on the tar-
fusion than the radiology reports (range, 7.7%–16.4% vs 4.2%; P get findings tested in this study with the vendor B, vendor C, or
< .001 for all), one had a lower false-positive rate (2.7% vs 4.2%, vendor D tools. For airspace disease, reported sensitivities and
P < .001), and one showed no difference (3.2% vs 4.2%, P = .45). specificities have ranged 81%–92% and 67%–94%, respectively
Examples of chest radiographs incorrectly labeled by the AI tools (15,20,21). Corresponding numbers have ranged 39%–99% and
are shown in Figure 6 and examples of chest radiographs correctly 92%–100% for pneumothorax (15–17,19,20,22) and 78%–
labeled by the AI tools are shown in Figure S4. 89% and 94%–99% for pleural effusion (15,19,20). Notably,
only one of these studies included an unselected consecutive sam-
Discussion ple (20), while three other consecutive studies were performed
This study tested the diagnostic accuracy of current commer- with a narrower scope of pneumothorax detection after lung
cially available artificial intelligence (AI) tools for identifying biopsy (17), pneumonia detection in younger men presenting
airspace disease, pneumothorax, and pleural effusion on chest with febrile respiratory illness at a military hospital (21), or pa-
radiographs in a real-life multicenter patient sample. The AI tools tients admitted for acute trauma (22). The sensitivities found
achieved moderate to high sensitivities ranging 62%–95% and in our study were comparable with those in previous studies,
excellent negative predictive values greater than 92%. The posi- while specificities were in the lower range, possibly due to the
tive predictive values of AI tools were lower and showed more consecutive data sample including heterogeneous patients from
variation, ranging 37%–86%, most often with false-positive rates a real-life setting.
higher than the clinical radiology reports. Furthermore, we found Among the AI tools examined in this study, we observed an
that AI sensitivity generally was lower for smaller-sized target find- acknowledgeable difference in the balance between sensitivity
ings and that AI specificity generally was lower for anteroposterior and specificity for the individual tools, which seems unpredict-
chest radiographs and those with concurrent findings. able. Therefore, when implementing an AI tool, it seems cru-
Previous studies have evaluated the diagnostic accuracy of cial to understand the disease prevalence and severity of the site
these target findings using commercially available AI tools (15– and that changing the AI tool threshold after implementation
23). Three of these studies used the vendor A tool, while the may be needed for the system to have the desired diagnostic
Figure 6: Representative chest radiographs in six patients show (A, C, E) false-positive findings and (B, D, F) false-neg-
ative findings as identified by the artificial intelligence (AI) tools. In general, false-negative findings determined by the AI tools
were very subtle representations of disease, while false-positive findings were misinterpretations. These examples were all
correctly classified by the clinical radiology reports. (A) Posteroanterior chest radiograph in a 71-year-old male patient who
underwent examination due to progression of dyspnea shows bilateral fibrosis (arrows), which was misclassified as airspace
disease by all four AI tools. (B) Posteroanterior chest radiograph in a 31-year-old female patient referred for radiography due
to month-long coughing shows subtle airspace opacity at the right cardiac border (arrows), which was missed by all AI tools.
(C) Anteroposterior chest radiograph in a 78-year-old male patient referred after placement of a central venous
catheter shows a skin fold on the right side (arrow), which was misclassified as pneumothorax by all AI tools. (D) Pos-
teroanterior chest radiograph in a 78-year-old male patient referred to rule out pneumothorax shows very subtle
apical right-sided pneumothorax (arrows), which was missed by all AI tools except for vendor B (with the high sen-
sitivity threshold). (E) Posteroanterior chest radiograph in a 72-year-old male patient referred for radiography with-
out a specified reason shows chronic rounding of the costophrenic angle (arrow), which was mistaken for pleu-
ral effusion by all AI tools and verified according to the reference standard in a corresponding chest CT image.
(F) Anteroposterior chest radiograph in a 76-year-old female patient referred for radiography due to suspicion of congestion
and/or pneumonia shows a very subtle left-sided pleural effusion (arrow), which was missed by all three AI tools that were
capable of analyzing anteroposterior chest radiographs for pleural effusion.
ability. Furthermore, the low sensitivity observed for several AI for concurrent reading supporting a human reader and should
tools in our study suggests that, like clinical radiologists, the ideally be prospectively evaluated in that setting; however, this
performance of AI tools decreases for more subtle findings on is not feasible when testing multiple AI tools.
chest radiographs. This has been observed previously in studies In conclusion, current-generation artificial intelligence
using a single algorithm for pneumothorax (16), lung nodules, (AI) tools showed moderate to high sensitivity for detecting
and pneumonia, where there are overlapping structures and/or airspace disease, pneumothorax, and pleural effusion on chest
a small lesion size (7,10). radiographs. However, they produced more false-positive re-
We further found that for anteroposterior chest radiographs sults than radiology reports and their performance decreased
and chest radiographs with multiple findings, the specificity of AI for smaller-sized target findings, chest radiographs with mul-
tools for airspace disease and pleural effusion decreased compared tiple findings, and chest radiographs with anteroposterior ra-
with posteroanterior chest radiographs and chest radiographs with diographic projection. Future studies could focus on prospec-
a single finding. This effect was most pronounced for airspace dis- tive assessment of the clinical consequence of using AI for chest
ease, which is unsurprising as airspace disease can resemble other radiography in patient-related outcomes.
chest radiograph findings, but we also observed the effect for
pneumothorax and pleural effusion, which have clearer imaging Author contributions: Guarantors of integrity of entire study, L.L.P., F.C.M., M.B.,
M.B.A.; study concepts/study design or data acquisition or data analysis/interpretation,
definitions. Ahn et al (15) reported on the performance of the all authors; manuscript drafting or manuscript revision for important intellectual con-
Lunit INSIGHT AI tool and found, similar to our study, that the tent, all authors; approval of final version of submitted manuscript, all authors; agrees
specificity for pneumonia was 85% in patients without extra find- to ensure any questions related to the work are appropriately resolved, all authors; lit-
erature research, L.L.P.; clinical studies, L.L.P., M.W.B., L.C.L., F.R., M.B., M.B.A.;
ings on chest radiographs and 51% in patients with concurrent experimental studies, L.L.P.; statistical analysis, L.L.P., F.C.M., M.B.A.; and manuscript
findings. Together these findings suggest that radiologists should editing, all authors
be aware of these limitations, regarding both sensitivity and speci-
ficity, and should not overconfidently trust the systems in these Disclosures of conflicts of interest: L.L.P. Lecture payment from Siemens Health-
ineers. F.C.M. Institutional research grants from Siemens Healthineers and Innovation
difficult cases. However, it should be stated, that many mistakes
Fund Denmark; lecture payment from Siemens Healthineers. M.W.B. No relevant re-
made by AI tools would also be difficult or even impossible for lationships. L.C.L. No relevant relationships. F.R. No relevant relationships. O.W.N.
a human reader to detect without access to additional imaging Lecture payments from Roche, Orion, Pharmacosmos, and Novartis; stock options in
Bavarian Nordic and Merck; currently employed by Novo Nordisk. M.B. No relevant
and patient history. To overcome this limitation, next-generation
relationships. M.B.A. Lecture payments from Philips Healthcare, Siemens Healthineers,
AI tools should strive to incorporate comparisons with previous Boehringer Ingelheim, and Roche.
medical imaging, which is currently being explored (24).
Our study had several limitations. First, although a con-
secutive sample was used, this sample may lack generalizability References
to other than a hospital-based setting due to the high median 1. Raoof S, Feigin D, Sung A, Raoof S, Irugulpati L, Rosenow EC 3rd. Inter-
pretation of plain chest roentgenogram. Chest 2012;141(2):545–558.
age in the sample and high prevalence of patients with multiple 2. Eng J, Mysko WK, Weller GER, et al. Interpretation of Emergency Depart-
findings on chest radiographs. Second, the definitions of dis- ment radiographs: a comparison of emergency medicine physicians with
ease used for our reference standard may align differently with radiologists, residents with faculty, and film with digital display. AJR Am J
Roentgenol 2000;175(5):1233–1238.
the definitions used for AI training, thereby possibly favoring 3. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs
one AI tool over another. Third, AI tools were compared with in the emergency department: is the radiologist really necessary? Postgrad
clinical radiology reports that were generated by radiologists Med J 2003;79(930):214–217.
4. Çallı E, Sogancioglu E, van Ginneken B, van Leeuwen KG, Murphy K. Deep
who had access to lateral chest radiographs, clinical informa- learning for chest X-ray analysis: A survey. Med Image Anal 2021;72:102125.
tion, and prior imaging, whereas the AI tools did not, which 5. van Leeuwen KG, Schalekamp S, Rutten MJCM, van Ginneken B, de Rooij
gives the radiologists an “unfair advantage.” Additionally, clini- M. Artificial intelligence in radiology: 100 commercially available products
and their scientific evidence. Eur Radiol 2021;31(6):3797–3804.
cal radiologist accuracy for pneumothorax is inflated due to 6. Li D, Pehrson LM, Lauridsen CA, et al. The Added Effect of Artificial
the enrichment inclusion method for examinations with these Intelligence on Physicians’ Performance in Detecting Thoracic Patholo-
findings, which were identified using the same radiology re- gies on CT and Chest X-ray: A Systematic Review. Diagnostics (Basel)
2021;11(12):2206.
ports included in our analysis. Fourth, analyses for this study 7. Kim C, Yang Z, Park SH, et al. Multicentre external validation of a commercial
were performed at the examination level and, therefore, the AI artificial intelligence software to analyse chest radiographs in health screening
tools and reference standard experts could have made decisions environments with low disease prevalence. Eur Radiol 2023;33(5):3501–3509.
8. Voter AF, Larson ME, Garrett JW, Yu JJ. Diagnostic Accuracy and Failure
based on differing pixels in the chest radiographs. This will give Mode Analysis of a Deep Learning Algorithm for the Detection of Cervical
an advantage toward less specific AI tools because a false-posi- Spine Fractures. AJNR Am J Neuroradiol 2021;42(8):1550–1556.
tive finding can be counted as a true-positive and hence inflate 9. Oakden-Rayner L, Gale W, Bonham TA, et al. Validation and algorithmic
audit of a deep learning system for the detection of proximal femoral frac-
the AI performance. However, due to the high specificity of AI tures in patients in the emergency department: a diagnostic accuracy study.
tools for pneumothorax and pleural effusion, this may only be Lancet Digit Health 2022;4(5):e351–e358.
relevant for airspace disease detection. Fifth, no lateral chest 10. Sun J, Peng L, Li T, et al. Performance of a Chest Radiograph AI Diagnostic
Tool for COVID-19: A Prospective Observational Study. Radiol Artif Intell
radiographs were used as input to any of the AI tools, thus it is 2022;4(4):e210217.
unknown whether the two AI vendors with lateral image pro- 11. Park SH. Diagnostic Case-Control versus Diagnostic Cohort Studies for
cessing capacity could have had a slightly higher performance. Clinical Validation of Artificial Intelligence Algorithm Performance. Radiol-
ogy 2019;290(1):272–273.
Finally, this was a retrospective study of the standalone perfor- 12. AI Central. Data Science Institute, American College of Radiology. https://
mance of AI tools, although the AI tools are clinically approved aicentral.acrdsi.org/. Accessed May 1, 2023.
13. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated radiologists: a retrospective, multireader multicase study. Lancet Digit Health
list of essential items for reporting diagnostic accuracy studies. BMJ 2021;3(8):e496–e506.
2015;351:h5527. 20. van Beek EJR, Ahn JS, Kim MJ, Murchison JT. Validation study of machine-
14. R Core Team. R: A language and environment for statistical computing. Vi- learning chest radiograph software in primary and emergency medicine. Clin
enna, Austria: R Foundation for Statistical Computing, 2022. Radiol 2023;78(1):1–7.
15. Ahn JS, Ebrahimian S, McDermott S, et al. Association of Artificial Intel- 21. Kim JH, Kim JY, Kim GH, et al. Clinical Validation of a Deep Learning
ligence-Aided Chest Radiograph Interpretation With Reader Performance Algorithm for Detection of Pneumonia on Chest Radiographs in Emergency
and Efficiency. JAMA Netw Open 2022;5(8):e2229289. Department Patients with Acute Febrile Respiratory Illness. J Clin Med
16. Hillis JM, Bizzo BC, Mercaldo S, et al. Evaluation of an Artificial Intelligence 2020;9(6):1981.
Model for Detection of Pneumothorax and Tension Pneumothorax in Chest 22. Gipson J, Tang V, Seah J, et al. Diagnostic accuracy of a commercially avail-
Radiographs. JAMA Netw Open 2022;5(12):e2247172. able deep-learning algorithm in supine chest radiographs following trauma.
17. Hong W, Hwang EJ, Lee JH, Park J, Goo JM, Park CM. Deep Learning for Br J Radiol 2022;95(1134):20210979.
Detecting Pneumothorax on Chest Radiographs after Needle Biopsy: Clini- 23. Choi SY, Park S, Kim M, Park J, Choi YR, Jin KN. Evaluation of a deep
cal Implementation. Radiology 2022;303(2):433–441. learning-based computer-aided detection algorithm on chest radiographs:
18. Nam JG, Kim M, Park J, et al. Development and validation of a deep learn- Case-control study. Medicine (Baltimore) 2021;100(16):e25663.
ing algorithm detecting 10 common abnormalities on chest radiographs. Eur 24. Bannur S, Hyland S, Liu Q, et al. Learning to Exploit Temporal Struc-
Respir J 2021;57(5):2003061. ture for Biomedical Vision-Language Processing. arXiv 2301.04558
19. Seah JCY, Tang CHM, Buchlak QD, et al. Effect of a comprehensive [preprint] https://arxiv.org/abs/2301.04558. Posted January 11, 2023.
deep-learning model on the accuracy of chest x-ray interpretation by Accessed April 13, 2023.