0% found this document useful (0 votes)
29 views13 pages

Lind Plesner Et Al 2023 Commercially Available Chest Radiograph Ai Tools For Detecting Airspace Disease Pneumothorax

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views13 pages

Lind Plesner Et Al 2023 Commercially Available Chest Radiograph Ai Tools For Detecting Airspace Disease Pneumothorax

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ORIGINAL RESEARCH • THORACIC IMAGING

Commercially Available Chest Radiograph AI Tools


for Detecting Airspace Disease, Pneumothorax, and
Pleural Effusion
Louis Lind Plesner, MD • Felix C. Müller, MD, PhD • Mathias W. Brejnebøl, MD • Lene C. Laustrup, MD •
Finn Rasmussen, MD, DMSc • Olav W. Nielsen, MD, PhD • Mikael Boesen, MD, PhD* •
Michael Brun Andersen, MD, PhD*
From the Department of Radiology, Herlev and Gentofte Hospital, Borgmester Ib, Juuls vej 1 Herlev, Copenhagen 2730, Denmark (L.L.P., F.C.M., M.W.B., L.C.L., M.B.A.);
Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark (L.L.P., M.W.B., O.W.N., M.B., M.B.A.); Radiological Artificial Intelligence Testcenter, RAIT.
dk, Capital Region of Denmark (L.L.P., F.C.M., M.W.B., M.B., M.B.A.); Departments of Radiology (M.W.B., M.B.) and Cardiology (O.W.N.), Bispebjerg and Frederiksberg
Hospital, Copenhagen, Denmark; and Department of Radiology, Aarhus University Hospital, Aarhus, Denmark (F.R.). Received May 17, 2023; revision requested June 27;
revision received August 1; accepted August 14. Address correspondence to L.L.P. (email: [email protected]).
This study was supported by a research grant from the Danish government (Project SmartChest, jr. nr 2020–6718). L.L.P., F.C.M., M.W.B., M.B., M.B.A. supported by fund-
ing from an AI Signature grant (SmartChest) from the Danish government, which included the PhD salaries connected to the study and meeting and/or travel support. F.C.M.
supported by grants from the Agency for Digitalization (Digitaliseringsstyrelsen) and Innovation Fund Denmark, Capitol Region of Denmark.
* M.B. and M.B.A. are co-senior authors.

Conflicts of interest are listed at the end of this article.


See also the editorial by Yanagawa and Tomiyama in this issue.

Radiology 2023; 308(3):e231236 • https://doi.org/10.1148/radiol.231236 • Content code:

Background: Commercially available artificial intelligence (AI) tools can assist radiologists in interpreting chest radiographs, but their
real-life diagnostic accuracy remains unclear.

Purpose: To evaluate the diagnostic accuracy of four commercially available AI tools for detection of airspace disease, pneumothorax,
and pleural effusion on chest radiographs.

Materials and Methods: This retrospective study included consecutive adult patients who underwent chest radiography at one of four
Danish hospitals in January 2020. Two thoracic radiologists (or three, in cases of disagreement) who had access to all previous and
future imaging labeled chest radiographs independently for the reference standard. Area under the receiver operating characteristic
curve, sensitivity, and specificity were calculated. Sensitivity and specificity were additionally stratified according to the severity of find-
ings, number of findings on chest radiographs, and radiographic projection. The χ2 and McNemar tests were used for comparisons.

Results: The data set comprised 2040 patients (median age, 72 years [IQR, 58–81 years]; 1033 female), of whom 669 (32.8%) had
target findings. The AI tools demonstrated areas under the receiver operating characteristic curve ranging 0.83–0.88 for airspace
disease, 0.89–0.97 for pneumothorax, and 0.94–0.97 for pleural effusion. Sensitivities ranged 72%–91% for airspace disease, 63%–
90% for pneumothorax, and 62%–95% for pleural effusion. Negative predictive values ranged 92%–100% for all target findings.
In airspace disease, pneumothorax, and pleural effusion, specificity was high for chest radiographs with normal or single findings
(range, 85%–96%, 99%–100%, and 95%–100%, respectively) and markedly lower for chest radiographs with four or more find-
ings (range, 27%–69%, 96%–99%, 65%–92%, respectively) (P < .001). AI sensitivity was lower for vague airspace disease (range,
33%–61%) and small pneumothorax or pleural effusion (range, 9%–94%) compared with larger findings (range, 81%–100%; P
value range, > .99 to < .001).

Conclusion: Current-generation AI tools showed moderate to high sensitivity for detecting airspace disease, pneumothorax, and pleural
effusion on chest radiographs. However, they produced more false-positive findings than radiology reports, and their performance de-
creased for smaller-sized target findings and when multiple findings were present.
© RSNA, 2023

Supplemental material is available for this article.

C hest radiography is a common diagnostic tool, but sig-


nificant training and experience is required to interpret
examinations correctly (1–3). In recent years, artificial
development of AI tools that are able to assist radiologists
with diagnosis, segmentation, and worklist triage, some
of which have received regulatory approval and are now
intelligence (AI) has demonstrated proficiency in image commercially available (5). Retrospective observer studies
classification tasks using supervised deep learning with in which AI assessments of chest radiographs are used as
convolutional neural networks. Due to the widespread use a decision support tool for a human reader have shown
of chest radiographs for decision-making in many clini- enhanced reader performance, especially for less experi-
cal scenarios and the public availability of large training enced readers (6). However, the clinical use of deep learn-
data sets, numerous studies have investigated the ability of ing–based AI tools for radiologic diagnosis is in its infancy
deep learning–based AI models to carry out various tasks (5) and, while case-control studies have been carried out,
in the analysis of chest radiographs (4). This has led to the consecutive sample studies are lacking (7). It is evident that
This copy is for personal use only. To order copies, contact [email protected]
Commercially Available Chest Radiograph AI Tools for Detecting Thoracic Disease

at included hospitals, or were from a duplicate patient were


Abbreviation excluded. It was estimated that each target finding would be
AI = artificial intelligence represented with a minimum of 75 cases (see Statistical Analy-
Summary sis for sample size) in a consecutive sample of 2000 chest ra-
Four commercial chest radiograph artificial intelligence tools detected diographs. However, case-enrichment was performed for pneu-
airspace disease, pneumothorax, and pleural effusion with moderate mothorax due to a low prevalence of this finding, whereby
to high sensitivity, but had more false-positive findings than radiology enrichment examinations were identified by a keyword search
reports and decreased sensitivity for smaller target findings. of chest radiology reports from the same year (January to De-
Key Results cember 2020) and same hospitals as the consecutive sample.
■ In this retrospective study, four commercially available artificial All chest radiographs were identified by searching the picture
intelligence (AI) tools evaluated 2040 chest radiographs, achieving archiving and communications system, or PACS (Impax 6;
sensitivities ranging 72%–91%, 63%–90%, and 62%–95% for AGFA HealthCare). Chest radiographs and reports were com-
airspace disease, pneumothorax, and pleural effusion, respectively. bined with clinical data from electronic health records (Epic;
■ AI tool specificity was high for radiographs with normal or single Epic Systems) before de-identification. See Appendix S1 for
findings (range for airspace disease, 85%–96%; pneumothorax,
99%–100%; pleural effusion, 95%–100%) but lower in radio- further study design details.
graphs with multiple findings (range, 27%–69%, 96%–99%,
65%–92%, respectively) (P < .001). Target Findings
■ False-positive rates were higher for AI tools than for radiology Chest radiographs were assessed for the following three findings:
reports, whereas false-negative rates were similar. airspace disease, pneumothorax, and pleural effusion. For refer-
ence standard expert readers, airspace disease was defined as vis-
ible opacity in the lung alveoli that was not considered a tumor
disease prevalence, disease spectrum and severity, and the simi- and/or atelectasis (eg, pneumonia, pulmonary edema, tuberculosis,
larity of data used to train and test AI can affect the measured AI hemorrhage) and categorized as diffuse, multifocal, unifocal, or
performance (7–11). While AI tools are increasingly being ap- unifocal vague. Pneumothorax was categorized as small (<1-cm
proved for use in radiology departments (12), there is an unmet gap from the chest wall to lung edge in lung apex), moderate
need to further test them in real-life clinical scenarios. (<2-cm gap at the level of the hilum), or large (>2 cm at the level
The aim of this study was to evaluate the current generation of of the hilum). Pleural effusion was categorized as small (blunt-
commercially available AI tools for detection of common acute ing of costophrenic angle on frontal chest radiograph), moderate
findings (airspace disease, pneumothorax, pleural effusion) on (fluid levels below the hilum), or large (above the hilum) (for
chest radiographs in a consecutive multicenter hospital sample. supine chest radiographs, this level was judged). Examples are
The main objective was to assess the diagnostic performance of provided in Figures S1–S3.
these algorithms by evaluating their individual sensitivity, speci-
ficity, and area under the receiver operating characteristics curve. Reference Standard
The secondary objectives were to compare the diagnostic accu- The reference standard assessment was performed by expert
racy of the AI tools with that of clinical radiology reports and as- thoracic radiologists (M.B.A., L.C.L., and F.R., with 8, 17,
sess the performance of these AI tools when target findings were and 33 years of thoracic radiology experience, respectively)
small, when multiple concurrent findings were present on chest who were blinded to AI predictions. Two readers (M.B.A.
radiographs, and when anteroposterior radiographic projections and L.C.L.) labeled all chest radiographs independently fol-
were used. lowed by a consensus discussion in case of disagreement. If
there were still differing opinions, the chest radiograph went
Materials and Methods to a senior arbitrating third reader (F.R.) who was blinded to
This article was prepared according to Standards for Report- previous labels. All labeled findings, both target findings and
ing of Diagnostic Accuracy Studies, STARD, guidelines (13). nontarget findings, including their prevalence in the data set, are
This study was approved by the National Committee on Health presented in Table S1. Reference standard readers had access to
Research Ethics (J-76643), which waived the requirement for the full medical history of patients, including prior or subsequent
informed consent. CT scans or chest radiographs. If any finding was present on an
available CT scan or lateral projection chest radiograph, but not
Study Sample deemed visible on the frontal chest radiograph, the chest radio-
Consecutive unique adult patients (>18 years of age) with chest graph was classified as negative for that finding.
radiographs from four different hospitals in the Copenhagen
region (12 days in January 2020) were retrospectively iden- Clinical Radiology Reports
tified for inclusion in this study. Only a patient’s first chest A physician with 1 year of clinical radiology training (L.L.P.),
radiograph during the study period was included. Chest ra- who was blinded to AI results, extracted labels from the unstruc-
diographs screened for inclusion that had insufficient lung tured prose radiology reports made by radiologists in clinical
visualization according to the clinical radiology report or ref- practice. When a report was deemed insufficient for extraction
erence standard, were missing a Digital Imaging and Commu- of labels (ie, if the report did not state the presence or absence of
nications in Medicine (DICOM) image, were not obtained any chest radiograph findings but, for example, referenced more

2 radiology.rsna.org ■ Radiology: Volume 308: Number 3—September 2023


Lind Plesner et al

Figure 1: Flowchart shows study inclusion and exclusion. The sample was enriched by including chest radiographs
(CXRs) with pneumothorax (n = 44) to achieve a sufficient sample size. The analysis sample (n = 2040) is defined as the
sample analyzed by all artificial intelligence (AI) tools in this study. For comparison of AI performance with corresponding
clinical radiology reports (*), insufficient radiology reports (n = 29; defined as a report that did not state the presence or
absence of any chest radiograph findings but instead, eg, referenced more recent CT findings) were excluded from the
analysis. A target finding chest radiograph was defined as a chest radiograph with one or more of the following findings as
determined according to the reference standard: airspace disease, pneumothorax, and/or pleural effusion. Normal and
other abnormal chest radiographs were also determined according to the reference standard. DICOM = Digital Imaging
and Communications in Medicine.

recent CT findings instead of interpreting the chest radiograph), posteroanterior chest radiographs were included for this find-
the chest radiograph was excluded from diagnostic accuracy as- ing with this tool. None of the AI tools had been trained on
sessment of the clinical radiologist’s report. Examinations that data from any of the included hospitals.
were reported as equivocal were labeled as positive.
Statistical Analysis
AI Tools Continuous data are presented as medians with IQRs and
Seven vendors with class IIA/IIB European conformity– categorical data are presented as numbers and percentages.
marked (CE-marked) AI tools as of 2022 were invited to For the primary aim, examination-level values for sensitivity,
participate in the study. Four vendors agreed as follows: ven- specificity, positive predictive value, and negative predictive
dor A, Annalise Enterprise CXR (version 2.2; Annalise-AI); value with 95% CIs were calculated using the binomial ex-
vendor B, SmartUrgences (version 1.24; Milvue); vendor act method. Comparisons of cross-tabulated frequencies were
C, ChestEye (version 2.6; Oxipit); and vendor D, AI-Rad performed using the χ2 test for independent observations or
Companion (version 10; Siemens Healthineers). AI tools are the Fischer exact test when specifically stated. Individual AI
detailed in Table S2. De-identified frontal chest radiographs tools were not statistically compared head to head but were
were processed by each AI tool to obtain a probability score instead grouped to assess any differences in performance
for each target finding (score 0–1, with low values represent- across all tools. For this purpose, the McNemar test was used
ing low probability of disease and vice versa). Binary diagnos- to compare sensitivity and specificity, and the χ2 test was used
tic accuracy metrics were calculated using the manufacturer- to compare positive predictive values and negative predictive
specified probability thresholds. Three AI tools used a single values. Areas under the receiver operating characteristic curve
threshold, while vendor B used both a high sensitivity thresh- for detection of target findings were calculated and com-
old (hereafter, vendor B sens) and high specificity threshold pared using the DeLong method. For the secondary aim, the
(hereafter, vendor B spec). When not capable of processing a McNemar test was used to compare false-positive and false-
chest radiograph, the AI probability score was 0. Two AI tools negative rates between AI tools and radiology reports. With a
(vendor A and vendor B) can evaluate lateral images in clini- sample size of at least 75 cases, an AI sensitivity or specificity
cal use; however, only frontal chest radiographs were processed of 85% ± 15 (SD) can be detected with a power of 0.9 and
in this study. The vendor D tool does not classify pleural ef- significance level of .05. P < .05 was considered indicative of
fusion on anteroposterior chest radiographs; therefore, only a statistically significant difference. Statistical analyses were

Radiology: Volume 308: Number 3—September 2023 ■ radiology.rsna.org 3


Commercially Available Chest Radiograph AI Tools for Detecting Thoracic Disease

Table 1: Characteristics of Patients with and without Target Findings on Chest Radiographs

All Patients Target Finding No Target Finding


Characteristic (n = 2040) (n = 669) (n = 1371) P Value
Age (y)* 72 (58–81) 76 (66–84) 69 (55–79) <.001
Sex <.001
F 1033 (50.6) 300 (44.8) 733 (53.5)
M 1007 (49.4) 369 (55.2) 638 (46.5)
Radiographic projection <.001
Posteroanterior 1451 (71.1) 407 (60.8) 1044 (76.1)
Anteroposterior 476 (23.3) 228 (34.1) 248 (18.1)
Suboptimal posteroanterior 113 (5.5) 34 (5.1) 79 (5.8)
Referral site <.001
Emergency department 1056 (51.8) 356 (53.2) 700 (51.1)
Hospital ward or other 500 (24.5) 241 (36.0) 259 (18.9)
Outpatient 484 (23.7) 72 (10.8) 412 (30.1)
COPD 375 (18.4) 155 (23.2) 220 (16.0) <.001
Ischemic heart disease 331 (16.2) 119 (17.8) 212 (15.5) .24
Heart failure 250 (12.3) 129 (19.3) 121 (8.8) <.001
Current lung tumor 115 (5.6) 53 (7.9) 62 (4.5) .002
Previous lung surgery 130 (6.4) 79 (11.8) 51 (3.7) <.001
Previous heart surgery 144 (7.1) 46 (6.9) 98 (7.1) .83
Smoking history .06
Current 705 (34.6) 252 (37.7) 453 (33.0)
Unknown 496 (24.3) 141 (21.1) 355 (25.9)
Never 484 (23.7) 156 (23.3) 328 (23.9)
Former 355 (17.4) 120 (17.9) 235 (17.1)
Total no. of findings on chest radiograph <.001
None 461 (22.6) 0 (0.0) 461 (33.6)
One 357 (17.5) 56 (8.4) 301 (22.0)
Two or three 576 (28.2) 189 (28.3) 387 (28.2)
Four or more 646 (31.7) 424 (63.4) 222 (16.2)
Note.—Data are numbers of patients, with percentages in parentheses, for categorical data. Target findings on chest radiographs included
airspace disease, pneumothorax, and/or pleural effusion. Suboptimal posteroanterior images included those with any external objects,
incomplete inspiration, rotation, overexposing or underexposing, or other image quality issues. P values were calculated with the Wilcoxon
rank sum test (age) or χ2 test (others). COPD = chronic obstructive pulmonary disease.
* Data are medians, with IQRs in parentheses, for continuous data.

carried out by one author (L.L.P.) using R Software (version of 2040 patients (0.4%) had chest radiographs with no AI
3.6.1; The R Foundation [14]) with pROC, thresholdROC, output from vendor A and two of 2040 (0.1%) had no out-
tidyverse, and gtsummary packages. put from vendor C.
Demographic information is presented in Table 1. The
Results median age in the analysis sample was 72 years (IQR, 58–81
years), with 1033 female and 1007 male patients included.
Patient Characteristics and Examination Findings Prior or subsequent chest radiographs or chest CT scans
A total of 2055 consecutive patients with chest radiographs were available for 1641 of 2040 (80.4%) and 1165 of 2040
were screened for inclusion, along with 44 patients with chest (57.1%) patients, respectively. There were 1222 of 2040 pa-
radiographs in the enrichment sample for pneumothorax (Fig tients (59.9%) with two or more findings and 646 of 2040
1). A total of 59 of 2099 patients (2.8%) were excluded due (31.7%) with four or more findings on chest radiographs.
to insufficient lung visualization (n = 35), a missing DICOM The radiographic projection was posteroanterior in 1564 of
image (n = 14), a chest radiograph from another hospital 2040 patients (76.7%) and anteroposterior in 476 of 2040
(n = 9), or duplicate inclusion (n = 1). The remaining 2040 patients (23.3%). There were 113 of 1564 patients (7.2%)
patients were included in the analysis sample; of these, 669 with posteroanterior chest radiographs labeled as suboptimal
(32.8%) had at least one target finding, while 1371 (67.2%) at reference standard assessment due to one or more quality
did not have any target findings. There were 461 of 2040 pa- issues that included external objects (36.3% [41 of 113]), an
tients (22.6%) without any chest radiograph findings. Eight underexposed chest radiograph (32.7% [37 of 113]), rotation

4 radiology.rsna.org ■ Radiology: Volume 308: Number 3—September 2023


Lind Plesner et al

Figure 2: Diagnostic accuracy of four artificial intelligence (AI) tools for detection of airspace disease, pneumothorax, and pleural effusion as target findings.
Top: Receiver operating curves show performance of the AI tools for detecting the target findings on chest radiographs. Bottom: Precision recall curves show performance for
the same target findings. Colored diamonds mark the operating point thresholds set by the manufacturer and used in this study, while white diamonds represent clinical radiol-
ogy report performance (n = 2011). Two thoracic radiologists, or three in the case of disagreement, independently labeled all chest radiographs, and the reference standard
was the consensus finding. ** = The vendor D AI tool does not detect pleural effusion on anteroposterior chest radiographs, thus the green line in these graphs represent
posteroanterior chest radiographs only (n = 1564). PPV = positive predictive value.

(19.5% [22 of 113]), incomplete inspiration (15.9% [18 of Diagnostic Accuracy of the AI Tools
113]), or other (8.8% [10 of 113]). Using the expert-labeled chest radiographs as the reference
Among the 393 chest radiographs on which airspace standard, the four AI tools demonstrated areas under the re-
disease was identified at reference standard assessment, 74 ceiver operating characteristic curve ranging 0.83–0.88 (95%
(18.8%) were classified as diffuse, 146 (37.2%) as multifo- CI range: 0.81–0.90) for airspace disease, 0.89–0.97 (95% CI
cal, 112 (28.5%) as unifocal, and 61 (15.5%) as unifocal and range: 0.84–1.00) for pneumothorax, and 0.94–0.97 (95%
vague. Among the 78 chest radiographs on which pneumo- CI range: 0.93–0.98) for pleural effusion (Fig 2; Tables 2, S3).
thorax was identified, 31 (39.7%) were large, 25 (32.1%) Sensitivities of the AI tools ranged 72%–91% (95% CI range:
were moderate, and 22 (28.2%) were small. Among the 365 67–94) for airspace disease, 63%–90% (95% CI range: 51–95)
chest radiographs on which pleural effusions were identi- for pneumothorax, and 62%–95% (95% CI range: 57–97) for
fied, 36 (9.9%) were large, 81 (22.2%) were moderate, and pleural effusion, while specificities ranged 62%–86% (95%
248 (67.9%) were small. Furthermore, an intercostal drain- CI range: 60–88), 98%–100% (95% CI range: 97–100), and
age tube was present in 29.5% (23 of 78) of patients with a 83%–97% (95% CI range: 82–98), respectively, for the target
pneumothorax finding and 2.7% (10 of 365) of patients with findings. Negative predictive values were high across all findings,
a pleural effusion finding on chest radiographs. Finally, pleu- ranging 92%–100% (95% CI range: 91–100), but positive pre-
ral effusion or airspace disease were visible on only the lateral dictive values were lower, especially for airspace disease (range,
projection for 27 and seven examinations, respectively, and 37%–55%) but also for pneumothorax (range, 60%–86%) and
were thus counted as negative. pleural effusion (range, 56%–84%). The areas under the receiver

Radiology: Volume 308: Number 3—September 2023 ■ radiology.rsna.org 5


Commercially Available Chest Radiograph AI Tools for Detecting Thoracic Disease

Table 2: Diagnostic Accuracy of the AI Tools for Airspace Disease, Pneumothorax, and Pleural Effusion

Vendor B (High Vendor B (High Clinical


Sensitivity Specificity Radiology
Finding and Metric Vendor A Threshold) Threshold) Vendor C Vendor D* Report† P Value‡
Airspace disease
Sensitivity (%) 72 (67, 76) 91 (88, 94) 81 (77, 85) 80 (75,83) 79 (75, 83) 78 (74 ,82) <.001
Specificity (%) 86 (84, 88) 62 (60, 65) 71 (69, 73) 76 (74, 78) 72 (70, 75) 88 (87, 90) <.001
PPV (%) 55 (51, 59) 37 (34, 40) 40 (37, 44) 45 (41, 48) 41 (37, 44) 62 (57, 66) <.001
NPV (%) 93 (91, 94) 97 (95, 98) 94 (93, 95) 94 (93, 95) 94 (92, 95) 94 (93, 96) <.001
AUC 0.88 0.85 0.85 0.86 0.83 NA <.001
(0.87, 0.90) (0.84, 0.87) (0.84, 0.87) (0.84, 0.88) (0.81, 0.85)
Pneumothorax
Sensitivity (%) 90 (80, 95) 73 (62, 82) 63 (51, 73) 78 (67, 86) 71 (59, 80) 85 (75, 92) <.001
Specificity (%) 98 (98, 99) 99 (98, 99) 100 (99, 100) 98 (97, 98) 98 (97, 99) 100 (100, 100) <.001
PPV (%) 67 (57, 76) 72 (61, 81) 86 (74, 93) 56 (46, 65) 60 (50, 70) 96 (87, 99) <.001
NPV (%) 100 (99, 100) 99 (98, 99) 99 (98, 99) 99 (99, 99) 99 (98, 99) 99 (99, 100) <.001
AUC 0.97 0.97 0.97 0.97 0.89 NA <.001
(0.94, 1) (0.96, 0.99) (0.96, 0.99) (0.94, 0.99) (0.84, 0.94)
Pleural effusion
Sensitivity (%) 95 (93, 97) 78 (73, 82) 62 (57, 67) 68 (63, 73) 80 (74, 85) 74 (70, 79) <.001
Specificity (%) 83 (82, 85) 92 (91,94) 97 (96, 98) 97 (96, 98) 92 (90, 93) 96 (95, 97) <.001
PPV (%) 56 (52, 60) 69 (64, 73) 81 (76, 85) 84 (79, 88) 63 (57, 69) 79 (75, 84) <.001
NPV (%) 99 (98, 99) 95 (94,96) 92 (91, 93) 93 (92, 95) 96 (95, 97) 94 (93, 95) <.001
AUC 0.96 0.94 0.94 0.97 0.94 NA <.001
(0.95, 0.98) (0.93, 0.95) (0.93, 0.95) (0.96, 0.98) (0.93, 0.96)
Note.—Data in parentheses are 95% CIs. Diagnostic accuracy measures of all AI tools and clinical radiology reports were compared with
expert-labeled chest radiographs as the reference standard in 2040 patients. Data used for calculating binary diagnostic accuracy metrics
are available in Table S3. AI = artificial intelligence, AUC = area under the receiver operating characteristic curve, PPV = positive predictive
value, NA = not applicable, NPV = negative predictive value.
* Vendor D does not classify pleural effusion on anteroposterior chest radiographs, so only posteroanterior chest radiographs were included
for evaluation of pleural effusion (n = 1564).

Clinical reports were only included for chest radiographs in 2011 patients as 29 reports were deemed insufficient.

P values are for any difference between the highest and lowest values across all AI tools (AUC, DeLong method; sensitivity and specificity,
McNemar test; PPV and NPV, χ2 test).

operating characteristic curve, sensitivities, specificities, positive for all). For pleural effusion, sensitivities for large versus small le-
predictive values, and negative predictive values were different sions were similar for vendor A at 94% (95% CI range: 80–99)
for similar target findings across the AI tools (P < .001), and a versus 94% (95% CI range: 90–96) (P = > .99) but lower for
lower sensitivity corresponded directly to a higher specificity (Fig other vendors (range, 81%–100% [95% CI range: 63–100] vs
2). No difference was observed in the mean sensitivity of all AI 56%–76% [95% CI range: 49–82]; P < .001 for all).
tools for pneumothorax detection on chest radiographs between The specificity for target findings on chest radiographs
the enrichment sample and the consecutive sample (77.9% vs with 0–1 findings compared with four or more findings was
77.8%, P = > .99). higher across all AI tools (P value range, .10 to < .001), ex-
cept for vendor B with the high specificity threshold (ven-
Diagnostic Performance for Target Findings Based on Size, dor B spec) for pneumothorax (P = .17) (Fig 4). This was
Number of Findings, and Projection especially evident for airspace disease, where average AI tool
Figures 3–5 illustrate AI and clinical radiology report performance specificity was 90.7% for chest radiographs with 0–1 findings
in prespecified subgroups (full data are available in Tables S4 and versus 46.8% for those with 4 or more findings (P < .001).
S5). The range of sensitivities for AI tools for diffuse airspace dis- The specificity for airspace disease on posteroanterior chest
ease were 92%–100% (95% CI range: 83–100) compared with radiographs compared with anteroposterior chest radiographs
33%–61% (95% CI range: 22–73) for unifocal vague airspace was also higher across all AI tools (P < .001 for all), with an
disease (P < .001 for all AI tools). For pneumothorax, sensitivities average AI specificity of 77.8% versus 56.2%, respectively
for large versus small lesions were similar for vendor A at 97% (P < .001) (Fig 5). For pneumothorax, this pattern was also
(95% CI range: 81–100) versus 86% (95% CI range: 64–96) seen for vendors A, B sens, B spec, and C (P < .001 for all) but
(P = .30), but lower for other vendors (range, 94%–100% [95% not for vendor D (P = .30). For pleural effusion, vendors A and
CI range: 77–100] vs 9%–59% [95% CI range: 2–79]; P < .001 C had a lower specificity for posteroanterior compared with

6 radiology.rsna.org ■ Radiology: Volume 308: Number 3—September 2023


Lind Plesner et al

Figure 3: Sensitivity of artificial intelligence (AI) tools and clinical radiology reports stratified according to target finding. Top: Bar graphs show airspace disease findings
(n = 393), which were categorized as diffuse (n = 74), multifocal (n = 146), unifocal (n = 112), or unifocal vague (n = 61), for the AI tools and radiology reports, with the
lowest sensitivity values for unifocal vague findings (range, 33%–61%; P < .001 for all). Middle: Bar graphs show pneumothorax findings (n = 78), which were categorized
as large (n = 31), moderate (n = 25), or small (n = 22), for the AI tools and radiology reports, with a lower sensitivity for small findings (range, 9%–59%; P < .001), except
for that of vendor A. Bottom: Bar graphs show pleural effusion findings (n = 365), which were categorized as large (n = 36), moderate (n = 81), or small (n = 248), for the
AI tools and radiology reports, with a lower sensitivity for small findings (range, 56%–76%; P < .001), except for that of vendor A. Vendor B used both high sensitivity (Vendor
B Sens.) and high specificity (Vendor B Spec.) probability thresholds. Error bars represent 95% CIs on the sensitivity estimate. * = A statistically significant difference (P < .05)
is indicated with reference to the bar illustrating the highest sensitivity for the individual AI tool (not across different AI tools), as calculated using the Fisher exact test. ** = The
vendor D AI tool does not detect pleural effusion in anteroposterior chest radiographs, thus the graph illustrates results for posteroanterior only (n = 1564) and should not be
directly compared with other vendors. All data are provided in Table S4. NS = not significant.

anteroposterior chest radiographs (86% and 98% vs 72% and subspecialities who validated one or more chest radiographs, in-
94%, respectively; P < .001 for both), while vendor B sens and cluding five radiologists in training who together validated 14
vender B spec showed no significant difference between pos- chest radiographs in total (0.7% [14 of 2011]). No evidence of a
teroanterior and anteroposterior projections (93% and 97% vs difference was observed in the rate of airspace disease false-negative
90% and 96%, respectively; P = .09 and P = .29) (Table S5). findings between the AI tools and the clinical radiology reports,
Vendor D was not designed to detect pleural effusion on an- except for when vendor B sens (false-negative rate, 9% vs 21.5%;
teroposterior images. P < .001) was used (Table 3). All AI tools had a higher false-pos-
itive rate for airspace disease (range, 13.7%–36.9%) compared
Comparison of AI Tools with Clinical Radiology Reports for with radiology reports (11.6%; P value range, < .001 to .01). For
Target Findings pneumothorax, no difference in the false-negative rate between
Clinical radiology reports were deemed insufficient and, thus, ex- the AI tools and radiology reports was found except for when ven-
cluded from the following analysis in 29 of 2040 patients (1.4%). dor B spec was used, for which a higher false-negative rate was
There were 72 different report readers from a variety of radiology observed (37.3% vs 16.0%, P = .01). Most AI tools had a higher

Radiology: Volume 308: Number 3—September 2023 ■ radiology.rsna.org 7


Commercially Available Chest Radiograph AI Tools for Detecting Thoracic Disease

Figure 4: Specificity of artificial intelligence (AI) tools and clinical radiology reports stratified according to the number of concurrent findings on chest radiographs. Top:
Bar graphs show airspace disease controls grouped into 0–1 (n = 772), 2–3 (n = 454), and 4 or more (n = 421) chest radiograph findings, with the lowest specificity values
in the 4 or more category (range, 27%–69%; P < .001 for all, compared with 0–1 findings). Middle: Bar graphs show pneumothorax controls grouped into 0–1 (n = 814),
2–3 (n = 548), and 4 or more (n = 600) chest radiograph findings, with lowest values in the 4 or more category (range, 96%–99%; P = .17 for vendor B spec; P value range,
.01 to < .001 for others). Bottom: Bar graphs show pleural effusion controls grouped into 0–1 (n = 812), 2–3 (n = 510), and 4 or more (n = 353) chest radiograph findings,
with the lowest values in the 4 or more category (range, 65%–92%; P < .001 for all). Vendor B used both high sensitivity (Vendor B Sens.) and high specificity (Vendor B Spec.)
probability thresholds. Error bars represent 95% CIs on the specificity estimate. * = A statistically significant difference (P < .05) is indicated with reference to the bar illustrating
the highest sensitivity for the individual AI tool (not across different AI tools), as calculated using the Fisher exact test. ** = The vendor D AI tool does not detect pleural effusion
in anteroposterior chest radiographs, thus the graph illustrates results for posteroanterior only (n = 1564) and should not be directly compared with other vendors. All data are
provided in Table S5. NS = not significant.

8 radiology.rsna.org ■ Radiology: Volume 308: Number 3—September 2023


Lind Plesner et al

Figure 5: Specificity of artificial intelligence (AI) tools and clinical radiology reports stratified according to radiographic projection. Top: Bar graphs show airspace
disease controls grouped into anteroposterior (AP, n = 318) and posteroanterior (PA, n = 1329), with the lowest values in the anteroposterior projection (range, 42%–73%;
P < .001 for all, compared with posteroanterior). Middle: Bar graphs show pneumothorax controls grouped into anteroposterior (n = 466) and posteroanterior (n = 1496),
with the lowest values in the anteroposterior projection (range, 93%–99%; P = .30 for vendor D, P < .001 for others). Bottom: Bar graphs show pleural effusion controls
grouped into anteroposterior (n = 340) and posteroanterior (n = 1335), with the lowest values in the anteroposterior projection for vendors A and C (P < .001 for both) and
the proportion unchanged for vendor B at both the high sensitivity (Vendor B Sens.) and high specificity (Vendor B Spec.) thresholds (P = .09 and P = .29). Error bars represent
95% CIs on the sensitivity estimate. * = A statistically significant difference (P < .05) is indicated with reference to the bar illustrating the highest sensitivity for the individual AI tool
(not across different AI tools), as calculated using the Fisher exact test. ** = The vendor D AI tool does not detect pleural effusion in anteroposterior chest radiographs, thus the
graph illustrates results for posteroanterior only (n = 1564). Data used to generate this figure are provided in Table S5. NS = not significant.

false-positive rate compared with the radiology reports for pneu- (4.7% vs 27.5%, P < .001) and vendor B spec and vendor C had
mothorax (range, 1.1%–2.4% vs 0.2%; P < .001 for all), except higher false-negative rates than the reports (31.7% and 38.0% vs
for vendor B spec, which was 0.4% (P = .91). For pleural effusion, 27.5%, P = .01 and P < .001). No differences were observed for
vendor A had a lower false-negative rate than the radiology reports pleural effusion false-negative rates between either the vendor B

Radiology: Volume 308: Number 3—September 2023 ■ radiology.rsna.org 9


Commercially Available Chest Radiograph AI Tools for Detecting Thoracic Disease

Table 3: Performance of the AI Tools Compared with Corresponding Radiology Reports for Target Findings

Finding and P Value P Value


Assessment Method False-Negative Rate False-Positive Rate (False-Negative Findings) (False-Positive Findings)
Airspace disease
Clinical report 84/390 (21.5) 188/1621 (11.6) Reference Reference
Vendor A 109/390 (27.9) 222/1621 (13.7) .12 .01
Vendor B spec 72/390 (18.5) 459/1621 (28.3) >.99 <.001
Vendor B sens 35/390 (9.0) 598/1621 (36.9) <.001 <.001
Vendor C 79/390 (20.3) 377/1621 (23.3) >.99 <.001
Vendor D 80/390 (20.5) 441/1621 (27.2) >.99 <.001
Pneumothorax
Clinical report 12/75 (16.0) 3/1936 (0.2) Reference Reference
Vendor A 8/75 (10.7) 33/1936 (1.7) >.99 <.001
Vendor B spec 28/75 (37.3) 8/1936 (0.4) .01 .91
Vendor B sens 21/75 (28.0) 22/1936 (1.1) .40 <.001
Vendor C 17/75 (22.7) 47/1936 (2.4) >.99 <.001
Vendor D 23/75 (30.7) 35/1936 (1.8) .19 <.001
Pleural effusion
Clinical report 100/360 (27.8) 70/1648 (4.2) Reference Reference
Vendor A 17/360 (4.7) 270/1648 (16.4) <.001 <.001
Vendor B spec 138/360 (38.3) 52/1648 (3.2) <.001 .45
Vendor B sens 82/360 (22.8) 127/1648 (7.7) .53 <.001
Vendor C 114/360 (31.7) 44/1648 (2.7) .01 <.001
Vendor D* 43/227 (18.9) 106/1315 (8.1) .07 <.001
Note.—Except where indicated, data are numbers of patients, with percentages in parentheses. Insufficient clinical reports are not included,
thus 2011 patients are included instead of 2040. Expert-labeled chest radiographs served as the reference standard for determining false-
negative and false-positive rates for both AI tools and clinical radiology reports. Vendor B used both high sensitivity (vendor B sens) and
high specificity (vendor B spec) probability thresholds. P values (McNemar test, Bonferroni corrected per finding) are provided for the
comparison of false-negative and false-positive rates between any AI tool and the clinical radiology report. AI = artificial intelligence.
* Comparison with clinical report is for posteroanterior chest radiographs only (n = 1542).

sens and vendor D AI tools and radiology reports (P = .53 and P majority have used the Lunit INSIGHT AI tool, which was not
= .07). Three AI tools had higher false-positive rates for pleural ef- tested here. There are currently no published studies on the tar-
fusion than the radiology reports (range, 7.7%–16.4% vs 4.2%; P get findings tested in this study with the vendor B, vendor C, or
< .001 for all), one had a lower false-positive rate (2.7% vs 4.2%, vendor D tools. For airspace disease, reported sensitivities and
P < .001), and one showed no difference (3.2% vs 4.2%, P = .45). specificities have ranged 81%–92% and 67%–94%, respectively
Examples of chest radiographs incorrectly labeled by the AI tools (15,20,21). Corresponding numbers have ranged 39%–99% and
are shown in Figure 6 and examples of chest radiographs correctly 92%–100% for pneumothorax (15–17,19,20,22) and 78%–
labeled by the AI tools are shown in Figure S4. 89% and 94%–99% for pleural effusion (15,19,20). Notably,
only one of these studies included an unselected consecutive sam-
Discussion ple (20), while three other consecutive studies were performed
This study tested the diagnostic accuracy of current commer- with a narrower scope of pneumothorax detection after lung
cially available artificial intelligence (AI) tools for identifying biopsy (17), pneumonia detection in younger men presenting
airspace disease, pneumothorax, and pleural effusion on chest with febrile respiratory illness at a military hospital (21), or pa-
radiographs in a real-life multicenter patient sample. The AI tools tients admitted for acute trauma (22). The sensitivities found
achieved moderate to high sensitivities ranging 62%–95% and in our study were comparable with those in previous studies,
excellent negative predictive values greater than 92%. The posi- while specificities were in the lower range, possibly due to the
tive predictive values of AI tools were lower and showed more consecutive data sample including heterogeneous patients from
variation, ranging 37%–86%, most often with false-positive rates a real-life setting.
higher than the clinical radiology reports. Furthermore, we found Among the AI tools examined in this study, we observed an
that AI sensitivity generally was lower for smaller-sized target find- acknowledgeable difference in the balance between sensitivity
ings and that AI specificity generally was lower for anteroposterior and specificity for the individual tools, which seems unpredict-
chest radiographs and those with concurrent findings. able. Therefore, when implementing an AI tool, it seems cru-
Previous studies have evaluated the diagnostic accuracy of cial to understand the disease prevalence and severity of the site
these target findings using commercially available AI tools (15– and that changing the AI tool threshold after implementation
23). Three of these studies used the vendor A tool, while the may be needed for the system to have the desired diagnostic

10 radiology.rsna.org ■ Radiology: Volume 308: Number 3—September 2023


Lind Plesner et al

Figure 6: Representative chest radiographs in six patients show (A, C, E) false-positive findings and (B, D, F) false-neg-
ative findings as identified by the artificial intelligence (AI) tools. In general, false-negative findings determined by the AI tools
were very subtle representations of disease, while false-positive findings were misinterpretations. These examples were all
correctly classified by the clinical radiology reports. (A) Posteroanterior chest radiograph in a 71-year-old male patient who
underwent examination due to progression of dyspnea shows bilateral fibrosis (arrows), which was misclassified as airspace
disease by all four AI tools. (B) Posteroanterior chest radiograph in a 31-year-old female patient referred for radiography due
to month-long coughing shows subtle airspace opacity at the right cardiac border (arrows), which was missed by all AI tools.
(C) Anteroposterior chest radiograph in a 78-year-old male patient referred after placement of a central venous
catheter shows a skin fold on the right side (arrow), which was misclassified as pneumothorax by all AI tools. (D) Pos-
teroanterior chest radiograph in a 78-year-old male patient referred to rule out pneumothorax shows very subtle
apical right-sided pneumothorax (arrows), which was missed by all AI tools except for vendor B (with the high sen-
sitivity threshold). (E) Posteroanterior chest radiograph in a 72-year-old male patient referred for radiography with-
out a specified reason shows chronic rounding of the costophrenic angle (arrow), which was mistaken for pleu-
ral effusion by all AI tools and verified according to the reference standard in a corresponding chest CT image.
(F) Anteroposterior chest radiograph in a 76-year-old female patient referred for radiography due to suspicion of congestion
and/or pneumonia shows a very subtle left-sided pleural effusion (arrow), which was missed by all three AI tools that were
capable of analyzing anteroposterior chest radiographs for pleural effusion.

Radiology: Volume 308: Number 3—September 2023 ■ radiology.rsna.org 11


Commercially Available Chest Radiograph AI Tools for Detecting Thoracic Disease

ability. Furthermore, the low sensitivity observed for several AI for concurrent reading supporting a human reader and should
tools in our study suggests that, like clinical radiologists, the ideally be prospectively evaluated in that setting; however, this
performance of AI tools decreases for more subtle findings on is not feasible when testing multiple AI tools.
chest radiographs. This has been observed previously in studies In conclusion, current-generation artificial intelligence
using a single algorithm for pneumothorax (16), lung nodules, (AI) tools showed moderate to high sensitivity for detecting
and pneumonia, where there are overlapping structures and/or airspace disease, pneumothorax, and pleural effusion on chest
a small lesion size (7,10). radiographs. However, they produced more false-positive re-
We further found that for anteroposterior chest radiographs sults than radiology reports and their performance decreased
and chest radiographs with multiple findings, the specificity of AI for smaller-sized target findings, chest radiographs with mul-
tools for airspace disease and pleural effusion decreased compared tiple findings, and chest radiographs with anteroposterior ra-
with posteroanterior chest radiographs and chest radiographs with diographic projection. Future studies could focus on prospec-
a single finding. This effect was most pronounced for airspace dis- tive assessment of the clinical consequence of using AI for chest
ease, which is unsurprising as airspace disease can resemble other radiography in patient-related outcomes.
chest radiograph findings, but we also observed the effect for
pneumothorax and pleural effusion, which have clearer imaging Author contributions: Guarantors of integrity of entire study, L.L.P., F.C.M., M.B.,
M.B.A.; study concepts/study design or data acquisition or data analysis/interpretation,
definitions. Ahn et al (15) reported on the performance of the all authors; manuscript drafting or manuscript revision for important intellectual con-
Lunit INSIGHT AI tool and found, similar to our study, that the tent, all authors; approval of final version of submitted manuscript, all authors; agrees
specificity for pneumonia was 85% in patients without extra find- to ensure any questions related to the work are appropriately resolved, all authors; lit-
erature research, L.L.P.; clinical studies, L.L.P., M.W.B., L.C.L., F.R., M.B., M.B.A.;
ings on chest radiographs and 51% in patients with concurrent experimental studies, L.L.P.; statistical analysis, L.L.P., F.C.M., M.B.A.; and manuscript
findings. Together these findings suggest that radiologists should editing, all authors
be aware of these limitations, regarding both sensitivity and speci-
ficity, and should not overconfidently trust the systems in these Disclosures of conflicts of interest: L.L.P. Lecture payment from Siemens Health-
ineers. F.C.M. Institutional research grants from Siemens Healthineers and Innovation
difficult cases. However, it should be stated, that many mistakes
Fund Denmark; lecture payment from Siemens Healthineers. M.W.B. No relevant re-
made by AI tools would also be difficult or even impossible for lationships. L.C.L. No relevant relationships. F.R. No relevant relationships. O.W.N.
a human reader to detect without access to additional imaging Lecture payments from Roche, Orion, Pharmacosmos, and Novartis; stock options in
Bavarian Nordic and Merck; currently employed by Novo Nordisk. M.B. No relevant
and patient history. To overcome this limitation, next-generation
relationships. M.B.A. Lecture payments from Philips Healthcare, Siemens Healthineers,
AI tools should strive to incorporate comparisons with previous Boehringer Ingelheim, and Roche.
medical imaging, which is currently being explored (24).
Our study had several limitations. First, although a con-
secutive sample was used, this sample may lack generalizability References
to other than a hospital-based setting due to the high median 1. Raoof S, Feigin D, Sung A, Raoof S, Irugulpati L, Rosenow EC 3rd. Inter-
pretation of plain chest roentgenogram. Chest 2012;141(2):545–558.
age in the sample and high prevalence of patients with multiple 2. Eng J, Mysko WK, Weller GER, et al. Interpretation of Emergency Depart-
findings on chest radiographs. Second, the definitions of dis- ment radiographs: a comparison of emergency medicine physicians with
ease used for our reference standard may align differently with radiologists, residents with faculty, and film with digital display. AJR Am J
Roentgenol 2000;175(5):1233–1238.
the definitions used for AI training, thereby possibly favoring 3. Gatt ME, Spectre G, Paltiel O, Hiller N, Stalnikowicz R. Chest radiographs
one AI tool over another. Third, AI tools were compared with in the emergency department: is the radiologist really necessary? Postgrad
clinical radiology reports that were generated by radiologists Med J 2003;79(930):214–217.
4. Çallı E, Sogancioglu E, van Ginneken B, van Leeuwen KG, Murphy K. Deep
who had access to lateral chest radiographs, clinical informa- learning for chest X-ray analysis: A survey. Med Image Anal 2021;72:102125.
tion, and prior imaging, whereas the AI tools did not, which 5. van Leeuwen KG, Schalekamp S, Rutten MJCM, van Ginneken B, de Rooij
gives the radiologists an “unfair advantage.” Additionally, clini- M. Artificial intelligence in radiology: 100 commercially available products
and their scientific evidence. Eur Radiol 2021;31(6):3797–3804.
cal radiologist accuracy for pneumothorax is inflated due to 6. Li D, Pehrson LM, Lauridsen CA, et al. The Added Effect of Artificial
the enrichment inclusion method for examinations with these Intelligence on Physicians’ Performance in Detecting Thoracic Patholo-
findings, which were identified using the same radiology re- gies on CT and Chest X-ray: A Systematic Review. Diagnostics (Basel)
2021;11(12):2206.
ports included in our analysis. Fourth, analyses for this study 7. Kim C, Yang Z, Park SH, et al. Multicentre external validation of a commercial
were performed at the examination level and, therefore, the AI artificial intelligence software to analyse chest radiographs in health screening
tools and reference standard experts could have made decisions environments with low disease prevalence. Eur Radiol 2023;33(5):3501–3509.
8. Voter AF, Larson ME, Garrett JW, Yu JJ. Diagnostic Accuracy and Failure
based on differing pixels in the chest radiographs. This will give Mode Analysis of a Deep Learning Algorithm for the Detection of Cervical
an advantage toward less specific AI tools because a false-posi- Spine Fractures. AJNR Am J Neuroradiol 2021;42(8):1550–1556.
tive finding can be counted as a true-positive and hence inflate 9. Oakden-Rayner L, Gale W, Bonham TA, et al. Validation and algorithmic
audit of a deep learning system for the detection of proximal femoral frac-
the AI performance. However, due to the high specificity of AI tures in patients in the emergency department: a diagnostic accuracy study.
tools for pneumothorax and pleural effusion, this may only be Lancet Digit Health 2022;4(5):e351–e358.
relevant for airspace disease detection. Fifth, no lateral chest 10. Sun J, Peng L, Li T, et al. Performance of a Chest Radiograph AI Diagnostic
Tool for COVID-19: A Prospective Observational Study. Radiol Artif Intell
radiographs were used as input to any of the AI tools, thus it is 2022;4(4):e210217.
unknown whether the two AI vendors with lateral image pro- 11. Park SH. Diagnostic Case-Control versus Diagnostic Cohort Studies for
cessing capacity could have had a slightly higher performance. Clinical Validation of Artificial Intelligence Algorithm Performance. Radiol-
ogy 2019;290(1):272–273.
Finally, this was a retrospective study of the standalone perfor- 12. AI Central. Data Science Institute, American College of Radiology. https://
mance of AI tools, although the AI tools are clinically approved aicentral.acrdsi.org/. Accessed May 1, 2023.

12 radiology.rsna.org ■ Radiology: Volume 308: Number 3—September 2023


Lind Plesner et al

13. Bossuyt PM, Reitsma JB, Bruns DE, et al. STARD 2015: an updated radiologists: a retrospective, multireader multicase study. Lancet Digit Health
list of essential items for reporting diagnostic accuracy studies. BMJ 2021;3(8):e496–e506.
2015;351:h5527. 20. van Beek EJR, Ahn JS, Kim MJ, Murchison JT. Validation study of machine-
14. R Core Team. R: A language and environment for statistical computing. Vi- learning chest radiograph software in primary and emergency medicine. Clin
enna, Austria: R Foundation for Statistical Computing, 2022. Radiol 2023;78(1):1–7.
15. Ahn JS, Ebrahimian S, McDermott S, et al. Association of Artificial Intel- 21. Kim JH, Kim JY, Kim GH, et al. Clinical Validation of a Deep Learning
ligence-Aided Chest Radiograph Interpretation With Reader Performance Algorithm for Detection of Pneumonia on Chest Radiographs in Emergency
and Efficiency. JAMA Netw Open 2022;5(8):e2229289. Department Patients with Acute Febrile Respiratory Illness. J Clin Med
16. Hillis JM, Bizzo BC, Mercaldo S, et al. Evaluation of an Artificial Intelligence 2020;9(6):1981.
Model for Detection of Pneumothorax and Tension Pneumothorax in Chest 22. Gipson J, Tang V, Seah J, et al. Diagnostic accuracy of a commercially avail-
Radiographs. JAMA Netw Open 2022;5(12):e2247172. able deep-learning algorithm in supine chest radiographs following trauma.
17. Hong W, Hwang EJ, Lee JH, Park J, Goo JM, Park CM. Deep Learning for Br J Radiol 2022;95(1134):20210979.
Detecting Pneumothorax on Chest Radiographs after Needle Biopsy: Clini- 23. Choi SY, Park S, Kim M, Park J, Choi YR, Jin KN. Evaluation of a deep
cal Implementation. Radiology 2022;303(2):433–441. learning-based computer-aided detection algorithm on chest radiographs:
18. Nam JG, Kim M, Park J, et al. Development and validation of a deep learn- Case-control study. Medicine (Baltimore) 2021;100(16):e25663.
ing algorithm detecting 10 common abnormalities on chest radiographs. Eur 24. Bannur S, Hyland S, Liu Q, et al. Learning to Exploit Temporal Struc-
Respir J 2021;57(5):2003061. ture for Biomedical Vision-Language Processing. arXiv 2301.04558
19. Seah JCY, Tang CHM, Buchlak QD, et al. Effect of a comprehensive [preprint] https://arxiv.org/abs/2301.04558. Posted January 11, 2023.
deep-learning model on the accuracy of chest x-ray interpretation by Accessed April 13, 2023.

Radiology: Volume 308: Number 3—September 2023 ■ radiology.rsna.org 13

You might also like