0% found this document useful (0 votes)
183 views19 pages

Chestx-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks On Weakly-Supervised Classification and Localization of Common Thorax Diseases

This document presents a new chest X-ray database called ChestX-ray8 containing over 100,000 X-ray images from over 30,000 patients. The images have been labeled for the presence of 8 common thoracic diseases by analyzing associated radiology reports using natural language processing. A weakly-supervised framework is proposed to classify and localize these diseases using only the image-level labels, demonstrating the potential of deep learning approaches for medical image analysis using large hospital databases of images and reports.

Uploaded by

Thomas Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views19 pages

Chestx-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks On Weakly-Supervised Classification and Localization of Common Thorax Diseases

This document presents a new chest X-ray database called ChestX-ray8 containing over 100,000 X-ray images from over 30,000 patients. The images have been labeled for the presence of 8 common thoracic diseases by analyzing associated radiology reports using natural language processing. A weakly-supervised framework is proposed to classify and localize these diseases using only the image-level labels, demonstrating the potential of deep learning approaches for medical image analysis using large hospital databases of images and reports.

Uploaded by

Thomas Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on

Weakly-Supervised Classification and Localization of Common Thorax Diseases

Xiaosong Wang1 , Yifan Peng 2 , Le Lu 1 , Zhiyong Lu 2 , Mohammadhadi Bagheri 1 , Ronald M. Summers 1


1
Department of Radiology and Imaging Sciences, Clinical Center,
2
National Center for Biotechnology Information, National Library of Medicine,
National Institutes of Health, Bethesda, MD 20892
arXiv:1705.02315v5 [cs.CV] 14 Dec 2017

{xiaosong.wang,yifan.peng,le.lu,luzh,mohammad.bagheri,rms}@nih.gov

Abstract

The chest X-ray is one of the most commonly accessi-


ble radiological examinations for screening and diagnosis
of many lung diseases. A tremendous number of X-ray
imaging studies accompanied by radiological reports are
accumulated and stored in many modern hospitals’ Pic-
ture Archiving and Communication Systems (PACS). On
the other side, it is still an open question how this type
of hospital-size knowledge database containing invaluable
imaging informatics (i.e., loosely labeled) can be used to fa-
cilitate the data-hungry deep learning paradigms in build-
ing truly large-scale high precision computer-aided diagno- Figure 1. Eight common thoracic diseases observed in chest X-rays
sis (CAD) systems. that validate a challenging task of fully-automated diagnosis.
In this paper, we present a new chest X-ray database, comparison to previous shallow methodologies built upon
namely “ChestX-ray8”, which comprises 108,948 frontal- hand-crafted image features. Deep neural network rep-
view X-ray images of 32,717 unique patients with the text- resentations further make the joint language and vision
mined eight disease image labels (where each image can learning tasks more feasible to solve, in image captioning
have multi-labels), from the associated radiological reports [49, 24, 33, 48, 23], visual question answering [2, 46, 51, 55]
using natural language processing. Importantly, we demon- and knowledge-guided transfer learning [4, 34], and so
strate that these commonly occurring thoracic diseases can on. However, the intriguing and strongly observable per-
be detected and even spatially-located via a unified weakly- formance gaps of the current state-of-the-art object detec-
supervised multi-label image classification and disease lo- tion and segmentation methods, evaluated between using
calization framework, which is validated using our proposed PASCAL VOC [13] and employing Microsoft (MS) COCO
dataset. Although the initial quantitative results are promis- [28], demonstrate that there is still significant room for per-
ing as reported, deep convolutional neural network based formance improvement when underlying challenges (rep-
“reading chest X-rays” (i.e., recognizing and locating the resented by different datasets) become greater. For exam-
common disease patterns trained with only image-level la- ple, MS COCO is composed of 80 object categories from
bels) remains a strenuous task for fully-automated high pre- 200k images, with 1.2M instances (350k are people) where
cision CAD systems. every instance is segmented and many instances are small
objects. Comparing to PASCAL VOC of only 20 classes
and 11,530 images containing 27,450 annotated objects with
1 Introduction bounding-boxes (BBox), the top competing object detection
The rapid and tremendous progress has been evidenced approaches achieve in 0.413 in MS COCO versus 0.884 in
in a range of computer vision problems via deep learning PASCAL VOC under mean Average Precision (mAP).
and large-scale annotated image datasets [26, 38, 13, 28]. Deep learning yields similar rises in performance in the
Drastically improved quantitative performances in object medical image analysis domain for object (often human
recognition, detection and segmentation are demonstrated in anatomical or pathological structures in radiology imaging)

1
detection and segmentation tasks. Recent notable work in- tors. Therefore we exploit to mine the per-image (possi-
cludes (but do not limit to) an overview review on the future bly multiple) common thoracic pathology labels from the
promise of deep learning [14] and a collection of important image-attached chest X-ray radiological reports using Nat-
medical applications on lymph node and interstitial lung dis- ural Language Processing (NLP) techniques. Radiologists
ease detection and classification [37, 43]; cerebral microb- tend to write more abstract and complex logical reasoning
leed detection [11]; pulmonary nodule detection in CT im- sentences than the plain describing texts in [53, 28]. 2, The
ages [40]; automated pancreas segmentation [36]; cell im- spatial dimensions of an chest X-ray are usually 2000×3000
age segmentation and tracking [35], predicting spinal radi- pixels. Local pathological image regions can show hugely
ological scores [21] and extensions of multi-modal imaging varying sizes or extents but often very small comparing to
segmentation [30, 16]. The main limitation is that all pro- the full image scale. Fig. 1 shows eight illustrative examples
posed methods are evaluated on some small-to-middle scale and the actual pathological findings are often significantly
problems of (at most) several hundred patients. It remains smaller (thus harder to detect). Fully dense annotation of
unclear how well the current deep learning techniques will region-level bounding boxes (for grounding the pathologi-
scale up to tens of thousands of patient studies. cal findings) would normally be needed in computer vision
datasets [33, 55, 25] but may be completely nonviable for
In the era of deep learning in computer vision, re-
the time being. Consequently, we formulate and verify a
search efforts on building various annotated image datasets
weakly-supervised multi-label image classification and dis-
[38, 13, 28, 2, 33, 55, 23, 25] with different characteristics
ease localization framework to address this difficulty. 3,
play indispensably important roles on the better definition
So far, all image captioning and VQA techniques in com-
of the forthcoming problems, challenges and subsequently
puter vision strongly depend on the ImageNet pre-trained
possible technological progresses. Particularly, here we fo-
deep CNN models which already perform very well in a
cus on the relationship and joint learning of image (chest X-
large number of object classes and serves a good baseline
rays) and text (X-ray reports). The previous representative
for further model fine-tuning. However, this situation does
image caption generation work [49, 24] utilize Flickr8K,
not apply to the medical image diagnosis domain. Thus we
Flickr30K [53] and MS COCO [28] datasets that hold 8,000,
have to learn the deep image recognition and localization
31,000 and 123,000 images respectively and every image is
models while constructing the weakly-labeled medical im-
annotated by five sentences via Amazon Mechanical Turk
age database.
(AMT). The text generally describes annotator’s attention
To tackle these issues, we propose a new chest X-ray
of objects and activity occurring on an image in a straight-
database, namely “ChestX-ray8”, which comprises 108,948
forward manner. Region-level ImageNet pre-trained con-
frontal-view X-ray images of 32,717 (collected from the
volutional neural networks (CNN) based detectors are used
year of 1992 to 2015) unique patients with the text-mined
to parse an input image and output a list of attributes or
eight common disease labels, mined from the text radi-
“visually-grounded high-level concepts” (including objects,
ological reports via NLP techniques. In particular, we
actions, scenes and so on) in [24, 51]. Visual question an-
demonstrate that these commonly occurred thoracic dis-
swering (VQA) requires more detailed parsing and complex
eases can be detected and even spatially-located via a uni-
reasoning on the image contents to answer the paired natural
fied weakly-supervised multi-label image classification and
language questions. A new dataset containing 250k natural
disease localization formulation. Our initial quantitative re-
images, 760k questions and 10M text answers [2] is pro-
sults are promising. However developing fully-automated
vided to address this new challenge. Additionally, databases
deep learning based “reading chest X-rays” systems is still
such as “Flickr30k Entities” [33], “Visual7W” [55] and “Vi-
an arduous journey to be exploited. Details of accessing the
sual Genome” [25, 23] (as detailed as 94,000 images and
ChestX-ray8 dataset can be found via the website 1 .
4,100,000 region-grounded captions) are introduced to con-
struct and learn the spatially-dense and increasingly diffi- 1.1 Related Work
cult semantic links between textual descriptions and image There have been recent efforts on creating openly avail-
regions through the object-level grounding. able annotated medical image databases [50, 52, 37, 36]
with the studied patient numbers ranging from a few hun-
Though one could argue that the high-level analogy ex-
dreds to two thousands. Particularly for chest X-rays, the
ists between image caption generation, visual question an-
largest public dataset is OpenI [1] that contains 3,955 ra-
swering and imaging based disease diagnosis [42, 41], there
diology reports from the Indiana Network for Patient Care
are three factors making truly large-scale medical image
and 7,470 associated chest x-rays from the hospitals picture
based diagnosis (e.g., involving tens of thousands of pa-
archiving and communication system (PACS). This database
tients) tremendously more formidable. 1, Generic, open-
is utilized in [42] as a problem of caption generation but
ended image-level anatomy and pathology labels cannot be
obtained through crowd-sourcing, such as AMT, which is 1 https://nihcc.app.box.com/v/ChestXray-NIHCC,
prohibitively implausible for non-medically trained annota- more details: https://www.cc.nih.gov/drd/summers.html
no quantitative disease detection results are reported. Our
newly proposed chest X-ray database is at least one order
of magnitude larger than OpenI [1] (Refer to Table 1). To
achieve the better clinical relevance, we focus to exploit
the quantitative performance on weakly-supervised multi-
label image classification and disease localization of com-
mon thoracic diseases, in analogy to the intermediate step
of “detecting attributes” in [51] or “visual grounding” for
[33, 55, 23].
2 Construction of Hospital-scale Chest X-ray
Database
In this section, we describe the approach for build-
ing a hospital-scale chest X-ray image database, namely
“ChestX-ray8”, mined from our institute’s PACS system.
First, we short-list eight common thoracic pathology key-
words that are frequently observed and diagnosed, i.e., At-
electasis, Cardiomegaly, Effusion, Infiltration, Mass, Nod-
ule, Pneumonia and Pneumathorax (Fig. 1), based on radi-
ologists’ feedback. Given those 8 text keywords, we search
the PACS system to pull out all the related radiological re- Figure 2. The circular diagram shows the proportions of images
ports (together with images) as our target corpus. A vari- with multi-labels in each of 8 pathology classes and the labels’
ety of Natural Language Processing (NLP) techniques are co-occurrence statistics.
adopted for detecting the pathology keywords and removal Systematized Nomenclature of Medicine Clinical Terms
of negation and uncertainty. Each radiological report will (or SNOMED-CT), which is a standardized vocabulary of
be either linked with one or more keywords or marked clinical terminology for the electronic exchange of clinical
with ’Normal’ as the background category. As a result, the health information.
ChestX-ray8 database is composed of 108,948 frontal-view MetaMap is another prominent tool to detect bio-
X-ray images (from 32,717 patients) and each image is la- concepts from the biomedical text corpus. Different from
beled with one or multiple pathology keywords or “Normal” DNorm, it is an ontology-based approach for the detec-
otherwise. Fig. 2 illustrates the correlation of the resulted tion of Unified Medical Language System R
(UMLS R
)
keywords. It reveals some connections between different Metathesaurus. In this work, we only consider the seman-
pathologies, which agree with radiologists’ domain knowl- tic types of Diseases or Syndromes and Findings (namely
edge, e.g., Infiltration is often associated with Atelectasis ‘dsyn’ and ‘fndg’ respectively). To maximize the recall
and Effusion. To some extend, this is similar with under- of our automatic disease detection, we merge the results
standing the interactions and relationships among objects or of DNorm and MetaMap. Table 1 (in the supplemen-
concepts in natural images [25]. tary material) shows the corresponding SNOMED-CT con-
2.1 Labeling Disease Names by Text Mining cepts that are relevant to the eight target diseases (these
Overall, our approach produces labels using the reports mappings are developed by searching the disease names in
in two passes. In the first iteration, we detected all the dis- the UMLS R
terminology service 2 , and verified by a board-
ease concept in the corpus. The main body of each chest certified radiologist.
X-ray report is generally structured as “Comparison”, “In- Negation and Uncertainty: The disease detection algo-
dication”, “Findings”, and “Impression” sections. Here, we rithm locates every keyword mentioned in the radiology re-
focus on detecting disease concepts in the Findings and Im- port no matter if it is truly present or negated. To eliminate
pression sections. If a report contains neither of these two the noisy labeling, we need to rule out those negated patho-
sections, the full-length report will then be considered. In logical statements and, more importantly, uncertain men-
the second pass, we code the reports as “Normal” if they tions of findings and diseases, e.g., “suggesting obstructive
do not contain any diseases (not limited to 8 predefined lung disease”.
pathologies). Although many text processing systems (such as [6]) can
Pathology Detection: We mine the radiology reports handle the negation/uncertainty detection problem, most of
for disease concepts using two tools, DNorm [27] and them exploit regular expressions on the text directly. One
MetaMap [3]. DNorm is a machine learning method for of the disadvantages to use regular expressions for nega-
disease recognition and normalization. It maps every men-
tion of keywords in a report to a unique concept ID in the 2 https://uts.nlm.nih.gov/metathesaurus.html
prep of Item # OpenI Ov. ChestX-ray8 Ov.
conj or
conj or Report 2,435 - 108,948 -
... clear of focal airspace disease , pneumothorax , or pleural effusion Annotations 2,435 - - -
prep of (CCProcessed)
prep of (CCProcessed)
Atelectasis 315 122 5,789 3,286
Cardiomegaly 345 100 1,010 475
Figure 3. The dependency graph of text: “clear of focal airspace Effusion 153 94 6,331 4,017
disease, pneumothorax, or pleural effusion”. Infiltration 60 45 10,317 4,698
tion/uncertainty detection is that they cannot capture vari- Mass 15 4 6,046 3,432
ous syntactic constructions for multiple subjects. For exam- Nodule 106 18 1,971 1,041
ple, in the phrase of “clear of A and B”, the regular expres- Pneumonia 40 15 1,062 703
sion can capture “A” as a negation but not “B”, particularly Pneumothorax 22 11 2,793 1,403
when both “A” and “B” are long and complex noun phrases Normal 1,379 0 84,312 0
(“clear of focal airspace disease, pneumothorax, or pleural
effusion” in Fig. 3). Table 1. Total number (#) and # of Overlap (Ov.) of the corpus in
To overcome this complication, we hand-craft a number both OpenI and ChestX-ray8 datasets.
of novel rules of negation/uncertainty defined on the syn- MetaMap Our Method
Disease
tactic level in this work. More specifically, we utilize the P/ R/ F P/ R/ F
syntactic dependency information because it is close to the Atelectasis 0.95 / 0.95 / 0.95 0.99 / 0.85 / 0.91
semantic relationship between words and thus has become Cardiomegaly 0.99 / 0.83 / 0.90 1.00 / 0.79 / 0.88
prevalent in biomedical text processing. We defined our Effusion 0.74 / 0.90 / 0.81 0.93 / 0.82 / 0.87
rules on the dependency graph, by utilizing the dependency Infiltration 0.25 / 0.98 / 0.39 0.74 / 0.87 / 0.80
label and direction information between words. Mass 0.59 / 0.67 / 0.62 0.75 / 0.40 / 0.52
As the first step of preprocessing, we split and tokenize Nodule 0.95 / 0.65 / 0.77 0.96 / 0.62 / 0.75
the reports into sentences using NLTK [5]. Next we parse Normal 0.93 / 0.90 / 0.91 0.87 / 0.99 / 0.93
each sentence by the Bllip parser [7] using David Mc- Pneumonia 0.58 / 0.93 / 0.71 0.66 / 0.93 / 0.77
Closkys biomedical model [29]. The syntactic dependen- Pneumothorax 0.32 / 0.82 / 0.46 0.90 / 0.82 / 0.86
cies are then obtained from “CCProcessed” dependencies Total 0.84 / 0.88 / 0.86 0.90 / 0.91 / 0.90
output by applying Stanford dependencies converter [8] on
the parse tree. The “CCProcessed” representation propa- Table 2. Evaluation of image labeling results on OpenI dataset.
gates conjunct dependencies thus simplifies coordinations. Performance is reported using P, R, F1-score.
As a result, we can use fewer rules to match more com- the gold standard for evaluating our method. Table 1 sum-
plex constructions. For an example as shown in Fig. 3, we marizes the statistics of the subset of OpenI [1, 20] reports.
could use “clear → prep of → DISEASE” to detect three Table 2 shows the results of our method using OpenI, mea-
negations from the text hneg, focal airspace diseasei, hneg, sured in precision (P), recall (R), and F1-score. Higher pre-
pneumothoraxi, and hneg, pleural effusioni. cision of 0.90, higher recall of 0.91, and higher F1-score
Furthermore, we label a radiology report as “normal” if of 0.90 are achieved compared to the existing MetaMap ap-
it meets one of the following criteria: proach (with NegEx enabled). For all diseases, our method
• If there is no disease detected in the report. Note that obtains higher precisions, particularly in “pneumothorax”
here we not only consider 8 diseases of interest in this (0.90 vs. 0.32) and “infiltration” (0.74 vs. 0.25). This in-
paper, but all diseases detected in the reports. dicates that the usage of negation and uncertainty detection
on syntactic level successfully removes false positive cases.
• If the report contains text-mined concepts of “normal” More importantly, the higher precisions meet our expecta-
or “normal size” (CUIs C0205307 and C0332506 in tion to generate a Chest X-ray corpus with accurate seman-
the SNOMED-CT concepts respectively). tic labels, to lay a solid foundation for the later processes.
2.2 Quality Control on Disease Labeling 2.3 Processing Chest X-ray Images
To validate our method, we perform the following exper- Comparing to the popular ImageNet classification prob-
iments. Given the fact that no gold-standard labels exist for lem, significantly smaller spatial extents of many diseases
our dataset, we resort to some existing annotated corpora as inside the typical X-ray image dimensions of 3000 × 2000
an alternative. Using the OpenI API [1], we retrieve a total pixels impose challenges in both the capacity of comput-
of 3,851 unique radiology reports where each OpenI report ing hardware and the design of deep learning paradigm. In
is assigned with its key findings/disease names by human ChestX-ray8, X-rays images are directly extracted from the
annotators [9]. Given our focus on the eight diseases, a sub- DICOM file and resized as 1024×1024 bitmap images with-
set of OpenI reports and their human annotations are used as out significantly losing the detail contents, compared with
image sizes of 512 × 512 in OpenI dataset. Their intensity the network surgery on the pre-trained models (using Im-
ranges are rescaled using the default window settings stored ageNet [10, 39]), e.g., AlexNet [26], GoogLeNet [45],
in the DICOM header files. VGGNet-16 [44] and ResNet-50 [17], by leaving out the
2.4 Bounding Box for Pathologies fully-connected layers and the final classification layers. In-
As part of the ChestX-ray8 database, a small number stead we insert a transition layer, a global pooling layer, a
of images with pathology are provided with hand labeled prediction layer and a loss layer in the end (after the last con-
bounding boxes (B-Boxes), which can be used as the ground volutional layer). In a similar fashion as described in [54],
truth to evaluate the disease localization performance. Fur- a combination of deep activations from transition layer (a
thermore, it could also be adopted for one/low-shot learn- set of spatial image features) and the weights of prediction
ing setup [15], in which only one or several samples are inner-product layer (trained feature weighting) can enable
needed to initialize the learning and the system will then us to find the plausible spatial locations of diseases.
evolve by itself with more unlabeled data. We leave this as Multi-label Setup: There are several options of image-
future work. label representation and the choices of multi-label classi-
In our labeling process, we first select 200 instances for fication loss functions. Here, we define a 8-dimensional
each pathology (1,600 instances total), consisting of 983 label vector y = [y1 , ..., yc , ..., yC ], yc ∈ {0, 1}, C = 8
images. Given an image and a disease keyword, a board- for each image. yc indicates the presence with respect to
certified radiologist identified only the corresponding dis- according pathology in the image while a all-zero vector
ease instance in the image and labeled it with a B-Box. The [0, 0, 0, 0, 0, 0, 0, 0] represents the status of “Normal” (no
B-Box is then outputted as an XML file. If one image con- pathology is found in the scope of any of 8 disease cate-
tains multiple disease instances, each disease instance is la- gories as listed). This definition transits the multi-label clas-
beled separately and stored into individual XML files. As sification problem into a regression-like loss setting.
an application of the proposed ChestX-ray8 database and
benchmarking, we will demonstrate the detection and local- Transition Layer: Due to the large variety of pre-trained
ization of thoracic diseases in the following. DCNN architectures we adopt, a transition layer is usu-
ally required to transform the activations from previous lay-
3 Common Thoracic Disease Detection and ers into a uniform dimension of output, S × S × D, S ∈
Localization {8, 16, 32}. D represents the dimension of features at spa-
Reading and diagnosing Chest X-ray images may be an tial location (i, j), i, j ∈ {1, ..., S}, which can be varied in
entry-level task for radiologists but, in fact it is a complex different model settings, e.g., D = 1024 for GoogLeNet and
reasoning problem which often requires careful observation D = 2048 for ResNet. The transition layer helps pass down
and good knowledge of anatomical principles, physiology the weights from pre-trained DCNN models in a standard
and pathology. Such factors increase the difficulty of de- form, which is critical for using this layers’ activations to
veloping a consistent and automated technique for reading further generate the heatmap in pathology localization step.
chest X-ray images while simultaneously considering all
Multi-label Classification Loss Layer: We first experi-
common thoracic diseases.
ment 3 standard loss functions for the regression task instead
As the main application of ChestX-ray8 dataset, we
of using the softmax loss for traditional multi-class classifi-
present a unified weakly-supervised multi-label image clas-
cation model, i.e., Hinge Loss (HL), Euclidean Loss (EL)
sification and pathology localization framework, which can
and Cross Entropy Loss (CEL). However, we find that the
detect the presence of multiple pathologies and subse-
model has difficulty learning positive instances (images with
quently generate bounding boxes around the corresponding
pathologies) and the image labels are rather sparse, mean-
pathologies. In details, we tailor Deep Convolutional Neural
ing there are extensively more ‘0’s than ‘1’s. This is due to
Network (DCNN) architectures for weakly-supervised ob-
our one-hot-like image labeling strategy and the unbalanced
ject localization, by considering large image capacity, vari-
numbers of pathology and “Normal” classes. Therefore, we
ous multi-label CNN losses and different pooling strategies.
introduce the positive/negative balancing factor βP , βN to
3.1 Unified DCNN Framework enforce the learning of positive examples. For example, the
Our goal is to first detect if one or multiple pathologies weighted CEL (W-CEL) is defined as follows,
are presented in each X-ray image and later we can lo- LW –CEL (f (~x), ~y ) =
cate them using the activation and weights extracted from X X
βP − ln(f (xc )) + βN − ln(1 − f (xc )), (1)
the network. We tackle this problem by training a multi-
yc =1 yc =0
label DCNN classification model. Fig. 4 illustrates the
|P |+|N |
DCNN architecture we adapted, with similarity to sev- where βP is set to while βN is set to |P |N
|P |
|+|N |
| . |P |
eral previous weakly-supervised object localization meth- and |N | are the total number of ‘1’s and ‘0’s in a batch of
ods [31, 54, 12, 19]. As shown in Fig. 4, we perform image labels.
Figure 4. The overall flow-chart of our unified DCNN framework and disease localization process.
3.2 Weakly-Supervised Pathology Localization r, the pooled value ranges from the maximum in S (when
r → ∞) to average (r → 0). It serves as an adjustable op-
Global Pooling Layer and Prediction Layer: In our tion between max pooling and average pooling. Since the
multi-label image classification network, the global pool- LSE function suffers from overflow/underflow problems,
ing and the predication layer are designed not only to be the following equivalent is used while implementing the
part of the DCNN for classification but also to generate the LSE pooling layer in our own DCNN architecture,
likelihood map of pathologies, namely a heatmap. The lo-  
cation with a peak in the heatmap generally corresponds to 1 1 X
the presence of disease pattern with a high probability. The xp = x∗ + · log  · exp(r · (xij − x∗ ) , (3)
r S
(i,j)∈S
upper part of Fig. 4 demonstrates the process of producing

this heatmap. By performing a global pooling after the tran- where x = max{|xij |, (i, j) ∈ S}.
sition layer, the weights learned in the prediction layer can Bounding Box Generation: The heatmap produced
function as the weights of spatial maps from the transition from our multi-label classification framework indicates the
layer. Therefore, we can produce weighted spatial activation approximate spatial location of one particular thoracic dis-
maps for each disease class (with a size of S × S × C) by ease class each time. Due to the simplicity of intensity dis-
multiplying the activation from transition layer (with a size tributions in these resulting heatmaps, applying an ad-hoc
of S × S × D) and the weights of prediction layer (with a thresholding based B-Box generation method for this task is
size of D × C). found to be sufficient. The intensities in heatmaps are first
The pooling layer plays an important role that chooses normalized to [0, 255] and then thresholded by {60, 180} in-
what information to be passed down. Besides the conven- dividually. Finally, B-Boxes are generated to cover the iso-
tional max pooling and average pooling, we also utilize the lated regions in the resulting binary maps.
Log-Sum-Exp (LSE) pooling proposed in [32]. The LSE 4 Experiments
pooled value xp is defined as
  Data: We evaluate and validate the unified disease clas-
1 1 X sification and localization framework using the proposed
xp = · log  · exp(r · xij ) , (2) ChestX-ray8 database. In total, 108,948 frontal-view X-ray
r S
(i,j)∈S images are in the database, of which 24,636 images contain
where xij is the activation value at (i, j), (i, j) is one loca- one or more pathologies. The remaining 84,312 images are
tion in the pooling region S, and S = s × s is the total num- normal cases. For the pathology classification and localiza-
ber of locations in S. By controlling the hyper-parameter tion task, we randomly shuffled the entire dataset into three
subgroups for CNN fine-tuning via Stochastic Gradient De-
scent (SGD): i.e. training (70%), validation (10%) and test-
ing (20%). We only report the 8 thoracic disease recognition
performance on the testing set in our experiments. Further-
more, for the 983 images with 1,600 annotated B-Boxes of
pathologies, these boxes are only used as the ground truth to
evaluate the disease localization accuracy in testing (not for
training purpose).
CNN Setting: Our multi-label CNN architecture is im-
plemented using Caffe framework [22]. The ImageNet
pre-trained models, i.e., AlexNet [26], GoogLeNet [45],
VGGNet-16 [44] and ResNet-50 [17] are obtained from the
Caffe model zoo. Our unified DCNN takes the weights from
those models and only the transition layers and prediction
layers are trained from scratch.
Due to the large image size and the limit of GPU mem-
ory, it is necessary to reduce the image batch size to load
the entire model and keep activations in GPU while we in-
Figure 5. A comparison of multi-label classification performance
crease the iter size to accumulate the gradients for more it-
with different model initializations.
erations. The combination of both may vary in different
CNN models but we set batch size × iter size = 80 as 6, average pooling and max pooling achieve approximately
a constant. Furthermore, the total training iterations are cus- equivalent performance in this classification task. The per-
tomized for different CNN models to prevent over-fitting. formance of LSE pooling start declining first when r starts
More complex models like ResNet-50 actually take less it- increasing and reach the bottom when r = 5. Then it
erations (e.g., 10000 iterations) to reach the convergence. reaches the overall best performance around r = 10. LSE
The DCNN models are trained using a Dev-Box linux server pooling behaves like a weighed pooling method or a tran-
with 4 Titan X GPUs. sition scheme between average and max pooling under dif-
Multi-label Disease Classification: Fig. 5 demonstrates ferent r values. Overall, LSE pooling (r = 10) reports the
the multi-label classification ROC curves on 8 pathology best performance (consistently higher than mean and max
classes by initializing the DCNN framework with 4 dif- pooling).
ferent pre-trained models of AlexNet, GoogLeNet, VGG
and ResNet-50. The corresponding Area-Under-Curve
(AUC) values are given in Table 4. The quantitative per-
formance varies greatly, in which the model based on
ResNet-50 achieves the best results. The “Cardiomegaly”
(AUC=0.8141) and “Pneumothorax” (AUC=0.7891) classes
are consistently well-recognized compared to other groups
while the detection ratios can be relatively lower for
pathologies which contain small objects, e.g., “Mass”
(AUC=0.5609) and “Nodule” classes. Mass is difficult to
detect due to its huge within-class appearance variation. The
lower performance on “Pneumonia” (AUC=0.6333) is prob-
ably because of lack of total instances in our patient popu-
lation (less than 1% X-rays labeled as Pneumonia). This Figure 6. A comparison of multi-label classification performance
with different pooling strategies.
finding is consistent with the comparison on object detec-
tion performance, degrading from PASCAL VOC [13] to Last, we demonstrate the performance improvement by
MS COCO [28] where many small annotated objects appear. using the positive/negative instances balanced loss functions
Next, we examine the influence of different pooling (Eq. 1). As shown in Table 4, the weighted loss (W-
strategies when using ResNet-50 to initialize the DCNN CEL) provides better overall performance than CEL, espe-
framework. As discussed above, three types of pooling cially for those classes with relative fewer positive instances,
schemes are experimented: average looping, LSE pooling e.g. AUC for “Cardiomegaly” is increased from 0.7262 to
and max pooling. The hyper-parameter r in LSE pool- 0.8141 and from 0.5164 to 0.6333 for “Pneumonia”.
ing varies in {0.1, 0.5, 1, 5, 8, 10, 12}. As illustrated in Fig. Disease Localization: Leveraging the fine-tuned DCNN
Setting Atelectasis Cardiomegaly Effusion Infiltration Mass Nodule Pneumonia Pneumothorax
Initialization with different pre-trained models
AlexNet 0.6458 0.6925 0.6642 0.6041 0.5644 0.6487 0.5493 0.7425
GoogLeNet 0.6307 0.7056 0.6876 0.6088 0.5363 0.5579 0.5990 0.7824
VGGNet-16 0.6281 0.7084 0.6502 0.5896 0.5103 0.6556 0.5100 0.7516
ResNet-50 0.7069 0.8141 0.7362 0.6128 0.5609 0.7164 0.6333 0.7891
Different multi-label loss functions
CEL 0.7064 0.7262 0.7351 0.6084 0.5530 0.6545 0.5164 0.7665
W-CEL 0.7069 0.8141 0.7362 0.6128 0.5609 0.7164 0.6333 0.7891
Table 3. AUCs of ROC curves for multi-label classification in different DCNN model setting.
T(IoBB) Atelectasis Cardiomegaly Effusion Infiltration Mass Nodule Pneumonia Pneumothorax
T(IoBB) = 0.1
Acc. 0.7277 0.9931 0.7124 0.7886 0.4352 0.1645 0.7500 0.4591
AFP 0.8323 0.3506 0.7998 0.5589 0.6423 0.6047 0.9055 0.4776
T(IoBB) = 0.25 (Two times larger on both x and y axis than ground truth B-Boxes)
Acc. 0.5500 0.9794 0.5424 0.5772 0.2823 0.0506 0.5583 0.3469
AFP 0.9167 0.4553 0.8598 0.6077 0.6707 0.6158 0.9614 0.5000
T(IoBB) = 0.5
Acc. 0.2833 0.8767 0.3333 0.4227 0.1411 0.0126 0.3833 0.1836
AFP 1.0203 0.5630 0.9268 0.6585 0.6941 0.6189 1.0132 0.5285
T(IoBB) = 0.75
Acc. 0.1666 0.7260 0.2418 0.3252 0.1176 0.0126 0.2583 0.1020
AFP 1.0619 0.6616 0.9603 0.6921 0.7043 0.6199 1.0569 0.5396
T(IoBB) = 0.9
Acc. 0.1333 0.6849 0.2091 0.2520 0.0588 0.0126 0.2416 0.0816
AFP 1.0752 0.7226 0.9797 0.7124 0.7144 0.6199 1.0732 0.5437
Table 4. Pathology localization accuracy and average false positive number for 8 disease classes.
models for multi-label disease classification, we can cal- 5 Conclusion
culate the disease heatmaps using the activations of the
transition layer and the weights from the prediction layer, Constructing hospital-scale radiology image databases
and even generate the B-Boxes for each pathology candi- with computerized diagnostic performance benchmarks has
date. The computed bounding boxes are evaluated against not been addressed until this work. We attempt to build
the hand annotated ground truth (GT) boxes (included in a “machine-human annotated” comprehensive chest X-ray
ChestX-ray8). Although the total number of B-Box anno- database that presents the realistic clinical and methodolog-
tations (1,600 instances) is relatively small compared to the ical challenges of handling at least tens of thousands of pa-
entire dataset, it may be still sufficient to get a reasonable tients (somewhat similar to “ImageNet” in natural images).
estimate on how the proposed framework performs on the We also conduct extensive quantitative performance bench-
weakly-supervised disease localization task. To examine the marking on eight common thoracic pathology classifica-
accuracy of computerized B-Boxes versus the GT B-Boxes, tion and weakly-supervised localization using ChestX-ray8
two types of measurement are used, i.e, the standard Inter- database. The main goal is to initiate future efforts by pro-
section over Union ratio (IoU) or the Intersection over the moting public datasets in this important domain. Building
detected B-Box area ratio (IoBB) (similar to Area of Preci- truly large-scale, fully-automated high precision medical di-
sion or Purity). Due to the relatively low spatial resolution agnosis systems remains a strenuous task. ChestX-ray8 can
of heatmaps (32 × 32) in contrast to the original image di- enable the data-hungry deep neural network paradigms to
mensions (1024 × 1024), the computed B-Boxes are often create clinically meaningful applications, including com-
larger than the according GT B-Boxes. Therefore, we define mon disease pattern mining, disease correlation analysis, au-
a correct localization by requiring either IoU > T (IoU ) tomated radiological report generation, etc. For future work,
or IoBB > T (IoBB). Refer to the supplementary ma- ChestX-ray8 will be extended to cover more disease classes
terial for localization performance under varying T (IoU ). and integrated with other clinical information, e.g., follow-
Table 4 illustrates the localization accuracy (Acc.) and Av- up studies across time and patient history.
erage False Positive (AFP) number for each disease type,
Acknowledgements This work was supported by the In-
with T (IoBB) ∈ {0.1, 0.25, 0.5, 0.75, 0.9}. Please refer
tramural Research Programs of the NIH Clinical Center and
to the supplementary material for qualitative exemplary dis-
National Library of Medicine. We thank NVIDIA Corpora-
ease localization results for each of 8 pathology classes.
tion for the GPU donation.
References [18] J. Hosang, R. Benenson, P. Dollár, and B. Schiele. What
[1] Open-i: An open access biomedical search engine. https: makes for effective detection proposals? IEEE transactions
//openi.nlm.nih.gov. 2, 3, 4 on pattern analysis and machine intelligence, 38(4):814–
[2] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, and L. Zit- 830, 2016. 11
nick. Vqa: Visual question answering. In ICCV, 2015. 1, 2 [19] S. Hwang and H.-E. Kim. Self-transfer learning for weakly
[3] A. R. Aronson and F.-M. Lang. An overview of MetaMap: supervised lesion localization. In MICCAI, pages (2): 239–
historical perspective and recent advances. Journal of the 246, 2015. 5
American Medical Informatics Association, 17(3):229–236, [20] S. Jaeger, S. Candemir, S. Antani, Y.-X. J. Wng, P.-X. Lu,
may 2010. 3 and G. Thoma. Two public chest x-ray datasets for computer-
[4] J. Ba, K. Swersky, S. Fidler, and R. Salakhutdinov. Predicting aided screening of pulmonary diseases. Quantitative Imaging
deep zero-shot convolutional neural networks using textual in Medicine and Surgery, 4(6), 2014. 4
descriptions. In ICCV, 2015. 1 [21] A. Jamaludin, T. Kadir, and A. Zisserman. Spinenet: Auto-
[5] S. Bird, E. Klein, and E. Loper. Natural language processing matically pinpointing classification evidence in spinal mris.
with Python. ”O’Reilly Media, Inc.”, 2009. 4 In MICCAI. Springer, 2016. 2
[6] W. W. Chapman, W. Bridewell, P. Hanbury, G. F. Cooper, and [22] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Gir-
B. G. Buchanan. A simple algorithm for identifying negated shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
findings and diseases in discharge summaries. Journal of tional architecture for fast feature embedding. arXiv preprint
Biomedical Informatics, 34(5):301–310, oct 2001. 3 arXiv:1408.5093, 2014. 7
[7] E. Charniak and M. Johnson. Coarse-to-fine n-best parsing [23] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully
and MaxEnt discriminative reranking. In Proceedings of the convolutional localization networks for dense captioning. In
43rd Annual Meeting on Association for Computational Lin- CVPR, 2016. 1, 2, 3
guistics (ACL), pages 173–180, 2005. 4 [24] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments
[8] M.-C. De Marneffe and C. D. Manning. Stanford typed de- for generating image descriptions. In CVPR, 2015. 1, 2
pendencies manual. Stanford University, apr 2015. 4 [25] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz,
[9] D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. Bernstein,
Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. and L. Fei-Fei. Visual genome: Connecting language and
McDonald. Preparing a collection of radiology examinations vision using crowdsourced dense image annotations. 2016.
for distribution and retrieval. Journal of the American Medi- 2, 3
cal Informatics Association, 23(2):304–310, July 2015. 4 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei- classification with deep convolutional neural networks. In
Fei. Imagenet: A large-scale hierarchical image database. In Advances in neural information processing systems, pages
Computer Vision and Pattern Recognition, pages 248–255. 1097–1105, 2012. 1, 5, 7
IEEE, 2009. 5 [27] R. Leaman, R. Khare, and Z. Lu. Challenges in clinical nat-
[11] Q. Dou, H. Chen, L. Yu, L. Zhao, J. Qin, D. Wang, V. Mok, ural language processing for automated disorder normaliza-
L. Shi, and P. Heng. Automatic detection of cerebral microb- tion. Journal of Biomedical Informatics, 57:28–37, 2015. 3
leeds from mr images via 3d convolutional neural networks. [28] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
IEEE Trans. Medical Imaging, 35(5):1182–1195, 2016. 2 manan, P. Dollr, and L. Zitnick. Microsoft coco: Common
[12] T. Durand, N. Thome, and M. Cord. Weldon: Weakly super- objects in context. ECCV, pages (5): 740–755, 2014. 1, 2, 7
vised learning of deep convolutional neural networks. IEEE [29] D. McClosky. Any domain parsing: automatic domain adap-
CVPR, 2016. 5 tation for natural language parsing. Thesis, Department of
[13] M. Everingham, S. M. A. Eslami, L. J. Van Gool, Computer Science, Brown University, 2009. 4
C. Williams, J. Winn, and A. Zisserman. The pascal visual [30] P. Moeskops, J. Wolterink, B. van der Velden, K. Gilhuijs,
object classes challenge: A retrospective. International Jour- T. Leiner, M. Viergever, and I. Isgum. Deep learning for
nal of Computer Vision, pages 111(1): 98–136, 2015. 1, 2, multi-task medical image segmentation in multiple modali-
7 ties. In MICCAI. Springer, 2016. 2
[14] H. Greenspan, B. van Ginneken, and R. M. Summers. Guest [31] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Is object lo-
editorial deep learning in medical imaging: Overview and calization for free?-weakly-supervised learning with convo-
future promise of an exciting new technique. IEEE Trans. lutional neural networks. In IEEE CVPR, pages 685–694,
Medical Imaging, 35(5):1153–1159, 2016. 2 2015. 5
[15] B. Hariharan and R. Girshick. Low-shot visual object recog- [32] P. O. Pinheiro and R. Collobert. From image-level to pixel-
nition. arXiv preprint arXiv:1606.02819, 2016. 5 level labeling with convolutional networks. In Proceedings of
[16] M. Havaei, N. Guizard, N. Chapados, and Y. Bengio. Hemis: the IEEE Conference on Computer Vision and Pattern Recog-
Hetero-modal image segmentation. In MICCAI, pages (2): nition, pages 1713–1721, 2015. 6
469–477. Springer, 2016. 2 [33] B. Plummer, L. Wang, C. Cervantes, J. Caicedo, J. Hocken-
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- maier, and S. Lazebnik. Flickr30k entities: Collecting region-
ing for image recognition. arXiv preprint arXiv:1512.03385, to-phrase correspondences for richer image-to-sentence mod-
2015. 5, 7 els. In ICCV, 2015. 1, 2, 3
[34] R. Qiao, L. Liu, C. Shen, and A. van den Hengel. Less is [48] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun. Order-
more: zero-shot learning from online textual documents with embeddings of images and language. In ICLR, 2016. 1
noise suppression. In CVPR, 2016. 1 [49] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and
[35] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu- tell: A neural image caption generator. In CVPR, pages
tional networks for biomedical image segmentation. In MIC- 3156–3164, 2015. 1, 2
CAI, pages 234–241. Springer, 2015. 2 [50] H.-J. Wilke, M. Kmin, and J. Urban. Genodisc dataset: The
[36] H. Roth, L. Lu, A. Farag, H.-C. Shin, J. Liu, E. B. Turkbey, benefits of multi-disciplinary research on intervertebral disc
and R. M. Summers. Deeporgan: Multi-level deep convo- degeneration. In European Spine Journal, 2016. 2
lutional networks for automated pancreas segmentation. In [51] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel.
MICCAI, pages 556–564. Springer, 2015. 2 Ask me anything: free-form visual question answering based
[37] H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, on knowledge from external sources. In CVPR, 2016. 1, 2, 3
S. Wang, J. Liu, E. Turkbey, and R. M. Summers. A new [52] J. Yao and et al. A multi-center milestone study of clinical
2.5D representation for lymph node detection using random vertebral ct segmentation. In Computerized Medical Imaging
sets of deep convolutional neural network observations. In and Graphics, pages 49(4): 16–28, 2016. 2
MICCAI, pages 520–527. Springer, 2014. 2 [53] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From im-
[38] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, age descriptions to visual denotations: New similarity met-
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, rics for semantic inference over event descriptions. In TACL,
A. Berg, and L. Fei-Fei. Imagenet large scale visual recog- 2014. 2
nition challenge. International Journal of Computer Vision, [54] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba.
pages 115(3): 211–252, 2015. 1, 2 Learning deep features for discriminative localization. arXiv
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, preprint arXiv:1512.04150, 2015. 5
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. [55] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w:
Imagenet large scale visual recognition challenge. Interna- Grounded question answering in images. In CVPR, 2016. 1,
tional Journal of Computer Vision, 115(3):211–252, 2015. 5 2, 3
[40] A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S. van
Riel, M. Wille, M. Naqibullah, C. Snchez, and B. van Gin-
neken. Pulmonary nodule detection in ct images: False
positive reduction using multi-view convolutional networks.
IEEE Trans. Medical Imaging, 35(5):1160–1169, 2016. 2
[41] H. Shin, L. Lu, L. Kim, A. Seff, J. Yao, and R. Summers.
Interleaved text/image deep mining on a large-scale radiol-
ogy database for automated image interpretation. Journal of
Machine Learning Research, 17:1–31, 2016. 2
[42] H. Shin, K. Roberts, L. Lu, D. Demner-Fushman, J. Yao, and
R. Summers. Learning to read chest x-rays: Recurrent neural
cascade model for automated image annotation. In CVPR,
2016. 2, 11
[43] H. Shin, H. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao,
D. Mollura, and R. Summers. Deep convolutional neural
networks for computer-aided detection: Cnn architectures,
dataset characteristics and transfer learnings. IEEE Trans.
Medical Imaging, 35(5):1285–1298, 2016. 2
[44] K. Simonyan and A. Zisserman. Very deep convolutional
networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014. 5, 7
[45] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going deeper with convolutions. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
pages 1–9, 2015. 5, 7
[46] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urta-
sun, and S. Fidler. Movieqa: Understanding stories in movies
through question-answering. In ICCV, 2015. 1
[47] J. R. Uijlings, K. E. van de Sande, T. Gevers, and A. W.
Smeulders. Selective search for object recognition. Inter-
national journal of computer vision, 104(2):154–171, 2013.
11
A Supplementary Materials To overcome this complication, we hand-craft a num-
A.1 SNOMED-CT Concepts ber of novel rules of negation/uncertainty defined on the
syntactic level in this work. More specifically, we utilize
In this work, we only consider the semantic types of Dis-
the syntactic dependency information because it is close to
eases or Syndromes and Findings (namely ‘dsyn’ and ‘fndg’
the semantic relationship between words and thus has be-
respectively). Table 5 shows the corresponding SNOMED-
come prevalent in biomedical text processing. We defined
CT concepts that are relevant to the target diseases (these
our rules on the dependency graph, by utilizing the depen-
mappings are developed by searching the disease names in
dency label and direction information between words. Ta-
the UMLS R
terminology service 3 , and verified by a board-
ble 6 shows the rules we defined for negation/uncertainty
certificated radiologist.
detection on the syntactic level.
A.2 Rules of Negation/Uncertainty A.3 More Disease Localization Results
Table 7 illustrates the localization accuracy (Acc.)
Although many text processing systems can handle the
and Average False Positive (AFP) number for each dis-
negation/uncertainty detection problem, most of them ex-
ease type, with IoU > T (IoU ) only and T (IoU ) ∈
ploit regular expressions on the text directly. One of
{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7}.
the disadvantages to use regular expressions for nega-
Table 8 to Table 15 illustrate localization results from
tion/uncertainty detection is that they cannot capture various
each of 8 disease classes together with associated report and
syntactic constructions for multiple subjects. For example,
mined disease keywords. The heatmaps overlay on the orig-
in the phrase of “clear of A and B”, the regular expression
inal images are shown on the right. Correct bounding boxes
can capture “A” as a negation but not “B”, particularly when
(in green), false positives (in red) and the groundtruth (in
both “A” and “B” are long and complex noun phrases.
blue) are plotted over the original image on the left.
3 https://uts.nlm.nih.gov/metathesaurus.html In order to quantitatively demonstrate how informative
those heatmaps are, a simple two-level thresholding based
CUI Concept bounding box generator is adopted here to catch the peaks
Atelectasis in the heatmap and later generated bounding boxes can be
C0004144 atelectasis evaluated against the ground truth. Each heatmap will ap-
C0264494 discoid atelectasis proximately results in 1-3 bounding boxes. We believe the
C0264496 focal atelectasis localization accuracy and AFP (shown in Table 7) could be
Cardiomegaly further optimized by adopting a more sophisticated bound-
C0018800 cardiomegaly ing box generation method, e.g. selective search [47] or
Effusion Edgebox [18]. Nevertheless, we reserve the effort to do so,
C0013687 effusion since our main goal is not to compute the exact spatial lo-
C0031039 pericardial effusion cation of disease patterns but just to obtain some instructive
C0032227 pleural effusion disorder location information for future applications, e.g. automated
C0747635 bilateral pleural effusion radiological report generation. Take the case shown in Table
C0747639 loculated pleural effusion 8 for an example. The peak at the lower part of the left lung
Pneumonia region indicates the presence of “atelectasis”, which con-
C0032285 pneumonia fer the statement of “...stable abnormal study including left
C0577702 basal pneumonia basilar infilrate/atelectasis, ...” presented in the impression
C0578576 left upper zone pneumonia section of the associated radiological report. By combining
C0578577 right middle zone pneumonia with other information, e.g. a lung region mask, the heatmap
C0585104 left lower zone pneumonia itself is already more informative than just the presence in-
C0585105 right lower zone pneumonia dication of certain disease in an image as introduced in the
C0585106 right upper zone pneumonia previous works, e.g. [42].
C0747651 recurrent aspiration pneumonia
C1960024 lingular pneumonia
Pneumothorax
C0032326 pneumothorax
C0264557 chronic pneumothorax
C0546333 right pneumothorax
C0546334 left pneumothorax

Table 5. Sample Target Diseases and their corresponding concept


and identifiers (CUIs) in SNOMED-CT.
Rule Example
Negation
no← ∗ ← DISEASE No acute pulmonary disease
∗ → prep without → DISEASE changes without focal airspace disease
clear/free/disappearance → prep of → DISEASE clear of focal airspace disease, pneumothorax, or pleural effusion
∗ → prep without → evidence → prep of → DISEASE Changes without evidence of acute infiltrate
no ← neg ← evidence → prep of → DISEASE No evidence of active disease
Uncertainty
cannot ← md ← exclude The aorta is tortuous, and cannot exclude ascending aortic aneurysm
concern → prep for → ∗ There is raises concern for pneumonia
could be/may be/... which could be due to nodule/lymph node
difficult → prep to → exclude interstitial infiltrates difficult to exclude
may ← md ← represent which may represent pleural reaction or small pulmonary nodules
suggesting/suspect/... → dobj → DISEASE Bilateral pulmonary nodules suggesting pulmonary metastases

Table 6. Rules and corresponding examples for negation and uncertainty detection.

T(IoU) Atelectasis Cardiomegaly Effusion Infiltration Mass Nodule Pneumonia Pneumothorax


T(IoU) = 0.1
Acc. 0.6888 0.9383 0.6601 0.7073 0.4000 0.1392 0.6333 0.3775
AFP 0.8943 0.5996 0.8343 0.6250 0.6666 0.6077 1.0203 0.4949
T(IoU) = 0.2
Acc. 0.4722 0.6849 0.4509 0.4796 0.2588 0.0506 0.3500 0.2346
AFP 0.9827 0.7205 0.9096 0.6849 0.6941 0.6158 1.0793 0.5173
T(IoU) = 0.3
Acc. 0.2444 0.4589 0.3006 0.2764 0.1529 0.0379 0.1666 0.1326
AFP 1.0417 0.7815 0.9472 0.7236 0.7073 0.6168 1.1067 0.5325
T(IoU) = 0.4
Acc. 0.0944 0.2808 0.2026 0.1219 0.0705 0.0126 0.0750 0.0714
AFP 1.0783 0.8140 0.9705 0.7489 0.7164 0.6189 1.1239 0.5427
T(IoU) = 0.5
Acc. 0.0500 0.1780 0.1111 0.0650 0.0117 0.0126 0.0333 0.0306
AFP 1.0884 0.8354 0.9919 0.7571 0.7215 0.6189 1.1291 0.5478
T(IoU) = 0.6
Acc. 0.0222 0.0753 0.0457 0.0243 0.0000 0.0126 0.0166 0.0306
AFP 1.0935 0.8506 1.0051 0.7632 0.7226 0.6189 1.1321 0.5478
T(IoU) = 0.7
Acc. 0.0055 0.0273 0.0196 0.0000 0.0000 0.0000 0.0083 0.0204
AFP 1.0965 0.8577 1.009 0.7663 0.7226 0.6199 1.1331 0.5488

Table 7. Pathology localization accuracy and average false positive number for 8 disease classes with T (IoU ) ranged from 0.1 to 0.7.
Radiology report Keyword Localization Result
findings include: 1. left basilar at- Effusion;
electasis/consolidation. 2. prominent Infiltration;
hilum (mediastinal adenopathy). 3. Atelectasis
left pic catheter (tip in atriocaval junc-
tion). 4. stable, normal appearing car-
diomediastinal silhouette.
impression: small right pleural ef-
fusion otherwise stable abnormal
study including left basilar infil-
trate/atelectasis, prominent hilum,
and position of left pic catheter (tip
atriocaval junction).

Table 8. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Atelectasis” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.

Radiology report Keyword Localization Result


findings include: 1. cardiomegaly (ct Cardiomegaly
ratio of 17/30). 2. otherwise normal
lungs and mediastinal contours. 3. no
evidence of focal bone lesion. dictat-
ing

Table 9. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Cardiomegaly” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.

Radiology report Keyword Localization Result


findings: no appreciable change since Effusion;
XX/XX/XX. small right pleural effu- Nodule
sion. elevation right hemidiaphragm.
diffuse small nodules throughout the
lungs, most numerous in the left
mid and lower lung. impression:
no change with bilateral small lung
metastases.

Table 10. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Effusion” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.
Radiology report Keyword Localization Result
findings: port-a-cath reservoir re- Infiltration
mains in place on the right. chest tube
remains in place, tip in the left apex.
no pneumothorax. diffuse patchy in-
filtrates bilaterally are decreasing.
impression: infiltrates and effusions
decreasing.

Table 11. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Infiltration” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.

Radiology report Keyword Localization Result


findings: right internal jugular Mass
catheter remains in place. large
metastatic lung mass in the lateral
left upper lobe is again noted. no in-
filtrate or effusion. extensive surgical
clips again noted left axilla.
impression: no significant change.

Table 12. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Mass” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.

Radiology report Keyword Localization Result


findings: pa and lateral views of the Nodule;
chest demonstrate stable 2.2 cm nod- Infiltration
ule in left lower lung field posteriorly.
the lungs are clear without infiltrate
or effusion. cardiomediastinal silhou-
ette is normal size and contour. pul-
monary vascularity is normal in cal-
iber and distribution.
impression: stable left likely hamar-
toma.

Table 13. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Nodule” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.
Radiology report Keyword Localization Result
findings: unchanged left lower lung Pneumonia;
field infiltrate/air bronchograms. un- Infiltration
changed right perihilar infiltrate with
obscuration of the right heart bor-
der. no evidence of new infiltrate.
no evidence of pneumothorax the car-
diac and mediastinal contours are sta-
ble. impression: 1. no evidence
pneumothorax. 2. unchanged left
lower lobe and left lingular consoli-
dation/bronchiectasis. 3. unchanged
right middle lobe infiltrate

Table 14. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Pneumonia” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.

Radiology report Keyword Localization Result


findings: frontal lateral chest x-ray Mass;
performed in expiration. left apical Pneumothorax
pneumothorax visible. small pneu-
mothorax visible along the left heart
border and left hemidiaphragm. pleu-
ral thickening, mass right chest. the
mediastinum cannot be evaluated in
the expiration. bony structures intact.
impression: left post biopsy pneu-
mothorax.

Table 15. A sample of chest x-ray radiology report, mined disease keywords and localization result from the “Pneumothorax” Class. Correct
bounding box (in green), false positives (in red) and the ground truth (in blue) are plotted over the original image.
B ChestX-ray14 Dataset ResNet-50 ChestX-ray8 ChestX-ray14
After the CVPR submission, we expand the disease cate- Atelectasis 0.7069 0.7003
gories to include 6 more common thorax diseases (i.e. Con- Cardiomegaly 0.8141 0.8100
solidation, Edema, Emphysema, Fibrosis, Pleural Thicken- Effusion 0.7362 0.7585
ing and Hernia) and update the NLP mined labels. The Infiltration 0.6128 0.6614
statistics of ChestX-ray14 dataset are illustrated in Table 16 Mass 0.5609 0.6933
and Figure 8. The bounding boxes for Pathologies are un- Nodule 0.7164 0.6687
changed at this point. Pneumonia 0.6333 0.6580
Item # X-ray8 Ov. X-ray14 Ov. Pneumothorax 0.7891 0.7993
Consolidation - 0.7032
Report 108,948 - 112,120 - Edema - 0.8052
Atelectasis 5,789 3,286 11,535 7,323 Emphysema - 0.8330
Cardiomegaly 1,010 475 2,772 1,678 Fibrosis - 0.7859
Effusion 6,331 4,017 13,307 9,348 PT - 0.6835
Infiltration 10,317 4,698 19,871 10,319 Hernia - 0.8717
Mass 6,046 3,432 5,746 2,138 Table 17. AUCs of ROC curves for multi-label classification for
Nodule 1,971 1,041 6,323 3,617 ChestX-ray14 using published data split. PT: Pleural Thickening
Pneumonia 1,062 703 1,353 1,046
Pneumothorax 2,793 1,403 5,298 3,099
Consolidation - - 4,667 3,353
Edema - - 2,303 1,669
Emphysema - - 2,516 1,621
Fibrosis - - 1,686 959
PT - - 3,385 2,258
Hernia - - 227 117
No findings 84,312 0 60,412 0

Table 16. Total number (#) and # of Overlap (Ov.) of the corpus in
ChestX-ray8 and ChestX-ray14 datasets.PT: Pleural Thickening

B.1 Evaluation of NLP Mined Labels


To validate our method, we perform the following exper-
iments. First, we resort to some existing annotated corpora
as an alternative, i.e. OpenI dataset. Furthermore, we anno-
tated clinical reports suitable for evaluating finding recogni- Figure 7. Multi-label classification performance on ChestX-ray14
tion systems. We randomly selected 900 reports and asked with ImageNet pre-trained ResNet.
two annotators to mark the above 14 types of findings. Each task, we randomly shuffled the entire dataset into three sub-
report was annotated by two annotators independently and groups on the patient level for CNN fine-tuning via Stochas-
then agreements are reached for conflicts. tic Gradient Descent (SGD): i.e. training (∼ 70%), valida-
Table 18 shows the results of our method using OpenI tion (∼ 10%) and testing (∼ 20%). All images from the
and our proposed dataset, measured in precision (P), recall same patient will only appear in one of the three sets. 4 We
(R), and F1-score. Much higher precision, recall and F1- report the 14 thoracic disease recognition performance on
scores are achieved compared to the existing MetaMap ap- the published testing set in comparison with the counterpart
proach (with NegEx enabled). This indicates that the usage based on ChestX-ray8, shown in Table 17 and Figure 7.
of negation and uncertainty detection on syntactic level suc- Since the annotated B-Boxes of pathologies are un-
cessfully removes false positive cases. changed, we only test the localization performance on the
B.2 Benchmark Results original 8 categories. Results measured by the Intersection
In a similar fashion to the experiment on ChestX-ray8, over the detected B-Box area ratio (IoBB) (similar to Area
we evaluate and validate the unified disease classification of Precision or Purity) are demonstrated in Table 19.
and localization framework on ChestX-ray14 database. In Overall, both of the classification and localization perfor-
total, 112,120 frontal-view X-ray images are used, of which mance on ChestX-ray14 is equivalent to the counterpart on
51,708 images contain one or more pathologies. The re- ChestX-ray8.
maining 60,412 images do not contain the listed 14 disease 4 Data split files could be downloaded via https://nihcc.app.

findings. For the pathology classification and localization box.com/v/ChestXray-NIHCC


MetaMap Our Method
Disease
Precision / Recall / F1-score Precision / Recall / F1-score
OpenI
Atelectasis 87.3 / 96.5 / 91.7 88.7 / 96.5 / 92.4
Cardiomegaly 100.0 / 85.5 / 92.2 100.0 / 85.5 / 92.2
Effusion 90.3 / 87.5 / 88.9 96.6 / 87.5 / 91.8
Infiltration 68.0 / 100.0 / 81.0 81.0 / 100.0 / 89.5
Mass 100.0 / 66.7 / 80.0 100.0 / 66.7 / 80.0
Nodule 86.7 / 65.0 / 74.3 82.4 / 70.0 / 75.7
Pneumonia 40.0 / 80.0 / 53.3 44.4 / 80.0 / 57.1
Pneumothorax 80.0 / 57.1 / 66.7 80.0 / 57.1 / 66.7
Consolidation 16.3 / 87.5 / 27.5 77.8 / 87.5 / 82.4
Edema 66.7 / 90.9 / 76.9 76.9 / 90.9 / 83.3
Emphysema 94.1 / 64.0 / 76.2 94.1 / 64.0 / 76.2
Fibrosis 100.0 / 100.0 / 100.0 100.0 / 100.0 / 100.0
PT 100.0 / 75.0 / 85.7 100.0 / 75.0 / 85.7
Hernia 100.0 / 100.0 / 100.0 100.0 / 100.0 / 100.0
Total 77.2 / 84.6 / 80.7 89.8 / 85.0 / 87.3
ChestX-ray14
Atelectasis 88.6 / 98.1 / 93.1 96.6 / 97.3 / 96.9
Cardiomegaly 94.1 / 95.7 / 94.9 96.7 / 95.7 / 96.2
Effusion 87.7 / 99.6 / 93.3 94.8 / 99.2 / 97.0
Infiltration 69.7 / 90.0 / 78.6 95.9 / 85.6 / 90.4
Mass 85.1 / 92.5 / 88.7 92.5 / 92.5 / 92.5
Nodule 78.4 / 92.3 / 84.8 84.5 / 92.3 / 88.2
Pneumonia 73.8 / 87.3 / 80.0 88.9 / 87.3 / 88.1
Pneumothorax 87.4 / 100.0 / 93.3 94.3 / 98.8 / 96.5
Consolidation 72.8 / 98.3 / 83.7 95.2 / 98.3 / 96.7
Edema 72.1 / 93.9 / 81.6 96.9 / 93.9 / 95.43
Emphysema 97.6 / 93.2 / 95.3 100.0 / 90.9 / 95.2
Fibrosis 84.6 / 100.0 / 91.7 91.7 / 100.0 / 95.7
PT 85.1 / 97.6 / 90.9 97.6 / 97.6 / 97.6
Hernia 66.7 / 100.0 / 80.0 100.0 / 100.0 / 100.0
Total 82.8 / 95.5 / 88.7 94.4 / 94.4 / 94.4

Table 18. Evaluation of image labeling results on OpenI and ChestX-ray14 dataset. Performance is reported using P, R, F1-score. PT:
Pleural Thickening
T(IoBB) Atelectasis Cardiomegaly Effusion Infiltration Mass Nodule Pneumonia Pneumothorax
T(IoBB) = 0.1
Acc. 0.6222 1 0.7974 0.9106 0.5882 0.1519 0.8583 0.5204
AFP 0.8293 0.1768 0.6148 0.4919 0.3933 0.4685 0.4360 0.4543
T(IoBB) = 0.25 (Two times larger on both x and y axis than ground truth B-Boxes)
Acc. 0.3944 0.9863 0.6339 0.7967 0.4588 0.0506 0.7083 0.3367
AFP 0.9319 0.2042 0.6880 0.5447 0.4288 0.4786 0.4959 0.4857
T(IoBB) = 0.5
Acc. 0.1944 0.9452 0.4183 0.6504 0.3058 0 0.4833 0.2653
AFP 0.9979 0.2785 0.7652 0.6006 0.4604 0.4827 0.5630 0.5030
T(IoBB) = 0.75
Acc. 0.0889 0.8151 0.2287 0.4390 0.1647 0 0.2917 0.1735
AFP 1.0285 0.4045 0.8222 0.6697 0.4827 0.4827 0.6169 0.5243
T(IoBB) = 0.9
Acc. 0.0722 0.6507 0.1373 0.3577 0.0941 0 0.2333 0.1224
AFP 1.0356 0.4837 0.8445 0.7043 0.4939 0.4827 0.6331 0.5346
Table 19. Pathology localization accuracy and average false positive number for ChestX-ray14.
Figure 8. The circular diagram shows the proportions of images with multi-labels in each of 14 pathology classes and the labels’ co-
occurrence statistics.

You might also like