0% found this document useful (0 votes)
97 views5 pages

Barata2019 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views5 pages

Barata2019 PDF

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019)

Venice, Italy, April 8-11, 2019

DEEP LEARNING FOR SKIN CANCER DIAGNOSIS WITH HIERARCHICAL


ARCHITECTURES

Catarina Barata and Jorge S. Marques

Institute for Systems and Robotics, Instituto Superior Técnico, Lisboa, Portugal

ABSTRACT or as a starting point for fine-tuning to the skin cancer prob-


lem, has fostered the release of several works based on this
Skin lesions are organized in a hierarchical way, which is
methodology [7].
taken into account by dermatologists when diagnosing them.
The most recent public datasets have extended the tra-
However, automatic systems do not make use of this informa-
ditional melanoma/benign problem using only melanocytic
tion, performing the diagnosis in a one-vs-all approach, where
lesions, to a multi-class one where non-melanocytic lesions
all types of lesions are considered. In this paper we propose
have been added (e.g., ISIC 2017 [5]). Several methods have
to mimic the medical strategy and train a deep-learning ar-
treated this problem has a one-vs-all approach, where the
chitecture to perform a hierarchical diagnosis. Our results
network tries to distinguish between all of the classes in the
highlight the benefits of addressing the classification of der-
same decision layer. But, dermatologists divide this task
moscopy images in a structured way. Additionally, we pro-
into a hierarchical method: first they distinguish between
vide an extensive evaluation of criteria that must be taken into
melanocytic/non-melanocytic and only then they perform the
account in the development of diagnostic systems based on
final diagnosis [8].
deep learning.
Thus, it is possible to wonder if there is any benefit in
Index Terms— Skin Cancer, Hierarchical Classification, mimicking the medical diagnosis, and train hierarchical net-
Deep Learning, Dermoscopy works. This paper shows that it is better to use hierarchi-
cal networks. Additionally, we conduct several experiments
1. INTRODUCTION that shed some light on the following points: i) importance of
color normalization and lesion segmentation; ii) performance
Skin cancer is one of the most common types of cancer world- of transfer learning strategies; and iii) comparison of evalua-
wide, accounting for approximately one third of all the diag- tion metrics. To the best of our knowledge this is the first work
noses. The overwhelming increase in its incidence rates, par- that explores the hierarchical organization of skin lesions and
ticularly of melanoma that has grown over 300% from 1990 to simultaneously investigates points i), ii), and iii).
2018 just in the US [1], has raised the attention of researchers. The remaining of the paper is organized as follows. Sec-
In particular, there is a focus on the development of methods tion 2 gives an overview of CNN architectures in skin cancer
for the automatic diagnosis of dermoscopy images [2]. diagnosis, Section 3 introduces the hierarchical architectures,
Although dermoscopy image analysis has been an active and Section 4 describes the experimental setup. Section 5
topic of research for more than twenty years, the last couple presents the results and Section 6 concludes the paper.
of years have seen a significant increase in the number of pub-
lished works [2]. Such interest has been mainly encouraged 2. CNNS IN DERMOSCOPY IMAGE ANALYSIS
by the release of public dermoscopy datasets, such as PH2
[3] and the ISIC challenges [4, 5]. Moreover, the deep learn- For the past years, CNNs have been used in dermoscopy im-
ing revolution [6] has also played a role, with the proposal age analysis. One of the first works is that of Codella et al. [9]
of increasingly deeper and better convolutional architectures where the Caffe architecture was used as a feature extractor.
(CNN) and the release of open source software tools. Deep Esteva et al. [10] trained an Inception network from scratch
learning and small datasets, such as the dermoscopy ones, are using a very large private dataset of both clinical and der-
antagonists, meaning that it is not reasonable to train CNN moscopy images, showing that it was possible to achieve a
architectures from scratch to tackle the problem of skin can- performance similar to a human expert. However, training a
cer. However, the availability of pre-trained networks, which CNN from scratch to diagnose skin cancer is usually infea-
may be used for transfer learning either as feature extractors sible due to the reduced size of the datasets (e.g., the dataset
This work was supported by the FCT project and plurianual funding:
from the 2017 challenge contained only 2000 images). There-
[PTDC/EEIPRO/0426/2014], [UID/EEA/50009/2019]. The Titan Xp used fore, most works have either used pre-trained CNNs as feature
for this research was donated by the NVIDIA Corporation. extractors or have fine-tuned them for this problem [7].

978-1-5386-3640-4/19/$31.00 ©2019 IEEE 841


Coarse Coarse
Label Label
Dropout Dropout O Dropout B
𝒑 = 𝟎. 𝟓 𝒑 = 𝟎. 𝟓 𝒑 = 𝟎. 𝟓

K M
N

Convolutional Flatenning M Convolutional Flatenning Convolutional Flatenning


Layers Layers Layers
⋮ K

M
Fine ⋮
K
Fine
Label Label
N N

Fig. 1: Classification strategies: multi-class (left), hierarchical melanocytic-non melanocytic (hier1 -mid), and hierarchical
malignant-benign (hier2 -right). Here, o identifies the melanocytic class and p stands for the dropout probability.
The use of CNNs was extensively observed in the 2017 [5] strategies: one based on a multi-class formulation (see Fig. 1
and 2018 1 ISIC challenges. While in 2017 most participants (left)) and two based on hierarchical classification (see Fig.
showed a preference for ResNet, Inception, and ResNext ar- 1 (mid and right)). Our dataset contains examples of non-
chitectures, in 2018 the use of deeper and more complex ar- melanocytic lesions (seborrheic keratosis-K) and melanocytic
chitectures, such as DenseNet and PNASNet, was also ob- lesions (melanoma-M and Nevi-N).
served . Another difference in the two challenges is the use of With respect to the hierarchical strategies, we aim to in-
ensembles of CNNs in 2018, which had already been pointed fer it is better to: i) mimic dermatologists and first discrim-
out by the challenge organizers as a way to improve the re- inate between non-melanocytic (K) and melanocytic lesions
sults [5]. Recently, several ensemble techniques have been (M and N) - hier1 ; or ii) to first discriminate between malig-
proposed [11, 12], with promising results. nant (M) and benign lesions (K and N) - hier2 .
Some authors have devoted their work to studying specific
aspects of the CNN that may improve the classification results
of dermoscopy images. In particular, great importance has 4. EXPERIMENTAL SETUP
been given to the identification of suitable data augmentation
strategies that may help dealing with the limited amount of This section describes the experimental evaluation of the
available data [13, 14]. Additionally, attention has been paid strategies proposed in Section 3. Additionally, we also assess
to the comparison between transfer learning with and without the role of several factors that may influence the performance
fine-tuning [15], performing data augmentation on the test set of deep neural networks. In the following sections we identify
[14], and other relevant criteria (e.g, image size and selected key aspects that are studied in the paper.
architecture) [11].
Although a hierarchical classification was investigated be- 4.1. Dataset
fore using hand-crafted features [16], to the best of our knowl-
edge, the application of this idea to CNNs has been poorly in- For many years, the works devoted to skin cancer diagno-
vestigated in the dermoscopy field. The exception is the work sis used relatively small datasets, which usually comprised
of Demyanov et al. [17], which uses both clinical and der- only examples of melanocytic lesions. Recently, the ISIC
moscopy images to train a ResNet-50 using a tree-loss func- project started to release increasingly larger and more com-
tion. This dataset is significantly different from the one used plex datasets associated with conference challenges. The
in our work, which contains only dermoscopy images. More- challenges’ datasets are particularly relevant, since they allow
over, we propose a simpler approach to impose hieararchy in a fair comparison between methods and their performances.
our classification procedure. Therefore, in this work we will use the ISIC 2017-ISBI set
[5], which is divided into training (2000 images), validation
(150 images), and test (600 images) sets. The task of this
3. HIERARCHICAL CNN challenge was to diagnose three classes of lesions: M, K,
and N. Contrary to several of the challenge competitors, we
Dermoscopy lesions are categorized in a hierarchical way,
will not augment the training set with external data, as we
where the lesions are firstly grouped in melanocytic or non-
are interested in assessing how to make the most of a dataset,
melanocytic, according to their origin, and only then diag-
even if limited, to efficiently train deep learning architectures.
nosed into a more fine category [8]. Although this hierarchy
Moreover, we want to ensure that our results are reproducible.
is well know in the literature, an evaluation of CNN architec-
tures that perform a structured classification is still missing in
the literature. 4.2. Pre-processing
We address this problem and compare three classification
It may be useful to perform several transformations to der-
1 [Link] moscopy images before feeding them to a CNN. In this work

842
strategies. The first one is based on online data augmenta-
tion, which consists of randomly flipping, rotating, cropping,
and altering the colors of the training images in each epoch.
We have picked this particular combination of transforma-
tions because they have been shown to improve the results
of CNNs [13, 14]. Although online augmentation does not
increase the size of the training set, it guarantees that the net-
work ”sees” a different version of the same image between
epochs, which reduces the probability of the network memo-
rizing it and improves the generalization.
The other strategy is based on the use of dropout [21]. In
Fig. 2: Examples of pre-processed images: original (1st col- particular, we will apply dropout with 50% probability, before
umn); segmented and cropped (2nd column); normalized (3rd the decision layer(s), as shown in Fig. 1.
column).
4.5. Unbalanced Data
we will focus in two types of transformations: lesion segmen-
tation and color normalization. The training set used in this work is very unbalanced, with
Lesion segmentation corresponds to the separation be- the following proportions: 18.7% M, 12.9% K, and 68.6% N.
tween the lesion and the surrounding skin. In our experiments Popular approaches to deal with this issue are to artificially
this will amount to cropping the original dermoscopy image augment the less frequent classes, to assign different weights
with a tight bounding box around the lesion (see Fig. 2, 2nd to the classes in the cost function, or to combine the previous
column). Although the role of lesion segmentation is still an two.
open issue in dermoscopy image analysis [7], it is important In this work we will resort to weighting the cross-entropy
to understand how it influences the performance of CNN losses of the training examples. In particular, we will assign
architectures. the class weights based on their distribution:
Color normalization allows us to correct the colors of the #N
dermoscopy images and reduce the variability introduced by wc = , (1)
#Nc
the acquisition setup, as exemplified in Fig. 2, 3rd column.
Similarly to the top classified of the ISIC-2017 challenge [18] where #N is the size of the training set and #Nc is the num-
and several of the participants of the ISIC-2018 challenge, ber of training elements form class c ∈ {M, K, N}.
we apply the color normalization strategy proposed in [19] to
correct the image colors using its statistics. We set the value 4.6. Evaluation Metrics
of p = 6.
Finding the appropriate metrics to evaluate and compare the
After applying the aforementioned transformations, all of
performance of classification systems is a challenging task.
the images were resized to 299×299.
The metric used to rank the participants in the ISIC-2017
challenge was the average area under the curve (AU C) for the
4.3. Network Training M and K diagnosis [5]. Thus, we will also apply this metric to
Due to the reduced size of the training set we will use the evaluate the performance of the tested model configurations.
DenseNet-161 architecture pre-trained on the ImageNet Although, AUC is a suitable metric to compare models, it
dataset [20], comparing two approaches: feature extractor is difficult to infer the performance of the model for each of
vs fine-tuning. In the feature extractor learning approach we the classes solely by inspecting its value. In the ISIC-2018
will freeze all the layers except the decision one(s), which challenge, the ranking procedure was changed to be based on
will be trained for our problem, while in the fine-tuning case the balanced accuracy metric (BACC), which averages the
the pre-trained weights will be used as a soft initialization. recall (Re) values of all the class
All of the models will be trained using the Adam Opti- #T Pc
mizer and a mini-batch approach, with a batch size of 5. The Re = , (2)
#Nc
starting learning rate η will be η = 0.005 for transfer learning
and η = 10−5 for fine-tuning, with a decay rate of 0.5 for where T Pc is the number of true positives, i.e, the number of
every 40 epochs. Cross-entropy is the selected loss function. correctly classified examples from class c.

4.4. Generalization 5. RESULTS

It is crucial to train deep learning architectures that general- The experimental framework described in Section 4 was im-
ize well to new images. In this work we will rely on two plement using Tensorflow and one Titan Xp GPU. Overall,

843
90
70
90 [Link]
85 Feat.hier1
65
Feat. [Link]
hier2
80 85 Finemulti Feat.hier1
60
Finehier1 Feat.

BACC
Finehier2 hier2
AUC

75 80 Finemulti
55
Finehier1
70 Finehier2

AUC
75 50

65

70 45

60
Full Full_Norm Cropped Cropped_Norm Full Full_Norm Cropped Cropped_Norm

65

Fig. 3: Performance results for the test set: .


60
Full Full_Norm Cropped Cropped_Norm
the experiments amounted to training and evaluating 24 dif- Table 1: Best performance scores.
ferent network architectures. The architectures were trained
for 500 epochs using the 2000 training images, and validated Image Type ReM ReK ReN AU C BACC
Full 44.4% 70.0% 83.4% 87.2% 65.9%
every 10 epochs using the validation set. Figure 3 summarizes
Full Norm. 46.1% 71.1% 85.0% 87.6% 67.4%
the AU C and BACC scores for the test set. Cropped 50.0% 76.7% 83.3% 87.4% 70.0%
These results yield relevant information. First, the use of Cropped Norm. 59.8% 71.1% 79.2% 87.5% 70.0%
a hierarchical classification strategy seems to lead to better
overall results than using a traditional multi-class approach.
serve that they are significantly different, evidencing again the
Such results are observed both for the feature extraction
importance of considering more than one metric to evaluate a
(range of blues) and fine-tuning strategies (hot color bars),
classification system.
using any type of image pre-processing. As expected, fine-
tuning DenseNet-161 to our problem leads to better experi- We have compared our results with those of the ISIC
mental results both in terms of AU C and BACC. Interest- challenge [5]. Our scores rank in the 70th percentile regard-
ingly, this improvement in more notorious in the AU C scores ing the AU C metric, meaning that the hierarchical approach
of the architectures trained with the full image (1st and 2nd would rank above 7th position in the leaderboard. Regarding
sets of bars), while for the cropped images (3rd and 4th sets BACC, we have only compared our scores for the melanoma
of bars) it seems that fine-tuning even degrades the perfor- and keratosis classes, since these are the only Re available to
mance of the multi-class architectures. However, when one the public). In this case, our hierarchical formulation would
inspects the BACC scores, it is clear that fine-tuning leads to rank in the 90th percentile, with a BACC = 65.5%. These
significant improvements in all of the cases, suggesting that are promising results, especially if one takes into account that
the evaluation of a model must take into account more than we have used simple regularization techniques (dropout and
one metric. online data augmentation) an no external data, to train our
Cropped images seem to convey more discriminant infor- networks and prevent overfitting.
mation, specially when combined with the hierarchical archi-
tectures. In particular, the use of cropped images seems to
be more suitable to diagnose melanomas, since the ReM in- 6. CONCLUSIONS
creases, as shown in Table 1. The scores shown in this ta-
ble were obtained using the hierarchical architecture hier2 , This paper explores the hierarchical organization of skin le-
i.e., first discriminate between malignant and benign lesions sions, in order to develop a deep learning system that per-
and then between types of benign lesions. Contrary to what forms a structured classification. Additionally, we performed
was expected, since hier1 (orange bars) is the methodology comparative studies on the importance of lesion segmenta-
used by dermatologists, hier2 (red bars) seems to be the one tion, color normalization, and evaluation metrics.
that leads to the best results for most of the configurations. Our results show that a structured classification based on
Such finding may be explained by the difficulty in diagnosing a distinction between malignant and benign lesions, followed
melanomas when compared with other types of skin lesions. by the diagnosis of the latter in different classes leads to better
This is a promising result that must be further investigated results, when combined with segmented lesions. Color nor-
with a dataset that contains other types of malignant lesions, malization also improves the results, but plays a minor role.
such as basal cell carcinomas [8]. Finally, we have also showed that our approach compares fa-
Regarding the use of color normalization, it seems to lead vorably with other state-of-the-art methods.
to a marginal improvement in the AU C scores and to similar Future work should focus on validating these results on a
BACC for the cropped images. However, when we take a larger dataset that comprises more classes of non-melanocytic
closer look at the Re values for the different classes we ob- lesions.

844
7. REFERENCES [12] B. Harangi, “Skin lesion classification with ensem-
bles of deep convolutional neural networks,” Journal
[1] R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statis- of biomedical informatics, vol. 86, pp. 25–32, 2018.
tics, 2018,” CA: a cancer journal for clinicians, vol. 68,
pp. 7–30, 2018. [13] C. N. Vasconcelos and B. N. Vasconcelos, “Experi-
ments using deep learning for dermoscopy image anal-
[2] S. Pathan, K. G. Prabhu, and P. C. S., “Techniques and ysis,” Pattern Recognition Letters, 2017.
algorithms for computer aided diagnosis of pigmented
skin lesions - a review,” Biomedical Signal Processing [14] F. Perez, C. Vasconcelos, S. Avila, and E. Valle, “Data
and Control, vol. 39, pp. 237–262, 2018. augmentation for skin lesion analysis,” in OR 2.0
Context-Aware Operating Theaters, Computer Assisted
[3] T. Mendonça, P. M. Ferreira, J. S. Marques, A. R. S. Robotic Endoscopy, Clinical Image-Based Procedures,
Marcal, and J. Rozeira, “PH2: A dermoscopic im- and Skin Image Analysis, pp. 303–311. Springer, 2018.
age database for research and benchmarking,” in IEEE
[15] A. Menegola, M. Fornaciali, R. Pires, and et al.,
EMBC 2013, 2013, pp. 5437–5440.
“Towards automated melanoma screening: Explor-
[4] D. Gutman, N. C. F. Codella, M. E. Celebi, and ing transfer learning schemes,” arXiv preprint
et al., “Skin lesion analysis toward melanoma detec- arXiv:1609.01228, 2016.
tion: A challenge at the international symposium on
[16] K. Shimizu, H. Iyatomi, M. E. Celebi, and et al., “Four-
biomedical imaging (isbi) 2016, hosted by the interna-
class classification of skin lesions with task decomposi-
tional skin imaging collaboration (isic),” arXiv preprint
tion strategy,” IEEE Transactions on Biomedical Engi-
arXiv:1605.01397, 2016.
neering, vol. 62, pp. 274–283, 2015.
[5] N. C. F. Codella, D. Gutman, M. E. Celebi, and [17] S. Demyanov, R. Chakravorty, Z. Ge, and et al., “Tree-
et al., “Skin lesion analysis toward melanoma detection: loss function for training neural networks on weakly-
A challenge at the 2017 international symposium on labelled datasets,” in ISBI 2017. IEEE, 2017, pp. 287–
biomedical imaging (isbi), hosted by the international 291.
skin imaging collaboration (isic),” in Biomedical Imag-
ing (ISBI 2018), 2018 IEEE 15th International Sympo- [18] K. Matsunaga, A. Hamada, A. Minagawa, and H. Koga,
sium on. IEEE, 2018, pp. 168–172. “Image classification of melanoma, nevus and sebor-
rheic keratosis by deep neural network ensemble,” arXiv
[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” preprint arXiv:1703.03108, 2017.
nature, vol. 521, no. 7553, pp. 436–444, 2015.
[19] C. Barata, M. E. Celebi, and J. S. Marques, “Improv-
[7] C. Barata, M. E. Celebi, and J. S. Marques, “A survey ing dermoscopy image classification using color con-
of feature extraction in dermoscopy image analysis of stancy,” IEEE Journal of Biomedical and Health In-
skin cancer,” IEEE Journal of Biomedical and Health formatics, vol. 19, pp. 1146–1152, 2015.
Informatics, 2018.
[20] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
[8] G. Argenziano, H P. Soyer, V. De Giorgi, and et al., In- berger, “Densely connected convolutional networks.,”
teractive Atlas of Dermoscopy, EDRA Medical Publish- in CVPR, 2017, vol. 1, p. 3.
ing & New Media, 2000.
[21] N. Srivastava, G. E. Hinton, A. Krizhevsky, and et al.,
[9] N. C. F Codella, J. Cai, M. Abedini, and et al., “Deep “Dropout: a simple way to prevent neural networks from
learning, sparse coding, and SVM for melanoma recog- overfitting.,” Journal of machine learning research, vol.
nition in dermoscopy images,” in MLMI 2015, 2015, pp. 15, pp. 1929–1958, 2014.
118–126.

[10] A. Esteva, B. Kuprel, R. A Novoa, and et al.,


“Dermatologist-level classification of skin cancer with
deep neural networks,” Nature, vol. 542, pp. 115–118,
2017.

[11] E. Valle, M. Fornaciali, A. Menegola, and et al.,


“Data, depth, and design: learning reliable mod-
els for melanoma screening,” arXiv preprint
arXiv:1711.00441, 2017.

845

You might also like