Barata2019 PDF
Barata2019 PDF
Institute for Systems and Robotics, Instituto Superior Técnico, Lisboa, Portugal
K M
N
Fig. 1: Classification strategies: multi-class (left), hierarchical melanocytic-non melanocytic (hier1 -mid), and hierarchical
malignant-benign (hier2 -right). Here, o identifies the melanocytic class and p stands for the dropout probability.
The use of CNNs was extensively observed in the 2017 [5] strategies: one based on a multi-class formulation (see Fig. 1
and 2018 1 ISIC challenges. While in 2017 most participants (left)) and two based on hierarchical classification (see Fig.
showed a preference for ResNet, Inception, and ResNext ar- 1 (mid and right)). Our dataset contains examples of non-
chitectures, in 2018 the use of deeper and more complex ar- melanocytic lesions (seborrheic keratosis-K) and melanocytic
chitectures, such as DenseNet and PNASNet, was also ob- lesions (melanoma-M and Nevi-N).
served . Another difference in the two challenges is the use of With respect to the hierarchical strategies, we aim to in-
ensembles of CNNs in 2018, which had already been pointed fer it is better to: i) mimic dermatologists and first discrim-
out by the challenge organizers as a way to improve the re- inate between non-melanocytic (K) and melanocytic lesions
sults [5]. Recently, several ensemble techniques have been (M and N) - hier1 ; or ii) to first discriminate between malig-
proposed [11, 12], with promising results. nant (M) and benign lesions (K and N) - hier2 .
Some authors have devoted their work to studying specific
aspects of the CNN that may improve the classification results
of dermoscopy images. In particular, great importance has 4. EXPERIMENTAL SETUP
been given to the identification of suitable data augmentation
strategies that may help dealing with the limited amount of This section describes the experimental evaluation of the
available data [13, 14]. Additionally, attention has been paid strategies proposed in Section 3. Additionally, we also assess
to the comparison between transfer learning with and without the role of several factors that may influence the performance
fine-tuning [15], performing data augmentation on the test set of deep neural networks. In the following sections we identify
[14], and other relevant criteria (e.g, image size and selected key aspects that are studied in the paper.
architecture) [11].
Although a hierarchical classification was investigated be- 4.1. Dataset
fore using hand-crafted features [16], to the best of our knowl-
edge, the application of this idea to CNNs has been poorly in- For many years, the works devoted to skin cancer diagno-
vestigated in the dermoscopy field. The exception is the work sis used relatively small datasets, which usually comprised
of Demyanov et al. [17], which uses both clinical and der- only examples of melanocytic lesions. Recently, the ISIC
moscopy images to train a ResNet-50 using a tree-loss func- project started to release increasingly larger and more com-
tion. This dataset is significantly different from the one used plex datasets associated with conference challenges. The
in our work, which contains only dermoscopy images. More- challenges’ datasets are particularly relevant, since they allow
over, we propose a simpler approach to impose hieararchy in a fair comparison between methods and their performances.
our classification procedure. Therefore, in this work we will use the ISIC 2017-ISBI set
[5], which is divided into training (2000 images), validation
(150 images), and test (600 images) sets. The task of this
3. HIERARCHICAL CNN challenge was to diagnose three classes of lesions: M, K,
and N. Contrary to several of the challenge competitors, we
Dermoscopy lesions are categorized in a hierarchical way,
will not augment the training set with external data, as we
where the lesions are firstly grouped in melanocytic or non-
are interested in assessing how to make the most of a dataset,
melanocytic, according to their origin, and only then diag-
even if limited, to efficiently train deep learning architectures.
nosed into a more fine category [8]. Although this hierarchy
Moreover, we want to ensure that our results are reproducible.
is well know in the literature, an evaluation of CNN architec-
tures that perform a structured classification is still missing in
the literature. 4.2. Pre-processing
We address this problem and compare three classification
It may be useful to perform several transformations to der-
1 [Link] moscopy images before feeding them to a CNN. In this work
842
strategies. The first one is based on online data augmenta-
tion, which consists of randomly flipping, rotating, cropping,
and altering the colors of the training images in each epoch.
We have picked this particular combination of transforma-
tions because they have been shown to improve the results
of CNNs [13, 14]. Although online augmentation does not
increase the size of the training set, it guarantees that the net-
work ”sees” a different version of the same image between
epochs, which reduces the probability of the network memo-
rizing it and improves the generalization.
The other strategy is based on the use of dropout [21]. In
Fig. 2: Examples of pre-processed images: original (1st col- particular, we will apply dropout with 50% probability, before
umn); segmented and cropped (2nd column); normalized (3rd the decision layer(s), as shown in Fig. 1.
column).
4.5. Unbalanced Data
we will focus in two types of transformations: lesion segmen-
tation and color normalization. The training set used in this work is very unbalanced, with
Lesion segmentation corresponds to the separation be- the following proportions: 18.7% M, 12.9% K, and 68.6% N.
tween the lesion and the surrounding skin. In our experiments Popular approaches to deal with this issue are to artificially
this will amount to cropping the original dermoscopy image augment the less frequent classes, to assign different weights
with a tight bounding box around the lesion (see Fig. 2, 2nd to the classes in the cost function, or to combine the previous
column). Although the role of lesion segmentation is still an two.
open issue in dermoscopy image analysis [7], it is important In this work we will resort to weighting the cross-entropy
to understand how it influences the performance of CNN losses of the training examples. In particular, we will assign
architectures. the class weights based on their distribution:
Color normalization allows us to correct the colors of the #N
dermoscopy images and reduce the variability introduced by wc = , (1)
#Nc
the acquisition setup, as exemplified in Fig. 2, 3rd column.
Similarly to the top classified of the ISIC-2017 challenge [18] where #N is the size of the training set and #Nc is the num-
and several of the participants of the ISIC-2018 challenge, ber of training elements form class c ∈ {M, K, N}.
we apply the color normalization strategy proposed in [19] to
correct the image colors using its statistics. We set the value 4.6. Evaluation Metrics
of p = 6.
Finding the appropriate metrics to evaluate and compare the
After applying the aforementioned transformations, all of
performance of classification systems is a challenging task.
the images were resized to 299×299.
The metric used to rank the participants in the ISIC-2017
challenge was the average area under the curve (AU C) for the
4.3. Network Training M and K diagnosis [5]. Thus, we will also apply this metric to
Due to the reduced size of the training set we will use the evaluate the performance of the tested model configurations.
DenseNet-161 architecture pre-trained on the ImageNet Although, AUC is a suitable metric to compare models, it
dataset [20], comparing two approaches: feature extractor is difficult to infer the performance of the model for each of
vs fine-tuning. In the feature extractor learning approach we the classes solely by inspecting its value. In the ISIC-2018
will freeze all the layers except the decision one(s), which challenge, the ranking procedure was changed to be based on
will be trained for our problem, while in the fine-tuning case the balanced accuracy metric (BACC), which averages the
the pre-trained weights will be used as a soft initialization. recall (Re) values of all the class
All of the models will be trained using the Adam Opti- #T Pc
mizer and a mini-batch approach, with a batch size of 5. The Re = , (2)
#Nc
starting learning rate η will be η = 0.005 for transfer learning
and η = 10−5 for fine-tuning, with a decay rate of 0.5 for where T Pc is the number of true positives, i.e, the number of
every 40 epochs. Cross-entropy is the selected loss function. correctly classified examples from class c.
It is crucial to train deep learning architectures that general- The experimental framework described in Section 4 was im-
ize well to new images. In this work we will rely on two plement using Tensorflow and one Titan Xp GPU. Overall,
843
90
70
90 [Link]
85 Feat.hier1
65
Feat. [Link]
hier2
80 85 Finemulti Feat.hier1
60
Finehier1 Feat.
BACC
Finehier2 hier2
AUC
75 80 Finemulti
55
Finehier1
70 Finehier2
AUC
75 50
65
70 45
60
Full Full_Norm Cropped Cropped_Norm Full Full_Norm Cropped Cropped_Norm
65
844
7. REFERENCES [12] B. Harangi, “Skin lesion classification with ensem-
bles of deep convolutional neural networks,” Journal
[1] R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statis- of biomedical informatics, vol. 86, pp. 25–32, 2018.
tics, 2018,” CA: a cancer journal for clinicians, vol. 68,
pp. 7–30, 2018. [13] C. N. Vasconcelos and B. N. Vasconcelos, “Experi-
ments using deep learning for dermoscopy image anal-
[2] S. Pathan, K. G. Prabhu, and P. C. S., “Techniques and ysis,” Pattern Recognition Letters, 2017.
algorithms for computer aided diagnosis of pigmented
skin lesions - a review,” Biomedical Signal Processing [14] F. Perez, C. Vasconcelos, S. Avila, and E. Valle, “Data
and Control, vol. 39, pp. 237–262, 2018. augmentation for skin lesion analysis,” in OR 2.0
Context-Aware Operating Theaters, Computer Assisted
[3] T. Mendonça, P. M. Ferreira, J. S. Marques, A. R. S. Robotic Endoscopy, Clinical Image-Based Procedures,
Marcal, and J. Rozeira, “PH2: A dermoscopic im- and Skin Image Analysis, pp. 303–311. Springer, 2018.
age database for research and benchmarking,” in IEEE
[15] A. Menegola, M. Fornaciali, R. Pires, and et al.,
EMBC 2013, 2013, pp. 5437–5440.
“Towards automated melanoma screening: Explor-
[4] D. Gutman, N. C. F. Codella, M. E. Celebi, and ing transfer learning schemes,” arXiv preprint
et al., “Skin lesion analysis toward melanoma detec- arXiv:1609.01228, 2016.
tion: A challenge at the international symposium on
[16] K. Shimizu, H. Iyatomi, M. E. Celebi, and et al., “Four-
biomedical imaging (isbi) 2016, hosted by the interna-
class classification of skin lesions with task decomposi-
tional skin imaging collaboration (isic),” arXiv preprint
tion strategy,” IEEE Transactions on Biomedical Engi-
arXiv:1605.01397, 2016.
neering, vol. 62, pp. 274–283, 2015.
[5] N. C. F. Codella, D. Gutman, M. E. Celebi, and [17] S. Demyanov, R. Chakravorty, Z. Ge, and et al., “Tree-
et al., “Skin lesion analysis toward melanoma detection: loss function for training neural networks on weakly-
A challenge at the 2017 international symposium on labelled datasets,” in ISBI 2017. IEEE, 2017, pp. 287–
biomedical imaging (isbi), hosted by the international 291.
skin imaging collaboration (isic),” in Biomedical Imag-
ing (ISBI 2018), 2018 IEEE 15th International Sympo- [18] K. Matsunaga, A. Hamada, A. Minagawa, and H. Koga,
sium on. IEEE, 2018, pp. 168–172. “Image classification of melanoma, nevus and sebor-
rheic keratosis by deep neural network ensemble,” arXiv
[6] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” preprint arXiv:1703.03108, 2017.
nature, vol. 521, no. 7553, pp. 436–444, 2015.
[19] C. Barata, M. E. Celebi, and J. S. Marques, “Improv-
[7] C. Barata, M. E. Celebi, and J. S. Marques, “A survey ing dermoscopy image classification using color con-
of feature extraction in dermoscopy image analysis of stancy,” IEEE Journal of Biomedical and Health In-
skin cancer,” IEEE Journal of Biomedical and Health formatics, vol. 19, pp. 1146–1152, 2015.
Informatics, 2018.
[20] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
[8] G. Argenziano, H P. Soyer, V. De Giorgi, and et al., In- berger, “Densely connected convolutional networks.,”
teractive Atlas of Dermoscopy, EDRA Medical Publish- in CVPR, 2017, vol. 1, p. 3.
ing & New Media, 2000.
[21] N. Srivastava, G. E. Hinton, A. Krizhevsky, and et al.,
[9] N. C. F Codella, J. Cai, M. Abedini, and et al., “Deep “Dropout: a simple way to prevent neural networks from
learning, sparse coding, and SVM for melanoma recog- overfitting.,” Journal of machine learning research, vol.
nition in dermoscopy images,” in MLMI 2015, 2015, pp. 15, pp. 1929–1958, 2014.
118–126.
845