0% found this document useful (0 votes)

18 views

wen2020

Uploaded by

Abhisar Gautam

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

wen2020

Uploaded by

Abhisar Gautam

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Medical Image Analysis 63 (2020) 101694

Contents lists available at ScienceDirect

Medical Image Analysis

journal homepage: www.elsevier.com/locate/media

Convolutional neural networks for classiﬁcation of Alzheimer’s

disease: Overview and reproducible evaluation
Junhao Wen a,b,c,d,e,†, Elina Thibeau-Sutre a,b,c,d,e,†, Mauricio Diaz-Melo e,a,b,c,d,
Jorge Samper-González e,a,b,c,d, Alexandre Routier e,a,b,c,d, Simona Bottani e,a,b,c,d,
Didier Dormont e,a,b,c,d,f, Stanley Durrleman e,a,b,c,d, Ninon Burgos a,b,c,d,e,
Olivier Colliot a,b,c,d,e,f,g,∗ , for the Alzheimer’s Disease Neuroimaging Initiative# , the
Australian Imaging Biomarkers and Lifestyle ﬂagship study of ageing##
a
Institut du Cerveau et de la Moelleépinière, ICM, Paris F-75013, France
b
SorbonneUniversité, ParisF-75013,France
c
Inserm, U 1127, Paris F-75013, France
d
CNRS, UMR 7225, Paris F-75013, France
e
Inria, Aramis project-team, Paris F-75013, France
f
Department of Neuroradiology, AP-HP, Hôpital de la PitiéSalpêtrière, Paris F-75013, France
g
Department of Neurology, AP-HP, Hôpital de la PitiéSalpêtrière, Paris F-75013, France

a r t i c l e i n f o a b s t r a c t

Article history: Numerous machine learning (ML) approaches have been proposed for automatic classification of
Received 1 April 2019 Alzheimer’s disease (AD) from brain imaging data. In particular, over 30 papers have proposed to use con-
Revised 23 March 2020
volutional neural networks (CNN) for AD classification from anatomical MRI. However, the classification
Accepted 27 March 2020
performance is difficult to compare across studies due to variations in components such as participant
Available online 1 May 2020
selection, image preprocessing or validation procedure. Moreover, these studies are hardly reproducible
Keywords: because their frameworks are not publicly accessible and because implementation details are lacking.
Convolutional neural network Lastly, some of these papers may report a biased performance due to inadequate or unclear validation
Reproducibility or model selection procedures. In the present work, we aim to address these limitations through three
Alzheimer’s disease classification Magnetic main contributions. First, we performed a systematic literature review. We identified four main types of
resonance imaging approaches: i) 2D slice-level, ii) 3D patch-level, iii) ROI-based and iv) 3D subject-level CNN. Moreover, we
found that more than half of the surveyed papers may have suffered from data leakage and thus reported
biased performance. Our second contribution is the extension of our open-source framework for classi-
fication of AD using CNN and T1-weighted MRI. The framework comprises previously developed tools to
automatically convert ADNI, AIBL and OASIS data into the BIDS standard, and a modular set of image
preprocessing procedures, classification architectures and evaluation procedures dedicated to deep learn-
ing. Finally, we used this framework to rigorously compare different CNN architectures. The data was
split into training/validation/test sets at the very beginning and only the training/validation sets were
used for model selection. To avoid any overfitting, the test sets were left untouched until the end of the
peer-review process. Overall, the different 3D approaches (3D-subject, 3D-ROI, 3D-patch) achieved simi-
lar performances while that of the 2D slice approach was lower. Of note, the different CNN approaches
did not perform better than a SVM with voxel-based features. The different approaches generalized well
to similar populations but not to datasets with different inclusion criteria or demographical characteris-
tics. All the code of the framework and the experiments is publicly available: general-purpose tools have
been integrated into the Clinica software (www.clinica.run) and the paper-specific code is available at:
https://github.com/aramis- lab/AD- DL.
© 2020 Elsevier B.V. All rights reserved.

∗
Corresponding author at: ICM – Brain and Spinal Cord Institute, ARAMIS team,
#
Pitié-Salpêtrière Hospital, 47-83, boulevard de l’Hôpital, Paris Cedex 13, 75651France Data used in preparation of this article were obtained from the Alzheimer’s Dis-
ease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the in-
E-mail address: [email protected] (O. Colliot).
† vestigators within the ADNI contributed to the design and implementation of ADNI
denotes shared ﬁrst authorship
https://doi.org/10.1016/j.media.2020.101694
1361-8415/© 2020 Elsevier B.V. All rights reserved.
2 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

1. Introduction previous works, the generalization may be questionable due to in-

adequate validation procedures, the absence of an independent test
Alzheimer’s disease (AD), a chronic neurodegenerative disease set, or a test set chosen from the same study as the training and
causing the death of nerve cells and tissue loss throughout the validation sets.
brain, usually starts slowly and worsens over time (McKhann et al., In our previous studies (Samper-González et al., 2018;
1984). AD is expected to affect 1 out of 85 people in the world Wen et al., 2018b), we have proposed an open source framework
by the year 2050 (Brookmeyer et al., 2007). The cost of caring for for reproducible evaluation of AD classification using conventional
AD patients is also expected to rise dramatically, thus the need of ML methods. The framework comprises: i) tools to automatically
individual computer-aided systems for early and accurate AD diag- convert three publicly available datasets into the Brain Imaging
nosis. Data Structure (BIDS) format (Gorgolewski et al., 2016) and ii) a
Magnetic resonance imaging (MRI) offers the possibility to modular set of preprocessing pipelines, feature extraction and clas-
study pathological brain changes associated with AD in vivo sification methods, together with an evaluation framework, that
(Ewers et al., 2011). Over the past decades, neuroimaging data provide a baseline for benchmarking the different components. We
have been increasingly used to characterize AD by means of ma- demonstrated the use of this framework on positron emission to-
chine learning (ML) methods, offering promising tools for individ- mography (PET), T1-weighted (T1w) MRI (Samper-González et al.,
ualized diagnosis and prognosis (Falahati et al., 2014; Haller et al., 2018) and diffusion MRI data (Wen et al., 2018a).
2011; Rathore et al., 2017). A large number of studies have pro- This work presents three main contributions. We first reviewed
posed to use predefined features (including regional and voxel- and summarized the different studies using CNNs and anatomical
based measurements) obtained from image preprocessing pipelines MRI for AD classification. In particular, we reviewed their valida-
in combination with different types of classifiers, such as sup- tion procedures and the possible presence of data leakage. We then
port vector machines (SVM) or random forests. Such approach is extended our open-source framework for reproducible evaluation
often referred to as conventional ML (LeCun et al., 2015). More of AD classification to DL approaches by implementing a modular
recently, deep learning (DL), as a newly emerging ML method- set of image preprocessing procedures, classification architectures
ology, has made a big leap in the domain of medical imaging and evaluation procedures dedicated to DL. Finally, we used this
(Bernal et al., 2018; Liu et al., 2018a; Lundervold and Lunder- framework to rigorously assess the performance of different CNN
vold, 2018; Razzak et al., 2018; D. Wen et al., 2018a). As the architectures, representative of the literature. We studied the in-
most widely used architecture of DL, convolutional neural network fluence of key components on the classification accuracy, we com-
(CNN) has attracted huge attention due to its great success in im- pared the proposed CNNs to a conventional ML approach based on
age classification (Krizhevsky et al., 2012). Contrary to conventional a linear SVM, and we assessed the generalization ability of the CNN
ML, DL allows the automatic abstraction of low-to-high level latent models within (training and testing on ADNI) and across datasets
feature representations (e.g. lines, dots or edges for low level fea- (training on ADNI and testing on AIBL or OASIS).
tures, and objects or larger shapes for high level features). Thus, All the code of the framework and the experiments is publicly
one can hypothesize that DL depends less on image preprocess- available: general-purpose tools have been integrated into Clin-
ing and requires less prior on other complex procedures, such as ica1 (Routier et al., 2018), an open-source software platform that
feature selection, resulting in a more objective and less bias-prone we developed to process data from neuroimaging studies, and the
process (LeCun et al., 2015). paper-specific code is available at: https://github.com/aramis-lab/
Very recently, numerous studies have proposed to assist di- AD-DL. The tagged version v.0.0.1 corresponds to the version of the
agnosis of AD by means of CNNs (Aderghal et al., 2018, 2017a, code used to obtain the results of the paper. The trained models
2017b; Bäckström et al., 2018; Basaia et al., 2019; Cheng et al., are available on Zenodo and their associated DOI is 10.5281/zen-
2017; Cheng and Liu, 2017; Farooq et al., 2017; Gunawardena et al., odo.3491003.
2017; Hon and Khan, 2017; HosseiniAsl et al., 2018; Islam and
Zhang, 2018, 2017; Korolev et al., 2017; Lian et al., 2018;Li et al.,
2018, 2017; Lin et al., 2018; Liu et al., 2018a; Liu et al., 2018a, 2. State of the art
2018e; Qiu et al., 2018; Senanayake et al., 2018; Shmulev et al.,
2018; Taqi et al., 2018; Valliani and Soni, 2017; Vu et al., 2018, We performed an online search of publications concerning clas-
2017; Wang et al., 2019, 2017; Wang et al., 2018a; Wu et al., 2018). sification of AD using neural networks based on anatomical MRI
However, classification results among these studies are not directly in PubMed and Scopus, from January 1990 to the 15th of January
comparable because they differ in terms of: i) sets of participants; 2019. This resulted in 406 records which were screened according
ii) image preprocessing procedures, iii) cross-validation (CV) pro- to their abstract, type and content (more details are provided in
cedure and iv) reported evaluation metrics. It is thus impossible to online supplementary eMethod 1) to retain only those focused on
determine which approach performs best. The generalization abil- the classification of AD stages using at least anatomical MRI as in-
ity of these approaches also remains unclear. In DL, the use of fully put of a neural network. This resulted in 71 studies. Out of these
independent test sets is even more critical than in conventional 71, 32 studies used CNN on image data in an end-to-end frame-
ML, because of the very high flexibility with numerous possible work, which is the focus of our work.
model architecture and training hyperparameter choices. Assessing Depending on the disease stage that is studied, different clas-
generalization to other studies is also critical to ensure that the sification experiments can be performed. We present the main
characteristics of the considered study have not been overfitted. In tasks considered in the literature in Section 2.1. We found that a
substantial proportion of the studies performed a biased evalua-
tion of results due to the presence of data leakage. These issues
and/or provided data but did not participate in analysis or writing of this report.
A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/ are discussed in Section 2.2. We then review the 32 studies that
wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf used end-to-end CNNs on image data, the main focus of this work
##
Data used in the preparation of this article was obtained from the Aus- (Section 2.3). Finally, we briefly describe other studies that were
tralian Imaging Biomarkers and Lifestyle flagship study of ageing (AIBL) funded by kept in our bibliography but that are out of our scope (Section 2.4).
the Commonwealth Scientific and Industrial Research Organisation (CSIRO) which
was made available at the ADNI database (www.loni.usc.edu/ADNI). The AIBL re-
searchers contributed data but did not participate in analysis or writing of this re-
1
port. AIBL researchers are listed at www.aibl.csiro.au. http://www.clinica.run/
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 3

Table 1
Summary of the studies performing classiﬁcation of AD using CNNs on anatomical MRI. Studies are categorized according to the potential presence of data leakage:
(A) studies without data leakage; (B) studies with potential data leakage. The number of citations was found with Google Scholar on 16th of January 2020.

(A) None detected Table

Study Performance Approach Data leakage Number of

citations

AD vs CN sMCI vs pMCI MCI vs CN AD vs MCI Multi-class

(Aderghal et al., 2017b) ACC=0.84 – ACC=0.65 ACC=0.67† – ROI-based None detected 16

(Aderghal et al., 2018) BA=0.90 – BA=0.73 BA=0.83 – ROI-based None detected 9
(Bäckström et al., 2018)∗ ACC=0.90 – – – – 3D subject-level None detected 20
(Cheng et al., 2017) ACC=0.87 – – – – 3D patch-level None detected 12
(Cheng and Liu, 2017) ACC=0.85 – – – – 3D subject-level None detected 8
(Islam and Zhang, 2018) – – – – ACC=0.931 , † 2D slice-level None detected 23
(Korolev et al., 2017) ACC=0.80 – – – – 3D subject-level None detected 72
(Li et al., 2017) ACC=0.88 – – – – 3D subject-level None detected 12
(Li et al., 2018) ACC=0.90 – ACC=0.74† – – 3D patch-level None detected 7
(Lian et al., 2018) ACC=0.90 ACC=0.80† – – – 3D patch-level None detected 30
(Mingxia Liu et al., 2018a) ACC=0.91 ACC=0.78† – – – 3D patch-level None detected 59
(Mingxia Liu et al., 2018b) ACC=0.91 – – – – 3D patch-level None detected 26
(Qiu et al., 2018) – – ACC=0.83† – – 2D slice-level None detected 8
(Senanayake et al., 2018) ACC=0.76 – ACC=0.75 ACC=0.76 – 3D subject-level None detected 3
(Shmulev et al., 2018) – ACC=0.62 – – – 3D subject-level None detected 5
(Valliani and Soni, 2017) ACC=0.81 – – – ACC=0.572 2D slice-level None detected 8

(B) Data leakage Table

Study Performance Approach Data leakage Number of

(type) citations

AD vs CN sMCI vs pMCI MCI vs CN AD vs MCI Multi-class

(Aderghal et al., 2017a) ACC=0.91 – ACC=0.66 ACC=0.70 – ROI-based Unclear (b,c) 13

(Basaia et al., 2019) BA=0.99 BA=0.75 – – – 3D subject-level Unclear (b) 25
(Hon and Khan, 2017) ACC=0.96 – – – – 2D slice-level Unclear (a,c) 32
(Hosseini Asl et al., 2018) ACC=0.99 – ACC=0.94 ACC=1.00 ACC=0.952 3D subject-level Unclear (a) 107
(Islam and Zhang, 2017) – – – – ACC=0.741 , † 2D slice-level Unclear (b,c) 23
(Lin et al., 2018) ACC=0.89 ACC=0.73 – – – ROI-based Unclear (b) 22
(Manhua Liu et al., 2018c) ACC=0.85 ACC=0.74 – – – 3D patch-level Unclear (d) 39
(Taqi et al., 2018) ACC=1.00 – – – – 2D slice-level Unclear (b) 16
(Vu et al., 2017) ACC=0.85 – – – – 3D subject-level Unclear (a) 20
(Wang et al., 2018b) ACC=0.98 – – – – 2D slice-level Unclear (b) 49
(Bäckström et al., 2018)∗ ACC=0.99 – – – – 3D subject-level Clear (a) 20
(Farooq et al., 2017) – – – – ACC=0.993 , † 2D slice-level Clear (a,c) 31
(Gunawardena et al., 2017) – – – – ACC=0.962 3D subject-level Clear (a,b) 8
(Vu et al., 2018) ACC=0.86 – ACC=0.86 ACC=0.77 ACC=0.802 3D subject-level Clear (a,c) 8
(Wang et al., 2017) – – ACC=0.91 – – 2D slice-level Clear (a,c) 11
(Wang et al., 2019) ACC=0.99 – ACC=0.98 ACC=0.94 ACC=0.972 3D subject-level Clear (b) 17
(Wu et al., 2018) – – – – 0.954 , † 2D slice-level Clear (a,b) 7

Types of data leakage: a: wrong dataset split; b: absence of independent test set; c: late split; d: biased transfer learning (see Section 2.2).
∗
In (Bäckström et al., 2018), data leakage was introduced on purpose to study its inﬂuence, which explains its presence in both categories.
†
Use of accuracy on a severely imbalanced dataset (one class is less than half of the other), leading to an over-optimistic estimation of performance.
1
CN vs mild vs moderate vs severe
2
AD vs MCI vs CN
3
AD vs LMCI vs EMCI vs CN
4
sMCI vs pMCI vs CN, ACC: accuracy; BA: balanced accuracy.

Designing DL approaches for MRI-based classification of AD re- from CN subjects (MCI vs CN) is another task of interest, reported
quires expertise about DL, MRI processing and AD. Such knowl- in nine studies. Patients with MCI may remain stable or subse-
edge might be difficult to acquire for newcomers to the field. quently progress to AD dementia or to another type of dementia.
Therefore, we present a brief introduction to these topics in on- Distinguishing MCI subjects that will progress to AD (denoted as
line supplementary eMethod 2 and 3. Readers can also refer to pMCI) from those who will remain stable (denoted as sMCI) would
(Goodfellow et al., 2016) about DL and (Bankman, 2008) for MRI allow predicting the group of subjects that will likely develop the
processing. disease. This task (sMCIvspMCI) has been performed in seven stud-
ies. Other experiments performed in the 32 studies on which we
2.1. Main classification tasks focus include differentiating AD from MCI patients (AD vs MCI) and
multiclass tasks.
Even though its clinical relevance is limited, differentiating pa-
tients with AD from cognitively normal subjects (CN), i.e. AD vs 2.2. Main causes of data leakage
CN, is the most widely addressed task: 25 of the 32 studies pre-
senting an end-to-end CNN framework report results with this task Unbiased evaluation of classification algorithms is critical to as-
(Table 1). Before the development of dementia, patients go through sess their potential clinical value. A major source of bias is data
a phase called mild cognitive impairment (MCI) during which they leakage, which refers to the use of test data in any part of the
have objective deficits but not severe enough to result in demen- training process (Kriegeskorte et al., 2009; Rathore et al., 2017).
tia. Identifying the early stage of AD by differentiating MCI patients Data leakage can be difficult to detect for DL approaches as they
4 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

can be complex and very flexible. We assessed the prevalence of borrowed and used in a transfer learning fashion. Another advan-
data leakage among the papers described in Section 2.3 and ana- tage is the increased number of training samples as many slices
lyzed its causes. The articles were labeled into three categories: i) can be extracted from a single 3D image.
Clear when data leakage was explicitly witnessed; ii) Unclear when In this subsection of the bibliography, we found only one study
no sufficient explanation was offered and iii) None detected. The re- in which neither data leakage was detected nor biased metrics
sults are summarized in the last column of Table 1. They were fur- were used (Valliani and Soni, 2017). They used a single axial slice
ther categorized according to the cause of data leakage. Four main per subject (taken in the middle of the 3D volume) to compare
causes were identified: the ResNet (He et al., 2016) to an original CNN with only one con-
volutional layer and two fully connected (FC) layers. They studied
1 Wrong data split. Not splitting the dataset at the subject-level the impact of both transfer learning, by initializing their networks
when defining the training, validation and test sets can result with models trained on ImageNet, and data augmentation with
in data from the same subject to appear in several sets. This affine transformations. They conclude that the ResNet architecture
problem can occur when patches or slices are extracted from is more efficient than their baseline CNN and that pre-training and
a 3D image, or when images of the same subject are available data augmentation improve the accuracy of the ResNet architec-
at multiple time points. (Bäckström et al., 2018) showed that, ture.
using a longitudinal dataset, a biased dataset split (at the im- In all other studies, we detected a problem in the eval-
age level) can result in an accuracy increase of 8 percent points uation: either data leakage was present (or at least sus-
compared to an unbiased split (at the subject-level). pected) (Farooq et al., 2017; Gunawardena et al., 2017; Hon and
2 Late split. Procedures such as data augmentation, feature selec- Khan, 2017; Islam and Zhang, 2017; Taqi et al., 2018; Wang et al.,
tion or autoencoder (AE) pre-training must never use the test 2017; Wang et al., 2018a; Wu et al., 2018) or an imbalanced met-
set and thus be performed after the training/validation/test split ric was computed on a severely imbalanced dataset (one class is
to avoid biasing the results. For example, if data augmenta- less than half of the other) (Islam and Zhang, 2018; Qiu et al.,
tion is performed before isolating the test data from the train- 2018). Theses studies differ in terms of slice selection: i) one study
ing/validation data, then images generated from the same origi- used all slices of a given plane (except the very first and last ones
nal image may be found in both sets, leading to a problem sim- that are not informative) (Farooq et al., 2017); ii) other studies
ilar to the wrong data split. selected several slices using an automatic (Hon and Khan, 2017;
3 Biased transfer learning. Transfer learning can result in data Wu et al., 2018) or manual criterion (Qiu et al., 2018); iii) one
leakage when the source and target domains overlap, for exam- study used only one slice (Wang et al., 2018a). Working with sev-
ple when a network pre-trained on the AD vs CN task is used eral slices implies to fuse the classifications obtained at the slice-
to initialize a network for the MCI vs CN task and that the CN level to obtain a classification at the subject-level. Only one study
subjects in the training or validation sets of the source task (AD (Qiu et al., 2018) explained how they performed this fusion. Other
vs CN) are also in the test set of the target task (MCI vs CN). studies did not implement fusion and reported the slice-level ac-
4 Absence of an independent test set. The test set should only curacy (Farooq et al., 2017; Gunawardena et al., 2017; Hon and
be used to evaluate the final performance of the classifier, not Khan, 2017; Wang et al., 2017; Wu et al., 2018) or it is unclear if
to choose the training hyperparameters (e.g. learning rate) of the accuracy was computed at the slice- or subject-level (Islam and
the model. A separate validation set must be used beforehand Zhang, 2018, 2017; Taqi et al., 2018).
for hyperparameter optimization. The main limitation of the 2D slice-level approach is that MRI
is 3-dimensional, whereas the 2D convolutional filters analyze all
Note that we did not consider data leakage occurring when de-
slices of a subject independently. Moreover, there are many ways
signing the network architecture, possibly chosen thanks to succes-
to select slices that are used as input (as all of them may not be in-
sive evaluations on the test set, as the large majority of the stud-
formative), and slice-level accuracy and subject-level accuracy are
ies does not explicit this step. All these data leakage causes may
often confused.
not have the same impact on data performance. For instance, it is
likely that a wrong data split in a longitudinal dataset or at the
slice-level is more damaging than a late split for AE pre-training. 2.3.2. 3D patch-level CNN
To compensate for the absence of 3D information in the 2D
slice-level approach, some studies focused on the 3D patch-level
2.3. Classification of AD with end-to-end CNNs classification (see Table 1). In these frameworks, the input is com-
posed of a set of 3D patches extracted from an image. In princi-
This section focuses on CNNs applied to an Euclidean space ple, this could result, as in the 2D slice-level approach, in a larger
(here a 2D or 3D image) in an end-to-end framework (from the in- sample size, since the number of samples would be the number of
put to the classification). A summary of these studies can be found patches (and not the number of subjects). However, this potential
in Table 1. The table indicates whether data leakage was poten- advantage is not used in the surveyed papers because they trained
tially present, which could have biased the performance upwards. independent CNNs for each patch. Additional advantages of patches
We categorized studies according to the type of input of the net- are the lower memory usage, which may be useful when one has
work: i) 2D slice-level, ii) 3D patch-level, iii) ROI-based and iv) 3D limited resources, and the lower number of parameters to learn.
subject-level. However, this last advantage is present only when one uses the
same network for all patches.
2.3.1. 2D slice-level CNN Two studies (Cheng et al., 2017; Liu et al., 2018d) used
Several studies used 2D CNNs with input composed of the set very large patches. Specifically, they extracted 27 overlapping 3D
of 2D slices extracted from the 3D MRI volume (Farooq et al., patches of size 50 × 41 × 40 voxels covering the whole volume of
2017; Gunawardena et al., 2017; Hon and Khan, 2017; Islam and the MR image (100 × 81 × 80 voxels). They individually trained 27
Zhang, 2018, 2017; Qiu et al., 2018; Taqi et al., 2018; Valliani and convolutional networks (one per patch) comprising four convo-
Soni, 2017; Wang et al., 2017; Wang et al., 2018a; Wu et al., 2018). lutional layers and two FC layers. Then, an ensemble CNN was
An advantage of this approach is that existing CNNs which had trained to provide a decision at the subject level. This ensemble
huge success for natural image classification, e.g. ResNet (He et al., CNN is partly initialized with the weights of the previously trained
2016) and VGGNet (Simonyan and Zisserman, 2014), can be easily CNNs. (Liu et al., 2018a) used exactly the same architecture as
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 5

(Cheng et al., 2017) and enriched it with a fusion of PET and MRI of the CNN is made of two convolutional layers associated with
inputs. They also gave the results obtained using the MRI modality max pooling, and one FC layer. In the second study (Aderghal et al.,
only, which is the result reported in Table 1. 2017a), all the views (sagittal, coronal and axial) are used to gen-
(Li et al., 2018) used smaller patches (32 × 32 × 32). By decreas- erate patches. Then, three patches are generated per subject, and
ing the size of the patches, they had to take into account a pos- three networks are trained for each view and then fused. The last
sible discrepancy between patches taken at the same coordinates study from the same author (Aderghal et al., 2018) focuses on the
for different subjects. To avoid this dissimilarity between subjects transfer learning from anatomical MRI to diffusion MRI, which is
without performing a non-linear registration, they clustered their out of our scope.
patches using k-means. Then they trained one CNN per cluster, and In (Lin et al., 2018) a non-linear registration was performed to
assembled the features obtained at the cluster-level in a similar obtain a voxel correspondence between the subjects, and the vox-
way to (Cheng et al., 2017; Liu et al., 2018a). els belonging to the hippocampus2 were identified after a segmen-
The following three studies (Lian et al., 2018; Liu et al., 2018a, tation implemented with MALP-EM (Ledig et al., 2015). 151 patches
2018e) used even smaller patches (19 × 19 × 19). Only a subset of were extracted per image with sampling positions fixed during the
patches, chosen based on anatomical landmarks, are used. These experiments. Each of them was made of the concatenation of three
anatomical landmarks are found in a supervised manner via a 2D slices along the three possible planes (sagittal, coronal and ax-
group comparison between AD and CN subjects. This method re- ial) originated at one voxel belonging to the hippocampus.
quires a non-linear registration to build the correspondence be- The main drawback of this methodology is that it studies only
tween voxels of different subjects. Similarly to other studies, in one (or a few) regions while AD alterations span over multiple
(Liu et al., 2018a), one CNN is pre-trained for each patch and the brain areas. However, it may reduce the risk of overfitting be-
outputs are fused to obtain the diagnosis of a subject. The ap- cause the inputs are smaller (∼30 0 0 voxels in our bibliography)
proach of (Liu et al., 2018a) is slightly different as they consider and fewer than in methods allowing patch combinations.
that a patch cannot be labelled with a diagnosis, hence they do
not train one CNN per patch individually before ensemble learning,
but train the ensemble network from scratch. Finally, (Lian et al., 2.3.4. 3D subject-level CNN
2018) proposed a weakly-supervised guidance: the loss of the net- Recently, with the boost of high-performance computing re-
work is based on the final classification scores at the subject-level sources, more studies used a 3D subject-level approach (see
as well as the intermediate classification done on the patch- and Table 1). In this approach, the whole MRI is used at once and the
region-level. classification is performed at the subject level. The advantage is
There are far less data leakage problems in this section, with that the spatial information is fully integrated.
only a doubt about the validity of the transfer learning between Some studies readapted two classical architectures, ResNet
the AD vs CN and MCI vs CN tasks in (Liu et al., 2018a) because of a (He et al., 2016) and VGGNet (Simonyan and Zisserman, 2014),
lack of explanations. Nevertheless, this has no impact on the result to fit the whole MRI (Korolev et al., 2017; Shmulev et al., 2018).
of the AD vs CN task for which we did not detect any problem of In both cases, the classification accuracies obtained with VGGNet
data leakage. and ResNet are equivalent, and their best accuracies are lower
As for the 2D-slice level approaches, in which a selection of than that of other 3D subject-level approaches. Another study
slices must be made, one must choose the size and stride of (Senanayake et al., 2018) used a set of complex modules from clas-
patches. The choice of these hyperparameters will depend on the sical architectures such as ResNet and DenseNet (dilated convolu-
MRI preprocessing (e.g. a non-linear registration is likely needed tions, dense blocks and residual blocks), also without success.
for smaller patches). Nevertheless, note that the impact of these Other studies defined original architectures (Bäckström et al.,
hyperparameters has been studied in the pre-cited studies (which 2018; Basaia et al., 2019; Cheng and Liu, 2017; HosseiniAsl et al.,
has not been done for the 2D slice-level approaches). The main 2018; Li et al., 2017; Vu et al., 2018, 2017; Wang et al., 2019).
drawback of these approaches is the complexity of the framework: We detected data leakage in all studies except (Bäckström et al.,
one network is trained for each patch position and these networks 2018; Cheng and Liu, 2017; Li et al., 2017). (Bäckström et al.,
are successively fused and retrained at different levels of represen- 2018; Cheng and Liu, 2017) had a similar approach by train-
tation (region-level, subject-level). ing one network from scratch on augmented data. One cru-
cial difference between these two studies is the preprocessing
2.3.3. ROI-based CNN step: (Bäckström et al., 2018) used non-linear registration whereas
3D patch-level methods use the whole MRI by slicing it into (Cheng and Liu, 2017) performed no registration. (Li et al., 2017)
smaller inputs. However, most of these patches are not informa- proposed a more complex framework fusing the results of a CNN
tive as they contain parts of the brain that are not affected by and three networks pre-trained with an AE.
the disease. Methods based on regions of interest (ROI) overcome For the other studies using original architectures, we suspect
this issue by focusing on regions which are known to be infor- data leakage (Basaia et al., 2019; HosseiniAsl et al., 2018; Vu et al.,
mative. In this way, the complexity of the framework can be de- 2018, 2017; Wang et al., 2019), hence their performance cannot
creased as fewer inputs are used to train the networks.In all the be fairly compared to the previous ones. However we noted that
following studies, the ROI chosen was the hippocampus, which (HosseiniAsl et al., 2018; Vu et al., 2018, 2017) studied the impact
is well-known to be affected early in AD (Dickerson et al., 2001; of pre-training with an AE, and concluded that it improved their
Salvatore et al., 2015; Schuff et al., 2009). Studies differ by the def- results (accuracy increased from 5 to 10 percent points).
inition of the hippocampal ROI. In the 3D-subject level approach, the number of samples is
(Aderghal et al., 2018, 2017a, 2017b) performed a linear regis- small compared to the number of parameters to optimize. Indeed,
tration and defined a 3D bounding box comprising all the voxels of there is one sample per subject, typically a few hundreds to thou-
the hippocampus according to a segmentation with the AAL atlas. sands of subjects in a dataset, thus increasing the risk of overfit-
These three studies used a “2D+ε approach” with patches made ting.
of three neighbouring 2D slices in the hippocampus. As they use
only one or three patches per patient, they do not cover the en-
tire region. The first study (Aderghal et al., 2017b) only uses the 2
In their original paper, this anatomical structure was called the “hippopotamus”
sagittal view and classifies one patch per patient. The architecture (sic).
6 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

2.3.5. Conclusion ploit multi-modal data (Aderghal et al., 2018; Cheng and Liu, 2017;
A high number of these 32 studies presented a biased perfor- Esmaeilzadeh et al., 2018; Li et al., 2015; Liu et al., 2016, 2015;
mance because of data leakage: 10 were labeled as Unclear be- Liu et al., 2018a; Liu et al., 2018a; Lu et al., 2018; Ning et al.,
cause of lack of explanations, and 6 as Clear (we do not count 2018; Ortiz et al., 2016; Qiu et al., 2018; Raut and Dalal, 2017;
here the study of Backstrom et al (Bäckström et al., 2018) as Senanayake et al., 2018; Shi et al., 2018; Shmulev et al., 2018;
data leakage was done deliberately to study its impact). This Spasov et al., 2018; Suk et al., 2014; Thung et al., 2017; Vu et al.,
means that about 50% of the surveyed studies could report biased 2018, 2017; Zhou et al., 2019, 2017), such as multiple imaging
results. modalities (PET and diffusion tensor imaging), demographic data,
In addition to that problem, most studies are not comparable genetics, clinical scores, or cerebrospinal fluid biomarkers. Note
because the datasets used, subjects selected among them and pre- that multimodal studies that also reported results with MRI only
processing performed are different. Furthermore, these studies of- (Aderghal et al., 2018; Cheng and Liu, 2017; Liu et al., 2018a;
ten do not motivate the choice of their architecture or hyperpa- Qiu et al., 2018; Senanayake et al., 2018; Shmulev et al., 2018;
rameters. It might be that many of them have been tried (but not Vu et al., 2018, 2017) are displayed in Table 1. Exploiting multiple
reported) thereby resulting in a biased performance on the test set. time-points and/or modalities is expected to improve the classifi-
Finally, the code and key implementation details (such as hyperpa- cation performance. However, these studies can be limited by the
rameter values) are often not available, making them difficult if not small number of subjects having all the required time points and
impossible to reproduce. modalities.

2.4. Other deep learning approaches for AD classiﬁcation

3. Materials

Several studies found during our literature search are out of our
The data used in our study are from three public datasets:
scope: either CNNs were not used in an end-to-end manner or not
the Alzheimer’s Disease Neuroimaging Initiative (ADNI) study, the
applied to images, other network architectures were implemented,
Australian Imaging, Biomarkers and Lifestyle (AIBL) study and the
or the approach required longitudinal or multimodal data.
Open Access Series of Imaging Studies (OASIS). These datasets are
In several studies, the CNN is used as a feature extractor only
described in supplementary eMethod 4. We used the T1w MRI
and the classification is performed using either a random forest
available in each of these studies. For the detailed MRI protocols,
(Chaddad et al., 2018), SVM with linear or polynomial kernels and
one can see (Samper-González et al., 2018).
logistic regression (Çitak-ER et al., 2017), extreme ML (Lin et al.,
The ADNI dataset used in our experiments comprises 1455 par-
2018), SVM with different kernels (Shen et al., 2018), or logistic re-
ticipants for whom a T1w MR image was available at at least one
gression and XGBoost (decision trees) (Shmulev et al., 2018). Only
visit. Five diagnosis groups were considered:
Shmulev et al. compared the results obtained with the CNN clas-
sification with those obtained with other classifiers based on fea- • CN: sessions of subjects who were diagnosed as CN at baseline
tures extracted by the CNN, and concluded that the latter is more and stayed stable during the follow-up;
efficient. Instead of being directly applied to the image, CNNs can • AD: sessions of subjects who were diagnosed as AD at baseline
be applied to pre-extracted features. This is the case of (Suk et al., and stayed stable during the follow-up;
2017) where the CNN is applied to the outputs of several regres- • MCI: sessions of subjects who were diagnosed as MCI, EMCI or
sion models performed between MRI-based features and clinical LMCI at baseline, who did not encounter multiple reversions
scores with different hyperparameters. CNNs can also be applied and conversions and who did not convert back to CN;
to non-Euclidean spaces, such as graphs of patients (Parisot et al., • pMCI: sessions of subjects who were diagnosed as MCI, EMCI or
2018) or the cortical surface (Mostapha et al., 2018). LMCI at baseline, and progressed to AD during the 36 months
Other architectures have been applied to anatomical MRI. Many following the current visit;
studies used a variant of the multilayer perceptron composed • sMCI: sessions of subjects who were diagnosed as MCI, EMCI
of stacked FC layers (Amoroso et al., 2018; Baskar et al., 2018; or LMCI at baseline, and did not progress to AD during the 36
Cárdenas-Peña et al., 2017, 2016; Dolph et al., 2017; Gorji and Had- months following the current visit.
dadnia, 2015; Gutiérrez-Becker and Wachinger, 2018; Jha et al.,
2017; Lu et al., 2018; Mahanand et al., 2012; Maitra and Chat- AD and CN subjects whose label changed over time were ex-
terjee, 2006; Ning et al., 2018; Raut and Dalal, 2017; Shams- cluded. This was also the case for MCI patients with two or more
Baboli and Ezoji, 2017; Zhang et al., 2018; Zhou et al., 2019) or of a label changes (for instance progressing to AD and then reverting
probabilistic neural network (Duraisamy et al., 2019; Mathew et al., back to MCI). We made this choice because one can assume that
2018). In other studies, high-level representations of the features the diagnosis of these subjects is less reliable. Naturally, all the ses-
are extracted using both unsupervised (deep Boltzmann machine sions of the pMCI and sMCI groups are included in the MCI group.
(Suk et al., 2014) and AE (Suk et al., 2015)) and supervised struc- Note that the reverse is false, as some MCI subjects did not con-
tures (deep polynomial networks (Shi et al., 2018)), and an SVM vert to AD but were not followed long enough to state whether
is used for classification. Non-CNN architectures require exten- they were sMCI. Moreover, for 30 sessions, the preprocessing did
sive preprocessing as they have to be applied to imaging features not pass the quality check (QC) (see Section 4.2) and these im-
such as cortical thickness, shapes, or texture, and regional features. ages were removed from our dataset. Two pMCI subjects were en-
Moreover, feature selection or embedding is also often required tirely removed because the preprocessing failed for all their ses-
(Amoroso et al., 2018; Dolph et al., 2017; Jha et al., 2017; Lu et al., sions. Table 2 summarizes the demographics, and the MMSE and
2018; Mahanand et al., 2012; Mathew et al., 2018; Suk et al., 2015, global CDR scores of the ADNI participants.
2014) to further reduce dimensionality. The AIBL dataset considered in this work is composed of 598
DL-based classification approaches are not limited to cross- participants for whom a T1w MR image and an age value was
sectional anatomical MRI. Longitudinal studies exploit information available at at least one visit. The criteria used to create the diag-
extracted from several time points of the same subject. A spe- nosis groups are identical to the ones used for ADNI. Table 3 sum-
cific structure, the recurrent neural network, has been used to marizes the demographics, and the MMSE and global CDR scores of
study the temporal correlation between images (Bhagwat et al., the AIBL participants. After the preprocessing pipeline, seven ses-
2018; Cui et al., 2018; Wang et al., 2018a). Several studies ex- sions were removed without changing the number of subjects.
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 7

Table 2
Summary of participant demographics, mini-mental state examination (MMSE) and global clinical dementia rating (CDR) scores
at baseline for ADNI.

Subjects Sessions Age Gender MMSE CDR

CN 330 1 830 74. 4 ± 5.8 [59.8, 89.6] 160 M / 170 F 29.1 ± 1.1 [24, 30] 0: 330
MCI 787 3 458 73.3 ± 7.5 [54.4, 91.4] 464 M / 323 F 27.5 ± 1.8 [23, 30] 0: 2; 0.5: 785
sMCI 298 1 046 72.3 ± 7.4 [55.0, 88.4] 175 M / 123 F 28.0 ± 1.7 [23, 30] 0.5: 298
pMCI 295 865 73.8 ± 6.9 [55.1, 88.3] 176 M / 119 F 26.9 ± 1.7 [23, 30] 0.5: 293; 1: 2
AD 336 1 106 75.0 ± 7.8 [55.1, 90.9] 185 M / 151 F 23.2 ± 2.1 [18, 27] 0.5: 160; 1: 175; 2: 1

Values are presented as mean ± SD [range]. M: male, F: female

Table 3
Summary of participant demographics, mini-mental state examination (MMSE) and global clinical dementia rat-
ing (CDR) scores at baseline for AIBL.

N Age Gender MMSE CDR

CN 429 72.5 ± 6.2 [60, 92] 183 M / 246 F 28.8 ± 1.2 [25, 30] 0: 406; 0.5: 22; 1: 1
MCI 93 75.4 ± 6.9 [60, 96] 50 M / 43 F 27.0 ± 2.1 [20, 30] 0: 6; 0.5: 86; 1: 1
sMCI 13 76.7 ± 6.5 [64, 87] 8M/5F 28.2 ± 1.5 [26, 30] 0.5: 13
pMCI 20 78.1 ± 6.6 [63, 91] 10 M / 10 F 26.7 ± 2.1 [22, 30] 0.5: 20
AD 76 73.9 ± 8.0 [55, 93] 33 M / 43 F 20.6 ± 5.5 [6, 29] 0.5: 31; 1: 36; 2: 7; 3: 2

Values are presented as mean ± SD [range]. M: male, F: female

Table 4
Summary of participant demographics, mini-mental state examination (MMSE) and global clinical
dementia rating (CDR) scores for OASIS.

N Age Gender MMSE CDR

CN 76 76.5 ± 8.4 [62, 94] 14 M / 62 F 29.0 ± 1.2 [25, 30] 0: 76

AD 78 75.6 ± 7.0 [62, 96] 35 M / 43 F 24.4 ± 4.3 [14, 30] 0.5: 56; 1: 20; 2: 2

Values are presented as mean ± SD [range]. M: male, F: female

The OASIS dataset considered in this work is composed of 193 4.2. Preprocessing of T1w MRI
participants aged 62 years or more (minimum age of the partici-
pants diagnosed with AD). Table 4 summarizes the demographics, In principle, CNNs require only minimal preprocessing because
and the MMSE and global CDR scores of the OASIS participants. of their ability to automatically extract low-to-high level features.
After the preprocessing pipeline, 22 AD and 17 CN subjects were However, in AD classification where datasets are relatively small
excluded. and thus deep networks may be difficult to train, it remains un-
Note that for the ADNI and AIBL datasets, three diagnosis labels clear whether they can benefit from more extensive preprocessing.
(CN, MCI and AD) exist and are assigned by a physician after a Moreover, previous studies have used varied preprocessing proce-
series of clinical tests (Ellis et al., 2010, 2009; Petersen et al., 2010) dures but without systematically assessing their impact. Thus, in
while for OASIS only two diagnosis labels exist, CN and AD (the the current study, we compared two different image preprocessing
MCI subjects are labelled as AD), and it is assigned based on the procedures: a “Minimal” and a more “Extensive” procedure. Both
CDR only (Marcus et al., 2007). As the diagnostic criteria of these procedures included bias field correction, and (optional) intensity
studies differ, there is no strict equivalence between the labels of rescaling. In addition, the “Minimal” processing included a linear
ADNI and AIBL, and those of OASIS. registration while the “Extensive” included non-linear registration
and skull-stripping. The essential MR image processing steps to
consider in the context of AD classification are presented in online
4. Methods
supplementary eMethod 3.
In brief, the “Minimal” preprocessing procedure performs the
In this section, we present the main components of our frame-
following operations. The N4ITK method (Tustison et al., 2010) was
work: automatic converters of public datasets for reproducible data
used for bias field correction. Next, a linear (affine) registration
management (Section 4.1), preprocessing of MRI data (4.2), clas-
was performed using the SyN algorithm from ANTs (Avants et al.,
sification models (4.3), transfer learning approaches (4.4), classifi-
2008) to register each image to the MNI space (ICBM 2009c non-
cation tasks (4.5), evaluation strategy (4.6) and framework imple-
linear symmetric template) (Fonov et al., 2011, 2009). To im-
mentation details (4.7).
prove the computational efficiency, the registered images were fur-
ther cropped to remove the background. The final image size is
4.1. Converting datasets to a standardized data structure 169 × 208 × 179 with 1 mm3 isotropic voxels. Intensity rescaling,
which was performed based on the min and max values, denoted
ADNI, AIBL and OASIS, as public datasets, are extremely use- as MinMax, was set to be optional to study its influence on the
ful to the research community. However, they may be difficult to classification results.
use because the downloaded raw data do not possess a clear and In the “Extensive” preprocessing procedure, bias field correc-
uniform organization. We thus used our previously developed con- tion and non-linear registration were performed using the Unified
verters (Samper-González et al., 2018) (available in the open source Segmentation approach (Ashburner and Friston, 2005) available in
software platform Clinica) to convert the raw data into the BIDS SPM123 . Note that we do not use the tissue probability maps but
format (Gorgolewski et al., 2016). Finally, we organized all the out-
puts of the experiments into a standardized structure, inspired
from BIDS. 3
http://www.fil.ion.ucl.ac.uk/spm/software/spm12/.
8 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

only the nonlinearly registered, bias corrected, MR images. Subse- For all four approaches, training hyperparameters (learning rate,
quently, we perform skull-stripping based on a brain mask drawn weight decay) were adapted for each model depending on the evo-
in MNI space. We chose this mask-based approach over direct lution of the training accuracy.
image-based skull-stripping procedures because the later did not The list of the chosen architecture hyperparameters is given in
prove robust on our data. This mask-based approach is less accu- online supplementary eTables 1, 2 and 3. The list of the chosen
rate but more robust. In addition, we performed intensity rescaling training hyperparameters is given in online supplementary eTables
as in the “Minimal” pipeline. 4 and 5.
We performed QC on the outputs of the preprocessing proce-
dures. For the “Minimal” procedure, we used a DL-based QC frame- 4.3.1. 3D subject-level CNN
work4 (Fonov et al., 2018) to automatically check the quality of the For the 3D-subject-level approach, the proposed CNN architec-
linearly registered data. This software outputs a probability indicat- ture is shown in Fig. 1. The CNN consisted of 5 convolutional
ing how accurate the registration is. We excluded the scans with blocks and 3 FC layers. Each convolutional block was sequentially
a probability lower than 0.5 and visually checked the remaining made of one convolutional layer, one batch normalization layer,
scans whose probability were lower than 0.70. As a result, 30 ADNI one ReLU and one max pooling layer (more architecture details are
scans, 7 AIBL scans, and 39 OASIS scans were excluded. provided in online supplementary eTable 1).

4.3.2. 3D ROI-based and 3D patch-level CNN

4.3. Classification models
For the 3D ROI-based and 3D patch-level approaches, the cho-
sen CNN architecture, shown in Fig. 2, consisted of 4 convolutional
We considered four different classification approaches: i) 3D
blocks (with the same structure as in the 3D subject-level) and 3
subject-level CNN, ii) 3D ROI-based CNN, iii) 3D patch-level CNN
FC layers (more architecture details are provided in online supple-
and iv) 2D slice-level CNN.
mentary eTable 2).
In the case of DL, one challenge is to find the “optimal” model
To extract the 3D patches, a sliding window (50 × 50 × 50 mm3 )
(i.e. global minima), including the architecture hyperparameters
without overlap was used to convolve over the entire image, gen-
(e.g. number of layers, dropout, batch normalization) and the train-
erating 36 patches for each image.
ing hyperparameters (e.g. learning rate, weight decay). We first re-
For the 3D ROI-based approach, we chose the hippocampus
viewed the architectures used in the literature among the stud-
as a ROI, as done in previous studies. We used a cubic patch
ies in which no data leakage problem was found (Table 1a). As
(50 × 50 × 50 mm3 ) enclosing the left (resp. right) hippocampus.
there was no consensus, we used the following heuristic strategy
The center of this cubic patch was manually chosen based on the
for each of the four approaches.
MNI template image (ICBM 2009c nonlinear symmetric template).
For the 3D subject-level approach, we began with an overfit-
We ensured visually that this cubic patch included all the hip-
ting model that was very heavy because of the high number of FC
pocampus.
layers (4 convolutional blocks + 5 FC layers). Then, we iteratively
For the 3D patch-level approach, two different training strate-
repeated the following operations:
gies were considered. First, all extracted patches were fitted into a
- the number of FC layers was decreased until accuracy on the single CNN (denoting this approach as 3D patch-level single-CNN).
validation set decreased substantially; Secondly, we used one CNN for each patch, resulting in finally 36
we added one more convolutional block. (number of patches) CNNs (denoting this approach as 3D patch-
level multi-CNN).
In this way, we explored the architecture space from 4 convolu-
tional blocks + 5 FC layers to 7 convolutional blocks + 2 FC layers. 4.3.3. 2D slice-level CNN
Among the best performing architectures, we chose the shallowest For the 2D slice-level approach, the ResNet pre-trained on Im-
one: 5 convolutional blocks + 3 FC layers. ageNet was adopted and fine-tuned. The architecture is shown in
As the performance was very similar for the different architec- Fig. 3. The architecture details of ResNet can be found in (He et al.,
tures tested with the 3D subject-level approach, and as this search 2016). We added one FC layer on top of the ResNet (more archi-
method is time costly, it was not used for the 3D patch-level ap- tecture details are provided in online supplementary eTable 3). The
proach for which only four different architectures were tested: last five convolutional layers and the last FC layer of the ResNet, as
well as the added FC layer, were fine-tuned. The weight and bias
- 4 convolutional blocks + 2 FC layers of the other layers of the CNN were frozen during fine-tuning to
- 4 convolutional blocks + 1 FC layer avoid overfitting.
- 7 convolutional blocks + 2 FC layers For each subject, each sagittal slice was extracted and replicated
- 7 convolutional blocks + 1 FC layer into R, G and B channels respectively, in order to generate a RGB
The best architecture (4 convolutional blocks + 2 FC layers) was image. The first and last twenty slices were excluded due to the
used for both the 3D patch-level and ROI-based approaches. Note lack of information, which resulted in 129 RGB slices for each im-
that the other architectures were only slightly worse. age.
For these 3 approaches, other architecture hyperparameters
4.3.4. Majority voting system
were explored: with or without batch normalization, with or with-
For 3D patch-level, 3D ROI-based and 2D slice-level CNNs,
out dropout.
we adopted a soft voting system (Raschka, 2015) to generate
For the 2D slice-level approach, we chose to use a classical ar-
the subject-level decision. The subject-level decision is generated
chitecture, the ResNet-18 with FC layers added at the end of the
based on the decision for each slice (resp. for each patch / for the
network. We explored from 1 to 3 added FC layers and the best
left and right hippocampus ROI). More precisely, it was computed
results were obtained with one. We then explored the number of
based on the predicted probability p obtained after softmax nor-
layers to fine-tune (2 FC layers or the last residual block + 2 FC lay-
malization of the outputs of all the slices/patches/ROIs/CNNs from
ers) and chose to fine-tune the last block and the 2 FC layers. We
the same patient:
always used dropout and tried different dropout rates.

m

y = arg maxi w j pi j
4
https://github.com/vfonov/deep-qc j
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 9

Fig. 1. Architecture of the 3D subject-level CNNs. For each convolutional block, we only display the convolutional and max pooling layers. Filters for each convolutional layer
represent the number of ﬁlters ∗ ﬁlter size. Feature maps of each convolutional block represent the number of feature maps ∗ size of each feature map. Conv: convolutional
layer; MaxP: max pooling layer; FC: fully connected layer.

Fig. 2. Architecture of the 3D ROI-based and 3D patch-level CNNs. For each convolutional block, we only display the convolutional and max pooling layers. Filters for each
convolutional layer represent the number of ﬁlters ∗ ﬁlter size. Feature maps of each convolutional block represent the number of feature maps ∗ size of each feature map.
Conv: convolutional layer; MaxP: max pooling layer; FC: fully connected layer.

wherewj is the weight assigned to the j-th patch/slice/ROI/CNN. Moreover, given the very high-dimensionality of the input, a non-
wj reflects the importance of each slice/patch/ROI/CNN linear SVM, e.g. with a radial basis function kernel, may not be
and is weighted by the normalized accuracy of the j-th advantageous since it would only transport the data into an even
slice/patch/ROI/CNN. For the evaluation on the test sets, the higher dimensional space. The SVM took as input the modulated
weights computed on the validation set were used. Note that the gray matter density maps non-linearly registered to the MNI space
predicted probability p is not calibrated and should be interpreted using the DARTEL method (Ashburner, 2007), as in our previous
with care as it is not reflective of the true underlying probabil- study (Samper-González et al., 2018).
ity of the sample applied to CNNs (Guo et al., 2017; Kuhn and
Johnson, 2013). 4.4. Transfer learning
For the 3D patch-level multi-CNN approach, the 36 CNNs were
trained independently. In this case, the weaker classifiers’ weight Two different approaches were used for transfer learning: i) AE
(balanced accuracy < 0.7) was set to be 0 with the consideration pre-training for 3D CNNs; and ii) ResNet pre-trained on ImageNet
that the labels’ probabilities of these classifiers could harm the ma- for 2D CNNs.
jority voting system.
4.4.1. AE pre-training
4.3.5. Comparison to a linear SVM on voxel-based features The AE was constructed based on the architecture of the clas-
For comparison purposes, classification was also performed sification CNN. The encoder part of the AE is composed of a se-
with a linear SVM classifier. We chose the linear SVM as we previ- quence of convolutional blocks, each block having one convolu-
ously showed that it obtained higher or at least comparable clas- tional layer, one batch normalization layer, one ReLU and one max
sification accuracy compared to other conventional models (logis- pooling layer, which is identical to the sequence of convolutional
tic regression and random forest) (Samper-González et al., 2018). blocks composing the 3D subject-level network. The architecture of
10 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

Fig. 3. Architecture of the 2D slice-level CNN. An FC layer (FC2) was added on top of the ResNet. The last five convolutional layers and the last FC of ResNet (green dotted
box) and the added FC layer (purple dotted box) were fine-tuned and the other layers were frozen during training. Filters for each convolutional layer represent the number
of filters ∗ filter size. Feature maps of each convolutional block represent the number of feature maps ∗ size of each feature map. Conv: convolutional layer; FC: fully
connected layer.

the decoder mirrored that of the encoder, except that the order of and the final model is chosen. This test dataset should be used
the convolution layer and the ReLU was swapped. Of note, the pre- only to assess the performance (i.e. generalization) of a fully spec-
training with AE and classification with CNNs in our experiments ified and trained classifier (Kriegeskorte et al., 2009; Ripley, 1996;
used the same training and validation data splits in order to avoid Sarle, 1997). Considering this, we chose a classical split into train-
potential data leakage problems. Also, each AE was trained on all ing/validation/test sets. Training/validation sets were used in a CV
available data in the training sets. For instance, all MCI, AD and CN procedure for model selection while the test set was left un-
subjects in the training dataset were used to pre-train the AE for touched until the end of the peer-review process. Only the best
the AD vs CN classification task. performing model for each approach (3D subject-level, 3D patch-
level, 3D ROI-based, 2D slice-level), as defined by the CV on train-
4.4.2. ImageNet pre-training ing/validation sets, was tested on the test set.
For the 2D-slice experiments, we investigated the possibility to The ADNI test dataset consisted of 100 randomly chosen age-
transfer a ResNet pre-trained on ImageNet (He et al., 2016) to our and sex-matched subjects for each diagnostic class (i.e. 100 CN
specific tasks. Next, the fine-tuning procedure was performed on subjects, 100 AD patients). The rest of the ADNI data was used as
the chosen layers (see Fig. 3). training/validation set. We ensured that age and sex distributions
between training/validation and test sets were not significantly dif-
4.5. Classification tasks ferent. Two other test sets were composed of all subjects of OASIS
and AIBL. The ADNI test set will be used to assess model general-
We performed two tasks in our experiments. AD vs CN was ization within the same dataset (thereby assessing that the model
used as the baseline task to compare the results of our different has not overfitted the training/validation set). The AIBL test set will
frameworks. Then the best frameworks were selected to perform be used to assess generalization to another dataset that has similar
the prediction task sMCI vs pMCI: the weights and biases of the inclusion criteria and image acquisition parameters to those of the
model learnt on the source task (AD vs CN) were transferred to training set. The OASIS test will be used to assess generalization
a new model fine-tuned on the target task (sMCI vs pMCI). For to a dataset with different inclusion criteria and image acquisition
the SVM, the sMCI vs pMCI experiment was performed either by parameters. As mentioned above, it is important to note that the
training directly on sMCI vs pMCI or by training on AD vs CN and diagnosis labels are not based on the same criteria in OASIS on the
applying the trained model to sMCI vs pMCI. one hand and ADNI/AIBL on the other. Thus, we do not hypothesize
that the models trained on ADNI will generalize well to OASIS.
4.6. Evaluation strategy The model selection procedure, including model architecture
selection and training hyperparameter fine-tuning, was performed
4.6.1. Validation procedure using only the training/validation dataset. For that purpose, a 5-
Rigorous validation is essential to objectively assess the perfor- fold CV was performed, which resulted in a fold (20%) of the data
mance of a classification framework. This is particularly critical in for validation and the rest for training. Note that the 5-fold data
the case of DL as one may easily overfit the validation dataset split was performed only once for all the experiments with a fixed
when manually performing model selection and hyperparameter seed number (random_state = 2), thus guaranteeing that all the ex-
fine-tuning. An independent test set should be, at the very be- periments used exactly the same subjects during CV. Also, no over-
ginning, generated and concealed. It should not be touched until lapping exists between the MCI subjects used for AE pre-training
the CV, based on the training and validation datasets, is finished (using all available AD, CN and MCI) and the test dataset of sMCI
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 11

vs pMCI. Thus, the evaluation of the cross-task transfer learning the training process of the best fold (with highest balanced vali-
(from AD vs CN to sMCI vs pMCI) is unbiased. Finally, for the lin- dation accuracy) is presented as an illustration (see supplementary
ear SVM, the hyperparameter C controlling the amount of regu- eFigures 1-4 for details). Lastly, the training hyperparameters (e.g.
larization was chosen using an inner loop of 10-fold CV (thereby learning rate and batch size) for each experiment are presented in
performing a nested CV). supplementary eTable 4.
The validation procedure includes a series of tests. We imple- All the pipelines (3D subject-level, 3D ROI-based, 3D patch-
mented tests to check the absence of data leakage in the cross- level, 2D slice-level and SVM) were tested on the synthetic insep-
validation procedure. We also include functional tests of pipelines arable and fully separable datasets. The results were as expected:
on inseparable and fully separable data for sanity check. The in- 0.5 (resp. 1.00) of balanced accuracy for the inseparable (resp. fully
separable data is made as follows. We selected a random subject separable) dataset.
from OASIS. We then generated multiple subjects by adding ran-
dom noise to this subject. The images are different but inseparable. 5.1.1. 3D subject-level
Each of the generated subjects was assigned randomly to a diag- Influence of intensity rescaling. We first assessed the influ-
nostic class. The fully separable data was built as follows. The first ence of intensity rescaling. Without rescaling, the CNN did not per-
(resp. second) group of subjects is made of images in which the form better than chance (BA = 0.50) and there was an obvious gen-
voxel intensities of the left (resp. right) hemisphere were lowered. eralization gap (high training but low validation BA). With inten-
The scripts needed to generate the synthetic datasets are provided sity rescaling, the BA improved to 0.80. Based on these results, in-
in the repository (see https://github.com/aramis- lab/AD- DL). tensity rescaling was used in all subsequent experiments.
Influence of transfer learning (AE pre-training). The perfor-
4.6.2. Metrics mance was slightly higher with AE pre-training (0.82) than with-
We computed the following performance metrics: balanced ac- out (0.80). Based on this, we decided to always use AE pre-training,
curacy (BA), area under the receiver operating characteristic (ROC) even though the difference is small.
curve (AUC), accuracy, sensitivity and specificity. In the manuscript, Influence of the training dataset size. We then assessed the
for the sake of concision, we report only the BA but all other influence of the amount of training data, comparing training us-
metrics are available on Zenodo under the DOI 10.5281/zen- ing only baseline data to those with longitudinal data. The perfor-
odo.3491003. mance was moderately higher with longitudinal data (0.85) com-
pared to baseline data only (0.82). We choose to continue ex-
ploring the influence of this choice because the four different ap-
4.7. Implementation details
proaches have a very different number of learnt parameters and
the sample size is intrinsically augmented in 2D slice-level and 3D
The image preprocessing procedures were implemented with
single-CNN patch-level approaches.
Nipype (Gorgolewski et al., 2011). The DL models were built us-
Influence of preprocessing. We then assessed the influence of
ing the Pytorch library5 (Paszke et al., 2017). TensorboardX6 was
the preprocessing comparing the “Extensive” and “Minimal” pre-
embedded into the current framework to dynamically monitor the
processing procedures. The performance was almost equivalent
training process. Specifically, we evaluated and reported the train-
with the “Minimal” preprocessing (0.85) and with the “Extensive”
ing and validation BA/loss after each epoch or certain iterations.
preprocessing (0.86). Hence in the following experiments we kept
Of note, instead of using only the current batch of data, the BA
the “Minimal” preprocessing.
was evaluated based on all the training/validation data. Moreover,
Classification of sMCI vs pMCI. The BA was the same for base-
we organized the classification outputs in a hierarchical way in-
line data and for longitudinal data (0.73).
spired from BIDS, including the TSV files containing the classifica-
tion results, the outputs of TensorboardX for dynamic monitoring
5.1.2. 3D ROI-based
of the training and the best performing models selected based on
For AD vs CN, the BA was 0.88 for baseline data and 0.86 for
the validation BA. The linear SVM was implemented using scikit-
longitudinal data. This is slightly higher than that of the subject-
learn (Pedregosa et al., 2011; Samper-González et al., 2018).
level approach. For sMCI vs pMCI, the BA was 0.77 for baseline
We applied the following early stopping strategy for all the
data and 0.78 for longitudinal data. This is substantially higher
classification experiments: the training procedure does not stop
than with the 3D-subject level approach.
until the validation loss is continuously higher than the lowest val-
idation loss for N epochs. Otherwise, the training continues to the
5.1.3. 3D patch-level
end of the pre-defined number of epochs. The selected model was
Single CNN. For AD vs CN, the BA was 0.74 for baseline data
the one which obtained the highest validation BA during training.
and 0.76 for longitudinal data.
For the AE pre-training, the AE was trained to the end of the pre-
Multi CNN. For AD vs CN, the BA was 0.81for baseline data and
defined number of epochs. We then visually check the validation
0.83for longitudinal data, thereby outperforming the single CNN
loss and the quality of the reconstructed images. The mean square
approach. For sMCI vs pMCI, the BA was 0.75 for baseline data
loss was used for the AE pre-training and the cross-entropy loss,
and 0.77 for longitudinal data. The performance for both tasks is
which combines a log softmax normalization and the negative log
slightly lower than that of the 3D ROI-based approach. Compared
likelihood loss, was used for the CNNs.
to the 3D subject-level approach, this method works better for
sMCI vs pMCI.
5. Experiments and results
5.1.4. 2D slice-level
5.1. Results on training/validation set
In general, the performance of the 2D-slice level approach was
lower to that of the 3D ROI-based, 3D patch-level multi CNN and
The different classification experiments and results (validation
3D subject-level (when trained with longitudinal data) approaches
BA during 5-fold CV) are detailed in Table 5. For each experiment,
but higher than that of the 3D patch-level single CNN approach.
For 2D slice-level, the use of longitudinal data for training did
5
https://pytorch.org/ not improve the performance (0.79 for baseline data; 0.74 for lon-
6
https://github.com/lanpa/tensorboardX gitudinal data). Finally, we studied the influence of data leakage
12 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

Table 5
Summary of all the classification experiments and validation results in our analyses. For each model, we report the balanced accuracy for each of the five folds within
brackets and the average and standard deviation across the folds. Note that this is not the standard-deviation of the estimator of balanced accuracy. MinMax: for CNNs,
intensity rescaling was done based on min and max values, resulting all values to be in the range of [0, 1]; SPM-based: the SPM-based gray matter maps are intrinsically
rescaled; AE: autoencoder. For DL models, sMCI vs pMCI tasks were done with as follows: the weights and biases of the model learnt on the source task (AD vs CN) were
transferred to a new model fine-tuned on the target task (sMCI vs pMCI). For SVM, the sMCI vs pMCI was done either training directly on sMCI vs pMCI or using training
on AD vs CN and applying the trained model to sMCI vs pMCI.
Classification Training data Image Intensity Data split Training Transfer Task Validation balanced Exp #
architectures preprocessing rescaling approach learning accuracy
3D subject-level Baseline Minimal None subject-level single-CNN None AD vs CN 0.50 ± 0.00 [0.50, 0.50, 1
CNN 0.50, 0.50, 0.50]
MinMax 0.80 ± 0.05 [0.76, 0.86, 2
0.81, 0.85, 0.74]
AE pre-training 0.82 ± 0.05 [0.74, 0.90, 3
0.83, 0.77, 0.83]
Longitudinal Minimal MinMax subject-level single-CNN AE pre-training 0.85 ± 0.04 [0.88, 0.88, 4
0.84, 0.85, 0.78]
Extensive 0.86 ± 0.06 [0.88, 0.94, 5
0.85, 0.85, 0.76]
Minimal sMCI vs pMCI 0.73 ± 0.03 [0.73, 0.73, 6
0.67, 0.76, 0.74]
Baseline 0.73 ± 0.05 [0.73, 0.73, 7
0.63, 0.77, 0.76]
3D ROI-based CNN Baseline Minimal MinMax subject-level single-CNN AE pre-training AD vs CN 0.88 ± 0.03 [0.84, 0.89, 8
0.90, 0.89, 0.85]
sMCI vs pMCI 0.77 ± 0.05[0.81, 0.81, 9
0.67, 0.78, 0.76]
Longitudinal AD vs CN 0.86 ± 0.02 [0.83, 0.86, 10
0.86, 0.88, 0.86]
sMCI vs pMCI 0.78 ± 0.07 [0.87, 0.73, 11
0.68, 0.82, 0.78]
3D patch-level CNN Baseline Minimal MinMax subject-level single-CNN AE pre-training AD vs CN 0.74 ± 0.08 [0.75, 0.84, 12
0.78, 0.75, 0.59]
Longitudinal 0.76 ± 0.04 [0.78, 0.77, 13
0.80, 0.78, 0.69]
Baseline multi-CNN AD vs CN 0.81 ± 0.03 [0.82, 0.84, 14
0.83, 0.77, 0.79]
sMCI vs pMCI 0.75 ± 0.04[0.80, 0.72, 15
0.72, 0.79, 0.72]
Longitudinal AD vs CN 0.83 ± 0.02 [0.83, 0.85, 16
0.84, 0.82, 0.79]
sMCI vs pMCI 0.77 ± 0.04 [0.77, 0.75, 17
0.71, 0.82, 0.79]
2D slice-level CNN Baseline Minimal MinMax subject-level single-CNN ImageNet AD vs CN 0.79 ± 0.04 [0.83, 0.83, 18
pre-training 0.72, 0.82, 0.73]
Longitudinal 0.74 ± 0.03 [0.76, 0.80, 19
0.74, 0.71, 0.69]
Baseline slice-level(data 1.00 ± 0 [1.00, 1.00, 1.00, 20
leakage) 1.00, 1.00]
SVM Baseline DartelGM SPM-based subject-level None None AD vs CN 0.88 ± 0.02 [0.92, 0.89, 21
0.85, 0.89, 0.84]
sMCI vs pMCI (trained 0.68 ± 0.02 [0.71, 0.68, 22
on sMCI vs pMCI) 0.66, 0.67, 0.71]
sMCI vs pMCI (trained 0.70 ± 0.06 [0.66, 0.75, 23
on AD vs CN) 0.70, 0.79, 0.63]
Longitudinal None AD vs CN 0.87 ± 0.01 [0.86, 0.86, 24
0.88, 0.87, 0.85]
sMCI vs pMCI (trained 0.68 ± 0.06 [0.75, 0.77, 25
on sMCI vs pMCI) 0.62, 0.62, 0.67]
sMCI vs pMCI (trained 0.70 ± 0.02 [0.68, 0.72, 26
on AD vs CN) 0.67, 0.69, 0.73]

using a slice-level data split strategy. As expected, the BA was 5.2.1. 3D subject-level
1.00. For AD vs CN, all models generalized well to the ADNI and AIBL
test sets but not to the OASIS test set (losing over 0.15 points of
BA).
5.1.5. Linear SVM
For sMCI vs pMCI, the models generalized relatively well to the
For task AD vs CN, the balanced accuracies were 0.88 when
ADNI test set but not to the AIBL test set (losing over 0.20 points).
trained with baseline data and 0.87 when trained with longitudi-
Note that the generalization was better for longitudinal than for
nal data. For task sMCI vs pMCI, when training from scratch, the
baseline.
balanced accuracies were 0.68 when trained with baseline data
and 0.68 when trained with longitudinal data. When using transfer
5.2.2. 3D ROI-based
learning from the task AD vs CN to the task sMCI vs pMCI, the bal-
For AD vs CN, the models generalized well to the ADNI test set,
anced accuracies were 0.70 (when trained with baseline data) and
slightly worse to the AIBL test set (losing 0.04 to 0.05 points) and
0.70 (when trained with longitudinal data). The performance of the
considerably worse for OASIS (losing from 0.13 to 0.19 points).
SVM on AD vs CN is thus higher than that of most DL models and
For sMCI vs pMCI, there was a slight decrease in BA on the
comparable to the best ones. Whereas for task sMCI vs pMCI, the
ADNI test set and a severe decrease for the AIBL test set. Note that
BA of the SVM is lower than that of DL models.
on the ADNI test set, the performance of the 3D ROI-based is al-
most the same as that of the 3D-subject (when using longitudinal
5.2. Results on the test sets data) while it was better on the validation set.

Results on the three test sets (ADNI, OASIS and AIBL) are pre- 5.2.3. 3D patch-level
sented in Table 6. For each category of approach, we only applied For AD vs CN, the generalization pattern was similar to that of
the best models for both baseline and longitudinal data. the other models: good for ADNI and AIBL, poor for OASIS.
Table 6
Summary of the results of the three test datasets in our analyses. 3D subject-level CNNs were trained using intensity rescaling and our “Minimal” preprocessing, with a data split on the subject level and transfer learning
(AE pretraining for AD vs CN tasks and cross-task transfer learning was applied for sMCI vs pMCI tasks). For each model, we first copied the validation balanced accuracy (averaged across the five folds) that is reported in
Table 5. Then, we report the balanced accuracy for each test set (ADNI, AIBL, OASIS), more specifically within brackets we report the balanced accuracy for each of the trained models of the 5 folds of the validation set and then
the average across the five folds. MinMax: for CNNs, intensity rescaling was done based on min and max values, resulting all values to be in the range of [0, 1]; SPM-based: the SPM-based gray matter maps are intrinsically
rescaled; AE: autoencoder.

J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694
Classiﬁcation Training data Image preprocessing Intensity rescaling Data split Training approach Transfer learning Task Validation balanced ADNI test balanced AIBL test balanced OASIS test balanced
architectures accuracy accuracy accuracy accuracy
3D subject-level Baseline Minimal MinMax subject-level single-CNN AE pre-training AD vs CN 0.82 ± 0.05 0.82 [0.79, 0.85, 0.83 [0.81, 0.85, 0.67 [0.59, 0.69,
CNN 0.82, 0.81, 0.85] 0.84, 0.78, 0.86] 0.72, 0.64, 0.69]
Longitudinal 0.85 ± 0.04 0.85 [0.88, 0.84, 0.86 [0.89, 0.85, 0.68 [0.65, 0.70,
0.84, 0.84, 0.84] 0.86, 0.85, 0.86] 0.70, 0.71, 0.65]
Baseline sMCI vs pMCI 0.73 ± 0.05 0.69 [0.68, 0.71, 0.52 [0.51, 0.47, –
0.64, 0.73, 0.67] 0.55, 0.54, 0.55]
Longitudinal 0.73 ± 0.03 0.73 [0.75, 0.72, 0.50 [0.48, 0.47, –
0.72, 0.74, 0.72] 0.54, 0.52, 0.51]
3D ROI-based CNN Baseline Minimal MinMax subject-level single-CNN AE pre-training AD vs CN 0.88 ± 0.03 0.89 [0.87, 0.88, 0.84 [0.83, 0.88, 0.69 [0.62, 0.74,
0.90, 0.91, 0.89] 0.84, 0.85, 0.83] 0.70, 0.69, 0.71]
sMCI vs pMCI 0.7 ± 0.05 0.74 [0.75, 0.72, 0.60 [0.56, 0.56, –
0.76, 0.75, 0.75] 0.66, 0.62, 0.59]
Longitudinal AD vs CN 0.86 ± 0.02 0.85 [0.87, 0.82, 0.81 [0.79, 0.81, 0.73 [0.71, 0.73,
0.87, 0.86, 0.87] 0.79, 0.82, 0.85] 0.72, 0.76, 0.71]
sMCI vs pMCI 0.78 ± 0.07 0.74 [0.70, 0.73, 0.57 [0.56, 0.53, –
0.73, 0.75, 0.81] 0.52, 0.66, 0.56]
3D patch-level CNN Baseline Minimal MinMax subject-level multi-CNN AE pre-training AD vs CN 0.81 ± 0.03 0.81 [0.82, 0.81, 0.81 [0.81, 0.75, 0.64 [0.61, 0.65,
0.84, 0.80, 0.79] 0.81, 0.84, 0.82] 0.60, 0.69, 0.67]
sMCI vs pMCI 0.75 ± 0.04 0.70 [0.71, 0.66, 0.64 [0.63, 0.52, –
0.66, 0.71, 0.75] 0.67, 0.74, 0.63]
Longitudinal AD vs CN 0.83 ± 0.02 0.86 [0.86, 0.86, 0.80 [0.82, 0.78, 0.71 [0.70, 0.70,
0.87, 0.85, 0.84] 0.81, 0.81, 0.79] 0.71, 0.71, 0.67]
sMCI vs pMCI 0.77 ± 0.04 0.70 [0.70, 0.71, 0.44 [0.45, 0.39, –
0.69, 0.71, 0.69] 0.55, 0.42, 0.39]
2D slice-level CNN Baseline Minimal MinMax subject-level single-CNN ImageNet pre-train AD vs CN 0.79 ± 0.04 0.76 [0.76, 0.75, 0.76 [0.74, 0.76, 0.65 [0.67, 0.62,
0.77, 0.75, 0.78] 0.78, 0.75, 0.75] 0.64, 0.65, 0.69]
Longitudinal 0.74 ± 0.03 0.74 [0.81, 0.76, 0.73 [0.72, 0.77, 0.61 [0.62, 0.63,
0.70, 0.74, 0.72] 0.72, 0.66, 0.79] 0.64, 0.58, 0.60]
Baseline slice-level (data 1.00 ± 0 0.75 [0.74, 0.76, 0.80 [0.80, 0.79, 0.68 [0.68, 0.67,
leakage) 0.75, 0.76, 0.75] 0.82, 0.80, 0.81] 0.69, 0.70, 0.66]
SVM Baseline DartelGM SPM-based subject-level None None AD vs CN 0.88 ± 0.02 0.88 [0.88, 0.87, 0.88 [0.87, 0.90, 0.70 [0.71, 0.71,
0.90, 0.90, 0.88] 0.87, 0.89, 0.90] 0.70, 0.68, 0.72]
sMCI vs pMCI 0.70 ± 0.06 0.75 [0.75, 0.75, 0.60 [0.62, 0.54, –
(trained on AD vs 0.74, 0.76, 0.76] 0.62, 0.59, 0.64]
CN)
Longitudinal AD vs CN 0.87 ± 0.01 0.87 [0.85, 0.84, 0.87 [0.88, 0.86, 0.71 [0.73, 0.68,
0.90, 0.89, 0.87] 0.88, 0.87, 0.89] 0.72, 0.70, 0.71]
sMCI vs pMCI 0.70 ± 0.02 0.76 [0.74, 0.75, 0.68 [0.67, 0.66, –
(trained on AD vs 0.80, 0.77, 0.76] 0.68, 0.67, 0.71]
CN)

13
14 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

For sMCI vs pMCI, the BA on the ADNI test set was 0.05 to 0.07 with their framework, but neither of them succeeded in reproduc-
points lower than on the ADNI validation set. The BA on the AIBL ing the results of the original study (for the AD vs CN task they
test set was very poor. report both an accuracy of 0.82 while the original study reports
an accuracy of 0.99). We extended our open-source framework
5.2.4. 2D slice-level for reproducible evaluation of AD classification, initially dedicated
For AD vs CN, there was a slight decrease in performance on to traditional methods (Samper-González et al., 2018; Wen et al.,
the ADNI test set (losing from 0 to 0.03 points) and the AIBL test 2018a), to DL approaches. It is composed of the previously devel-
set (losing from 0.01 to 0.03 points) and a considerable decrease oped tools for data management that rely on the BIDS commu-
on the OASIS test set (losing from 0.13 to 0.14 points). As expected, nity standard (Gorgolewski et al., 2016), a new image preprocess-
the “data-leakage” model did not generalize well. ing pipeline performing bias field correction, affine registration to
MNI space and intensity rescaling, a set of CNN models that are
5.2.5. Linear SVM representative of the literature, and rigorous validation procedures
For AD vs CN, we observed the same pattern as for the other dedicated to DL. We hope that this open-source framework will fa-
models: excellent generalization to ADNI and AIBL but not to OA- cilitate the reproducibility and objectivity of DL methods for AD
SIS. classification as it enables researchers to easily embed new im-
For sMCI vs pMCI, the generalization was excellent for ADNI but age preprocessing pipelines or CNN architectures and study their
not for AIBL. Of note, the BA on the ADNI test set was even higher added value. It extends the efforts initiated in both the neuroimag-
to that of the validation, reaching a level which is comparable to ing (Gorgolewski and Poldrack, 2016; Poldrack et al., 2017) and ML
the best DL models. (Sonnenburg et al., 2007; Stodden et al., 2014; Vanschoren et al.,
2014) communities to improve reproducibility. In particular, frame-
6. Discussion works and software tools have been recently proposed to facilitate
and standardize ML analyses for neuroimaging data. Nilearn7 is
The present study contains three main contributions. First, we currently mostly focused on fMRI data. It provides pipelines for im-
performed a systematic and critical literature review, which high- age processing, various techniques for decoding activity and study-
lighted several important problems. Then, we proposed an open- ing functional connectivity as well as visualization tools. Neuropre-
source framework for the reproducible evaluation of AD classifica- dict8 (Raamana, 2017) is more focused on computer-aided diagno-
tion using CNNs and T1w MRI. Finally, we applied the framework sis and other clinical applications. In particular, it provides stan-
to rigorously compare different CNN approaches and to study the dardized cross-validation procedures and tools to visualize results.
impact of key components on the performance. We hope that the Our third contribution is the rigorous assessment of the per-
present paper will provide a more objective assessment of the performance of different CNN architectures. The proposed framework
formance of CNNs for AD classification and constitute a solid base- was applied to images from three public datasets, ADNI, AIBL and
line for future research. OASIS. On the ADNI test dataset, the diagnostic BA of CNNs ranged
This paper first proposes a survey of existing CNN methods for from 0.76 to 0.89 for the AD vs CN task and from 0.69 to 0.74 for
AD classification that highlighted several serious problems with the sMCI vs pMCI task. These results are in line with the state-of-
the existing literature. We found that data leakage was potentially the-art (studies without data leakage in Table 1a), where classifica-
present in half of the 32 surveyed studies. This problem was ev- tion accuracy typically ranged from 0.76 to 0.91 for AD vs CN and
ident in six of them and possible (due to inadequate description 0.62 to 0.83 for sMCI vs pMCI. Nevertheless, the performance that
of the validation procedure) in ten others. This is a very serious we report is lower than that of the top-performing studies. This
issue, in particular considering that all these studies have under- potentially comes from the fact that our test set was fully inde-
gone peer-review, likely to bias the performance upwards. We con- pendent and was never used to choose the architectures or param-
firmed this assumption by simulating data leakage and found that eters. The proposed framework can be used to provide a baseline
it led to a biased evaluation of the BA (1.00 on the validation in- performance when developing new methods.
stead of 0.75 on ADNI test set and 0.80 on AIBL test set). Similar Different approaches, namely 3D subject-level, 3D ROI-based,
findings were observed in (Bäckström et al., 2018). Moreover, the 3D patch-level and 2D slice-level CNNs, were compared. Our study
survey highlighted that many studies did not motivate the choice is the first one to systematically compare the performance of these
of their architecture or training hyperparameters. Only two of them four approaches. In the literature, three studies (Cheng et al., 2017;
(Wang et al., 2019; Wang et al., 2018a) explored and gave results Li et al., 2018; Liu et al., 2018a) using a 3D patch-level approach
obtained with different architecture hyperparameters. However, it compared their results with a 3D subject-level approach. In all
is possible that these results were computed on the test set to studies, the 3D patch-level multi-CNN gave better results than the
help choose their final model, hence they may be contaminated by 3D-subject CNN (3 or 4 percent points of difference between the
data leakage. For other studies, it is also likely that multiple combi- two approaches). However, except for (Liu et al., 2018a) where the
nations of architecture and training hyperparameters were tested, code provided by (Hosseini-Asl et al., 2016) is reused, the meth-
leading to a biased performance on the test set. We believe that ods used for the comparison are poorly described and the studies
these issues may potentially be caused by the lack of expertise would thus be difficult, if not impossible, to reproduce. In gen-
in medical imaging or DL. For instance, splitting at the slice-level eral, in our results, three approaches (3D subject-level, 3D ROI-
comes from a lack of knowledge of the nature of medical imaging based, 3D patch-level) provided approximately the same level of
data. We hope that the present paper will help to spread knowl- performance (note that this discussion paragraph is based on test
edge and good practices in the field. set results which are the most objective performance measures).
The second contribution of our work is an open-source frame- On the other hand, the 2D-slice approach was less efficient. One
work for reproducible experiments on AD classification using can hypothesize that this is because the spatial information is
CNNs. Some studies in our bibliography made their code available not adequately modeled by these approaches (no 3D consistency
on open source platforms (Hon and Khan, 2017; Hosseini-Asl et al., across slices). Only one paper (without data leakage) has explored
2016; Korolev et al., 2017; Liu et al., 2018a). Even though this prac-
tice should be encouraged, it does not guarantee reproducibility of
the results. Two studies (Cheng and Liu, 2017; Liu et al., 2018a) 7
https://nilearn.github.io/.
used the online code of (Hosseini-Asl et al., 2016) for comparison 8
http://github.com/raamana/neuropredict.
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 15

2D slice-level using ImageNet pre-trained ResNet (Valliani and We studied generalization in three different settings: i) on a
Soni, 2017). Their accuracy is very similar to ours (0.81 for task separate test set from ADNI, thus from the same study as those
AD vs CN). The results of the three 3D approaches were in gen- of the training set; ii) on AIBL, i.e. a different study but with sim-
eral comparable and our results do not allow a strong conclusion ilar inclusion criteria and imaging acquisitions; iii) on OASIS, i.e.
to be drawn on the superiority of one of these three approaches. a study with different inclusion criteria and imaging acquisitions.
Nevertheless, there is a trend for a slightly lower performance of Overall, the models generalized well to ADNI (for both tasks) and
the 3D patch-level approach. It could come from the fact that the to AIBL (for AD vs CN). On the other hand, we obtained a very
spatial information is also not ideally modeled with the 3D patch poor generalization to sMCI vs pMCI for AIBL. We hypothesize that
(no consistency at the border of the patch). Other studies with it could be because pMCI and sMCI participants from AIBL are sub-
3D patch-level approaches in the literature (Cheng et al., 2017; stantially older than those of ADNI, which is not the case for AD
Lian et al., 2018; Li et al., 2018; Liu et al., 2018a, 2018e) reported and CN participants. Nevertheless, note that the sample size for
higher accuracies (from 0.87 to 0.91) than ours (from 0.81 to 0.86). sMCI vs pMCI in AIBL is quite small (33 participants). Also, the
We hypothesize that it may come from the increased complexity generalization to OASIS was poor. This may stem from the diagno-
of their approach, including patch selection and fusion. Concern- sis criteria which are less rigorous (in OASIS, all participants with
ing the 3D ROI-based approach, two papers in the literature using CDR>0 are considered AD). Overall, these results bring important
hippocampal ROI reported high accuracies for task AD vs CN (0.84 information. First, good generalization to unseen, similar, subjects
and 0.90), comparable to ours, even though their definition of the demonstrate that the models did not overfit the subjects at hand
ROI was different (Aderghal et al., 2018, 2017b). As for the 3D sub- in the training/validation set. On the other hand, poor generaliza-
jects (Bäckström et al., 2018; Cheng and Liu, 2017; Korolev et al., tion to different age ranges, protocols and inclusion criteria show
2017; Li et al., 2017; Senanayake et al., 2018; Shmulev et al., 2018), that trained models are too specific of these characteristics. Gen-
results of the literature varied across papers, from 0.76 to 0.90. Al- eralization across different populations thus remains an unsolved
though we cannot prove it directly, we believe that this variabil- problem and will require training on more representative datasets
ity stems from the high risk of overfitting. To summarize, our re- but maybe also new strategies to make training more robust to
sults demonstrate the superiority of 3D approaches compared to heterogeneity. This is critical for the future translation to clinical
2D, but the results of the different 3D approaches were not sub- practice in which conditions are much less controlled than in re-
stantially different. In light of this, one could prefer using the 3D search datasets like ADNI.
ROI-based method which requires less memory and training time We studied the influence of several key choices on the per-
(compared to other 3D methods) and is conceptually simpler than formance. First, we studied the influence of AE pre-training and
the 3D-patch multi-CNN approach. However, it could be that fu- showed that it slightly improved the average over training from
ture works, with larger training sets, result in the superiority of ap- scratch. Three previous papers studied the impact of AE pre-
proaches that exploit all the information in the 3D image and not training (Hosseini-Asl et al., 2016; Vu et al., 2018, 2017) and found
only that of the hippocampus. Indeed, even though the hippocam- that it improved the results. However, they are all suspected of
pus is affected early and severely by AD (Braak and Braak, 1998), data leakage. We thus conclude that, to date, it is not proven
alterations in AD are not confined to the hippocampus and extend that AE pre-training leads to a significant increase in BA. A diffi-
to other regions in the temporal, parietal and frontal lobes. culty in AD classification using DL is the limited amount of data
One interesting question is whether DL could perform bet- samples available for training. However, training with longitudi-
ter than conventional ML methods for AD classification. Here, we nal instead of baseline data gave only a slight increase of BA in
chose to compare CNN to a linear SVM. SVM has been used in most approaches. The absence of a major improvement may be
many AD classification studies and obtained competitive balanced due to several factors. First, training with longitudinal data implies
accuracies (Falahati et al., 2014; Haller et al., 2011; Rathore et al., training with data from more advanced disease stages, since pa-
2017). In the current study, the SVM was at least as good as the tients are seen at a later point in the disease course. This may
best CNNs for both the AD vs CN and the sMCI vs pMCI task. Note have an adverse effect on the performance of the model when
that we used a standard linear SVM with standard voxel-based fea- tested on baseline data, at which the patients are less advanced.
tures. It could be that more sophisticated conventional ML meth- Also, since the additional data come from the same patients, this
ods could provide even higher performance. Similarly, we do not does not provide a better coverage of inter-individual variability.
claim that more sophisticated DL architectures would not outper- We studied the impact of image preprocessing. First, as expected,
form the SVM. However, this is not the case with the architectures we found that CNNs cannot be successfully trained without in-
that we tested, which are representative of the existing literature tensity rescaling. We then studied the influence of two different
on AD classification. Besides, it is possible that CNNs will outper- preprocessing procedures (“Minimal” and “Extensive”). The “Min-
form SVM when larger public datasets will become available. Over- imal” procedure is limited to an affine registration of the sub-
all, a major result of the present paper is that, with the sample ject’s image to a standard space, while for the “Extensive” pro-
size which is available in ADNI, CNNs did not provide an increase cedure non-linear registration and skull stripping are performed.
in performance compared to SVM. They led to comparable results. In principle, this is not surprising
Unbiased evaluation of the performance is an essential task in as DL methods do not require extensive preprocessing. In the litter-
ML. This is particularly critical for DL because of the extreme flex- ature, varied types of preprocessing have been used. Some studies
ibility of the models and of the numerous architecture and train- used non-linear registration (Bäckström et al., 2018; Basaia et al.,
ing hyperparameters that can be chosen. In particular, it is cru- 2019; Lian et al., 2018; Lin et al., 2018; Liu et al., 2018a, 2018e;
cial that such choices are not made using the test set. We chose Wang et al., 2019; Wang et al., 2018a) while others used only lin-
a very strict validation strategy in that respect: the test sets were ear (Aderghal et al., 2018, 2017a, 2017b; Hosseini Asl et al., 2018;
left untouched until the end of the peer-review process. This guar- Li et al., 2018; Liu et al., 2018a; Shmulev et al., 2018) or no reg-
antees that only the final models, after all possible adjustments, istration (Cheng and Liu, 2017). None of them compared these
are carried to the test set. Moreover, it is important to assess gen- different preprocessings with the exception of (Bäckström et al.,
eralization not only to unseen subjects but also to other studies in 2018) which compared preprocessing using FreeSurfer to no pre-
which image acquisitions or patient inclusion criteria can vary. In processing. They found that training the network with the raw data
the present paper, we used three test sets from the ADNI, AIBL and resulted in a lower classification performance (drop in accuracy
OASIS databases to assess different generalization aspects. of 38 percent points) compared to the preprocessed data using
16 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

FreeSurfer (Bäckström et al., 2018). However, FreeSurfer comprises the allocation 2019-100963 made by GENCI (Grand Équipement
a complex pipeline with many preprocessing steps so it is unclear, National de Calcul Intensif) in the context of the Jean Zay
from their results, which part drives the superior performance. We "Grands Challenges" (2019). The research leading to these re-
clearly demonstrated that the intensity rescaling is essential for the sults has received funding from the program “Investissements
CNN training whereas there is no improvement in using a non- d’avenir” ANR-10-IAIHU-06 (Agence Nationale de la Recherche-
linear registration over a linear one. Finally, we found that, for the 10-Investissements Avenir Institut Hospitalo-Universitaire-6), from
3D-patch level framework, the multi-CNN approach gave better re- ANR-19-P3IA-0 0 01 (Agence Nationale de la Recherche - 19
sults than the single-CNN one. However, this may be mainly be- - ProgrammeInstitutsInterdisciplinairesIntelligenceArtificielle-0 0 01,
cause the multi-CNN approach benefits from a thresholding system project PRAIRIE), from the European Union H2020 program
which excludes the worst patches, a system that was not present (project EuroPOND, grant number 666992), and from the joint
in the single-CNN approach. To test this hypothesis, we performed NSF/NIH/ANR program “Collaborative Research in Computational
supplementary experiments in which the multi-CNN was trained Neuroscience” (project HIPLAY7, grant number ANR-16-NEUC-
without threshold and the single-CNN was trained using the same 0 0 01-01). J.W. receives financial support from the China Scholar-
thresholding system as in the main experiments of the multi-CNN. ship Council (CSC). O.C. is supported by a “Contrat d’Interface Lo-
Results are reported in eTables 6 and 7. We observed that the re- cal” from Assistance Publique-Hôpitaux de Paris (AP-HP).Data col-
sults of the multi-CNN and the single-CNN are comparable when lection and sharing for this project was funded by the Alzheimer’s
they use the same thresholding system. For example, for the AD vs Disease Neuroimaging Initiative (ADNI) (National Institutes of
CN task, without thresholding, the BA of the multi-CNN was 0.76 Health Grant U01 AG024904) and DOD ADNI (Department of De-
using baseline data and 0.72 using longitudinal data while that of fense award number W81XWH-12-2-0012). ADNI is funded by the
the single-CNN were respectively 0.74 and 0.76. A similar obser- National Institute on Aging, the National Institute of Biomedical
vation can be made when both approaches used the thresholding. Imaging and Bioengineering, and through generous contributions
These supplementary experiments suggest that, under similar con- from the following: AbbVie, Alzheimer’s Association; Alzheimer’s
ditions, the multi-CNN architecture does not always perform bet- Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Bio-
ter than the single-CNN architecture. In light of this, it would seem gen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Ei-
preferable to choose a framework that offers a better compromise sai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroIm-
between performance and conceptual complexity, e.g. the 3D-ROI mun; F. Hoffmann-La Roche Ltd and its affiliated company Genen-
or the 3D-subject approaches. tech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer
Our study has the following limitations. First, a large number Immunotherapy Research & Development, LLC.; Johnson & John-
of options exist when choosing the model architecture and train- son Pharmaceutical Research & Development LLC.; Lumosity; Lund-
ing hyperparameters. Even though we did our best to make mean- beck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Re-
ingful choices and test a relatively large number of possibilities, search; Neurotrack Technologies; Novartis Pharmaceuticals Corpo-
we cannot exclude that other choices could have led to better re- ration; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceu-
sults. To overcome this limitation, our framework is freely avail- tical Company; and Transition Therapeutics. The Canadian Insti-
able to the community. Researchers can use it to propose and vali- tutes of Health Research is providing funds to support ADNI clini-
date potentially better performing models. In particular, with our cal sites in Canada. Private sector contributions are facilitated by
proposed framework, researchers can easily try their own mod- the Foundation for the National Institutes of Health (www.fnih.
els without touching the test datasets. Secondly, the CV procedures org). The grantee organization is the Northern California Insti-
were performed only once. Of course, the training is not determin- tute for Research and Education, and the study is coordinated by
istic, and one would ideally want to repeat the CV to get a more the Alzheimer’s Therapeutic Research Institute at the University of
robust estimate of the performance. However, we did not perform Southern California. ADNI data are disseminated by the Laboratory
this due to limited computational resources. Finally, overfitting al- for Neuro Imaging at the University of Southern California. The OA-
ways exists in our experiments, even though different techniques SIS Cross-Sectional project (Principal Investigators: D. Marcus, R,
have been tried (e.g. transfer learning, dropout or weight decay). Buckner, J, Csernansky J. Morris) was supported by the following
This phenomenon occurs mainly due to the limited size of the grants: P50 AG05681, P01 AG03991, P01 AG026276, R01 AG021910,
datasets available for AD classification. It is likely that training with P20 MH071616, and U24 RR021382.
much larger datasets would result in higher performance.
Supplementary materials
Declaration of Competing Interest
Supplementary material associated with this article can be
OC reports having received speaker fees from Roche (2015), found, in the online version, at doi:10.1016/j.media.2020.101694.
Lundbeck (2012) and Guerbet (2010), having received consulting
fees from AskBio (2020), having received fees for writing a lay au- References
dience short paper from Expression Santé (2019), having received
speaker fees for a lay audience presentation from Palais de la Dé- Aderghal, K., Benois-Pineau, J., Afdel, K., Gwenaëlle, C., 2017a. FuseMe: Classifica-
tion of sMRI images by fusion of Deep CNNs in 2D+ projections. In: 15th
couverte (2017) and that his laboratory has received grants (paid International Workshop on Content-Based Multimedia Indexing. ACM, p. 34.
to the institution) from EISAI (2007-2011), Air Liquide Medical doi:10.1145/3095713.3095749.
Systems (2011-2016), Qynapse (2017-present) and my Brain Tech- Aderghal, K., Boissenin, M., Benois-Pineau, J., Catheline, G., Afdel, K., 2017b. Clas-
sification of sMRI for AD Diagnosis with Convolutional Neuronal Networks: A
nologies (2016-present). His spouse is an employee at my Brain
Pilot 2-D+Ɛ Study on ADNI. In: Amsaleg, L., Guðmundsson, G.Þ., Gurrin, C.,
Technologies (2015-). Jónsson, B.Þ., Satoh, S. (Eds.), MultiMedia Modeling, Lecture Notes in Com-
puter Science. Springer International Publishing, Cham, pp. 690–701. doi:10.
1007/978- 3- 319- 51811- 4_56.
Acknowledgements
Aderghal, K., Khvostikov, A., Krylov, A., Benois-Pineau, J., Afdel, K., Catheline,
G., 2018. Classification of Alzheimer Disease on Imaging Modalities with
We thank Mr. Maxime Kermarquer for the IT support during Deep CNNs Using Cross-Modal Transfer Learning , in: IEEE 31st Interna-
this study. We also thank the following colleagues for useful dis- tional Symposium on Computer-Based Medical Systems (CBMS). pp.345–350.
10.1109/CBMS.2018.0 0 067.
cussions and suggestions: Alexandre Bône and Johann Faouzi. This Amoroso, N., Diacono, D., Fanizzi, A., La Rocca, M., Monaco, A., Lombardi, A., Guarag-
work was granted access to the HPC resources of IDRIS under nella, C., Bellotti, R., Tangaro, S.Alzheimer’s Disease Neuroimaging Initiative,
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 17

2018. Deep learning reveals Alzheimer’s disease onset in MCI subjects: Results in Medical Imaging, Lecture Notes in Computer Science. Springer International
from an international challenge. J. Neurosci. Methods 302, 3–9. doi:10.1016/j. Publishing, Cham, pp. 337–345. doi:10.1007/978- 3- 030- 00919- 9_39.
jneumeth.2017.12.011. Ewers, M., Sperling, R.A., Klunk, W.E., Weiner, M.W., Hampel, H., 2011. Neuroimaging
Ashburner, J., 2007. A fast diffeomorphic image registration algorithm. Neuroimage markers for the prediction and early diagnosis of Alzheimer’s disease dementia.
38, 95–113. doi:10.1016/j.neuroimage.2007.07.007. Trends Neurosci. 34, 430–442. doi:10.1016/j.tins.2011.05.005.
Ashburner, J., Friston, K.J., 2005. Unified segmentation. Neuroimage 26, 839–851. Falahati, F., Westman, E., Simmons, A., 2014. Multivariate data analysis and machine
doi:10.1016/j.neuroimage.2005.02.018. learning in Alzheimer’s disease with a focus on structural magnetic resonance
Avants, B.B., Epstein, C.L., Grossman, M., Gee, J.C., 2008. Symmetric diffeomorphic imaging. J. Alzheimers. Dis. 41, 685–708. doi:10.3233/JAD-131928.
image registration with cross-correlation: evaluating automated labeling of el- Farooq, A., Anwar, S., Awais, M., Rehman, S., 2017. A deep CNN based multi-
derly and neurodegenerative brain. Med. Image Anal. 12, 26–41. doi:10.1016/j. class classification of Alzheimer’s disease using MRI, in: 2017 IEEE Inter-
media.20 07.06.0 04. national Conference on Imaging Systems and Techniques (IST). pp. 1–6.
Bäckström, K., Nazari, M., Gu, I.Y., Jakola, A.S., 2018. An efficient 3D deep convolu- 10.1109/IST.2017.8261460.
tional network for Alzheimer’s disease diagnosis using MR images, in: 2018 IEEE Fonov, V., Dadar, M., The PREVENT-AD Research Group, Louis Collins, D., 2018. Deep
15th International Symposium on Biomedical Imaging (ISBI 2018). pp. 149–153. learning of quality control for stereotaxic registration of human brain MRI.
10.1109/ISBI.2018.8363543. bioRxiv. 10.1101/303487.
Bankman, I., 2008. Handbook of Medical Image Processing and Analysis. Elsevier. Fonov, V., Evans, A.C., Botteron, K., Almli, C.R., McKinstry, R.C., Collins, D.L.Brain De-
Basaia, S., Agosta, F., Wagner, L., Canu, E., Magnani, G., Santangelo, R., Fil- velopment Cooperative Group, 2011. Unbiased average age-appropriate atlases
ippi, M.Alzheimer’s Disease Neuroimaging Initiative, 2019. Automated classifi- for pediatric studies. Neuroimage 54, 313–327. doi:10.1016/j.neuroimage.2010.
cation of Alzheimer’s disease and mild cognitive impairment using a single MRI 07.033.
and deep neural networks. Neuroimage Clin. 21, 101645. doi:10.1016/j.nicl.2018. Fonov, V.S., Evans, A.C., McKinstry, R.C., Almli, C.R., Collins, D.L., 2009. Unbiased non-
101645. linear average age-appropriate brain templates from birth to adulthood. Neu-
Baskar, D., Jayanthi, V.S., Jayanthi, A.N., 2018. An efficient classification approach for roimage Supplement 1, S102. doi:10.1016/S1053- 8119(09)70884- 5.
detection of Alzheimer’s disease from biomedical imaging modalities. Multimed. Goodfellow, I., Bengio, Y., Courville, A., Bengio, Y., 2016. Deep learning. MIT press
Tools Appl 1–33. doi:10.1007/s11042- 018- 6287- 8. Cambridge.
Bernal, J., Kushibar, K., Asfaw, D.S., Valverde, S., Oliver, A., Martí, R., Lladó, X., 2018. Gorgolewski, K., Burns, C.D., Madison, C., Clark, D., Halchenko, Y.O., Waskom, M.L.,
Deep convolutional neural networks for brain image analysis on magnetic reso- Ghosh, S.S., 2011. Nipype: a flexible, lightweight and extensible neuroimaging
nance imaging: a review. Artif. Intell. Med. 10.1016/j.artmed.2018.08.008. data processing framework in python. Front. Neuroinform. 5, 13. doi:10.3389/
Bhagwat, N., Viviano, J.D., Voineskos, A.N., Chakravarty, M.M.Alzheimer’s Disease fninf.2011.0 0 013.
Neuroimaging Initiative, 2018. Modeling and prediction of clinical symptom tra- Gorgolewski, K.J., Auer, T., Calhoun, V.D., Craddock, R.C., Das, S., Duff, E.P., Flandin, G.,
jectories in Alzheimer’s disease using longitudinal data. PLoS Comput. Biol. 14, Ghosh, S.S., Glatard, T., Halchenko, Y.O., Handwerker, D.A., Hanke, M., Keator, D.,
e1006376. doi:10.1371/journal.pcbi.1006376. Li, X., Michael, Z., Maumet, C., Nichols, B.N., Nichols, T.E., Pellman, J., Poline, J.-
Braak, H., Braak, E.Jellinger, K., Fazekas, F., Windisch, M. (Eds.), 1998. Evolution of B., Rokem, A., Schaefer, G., Sochat, V., Triplett, W., Turner, J.A., Varoquaux, G.,
neuronal changes in the course of Alzheimer’s disease. Ageing and Dementia, Poldrack, R.A., 2016. The brain imaging data structure, a format for organiz-
J. Neural Transmission. Supplementa. Springer Vienna, Vienna 127–140. doi:10. ing and describing outputs of neuroimaging experiments. Sci. Data 3, 160044.
1007/978- 3- 7091- 6467- 9_11. doi:10.1038/sdata.2016.44.
Brookmeyer, R., Johnson, E., Ziegler-Graham, K., Arrighi, H.M., 2007. Forecasting the Gorgolewski, K.J., Poldrack, R.A., 2016. A Practical Guide for Improving Transparency
global burden of Alzheimer’s disease. Alzheimers. Dement. 3, 186–191. doi:10. and Reproducibility in Neuroimaging Research. PLoS Biol. 14, e1002506. doi:10.
1016/j.jalz.2007.04.381. 1371/journal.pbio.1002506.
Cárdenas-Peña, D., Collazos-Huertas, D., Castellanos-Dominguez, G., 2017. Enhanced Gorji, H.T., Haddadnia, J., 2015. A novel method for early diagnosis of Alzheimer’s
Data Representation by Kernel Metric Learning for Dementia Diagnosis. Front. disease based on pseudo Zernike moment from structural MRI. Neuroscience
Neurosci. 11, 413. doi:10.3389/fnins.2017.00413. 305, 361–371. doi:10.1016/j.neuroscience.2015.08.013.
Cárdenas-Peña, D., Collazos-Huertas, D., Castellanos-Dominguez, G., 2016. Centered Gunawardena, K.A.N.N.P., Rajapakse, R.N., Kodikara, N.D., 2017. Applying convolu-
Kernel Alignment Enhancing Neural Network Pretraining for MRI-Based Demen- tional neural networks for pre-detection of alzheimer’s disease from structural
tia Diagnosis. Comput. Math. Methods Med. 2016, 9523849. doi:10.1155/2016/ MRI data, in: 2017 24th International Conference on Mechatronics and Machine
9523849. Vision in Practice (M2VIP). pp. 1–7. 10.1109/M2VIP.2017.8211486.
Chaddad, A., Desrosiers, C., Niazi, T., 2018. Deep Radiomic Analysis of MRI Related Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q., 2017. On Calibration of Modern
to Alzheimer’s Disease. IEEE Access 6, 58213–58221. doi:10.1109/ACCESS.2018. Neural Networks. In: Proceedings of the 34th International Conference on
2871977. Machine Learning - Volume 70, ICML’17. JMLR.org, Sydney, NSW, Australia,
Cheng, D., Liu, M., 2017. CNNs based multi-modality classification for AD diagnosis, pp. 1321–1330.
in: 10th International Congress on Image and Signal Processing. BioMed. Eng. Gutiérrez-Becker, B., Wachinger, C., 2018. Deep Multi-structural Shape Analysis:
Informat. (CISP-BMEI) 1–5. doi:10.1109/CISP-BMEI.2017.8302281. Application to Neuroanatomy: 21st International Conference, Granada, Spain,
Cheng, D., Liu, M., Fu, J., Wang, Y., 2017. Classification of MR brain images by com- September 16-20, 2018, Proceedings, Part III. In: Frangi, A.F., Schnabel, J.A., Da-
bination of multi-CNNs for AD diagnosis, in: Ninth International Conference on vatzikos, C., Alberola-López, C., Fichtinger, G. (Eds.), Medical Image Computing
Digital Image Processing (ICDIP). Presented at the Ninth International Confer- and Computer Assisted Intervention – MICCAI 2018, Lecture Notes in Com-
ence on Digital Image Processing (ICDIP 2017), International Society for Optics puter Science. Springer International Publishing, Cham, pp. 523–531. doi:10.
and Photonics, p. 1042042. 10.1117/12.2281808. 1007/978- 3- 030- 00931- 1_60.
Çitak-ER, F., Goularas, D., Ormeci, B., 2017. A novel Convolutional Neural Network Haller, S., Lovblad, K.O., Giannakopoulos, P., 2011. Principles of classification analyses
Model Based on Voxel-based Morphometry of Imaging Data in Predicting the in mild cognitive impairment (MCI) and Alzheimer disease. J. Alzheimers. Dis.
Prognosis of Patients with Mild Cognitive Impairment. J. Neurol. Sci. Turk. 34. 26 (3), 389–394. doi:10.3233/JAD- 2011- 0014, Suppl.
Cui, R., Liu, M., Li, G., 2018. Longitudinal analysis for Alzheimer’s disease diagno- He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recogni-
sis using RNN, in: IEEE 15th International Symposium on Biomedical Imaging tion, in: Proceedings of the IEEE Conference on Computer Vision and Pattern
(ISBI). pp.1398–1401. 10.1109/ISBI.2018.8363833. Recognition. pp. 770–778.
Dickerson, B.C., Goncharova, I., Sullivan, M.P., Forchetti, C., Wilson, R.S., Bennett, D.A., Hon, M., Khan, N.M., 2017. Towards Alzheimer’s disease classification through
Beckett, L.A., 2001. MRI-derived entorhinal and hippocampal atrophy in incipi- transfer learning, in: IEEE International Conference on Bioinformatics and
ent and very mild Alzheimer’s diseaseR. Neurobiol. Aging 22, 747–754. Biomedicine (BIBM). pp. 1166–1169. 10.1109/BIBM.2017.8217822.
Dolph, C.V., Alam, M., Shboul, Z., Samad, M.D., Iftekharuddin, K.M., 2017. Deep learn- Hosseini Asl, E., Ghazal, M., Mahmoud, A., Aslantas, A., Shalaby, A., Casanova, M.,
ing of texture and structural features for multiclass Alzheimer’s disease classifi- Barnes, G., Gimel’farb, G., Keynton, R., El Baz, A., 2018. Alzheimer’s disease di-
cation, in: International Joint Conference on Neural Networks (IJCNN). pp.2259– agnostics by a 3D deeply supervised adaptable convolutional network. Front.
2266. 10.1109/IJCNN.2017.7966129. Biosci. 23, 584–596. doi:10.2741/4606.
Duraisamy, B., Shanmugam, J.V., Annamalai, J., 2019. Alzheimer disease detection Hosseini-Asl, E., Keynton, R., El-Baz, A., 2016. Alzheimer’s disease diagnostics by
from structural MR images using FCM based weighted probabilistic neural net- adaptation of 3D convolutional network, in: 2016 IEEE International Conference
work. Brain Imaging Behav. 13, 87–110. doi:10.1007/s11682- 018- 9831- 2. on Image Processing (ICIP). pp. 126–130. 10.1109/ICIP.2016.7532332.
Ellis, K.A., Bush, A.I., Darby, D., De Fazio, D., Foster, J., Hudson, P., Lauten- Islam, J., Zhang, Y., 2018. Brain MRI analysis for Alzheimer’s disease diagnosis using
schlager, N.T., Lenzo, N., Martins, R.N., Maruff, P., 2009. The Australian Imaging, an ensemble system of deep convolutional neural networks. Brain Inform. 5 (2).
Biomarkers and Lifestyle (AIBL) study of aging: methodology and baseline char- doi:10.1186/s40708-018-0080-3.
acteristics of 1112 individuals recruited for a longitudinal study of Alzheimer’s Islam, J., Zhang, Y., 2017. A Novel Deep Learning Based Multi-class Classification
disease. Int. Psychogeriatr. 21, 672–687. Method for Alzheimer’s Disease Detection Using Brain MRI Data. In: Zeng, Y.,
Ellis, K.A., Rowe, C.C., Villemagne, V.L., Martins, R.N., Masters, C.L., Salvado, O., He, Y., Kotaleski, J.H., Martone, M., Xu, B., Peng, H., Luo, Q. (Eds.), Brain Infor-
Szoeke, C., Ames, D., 2010. Addressing population aging and Alzheimer’s disease matics, Lecture Notes in Computer Science. Springer International Publishing,
through the Australian Imaging Biomarkers and Lifestyle study: Collaboration Cham, pp. 213–222. doi:10.1007/978- 3- 319- 70772- 3_20.
with the Alzheimer’s Disease Neuroimaging Initiative. Alzheimer’s & Dementia. Jha, D., Kim, J.-I., Kwon, G.-R., 2017. Diagnosis of Alzheimer’s Disease Using Dual-
10.1016/j.jalz.2010.03.009. Tree Complex Wavelet Transform, PCA, and Feed-Forward Neural Network. J.
Esmaeilzadeh, S., Belivanis, D.I., Pohl, K.M., Adeli, E., 2018. End-To-End Alzheimer’s Healthc. Eng. 2017, 9060124. doi:10.1155/2017/9060124.
Disease Diagnosis and Biomarker Identification: 9th International Workshop, Korolev, S., Safiullin, A., Belyaev, M., Dodonova, Y., 2017. Residual and plain
MLMI 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, Septem- convolutional neural networks for 3D brain MRI classification, in: IEEE
ber 16, 2018, Proceedings. In: Shi, Y., Suk, H.-I., Liu, M. (Eds.), Machine Learning
18 J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694

14th International Symposium on Biomedical Imaging (ISBI). pp.835–838. imaging and genetic data using a neural network framework. Neurobiol. Aging
10.1109/ISBI.2017.7950647. 68, 151–158. doi:10.1016/j.neurobiolaging.2018.04.009.
Kriegeskorte, N., Simmons, W.K., Bellgowan, P.S.F., Baker, C.I., 2009. Circular analysis Ortiz, A., Munilla, J., Górriz, J.M., Ramírez, J., 2016. Ensembles of Deep Learning Ar-
in systems neuroscience: the dangers of double dipping. Nat. Neurosci. 12, 535– chitectures for the Early Diagnosis of the Alzheimer’s Disease. Int. J. Neural Syst.
540. doi:10.1038/nn.2303. 26, 1650025. doi:10.1142/S0129065716500258.
Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. ImageNet Classification with Deep Parisot, S., Ktena, S.I., Ferrante, E., Lee, M., Guerrero, R., Glocker, B., Rueckert, D.,
Convolutional Neural Networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Wein- 2018. Disease prediction using graph convolutional networks: Application to
berger, K.Q. (Eds.), Advances in Neural Information Processing Systems 25. Cur- Autism Spectrum Disorder and Alzheimer’s disease. Med. Image Anal. 48, 117–
ran Associates, Inc., pp. 1097–1105. 130. doi:10.1016/j.media.2018.06.001.
Kuhn, M., Johnson, K., 2013. Applied Predictive Modeling. Springer, New York, NY. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison,
10.1007/978-1-4614-6849-3. A., Antiga, L., Lerer, A., 2017. Automatic differentiation in PyTorch.
LeCun, Y., Bengio, Y., Hinton, G., 2015. Deep learning. Nature 521, 436–444. doi:10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blon-
1038/nature14539. del, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
Ledig, C., Heckemann, R.A., Hammers, A., Lopez, J.C., Newcombe, V.F.J., Makropou- napeau, D., Brucher, M., Perrot, M., Duchesnay, É, 2011. Scikit-learn: Machine
los, A., Lötjönen, J., Menon, D.K., Rueckert, D., 2015. Robust whole-brain seg- Learning in Python. J. Mach. Learn. Res. 12, 2825–2830.
mentation: application to traumatic brain injury. Med. Image Anal. 21, 40–58. Petersen, R.C., Aisen, P.S., Beckett, L.A., Donohue, M.C., Gamst, A.C., Harvey, D.J.,
doi:10.1016/j.media.2014.12.003. Jack Jr, C.R., Jagust, W.J., Shaw, L.M., Toga, A.W., Trojanowski, J.Q., Weiner, M.W.,
Lian, C., Liu, M., Zhang, J., Shen, D., 2018. Hierarchical Fully Convolutional Network 2010. Alzheimer’s Disease Neuroimaging Initiative (ADNI): clinical characteriza-
for Joint Atrophy Localization and Alzheimer’s Disease Diagnosis using Struc- tion. Neurology 74, 201–209. doi:10.1212/WNL.0b013e3181cb3e25.
tural MRI. IEEE Trans. Pattern Anal. Mach. Intell. 10.1109/TPAMI.2018.2889096. Poldrack, R.A., Baker, C.I., Durnez, J., Gorgolewski, K.J., Matthews, P.M., Munafò, M.R.,
Li, F., Cheng, D., Liu, M., 2017. Alzheimer’s disease classification based on combina- Nichols, T.E., Poline, J.-B., Vul, E., Yarkoni, T., 2017. Scanning the horizon: to-
tion of multi-model convolutional networks, in: IEEE International Conference wards transparent and reproducible neuroimaging research. Nat. Rev. Neurosci.
on Imaging Systems and Techniques (IST). pp. 1–5. 10.1109/IST.2017.8261566. 18, 115–126. doi:10.1038/nrn.2016.167.
Li, F., Liu, M.Alzheimer’s Disease Neuroimaging Initiative, 2018. Alzheimer’s dis- Qiu, S., Chang, G.H., Panagia, M., Gopal, D.M., Au, R., Kolachalama, V.B., 2018. Fu-
ease diagnosis based on multiple cluster dense convolutional networks. Comput. sion of deep learning models of MRI scans, Mini–Mental State Examination,
Med. Imaging Graph. 70, 101–110. doi:10.1016/j.compmedimag.2018.09.009. and logical memory test enhances diagnosis of mild cognitive impairment.
Li, F., Tran, L., Thung, K.-H., Ji, S., Shen, D., Li, J., 2015. A Robust Deep Model for Alzheimer’s & Dementia: Diagnosis, Assessment & Disease Monitoring 10, 737–
Improved Classification of AD/MCI Patients. IEEE J. Biomed. Health Inform. 19, 749. doi:10.1016/j.dadm.2018.08.013.
1610–1616. doi:10.1109/JBHI.2015.2429556. Raamana, P.R., 2017. neuropredict: easy machine learning and standardized predic-
Lin, W., Tong, T., Gao, Q., Guo, D., Du, X., Yang, Y., Guo, G., Xiao, M., Du, M., tive analysis of biomarkers. 10.5281/zenodo.1058993.
Qu, X.Alzheimer’s Disease Neuroimaging Initiative, 2018. Convolutional Neu- Raschka, S., 2015. Python Machine Learning. Packt Publishing Ltd.
ral Networks-Based MRI Image Analysis for the Alzheimer’s Disease Prediction Rathore, S., Habes, M., Iftikhar, M.A., Shacklett, A., Davatzikos, C., 2017. A review
From Mild Cognitive Impairment. Front. Neurosci. 12, 777. doi:10.3389/fnins. on neuroimaging-based classification studies and associated feature extraction
2018.00777. methods for Alzheimer’s disease and its prodromal stages. Neuroimage 155,
Liu, J., Pan, Y., Li, M., Chen, Z., Tang, L., Lu, C., Wang, J., a. Applications of deep 530–548. doi:10.1016/j.neuroimage.2017.03.057.
learning to MRI images: A survey. Big Data Mining and Analytics 1, 1–18. doi:10. Raut, A., Dalal, V., 2017. A machine learning based approach for detection of
26599/BDMA.2018.9020 0 01. alzheimer’s disease using analysis of hippocampus region from MRI scan, in:
Liu, J., Shang, S., Zheng, K., Wen, J.-R., 2016. Multi-view ensemble learning for de- International Conference on Computing Methodologies and Communication (IC-
mentia diagnosis from neuroimaging: An artificial neural network approach. CMC). pp. 236–242. 10.1109/ICCMC.2017.8282683.
Neurocomputing 195, 112–116. doi:10.1016/j.neucom.2015.09.119. Razzak, M.I., Naz, S., Zaib, A., 2018. Deep Learning for Medical Image Processing:
Liu, M., Cheng, D., Wang, K., Wang, Y.Alzheimer’s Disease Neuroimaging Initiative, Overview, Challenges and the Future. In: Dey, N., Ashour, A.S., Borra, S. (Eds.),
b. Multi-Modality Cascaded Convolutional Neural Networks for Alzheimer’s Dis- Classification in BioApps: Automation of Decision Making. Springer Interna-
ease Diagnosis. Neuroinformatics 16, 295–308. doi:10.1007/s12021- 018- 9370- 4. tional Publishing, Cham, pp. 323–350. doi:10.1007/978- 3- 319- 65981- 7_12.
Liu, M., Zhang, J., Adeli, E., Shen, D., 2018c. Landmark-based deep multi-instance Ripley, B.D., 1996. Pattern Recognition and Neural Networks by Brian D.Ripley. Cam-
learning for brain disease diagnosis. Med. Image Anal. 43, 157–168. doi:10.1016/ bridge University Press. 10.1017/CBO9780511812651.
j.media.2017.10.005. Routier, A., Guillon, J., Burgos, N., Samper-González, J., Wen, J., Fontanella, S., Bottani,
Liu, M., Zhang, J., Adeli, E., Shen, D., 2018d. Joint Classification and Regression via S., Jacquemont, T., Marcoux, A., Gori, P., Lu, P., Moreau, T., Bacci, M., Durrleman,
Deep Multi-Task Multi-Channel Learning for Alzheimer’s Disease Diagnosis. IEEE S., and Colliot, O., 2018. Clinica: an open source software platform for repro-
Trans. Biomed. Eng. 10.1109/TBME.2018.2869989. ducible clinical neuroscience studies, in: Annual Meeting of the Organization
Liu, M., Zhang, J., Nie, D., Yap, P.-T., Shen, D., 2018e. Anatomical Landmark Based for Human Brain Mapping (OHBM).
Deep Feature Representation for MR Images in Brain Disease Diagnosis. IEEE J. Salvatore, C., Cerasa, A., Battista, P., Gilardi, M.C., Quattrone, A., Cas-
Biomed. Health Inform. 22, 1476–1485. doi:10.1109/JBHI.2018.2791863. tiglioni, I.Alzheimer’s Disease Neuroimaging Initiative, 2015. Magnetic resonance
Liu, S., Liu, S., Cai, W., Che, H., Pujol, S., Kikinis, R., Feng, D., Fulham, M.J.ADNI, imaging biomarkers for the early diagnosis of Alzheimer’s disease: a machine
2015. Multimodal neuroimaging feature learning for multiclass diagnosis of learning approach. Front. Neurosci. 9, 307. doi:10.3389/fnins.2015.00307.
Alzheimer’s disease. IEEE Trans. Biomed. Eng. 62, 1132–1140. doi:10.1109/TBME. Samper-González, J., Burgos, N., Bottani, S., Fontanella, S., Lu, P., Marcoux, A.,
2014.2372011. Routier, A., Guillon, J., Bacci, M., Wen, J., Bertrand, A., Bertin, H., Habert, M.-
Lu, D., Popuri, K., Ding, G.W., Balachandar, R., Beg, M.F.Alzheimer’s Disease Neu- O., Durrleman, S., Evgeniou, T., Colliot, O., Alzheimer’s Disease Neuroimag-
roimaging Initiative, 2018. Multimodal and Multiscale Deep Neural Networks ing Initiative, Australian Imaging Biomarkers and Lifestyle flagship study of
for the Early Diagnosis of Alzheimer’s Disease using structural MR and FDG-PET ageing, 2018. Reproducible evaluation of classification methods in Alzheimer’s
images. Sci. Rep. 8, 5697. doi:10.1038/s41598- 018- 22871- z. disease: Framework and application to MRI and PET data. Neuroimage.
Lundervold, A.S., Lundervold, A., 2018. An overview of deep learning in medical 10.1016/j.neuroimage.2018.08.042.
imaging focusing on MRI. Z. Med. Phys. 10.1016/j.zemedi.2018.11.002. Sarle, W.S., 1997. Neural Network FAQ, part 1 of 7. Introduction, periodic posting to
Mahanand, B.S., Suresh, S., Sundararajan, N., Aswatha Kumar, M., 2012. Identifica- the Usenet newsgroup comp. ai. neural-nets URL: ftp://ftp.sas.com/pub/neural/
tion of brain regions responsible for Alzheimer’s disease using a Self-adaptive FAQ.html.
Resource Allocation Network. Neural Netw. 32, 313–322. doi:10.1016/j.neunet. Schuff, N., Woerner, N., Boreta, L., Kornfield, T., Shaw, L.M., Trojanowski, J.Q., Thomp-
2012.02.035. son, P.M., Jack Jr, C.R., Weiner, M.W.Alzheimer’s Disease Neuroimaging Initiative,
Maitra, M., Chatterjee, A., 2006. A Slantlet transform based intelligent system for 2009. MRI of hippocampal volume loss in early Alzheimer’s disease in rela-
magnetic resonance brain image classification. Biomed. Signal Process. Control tion to ApoE genotype and biomarkers. Brain 132, 1067–1077. doi:10.1093/brain/
1, 299–306. doi:10.1016/j.bspc.20 06.12.0 01. awp007.
Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L., 2007. Senanayake, U., Sowmya, A., Dawes, L., 2018. Deep fusion pipeline for mild cogni-
Open Access Series of Imaging Studies (OASIS): cross-sectional MRI data in tive impairment diagnosis, in: IEEE 15th International Symposium on Biomedi-
young, middle aged, nondemented, and demented older adults. J. Cogn. Neu- cal Imaging (ISBI 2018). pp. 1394–1997. 10.1109/ISBI.2018.8363832.
rosci. 19, 1498–1507. doi:10.1162/jocn.2007.19.9.1498. Shams-Baboli, A., Ezoji, M., 2017. A Zernike moment based method for clas-
Mathew, N.A., Vivek, R.S., Anurenjan, P.R., 2018. Early Diagnosis of Alzheimer’s sification of Alzheimer’s disease from structural MRI, in: 3rd International
Disease from MRI Images Using PNN, in: International CET Confer- Conference on Pattern Recognition and Image Analysis (IPRIA). pp. 38–43.
ence on Control, Communication, and Computing (IC4). pp. 161–164. 10.1109/PRIA.2017.7983061.
10.1109/CETIC4.2018.8530910. Shen, T., Jiang, J., Li, Y., Wu, P., Zuo, C., Yan, Z., 2018. Decision Supporting Model
McKhann, G., Drachman, D., Folstein, M., Katzman, R., Price, D., Stadlan, E.M., 1984. for One-year Conversion Probability from MCI to AD using CNN and SVM, in:
Clinical diagnosis of Alzheimer’s disease Report of the NINCDS-ADRDA Work 40th Annual International Conference of the IEEE Engineering in Medicine and
Group∗ under the auspices of Department of Health and Human Services Task Biology Society (EMBC). pp. 738–741. 10.1109/EMBC.2018.8512398.
Force on Alzheimer’s Disease. Neurology 34, 939. Shi, J., Zheng, X., Li, Y., Zhang, Q., Ying, S., 2018. Multimodal Neuroimaging Feature
Mostapha, M., Kim, S., Wu, G., Zsembik, L., Pizer, S., Styner, M., 2018. Non-Euclidean, Learning With Multimodal Stacked Deep Polynomial Networks for Diagnosis of
convolutional learning on cortical brain surfaces. IEEE 15th Int. Symp. Biomed. Alzheimer’s Disease. IEEE J Biomed. Health Inform. 22, 173–183. doi:10.1109/
Imaging (ISBI) 2018, 527–530. doi:10.1109/ISBI.2018.8363631. JBHI.2017.2655720.
Ning, K., Chen, B., Sun, F., Hobel, Z., Zhao, L., Matloff, W., Toga, A.W.Alzheimer’s Dis- Shmulev, Y., Belyaev, M., 2018. Predicting Conversion of Mild Cognitive Impairments
ease Neuroimaging Initiative, 2018. Classifying Alzheimer’s disease with brain to Alzheimer’s Disease and Exploring Impact of Neuroimaging: Second Inter-
J. Wen, E. Thibeau-Sutre and M. Diaz-Melo et al. / Medical Image Analysis 63 (2020) 101694 19

national Workshop, GRAIL 2018 and First International Workshop, Beyond MIC Vu, T.D., Yang, H.-J., Nguyen, V.Q., Oh, A.-R., Kim, M.-S., 2017. Multimodal learning
2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, using Convolution Neural Network and Sparse Autoencoder, in: IEEE Interna-
2018, Proceedings. In: Stoyanov, D., Taylor, Z., Ferrante, E., Dalca, A.V., Martel, A., tional Conference on Big Data and Smart Computing (BigComp). pp.309–312.
Maier-Hein, L., Parisot, S., Sotiras, A., Papiez, B., Sabuncu, M.R., Shen, L. (Eds.), 10.1109/BIGCOMP.2017.7881683.
Graphs in Biomedical Image Analysis and Integrating Medical Imaging and Non- Wang, H., Shen, Y., Wang, S., Xiao, T., Deng, L., Wang, X., Zhao, X., 2019. En-
Imaging Modalities, Lecture Notes in Computer Science. Springer International semble of 3D densely connected convolutional network for diagnosis of mild
Publishing, Cham, pp. 83–91. doi:10.1007/978- 3- 030- 00689- 1_9. cognitive impairment and Alzheimer’s disease. Neurocomputing 333, 145–156.
Simonyan, K., Zisserman, A., 2014. Very Deep Convolutional Networks for Large- doi:10.1016/j.neucom.2018.12.018.
Scale Image Recognition. arXiv [cs.CV]. Wang, S.-H., Phillips, P., Sui, Y., Liu, B., Yang, M., Cheng, H., a. Classification of
Sonnenburg, S., Braun, M.L., Ong, C.S., Bengio, S., Bottou, L., Holmes, G., LeCun, Y., Alzheimer’s Disease Based on Eight-Layer Convolutional Neural Network with
Müller, K.-R., Pereira, F., Rasmussen, C.E., Rätsch, G., Schölkopf, B., Smola, A., Leaky Rectified Linear Unit and Max Pooling. J. Med. Syst. 42, 85. doi:10.1007/
Vincent, P., Weston, J., Williamson, R., 2007. The Need for Open Source Software s10916- 018- 0932- 7.
in Machine Learning. J. Mach. Learn. Res. 8, 2443–2466. Wang, S., Shen, Y., Chen, W., Xiao, T., Hu, J., 2017. Automatic Recognition
Spasov, S.E., Passamonti, L., Duggento, A., Lio, P., Toschi, N., 2018. A Multi-modal of Mild Cognitive Impairment from MRI Images Using Expedited Convolu-
Convolutional Neural Network Framework for the Prediction of Alzheimer’s Dis- tional Neural Networks. In: Artificial Neural Networks and Machine Learn-
ease. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2018, 1271–1274. doi:10.1109/EMBC. ing – ICANN 2017. Springer International Publishing, pp. 373–380. doi:10.1007/
2018.8512468. 978- 3- 319- 68600- 4_43.
Stodden, V., Leisch, F., Peng, R.D., 2014. Implementing Reproducible Research. CRC Wang, X., Cai, W., Shen, D., Huang, H., b. Temporal Correlation Structure Learning
Press. for MCI Conversion Prediction: 21st International Conference, Granada, Spain,
Suk, H.-I., Lee, S.-W., Shen, D.Alzheimer’s Disease Neuroimaging Initiative, 2017. September 16-20, 2018, Proceedings, Part III. In: Frangi, A.F., Schnabel, J.A., Da-
Deep ensemble learning of sparse regression models for brain disease diagnosis. vatzikos, C., Alberola-López, C., Fichtinger, G. (Eds.), Medical Image Computing
Med. Image Anal. 37, 101–113. doi:10.1016/j.media.2017.01.008. and Computer Assisted Intervention – MICCAI 2018, Lecture Notes in Com-
Suk, H.-I., Lee, S.-W., Shen, D.Alzheimer’s Disease Neuroimaging Initiative, 2015. La- puter Science. Springer International Publishing, Cham, pp. 446–454. doi:10.
tent feature representation with stacked auto-encoder for AD/MCI diagnosis. 1007/978- 3- 030- 00931- 1_51.
Brain Struct. Funct. 220, 841–859. doi:10.10 07/s0 0429- 013- 0687- 3. Wen, D., Wei, Z., Zhou, Y., Li, G., Zhang, X., Han, W., a. Deep Learning Meth-
Suk, H.-I., Lee, S.-W., Shen, D.Alzheimer’s Disease Neuroimaging Initiative, 2014. Hi- ods to Process fMRI Data and Their Application in the Diagnosis of Cognitive
erarchical feature representation and multimodal fusion with deep learning for Impairment: A Brief Overview and Our Opinion. Front. Neuroinform. 12, 23.
AD/MCI diagnosis. Neuroimage 101, 569–582. doi:10.1016/j.neuroimage.2014.06. doi:10.3389/fninf.2018.0 0 023.
077. Wen, J., Samper-Gonzalez, J., Bottani, S., Routier, A., Burgos, N., Jacquemont, T.,
Taqi, A.M., Awad, A., Al-Azzo, F., Milanova, M., 2018. The Impact of Multi-Optimizers Fontanella, S., Durrleman, S., Epelbaum, S., Bertrand, A., Colliot, O., 2018b. Re-
and Data Augmentation on TensorFlow Convolutional Neural Network Perfor- producible evaluation of diffusion MRI features for automatic classification of
mance, in: 2018 IEEE Conference on Multimedia Information Processing and Re- patients with Alzheimers disease. arXiv [q-bio.QM].
trieval (MIPR). pp. 140–145. 10.1109/MIPR.2018.0 0 032. Wu, C., Guo, S., Hong, Y., Xiao, B., Wu, Y., Zhang, Q.Alzheimer’s Disease Neuroimag-
Thung, K.-H., Yap, P.-T., Shen, D., 2017. Multi-stage Diagnosis of Alzheimer’s Dis- ing Initiative, 2018. Discrimination and conversion prediction of mild cognitive
ease with Incomplete Multimodal Data via Multi-task Deep Learning. Deep impairment using convolutional neural networks. Quant. Imaging Med. Surg. 8,
Learn. Med. Image Anal. Multimodal Learn. Clin. Decis. Support 10553, 160–168. 992–1003. doi:10.21037/qims.2018.10.17.
doi:10.1007/978- 3- 319- 67558- 9_19. Zhang, Y., Wang, S., Sui, Y., Yang, M., Liu, B., Cheng, H., Sun, J., Jia, W., Phillips, P.,
Tustison, N.J., Avants, B.B., Cook, P.A., Zheng, Y., Egan, A., Yushkevich, P.A., Gee, J.C., Gorriz, J.M., 2018. Multivariate Approach for Alzheimer’s Disease Detection Us-
2010. N4ITK: improved N3 bias correction. IEEE Trans. Med. Imaging 29, 1310– ing Stationary Wavelet Entropy and Predator-Prey Particle Swarm Optimization.
1320. doi:10.1109/TMI.2010.2046908. J. Alzheimers. Dis. 65, 855–869. doi:10.3233/JAD-170069.
Valliani, A., Soni, A., 2017. Deep Residual Nets for Improved Alzheimer’s Diagnosis. Zhou, T., Thung, K.-H., Zhu, X., Shen, D., 2019. Effective feature learning and fusion
In: 8th ACM International Conference on Bioinformatics, Computational Biol- of multimodality data using stage-wise deep neural network for dementia diag-
ogy,and Health Informatics. ACM, p. 615. doi:10.1145/3107411.3108224. nosis. Hum. Brain Mapp. 40, 1001–1016. doi:10.1002/hbm.24428.
Vanschoren, J., van Rijn, J.N., Bischl, B., Torgo, L., 2014. OpenML: Networked Science Zhou, T., Thung, K.-H., Zhu, X., Shen, D., 2017. Feature Learning and Fusion of Multi-
in Machine Learning. SIGKDD Explor. Newsl. 15, 49–60. doi:10.1145/2641190. modality Neuroimaging and Genetic Data for Multi-status Dementia Diagnosis.
2641198. Mach. Learn. Med. Imaging 10541, 132–140. doi:10.1007/978- 3- 319- 67389- 9_16.
Vu, T.-D., Ho, N.-H., Yang, H.-J., Kim, J., Song, H.-C., 2018. Non-white matter tissue
extraction and deep convolutional neural network for Alzheimer’s disease de-
tection. Soft. Comput. 22, 6825–6833. doi:10.10 07/s0 050 0- 018- 3421- 5.