0% found this document useful (0 votes)
30 views12 pages

2019 - NATURE - Recommendations and Future Directions For Supervised Machine Learning in Psychiatry

Uploaded by

Priya Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views12 pages

2019 - NATURE - Recommendations and Future Directions For Supervised Machine Learning in Psychiatry

Uploaded by

Priya Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Cearns et al.

Translational Psychiatry (2019)9:271


https://doi.org/10.1038/s41398-019-0607-2 Translational Psychiatry

REVIEW ARTICLE Open Access

Recommendations and future directions for


supervised machine learning in psychiatry
1
Micah Cearns , Tim Hahn2 and Bernhard T. Baune3,4,5

Abstract
Machine learning methods hold promise for personalized care in psychiatry, demonstrating the potential to tailor
treatment decisions and stratify patients into clinically meaningful taxonomies. Subsequently, publication counts
applying machine learning methods have risen, with different data modalities, mathematically distinct models, and
samples of varying size being used to train and test models with the promise of clinical translation. Consequently, and
in part due to the preliminary nature of such works, many studies have reported largely varying degrees of accuracy,
raising concerns over systematic overestimation and methodological inconsistencies. Furthermore, a lack of
procedural evaluation guidelines for non-expert medical professionals and funding bodies leaves many in the field
with no means to systematically evaluate the claims, maturity, and clinical readiness of a project. Given the potential of
machine learning methods to transform patient care, albeit, contingent on the rigor of employed methods and their
dissemination, we deem it necessary to provide a review of current methods, recommendations, and future directions
for applied machine learning in psychiatry. In this review we will cover issues of best practice for model training and
evaluation, sources of systematic error and overestimation, model explainability vs. trust, the clinical implementation of
AI systems, and finally, future directions for our field.
1234567890():,;
1234567890():,;
1234567890():,;
1234567890():,;

Introduction contributed to the complexity and lack of accuracy in


Accurate prediction of intervention response and illness clinical decision making.
trajectories remains an elusive problem for modern psy- Methodologically, psychiatry has commonly focused on
chiatry, with contemporary practitioners still relying on a statistical inference over prediction3,11. Inferential statis-
‘wait and see’ approach for the treatment of psychiatric tics have afforded the testing of theory driven hypotheses,
disorders1. This problem has likely arisen due to an population inference, and the formulation of grounded
interplay of biopsychosocial factors2 and statistical mod- theory and mechanism to better understand the aetiology
eling decisions3. From a biopsychosocial perspective, the of psychiatric traits3,11. However, both psychiatry and
high degree of comorbidity between psychiatric condi- neuroscience have found themselves with significant
tions4,5, the lack of diagnostic biomarkers to delineate translation problems. Even in the face of new discoveries
between disorders and illness trajectories6,7, the shared and paradigm shifts in the understanding of disorders, the
genetic origins of clinically disparate traits8, and the clinical practice of psychiatry and discovery of interven-
imprecision of symptom measures9,10 has likely tions that outperform placebo has been slow12.
Given this complexity, clinicians commonly assume
diagnostic homogeneity, where all patients who present
with e.g., symptoms of low mood, lack of energy, and
Correspondence: Bernhard T. Baune ([email protected])
1
Discipline of Psychiatry, School of Medicine, University of Adelaide, Adelaide, negative thoughts are considered to have the same broad
SA 5005, Australia
2
diagnosis of major depressive disorder (MDD)1. However,
Institute of Translational Psychiatry, University of Münster, 48149 Münster,
studies using machine learning methodologies (ML) have
Germany
Full list of author information is available at the end of the article begun to identify subtypes of psychiatric disorders with
These authors contributed equally: Micah Cearns, Tim Hahn

© The Author(s) 2019


Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction
in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if
changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain
permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Cearns et al. Translational Psychiatry (2019)9:271 Page 2 of 12

differing symptomology13, illness trajectories14,15, and sources of systematic error and overestimation, model
drug response profiles16. Adding another layer of com- explainability vs. trust, the clinical implementation of AI
plexity are the individual differences inherent in these systems, and finally, future directions for our field.
disorders17. Ignoring this individuality and modeling at
the group level fails to represent the heterogeneity of Model training and evaluation
clinical populations18. Rather than assuming that all Sample size and systematic overestimation
patients are accurately represented by measures of central An array of model training and testing schemes exist in
tendency from case control studies, one solution is to ML. The choice of which depends on the size of a dataset
utilize modeling techniques that can parse patient and a practitioner’s computational resources. This brings
heterogeneity19. us to our first encounter with the question of sample size
ML models are capable of this task, by learning indivi- in psychiatric ML. How big should a dataset be before a
dual patient characteristics, they can make successive practitioner decides to use an ML strategy? The ignorance
individual (i.e., single subject) predictions. For example, of this question and application of ML to small datasets
using a ML model trained on multisite data from the has given rise to an important concern: Particularly, ML
STAR*D consortium, Chekroud et al.16, were able to studies using larger samples have commonly shown
predict remission from MDD after a 12-week course of weaker performance than studies using smaller samples26.
Citalopram therapy with an accuracy of 64.6%. The model This observation has led to questions regarding the
was then externally validated in the escitalopram and validity and reliability of preliminary small N ML studies
escitalopram-bupropion treatment group of COMED, in psychiatry21. To measure the degree of these effects,
attaining accuracies 59.6% and 59.7%, respectively. Given Neuhaus and Popescu20,26 collated studies across condi-
a report of ~49.3% accuracy for clinician prognostication tions; including schizophrenia (total observation N =
on the same outcome in the STAR*D cohort16, this is a 5563), MDD (N = 2042), and attention deficit hyper-
clinically meaningful increase in prognostic certainty. activity disorder (ADHD, N = 8084), finding an inverse
This study is an exemplar of applied ML in psychiatry. relationship between sample size and balanced accuracy
The dataset had a large number of observations allowing (schizophrenia, r = −0.34, p = 0.012; MDD, r = −0.32, p
for the learning of unique patient characteristics. Model = 0.053; and ADHD, r = −0.43, p = 0.044)20. As we
selection was tailored to the available data and rigorously would expect model performance to increase with more
cross-validated across sites using pipeline architecture. It data, these findings suggest an underlying problem within
was multisite, affording geographic generalizability and a our field.
large clinical scope. Finally, code for the trained model One explanation proposed by Schnack and Kahn21 is
was made available on request, allowing for transparency that patient characteristics in smaller samples tend to be
and dissemination of the studies methods. more homogenous. In the case of small N, participants
Prior to and since this publication, many more ML works may be more likely to be recruited from the same data
have been published using different data modalities, mod- collection site and of a similar age (for example, in the
els, sample sizes, and most interestingly, have reported case of a university recruited convenience sample). In
largely varying degrees of accuracy20. Given this variability, addition, stringent recruitment criterion may be easily
concerns have been raised questioning the veracity of met, resulting in a well-defined phenotype that is not truly
findings in our field and hinted at sources of systematic representative of the parent population of interest. As
overestimation20,21. Adding further complication, large sample size increases, the geographic, demographic, and
multisite datasets are the exception, not the rule, commonly phenotypic diversity of a sample will increase also,
have more predictors than observations, are generally resulting in decreased model performance, yet, increased
imbalanced, and have low signal to noise ratios. Given these generalizability and model scope (see from proof-of-
circumstances, concerns have been raised that ML may face concept studies to clinical application below). Theoreti-
the same reproducibility crisis as that experienced by group cally, future works may be able to circumvent this trade-
level analyses in recent years22–24. However, when proper off by subtyping patients into well-defined and clinically
methodology has been employed, ML works have been meaningful clusters, thus, models could be trained for
shown to prognosticate significantly better than chance on specific patient subtypes, maintaining phenotypic homo-
unseen data15, generalize across data collection sites14, and geneity as sample size increases. These considerations
outperform clinician prognostication16,25. underline the importance of further research into patient
Given the potential of ML models to transform patient subtyping in conjunction with supervised ML methods. In
care, albeit, contingent on the rigor of employed methods, addition to the issue of sample homogeneity, the sample
we deem it necessary to provide a best practice overview size needed to train an ML model is also contingent on
for applied ML in psychiatry. In this guide we will cover the strength of the underlying signal between input fea-
issues of best practice for model training and evaluation, tures and an outcome of interest, as well as the complexity
Cearns et al. Translational Psychiatry (2019)9:271 Page 3 of 12

of the underlying mapping function (the mathematical Once this architecture is defined, the appropriate
function used to derive a line or curve between data- number of cross-validation iterations need to be set.
points). As these two factors can vary greatly between Previous work by Kohavi has investigated this topic in
research questions and datasets, there can—in principle— detail28. For model selection, he found that 10-fold cross-
be no general rule of thumb to estimate the sample size validation best balanced the bias/variance trade-off. In
required for analyses. Beyond these sample size and het- addition, repeated runs of the 10-fold cycle were recom-
erogeneity constraints, the choice of cross-validation mended to avoid opportune splits that may lead to overly
scheme also bears influence on the variability and bias optimistic estimates. Three to five repeats are commonly
of accuracy estimates. In the following sections, we will used. Finally, given the balance between computational
address the methods available for preliminary ML works cost and the bias/variance trade-off, it is common to use
and their common use cases. another 10-fold cycle in the outer cross-validation loop to
assess the final model’s performance. See Fig. 1 for
visualization of this structure. It is important to note that,
Leave one out cross-validation
no “optimal” number of repeats can be made as this will
Due to sample size constraints, leave one out cross-
likely depend on the complexity of the model, the size and
validation (LOOCV) has become a popular strategy for
amount of signal in a sample, and the resulting stability of
model performance evaluation in applied psychiatric ML.
the final cross-validation estimates. If, for example, var-
With this strategy, a model is trained on all available
iance is high between cross-validation folds, more repeats
observations minus one (n − 1). Following, the trained
should converge on a less variable mean estimate and
model is tested on the one held-out observation. This
reduce dependency on spurious data partitions. On the
process is repeated until all available observations have
contrary, if the variance of estimates is low between folds,
been used for testing. Final model performance is then
a high number of repeats will be of less use and increase
averaged across the held-out samples and an accuracy
computational cost. However, whether repeated runs
estimate is derived. However, previous work has demon-
actually achieve this intended variance reduction has been
strated the variance properties of LOOCV, suggesting that
contested30. Nonetheless, in small samples it may
although this method utilizes all available data, an
decrease the effects of favorable cross-validation splits
appealing property given limited N, it consequently leads
without compromising test set sizes in a k-fold scheme.
to unstable and biased estimates due to the high degree of
While this method provides a means to conduct mul-
correlation between samples27. Therefore, repeated ran-
tiple transformations and estimate parameters without
dom splits are preferred, showing less bias and varia-
overtly double dipping into a dataset, it is possible that a
bility27,28. Notwithstanding, further complications will
nested scheme may generate test sets that are overly
arise when we conduct multiple transformations on our
similar to training sets, as well as similar training sets
data and optimize a models hyperparameters. If we were
across cross-validation folds29,31. In theory, this could also
to use LOOCV, yet, conduct transformations within the
lead to systematic overestimation. One alternative, in the
same cross-validation scheme, we may optimistically bias
case of small N is to keep data transformations and
our model evaluations by using the same test set (n − 1)
hyperparameter optimization to a minimum27. In the case
to select parameters and evaluate the model29. In this
of many features, only a few subsets could be tested in the
instance, we need a method that allows us to take
inner loop whilst hyperparameters for the model could be
advantage of all our data for both model selection and
left at default to decrease the risk of double dipping. For
evaluation whilst avoiding any circularity bias. In these
an example of this method in Python, see the following
situations, each transformation needs to be completed in
documentation32.
a nesting procedure.
Train/test/validate
Nested cross-validation In the case of a sufficiently large sample, the problem of
As the name implies, nested cross-validation allows for train/validate/test set overlap in a nested scheme can be
the nesting of multiple cross validations. First, an inner avoided by using an entirely separate test partition to the
cross-validation loop is used to conduct data transfor- train/validate partitions used for model selection. For
mations and/or hyperparameter optimization. This loop is example, if a balanced sample of N = 500 is available, N =
akin to a train/validation partition. Following, this loop is 300 could be partitioned for training and validation, whilst
nested inside an outer cross-validation loop that assesses N = 200 could be held out for final testing. In this case,
the transformed data and optimized model on different both sample size over/under estimation and partition
test sets to those used in the inner loop. This outer loop is overlap could be minimized. Commonly, partitions will be
akin to a validation/test partition and allows for the stratified by the outcome label for prediction. Transfor-
approximation of the selected models performance. mations and model parameters will be learnt in the train/
Cearns et al. Translational Psychiatry (2019)9:271 Page 4 of 12

Fig. 1 Visualization of a nested cross-validation scheme. All steps from 2a–2c should be conducted inside a pipeline, inside the inner cross-
validation loop
Cearns et al. Translational Psychiatry (2019)9:271 Page 5 of 12

validation partitions and analyzed in the remaining However, as sample size is often constrained in psychia-
test partition. For deployment of this method in the tric cohorts, as discussed above, automated k-fold and
Python programming language, see the following nested schemes are commonly used instead of a-priori
documentation33. train/validate/test partitions. As multiple k-fold runs are
conducted, it is not as simple as learning transformations
Leave group out cross-validation on a train partition and then predicting them into vali-
What about when a sample is multisite, and each data dation/test partitions. Here, data leakage risk increases
collection site has its own unique characteristics? Further, substantially, yet can still be avoided through the use of
what if outcome distributions vary across these sites? One pipeline architecture.
site, a treatment center, may have primarily collected a
convenience sample of patients, whilst another may have Pipeline architecture
predominantly focused on controls. When we tune a A machine learning pipeline can be thought of as an
models hyperparameters, how do we avoid tuning them object that sequentially chains together a list of transfor-
directly to these differences that may proxy for disparities mers and a final estimator into one object. This sequential
in outcome distributions or processing equipment? Here, chaining has three advantages. First, a practitioner only
the solution is leave-group out cross-validation, also has to call ‘fit’ and ‘predict’ once on a set of data, rather
known as Monte Carlo or leave site out cross- than at each transformation. Secondly, hyperparameters
validation34,35. In this situation we need to assess whe- from each estimator can be tuned in unison. Finally,
ther a model trained on a particular data collection site, pipelines help avoid the leakage of statistics by guaran-
will generalize to other sites. To achieve this, we hold out teeing that the same samples are used to train the trans-
samples according to a third-party array of integer groups formers and the final classifier. It is easy to overlook the
that represent each site. These integer codes can then be importance of this final step. As datasets are commonly
used to encode site specific cross-validation folds and repositioned for ML analysis, the a-priori pre-processing
assess the generalizability of model performance. This of data sets is common. As demonstrated above, if a
approach ensures that all samples in the cross-validation transformation as simple as imputation is completed on a
folds come from sites that are not represented at all in the dataset in its entirety, this is enough to leak statistics and
paired training folds. In situations where consortia/mul- cause optimistic bias.
tisite data is used, this method is required. For deploy- To demonstrate the correct use of pipeline architecture
ment of this method in the Python programming in the Python programming language and quantify the
language, see the following documentation36. magnitude of data leakage effects, we have provided an
example Python script in the references38. In this exam-
Data leakage ple, we randomly generate a balanced large P small N
While the use of proper cross-validation schemes and dataset containing 3000 features and 500 observations. To
minimum sample sizes helps reduce estimate variability emulate the low signal to noise ratio commonly inherent
and systematic overestimation, another overlooked source in psychiatric cohorts, only 10 of the 3000 features are
of error comes in the form of data leakage. The effects of related to the binary outcome vector y ¼ f1; 0g. To
leakage on estimates are profound yet appear to be rarely demonstrate data leakage effects, we first mean center the
discussed in the psychiatric ML literature. Named one of dataset and select a subset of features using regularized
the top 10 data mining mistakes37, leakage refers to the logistic regression (LASSO)39 on the full dataset. Fol-
introduction of information about the outcome label (e.g., lowing, we train and test a linear SVM with default
case/control) that would otherwise not be available to parameters using 10-fold cross-validation, attaining a test
learn from. A trivial example of data leakage would be set AUC of 99.89. Next, we conduct the same procedure,
either selecting features or imputing missing values on an now implementing each transformation within a sklearn
entire dataset before partitioning it into train/validate/test pipeline to ensure the use of the same cross-validation
folds. Here, feature distribution information from the folds over each transformation. Now, we attain a test AUC
validate/test folds would be leaked into the train set, of 50.23. Here, we see that when a pipeline is not used, the
hyperparameters would be tuned to these distributions, model overfits to leaked statistics and appears to be a near
and inevitably, test set outcomes would be predicted with perfect classifier. However, when we use a pipeline, con-
a high degree of accuracy. For an in-depth appraisal of this duct transformations in the same folds, and only learn
problem and solutions, see Kaufman37. from the true signal in the features, predictive accuracy is
This problem can be easily avoided through careful no better than chance.
selection of features (only selecting features that would Given that at current, the open sourcing of code for peer
truly be available at time of analysis) and the partitioning review is not requested by journals, and the ease of which
of cross-validation folds prior to data transformations. these mistakes can be made by those newer to the field, it
Cearns et al. Translational Psychiatry (2019)9:271 Page 6 of 12

is possible that beyond just sample size effects, many of external test set (score 2), i.e., a test set to which the
the highly optimistic studies ( 90%) may be due to data creators of the initial model did not have access at training
leakage. To counteract this problem, we recommend that time, and which was drawn independently from the
journals require code reviews in the peer review process. training set, is optimal. To this end, online model repo-
In addition, the open sourcing of code should be sitories (e.g., www.photon-ai.com/repo) provide valuable
encouraged. Beyond the aforementioned sample size infrastructure which greatly simplifies external validation
estimation biases, such minor transgressions in code in practice. Here, a research group can make available a
structure may go some way to explain the large degree of trained model, and have it tested on independent data,
variability currently observed in the literature. Finally, providing an opportunity to assess the geographic,
permutation tests should be conducted regardless of demographic, and phenotypic generalizability of a pub-
cross-validation strategy deployed. This way, the null- lished model. In addition, incremental learning can be
distribution and statistical significance of a classifier can used to train certain ML models, allowing a research
be obtained40. group to further train a pre-trained model with their own
data, affording cross institution collaboration without the
From proof-of-concept studies to clinical application need for data sharing (for more information, see the
While consortia efforts and open sourced models are partial fit method in sklearn41).
becoming more prevalent, publication numbers of studies The second category—model scope—refers to the group
at all points along the project maturity continuum con- of individuals about whom the model can make reason-
tinue to rise. Of utmost importance, ensuring methodo- able predictions. If a convenience sample (score 0) is used
logical rigor and conceptual maturity of these publications for training and testing, we cannot expect the model to
is essential. However, beyond best practice recommen- perform well on different samples. If the sample is
dations like those provided above, the current lack of representative for a site or local subpopulation (score 1),
practical guidelines to systematically evaluate ML quality we can expect it to perform as intended at this site
and maturity makes it difficult for researchers, stake- without any guarantees with regard to other sites or
holders, journals, and funding agencies to objectively subpopulations. Only testing on a representative sample
gauge the quality and current clinical utility of an ML of the target population (score 2) ensures reliable per-
model or publication. In addition, the lack of evaluation formance estimates that translate into clinical practice.
guidelines risks an overly optimistic or an unduly skep- From this point of view, the current use of exclusion
tical perception of findings—both in the scientific com- criteria appears highly problematic. While perfectly rea-
munity and the public eye. Therefore, we propose a sonable if we seek to test hypotheses to gain insight into
practical set of guidelines for the assessment of clinical mechanisms or advance theory, excluding patients with
utility and maturity for ML models in psychiatry. Based e.g., certain comorbidities inevitably entails that our
on the conceptual framework of AI Transparency31, we model’s utility for this patient group cannot be estimated,
have derived a straightforward “checklist” with which to thereby severely hampering its applicability in clinical
quantify a project’s maturity ranging from the initial practice.
proof-of-concept stage through to the clinical application The third category—incremental utility—refers to the
stage. Specifically, the checklist comprises six categories in added value a machine learning model confers as com-
which an ML project is evaluated. In each category, scores pared to current practice42. While most studies do not
range from “proof-of-concept stage” (0) to “ready for assess incremental utility (score 0), showing higher effi-
clinical application” (2). ciency or effectiveness with regard to the current state-of-
The first category—generalization—refers to model the-art (score 1) is essential. If a model or project cannot
performance in previously unseen data and has been show or does not intend to do this, little in the way of
outlined in detail above. It constitutes the most rudi- clinical translation can be expected. Thus, the essential
mentary performance measure of an ML model. While goal should always be that a project reaches a stage in
employing cross-validation techniques (score 0) avoids which it outperforms current state-of-the-art in real-life
data leakage and provides a principled estimate of gen- workflow (score 2). Although experimental approaches
eralization performance, it is important to note that such as randomized controlled trials seem optimally sui-
(nested) cross-validation should be used for initial model ted to show incremental utility, they are rarely used in
evaluation and hyperparameter optimization only. At medical ML research today. This concept also highlights a
current, the results of most machine learning studies in misalignment of expectations with regards to model
psychiatry might well have arisen from small test-sets as is performance. A commonly held view is that a model
typical for cross-validation. In contrast, using a large, needs to be highly accurate to be of clinical use. However,
independent test set (score 1) yields a more stable, reliable if a model can outperform its opportunity cost, that is,
estimate of future performance. Finally, using a large, current state of the art clinical practice free of decision
Cearns et al. Translational Psychiatry (2019)9:271 Page 7 of 12

support systems, then patient utility can be maximized at introductions and discussion sections of countless pub-
scale. As previously mentioned, clinician rated accuracy to lications and funding proposals, even if nothing but classic
predict remission in the STAR-D cohort was ~49.3%, statistical group inference or the simplest of ML models
whilst that of a weak to moderately accurate AI model are planned—we can judge project planning, as well as
ranged between 59.6% to 64.6% accuracy. Whilst far from subsequent results using this approach to delineate pro-
perfect, a clinically meaningful increase in prognostic jects aiming for ML with high clinical utility from studies
certainty was attained. Therefore, the absolute accuracy of seeking insight into mechanisms or proof-of-concept
a classifier should not serve as an indicator of clinical studies. The ability of journals and funding agencies, as
utility, but the relative increase in prognostication com- well as the general public to evaluate the aims and quality
pared to current state of the art practice. of research projects in this manner is crucial for all
Using this simple checklist, we can now evaluate a translational efforts in psychiatry.
project or publication with regard to its position on the
continuum spanning from “proof-of-concept stage” to Understanding ML models
“ready for clinical application”. Note that the three cate- In the wake of ever-more powerful machine learning
gories outlined above build upon each other in the sense applications emerging across a range of industries, the
that a model with bad generalization cannot have a broad question of explainability—i.e., understanding which
model scope. Likewise, to show incremental utility, a (patterns of) variables lead to which model predictions—
model must generalize well and have reasonable model has increasingly come into focus. Identifying the (pattern
scope. Thus, insufficient scores in one category cannot be of) variables driving predictions is of obvious scientific
compensated by higher scores in any other category. See interest. Extracting these relevant variables in an ML
Fig. 2 for an illustration of this general workflow. study would provide theoretic insights similar to classical
Once an ML model reaches the level we term “ready for (usually univariate) statistical inference while retaining
clinical application”, other considerations regarding post- predictive performance. In addition, if the relevant vari-
deployment evaluation, security, and algorithmic bias ables could be extracted, equally well-performing models
come into focus31. While often crucially important, their could be trained with only a small subset of variables,
impact depends heavily on the context in which the model saving resources on all levels from data acquisition to
is used. If, for example, we could be sure that the model processing and storage.
draws on measures of causal mechanisms, we can assume With regard to clinical evaluation as outlined in the
that the relationship will not change over time, rendering previous section, quantifying the effect of variables can
post-deployment evaluation less important. Also, an ML also be beneficial in at least three ways: First—with regard
model deployed in a consumer application for smart- to model utility—identifying relevant variables might help
phones might require much higher security (e.g., regard- to detect trivial and erroneous models. For example,
ing adversarial attacks and data security, i.e., hacking of Lapuschkin et al.44 used Layer-wise Relevance Propaga-
smartphone input data leading to erroneous model out- tion (LRP)45 to show that an ML model trained to detect
puts and therapeutic decisions) than is necessary in a certain objects in photographs (in this case horses) in fact
closed hardware-based device used in the clinic only. used nothing but the text of a tag present on all images of
Finally, as ML models are not programmed, but trained, horses in the training data. If this tag was inserted into
they will mimic systematic biases inherent in their train- other photographs, the images would be classified as
ing data. This potential algorithmic bias must be carefully horses independent of their actual content. In addition,
investigated. While it can often be eradicated by retrain- identifying relevant variables might enable domain
ing without the variables causing the bias, we need to be experts to judge whether the associations on which the
aware of a bias to be able to remove it. Thus, human model relies are likely to remain stable. Second—with
involvement is crucial at this point (for an in-depth dis- regard to model fairness—knowledge of the relevant
cussion, see ref. 43). variables may help to identify algorithmic bias. If, for
Importantly, these guidelines are intended to gauge ML example, gender and age are identified as the most rele-
model maturity with regard to clinical application only. It vant variables in a model built to identify suitable job
should explicitly not be applied to ML research projects applicants, this bias could be explicitly addressed. Third—
developing methods and proof-of-concept studies. Jud- with regard to model security—identifying the relevant
ging those with the checklist outlined here would inevi- variables provides information on where the model could
tably result in low scores and might therefore stifle even be most easily attacked. This, in turn, might help to
the most promising of methodological developments. It is immunize the model.
this creativity and ingenuity, however, that allows the field Given that (almost all) ML models are by no means
to move at such breathtaking speed. If researchers claim “black boxes”, but apply a transparent and deterministic,
to develop a clinically useful tool—as is done in the albeit often rather complex rule to make predictions, a
Cearns et al. Translational Psychiatry (2019)9:271 Page 8 of 12

Fig. 2 Illustration of the full best practice workflow from pipeline construction through to project maturity assessment. Dependent on the
sample, crossvalidationscheme, and measurement of incremental utility compared to current clinical practice, a project can fall into 3 distinct phases
of project maturity dictating its readiness for clinical use

large number of algorithms aiming to help us understand investigates the trained ML model itself. Examples of this
this rule, i.e., to identify relevant variables, have been include simple weight-maps for Support Vector
developed (for a general introduction, see ref. 46). While it Machines, decision tree-based importance scores in tree-
is beyond the scope of this article to review the numerous based models, as well as more complex approaches, such
families of such algorithms, they usually quantify the as visualizing the process of layer-wise data transforma-
contribution of each individual variable used in a pre- tion in neural networks, including LRP45, or more gen-
diction. This can be done, for example, in a straightfor- erally applicable approaches e.g., based on Game
ward manner by systematically obscuring certain sets of Theory47. While the wealth of research in this area has
variables in novel data and analyzing the resulting vastly increased the toolbox available for model inter-
decrease in performance. Another group of approaches pretation in recent years, it also indicates that no
Cearns et al. Translational Psychiatry (2019)9:271 Page 9 of 12

complete solution has been found. In fact, different researchers. Indeed, many studies forego optimizing the
approaches may lead to different variables being ML model development pipeline entirely. Indeed Arbab-
identified. shirani et al.48 showed in a recent review of ML studies in
However, the problems with ML model explainability the area of neuroimaging that 73% of studies employed a
run much deeper: On the one hand, all algorithms which single machine (namely the Support Vector Machine) and
identify variables driving model prediction require human almost no study optimized hyperparameters even for that
interpretation. Given the multivariate nature of ML single machine. This is particularly disconcerting as the
models, this renders such an interpretation extremely No Free Lunch Theorem49 in ML clearly indicates that,
difficult for even the simplest of (linear) models. On the without further knowledge of the process generating our
other hand, the insight into model utility, fairness, and data, no ML algorithm will, overall, perform better than
security we can gain is not only very limited but can any other. Even without any mathematical considerations,
usually be accomplished much better by straightforward it is quite obvious that employing a single machine makes
model evaluation. For example, while quantifying variable it extremely likely that another ML pipeline, algorithm, or
relevance might indicate that problematic variables (such setting thereof might have performed better. However, it
as gender or age) are driving predictions, explicitly may be possible that there are re-occurring data struc-
excluding these variables in the first place is much more tures for which particular algorithms, pipelines, and
effective. However, it is possible that seemingly bias free hyperparameter values are optimal (e.g., default SVM
variables may be confounded even after the exclusion of parameters for variance normalized neuroimaging
problematic ones. Further analyses may be required to datasets27).
elucidate such hidden confounds prior to selecting vari- Against this background, we see increasing efforts to
ables for model consideration. Further work in this area is automatize the entire ML development process, from data
required and ongoing. preprocessing and feature engineering to model selection
While we believe that complete transparency regarding and hyperparameter optimization. Examples for this
the previously outlined checklist to be essential and include auto-sklearn50 (a package focused on Bayesian
facilitated by practices such as variable assessment for bias hyperparameter optimization and machine selection),
and the sharing of code, data, and trained models for PHOTON (www.photon-ai.com; an ML framework
scientific reproducibility, we deem the disclosure of “the enabling cross-platform ML pipeline construction, opti-
algorithm” itself (specifically, the mathematical under- mization, and evaluation), and Auto Keras51 (an open-
pinnings of the ML model) to be of no use. In fact, even source package based on TensorFlow for neural network
knowing every single one of the hundreds of millions of architecture search). We expect this trend to accelerate in
parameter values in a given ML model would fail to the years to come, automatizing most if not all ML model
provide even a spec of practically useful insight into the development steps. This, however, will increase the need
inner workings of a trained model. While we could re- for proper model evaluation and full AI Transparency as
create every decision, we would have no additional way to outlined above.
investigate its quality—much less assess its real-world
impact. The disclosure of detailed information regarding Learning complex models from small samples
generalization, model scope, and risk profiles, however, While exceptionally successful in many areas, ML
ensures utmost transparency. model training may require large amounts of data as
models comprise at least as many free parameters as there
Future goals are variables measured52. For complex models (e.g., Deep
Automatized ML development and optimization Learning), the number of parameters may easily increase
The success of ML model development in practice to tens of millions of parameters53. Training such large
crucially relies on machine learning experts to preprocess models with “only” hundreds or even thousands of sam-
data (including e.g., imputation and cleaning), select and ples may induce so-called overfitting—a situation in
construct appropriate features (feature engineering; often which the large number of free parameters allows the
with the help of domain experts), select an appropriate model to essentially “memorize” all of the training data,
model family (e.g., neural networks or random forests leading to perfect performance on the training set, but
etc.), optimize model hyperparameters (e.g., learning extremely bad generalization to new, real-world data.
rate), and evaluate the model with regard to general- While generally effective, the numerous countermeasures
ization, model scope, and risk profiles (including incre- employed lower the complexity of what a model can learn,
mental utility, fairness, and security as outlined above). potentially rendering it unable to capture true associa-
The complexity of this generic ML development pipeline, tions in the data54–56. Thus, model performance is
as well as the necessity to avoid data leakage and ensure severely limited by the number of patients available,
proper evaluation can be challenging even for experienced especially whenever high dimensional data sources such
Cearns et al. Translational Psychiatry (2019)9:271 Page 10 of 12

Fig. 3 Illustration of workflows for the different techniques exemplified using Magnetic Resonance Imaging (MRI) data. a Data
augmentation approach using stochastic and image processing methodology. b Cross-domain Transfer Learning applying low-level filters learnt by a
Convolutional Neural Network (CNN) from the Imagenet database. c Intra-domain Transfer Learning deriving a statistical embedding from a large
database of MRI images employing a Generative Adversarial Network (GAN)

as neuroimaging data are of interest (commonly known as From ML models to clinical decision support systems
large P small N problems)48. Acquiring hundreds of At current, applied ML in psychiatry shows promise yet is
thousands of patient samples, however, is usually not still in its early days. Once best practice is attained and
feasible—especially in psychiatry where comprehensive proof-of-concept studies have been conducted, forward
phenotype data is often crucial. testing will be necessary to demonstrate prognostic stabi-
The fundamental problem of model training on small lity, incremental utility, and real-world estimates of per-
datasets has recently been addressed in other areas with formance in the same context as they will be clinically
great success: First, using data augmentation—i.e., per- deployed. In such a forward test, a clinician could make
turbing existing data to create new samples—as a math- their prediction and clinical decision (e.g., will a patient
ematically principled means to artificially enlarge training enter remission if they prescribe a certain drug? If so,
set size, enables the generation of an arbitrarily large prescribe drug). In parallel, a trained model could also
number of training samples using stochastic and image make its prediction, with both the model and clinician
processing methodology custom-tailored for imaging assessed at a 12-week endpoint. However, given the inter-
data57,58 (including e.g., image synthesis, sample pairing personal nature of psychiatric care, it is unlikely that even if
techniques etc.; Fig. 3a). Second, transfer learning has ML models prove to outperform clinician prognostication,
been used to transform variables into a lower dimensional they will ever solely drive the clinical decision-making
representation based on what has been learnt from other process. Therefore, the testing of AI decision support sys-
data sets59–61. In neuroimaging, for example, we could tems alongside clinicians will likely provide a more realistic
leverage a cross-domain transfer learning approach by approximation of what to expect in terms of socially
extracting basic visual features of a pre-trained image accepted clinical use. In this third study arm, a clinician
classification algorithm (i.e., a Convolutional Neural could make a prediction and decision not only based off of
Network, CNN) trained on 1.2 million natural images their own clinical expertize, but the binary prediction of an
(Imagenet57, Fig. 3b). In the same vein, employing intra- ML model, as well as its predicted probabilities62. Here, the
domain transfer learning, we can extract general statistical synergy between a clinician and an AI decision support
properties from large datasets of healthy controls using system could be measured. In this case, not only the binary
state-of-the-art Generative Adversarial Neural Networks prediction but the probabilistic estimates are of importance.
(GAN, Fig. 3c). Such GANs can, for example, represent Therefore, model calibration should also be carefully con-
the MRI data on a lower-dimensional manifold and enable sidered (see Niculescu-Mizil & R. Caruana63 and the
the generation of an arbitrary number of MRI images sklearn documentation64). If the collaboration of the clin-
from this distribution. While highly effective, these tech- ician and the AI system significantly outperform clinician
niques have thus far not been systematically applied in prognostication alone, the model could then move towards
psychiatry. clinical deployment.
Cearns et al. Translational Psychiatry (2019)9:271 Page 11 of 12

Psychiatry and beyond Publisher’s note


Springer Nature remains neutral with regard to jurisdictional claims in
Whilst the current overview has focused on the appli- published maps and institutional affiliations.
cation of ML to psychiatric phenotypes, these recom-
mendations will generalize to most areas of applied ML in Received: 21 May 2019 Revised: 5 July 2019 Accepted: 30 July 2019
medicine. However, as psychiatric disorders are defined
based on deviations of phenotypic characteristics—not
causal biological models—a disorder can be highly het-
erogeneous with regard to its biological underpinnings. References
1. Bzdok, D. & Meyer-Lindenberg, A. Machine learning for precision psychiatry:
Therefore, diagnostic labels in psychiatry likely contain opportunities and challenges. Biol. Psychiatry Cogn. Neurosci. Neuroimaging. 3,
noise not seen in other medical disciplines. This presents 223–230 (2018).
a set of unique challenges that are distinct to psychiatry, 2. Engel, G. L. The clinical-application of the biopsychosocial model. J. Med.
Philos. 6, 101–123 (1981).
with implications for model selection, accuracy, reliability, 3. Bzdok, D., Altman, N. & Krzywinski, M. Statistics versus machine learning. Nat.
reproducibility, and ceiling effects on model performance. Methods 15, 233–234 (2018).
In areas that do not face these challenges, for example, 4. AL-Asadi, A. M., Klein, B., Meyer, D. Multiple comorbidities of 21 psychological
disorders and relationships with psychosocial variables: a study of the online
oncology, specific problems discussed, e.g., sample sized assessment and diagnostic system within a web-based population. J. Med.
based systematic overestimation and overfitting risks may Internet Res. 17, e55 (2015).
be less prevalent. Therefore, domain application should 5. Anker, E., Bendiksen, B. & Heir, T. Comorbid psychiatric disorders in a clinical
sample of adults with ADHD, and associations with education, work and social
always be considered. Moving forward, only by parsing characteristics: a cross-sectional study. BMJ Open. 8, e019700 (2018).
this heterogeneity in psychiatric phenotypes will we be 6. Strawbridge, R., Young, A. H. & Cleare, A. J. Biomarkers for depression: recent
able to lay the groundwork for more targeted models and insights, current challenges and future prospects. Neuropsychiatr. Dis. Treat. 13,
1245–1262 (2017).
improved patient care65,66. 7. Yahata, N., Kasai, K. & Kawato, M. Computational neuroscience approach to
biomarkers and treatments for mental disorders. Psychiatry Clin. Neurosci. 71,
Conclusion 215–237 (2017).
8. Cross-Disorder Group of the Psychiatric Genomics C. Genetic relationship
In this overview, we have suggested and discussed best- between five psychiatric disorders estimated from genome-wide SNPs. Nat.
practice guidelines with the intention that they may help Genet. 45, 984 (2013).
stakeholders, journals, and funding agencies to obtain a 9. Del Boca, F. K. & Noll, J. A. Truth or consequences: the validity of self-report
data in health services research on addictions. Addiction 95, S347–S360 (2000).
more realistic view of (1) what can be expected of a 10. Nguyen, T. M. U., Caze, A. L. & Cottrell, N. What are validated self‐report
planned research project in terms of clinical utility, (2) adherence scales really measuring?: a systematic review. Br. J. Clin. Pharmacol.
how much closer a particular finding has brought us to 77, 427–445 (2014).
11. Yarkoni, T. & Westfall, J. Choosing prediction over explanation in psychology:
clinical application, and (3) what remains to be done lessons from machine learning. Perspect. Psychol. Sci. 12, 1100–1122 (2017).
before we can expect improvements in daily practice. In 12. Kirsch, I. & Sapirstein, G. Listening to Prozac but hearing placebo: a meta-
addition, understanding how to develop, train, and eval- analysis of antidepressant medication. Prev. Treat. 1, 2a (1998).
13. Chekroud, A. M. et al. Reevaluating the efficacy and predictability of anti-
uate ML models and publications might help researchers depressant treatments: a symptom clustering approach. JAMA Psychiatry 74,
new to the field of ML in psychiatry to better plan and 370–378 (2017).
monitor their ML projects, creating robust best-practice 14. Koutsouleris, N. et al. Multisite prediction of 4-week and 52-week treatment
outcomes in patients with first-episode psychosis: a machine learning
procedures in the mid-term. In the long-run, we hope that approach. Lancet Psychiatry 3, 935–946 (2016).
these guidelines can help to channel funding, as well as 15. Redlich, R. et al. Prediction of individual response to electroconvulsive therapy
media attention towards the most promising develop- via machine learning on structural magnetic resonance imaging data. JAMA
Psychiatry 73, 557–564 (2016).
ments regarding improved patient care, circumventing 16. Chekroud, A. M. et al. Cross-trial prediction of treatment outcome in
the evident dangers of the current hype around machine depression: a machine learning approach. Lancet Psychiatry 3, 243–250 (2016).
learning and artificial intelligence. 17. Sackett, P. R., Lievens, F., Van Iddekinge, C. H. & Kuncel, N. R. Individual dif-
ferences and their measurement: A review of 100 years of research. J. Appl.
Psychol. 102, 254 (2017).
Acknowledgements
18. Speelman, C. P. & McGann, M. Editorial: challenges to mean-based analysis in
This work was supported by the Australian Government Research Training
psychology: the contrast between individual people and general science.
Program Scholarship.
Front. Psychol. 7, 1234 (2016).
19. Chekroud, A. M., Lane, C. E. & Ross, D. A. Computational psychiatry: embracing
Author details uncertainty and focusing on individuals, not averages. Biol. Psychiatry 82,
1
Discipline of Psychiatry, School of Medicine, University of Adelaide, Adelaide, e45–e47 (2017).
SA 5005, Australia. 2Institute of Translational Psychiatry, University of Münster, 20. Neuhaus, A. H. & Popescu, F. C. Sample size, model robustness, and classifi-
48149 Münster, Germany. 3Department of Psychiatry, University of Münster, cation accuracy in diagnostic multivariate neuroimaging analyses. Biol. Psy-
48149 Münster, Germany. 4Department of Psychiatry, Melbourne Medical chiatry 84, e81–e82 (2018).
School, The University of Melbourne, Parkville, VIC 3010, Australia. 5The Florey 21. Schnack, H. G. & Kahn, R. S. Detecting neuroimaging biomarkers for psychiatric
Institute of Neuroscience and Mental Health, The University of Melbourne, disorders: sample size matters. Front Psychiatry 7, 50 (2016).
Parkville, VIC 3010, Australia 22. Hutson, M. Artificial intelligence faces reproducibility crisis. Science 359,
725–726 (2018).
Conflict of interest 23. Open Science Collaboration. PSYCHOLOGY. Estimating the reproducibility of
The authors declare that they have no conflict of interest. psychological science. Science 349, aac4716 (2015).
Cearns et al. Translational Psychiatry (2019)9:271 Page 12 of 12

24. Klein, R. A. et al. Many Labs 2: Investigating variation in replicability across 44. Lapuschkin, S. et al. Unmasking clever Hans predictors and assessing what
samples and settings. Advances in Methods and Practices in Psychological Sci- machines really learn. Nat. Commun. 10, 1096 (2019).
ence 1, 443–490 (2018). 45. Bach, S. et al. On pixel-wise explanations for non-linear classifier decisions by
25. Koutsouleris, N. et al. Prediction models of functional outcomes for individuals layer-wise relevance propagation. PloS ONE 10, e0130140 (2015).
in the clinical high-risk state for psychosis or with recent-onset depression: a 46. Molnar, C. Interpretable machine learning: a guide for making black box
multimodal, multisite machine learning analysis. JAMA Psychiatry 75, models explainable. Christoph Molnar, Leanpub (2018).
1156–1172 (2018). 47. Lundberg, S. M, Lee, S.-I. A unified approach to interpreting model predictions.
26. Kambeitz, J. et al. Reply to: sample size, model robustness, and classification In Advances in Neural Information Processing Systems. 4765–4774 (2017).
accuracy in diagnostic multivariate neuroimaging analyses. Biol. Psychiatry 84, 48. Arbabshirani, M. R., Plis, S., Sui, J. & Calhoun, V. D. Single subject prediction of
e83–e84 (2018). brain disorders in neuroimaging: promises and pitfalls. Neuroimage 145,
27. Varoquaux, G. et al. Assessing and tuning brain decoders: cross-validation, 137–165 (2017).
caveats, and guidelines. NeuroImage 145, 166–179 (2017). 49. Wolpert, DH, Macready, WG. No free lunch theorems for search: Technical
28. Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation Report SFI-TR-95-02-010, Santa Fe Institute (1995)
and model selection. IJCAI. 14, 1137–1145 (1995). 50. Feurer, M. et al. Efficient and robust automated machine learning. In Advances
29. Cawley, G. C. & Talbot, N. L. On over-fitting in model selection and subsequent in Neural Information Processing Systems. 2962–2970 (2015).
selection bias in performance evaluation. J. Mach. Learn Res. 11, 2079–2107 51. Jin, H., Song, Q., Hu, X. Efficient neural architecture search with network
(2010). morphism. Preprint at arXiv:180610282 (2018).
30. Vanwinckelen, G., Blockeel, H. On estimating model accuracy with repeated 52. Hastie, T, Tibshirani, R, Friedman, J. The elements of statistical learning. Springer
cross-validation. In Proc. 21st Belgian-Dutch Conference on Machine Learning. series in statistics (Springer, 2001).
39–44 (2012). 53. Goodfellow, I, Bengio, Y, Courville, A. Deep Learning (MIT press, 2016).
31. Hahn, T., Ebner-Priemer U., Meyer-Lindenberg A. Transparent artificial 54. Hahn, T. et al. Integrating neurobiological markers of depression. Arch. Gen.
intelligence–a conceptual framework for evaluating ai-based clinical decision Psychiatry 68, 361–368 (2011).
support systems. SSRN 3303123 (2018). 55. Rondina, J. M. et al. SCoRS-a method based on stability for feature selection
32. Pedregosa, F. et al. Nested versus non-nested cross-validation [Webpage]. and mapping in neuroimaging (vol 33, pg 85, 2014). IEEE Trans. Med. Imaging
Scikit learn documentation: Scikit learn Explanation and code for nested cross- 33, 794 (2014).
validation. https://scikit-learn.org/stable/auto_examples/model_selection/plot_ 56. Hahn, T. et al. Predicting treatment response to cognitive behavioral therapy
nested_cross_validation_iris.html (2019). in panic disorder with agoraphobia by integrating local neural information.
33. Pedregosa, F. et al. Train, test, split. Train/test/split cross-validation doc- JAMA Psychiatry 72, 68–74 (2015).
umentation. https://scikit-learn.org/stable/modules/generated/sklearn.model_ 57. Krizhevsky, A., Sutskever, I. & Hinton, G. E. ImageNet classification with deep
selection.train_test_split.html (2019). convolutional neural networks. Commun. Acm. 60, 84–90 (2017).
34. Roberts, D. R. et al. Cross‐validation strategies for data with temporal, spatial, 58. Inoue, H. Data augmentation by pairing samples for images classification.
hierarchical, or phylogenetic structure. Ecography 40, 913–929 (2017). Preprint at arXiv:180102929 (2018).
35. Xu, Q.-S. & Liang, Y.-Z. Monte Carlo cross validation. Chemometrics Intell. Lab. 59. Esteva, A. et al. Dermatologist-level classification of skin cancer with deep
Syst. 56, 1–11 (2001). neural networks. Nature 542, 115–118 (2017).
36. Pedregosa, F. et al. Leave one group out cross-validation. Leave one group out 60. Cheng, B. et al. Multimodal manifold-regularized transfer learning for MCI
cv python code example in scikit learn. https://scikit-learn.org/stable/modules/ conversion prediction. Brain Imaging Behav. 9, 913–926 (2015).
generated/sklearn.model_selection.LeaveOneGroupOut.html#sklearn. 61. Donahue, J., Krähenbühl, P., Darrell, T. Adversarial feature learning. Preprint at
model_selection.LeaveOneGroupOut (2019). arXiv:160509782 (2016).
37. Kaufman, S., Rosset, S., Perlich, C, Stitelman, O. Leakage in data mining: for- 62. Hahn, T. et al. A novel approach to probabilistic biomarker-based classification
mulation, detection, and avoidance. ACM Trans. Knowl. Discov. D 6, 15 (2012). using functional near-infrared spectroscopy. Hum. Brain Mapp. 34, 1102–1114
38. Cearns, M. Code based data leakage gist for Translational Psychiatry. https:// (2013).
gist.github.com/Micah0808/6d9e4d0919c9f43dcb3e53d21f405c97 (2019). 63. Niculescu-Mizil, A., Caruana, R. Predicting good probabilities with supervised
39. Tang, J., Alelyani, S., Liu, H. Feature selection for classification: a review. Data learning. In Proceedings of the 22nd international conference on Machine
Class Algorithms Appl. 37, (2014). learning. 625–632 (ACM, 2005).
40. Ojala, M. & Garriga, G. C. Permutation tests for studying classifier performance. 64. Pedregosa, F. et al. Probability calibration. https://scikit-learn.org/stable/
J. Mach. Learn Res. 11, 1833–1863 (2010). modules/calibration.html (2019).
41. Pedregosa, F et al. Strategies to scale computationally: bigger data. https:// 65. Marquand, A. F., Wolfers, T., Mennes, M., Buitelaar, J. & Beckmann, C. F. Beyond
scikit-learn.org/0.15/modules/scaling_strategies.html (2019). lumping and splitting: a review of computational approaches for stratifying
42. Gabrieli, J. D., Ghosh, S. S. & Whitfield-Gabrieli, S. Prediction as a humanitarian psychiatric disorders. Biol. Psychiatry Cogn. Neurosci. Neuroimaging. 1, 433–447
and pragmatic contribution from human cognitive neuroscience. Neuron 85, (2016).
11–26 (2015). 66. Marquand, A. F., Rezek, I., Buitelaar, J. & Beckmann, C. F. Understanding het-
43. FATM. Fairness, accountability, and transparency in machine learning. 24 erogeneity in clinical cohorts using normative models: beyond case-control
(2018). Retrieved December. studies. Biol. Psychiatry 80, 552–561 (2016).

You might also like