0% found this document useful (0 votes)
25 views

Robust and Data-Efficient Generalization of Self-Supervised Machine Learning For Diagnostic Imaging

Uploaded by

soumikfarhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Robust and Data-Efficient Generalization of Self-Supervised Machine Learning For Diagnostic Imaging

Uploaded by

soumikfarhan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

nature biomedical engineering

Article https://doi.org/10.1038/s41551-023-01049-7

Robust and data-efficient generalization


of self-supervised machine learning for
diagnostic imaging

Received: 22 July 2022 Shekoofeh Azizi 1,5 , Laura Culp1,5, Jan Freyberg1,5, Basil Mustafa1,
Sebastien Baur1, Simon Kornblith1, Ting Chen1, Nenad Tomasev2,
Accepted: 2 May 2023
Jovana Mitrović2, Patricia Strachan1, S. Sara Mahdavi1, Ellery Wulczyn1,
Published online: 8 June 2023 Boris Babenko 1, Megan Walker1, Aaron Loh1, Po-Hsuan Cameron Chen1,
Yuan Liu1, Pinal Bavishi1, Scott Mayer McKinney1, Jim Winkens1,
Check for updates
Abhijit Guha Roy1, Zach Beaver1, Fiona Ryan3, Justin Krogue 1,
Mozziyar Etemadi4, Umesh Telang 1, Yun Liu 1, Lily Peng1, Greg S. Corrado1,
Dale R. Webster 1, David Fleet1, Geoffrey Hinton1, Neil Houlsby1,
Alan Karthikesalingam1 , Mohammad Norouzi1 & Vivek Natarajan1

Machine-learning models for medical tasks can match or surpass the


performance of clinical experts. However, in settings differing from
those of the training dataset, the performance of a model can deteriorate
substantially. Here we report a representation-learning strategy for
machine-learning models applied to medical-imaging tasks that mitigates
such ‘out of distribution’ performance problem and that improves model
robustness and training efficiency. The strategy, which we named REMEDIS
(for ‘Robust and Efficient Medical Imaging with Self-supervision’), combines
large-scale supervised transfer learning on natural images and intermediate
contrastive self-supervised learning on medical images and requires minimal
task-specific customization. We show the utility of REMEDIS in a range of
diagnostic-imaging tasks covering six imaging domains and 15 test datasets,
and by simulating three realistic out-of-distribution scenarios. REMEDIS
improved in-distribution diagnostic accuracies up to 11.5% with respect
to strong supervised baseline models, and in out-of-distribution settings
required only 1–33% of the data for retraining to match the performance of
supervised models retrained using all available data. REMEDIS may accelerate
the development lifecycle of machine-learning models for medical imaging.

Machine learning (ML) methods based on deep learning1 have deliv- of clinical experts in disease-classification tasks. When deployed
ered impressive results across medical-imaging modalities, including in healthcare systems, such task-specific and custom-designed
radiology2–5, dermatology6,7, pathology8–10 and ophthalmology11,12. ML systems promise to aid caregivers and improve health
ML models have in fact shown potential to match the performance outcomes13.

Google Research, Mountain View, CA, USA. 2DeepMind, London, UK. 3Georgia Institute of Technology, Computer Science, Atlanta, GA, USA. 4School
1

of Medicine/School of Engineering, Northwestern University, Chicago, IL, USA. 5These authors contributed equally: Shekoofeh Azizi, Laura Culp, Jan
Freyberg. e-mail: [email protected]; [email protected]; [email protected]

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 756


Article https://doi.org/10.1038/s41551-023-01049-7

However, generalization remains a key translational challenge clinical settings. Most previous works often only evaluated the models
for medical-imaging applications. Medical ML systems can be evalu- on a limited number of tasks and featured custom design choices, which
ated and deployed in either in-distribution (ID) or out-of-distribution makes it difficult to broadly apply them in practice. Furthermore, it is
(OOD) settings. In the controlled ID setting, evaluation of a ML model unclear how these methods interact or can be combined with other
is performed in a dataset similar to the one in which it was trained, representation-learning strategies.
whereas in the OOD setting the model is evaluated or deployed in a Motivated by the need for data-efficient generalization of
new clinical environment that differs from that of the training data. medical-imaging ML and building on top of progress in self-supervised
ML models frequently exhibit excellent performance in ID settings learning for tackling this problem, we here present ‘Robust and Effi-
but fail to maintain this expert-level (and therefore clinically appli- cient Medical Imaging with Self-supervision’ (REMEDIS), a unified
cable performance) in OOD settings. Distribution shifts are common transfer-learning strategy for developing robust medical-imaging
in deployment and can have far-reaching consequences, with model ML with minimal customization across multiple domains. We pre-
accuracy shown to degrade in previously unseen environments in sent robust evidence of the efficacy of our approach across multiple
multiple different applications for ML in medical imaging14–16. Further- clinical tasks and settings. The key insight of this strategy is to learn
more, model performance and calibration may degrade to a greater transferable and generalizable visual representations that can be fur-
extent in underrepresented subgroups, which can propagate existing ther fine-tuned and deployed for the downstream medical-imaging
health disparities17–19. The ability of medical ML models to maintain task using limited labelled data from the clinical deployment setting.
performance by efficiently generalizing to clinical settings not seen Unlike current standard transfer-learning strategies, which rely on
during training is necessary for their safe and effective deployment at standard supervised representation and modality or task-specific
scale20–24. Rigorous evaluation of medical ML models therefore requires design choices, REMEDIS benefits from both large-scale supervised
assessment of their performance in OOD settings to guard against representation and task-specific self-supervised representations in a
‘under-specification’ (that is, when the model is not constrained by unified framework and with minimal customization across domains.
clinical domain knowledge), resulting in unanticipated poor perfor- The standard approach for learning transferable representa-
mance during deployment25. tions in computer vision involves supervised pretraining52,53 on
Despite the unmet need for medical ML generalization, few large-scale natural-image datasets that range from a million images
practical solutions currently exist. Adapting to new clinical deploy- in the ImageNet dataset45 to several hundred million54 and beyond55.
ment settings by retraining, fine-tuning or developing the ML model Although these representations show strong transfer-learning per-
from scratch with data from the new distribution is perhaps the most formance on downstream natural-image tasks56 and can even perform
favoured approach14,26,27. However, this may be prohibitively expensive well on medical-imaging tasks57, they tend to be suboptimal for the
or impractical owing to the requirement of acquiring and annotating medical-imaging domain given the large distribution shift from natural
large volumes of medical data for each new type of distribution shift; images58. To alleviate this discrepancy, further supervised pretraining
for example, the use of new imaging equipment or deployment in a new using medical data could be employed; but, as discussed previously,
clinic28. In turn, this considerably slows down the lifecycle of develop- this is challenging, as acquiring annotations for medical data is expen-
ment and deployment of ML for medical imaging, and presents an sive and time-consuming. Self-supervised learning, however, does not
important barrier to its widespread adoption. require annotations, and unlabelled medical data are often more easily
To formulate this problem more concretely, we use the notion available (although a considerable effort is still needed to acquire,
of data-efficient generalization. Specifically, we capture the ability clean, pre-process and harmonize the data28).
of the ML model to generalize to new deployment distributions with Following this reasoning, our method leverages a combination
considerably reduced need for expert-annotated data from the new of both large-scale supervised pretraining on natural images52 as well
clinical setting. We measure this in two ways: as the improvement in as intermediate contrastive self-supervised learning44 on unlabelled
zero-shot generalization to OOD settings (assessing performance in domain-specific medical data to learn transferable and generaliz-
an OOD evaluation set, with zero access to training data from the OOD able representations for medical images59. These representations
dataset); and as a significant reduction in the need for annotated data can be used for developing medical-imaging ML by fine-tuning them
from the OOD settings to reach a performance equivalent to that of on task-specific labelled medical data. The proposed strategy intro-
clinical experts (or a threshold demonstrating clinical utility) while duces minimal changes to the standard transfer-learning workflow
maintaining or improving ID performance. while drastically improving representation by wisely using the pool of
The desire to reduce reliance on hard-to-acquire labelled data for available labelled and unlabelled data. Figure 1 shows an overview of
developing ML systems29,30 and improve their OOD generalization31,32 our strategy for developing medical-imaging ML systems that exhibit
is a long-standing challenge for the wider ML community. The recent strong data-efficient generalization.
development and use of self-supervised-learning techniques in diverse The value of exploiting unlabelled data via self-supervised
applications across computer vision33–37, natural language understand- learning to improve ID performance has been studied in several
ing38,39 and speech recognition40 indicate their broad effectiveness individual medical-imaging domains, such as pathology60,61, der-
towards tackling this challenge. These methods use various pretext matology62 and chest-X-ray63 interpretation. However, these previ-
tasks to train models to produce high-quality representations without ous works studied self-supervised learning in isolation from other
using any label information. Contrastive self-supervised learning, representation-learning techniques, without considering how they
in particular, has emerged as a strong approach in computer vision might be combined with other representation-learning techniques.
where models are trained by aligning the representation of multiple They often rely on task-specific design choices, which makes it dif-
views of the same instance created via data augmentation or other ficult to broadly apply them in practice. Furthermore, the evaluation
means41–44. On the popular and competitive natural-images benchmark, protocols are limited and fail to comprehensively consider and evaluate
ImageNet45, models developed using such contrastive self-supervised data-efficient generalization.
methods are starting to approach the performance of fully super- For example, ref. 62 reported a custom approach to contrastive
vised methods46,47. Furthermore, contrastive self-supervised learn- self-supervised learning by leveraging multiview images to create
ing provides additional benefits, such as improved robustness48 and natural augmentation pairs. However, the approach is not generally
OOD detection performance49–51. Although progress in self-supervised applicable and its evaluation was limited to two medical-imaging
learning shows promise for applications in medical ML, the effect has tasks without consideration for data-efficient generalization to OOD
not been rigorously studied for data-efficient generalization to OOD settings. Another closely related work16 investigated the related topic

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 757


Article https://doi.org/10.1038/s41551-023-01049-7

Non-medical Representation learning In-distribution Fine-tuning/inference


labelled data labelled medical data

Supervised Supervised In-distribution


pretraining Minimize fine-tuning evaluation
cross entropy

Fine-tuned in-distribution
task-specific model

Domain-specific Intermediate self-supervised pretraining


unlabelled medical data Out-of-distribution
labelled medical data

Data-efficient
Supervised generalization
Maximize fine-tuning evaluation
agreement
(Optional)

Fine-tuned out-of-distribution
task-specific model

Fig. 1 | Overview of the REMEDIS approach for developing robust and efficient learning without using any labelled medical data. Finally, we fine-tune the
ML for medical imaging. REMEDIS starts with representations initialized using model to specific downstream medical-imaging ML tasks. We evaluate the ML
large-scale natural-image pretraining following the BiT method52. We then adapt model in both ID and OOD settings to establish the data-efficient generalization
the model to the medical domain using intermediate contrastive self-supervised performance of the model.

of domain generalization in clinical settings. Once again, this study medical ML and designed rigorous evaluation benchmarks to study
was restricted to only two medical tasks and the authors used syn- this using retrospective data. ML models developed using REMEDIS
thetic datasets for their experiments. The methods considered were deliver considerable improvements in data-efficient generalization
more focused on learning invariant predictors by pooling together performance when evaluated in realistic OOD settings (new and previ-
data from different domains and less on representation-learning ously unseen clinical environments). Models trained using our strategy
strategies for transfer learning. Furthermore, the focus of the work needed only 6% to 33% of the amount of retraining data to match the
was on domain generalization without taking the data-efficiency performance of a strong supervised baseline as well as widely used
aspect into account. Perhaps closest in scope to our study is the work supervised ML models that were provided access to all of the available
in ref. 64, where a self-supervised approach showed its potential training data from new clinical settings. These improvements in per-
across five three-dimensional (3D) medical-imaging tasks. The work formance in new environments would result in savings of thousands of
used a corrupt-and-reconstruct paradigm in which various transfor- valuable clinician hours that would otherwise be needed for medical
mations were applied to subvolumes, and an encoder-decoder model data acquisition (years) and annotation (days), thereby potentially
was trained to reconstruct the original subvolume. This approach accelerating the lifecycle for the development, deployment and democ-
has the added benefit of providing both pretrained encoders for ratization of medical-imaging ML systems. Moreover, our study shows
classification and pretrained decoders for segmentation, which empirically the effectiveness of contrastive self-supervised learning
our current work does not provide. However, the study failed to towards data-efficient OOD generalization, both in natural-image and
consider the interaction of self-supervised learning with other medical-imaging settings. Refs. 50,51 provide theoretical grounding
representation-learning strategies. Furthermore, whereas the study for our observations.
reported label efficiency for training, it did not evaluate the approach
rigorously on data-efficient generalization. Our method uses contras- Results
tive self-supervised learning, which is known to work better than Beyond the comprehensive experimental results that we report, the
image-reconstruction-based methods65. The supervised pretraining approach and insights described here have been integrated in several
considered here also includes much larger natural-image datasets of Google’s medical-imaging research projects, such as dermatology20
beyond ImageNet. and mammography66. In addition to open-sourcing the code used for
In contrast to previous works, we report a unified strategy that developing REMEDIS as well as other associated details, we provide a
leverages large-scale unsupervised pretraining and intermediate comprehensive appendix that can be used as an independent guide
self-supervised learning, and that can be applied across multiple for building on our results. We hope that this enables the medical ML
medical-imaging modalities without domain-specific customiza- community to replicate our study, derive further scientific value and
tion. Given that a key part of our method is representation learning realize positive clinical impact.
from unlabelled data using contrastive self-supervision, the method
is particularly well-suited to the medical-imaging ML setting, given the A unified framework for robust medical imaging
challenges of expert labelling and the relative abundance of unlabelled The goal of our method is to learn a predictor for each domain-specific
medical images. Across six different medical-imaging tasks and 15 dif- medical task with low prediction error on both the ID and the OOD
ferent evaluation sets, we show that the strategy substantially improves data. Since it has been shown that pretraining on massive unlabelled
ID performance over strong supervised ML approaches, with up to 11.5% datasets potentially improves accuracy under distribution shift59,67,
relative improvement in diagnostic accuracy. here we focused on predictors that leverage these pretrained repre-
We conceptualized the notion of ‘data-efficient’ generalization sentations and further fine-tuned these predictors using the labelled
to new clinical deployment settings as an important unmet need for data. In the representation-learning step, we trained an encoder f(⋅)

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 758


Article https://doi.org/10.1038/s41551-023-01049-7

to produce representations by minimizing some loss: cross-entropy selection of negative examples (that is, zi and zk ), as well as data-
loss (a multiclass generalization of logistic loss) for supervised pretrain- augmentation strategies (details provided in Methods).
ing or a contrastive loss for self-supervised pretraining. Then we trained The encoder network fθ (⋅) obtained from the self-supervised-
a classification head g(⋅) to map representations to the medical learning step was further fine-tuned using annotated medical images
task-specific label space. The final classifier h = g ∘ f was composed for the domain-specific medical task. For each task, we initialized a
of the encoder learnt in the representation-learning step, followed by classifier head g(⋅) to map representations to the domain-specific label
the classification head. Under this setup, given labelled training exam- space and trained it using the cross-entropy loss. The final classifier
ples {(x1 , y1 ), … , (xn , yn )} sampled from the labelled medical dataset D, h = g ∘ f consisted of the encoder with the classification head. Given
the predictor h(⋅) maps each input example x to the corresponding the importance of data-efficient generalization for medical-imaging
class label y. The visual representation-learning procedure that we ML, we studied this in detail by considering three scenarios selected
consider consisted of two pretraining steps: first, supervised pretrain- to simulate potential clinical deployment settings: (1) ID fine-tuning
ing using large-scale labelled natural images; then, contrastive and performance evaluation: the model h(⋅) was evaluated on ID test
self-supervised pretraining using unlabelled domain-specific medical samples, Dtest; (2) zero-shot OOD performance evaluation: the model
images. h(⋅) was evaluated on OOD test samples, Dtest out , without any further
retraining / fine-tuning using OOD data; and (3) data-efficient OOD
Supervised pretraining. Models trained for classification using fine-tuning and performance evaluation: the model h(⋅) was further
large-scale natural-image datasets such as ImageNet68 are commonly fine-tuned using some fraction or the whole Dtrain ̄ ̄
out to obtain h (.). h (.)
used for transfer learning. It has been found that using generic and was then evaluated on OOD test samples, Dtest out .
large-scale supervised pretrained models can have various benefits
such as speeding up training or improving downstream task perfor- Clinical evaluation settings
mance57,58,69,70. Big Transfer (BiT)52 scaled this process up and, in con- ML-model development and evaluation were performed for six tasks
junction with subtle architecture changes71 and more recent training spanning five medical-imaging modalities (clinical dermatology pho-
procedures, improved the performance of transfer learning, achieving tography, fundus imaging, digital pathology, chest radiography and
state-of-the-art results on several downstream transfer-learning mammography). Predictive models in each task were built using the
tasks56. To exploit these benefits, we initialized the backbone encoder domain-specific unlabelled dataset DU, the ID dataset Din and up to
with weights from BiT models trained on the JFT54 dataset by minimiz- two OOD datasets, Dout . All of the ID and OOD datasets were further
ing a supervised cross-entropy loss. Given that deployment settings split into training, validation and test sets (with the test set simulating
constrain the size of the ML model (in terms of number of model a deployment setting). Here we provide an overview of each task and
parameters) that can be used, it is important that our proposed of the corresponding datasets (they are further described in Methods
approach works when using both small and large model architecture and in Supplementary Table 5).
sizes. To study this in detail, we considered two ResNet architectures
with commonly used depth and width multipliers: ResNet-50 (1×) and Task 1: Dermatology-condition classification. We compared perfor-
ResNet-152 (2×) as the backbone encoder networks46. The pretrained mance to a previously published dermatology-condition classification
encoder network obtained from this step is indicated by fϕ (⋅) and was ML7,62 (T1 ), trained to identify various types of skin conditions from
further fine-tuned in an additional pretraining step using digital camera images. The ID dataset DT1 in
comprised 20,676 unique
medical-domain data. cases collected and de-identified by a US-based teledermatology ser-
vice. This dataset consisted of skin-condition images taken using
Contrastive self-supervised pretraining. To learn visual representa- consumer-grade digital cameras and included extensive variation in
tions effectively from unlabelled medical images, we adopted Sim- pose, lighting, camera focus, target body part and backgrounds, such
CLR44,72, a self-supervised-learning algorithm based on contrastive as variations in clothing or environment. Each case included between
learning. Intuitively, SimCLR learns representations by maximizing one to six images after removal of cases with the occurrence of multiple
agreement73 between differently augmented views of the same training skin conditions or ungradable images, with ground truth aggregated
example via a contrastive loss in a hidden layer of a feed-forward neural from a panel of several US-board-certified dermatologists7. The ML
net. Given a randomly sampled mini-batch of images, each image xi is models were trained to identify the 26 most common skin conditions
augmented in two different ways, creating two views of the same exam- out of 419 unique conditions in the dataset, with the remaining condi-
ple x2k−1 and x2k . The two images are mapped via an encoder network tions being grouped into an additional ‘Other’ class. We examined
fθ (⋅) to generate representations that are transformed again with a model generalization using an OOD dataset, DT1 out, consisting of 28,300
nonlinear transformation network, yielding representations z2k−1 and unique cases collected from a separate source (Australia-based skin
z2k that are used for computing the contrastive loss objective. With a cancer clinics), primarily focused on skin cancers and where the
mini-batch of encoded examples, the contrastive loss between a posi- ground-truth labels were obtained from biopsies7,62. Self-supervised
tive pair of examples i, j (different augmentations of the same image) pretraining leveraged a total of 207,032 unlabelled images from DT1 U
is given as follows: drawn from the same settings as the ID dataset.

exp (sim (zi , zj ) /τ) Task 2: Diabetic-macular-oedema (DME) classification. DME is dis-
lNT−Xnet
i, j
= −log 2N
, (1)
∑k=1 I[k≠i] exp (sim (zi , zk ) /τ) tinguished by thickness of the central area of the retina owing to the
accumulation of intraretinal fluid. Although it is possible to screen for
DME using colour fundus photographs (CFP) by detecting hard exudates
where (⋅, ⋅) is the cosine similarity between two vectors and τ is a scalar near the fovea as a surrogate for the presence of fluid, extracting the
denoting the temperature. The pretrained encoder network obtained thickness directly from a 3D optical coherence tomography (OCT) vol-
after this step of intermediate self-supervision is indicated by fθ (⋅). As ume has become the gold standard for making a diagnosis74. Neverthe-
suggested in ref. 44, a 2-layer multi-layer perceptron projection head less, the use of OCT machines for DME diagnosis worldwide is limited
is used to project the representation to a 128-dimensional latent space. due to high cost. In this task, we followed the approach of ref. 75 to lever-
There were multiple hyperparameters influencing the contrastive age a dataset of paired CFP and OCT data, and trained a model that takes
learning procedure, including the type of optimizer, learning rate, a CFP as input and predicts central retinal thickness (CRT) measured
weight decay, temperature, training epochs, batch size and the from the corresponding OCT. Specifically, CRT was defined as Early

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 759


Article https://doi.org/10.1038/s41551-023-01049-7

Treatment Diabetic Retinopathy Study zone 1/central subfield thickness self-supervised pretraining dataset ( DT4
U
) as the lymph node metastases
≥300 μm (refs. 76,77). For pretraining purposes, we used the de-identified detection task (T4). Colorectal tissue slides from 4,496 stage II and III
and unlabelled dataset DT2U
from EyePACS Inc., which included 2,287,716 colorectal cancer cases (36,841 slides) collected between 1984 and
fundus images from 308,507 patients. Hispanic is the most prevalent 2007 from the Institute of Pathology and the Biobank at the Medical
race/ethnicity within this dataset population. The ID dataset DT2 in
col- University of Graz were used for model development and ID validation
lected in Thailand included 6,039 CFPs from 4,035 patients. We also used ( DT5
in
). A temporal split of 671 cases (6,419 slides) collected between
a primary de-identified OOD dataset, DT2 out, to investigate the generaliza- 2008 and 2013 from the same institution were used for OOD evaluation
tion of our proposed strategy under distribution shift. Unlike DT2 in
, this ( DT5
out). This dataset is further described in ref. 10; however, only cases
dataset was collected in Australia and includes a total of 3,779 CFPs from not lost to follow-up for DSS within 5 yr were included here.
879 patients. Additionally, we used a secondary OOD dataset consisting
of 909 fundus images from 323 patients collected in India for evaluating Task 6: Mammography classification. In the mammography cancer
the zero-shot OOD performance evaluation. classification task (T6), the goal was to predict whether there will be a
biopsy-confirmed cancer occurring in the 39 months following the
Task 3: Chest-X-ray classification. The chest-X-ray-condition clas- screening episode, as described in ref. 4. We utilized multiple datasets
sification task (T3) involves multilabel classification of chest-X-ray collected in various geographic locations for this task. This included a
(CXR) images. Three publicly available datasets were used for training labelled dataset collected in the United Kingdom, a labelled dataset
and evaluation purposes in this task: CheXpert78, MIMIC-CXR79 and from the United States (Northwestern Memorial Hospital), an unlabelled
Chest X-ray 14 (ref. 66). In particular, we used the combination of the set of images from five clusters of hospitals across five different cities
training split of MIMIC-CXR79 and CheXpert78 as DT3 U
. MIMIC-CXR79 in India (Bangalore, Bhubaneswar, Chennai, Hyderabad and New Delhi)
consisted of 215,695 radiographic studies collected at the Beth Israel and another unlabelled set of images collected from Northwestern
Deaconess Medical Center in Boston, Massachusetts. Each study con- Memorial Hospital (Chicago). Each of these datasets contained four
tained one or more views, so we sampled from these images during different images per patient: medio lateral oblique and craniocaudal
pretraining (preferentially sampling posterior anterior/anterior pos- views, and left and right breasts. The UK and US datasets are described
terior views over lateral views if available). We used CheXpert78 as DT2
U
. in more detail in ref. 4. The UK dataset was used as the labelled ID data
This dataset was a large open-source dataset of 224,316 de-identified ( DT6
in
), which included a total of 89,018 cases. The labelled US dataset
CXRs from 65,240 unique patients. The ground-truth labels for the was used as the OOD data, DT6 out, which consisted of a total of 41,043 cases.
training data were automatically extracted from radiology reports. For pretraining, the unlabelled dataset ( DT6 U
) was formed by removing
The radiologist report was then mapped to a label space of 14 radiologi- labels from the labelled data from the UK dataset and combining it with
cal findings. We predicted the five most prevalent pathologies used in the unlabelled data from India. During pretraining, as suggested in
ref. 78: atelectasis, consolidation, pulmonary oedema, pleural effusion refs. 62,83 to improve the positive pair mining procedure, a single image
and cardiomegaly78. This dataset included missing labels where, fol- was randomly selected from the four possible views and used to generate
lowing the suggestion of ref. 78, ‘unmentioned’ and ‘uncertain’ values a positive pair with augmentation. Put together, the medical-imaging
were imputed to zero. We modelled each finding as an independent domains and tasks described above comprise a comprehensive clinical
binary prediction. Furthermore, following previous work57,58,62,80, to evaluation setup to rigorously evaluate the data-efficient generalization
facilitate a robust comparison of REMEDIS to standard approaches, capability of REMEDIS and the supervised ML baseline.
we defined a custom development subset of the CheXpert dataset
which differed from the original set78. We also used ChestX-ray 14 which Experimental setup. In each of these settings, we compared REMEDIS
was collected at the National Institutes of Health Clinical Center, to baseline models that had used a standard paradigm of supervised
Maryland, as DT3 T3
out. The images in Dout were annotated using a technique transfer learning to demonstrate clinician-level (or otherwise clinically
similar to CheXpert by extracting common findings from radiologist applicable) performance. This includes strong supervised models that
reports and it consisted of 47,699 CXRs. have been pretrained on the JFT-300M dataset and the standard super-
vised models that have been pretrained on the ImageNet-1K dataset
Task 4: Metastases detection. In the lymph node metastases detection and further fine-tuned for the specific medical-imaging task. To focus
task (T4), the goal was to detect cancer metastases in digital whole-slide on our primary objective of improving data-efficient generalization,
images of lymph-node histology slides. The models were trained in a we tested REMEDIS and the baseline ML models on previously unseen
weakly supervised manner, using only case-level labels and without datasets from a different clinical setting from that in which the ML
any local annotations. To make case-level predictions, embeddings system was originally trained.
from 214 = 16, 384 patches per case were combined via an attention Each specific modality and task included an unlabelled pretraining
layer81. A random sample of 50 M patches from 10,705 cases (29,018 dataset ( DU), an ID dataset ( Din) and one or more OOD datasets ( Dout),
slides) spanning 32 ‘studies’ (cancer types) from The Cancer Genome which reflected a variety of realistic distribution shifts owing to
Atlas (TCGA) was used for self-supervised pretraining ( DT4 U
). Breast data-acquisition devices or clinical demographics26 (Fig. 2 and Sup-
lymph-node slides from the CAMELYON16 challenge82 were used for plementary Table 5 provide more details). Examining performance in
model development and ID evaluation ( DT4 in
). Lymph-node slides from these previously unseen clinical settings enabled a rigorous test of
5,161 stage II and III colorectal cancer cases (36,520 slides) collected model robustness to multiple distribution-shift scenarios (comprising
between 1984 and 2007 from the Institute of Pathology and the Biobank 12 large datasets with extensive variation in image size, label spaces
at the Medical University of Graz were used for OOD evaluation ( DT4 out). and class distributions, among others). These tasks and datasets
This dataset is further described in ref. 10; however, cases here were embodied many common characteristics of medical imaging, such as
not excluded on the basis of having insufficient tumour content in the class-label imbalance, variation of pathologies of interest from small
primary tissue slides, and we focused on the lymph-node slides instead local lesions to more global abnormalities and other
of the primary tissue. image-characteristic variations.

Task 5: Colorectal-cancer-survival prediction. The objective of the Performance of REMEDIS


colorectal cancer survival prediction task (T5 ) was to predict 5 yr In this section, we explore and analyze how REMEDIS improves
disease-specific survival (DSS) using digitized whole-slide images of data-efficient generalization. We also compare our method to several
primary colorectal tissue histology slides. Models used the same state-of-the-art baselines to evaluate its performance.

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 760


Article https://doi.org/10.1038/s41551-023-01049-7

207,032
T1 20,676 Behavior shift
28,300
T1: Dermatology
Technology shift
2,287,716
T2 6,039 Population shift
3,779
T2: DME

215,695
T3 223,414
47,699
T3: Chest X-ray
Unlabelled data
10,705
T4 399 ID data
5,161 T4: Pathology: metastases detection
OOD data
10,705
T5 4,496
671 T5: Pathology: survival prediction

8,271
T6 89,018
Distribution shift
41,043 T6: Mammography classification

103 104 105 106 ID OOD


Number of dataset samples (log)

Fig. 2 | Overview of clinical settings for evaluating REMEDIS. We evaluated REMEDIS as well as baseline ML models on five different domains containing six tasks and
involving a wide and complex variety of distribution shifts in clinical settings.

Self-supervision leads to statistically significantly improved wherever available, we observed that self-supervised models required
data-efficient generalization. Figure 3, Supplementary Fig. 1 and Table considerably fewer labels to reach the clinically applicable perfor-
1 show an overview of the results demonstrating data-efficient generali- mance range necessary for safe deployment (see Table 2). Moreover,
zation of our proposed self-supervised-based representation-learning we observed that in certain cases such as chest-X-ray interpretation,
method, REMEDIS, as well as the strong supervised baseline pre- the baseline models were unable to reach clinically applicable per-
trained on JFT-300M and standard supervised baseline pretrained on formance even with access to all the labels from the new distribution
ImageNet-1K for the dermatology-condition classification (T1), DME as shown in Table 2. An alternative approach might be to measure the
classification (T2), chest-X-ray-condition classification (T3), pathology performance with a fixed count of labels; however, we are primarily
metastases detection (T4), pathology colorectal survival prediction concerned with the data needed to reach clinically acceptable levels
(T5) and mammography classification task (T6). REMEDIS achieves of performance here.
superior OOD diagnostic performance with substantially reduced
requirements for labelled data from new sites compared with base- Self-supervision leads to better ID performance. When compared
lines. The supervised ML baseline models were pretrained on either to the strong and standard supervised training baselines, REMEDIS
the ImageNet-1K46,68 or the JFT-300M54 dataset followed by medical not only exhibited substantial improvement in OOD performance
task-specific fine-tuning. More details on the supervised baselines are and data-efficient generalization but also led to significant improve-
available in Supplementary Information. ment in ID performance in five out of six tasks as seen in Fig. 3 and
These results indicate that in a setup where the ML model has no Table 1. In particular, for the dermatology task, the ID top-3 accuracy
access to training labels from the new clinical setting, use of REMEDIS improved modestly but was statistically significant (P < 0.05), increas-
leads to a statistically significant (P < 0.05) improvement compared ing from 0.900 (0.897, 0.903) for the standard supervised baselines
with both strong and standard supervised baseline when evaluated and 0.923 (0.922, 0.925) for the strong supervised baselines to 0.926
on OOD test dataset, improving top-3 accuracy or area under the (0.925, 0.928) using our strategy. For the task of predicting DME from
curve (AUC) in six different and challenging medical image analysis fundus images, we observed a significant improvement (P < 0.001) in
tasks. REMEDIS exhibits substantially improved OOD performance AUC from 0.891 (0.889, 0.892) for the strong supervised baselines to
with up to 10.7% relative improvement in diagnostic accuracy over a 0.902 (0.900,0.902). The improvements were more pronounced in
strong supervised baseline and up to 15.8% relative improvement in the chest-X-ray interpretation task, with AUC improving from 0.816
diagnostic accuracy over a standard supervised baseline when there (0.815, 0.816) for the strong supervised baseline to 0.833 (0.832, 0.833).
is no access to retraining data in the new clinical setting (0%/zero-shot Similarly, in the pathology metastases detection task, we observed a
out-of distribution data regime). Furthermore, the previous best per- significant increase in AUC from 0.856 (0.851,0.864) for the standard
formance was matched with access to 1.4% (0.0%, 4.9%) to 33.2% (25.7%, supervised method and 0.916 (0.916, 0.917) for the strong supervised
39.3%) of the labels among all of these tasks, indicating achieving the baseline to 0.954 (0.950, 0.960). For the pathology survival predic-
same accuracy as baseline specialized models using 3–100× less data. tion task, the AUC improved from 0.699 (0.698, 0.699) for the strong
This improved data efficiency manifests in 193–2,878 clinician hours supervised baseline to 0.748 using REMEDIS (0.747, 0.748). Finally, for
saved, translating to years of data collection as well as US$32,000 the mammography classification task, we observed an improvement in
to US$394,000 saving in annotation cost and a total of more US$1M AUC from 0.869 (0.866,0.872) to 0.870 (0.868, 0.872) when compared
annotation cost saving. to the strong supervised baseline; however, this is not a significant
improvement. Moreover, when compared to the standard supervised
Self-supervised medical ML requires fewer labels to reach clini- baseline, we observed a significant improvement (P < 0.001) using
cally applicable performance. Obtaining an accurate measure of REMEDIS.
expert clinician performance can be challenging for several reasons. The fact that REMEDIS improved not only the OOD generaliza-
For example, in Task 2, clinicians usually diagnose DME from OCT tion performance but also led to improvements in ID performance
rather than the fundus image utilized by ML. Therefore, a measure for suggests that the benefits of OOD generalization do not come at the
clinically applicable performance was not available for all the medi- expense of ID and that the learnt representations are stronger across
cal tasks considered in this work (Supplementary Table 4). However, the board.

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 761


Article https://doi.org/10.1038/s41551-023-01049-7

T1 T2 T3
ID OOD ID OOD ID OOD
0.95 0.95
0.96
0.85 787 h/$161k
0.90 0.80 0.90
Top-3 accuracy

Top-3 accuracy
** 193 h/$33k 0.82
0.94
0.85 0.85

AUC

AUC

AUC

AUC
0.92 0.80 0.76 224 h/$33k
0.80 0.80
0.80
0.90 ** 0.75 0.75
0.75 0.72
0.88 0.70 0.70 0.78
67% 93% 83%
0

52 %

0
5, 33%

18 %
,3 %

4, 17%

,9 %
0
4

78
2, 00
2

4
7
17 100
22

27 00
74

75
11,578 saved 2,342 saved 23,228 saved

1
REMEDIS: R-152(2x) Percent/count of REMEDIS: R-50(1x) Percent/count of REMEDIS: R-152(2x) Percent/count of
Baseline: R-152(2x) OOD training set Baseline: R-50(1x) OOD training set Baseline: R-152(2x) OOD training set

T4 T5 T6
ID OOD ID OOD ID OOD
0.95 0.95 0.95

2,812 h/$385k 0.90 0.78


0.90 554 h/$76k 0.90
0.90 † 0.74 1,569 h/$322k
0.85
0.85 * 0.85
AUC

AUC

AUC

AUC

AUC

AUC
0.80
0.80 0.72 0.80
0.80 0.75 0.72
0.75 0.75
0.70
0.70 0.65 0.66 0.70

94% 86% 91%


,9 %

0
0

544%

,17 %
0
87 %
1,0 6%

1,4 9%

8
89
17 100

17 00
04

3, 00
32

3
16,872 saved 1 3,325 saved 15,689 saved

1
1
REMEDIS: R-50(1x)
Percent/count of REMEDIS: R-50(1x) Percent/count of REMEDIS: R-152(2x)
Percent/count of
Baseline: R-50(1x) OOD training set Baseline: R-50(1x) OOD training set Baseline: R-152(2x) OOD training set

REMEDIS T1: Dermatology T3: Chest X-ray T5: Pathology survival prediction *: P value in range (0.01, 0.05) ***: P value > 0.05
Baseline supervised (JFT) T2: DME T4: Pathology metastases detection T6: Mammo. classification **: P value in range (0.001, 0.01)

Fig. 3 | Data-efficient generalization. Overview of the results showing overall intervals are shown by the shaded area and error bars. A two-sided t-test was also
performance and data-efficient generalization of REMEDIS as well as of the done for each label fraction as well as when computing the ID results. If no * is
strong supervised baseline pretrained on JFT-300M for the dermatology- shown, the P value is less than 0.001, otherwise, the P value is as indicated. The
condition classification (T1), DME classification (T2), chest-X-ray-condition red lines indicate the amount of data that REMEDIS needs to match the highest
classification (T3), pathology metastases detection (T4), pathology colorectal supervised ML baseline performance when simulated in a new OOD clinical
survival prediction (T5) and mammography classification task (T6). We observed deployment setting. The amount of annotated data (%) and clinician hours
considerably improved OOD generalization and substantial reduction in need potentially saved by using REMEDIS for each medical task considered are also
for labelled medical data when using REMEDIS. We calculated 95% confidence indicated above and below two-sided arrows, respectively.
intervals by running each label fraction and experiment up to ten times, and

REMEDIS and alternative data-efficient-learning techniques RELIC and Barlow Twins in Tensorflow as explained in refs. 84–86 and
Alternative self-supervised-learning techniques. Although the used the same experimental setup as above to ensure a fair comparison.
primary goal of our work is not to compare different self-supervised Figure 4 shows the performance of REMEDIS with SimCLR, RELIC,
techniques but rather to come up with a simple unified representation- MoCo and Barlow Twins variants as the self-supervised learning method
learning strategy to address the unmet need of data-efficient gene­ and ResNet-152 (2×) as the encoder. We evaluated these methods ID
ralization, we believe that REMEDIS is compatible with other self- and OOD for T1, T2 and T3. We observed significant improvements in
supervised learning methods beyond SimCLR. To demonstrate data-efficient generalization over the strong supervised baseline for all
this, we replaced SimCLR with multiple state-of-the-art alternatives, the REMEDIS variants. Our results suggest that there is not a single best
namely, improved Momentum Contrastive Learning (improved self-supervision method in our evaluation setup and that one can select
MoCo)84, Representation Learning via Invariant Causal Mechanisms the underlying self-supervision method on the basis of ease of imple-
(RELIC)85 and Self-Supervised Learning via Redundancy Reduction mentation, availability of computing and other relevant design choices.
(Barlow Twins)86. Although several classes of self-supervised-learning methods exist, in
These methods introduced multiple improvements over the this work we focused on common contrastive self-supervised-learning
design of vanilla SimCLR, which resulted in higher performance on methods, given that they are simple to implement, are domain-agnostic
natural-image classification benchmarks. The improved MoCo method and have yielded similarly impressive results on natural-image bench-
uses an extra momentum encoder, which acts as a memory mechanism marks. Additional results are provided in Supplementary Information.
and enables a large and consistent dictionary for learning visual repre-
sentations. RELIC enforces invariant prediction of proxy targets across Weak supervision. To reduce reliance on expert-annotated clinical
augmentations through an invariance regularizer. Based on causal datasets, weak-supervision methods have been studied where the
theory, the use of invariance penalties in RELIC leads to more robust unlabelled data are annotated with higher-level, noisier labels and often
representations that generalize better than those obtained using Sim- in a programmatic manner87 by leveraging auxiliary data modalities
CLR or BYOL85. Finally, Barlow Twins, unlike SimCLR, does not require such as text reports when available. The use of weak supervision for ID
large batches or the use of asymmetry between the network twins, medical image analysis has been explored in multiple previous stud-
such as a predictor network, gradient stopping or a moving average on ies87–90; however, the effectiveness of this approach for OOD settings
the weight updates. We re-implemented improved MoCo (MoCo-v2), is yet to be comprehensively studied.

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 762


Article https://doi.org/10.1038/s41551-023-01049-7

Table 1 | Comparison of REMEDIS with supervised baselines

Tasks Method ID OOD (0%) OOD (100%)

Supervised (ImageNet) 0.900 (0.897,0.903) 0.738 (0.734,0.743) 0.839 (0.838,0.840)


Task 1 (Top-3 acc.) Supervised (JFT) 0.923 (0.922,0.925)* 0.755 (0.750,0.760)* 0.844 (0.842,0.845)
REMEDIS 0.926 (0.925,0.928) 0.763 (0.760,0.769) 0.864 (0.863,0.866)
Supervised (ImageNet) 0.887 (0.886,0.887) 0.685 (0.682,0.688) 0.761 (0.759,0.764)
Task 2 (AUC) Supervised (JFT) 0.891 (0.889,0.892) 0.718 (0.715,0.720) 0.755 (0.750,0.761)
REMEDIS 0.902 (0.900,0.902) 0.731 (0.727,0.736) 0.816 (0.811,0.821)
Supervised (ImageNet) 0.818 (0.818,0.819) 0.786 (0.783,0.788) 0.812 (0.807,0.817)
Task 3 (AUC) Supervised (JFT) 0.816 (0.815,0.816) 0.785 (0.781,0.788) 0.825 (0.824,0.826)
REMEDIS 0.833 (0.832,0.833) 0.798 (0.796,0.800) 0.835 (0.834,0.836)
Supervised (ImageNet) 0.856 (0.851,0.864) 0.757 (0.755,0.758) 0.892 (0.886,0.895)
Task 4 (AUC) Supervised (JFT) 0.916 (0.916,0.917) 0.791 (0.790,0.792) 0.905 (0.897,0.911)
REMEDIS 0.954 (0.950,0.960) 0.876 (0.876,0.876) 0.958 (0.956,0.960)
Supervised (ImageNet) 0.714 (0.712,0.715) 0.649 (0.645,0.655) 0.725 (0.719,0.729)
Task 5 (AUC) Supervised (JFT) 0.699 (0.698,0.699) 0.664 (0.661,0.667) 0.760 (0.757,0.763)
REMEDIS 0.748 (0.747,0.748) 0.712 (0.710,0.714) 0.798 (0.792,0.804)
Supervised (ImageNet) 0.852 (0.848,0.856) 0.700 (0.697,0.702) 0.727 (0.725,0.728)
Task 6 (AUC) Supervised (JFT) 0.869 (0.866,0.872)† 0.711 (0.709,0.715) 0.734 (0.732,0.736)
REMEDIS 0.870 (0.868,0.872) 0.725 (0.724,0.726) 0.750 (0.749,0.751)
Comparison of the ID and OOD performance of REMEDIS with the strong supervised baseline pretrained on JFT-300M and the standard supervised baseline pretrained on ImageNet-1K images,
for three evaluation scenarios. The results are either the average AUC or the top-3 accuracy, with 95% confidence intervals in parentheses. If no * is shown, REMEDIS significantly outperformed
the baseline with P < 0.001, otherwise * shows a P < 0.05 and † shows a non-significant improvement.

Table 2 | Data required to achieve clinically applicable Analysis of REMEDIS under heterogeneous shifts
performance Achieving good OOD performance under natural and subtle
(non-adversarial) data shifts or domain shifts is still an open challenge91.
Tasks T1‡ T3† T5‡
To understand the limitations of REMEDIS in situations where the unla-
REMEDIS 0.0% (0) 4.2% (1,179) 1.0% (43) belled data are considerably different from that in the target distribu-
Supervised (JFT) 0.1% (17) 9.4% (2,630) 9.9% (380) tion and are heterogeneous, as well as to evaluate the performance
of REMEDIS when the OOD dataset experiences a significant shift, we
Supervised (ImageNet) 2.1% (369) 42.7% (11,959) 47.1% (1,826)
considered two experimental setups91,92.
Amount of data required to achieve clinically applicable accuracy for REMEDIS and the
supervised baseline in the OOD clinical setting. † only achieves the lower range of clinician
performance reported in Supplementary Table 4, and ‡ shows the additional data needed to Cross-domain shift. It is generally preferable to perform transfer learn-
achieve the upper bound of clinician performance (T1 achieves average clinician performance ing with data from the same domain, as this can make the learned
with 0% of the data needed). source-image representation more effective for the target task64. How-
ever, to evaluate the performance of REMEDIS when the unlabelled data
are substantially different from the data in the target distribution, we
We further highlight the interaction between weakly supervised consider a transfer setup between the mammography cancer-detection
methods and the proposed large-scale and self-supervised pretraining task (T6) and the chest-X-ray-condition classification task (T3). For this
components in REMEDIS in the lymph-node-metastases detection task purpose, REMEDIS pretrained on the unlabelled mammography dataset
(T4) as well as in the colorectal-cancer-survival prediction task (T5). As ( DT6
U
) was further fine-tuned using ID chest-X-ray ( DT3) data. The result-
detailed earlier, our existing pathology experimental setup leverages ing cross-domain model was then used to perform chest-X-ray-condition
state-of-the-art weak-supervision deep multi-instance learning (MIL) classification on both ID and OOD chest-X-ray test sets DT3 out . Figure 6
pipeline10,81. The models were trained in a weakly supervised manner compares the performance of the cross-domain model, indicated as
using only case-level labels and without any local annotations. To REMEDIS ( JFT+Mammo) versus in-domain REMEDIS chest-X-ray model
make case-level predictions, embeddings from 214 = 16, 384 patches and the strong supervised baseline. In this limited experiment, we
per case were combined via an attention layer81. We observed signifi- observed that the cross-domain REMEDIS ( JFT+Mammo) model still
cant improvements from using REMEDIS in this setting as demon- performed significantly better (P < 0.01) than the strong supervised
strated in Fig. 5. baseline for the chest-X-ray-condition classification task, both in ID and
In general, it is difficult to generalize the same weak-supervision OOD settings. However, it is still preferable to use the in-domain
approach to multiple domains, as the approach makes highly self-supervised learning when possible.
task-specific design choices. Nevertheless, as our experiments high-
light, weak supervision and self-supervised learning are complemen- Synthetic shift. Following the examples in refs. 91,92, we applied a set
tary. We believe that weak supervision is a promising approach and that of multiple common visual corruptions including Gaussian blur, Gauss-
it should be considered when appropriate and feasible for developing ian noise, contrast and brightness distortions to the dermatology OOD
medical-imaging ML using unlabelled data. dataset ( DT1
out ). Dermatology images are visually similar to natural

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 763


Article https://doi.org/10.1038/s41551-023-01049-7

T1: ID T1: OOD (0%) T1: OOD (100%)

SimCLR 0.926 SimCLR 0.763 SimCLR 0.864

RELIC 0.923 RELIC 0.764 RELIC 0.864

MoCo 0.918 MoCo 0.775 MoCo 0.866

Barlow Twins 0.922 Barlow Twins 0.760 † Barlow Twins 0.863

0.80 0.85 0.90 0.95 1.00 0.7 0.8 0.9 0.7 0.8 0.9
Top-3 accuracy (%) Top-3 accuracy (%) Top-3 accuracy (%)

T2: ID T2: OOD (0%) T2: OOD (100%)

SimCLR 0.907 SimCLR 0.746 SimCLR 0.778

RELIC 0.900 RELIC 0.737 RELIC 0.799

MoCo 0.905 MoCo 0.719 MoCo 0.775

Barlow Twins 0.909 Barlow Twins 0.753 Barlow Twins 0.798

0.80 0.85 0.90 0.95 1.00 0.6 0.7 0.8 0.9 0.7 0.8 0.9
AUC AUC AUC

T3: ID T3: OOD (0%) T3: OOD (100%)

SimCLR 0.833 SimCLR 0.798 SimCLR 0.835

RELIC 0.830 RELIC 0.802 RELIC 0.829 †

MoCo 0.830 MoCo 0.794 MoCo 0.832

Barlow Twins 0.825 Barlow Twins 0.794 Barlow Twins 0.831

0.75 0.80 0.85 0.90 0.7 0.8 0.9 0.75 0.80 0.85 0.90
AUC AUC AUC

REMEDIS (JFT + SimCLR) REMEDIS (JFT + RELIC) REMEDIS (JFT + MoCo) REMEDIS (JFT + Barlow Twins)

Fig. 4 | Data-efficient generalization of REMEDIS with various self-supervised alternative self-supervised learning strategies and that all the REMEDIS variants
learning techniques. Overview of the results showing performance and lead to data-efficient generalization improvements over the strong supervised
data-efficient generalization of REMEDIS with SimCLR, RELIC, MoCo and baseline. The 95% confidence intervals were calculated by running each label
Barlow Twins as the self-supervised strategy for the dermatology-condition fraction and experiment up to ten times, and intervals are shown using the
classification (T1), DME classification (T2) and chest-X-ray-condition classification shaded area and error bar. The strong supervised baselines pretrained on
(T3) with ResNet-152 (2×) as the encoder. The grey shadowed area indicates the JFT-300M is shown by the dashed grey lines. A two-sided t-test was performed
performance margin of the strong supervised baseline pretrained on JFT. We between the strong supervised baseline and the REMEDIS variants, and P > 0.05 is
observed that REMEDIS is compatible with MoCo, RELIC and Barlow Twins as indicated with †.

images, and these corruptions can appear in low-lighting conditions, The synthetic shifts can be benign or can completely corrupt the
when the imaging device is out of focus, or when lighting conditions image, depending on their severity. To comprehensively evaluate the
or skin colour change. The synthetically shifted dermatology dataset, robustness of REMEDIS and the baseline to such synthetic corruptions,
DT1
out, was generated using five severity degrees, wherein a combination the classifier performance was measured across these severity degrees
of these corruptions was applied at each severity degree. We used an in the zero-shot setting (0%) and when the entire OOD dataset was used
extended version of distortion setup suggested in ref. 92 to produce for fine-tuning (100%). Extended Data Fig. 1 compares the corruption
this shifted dataset. For Gaussian noise, σ was selected from a range of error of REMEDIS and of the strong supervised baseline at each distor-
5 logarithmically spaced values between 10−2 and 10−1. For Gaussian tion level. For the first distortion severity, there is no significant change
blur σ was selected from a range of 5 logarithmically spaced values in performance for both methods. However, as the distortion severity
−1
between 10 and 10. Brightness distortion adds a δ in a range of five increases, the performance drops drastically for the supervised base-
linearly spaced values between 0.5 and 6.5 to the intensity channel of line, whereas it is more gradual for REMEDIS.
each image. Contrast severity was adjusted by α in a range of five lin-
early spaced values between 2.0 and 0.15, where smaller values are Analysis of clinical impact
associated with a more severe contrast change. Extended Data Fig. 1 As detailed above, for each of the different tasks, REMEDIS led to sub-
shows an example of distortion for each severity level; at the highest stantial reductions in the amount of labelled data required for new
severity level of distortion, 5, there is no visually observable feature. deployment sites. This could substantially impact the feasibility of

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 764


Article https://doi.org/10.1038/s41551-023-01049-7

T4: ID T4: OOD 0% T4: OOD 100%


20 20 20
0 0 0

R
R
R

L+W

L+W
10
L+W

4.1 0 10 11 0 5.8 0 10

11 7 0 16 4.5 0 7.4 1.5 0


W

W
0 0 0

R L+W W R L+W W R L+W W

T5: ID T5: OOD 0% T5: OOD 100%


20 20 20
0 0 0
R

R
L+W

L+W

L+W
7 0 10 7.2 0 10 5 0 10

4.7 –2.1 0 9.8 2.4 0 10 4.9 0


W

W
0 0 0
R L+W W R L+W W R L+W W

Fig. 5 | REMEDIS versus weakly supervised DeepMIL. Overview of the results showing the relative improvement between REMEDIS (R = SSL+L+W), DeepMIL81
pretrained using large-scale JFT data (L+W) and DeepMIL pretrained using ImageNet data (W). SSL, self-supervised learning; L, large-scale supervised pretraining using
JFT data; W, weakly supervised learning.

site-specific adaptation or retraining of a model by resolving several In particular, recent results suggest that large-scale self-supervised
considerable concerns: the speed of acquiring local labels (which may ML models not only lead to better ID task performance but also general-
take years to manifest if based on outcomes); time and costs for curat- ize better to OOD settings and have exciting few-shot or data-efficient
ing, cleaning and de-identifying data for research; the costs associ- learning capabilities39,98,99. This key insight formed the basis of our
ated with the machine-learning-software infrastructure and the skills exploration into leveraging large-scale pretraining and self-supervision
for model training (without incurring negative transfer phenomena93 for developing robust and efficient medical ML, which is particularly
that worsen model performance in unexpected ways); valuable clini- important given the safety-critical nature of the field.
cian hours spent on annotating data. More concrete examples: for the Although several classes of self-supervised-learning methods exist,
survival-prediction task, the collection of 645 examples takes over 5 yr; in this work we focused on contrastive self-supervised-learning meth-
and a low incidence of positive cases in tasks such as mammography ods because they are simple to implement, are domain-agnostic and
screening considerably slows down the process of data collection to have yielded impressive results on natural-image benchmarks. There
an extent where a proper dataset is collected over 20 yr. Therefore, the are also several different classes of contrastive self-supervised-learning
saved clinician hours can be translated to saving the multiple years that strategies; however, which techniques work best under which circum-
it takes to properly recruit patients, and to collect and curate a medical stances is unclear and all yield comparable results on natural-image
dataset. Across tasks, we estimate that self-supervision leads to savings benchmarks. It is likely that our results are not specific to SimCLR
of over 5,000 clinician annotation-hours alone. This is likely to be a and that similar results could be obtained with other contrastive
lower bound because the data are often labelled multiple times by dif- approaches such as those reported in refs. 42,43,84,85,100. The effec-
ferent clinicians for improved label quality (typically 3–10 times), with tiveness of these self-supervised-learning alternatives in improving the
further expense and time required for the definition and monitoring ID performance of medical ML models has been studied in previous
of labelling practices, and for consensus or adjudication approaches works (such as refs. 63,83,101–103); however, their effectiveness in the
between annotators. As a result, we estimate that for a model that OOD setting has not been explored.
might require US$1M to adapt to a new site, REMEDIS could deliver More broadly, unlabelled data can also be leveraged without
a substantial cost reduction of more than 50% (Supplementary Table self-supervision through methods such as self-training104. These meth-
7). Importantly, the analysis presented here is approximate and there ods require a teacher model trained solely on annotated data and use pre-
are various other factors at play not considered here that could impact dictions made by this model on unlabelled data to train a student model.
costs; however, we believe that this analysis directionally holds true We compared REMEDIS with this technique (Supplementary Fig. 9)
across settings. and observed REMEDIS to be better under our evaluation settings.
Although the benefit of representation learning and
Discussion self-supervised learning specifically for medical imaging has been
The problem of ML generalization is long-standing and several model- demonstrated in several previous works, these works are often
ling approaches have been proposed to tackle it. A recent benchmark task-specific61,102,105–107 and feature domain-specific design choices.
paper94 suggested that under careful evaluation settings, ML models For example, refs. 61,105 explored the use of consistency training
developed using empirical risk minimization16,95,96 remain a strong and leveraged hard examples in a dynamic curriculum learning set-
baseline for this problem. In recent times, with large-scale computing ting along with self-supervision. However, this study was restricted
capabilities, attention has shifted to how large volumes of data can be to histopathology-image analysis. Similarly, refs. 106,107 showed
better leveraged to create more robust and efficient ML39,52,97. Although the use of multiview, multistyle and anatomy-aware augmentation
supervised learning approaches (especially with weak or noisy labels) for self-supervised learning. Once again, these studies considered a
have been studied and have yielded good results in domains such as limited number of modalities and did not consider how they might
computer vision52,55, self-supervised and semi-supervised approaches be combined with other representation-learning strategies. It is chal-
that can leverage unlabelled data at scale have also gained popularity. lenging to apply these methods in practice, as they rely on several

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 765


Article https://doi.org/10.1038/s41551-023-01049-7

ID OOD (0%) Although we have covered a large variety of tasks and


0.86 0.82
medical-imaging domains, there are many related machine-learning
0.81
strategies, such as weak supervision, self-training and active training,
0.84 0.833 that have not been considered here, as they were beyond the scope and
0.799
0.818 ** 0.80 motivation of this study. In general, when appropriate and feasible, all
0.82
0.816 0.791 ** of the labelled and unlabelled data should be leveraged to improve the
0.79 0.783
performance of the medical ML models in both ID and OOD settings,

AUC
AUC

0.80
0.78
and our goal was to come up with a unified representation-learning
strategy that allows for this. Furthermore, despite the strong perfor-
0.78 0.77 mance of REMEDIS across several diverse tasks and datasets, there
may be tasks or domains that benefit from domain-specific designs or
0.76
0.76 distribution shifts not covered here. An interesting line of future work
0.75
could be to analyse the performance of self-supervised representations
under synthetic or adversarial shifts. Nevertheless, we believe that
REMEDIS (JFT + CXR) REMEDIS (JFT + mammo) Baseline supervised (JFT)
REMEDIS can be a good starting point even for such domains and tasks.
Fig. 6 | Cross-domain shift. Comparison of the cross-domain pretrained models Another important direction of future research is to quantify the
indicated as REMEDIS ( JFT+Mammo) versus in-domain REMEDIS chest-X-ray impacts of model calibration as well as the fairness, privacy and ethical
model and the strong supervised baseline. We observed that the cross-domain considerations of leveraging self-supervised learning on large-scale
REMEDIS ( JFT+Mammo) performed significantly better (**P < 0.01) than the
data for developing medical ML. We provide some initial subgroup
strong supervised baseline for the chest-X-ray-condition classification task, both
analysis for REMEDIS in the dermatology and mammography settings
ID and OOD. However, it remains inferior to the in-domain REMEDIS ( JFT+CXR)
in Supplementary Fig. 8. However, rigorous research is required to
variant.
understand the implications along these key axes to design appropriate
mitigation strategies and procedural best practices in ethics.
Given the interest and progress in self-supervised learning across
task-specific design choices during pretraining and fine-tuning108,109, the wider ML community, we expect rapid progress in the development
and often require a high level of expertise to make these models per- of more computing-efficient self-supervised methods. Although in this
form well. Furthermore, these studies do not consider the problem of work we used self-supervised learning and focused on a single architec-
data-efficient generalization and often only feature ID evaluations. tural family (ResNet), the approach is in principle architecture-agnostic
To address these challenges, we have introduced REMEDIS, a uni- and can be applied to other model classes such as transformers110, as
fied representation-learning strategy that leads to improved ID and well as to other semi-supervised learning and domain-adaptation
OOD performance in multiple medical-imaging tasks. Our method methods.
leverages large-scale supervised pretraining and self-supervised learn- We believe that this work lays the foundation for multimodal rep-
ing without additional domain-specific modifications. We have shown resentation learning for ML in healthcare. Health data are inherently
the general applicability of REMEDIS over 6 diverse medical-imaging multimodal in nature; they include a mix of images, electronic health
domains using 15 different evaluation datasets simulating real-world records, sensors and wearable data, as well as genomics data at both
distribution-shift scenarios. the individual-patient and hospital-system levels. We believe that ML
To assist medical-imaging ML developers and researchers with systems that leverage these data at scale using self-supervised learning
using REMEDIS, we characterize the OOD scenarios with dataset fin- will be the foundation of next-generation-learning health systems that
gerprints and provide a comprehensive guide that we hope will reduce may scale world-class healthcare to everyone.
the search space for empirical design choices when developing medical
imaging ML. Given the simplicity of REMEDIS, its minimal modifica- Methods
tion to the current widespread transfer-learning framework and the Overall, REMEDIS comprises the following steps: (1) supervised
ubiquity of unlabelled medical-imaging data, we expect widespread representation learning on a large-scale dataset of labelled natural
adoption of such learning strategies for developing medical-imaging images; (2) self-supervised contrastive representation learning on an
ML on top of previously standard supervised transfer learning. We unlabelled dataset of ID medical images; and (3) supervised fine-tuning
characterize the impact of this strategy in terms of valuable clinician on labelled ID medical images.
hours saved from the acquisition and annotation of data for medical To rigorously evaluate the data-efficient generalization of the
ML, and in terms of cost and time reductions for developing these ML models, we further fine-tuned them using labelled data from the
medical ML solutions. The real-world impact of REMEDIS is further OOD setting. Figure 1 provides an outline of REMEDIS and Extended
being assessed in prospective settings in several medical-imaging Data Fig. 2 summarizes the evaluation setup. In the following, we pro-
research projects at Google. vide details on the experimental setups and on the medical-imaging
ML-development methodology as well as the clinical evaluation setups.
Outlook
Self-supervised learning remains a relatively new topic in ML and its Experimental setup
successful application for a given task can be challenging, especially In this section, we lay out the experimental setup and explain the design
for researchers and developers with limited computing resources. We choices made in our study. This includes the pretraining procedure,
hope to reduce the barrier to using self-supervised representations for the choice of base ML network architecture, data pre-processing and
developing medical-imaging ML by open-sourcing our code as well as augmentation strategies, hyperparameter search procedure and
relevant technical details to allow the wider community to replicate and fine-tuning setup.
build on our study. The study was conducted using retrospective data For contrastive pretraining, we build on SimCLR, which proposes
and, as such, further rigorous health-economic studies are required to a simple approach for contrastive learning (462 for images). We per-
quantify the impact of our approach, which would account for other formed a disjoint hyperparameter tuning procedure to select factors
hurdles and costs in the real-world deployment and generalization of influencing the quality of the learned representation, which we
medical ML; for example, integration into clinical workflows, infra- measured by the model performance in the downstream tasks using
structure and other IT considerations. the validation set of D . We considered various factors influencing

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 766


Article https://doi.org/10.1038/s41551-023-01049-7

performance including the network architecture, the initial model Pretraining data augmentations details. In our default contrastive
pretrained on labelled natural images, the contrastive-learning hyper- pretraining setting, we used random cropping (C), random colour
parameters and the augmentation strategy used in contrastive distortion (D), rotation (R) and random Gaussian blur (G) as the data
learning. For each of these factors, we conducted a comprehensive augmentation strategy.
hyperparameter search as follows. First, we pretrained a model on DU Due to the greyscale nature of radiology images (that is, mam-
given the target hyperparameters, then we measured the performance mography and chest-X-ray images), for these images we opted for
of the pretrained model on the validation set of the ID dataset D when stronger data augmentation to reduce the chances of overfitting. We
fine-tuned on the corresponding training set. For the selection of further improved the final performance by incorporating histogram
architecture and pretrained model ( fϕ), and also the augmentation equalization115 and elastic deformation115,116 in addition to our default
strategy, we used the SimCLR default learning rate, batch size and data augmentation strategy. For these greyscale images, histogram
temperature settings as described below. equalization (H) corrects the contrast, ensuring that all further contrast
changes start from a uniform place, and makes colour distortion work
Base network and pretrained models. In our experiments, we used better. Furthermore, elastic deformation (E) can realistically occur
a standard ResNet architecture44,52,57,62,72 with two different model archi- during breast cancer screening and in general for most human organs
tecture sizes to ensure the observed phenomenon is disentangled from during imaging when undergoing any internal or external pressure115. By
the model size and number of parameters in the network. In particular, including these augmentation strategies, the network learns invariance
we considered two ResNet architectures with two commonly used to such deformations without the need to see these transformations
depths and width multipliers (hidden layer widening factors) as the in the image corpus.
backbone networks: ResNet-50 (1×) and ResNet-152 (2×). Following the As studied in refs. 41,62, we used a standard inception-style random
SimCLR44 method for contrastive pretraining, we used two fully con- cropping procedure117 as one of the fundamental pre-processing
nected layers to map the 2,048-dimensional output of each ResNet to steps for contrastive learning. In all pretraining experiments, images
a 128-dimensional representation embedding space. We performed were randomly cropped and resized to 224 × 224 pixels. This image
SimCLR pretraining on DTn U
where n = {1, 2, … , 6} indicates different size is mainly used for pretraining due to memory constraints of the
medical imaging tasks. underlying hardware to help increase the mini-batch size. Previous
In REMEDIS, we initialized the backbone ResNet architecture studies62 suggest that pretraining with 224 × 224 images does not have
with weights from BiT52 pretrained models. In addition to the model a substantial impact on the final performance.
architecture, BiT models vary on the basis of the pretraining dataset: In the pathology tasks, to capture details specifically present in
BiT-S, BiT-M and BiT-L, where S(mall), M(edium) and L(arge) indicate high-resolution pathology slides, we obtained patches from various
whether the pretraining was done on ILSVRC-2012 (ImageNet-1K)68, magnification levels. Additionally, this was followed by a random hori-
ImageNet-21K45 or JFT54, respectively. The BiT-L family resulted in the zontal left-to-right flip with a 50% probability. We applied random
best performance on each domain-specific validation set, and hence, rotation by angle δ ∼ U(−45∘ , 45∘ ) and random colour distortion with
was selected as our main backbone model. The BiT models all use a maximum strength 1.0 that included random brightness, contrast,
ResNet-v2 architecture111, which replace all Batch normalization112 saturation and hue changes. We blurred the image 50% of the time using
layers with Group normalization71 and use Weight standardization113 in a Gaussian kernel with σ ∈ [0.1, 2.0] and size of 10% of the image height
all convolutional layers. This setup differs from the standard SimCLR and width. Additional details about the selection of batch size, learning
setup, and to incorporate BiT pretrained models into the SimCLR-based rate and augmentations are provided in Supplementary Table 1.
pretraining, we reused pretrained weights of convolution layers, while
Batch normalization was replaced with Group normalization via default Fine-tuning and evaluation protocol. Previous works in transfer learn-
initialization. ing and semi-supervised learning primarily use linear probing for evalu-
In addition to the above setup used across our key experiments, we ation of the learned representations108,118,119. End-to-end fine-tuning
have included detailed ablation studies in Supplementary Information, is not commonly considered given the computation costs incurred
where we considered setups both with and without initialization from during evaluation.
pretrained weights and ran several other experiments to understand In general, obtaining an effective fine-tuned model requires an
our method and results comprehensively. in-depth understanding of training dynamics and goes beyond solely
minimizing the loss function using the labelled data. Fine-tuning and
Pretraining hyperparameters. The contrastive-learning procedure is training from scratch both optimize the same training loss but differ
influenced by multiple hyperparameters, including the choice of opti- in their initial weights. However, the selection of learning hyperpara­
mizer, learning rate, weight decay, temperature, training epochs and meters, various output layer/head options and different gradient
batch size. We based our training on SimCLR44 and used the LARS opti- flow choices can dramatically affect the performance of the final model.
mizer114 to stabilize training during pretraining as previously sug- To this end, we considered a detailed fine-tuning and evaluation protocol
gested44. We also used the default weight decay of 10−6 and trained as explained below, with a visual overview in Extended Data Fig. 2.
all of the models for 1,000 epochs. For each task, we investigated pre-
training models with learning rate ( lr ) in {0.1, 0.3} and temperature ( τ ) Fine-tuning hyperparameters. After obtaining the pretrained model
in {0.1, 0.2} , and also with the largest possible batch size in for each task, we further fine-tuned the model end-to-end using the
{1, 024, 2, 048, 4, 096} that is compatible with memory constraints of our training set of the labelled ID medical data. Following a previously
hardware infrastructure. Our experiments showed that in all tasks, described approach44,62,72, we initialized the weights of the network used
1,000 epochs using a learning rate of 0.3 and temperature of 0.1 led to for the downstream task from the weights of the pretrained network
a domain-specific pretrained model with optimal performance. Since obtained in the previous step. For every combination of pretraining
contrastive pretraining often leads to spiky learning curves, after pre- strategy and downstream task, an extensive hyperparameter search
training the model for 1,000 epochs, we selected our final checkpoint was performed. This included the choice of optimizer, learning rate,
by considering all of the checkpoints in a window of [M(1 − 0.001M), M], weight decay, decay steps and decay factor. In particular, as recom-
where M is the maximum iteration and picking the checkpoint with the mended in much previous research, we tried both the Adam120 opti-
minimum contrastive loss. Supplementary Table 1 shows the final mizer or the standard stochastic gradient descent (SGD) with Nesterov
selected hyperparameter for each task based on the performance of Momentum121,122 using a linear or decaying learning rate. For all tasks,
the models on the validation set of the downstream task. we also performed data augmentation and pre-processing to achieve

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 767


Article https://doi.org/10.1038/s41551-023-01049-7

the best performance after downstream fine-tuning. However, unlike initial learning rates between 10−4.0 and 10−2.0, and three logarithmically
the pretraining step, we did not study the effect of each data aug- spaced values of weight decay between 10−6.0 and 10−4.0 , as well as
mentation separately in the fine-tuning step and mainly followed the no weight decay and considered learning rate decay steps in
standard augmentation that has been used in previous published work {10, 000, 25, 000}.
for these tasks4,7,10.
We performed model selection on the basis of the model perfor- Zero-shot OOD performance evaluation. To evaluate the robustness
mance in the validation set of the ID labelled medical data, and report of our models to distribution shifts, we considered a generalization
the final performance using the test set of the labelled ID data. We also setting where the model post pretraining and end-to-end fine-tuning
followed a standard early stopping procedure to avoid overfitting123. To on ID were used to make predictions on the shifted OOD dataset without
estimate the variability around the model performance and investigate using any labels from the new dataset. In this setting, the OOD dataset
any statistically significant improvement, the chosen hyperparam- ( Dout ) and the ID dataset ( Din) had the same label space but the OOD
eters were used for the training and testing of ten model runs (unless dataset underwent distribution shifts due to demographic, technology
otherwise specified), and task performance is reported on the basis of or behavioural changes. Towards this end, we directly evaluated the
mean and standard deviations of the performance across the model performance of fine-tuned models, which included the classification
runs. Details of the fine-tuning procedure for each task are explained head on the test split of Dout for each task, and report the performance
in the following paragraphs and summarized in Supplementary Table 2. on the basis of the task metrics. Similar to the ID dataset, to estimate
the variability around the model performance, we report the mean and
Dermatology-condition classification. For dermatology-condition standard deviations of 10 different models trained with the selected
classification, we used the Adam optimizer with a linear learning rate hyperparameters in the fine-tuning step (unless otherwise
and we fine-tuned all models for 150,000 steps with a batch size of 16. specified).
During fine-tuning, we augmented the dermatology images by per-
forming random colour distortion, cropping with resizing to 448 × 448 Data-efficient OOD fine-tuning and evaluation. We also considered
pixels, blurring, rotation and random flipping. We selected the learning the setting in which with site-specific retraining, we can improve the
rate and weight decay after a grid search of seven logarithmically performance of the model on the out-of-domain dataset. For this
spaced learning rates between 10−6.0 and 10−3.0, and three logarithmi- purpose and to investigate the data efficiency of our models, we
cally spaced values of weight decay between 10−6.0 and 10−4.0, as well fine-tuned our models on different fractions of out-of-domain labelled
as no weight decay. training data. To simulate the label scarcity encountered when develop-
ing medical imaging ML, we used various fractions of labelled Dtrain out
DME classification. In the DME classification task, we used an SGD including {10%, 20%, 50%, 100%}, and after fine-tuning of the model using
optimizer with momentum parameter of 0.9 and an exponential learn- these fractions, we evaluated the models on the test split of Dtest
out . Due
ing rate schedule, and we fine-tuned all models for a maximum of 1,000 to computation constraints, we performed the hyperparameter selec-
steps with a batch size 8. The final checkpoint selection was based on tion on the basis of the previously explained fine-tuning protocol using
the performance of the model on the validation set. During the training, only 100% of the Dtrain
out and selected the hyperparameters on the basis
we used an augmentation strategy consisting of random colour distor- of the performance of the fine-tuned models on the validation split
tion, cropping with resizing to 587 × 587 pixels, blurring and random Dval
out for each task. This implies that one could reach even higher per-
flipping. We tuned the initial learning rate and weight decay by perform- formance and data efficiency than the numbers we report for smaller
ing a grid search of seven logarithmically spaced learning rates in 10−4.0 data-fraction sizes by performing a more thorough hyperparameter
and 10−5.0 , and three logarithmically spaced values of weight decay search. Our experiment also showed that using early stopping is essen-
between 10−5.0 and 10−3.0, as well as no weight decay. We set the decay tial for smaller data fractions.
rate to 0.1 of the initial learning rates. The amount of OOD labelled data that we use for the adaptation,
the severity of the distribution shift and the architecture of models
Chest-X-ray classification. We used the Adam optimizer with an expo- used can affect the final performance124. For these reasons, in each task,
nential learning rate decay for the chest-X-ray classification task. We we additionally considered three distinct scenarios for initializing the
trained all models in this task up to a maximum of 250,000 steps using weights of the network: using the domain-specific pretrained model
a batch size of 64. We pre-processed and augmented the chest-X-ray weights and adding a random classification head; starting from the ID
images by applying random cropping, scaling to 224 × 224 pixels, rota- fine-tuned model and also keeping the classification head obtained
tion up to 15∘ and random colour distortions. We selected the initial from the ID fine-tuning step; and using the ID fine-tuned model and
learning rate and weight decay after a grid search of seven logarithmi- adding a random classification head.
cally spaced learning rates in 10−8.0 and 10−2.0, and three logarithmically We selected the best setup on the basis of the performance on the
spaced values of weight decay between 10−6.0 and 10−4.0 , as well as validation split Dval
out for a given task to report our results.
no weight decay.
Baselines
Pathology classification. For both pathology tasks, we used the Adam For all tasks, we considered three baselines: (1) the prevailing and
optimizer with a linear learning rate and we fine-tuned all models for standard paradigm for developing medical-imaging ML using models
a maximum of 25,000 steps with a batch size of 8. We tuned the learning pretrained on ImageNet-1K using supervised learning, (2) the strong
rate using a grid search of five logarithmically spaced learning rates supervised baseline pretrained on JFT-300M images and (3) the
in 10−7.0 and 10−3.0, and we used no weight decay. task-specific clinically applicable performance wherever available.

Mammography classification. For the breast cancer classification Supervised baseline. The standard supervised baseline we consid-
task, we used the SGD optimizer with a momentum parameter of 0.9 ered for all the medical imaging tasks in this study is an ImageNet-1K
and exponential learning rate schedule, and we fine-tuned all models supervised pretrained ResNet. Initialization from a model pretrained
for a maximum of 100,000 steps with a batch size of 1. During training, on ImageNet-1K is the standard baseline for transfer learning in medi-
we used an augmentation strategy consisting of random colour distor- cal imaging as demonstrated by their use in previously published
tion, resizing to 2,048 × 2,048 pixels, random flipping and elastic defor- works4,6,7,11 and ResNets remain a popular and competitive architecture
mation. We performed a grid search of five logarithmically spaced baseline for computer vision tasks46,58,125. We also considered a strong

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 768


Article https://doi.org/10.1038/s41551-023-01049-7

supervised baseline, which is a ResNet model pretrained on JFT-300M example, we used the pretrained model obtained from the unlabelled
that has shown significant improvement over standard baseline for pathology data for both pathology metastases detection and survival
medical image analysis57. For fine-tuning from the supervised pre- prediction tasks.
trained baselines, we followed exactly the same protocol as for the
models developed with REMEDIS. For fair comparison, we performed Clinical evaluation settings
identical extensive hyperparameter sweeps as in REMEDIS so that the In the following text, we provide details of our clinical evaluation setting
baseline models attained the highest performance on the validation set. including medical imaging tasks description, details of the datasets,
Based on our experiments, the supervised baseline models generally description of distribution shifts, definition of clinically applicable per-
required more iterations to fully converge and our chosen maximum formance and details of the clinical impact analysis of using REMEDIS
step and early stopping mechanism ensured that they attained optimal for developing medical-imaging ML.
performance.
Furthermore, for each baseline as well as for REMEDIS, we consid- Tasks and datasets. We investigated 14 distinct datasets across differ-
ered two architecture sizes: ResNet-50 (1×) and ResNet-152 (2×) (except ent imaging modalities, dataset sizes, label spaces and class distribu-
in the pathology domain due to computational considerations). This tions to reflect the heterogeneity of medical imaging problems and
helped us further understand the influence of the model size and the evaluate various distribution shift scenarios. Specifically, we consid-
number of parameters in REMEDIS as well as the baselines. In addition ered five popular modalities in medical imaging: dermatology, mam-
to the supervised baselines, it is also possible to train medical imaging mography, digital pathology, fundus imaging and X-ray, with different
models by randomly initializing the model weights from scratch and tasks listed in Supplementary Tables 5 and 6. These tasks are repre-
fine-tuning on the downstream task. However, previous work57,58 sug- sentative of many common characteristics and challenges of medical
gests this is suboptimal compared with the supervised baseline, hence imaging (for example, class label imbalance, variation in pathologies
we did not consider it in this study. of interest from small local patches to more global patches and
image-characteristic variations). Each specific modality and task
Clinically applicable performance. We defined clinically applicable included an unlabelled pretraining dataset ( DU), an ID dataset ( Din) and
performance as accuracy demonstrated by expert clinicians in the OOD one or more out-of-domain datasets ( Dout), which have been collected
setting for the given task at hand, Dout. Obtaining a measure of clinician under clinical distribution shifts due to new data acquisition devices
expert performance in the various clinical settings considered in this or different clinical demographics26. Supplementary Table 5 summa-
study was challenging for several reasons (for example, cost or time rizes the tasks and datasets and Fig. 2 shows visual samples of each
taken for annotation). As a result, we had a measure for clinically appli- task.
cable performance only for some of the medical tasks considered in
this work as we detail below. Supplementary Table 4 summarizes the Task 1: Dermatology-condition classification. The dermatology-
clinically applicable performance range calculated for each task. condition classification task (T1) targets the identification of various
For the dermatology (T1) task, we defined the clinician applicable types of skin conditions from digital camera images. For this task, the
performance following a one-versus-all approach7. This was done by experiment setup and dataset of ref. 7,62 were followed. Unlabelled
computing the accuracy of the differential diagnosis of skin conditions pretraining dataset and ID training dataset were collected and
provided by one US-board-certified dermatologist against the aggre- de-identified by a US-based teledermatology service, with images of
gated reading from a panel of several US-board-certified dermatolo- skin conditions taken using consumer-grade digital cameras. Intrinsi-
gists as the ground truth. Similarly, for the chest-X-ray (T3) classification cally, these images exhibited variations in pose, lighting and camera
task, we followed the same strategy to compute the accuracy of one focus. Additionally, the target body part and also the backgrounds
radiologist versus the rest of the radiologists as suggested in ref. 5. For embodied noise artefacts such as variations in clothing. Each case
the pathology survival prediction (T5) task, we followed the approach included between one to six images and during the data preparation
proposed in ref. 106. For the DME diagnosis (T2) task, the ground truth step, cases with the occurrence of multiple skin conditions or ungrada-
was collected using the worldwide gold-standard OCT machine, thus ble images were filtered out (for details, see ref. 7). These data were
the clinician expert performance was not applicable. The clinician’s collected in the United States and the ground-truth labels were aggre-
expert performance was also not available for the pathology metasta- gated from a panel of several US-board-certified dermatologists who
ses detection (T4 ) OOD datasets, as well as for the mammography provided a differential diagnosis of skin conditions in each case. As in
classification (T6) task. actual clinical settings, the distribution of different skin conditions
was heavily skewed in this dataset, ranging from some skin conditions
Implementation details making up more than 10% of the training data such as acne, eczema and
We built on top of the TensorFlow implementation of SimCLR as our psoriasis, to those making up less than 1% such as lentigo, melanoma
model pretraining code base. We further integrated an extended data and stasis dermatitis7. To ensure the existence of sufficient data in each
pipeline for each task to be able to read, pre-process and sample images category for model development, the 26 most common skin conditions
on the basis of patient metadata for modalities such as pathology, mam- out of 419 unique conditions were identified and the rest were grouped
mography and chest X-rays. We also implemented extra augmentation into an additional ‘Other’ class, leading to a final label space of 27 classes
strategies relevant to medical imaging data, including implementation for this task. The 26 target skin conditions included: acne, actinic kera-
of functions for elastic deformation, histogram equalization and ran- tosis, allergic contact dermatitis, alopecia areata, androgenetic alope-
dom rotation. We pretrained our models using 16–256 Google Cloud cia, basal cell carcinoma, cyst, eczema, folliculitis, hidradenitis, lentigo,
TPU cores depending on the chosen batch size, memory and the size of melanocytic naevus, melanoma, post-inflammatory hyperpigmenta-
the unlabelled dataset for each task. With the fixed input image size of tion, psoriasis, squamous cell carcinoma/squamous cell carcinoma
224 × 224 pixels and using 64 TPU cores, ~24 h was needed to pretrain in-situ, seborrhoeic keratosis, scar condition, seborrhoeic dermatitis,
a ResNet-50 (1×) with batch size of 1,024 for 1,000 epochs in a dataset skin tag, stasis dermatitis, tinea, tinea versicolor, urticaria, verruca
with 200,000 examples. However, pretraining using large unlabelled vulgaris and vitiligo.
datasets such as CFPs and image patches extracted from digital pathol- In total, the ID dataset included a total of 20,676 unique cases. The
ogy whole-slide images that include millions of examples can take up final train, validation and test sets included a total of 15,340 cases, 1,190
to 7 d (~150 h) using 256 TPU cores. Once the models were pretrained, cases and 4,146 cases, respectively. We also used an additional
they were fine-tuned for the domain-specific downstream tasks. For de-identified OOD dataset, DT1 out , to investigate the generalization per-

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 769


Article https://doi.org/10.1038/s41551-023-01049-7

formance of REMEDIS under distribution shift. Unlike DT1 in


, this dataset to those reported in ref. 78. Nonetheless, we believe the relative per-
was primarily focused on skin cancers on the ground-truth labels formance of models is representative, informative and comparable to
obtained from biopsies62. DT1 out was further split to train, validation and those in refs. 57,58,62,80. ChestX-ray 14 is an open-source dataset col-
test sets consisting of 17,322 cases, 4,339 cases and 6,639 cases, respec- lected at the National Institutes of Health Clinical Center in Maryland.
tively. The OOD dataset, DT1 out , was collected by a chain of skin cancer The data were labelled in a manner similar to CheXpert by extracting
clinics in Australia and New Zealand. This dataset had a much higher common findings from radiologist reports.
prevalence of skin cancers such as melanoma, basal cell carcinoma and
actinic keratosis. For self-supervised pretraining, DT1 U
was formed by Task 4: Lymph-node-metastases detection. In the lymph-node-
using a total of 207,032 unlabelled images from DT1 in
where we removed metastases detection task (T4), the goal is to detect cancer metastases
the annotations. in digital whole-slide images of lymph-node histology slides. The mod-
els were trained in a weakly supervised manner using only case-level
Task 2: DME classification. DME is distinguished by thickness of the labels and without any local annotations. To make case-level predic-
central area of the retina due to accumulation of intraretinal fluid. While tions, embeddings from 214 = 16, 384 patches per case were combined
it is possible to screen for DME using CFPs by detecting hard exudates via an attention layer81. A random sample of 50 M patches from 10,705
near the fovea as a surrogate for the presence of fluid, extracting the cases (29,018 slides) spanning 32 studies from TCGA was used for
thickness directly from a 3D OCT volume has become the gold stand- self-supervised pretraining ( DT4 U
). Breast lymph-node slides from the
ard for making a diagnosis74. Nevertheless, the use of OCT machines CAMELYON16 challenge82 were used for model development and ID
for DME diagnosis word-wide is limited due to high cost. In this task, evaluation ( DT4
in
). Lymph node slides from 5,161 stage II and III colorectal
we followed the approach of refs. 12,75,126 to leverage a dataset of cancer cases (36,520 slides) collected between 1984 and 2007 from the
paired CFP and OCT data, and trained a model that takes CFP as input Institute of Pathology and the Biobank at the Medical University of
and predicts CRT measured from the corresponding OCT. Specifically, Graz were used for OOD evaluation ( DT4 T4
out ). Dout was divided into train,
CRT was defined as Early Treatment Diabetic Retinopathy Study zone validation and test sets including 2,577 cases (17,904 slides), 1,295 cases
1/central subfield thickness ≥300 μm (refs. 75–77). (9,313 slides) and 1,289 cases (9,303 slides), respectively. This dataset
For the pretraining purposes, we used the unlabelled dataset from is further described in ref. 106; however, here cases were not excluded
EyePACS Inc., DT2 U
, which included 2,287,716 fundus images from a on the basis of having insufficient tumour content in the primary tissue
cohort of 308,507 patients where Hispanic is the most prevalent race/ slides. Additionally, we used a fraction of the CAMELYON-17 dataset,
ethnicity. The ID dataset (collected in Thailand) DT2 in
includes 6,039 including 273 pathology slides, as an additional OOD dataset for pathol-
fundus images from 4,035 patients. We split this dataset to train, valida- ogy metastases detection. We particularly removed overlapping
tion and test sets including a total of 3,874 images, 973 images and 1,192 CAMELYON16 pathology slides and cases from the original
images, respectively. We also used a primary de-identified OOD dataset, CAMELYON-17 dataset.
DT2
out, to investigate the generalization performance of REMEDIS under
distribution shift. Unlike DT2
in
, this dataset was collected in Australia and Task 5: Colorectal-cancer-survival prediction. The objective of the
included 3,779 fundus images from 879 patients. DT2 out was also further colorectal cancer survival prediction task (T5) is to predict 5 yr DSS
divided into train, validation and test sets including 2,524 images, 643 using digitized whole-slide images of primary colorectal tissue histo­
images and 612 images, respectively. Additionally, we used a secondary logy slides. Models were trained in a weakly supervised manner and
de-identified OOD dataset consisting of 909 fundus images from 323 used the same self-supervised pretraining dataset ( DT4 U
) as the lymph
patients collected in India for zero-shot OOD performance node metastases detection task (T4). Colorectal tissue slides from 4,496
evaluation. stage II and III colorectal cancer cases (36,841 slides) collected between
1984 and 2007 from the Institute of Pathology and the Biobank at the
Task 3: Chest-X-ray-condition classification. The chest-X-ray- Medical University of Graz were used for model development and ID
condition classification task (T3) involves multilabel classification of validation ( DT5in
). A temporal split of 671 cases (6,419 slides) collected
chest-X-ray images among five common findings: atelectasis, consoli- between 2008 and 2013 from the same institution was used for OOD
dation, pulmonary oedema, effusion and cardiomegaly. These were evaluation ( DT5 T4
out ). Dout was divided into train, validation and test sets
chosen due to their prevalence, in particular in DT3 in
. Each finding was including 402 cases (3,873 slides), 101 cases (913 slides) and 168 cases
modelled as an independent binary prediction. Three publicly available (1,633 slides), respectively. This dataset is further described in ref. 10;
datasets were used for training and evaluation purposes: CheXpert78, however, only cases not lost to follow-up for DSS within 5 yr were
MIMIC-CXR79 and ChestX-ray 14 (ref. 66). included here.
We used the training split of MIMIC-CXR79 as DT3 U
, a dataset consist-
ing of 215,695 radiographic studies collected at Beth Israel Deaconess Task 6: Mammography classification. In the mammography cancer
Medical Center in Boston, MA, for pretraining. Each study contained classification task (T6 ), the goal is to predict whether there will be
potentially multiple views, so we sampled from these images during biopsy-confirmed cancer occurring in the 39 months following the
pretraining (preferentially sampling posterior anterior or anterior screening episode, as described in ref. 4. We used multiple different
posterior, if available). This dataset was combined with the ID dataset datasets collected in various geographic locations for this task. This
during pretraining. included a labelled dataset collected in the United Kingdom, a labelled
CheXpert78 is a large open-source dataset of de-identified chest dataset from the United States (Northwestern Memorial Hospital), an
radiograph (X-ray) images (224,316 chest radiographs coming from unlabelled set of images from five clusters of hospitals across five dif-
65,240 unique patients). The ground-truth labels for training data were ferent cities in India (Bangalore, Bhubaneswar, Chennai, Hyderabad
automatically extracted from radiology reports. The radiologist report and New Delhi) and another unlabelled set of images collected from
was then mapped to a label space of 14 neurological observations. We Northwestern Memorial Hospital (Chicago). Each of these datasets
predicted the five most prevalent pathologies used in ref. 78. Following contained four different images per patient and for each breast (left
previous work57,58,62,80, to facilitate a robust comparison of our method and right), there is a medio lateral oblique and craniocaudal view. The
to standard approaches, we defined a custom subset of the CheXpert UK and USA datasets are described in more detail in ref. 4.
dataset. For this purpose, the full training set was randomly re-split into The UK dataset was used as the labelled ID data ( DT6 in
), which
training, validation and test images (see Supplementary Table 5). This included a total of 89,018 cases. The training set of this dataset con-
means the performances of our models are not directly comparable tained 26,739 cases, which were unevenly divided between 4,751

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 770


Article https://doi.org/10.1038/s41551-023-01049-7

positive cases and 21,988 negatives. We upsampled positive cases with demographic shift, the rate of diabetes and the prevalence of disease
a factor of 10 to balance the distribution of negative and positive cases differed between the two datasets. Also, clinical incentives to detect
and further improve performance. The original validation set included DME between these two countries varied, which introduced behav-
49,831 cases with a total of only 992 positive cases. Also, the test data ioural shifts. Thus, we observed technology shift, demographic change
contained 12,448 cases with 249 positives and 12,199 negatives. The and behavioural shifts between ID and OOD datasets in the DME task.
labelled dataset from the United States was used as out-of-domain These shifts also held between DME ID and the additional OOD (col-
data, DT6
out. This dataset, which included a total number of 41,043 cases, lected in India) datasets as detailed in Supplementary Information.
was split into train, validation and test sets with 27,083 cases, 6,901
cases and 7,059 cases, respectively. In this dataset, positive cases were Chest-X-ray classification. The chest-X-ray ID (CheXpert, Stanford)
only 9% of all of the cases. For pretraining, the unlabelled dataset ( DT6
U
) and OOD (ChestX-ray 14, Virginia) were collected in different demo-
was formed by removing labels from the labelled data from the UK graphics and hospitals and underwent technology and population
dataset and combining it with the unlabelled data from India. During shifts. However, there is no evidence for behaviour changes between
pretraining, as suggested in ref. 62,83 to improve the positive pair these datasets.
mining procedure, a single image was randomly selected from the four
possible views that were further used to generate a pair. Pathology classification. In the metastases task, the ID data tar-
geted the breast lymph-node slides, while the OOD data consisted
Qualitative analysis of distribution shifts of lymph-node slides of colorectal cancer. Also, the data had been
Here we provide details of the distribution shifts, including the defini- collected in different clinical sites and scanned with different devices.
tion, visual effect of distribution shifts and distribution shift analysis Therefore, we observed technology and population shifts between ID
across the different medical imaging tasks in this study. and OOD datasets in the pathology lymph-node metastases detection
task. The data used for survival prediction ID (collected between 1984
Type of distribution shift. Distribution shifts, where the training and 2007) and OOD (collected between 2008 and 2013) were also col-
distribution differs from the target test distribution, can substantially lected in different clinical setups and at different times.
degrade the diagnostic performance and model calibration of ML
systems deployed in new clinical setups and further negatively affect Mammography classification. The mammography ID (UK) and OOD
existing health disparities26. To qualitatively benchmark the extent of (US) data were collected in different countries and with various scan-
the distribution shifts, in each task and dataset, we viewed the overall ners. However, factors affecting behaviour shift between these datasets
data distribution shifts as a mixture of multiple possible changes rep- were unknown. Therefore, we considered technology and population
resented by Si. Distribution shift from a given ID set ( Din) to a given OOD shifts between ID and OOD datasets in this task.
set ( Dout) can be represented by ⋃i∈I Si, where S = {S1 , S2 , … , Sn }represents
the set of possible distribution shift sources and is the index set. Here Visual effects of distribution shifts. As discussed above, models used
we limit n to the three most frequent sources of data distribution shifts: in medical ML applications are often trained on data from a small num-
population shifts ( S1), technology shifts ( S2) and behavioural shifts ( S3) ber of hospitals (ID data), but the goal is usually to deploy these models
as suggested in ref. 26. A large |I| represents a severe degree of distribu- more generally across other hospitals or other varied clinical settings
tion shift between two datasets. (OOD). Variations in data collection and processing can degrade model
Changes in technology, one of the most common causes of data accuracy on data from new clinical deployment settings. This variation
distribution shifts, can consist of new types or different models or can be subtle or visually pronounced, ranging from changes in contrast,
brands of data-acquisition device, and changes in IT practices and sharpness or tint, to nonlinear effects of X-ray sensor construction,
infrastructure for the upstream task. Changes in demographics were differences in zoom levels and so on. For example, in histopathology,
identified by changes in characteristics of the population in which the this variation can arise from sources such as slide staining and image
model was developed versus the target test population and included acquisition differences. Extended Data Fig. 3 shows visual samples of
but were not limited to shifts in age, sex, race and ethnicity. Behavioural ID and OOD images used in this study. As there is no canonical way to
changes arise from changes in patient behaviour, clinician behaviour, quantify these distribution shifts, we assessed the quality of the dataset
clinical practice, clinical nomenclature, clinical incentives and finally shift according to ref. 26.
ML system-induced behavioural changes26. For example, adding surgi-
cal skin marking can affect the accuracy of the dermatology classifier127. Analysis of clinical impact
We focused on these distribution shifts because they collectively Medical data acquisition and annotation are often extremely expensive
capture the structure of most of the shifts in medical applications. and time-consuming28. The costs associated with medical data consist
However, it is worth noting that identifying and analysing all of the of multiple components such as acquisition cost, handling cost (for
causal factors in different clinical settings is challenging and some- example, image de-identification), curation cost (for example, data
times impossible. authentication, archiving, management) and annotation cost. To gain
a better understanding of the potential benefits of our data-efficient
Dermatology-condition classification. In the dermatology classifi- generalization strategy, we attempted to collect an estimate of the
cation task, we observed both technology shift and population shift cost and clinician hours associated with data annotation for each of
between ID and OOD datasets. Different acquisition devices were our target modalities and specifically for the OOD data. Since the exact
used to collect the ID (consumer-grade digital cameras) and OOD cost of data annotation is not available for many datasets considered in
(teledermatology service) images. In addition, IT practice, software this study, we estimated this using public data for US-based clinicians
and technology also shifted due to changes in clinical location and from salary.com (https://www.salary.com/). Estimates of clinician
setting. Demographic shift, disease prevalence shift and seasonal shift hours needed for data annotation were also collected from sources
also occurred since the ID dataset was collected in multiple US clinics, relevant to each task such as refs. 128–130. These are rough approxi-
whereas the OOD dataset was collected in Australia and New Zealand. mations and the exact labelling cost depends on clinical practice in a
given site. We assumed that labels need to be generated from scratch.
DME classification. The DME ID and OOD data were collected using We included true annotation cost and no additional overhead. Cost
different scanners and in different hospitals and countries, with the also largely depends on the experience level of the clinicians provid-
former collected in Thailand and the latter in Australia. Due to this ing the labels and the choices made by the medical ML developers (for

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 771


Article https://doi.org/10.1038/s41551-023-01049-7

example, the number of clinicians reviewing each case to generate the means REMEDIS could offer 789 clinician hours and US$162,000 sav-
gold-standard ground truth). In certain cases, such as mammography, ings. The DT3
out was collected over 23 yr between 1992 and 2015.
the ground truth was derived from real-world data such as biopsies or
outcomes and clinician annotation was primarily used for comparison. Pathology tasks. For the pathology task data, the annotation cost of a
However, such data, while being more reliable, are again extremely single pathology slide was estimated at around US$23, where the time
time-consuming to obtain, often spanning many years. spent on a single slide was estimated at 10 min and the average salary of
Supplementary Table 7 summarizes clinical costs associated with a pathologist estimated at US$138 (https://www.salary.com/research/
the acquisition, curation and annotation of each OOD dataset, and in salary/alternate/pathologist-hourly-wages) per hour. Based on this, anno-
all cases, we focused on train splits. The annotation cost and hours sav- tation of pathology metastases detection OOD ( DT4 out) train split, which
ings were approximated on the basis of the percentage of the data that included 2,577 cases (17,904 pathology slides), cost around US$411,792
REMEDIS needed to match the performance of the supervised baseline and required 2,984 clinician hours. Our method could attain baseline
as depicted in Fig. 3. The overall annotation cost of a dataset was equal performance using only 6% of the data, suggesting that it could contribute
to (average annotation cost per image) × (number of training images) to over US$385,000 savings in the cost of the annotations. DT4 out , which
and the overall annotation clinician hours for each dataset was equal included lymph-node slides from stage II and III colorectal cancer cases,
to (average annotation time per image) × (number of training images). was collected over the 23 yr between 1984 and 2007. In pathology survival
We also report the approximate duration needed to collect each dataset prediction (T5), each slide was annotated for the possible clinical outcome,
starting from the first patient recruitment. A similar calculation for and the OOD dataset which included colorectal tissue slides from stage
acquisition-time savings could be done. Since with REMEDIS, a smaller II and III colorectal cancer cases was collected over 5 yr between 2008 and
OOD dataset is sufficient for reaching baseline or clinically applicable 2013. The DT5 out train split consisted of 402 cases (3,873 pathology slides)
performance, the total data acquisition duration can be considerably and the cost of annotation was around US$89,079 and needed 645 clinician
reduced. However, this is not shown because the calculation implies hours for the annotation. REMEDIS could cut this cost to US$13,000 and
multiple assumptions, including a uniform acquisition of different offer US$76,000 savings by using only 14% of the whole data.
classes or pathologies over time, which are not always valid.
Mammography classification. For the mammography data, the anno-
Dermatology-condition classification. For dermatology, we used a tation task included localization and grading of the images. The cost of
median diagnosis time of 60 s for this particular dataset130. The average annotation per hour was estimated on the basis of the average salary of
hourly wage of a dermatologist is US$172 (https://www.salary.com/ a radiologist, reported as US$205 (https://www.salary.com/research/
research/salary/alternate/dermatologist-hourly-wages), resulting in salary/alternate/pathologist-hourly-wages) per hour. Moreover, it
an average cost per case of US$2.86. The cost of annotation of the takes 5–6 min to annotate a single mammography image, hence the
dermatology OOD train dataset which consisted of 17,322 dermatology annotation cost for a single mammography image was estimated at
cases was estimated at US$49,540 and the total clinician hours associ- US$20.5. For the mammography classification OOD dataset training
ated with this dataset was 289 h. In this task, REMEDIS could reach the split, this translated to a total cost of US$352,149 for 17,178 images,
highest supervised baseline performance using only 33% of the total requiring 1,718 clinician hours for annotation. This dataset was col-
training data, which means REMEDIS could potentially offer 193 clini- lected over 17 yr between 2001 and 2018. In this task, REMEDIS could
cian hours and US$33,000 savings. In addition, DT1 out was collected over attain the highest supervised baseline performance without using any
9 yr between 2007 and 2016 (very roughly estimating the effect of OOD example, which means REMEDIS could offer 1,569 clinician hours
dataset acquisition duration when using REMEDIS; with only 33% of and US$322,000 savings.
the data needed for reaching baseline performance, the data collected It is worth mentioning that the saved annotation overhead can
after the first 3 yr might have been sufficient for REMEDIS). potentially considerably impact the feasibility of site-specific adapta-
tion/retraining of a model. In addition to the visible cost and clinician
DME classification. For DME, while we had access to expert reader hours that hinder medical image data collection and medical ML model
numbers, OCT-generated labels were used for the ML model develop- adaptation, there are several substantial concerns such as the speed
ment. This means the costs associated with this task are not comparable of acquiring local labels, which may take years to manifest if based on
to clinician annotator cost in other tasks. Nevertheless, based on outcomes. For example, in the survival prediction task, collection of
WebMD and ref. 131, the amount of time it takes to perform an OCT test 645 examples takes over 5 yr, or low incidence of positive cases in tasks
is 5 min and 6.5 min, respectively. We estimated an average of 5.75 min such as mammography screening substantially slows down the process
(345 s) per case to obtain an OCT-generated label. The average hourly of data collection to an extent where a proper dataset is collected over
wage of an ophthalmologist is US$147 (https://www.salary.com/ 20 yr. Therefore, the saved clinician hours can be translated to saving
research/salary/alternate/ophthalmologist-hourly-wages), resulting the multiple years that it takes to properly recruit patients, collect and
in an average cost per case of US$14. Considering the 2,524 cases in the curate a medical dataset.
DT2
out train split, the cost of annotation of this dataset was estimated at
US$35,336 and with an annotation duration of 242 h using OCT Additional related work
machines. In this task, REMEDIS could reach the highest supervised Improving the robustness and generalization ability of ML models
baseline performance using only 7% of total training data, which means under distribution shifts is a long-standing challenge that becomes
REMEDIS could possibly offer 224 clinician hours and US$33,000 sav- more critical given the progressive adoption of these models in
75
ings. In addition, DT2out was collected over 7 yr between 2003 and 2020 . safety-critical settings such as healthcare and self-driving cars in
recent years. Unlike previous studies, here we specifically consider
Chest-X-ray-condition classification. For chest X-rays, we used a ‘data-efficient generalization’ that targets a more practical (following
reported interpretation time of 122 s per image128 and an average hourly real-world deployment scenarios) but also challenging setup that aims
wage of US$205 (https://www.salary.com/research/salary/alternate/ to generalize to a new distribution by using none or very few data exam-
radiologist-hourly-wages) for radiologists, resulting in a cost of ples from the target deployment data distribution. REMEDIS is closely
US$6.98 per image. The train partition of our OOD dataset in this task related to transfer learning, self-supervised and contrastive learning as
included 27,978 cases which translated to US$194,369 in cost of collec- well as to literature on robustness, domain adaptation and generaliza-
tion and 948 clinician hours. REMEDIS could reach the highest super- tion. Here we review previous works that are key to understanding the
vised baseline performance without any OOD training data, which context of our study.

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 772


Article https://doi.org/10.1038/s41551-023-01049-7

Robustness and OOD generalization. To tackle the challenges from generalization methods on multisite clinical time series and medical
distribution shift, numerous efforts have been made over the years, imaging datasets. This setup is different from our setup, where we
resulting in rich literature. The proposed techniques vary greatly ranging focused on retraining performance in retrospective data, mirroring
from causality132 to representation learning133 and model-based134,135 tech- real-world clinical settings.
niques. On the basis of the model development strategy, one can address
OOD generalization by improving the underlying representation learning Transfer learning. Despite the differences in image statistics, scale and
method, introducing a new mapping function between distributions or task-relevant features, transfer learning from natural images is com-
by formulating the generalization as a new optimization problem136–138. monly used in medical imaging3,4,6,7,157,158. Multiple empirical studies sug-
Nevertheless, a recent paper94 suggests that under careful evaluation set- gest that such transfer-learning strategy improves performance70,159,160
tings, models developed using standard empirical risk minimization95,96 for medical imaging tasks. However, detailed investigations of this
remain a strong baseline for the generalization problem. strategy58 demonstrate that this does not always improve performance
Representation-learning techniques aim to learn distinct and in medical imaging contexts; nevertheless, transfer learning from
informative representations (ref. 139) that can improve transfer learn- ImageNet can speed up convergence and is particularly helpful when
ing and OOD generalization140,141. For this purpose, both conventional the medical image training data are limited. Importantly, the study
disentangled representation techniques based on variational autoen- used relatively small architectures and found pronounced improve-
coders (VAE)142, causal representation learning132 and causalVAE143 ments with small amounts of data especially when using their largest
have been studied. However, it remains unclear whether disentangled architecture of ResNet-50 (1×)46 (which is the smallest architecture we
representation benefits OOD generalization. Some findings suggest considered in our study).
that the learned disentangled representation fails to extrapolate to Transfer learning from ID data can help alleviate the domain mis-
unseen data140,141,144, while multiple studies verify the ability of these match issue. For example, refs. 160–163 report performance improve-
representations to generalize under OOD circumstances145–147. Never- ments when pretraining on labelled data in the same domain. However,
theless, the adaptation of these methods to medical image analysis has this approach is often infeasible for many medical tasks in which
been limited and is yet to be explored. labelled data are expensive and time-consuming to obtain. Recent
There is also rapid development around large-scale pretraining advances in self-supervised learning provide a promising alternative,
models such as BiT52 and CLIP148 to improve the learned representa- enabling the use of unlabelled medical data that are often easier to
tion and consequent generalization. Ref. 147 demonstrated that such acquire.
pretrained models in the middle of fine-tuning as well as zero-shot
pretrained models represent an entire class of techniques that exhibit Self-supervised and contrastive learning. Preliminary works in
a high amount of effective robustness149. Our method is closely related self-supervised representation learning focused on the problem of
to this group; however, here we consider learning representation using learning representations without labels such that a small linear clas-
a combination of large-scale pretraining and self-supervision. Unlike sifier network operating on these representations could achieve high
these previous works which use standard synthetic visual corruption classification accuracy33,35,164,165. Contrastive self-supervised meth-
datasets and benchmarks such as ImageNet-C92, we consider a realistic ods such as instance discrimination166, CPC167,168, Deep InfoMax169,170,
real-world setup by using datasets that undergo distribution shifts in AMDIM171, CMC172, RELIC85, MoCo42,84, PIRL173, SimCLR44,72 and SwAV174,
the clinical setting. among others, were the first to achieve linear classification accuracy
End-to-end learning mechanisms for OOD generalization design approaching that of end-to-end supervised training. Recently, these
various models and learning strategies to address the generaliza- methods have been harnessed to achieve dramatic improvements
tion problem, which also includes the category of domain adaptation in label efficiency for semi-supervised learning. Specifically, one can
methods. Representation learning is still a principal component in first pretrain in a task-agnostic manner with self-supervised learning
domain adaptation150–153. A popular baseline for domain adaptation is using all data and then perform task-specific fine-tuning on the labelled
a domain-adversarial neural network (DANN)152,153. DANN learns repre- subset with a standard supervised objective44,72,167. Ref. 72 show that this
sentations that are discriminative and invariant to domains by jointly approach benefits from large (high capacity) models for pretraining
optimizing the underlying features, a label predictor that predicts class and fine-tuning, but that after a large model is trained, it can be effec-
labels and is used both in the training and inference phase, and a domain tively distilled to a much smaller model with little loss of accuracy.
classifier that discriminates between the source and the target domains Although self-supervised learning has only recently become
during training. The representations are trained to confuse the domain viable on standard image classification datasets, it has already seen
classifier using gradient reversal so that domain-invariant features some application within the medical domain. While some works have
are learned. There are multiple variants of DANNs150,151,154, including a attempted to design domain-specific pretext tasks175–178, other works
conditional invariant adversarial network151 that learns class-specific concentrate on tailoring contrastive learning to medical data179–184.
adversarial networks via a conditional invariant adversarial network. Most closely related to our work, ref. 63 explores the use of MoCo64
Most domain adaptation methods belong to domain alignment where pretraining for classification of the CheXpert dataset through linear
the central idea is to minimize the difference between source and target evaluation. In addition to task and medical-domain-specific contras-
domain data by learning domain-invariant representations138,154–156. tive learning, several recent publications investigate semi-supervised
Some other previous works also study training strategies for domain learning for medical imaging tasks using self-supervised pretrained
generalization, including meta-learning, ensemble learning, and models (for example, refs. 185–188). When compared to these efforts,
unsupervised and semi-supervised domain adaptations139. However, our study proposes a method, REMEDIS, that leverages both large-scale
there are limited works investigating domain adaptation and gener- pretraining and self-supervision. We rigorously evaluate and quantify
alization for medical imaging despite its being a critical unmet need. the impact of REMEDIS in varied and challenging clinical settings,
Practically, the most common solution to this problem is site-specific thereby demonstrating its clear potential for accelerating the lifecycle
data collection and model retraining26, which as we show can be pro- of medical imaging ML development and deployment, and facilitating
hibitively expensive and time-consuming for large-scale real-world its widespread uptake in the real world.
medical imaging ML deployment. Recently, the closely related topic of
domain generalization in the medical setting has been studied in ref. 16 Reporting summary
where a framework to induce synthetic but realistic domain shifts Further information on research design is available in the Nature Port-
is introduced. They benchmark the performance of eight domain folio Reporting Summary linked to this article.

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 773


Article https://doi.org/10.1038/s41551-023-01049-7

Data availability 10. Wulczyn, E. et al. Interpretable survival prediction for colorectal
The datasets from Northwestern Medicine and Apollo Hospitals were cancer using deep learning. npj Digit. Med. 4, 71 (2021).
used under a licence for the current study and are not publicly avail- 11. Gulshan, V. et al. Development and validation of a deep learning
able. Applications for access to the Optimam database can be made algorithm for detection of diabetic retinopathy in retinal fundus
using this web form. The de-identified teledermatology data used photographs. JAMA 316, 2402–2410 (2016).
in this study are not publicly available owing to restrictions in the 12. De Fauw, J. et al. Clinically applicable deep learning for diagnosis
data-sharing agreement. The unlabelled dataset used for DME classi- and referral in retinal disease. Nat. Med. 24, 1342–1350 (2018).
fication is de-identified data from EyePACS Inc. Interested researchers 13. Zhou, S. K. et al. A review of deep learning in medical imaging:
should contact [email protected] to enquire about access to imaging traits, technology trends, case studies with progress
EyePACSdata and approach the Office of Research and Development to highlights, and future promises. Proc. IEEE 109, 820–838
enquire about access to VA data. The rest of annotated data for ID and (2021).
OOD DME classification tasks were collected at the Rajavithi Hospital 14. Condon, J. J. J. et al. Replication of an open-access deep learning
Thailand and at the Lions Eye Institute and are not publicly available system for screening mammography: reduced performance
owing to restrictions in the data-sharing agreement. Data used in the mitigated by retraining on local data. Preprint at medRxiv https://
evaluation and pretraining of the chest-X-ray-condition classification, doi.org/10.1101/2021.05.28.21257892 (2021).
including MIMIC-CXR, CheXpert, and ChestX-ray 14 are publicly avail- 15. Zech, J. R. et al. Variable generalization performance of a deep
able. Data used for the ID fine-tuning and evaluation of the detection of learning model to detect pneumonia in chest radiographs: a
metastases are publicly available on the CAMELYON challenge website. cross-sectional study. PLoS Med. 15, e1002683 (2018).
The TCGA data used for pretraining for both the pathology-based 16. Zhang, H. et al. An empirical framework for domain generalization
metastases-detection and survival-prediction tasks are available via in clinical settings. In Proc. Conference on Health, Inference,
the NIH website. The rest of the data used in pathology tasks are not and Learning (eds Ghassemi, M. et al.) 279–290 (Association for
publicly available owing to restrictions in the data-sharing agreement. Computing Machinery, 2021).
Moreover, ImageNet-1K (ILSVRC-2012)68 used for the pretraining of 17. Seyyed-Kalantari, L., Liu, G., McDermott, M., Chen, I. Y. &
baseline supervised models, and ImageNet-21K used for the pretraining Ghassemi, M. CheXclusion: fairness gaps in deep chest X-ray
of BiT-M models are publicly available via the ImageNet website. BiT-L classifiers. Pac. Symp. Biocomput. 26, 232–243 (2021).
models trained on the JFT-300M54 dataset are not publicly available 18. Kadambi, A. Achieving fairness in medical devices. Science 372,
owing to restrictions in the data-sharing agreement. 30–31 (2021).
19. Pierson, E., Cutler, D. M., Leskovec, J., Mullainathan, S. &
Code availability Obermeyer, Z. An algorithmic approach to reducing unexplained
Several major components of the work are available in open-source pain disparities in underserved populations. Nat. Med. 27, 136–140
repositories, such as the T library. The code base and pretrained weights (2021).
used for self-supervised pretraining are available at S. The code base 20. Artificial Intelligence in Health Care: Benefits and Challenges
and pretrained weights for the BiT models are available at B. All experi- of Technologies to Augment Patient Care (US Government
ments and implementation details are described in sufficient detail in Accountability Office, 2020).
Methods and in Supplementary Information to support replication 21. Kelly, C. J., Karthikesalingam, A., Suleyman, M., Corrado, G. &
with non-proprietary libraries. The code base used for our comparison King, D. Key challenges for delivering clinical impact with artificial
to ResNet-RS was based on R. A number of the checkpoints and models intelligence. BMC Med. 17, 195 (2019).
generated through REMEDIS are readily accessible to researchers via 22. Roberts, M. et al. Common pitfalls and recommendations for
the P. Additionally, the Foundation Medical ML repositories on GitHub using machine learning to detect and prognosticate for COVID-19
offer access to codes that can be used to train REMEDIS-based models. using chest radiographs and CT scans. Nat. Mach. Intell. 3,
199–217 (2021).
References 23. Van Leeuwen, K. G., Schalekamp, S., Rutten, M. J., van Ginneken,
1. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, B. & de Rooij, M. Artificial intelligence in radiology: 100
436–444 (2015). commercially available products and their scientific evidence.
2. Yala, A., Lehman, C., Schuster, T., Portnoi, T. & Barzilay, R. A deep Eur. Radiol. 31, 3797–3804 (2021).
learning mammography-based model for improved breast cancer 24. Freeman, K. et al. Use of artificial intelligence for image analysis in
risk prediction. Radiology 292, 60–66 (2019). breast cancer screening programmes: systematic review of test
3. Wu, N. et al. Deep neural networks improve radiologists’ accuracy. bmj 374, n1872 (2021).
performance in breast cancer screening. IEEE Trans. Med. Imaging 25. D’Amour, A. et al. Underspecification presents challenges for
39, 1184–1194 (2019). credibility in modern machine learning. J. Mach. Learn. Res. 23,
4. McKinney, S. M. et al. International evaluation of an AI system for 1–61 (2020).
breast cancer screening. Nature 577, 89–94 (2020). 26. Finlayson, S. G. et al. The clinician and dataset shift in artificial
5. Rajpurkar, P. et al. Deep learning for chest radiograph diagnosis: intelligence. N. Engl. J. Med. 386, 283–286 (2020).
a retrospective comparison of the CheXNeXt algorithm to 27. Futoma, J., Simons, M., Panch, T., Doshi-Velez, F. & Celi, L. A. The
practicing radiologists. PLoS Med. 15, e1002686 (2018). myth of generalisability in clinical research and machine learning
6. Esteva, A. et al. Dermatologist-level classification of skin cancer in health care. Lancet Dig. Health 2, e489–e492 (2020).
with deep neural networks. Nature 542, 115–118 (2017). 28. Willemink, M. J. et al. Preparing medical imaging data for machine
7. Liu, Y. et al. A deep learning system for differential diagnosis of learning. Radiology 295, 4–15 (2020).
skin diseases. Nat. Med. 26, 900–908 (2020). 29. Li, F.-F., Fergus, R. & Perona, P. One-shot learning of object
8. Bera, K., Schalper, K. A., Rimm, D. L., Velcheti, V. & Madabhushi, A. categories. IEEE Trans. Pattern Anal. Mach. Intell. 28, 594–611
Artificial intelligence in digital pathology—new tools for diagnosis (2006).
and precision oncology. Nat. Rev. Clin. Oncol. 16, 703–715 (2019). 30. Zhu, X., Ghahramani, Z. & Lafferty, J. D. Semi-supervised learning
9. Rakha, E. A. et al. Current and future applications of artificial using gaussian fields and harmonic functions. In Proc. 20th
intelligence in pathology: a clinical perspective. J. Clin. Pathol. 74, International Conference on Machine Learning (eds Fawcett, T. &
409–414 (2021). Mishra, N.) 912–919 (AAAI Press, 2003).

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 774


Article https://doi.org/10.1038/s41551-023-01049-7

31. Cohn, D., Atlas, L. & Ladner, R. Improving generalization with 51. HaoChen, J. Z., Wei, C., Kumar, A. & Ma, T. Beyond separability:
active learning. Mach. Learn. 15, 201–221 (1994). analyzing the linear transferability of contrastive representations
32. Sutton, R. S. Generalization in reinforcement learning: successful to related subpopulations. Preprint at https://arxiv.org/
examples using sparse coarse coding. Adv. Neural Inf. Process. abs/2204.02683 (2022).
Syst. 8, 1038–1044 (1996). 52. Kolesnikov, A. et al. Big transfer (BiT): general visual
33. Doersch, C., Gupta, A. & Efros, A. A. Unsupervised visual representation learning. In Proc. European Conference on
representation learning by context prediction. In Proc. IEEE Computer Vision (eds Vedaldi, A. et al.) 491–507 (Springer, 2020).
International Conference on Computer Vision 1422–1430 (IEEE, 53. Huh, M., Agrawal, P. & Efros, A. A. What makes ImageNet good
2015). for transfer learning? Preprint at https://arxiv.org/abs/1608.08614
34. Doersch, C. & Zisserman, A. Multi-task self-supervised visual (2016).
learning. In Proc. IEEE International Conference on Computer 54. Sun, C., Shrivastava, A., Singh, S. & Gupta, A. Revisiting
Vision 2070–2079 (IEEE, 2017). unreasonable effectiveness of data in deep learning era. In Proc.
35. Gidaris, S., Singh, P. & Komodakis, N. Unsupervised representation IEEE International Conference on Computer Vision 843–852
learning by predicting image rotations. Preprint at https://arxiv. (IEEE, 2017).
org/abs/1803.07728 (2018). 55. Mahajan, D. et al. Exploring the limits of weakly supervised
36. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T. & Efros, A. A. pretraining. In Proc. European Conference on Computer Vision
Context encoders: Feature learning by inpainting. In Proc. IEEE (eds Ferrari, V. et al.) 185–201 (Springer, 2018).
Conference on Computer Vision and Pattern Recognition 2536– 56. Houlsby, N. & Zhai, X. The Visual Task Adaptation Benchmark
2544 (IEEE, 2016). (Google Research, 2019).
37. Larsson, G., Maire, M. & Shakhnarovich, G. Colorization as a 57. Mustafa, B. et al. Supervised transfer learning at scale for
proxy task for visual understanding. In Proc. IEEE Conference on medical imaging. Preprint at https://arxiv.org/abs/2101.05913
Computer Vision and Pattern Recognition 6874–6883 (2021).
(IEEE, 2017). 58. Raghu, M., Zhang, C., Kleinberg, J. & Bengio, S. Transfusion:
38. Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training understanding transfer learning for medical imaging. Adv. Neural
of deep bidirectional transformers for language understanding. Inf. Process. Syst. 33, 3347–3357 (2019).
Preprint at https://arxiv.org/abs/1810.04805 (2018). 59. Hendrycks, D., Lee, K. & Mazeika, M. Using pre-training can
39. Brown, T. B. et al. Language models are few-shot learners. Adv. improve model robustness and uncertainty. In Proc. 36th
Neural Inf. Process Syst. 33, 1877–1901 (2020). International Conference on Machine Learning (eds Chaudhuri, K.
40. Baevski, A., Auli, M. & Mohamed, A. Effectiveness of & Salakhutdinov, R.) 2712–2721 (PMLR, 2019).
self-supervised pre-training for speech recognition. Preprint at 60. Li, J., Lin, T. & Xu, Y. SSLP: Spatial guided self-supervised learning
https://arxiv.org/abs/1911.03912 (2019). on pathological images. In International Conference on Medical
41. Chen, L. et al. Self-supervised learning for medical image analysis Image Computing and Computer-Assisted Intervention (eds de
using image context restoration. Med. Image Anal. 58, 101539 Bruijne, M. et al.) 3–12 (Springer, 2021).
(2019). 61. Srinidhi, C. L. & Martel, A. L. Improving self-supervised learning
42. He, K., Fan, H., Wu, Y., Xie, S. & Girshick, R. Momentum contrast with hardness-aware dynamic curriculum learning: an application
for unsupervised visual representation learning. In Proc. IEEE/ to digital pathology. In Proc. IEEE/CVF International Conference on
CVF Conference on Computer Vision and Pattern Recognition Computer Vision 562–571 (IEEE, 2021).
9729–9738 (IEEE, 2020). 62. Azizi, S. et al. Big self-supervised models advance medical image
43. Grill, J.-B. et al. Bootstrap your own latent: a new approach classification. In IEEE/CVF International Conference on Computer
to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, Vision (ICCV) 3458–3468 (IEEE, 2021).
21271–21284 (2020). 63. Sowrirajan, H., Yang, J., Ng, A. Y. & Rajpurkar, P. MoCo
44. Chen, T., Kornblith, S., Norouzi, M. & Hinton, G. A simple pretraining improves representation and transferability
framework for contrastive learning of visual representations. In of chest X-ray models. In Proc. Fourth Conference on Medical
Proc. 37th International Conference on Machine Learning (eds Imaging with Deep Learning (eds Heinrich, M. et al.)
Daumé, H. & Singh, A.) 1597–1607 (JMLR, 2020). 728–744 (PMLR, 2021).
45. Deng, J. et al. Imagenet: a large-scale hierarchical image 64. Zhou, Z. et al. Models genesis: generic autodidactic models for
database. In 2009 IEEE Conference on Computer Vision and 3D medical image analysis. In International Conference on Medical
Pattern Recognition 248–255 (IEEE, 2009). Image Computing and Computer-Assisted Intervention (eds Shen,
46. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for D. et al.) 384–393 (2019).
image recognition. In Proc. IEEE Conference on Computer Vision 65. Liu, X. et al. Self-supervised learning: generative or contrastive.
and Pattern Recognition 770–778 (IEEE, 2016). IEEE Trans. Knowl. Data Eng. 35, 857–876 (2023).
47. Touvron, H. et al. Training data-efficient image transformers 66. Wang, X. et al. Chestx-ray8: hospital-scale chest X-ray database
and distillation through attention. In Proc. 38th International and benchmarks on weakly-supervised classification and
Conference on Machine Learning (eds Meila, M. & Zhang, T.) localization of common thorax diseases. In Proc. IEEE Conference
10347–10357 (PMLR, 2021). on Computer Vision and Pattern Recognition 3462–3471 (IEEE,
48. Liu, H. & Abbeel, P. Hybrid discriminative-generative training via 2017).
contrastive learning. Preprint at https://arxiv.org/abs/2007.09070 67. Hendrycks, D. et al. Pretrained transformers improve
(2020). out-of-distribution robustness. Preprint at https://arxiv.org/
49. Winkens, J. et al. Contrastive training for improved abs/2004.06100 (2020).
out-of-distribution detection. Preprint at https://arxiv.org/ 68. Russakovsky, O. et al. Imagenet large scale visual recognition
abs/2007.05566 (2020). challenge. Int. J. Comput. Vis. 115, 211–252 (2015).
50. Shen, K. et al. Connect, not collapse: explaining contrastive 69. Alzubaidi, L. et al. Optimizing the performance of breast
learning for unsupervised domain adaptation. In Proc. 39th cancer classification by employing the same domain transfer
International Conference on Machine Learning (eds Chaudhuri, K. learning from hybrid deep convolutional neural network model.
et al.) 19847–19878 (PMLR, 2022). Electronics 9, 445 (2020).

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 775


Article https://doi.org/10.1038/s41551-023-01049-7

70. Graziani, M., Andrearczyk, V. & Müller, H. Visualizing and 91. Wenzel, F. et al. Assaying out-of-distribution generalization in
interpreting feature reuse of pretrained CNNs for histopathology. transfer learning. Adv. Neural Inf. Process. Syst. 35, 7181–7198
In Proc. IMVIP 2019: Irish Machine Vision and Image Processing (2022).
(Technological University Dublin, 2019). 92. Hendrycks, D. & Dietterich, T. Benchmarking neural network
71. Wu, Y. & He, K. Group normalization. In Proc. European Conference robustness to common corruptions and perturbations. Preprint at
on Computer Vision (ECCV) 3–19 (2018). https://arxiv.org/abs/1903.12261 (2019).
72. Chen, T., Kornblith, S., Swersky, K., Norouzi, M. & Hinton, G. Big 93. Wang, Z., Dai, Z., Póczos, B. & Carbonell, J. Characterizing and
self-supervised models are strong semi-supervised learners. Adv. avoiding negative transfer. In Proc. IEEE/CVF Conference on
Neural Inf. Process. Syst. 33, 22243–22255 (2020). Computer Vision and Pattern Recognition 11285–11294 (IEEE, 2019).
73. Becker, S. & Hinton, G. E. Self-organizing neural network that 94. Gulrajani, I. & Lopez-Paz, D. In search of lost domain
discovers surfaces in random-dot stereograms. Nature 355, generalization. Preprint at https://arxiv.org/abs/2007.01434
161–163 (1992). (2020).
74. Virgili, G. et al. Optical coherence tomography (OCT) for 95. Vapnik, V. N. Statistical Learning Theory (Wiley-Interscience, 1998).
detection of macular oedema in patients with diabetic 96. Zhang, H., Cisse, M., Dauphin, Y. N. & Lopez-Paz, D. mixup:
retinopathy. Cochrane Database Syst. Rev. 1, CD008081 (2015). beyond empirical risk minimization. Preprint at https://arxiv.org/
75. Liu, X. et al. Deep learning to detect optical coherence abs/1710.09412 (2017).
tomography-derived diabetic macular edema from retinal 97. Goyal, P. et al. Self-supervised pretraining of visual features in the
photographs: a multicenter validation study. Ophthalmol. Retina wild. Preprint at https://arxiv.org/abs/2103.01988 (2021).
6, 398–410 (2022). 98. Bubeck, S. & Sellke, M. A universal law of robustness via
76. Brown, J. C. et al. Detection of diabetic foveal edema: contact lens isoperimetry. J. ACM 70, 1–18 (2023).
biomicroscopy compared with optical coherence tomography. 99. Ericsson, L., Gouk, H. & Hospedales, T. M. How well do
Arch. Ophthalmol. 122, 330–335 (2004). self-supervised models transfer? In Proc. IEEE/CVF Conference on
77. Sadda, S. R. et al. Automated detection of clinically significant Computer Vision and Pattern Recognition 5410–5419 (IEEE, 2021).
macular edema by grid scanning optical coherence tomography. 100. Chen, X. & He, K. Exploring simple Siamese representation
Ophthalmology 113, 1187.e1-12 (2006). learning. In Proc. IEEE/CVF Conference on Computer Vision and
78. Irvin, J. et al. Chexpert: a large chest radiograph dataset with Pattern Recognition 15745–15753 (IEEE, 2021).
uncertainty labels and expert comparison. Proc. Conf. AAAI Artif. 101. Ciga, O., Martel, A. L. & Xu, T. Self-supervised contrastive learning
Intell. 33, 590–597 (2019). for digital histopathology. Mach. Learn. 7, 100198 (2022).
79. Johnson, A. E. et al. MIMIC-CXR, a de-identified publicly available 102. Taher, M. R. H., Haghighi, F., Gotway, M. B. & Liang, J. CAiD:
database of chest radiographs with free-text reports. Sci. Data 6, context-aware instance discrimination for self-supervised
317 (2019). learning in medical imaging. In Proc. 5th International Conference
80. Neyshabur, B., Sedghi, H. & Zhang, C. What is being transferred in on Medical Imaging with Deep Learning (eds Konukoglu, E. et al.)
transfer learning? Adv. Neural Inf. Process. Syst. 33, 512–523 (2020). 535–551 (PMLR, 2022).
81. Ilse, M., Tomczak, J. & Welling, M. Attention-based deep multiple 103. Taher, M. R. H., Haghighi, F., Feng, R., Gotway, M. B. & Liang, J. in
instance learning. In Proc. 35th International Conference on Domain Adaptation and Representation Transfer, and Affordable
Machine Learning (eds Dy, J. & Krause, A.) 2127–2136 (PMLR, 2018). Healthcare and AI for Resource Diverse Global Health (eds
82. Bejnordi, B. E. et al. Diagnostic assessment of deep learning Albarqouni, S. et al.) 3–13 (Springer, 2021).
algorithms for detection of lymph node metastases in women 104. Xie, Q., Luong, M.-T., Hovy, E. & Le, Q. V. Self-training with noisy
with breast cancer. JAMA 318, 2199–2210 (2017). student improves imagenet classification. In Proc. IEEE/CVF
83. Vu, Y. N. T. et al. MedAug: contrastive learning leveraging patient Conference on Computer Vision and Pattern Recognition 10684–
metadata improves representations for chest X-ray interpretation. 10695 (IEEE, 2020).
In Proc. 6th Machine Learning for Healthcare Conference (eds 105. Srinidhi, C. L., Kim, S. W., Chen, F.-D. & Martel, A. L.
Jung, K. et al.) 755–769 (PMLR, 2021). Self-supervised driven consistency training for annotation
84. Chen, X., Fan, H., Girshick, R. & He, K. Improved baselines with efficient histopathology image analysis. Med. Image Anal. 75,
momentum contrastive learning. Preprint at https://arxiv.org/ 102256 (2022).
abs/2003.04297 (2020). 106. Li, Z. et al. Domain generalization for mammography detection
85. Mitrovic, J., McWilliams, B., Walker, J., Buesing, L. & Blundell, C. via multi-style and multi-view contrastive learning. In International
Representation learning via invariant causal mechanisms. Preprint Conference on Medical Image Computing and Computer-Assisted
at https://arxiv.org/abs/2010.07922 (2020). Intervention (eds de Bruijne, M. et al.) 98–108 (Springer, 2021).
86. Zbontar, J., Jing, L., Misra, I., LeCun, Y. & Deny, S. Barlow twins: 107. Sato, J. et al. Anatomy-aware self-supervised learning for anomaly
self-supervised learning via redundancy reduction. In Proc. 38th detection in chest radiographs. Preprint at https://arxiv.org/
International Conference on Machine Learning (eds Meila, M. & abs/2205.04282 (2022).
Zhang, T.) 12310–12320 (PMLR, 2021). 108. Wortsman, M. et al. Robust fine-tuning of zero-shot models.
87. Dunnmon, J. A. et al. Cross-modal data programming enables In Proc. IEEE/CVF Conference on Computer Vision and Pattern
rapid medical machine learning. Patterns 1, 100019 (2020). Recognition 7959–7971 (IEEE, 2022).
88. Campanella, G. et al. Clinical-grade computational pathology 109. Nguyen, T., Raghu, M. & Kornblith, S. Do wide and deep networks
using weakly supervised deep learning on whole slide images. learn the same things? Uncovering how neural network
Nat. Med. 25, 1301–1309 (2019). representations vary with width and depth. Preprint at https://
89. Eyuboglu, S. et al. Multi-task weak supervision enables arxiv.org/abs/2010.15327 (2020).
anatomically-resolved abnormality detection in whole-body 110. Dosovitskiy, A. et al. An image is worth 16x16 words: transformers
FDG-PET/CT. Nat. Commun. 12, 1880 (2021). for image recognition at scale. In International Conference on
90. Bakalo, R., Ben-Ari, R. & Goldberger, J. Classification and Learning Representations (ICLR) (OpenReview, 2021).
detection in mammograms with weak supervision via dual 111. He, K., Zhang, X., Ren, S. & Sun, J. Identity mappings in deep
branch deep neural net. In IEEE 16th International Symposium on residual networks. In European Conference on Computer Vision
Biomedical Imaging (ISBI) 1905–1909 (IEEE, 2019). (eds Leibe, B. et al.) 630–645 (Springer, 2016).

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 776


Article https://doi.org/10.1038/s41551-023-01049-7

112. Ioffe, S. & Szegedy, C. Batch normalization: accelerating deep 134. Liu, J., Hu, Z., Cui, P., Li, B. & Shen, Z. Heterogeneous risk
network training by reducing internal covariate shift. In Proc. 32nd minimization. In Proc. 38th International Conference on Machine
International Conference on Machine Learning (eds Bach, F. & Blei, Learning (eds Meila, M. & Zhang, T) 6804–6814 (PMLR, 2021).
D.) 448–456 (2015). 135. Robey, A., Pappas, G. J. & Hassani, H. Model-based domain
113. Qiao, S., Wang, H., Liu, C., Shen, W. & Yuille, A. Micro-batch training generalization. Adv. Neural Inf. Process. Syst. 34, 20210–20229
with batch-channel normalization and weight standardization. (2021).
Preprint at https://arxiv.org/abs/1903.10520 (2019). 136. Shen, Z. et al. Towards out-of-distribution generalization: a survey.
114. You, Y., Gitman, I. & Ginsburg, B. Large batch training Preprint at https://arxiv.org/abs/2108.13624 (2021).
of convolutional networks. Preprint at https://arxiv.org/ 137. Wang, J. et al. Generalizing to unseen domains: a survey on
abs/1708.03888 (2017). domain generalization. IEEE Trans. Knowl. Data Eng. (2022).
115. Castro, E., Cardoso, J. S. & Pereira, J. C. Elastic deformations for 138. Zhou, K., Liu, Z., Qiao, Y., Xiang, T. & Loy, C. C. Domain
data augmentation in breast cancer mass detection. In IEEE EMBS generalization: a survey. Preprint at https://arxiv.org/
International Conference on Biomedical and Health Informatics abs/2103.02503 (2021).
(BHI) 230–234 (IEEE, 2018). 139. Locatello, F. et al. Challenging common assumptions in the
116. Ronneberger, O., Fischer, P. & Brox, T. U-net: convolutional unsupervised learning of disentangled representations. In
networks for biomedical image segmentation. In International Proc. 36th International Conference on Machine Learning (eds
Conference on Medical Image Computing and Computer-Assisted Chaudhuri, K. & Salakhutdinov, R.) 4114–4124 (PMLR, 2019).
Intervention (eds Navab, N. et al.) 234–241 (Springer, 2015). 140. Geirhos, R. et al. ImageNet-trained CNNs are biased towards
117. Szegedy, C. et al. Going deeper with convolutions. In Proc. IEEE texture; increasing shape bias improves accuracy and robustness.
Conference on Computer Vision and Pattern Recognition 1–9 (IEEE, Preprint at https://arxiv.org/abs/1811.12231 (2018).
2015). 141. Geirhos, R. et al. Generalisation in humans and deep neural
118. Tripuraneni, N., Jordan, M. I. & Jin, C. On the theory of transfer networks. Adv. Neural Inf. Process. Syst. 31, 7538–7550 (2018).
learning: the importance of task diversity. Adv. Neural Inf. Process. 142. Kim, H. & Mnih, A. Disentangling by factorising. In Proc. 35th
Syst. 33, 7852–7862 (2020). International Conference on Machine Learning (eds Dy, J. &
119. Du, S. S., Hu, W., Kakade, S. M., Lee, J. D. & Lei, Q. Few-shot Krause, A.) 2649–2658 (PMLR, 2018).
learning via learning the representation, provably. Preprint at 143. Yang, M. et al. CausalVAE: disentangled representation learning
https://arxiv.org/abs/2002.09434 (2020). via neural structural causal models. In Proc. IEEE/CVF Conference
120. Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. on Computer Vision and Pattern Recognition 9588–9597 (IEEE,
Preprint at https://arxiv.org/abs/1412.6980 (2014). 2021).
121. Loshchilov, I. & Hutter, F. Sgdr: stochastic gradient descent with 144. Leeb, F. et al. Structure by architecture: disentangled
warm restarts. Preprint at https://arxiv.org/abs/1608.03983 representations without regularization. Preprint at https://arxiv.
(2016). org/abs/2006.07796 (2020).
122. Goyal, P. et al. Accurate, large minibatch sgd: training imagenet in 145. Träuble, F. et al. On disentangled representations learned from
1 hour. Preprint at https://arxiv.org/abs/1706.02677 (2017). correlated data. In Proc. 38th International Conference on Machine
123. Bengio, Y., Goodfellow, I. & Courville, A. Deep Learning (MIT Press, Learning (eds Meila, M. & Zhang, T.) 10401–10412 (PMLR, 2021).
2017). 146. Dittadi, A. et al. On the transfer of disentangled representations
124. Wang, M. & Deng, W. Deep visual domain adaptation: a survey. in realistic settings. Preprint at https://arxiv.org/abs/2010.14407
Neurocomputing 312, 135–153 (2018). (2020).
125. Bello, I. et al. Revisiting resnets: improved training and scaling 147. Andreassen, A., Bahri, Y., Neyshabur, B. & Roelofs, R. The evolution
strategies. Adv. Neural Inf. Process. Syst. 34, 22614–22627 (2021). of out-of-distribution robustness throughout fine-tuning. Preprint
126. Varadarajan, A. V. et al. Predicting optical coherence at https://arxiv.org/abs/2106.15831 (2021).
tomography-derived diabetic macular edema grades from fundus 148. Radford, A. et al. Learning transferable visual models from natural
photographs using deep learning. Nat. Commun. 11, 130 (2020). language supervision. In Proc. 38th International Conference on
127. Winkler, J. K. et al. Association between surgical skin markings Machine Learning (eds Meila, M. & Zhang, T.) 8748–8763 (PMLR,
in dermoscopic images and diagnostic performance of a deep 2021).
learning convolutional neural network for melanoma recognition. 149. Taori, R. et al. When robustness doesn’t promote robustness:
JAMA Dermatol. 155, 1135–1141 (2019). synthetic vs. natural distribution shifts on ImageNet. In
128. Seah, J. C. et al. Effect of a comprehensive deep-learning model International Conference on Learning Representations (ICLR)
on the accuracy of chest X-ray interpretation by radiologists: a (2019).
retrospective, multireader multicase study. Lancet Digit. Health 3, 150. Albuquerque, I., Monteiro, J., Darvishi, M., Falk, T. H. & Mitliagkas,
e496–e506 (2021). I. Adversarial Target-Invariant Representation Learning for Domain
129. Haygood, T. M. et al. Timed efficiency of interpretation of digital Generalization (DeepAI, 2020).
and film-screen screening mammograms. AJR Am. J. Roentgenol. 151. Li, Y. et al. Deep domain generalization via conditional invariant
192, 216–220 (2009). adversarial networks. In Proc. European Conference on Computer
130. Jain, A. et al. Development and assessment of an artificial Vision (ECCV) (eds Ferrari, V. et al.) 624–663 (Springer, 2018).
intelligence–based tool for skin condition diagnosis by primary 152. Ganin, Y. & Lempitsky, V. Unsupervised domain adaptation by
care physicians and nurse practitioners in teledermatology backpropagation. In Proc. 32nd International Conference on
practices. JAMA Netw. Open 4, e217249 (2021). Machine Learning (eds Bach, F. & Blei, D.) 1180–1189
131. Pugh, J. A. et al. Screening for diabetic retinopathy: the (JMLR, 2015).
wide-angle retinal camera. Diabetes Care 16, 889–895 (1993). 153. Ganin, Y. et al. Domain-adversarial training of neural networks.
132. Schölkopf, B. et al. Toward causal representation learning. Proc. J. Mach. Learn. Res. 17, 2096–2030 (2016).
IEEE 109, 612–634 (2021). 154. Shao, R., Lan, X., Li, J. & Yuen, P. C. Multi-adversarial discriminative
133. Bengio, Y., Courville, A. & Vincent, P. Representation learning: deep domain generalization for face presentation attack
a review and new perspectives. IEEE Trans. Pattern Anal. Mach. detection. In Proc. IEEE/CVF Conference on Computer Vision and
Intell. 35, 1798–1828 (2013). Pattern Recognition 10015–10023 (IEEE, 2019).

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 777


Article https://doi.org/10.1038/s41551-023-01049-7

155. Motiian, S., Piccirilli, M., Adjeroh, D. A. & Doretto, G. Unified deep 174. Caron, M. et al. Unsupervised learning of visual features by
supervised domain adaptation and generalization. In Proc. IEEE contrasting cluster assignments. Adv. Neural Inf. Process. Syst. 33,
International Conference on Computer Vision 5716–5726 (IEEE, 2017). 9912–9924 (2020).
156. Muandet, K., Balduzzi, D. & Schölkopf, B. Domain generalization 175. Bai, W. et al. Self-supervised learning for cardiac MR image
via invariant feature representation. In Proc. 30th International segmentation by anatomical position prediction. In International
Conference on Machine Learning (eds Dasgupta, S. & McAllester, Conference on Medical Image Computing and Computer-Assisted
D.) I-10–I-18 (2013). Intervention (eds Shen, D. et al.) 541–549 (Springer, 2019).
157. Menegola, A. et al. Knowledge transfer for melanoma screening 176. Spitzer, H., Kiwitz, K., Amunts, K., Harmeling, S. & Dickscheid, T.
with deep learning. In IEEE 14th International Symposium on Improving cytoarchitectonic segmentation of human
Biomedical Imaging (ISBI) 297–300 (IEEE, 2017). brain areas with self-supervised Siamese networks. In
158. Xie, H. et al. Dual network architecture for few-view CT-trained International Conference on Medical Image Computing and
on ImageNet data and transferred for medical imaging. In Proc. Computer-Assisted Intervention (eds Frangi, A. F. et al.) 663–671
SPIE 11113, Developments in X-Ray Tomography XII (eds Müller, B. & (Springer, 2018).
Wang, G.) 111130V (SPIE, 2019). 177. Zhuang, X. et al. Self-supervised feature learning for 3D medical
159. Alzubaidi, L. et al. Towards a better understanding of transfer images by playing a Rubik’s cube. In International Conference on
learning for medical imaging: a case study. Appl. Sci. 10, 4523 Medical Image Computing and Computer-Assisted Intervention
(2020). (eds Shen, D. et al.) 420–428 (Springer, 2019).
160. Heker, M. & Greenspan, H. Joint liver lesion segmentation and 178. Zhu, J. et al. Rubik’s Cube+: a self-supervised feature learning
classification via transfer learning. Preprint at https://arxiv.org/ framework for 3D medical image analysis. Med. Image Anal. 64,
abs/2004.12352 (2020). 101746 (2020).
161. Chen, S., Ma, K. & Zheng, Y. Med3D: transfer learning for 179. Chaitanya, K., Erdil, E., Karani, N. & Konukoglu, E. Contrastive
3D medical image analysis. Preprint at https://arxiv.org/ learning of global and local features for medical image
abs/1904.00625 (2019). segmentation with limited annotations. Adv. Neural Inf. Process.
162. Liang, G. & Zheng, L. A transfer learning method with deep Syst. 33, 12546–12558 (2020).
residual network for pediatric pneumonia diagnosis. Comput. 180. He, X. et al. Sample-efficient deep learning for COVID-19
Methods Prog. Biomed. 187, 104964 (2020). diagnosis based on CT scans. Adv. Neural Inf. Process. Syst. 33,
163. Geyer, R., Corinzia, L. & Wegmayr, V. Transfer learning by adaptive 12546–12558 (2020).
merging of multiple models. In Proc. 2nd International Conference 181. Li, H. et al. Imbalance-aware self-supervised learning for 3D
on Medical Imaging with Deep Learning (eds Cardoso, M. J. et al.) radiomic representations. In International Conference on Medical
185–196 (PMLR, 2019). Image Computing and Computer-Assisted Intervention (eds de
164. Noroozi, M. & Favaro, P. Unsupervised learning of visual Bruijne, M. et al.) 36–46 (Springer, 2021).
representations by solving jigsaw puzzles. In European 182. Liu, J. et al. Align, attend and locate: chest X-ray diagnosis via
Conference on Computer Vision (eds Leibe, B. et al.) 69–84 contrast induced attention network with limited supervision.
(Springer, 2016). In Proc. IEEE/CVF International Conference on Computer Vision
165. Zhang, R., Isola, P. & Efros, A. A. Colorful image colorization. In 106321–10640 (IEEE, 2019).
European Conference on Computer Vision (eds Leibe, B. et al.) 183. Zhou, H.-Y. et al. Comparing to learn: surpassing ImageNet
649–666 (Springer, 2016). pretraining on radiographs by comparing image representations.
166. Wu, Z., Xiong, Y., Yu, S. X. & Lin, D. Unsupervised feature learning In International Conference on Medical Image Computing and
via non-parametric instance discrimination. In Proc. IEEE/CVF Computer-Assisted Intervention (eds Martel, A. L.) 398–407
Conference on Computer Vision and Pattern Recognition 3733– (Springer, 2020).
3742 (IEEE, 2018). 184. Soni, P. N., Shi, S., Sriram, P. R., Ng, A. Y. & Rajpurkar, P. Contrastive
167. Hénaff, O. J. et al. Data-efficient image recognition with learning of heart and lung sounds for label-efficient diagnosis.
contrastive predictive coding. In Proc. 37th International Patterns 3, 100400 (2021).
Conference on Machine Learning (eds Daumé, H. & Singh, A.) 185. Liu, Q., Yu, L., Luo, L., Dou, Q. & Heng, P. A. Semi-supervised
4182–4192 (PMLR, 2020). medical image classification with relation-driven self-ensembling
168. van den Oord, A., Li, Y. & Vinyals, O. Representation learning model. IEEE Trans. Med. Imaging 39, 3429–3440 (2020).
with contrastive predictive coding. Preprint at https://arxiv.org/ 186. Wang, D., Zhang, Y., Zhang, K. & Wang, L. FocalMix:
abs/1807.03748 (2018). semi-supervised learning for 3D medical image detection. In Proc.
169. Hjelm, R. D. et al. Learning deep representations by mutual IEEE/CVF Conference on Computer Vision and Pattern Recognition
information estimation and maximization. Preprint at https://arxiv. 3950–3959 (IEEE, 2020).
org/abs/1808.06670v5 (2019). 187. Zhang, Y., Jiang, H., Miura, Y., Manning, C. D. & Langlotz, C. P.
170. Ye, M., Zhang, X., Yuen, P. C. & Chang, S.-F. Unsupervised Contrastive learning of medical visual representations from
embedding learning via invariant and spreading instance feature. paired images and text. In Proc. 7th Machine Learning for
In Proc. IEEE/CVF Conference on Computer Vision and Pattern Healthcare Conference (eds Lipton, Z. et al.) 2–25
Recognition 6203–6212 (IEEE, 2019). (PMLR, 2020).
171. Bachman, P., Hjelm, R. D. & Buchwalter, W. Learning 188. Truong, T., Mohammadi, S. & Lenga, M. How transferable are
representations by maximizing mutual information across views. self-supervised features in medical image classification tasks? In
Adv. Neural Inf. Process. Syst. 15535–15545 (2019). Proc. Machine Learning for Health (eds Roy, S. et al.) 54–74 (PMLR,
172. Tian, Y., Krishnan, D. & Isola, P. Contrastive multiview coding. In 2021).
European Conference on Computer Vision (eds Vedaldi, A. et al.)
776–794 (Springer, 2019). Acknowledgements
173. Misra, I. & Maaten, L. V. D. Self-supervised learning of This project was an extensive collaboration between Google Brain
pretext-invariant representations. In Proc. IEEE/CVF Conference and the Google Health AI Team. We thank Z. Ghahramani for valuable
on Computer Vision and Pattern Recognition 6706–6716 (IEEE, feedback and continuous support through the course of the project;
2020). M. Raghu, J. Krause, D. Eck and M. Howell for valuable feedback in

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 778


Article https://doi.org/10.1038/s41551-023-01049-7

improving the quality of the work; J. Uszkoreit, J. Deaton, V. Godbole, Competing interests
M. Sieniek, S. Prabhakara, D. Golden, D. Steiner, X. Zhai, A. Giurgiu, This study was funded by Google LLC and/or a subsidiary thereof
T. Duerig, C. Semturs, P. Bui, J. Hartford, S. Jansen, S. Shetty, T. Spitz, (‘Google’). J.F., L.C., S.A., V.N., N.H., A.K., M.N., B.M., S.B., P.S., S.S.M.,
D. Tran, J. Luo, O. Wichrowska and A. Ward for support throughout this S.K., T.C., N.T., J.M., B.B., P.B., E.W., P.-H.C.C., Yuan Liu, Yun Liu, S.M.,
project; multiple contributors to this international project: Rajavithi A.L., J.W., M.W., Z.B., A.G.R., U.T., D.R.W., D.F., L.P., G.S.C., J.K. and G.H.
Hospital Thailand, Lions Eye Institute and Derbarl Yerrigan Health are employees of Google and may own stock as part of the standard
Service, Western Australia, Stanford Center for Artificial Intelligence compensation package. M.E. received funding from Google to
in Medicine and Imaging, MIT Laboratory for Computational support the research collaboration.
Physiology and PhysioNet, and NIH Clinical Centre; our collaborators
at DermPath AI, Apollo Hospitals and EyePACS for support of this Additional information
work; collaborators at Northwestern medicine and all members of the Extended data is available for this paper at
Etemadi Research Group for support of this work. https://doi.org/10.1038/s41551-023-01049-7.
The images and data used in this publication were derived from
the Optimam database, the creation of which was funded by Supplementary information The online version contains supplementary
Cancer Research UK. Part of the retinal image dataset was material available at https://doi.org/10.1038/s41551-023-01049-7.
provided for the study by Sankara Nethralaya, Chennai, India.
The results included in this paper are in whole or in part based Correspondence and requests for materials should be addressed to
on data generated by The Cancer Genome Atlas (TCGA) Shekoofeh Azizi, Alan Karthikesalingam or Vivek Natarajan.
managed by the NCI and NHGRI. Information about TCGA can
be found at the NIH website. This study also used archived and Peer review information Nature Biomedical Engineering thanks Pranav
anonymized pathology slides, clinicopathologic variables, and Rajpurkar and the other, anonymous, reviewer(s) for their contribution
outcomes from the Institute of Pathology and the Biobank at the to the peer review of this work.
Medical University of Graz. The study also used pathology slides from
the CAMELYON challenge. Reprints and permissions information is available at
www.nature.com/reprints.
Author contributions
S.A., J.F., L.C., V.N., N.H., A.K., M.N., S.K., T.C., N.T., J.M., B.M., P.S., Publisher’s note Springer Nature remains neutral with regard to
S.S.M., F.R., E.W., P.-H.C.C. and G.H. contributed to the conception and jurisdictional claims in published maps and institutional affiliations.
design of the work. S.A., L.C., J.F., V.N., A.K., B.B., P.B., E.W., P.-H.C.C.,
Yuan Liu, Yun Liu, S.M.M., A.L., J.W., M.W., Z.B., A.G.R., D.R.W., L.P., Springer Nature or its licensor (e.g. a society or other partner) holds
G.S.C., U.T. and J.K. contributed to data acquisition. S.A., L.C., J.F., exclusive rights to this article under a publishing agreement with
S.B., B.M. and V.N. majorly contributed to the evaluation of the work. the author(s) or other rightsholder(s); author self-archiving of the
S.A., L.C., J.F., V.N., N.H., A.K., M.N., S.B., S.K., T.C., B.B., D.R.W., D.F., accepted manuscript version of this article is solely governed by the
G.S.C. and M.E. contributed to analysis and interpretation of the data. terms of such publishing agreement and applicable law.
S.A., L.C., J.F., V.N., N.H., A.K., M.N., S.K., E.W., P.S., S.S.M. and M.E.
contributed to drafting and revising the paper. N.H., A.K., M. N. and © The Author(s), under exclusive licence to Springer Nature Limited
V.N. contributed equally as co-advisers. 2023

Nature Biomedical Engineering | Volume 7 | June 2023 | 756–779 779


Article https://doi.org/10.1038/s41551-023-01049-7

Extended Data Fig. 1 | REMEDIS comparison with strong supervised JFT baseline under severe synthetic data shifts. We observe that, under increasing severity of
synthetic shifts, the performance of both the REMEDIS and the supervised baseline drops. However, the drop is more gradual for REMEDIS.

Nature Biomedical Engineering


Article https://doi.org/10.1038/s41551-023-01049-7

Extended Data Fig. 2 | Overview of our experimental setup for the development of REMEDIS and of the baseline AI models across the various medical-imaging
tasks. The different stages in which unlabeled and labeled (both ID and OOD) are used for model development and evaluation.

Nature Biomedical Engineering


Article https://doi.org/10.1038/s41551-023-01049-7

Extended Data Fig. 3 | Visual samples of distribution shifts across the effects of X-ray-sensor construction or in zoom levels. The underlying cause of
medical-imaging tasks considered in this study. Variation between ID and the distribution shift can be associated with technology shift, demographic shift
OOD data can be visually subtle or pronounced. This variation includes (but is or behavioral shift45.
not limited to) changes in contrast, sharpness or tint, differences in non-linear

Nature Biomedical Engineering


nature portfolio | reporting summary
Shekoofeh Azizi, Alan Karthikesalingam and
Corresponding author(s): Vivek Natarajan
Last updated by author(s): Apr 27, 2023

Reporting Summary
Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested


A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)

For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.

Software and code


Policy information about availability of computer code
Data collection No software was used for data collection.

Data analysis Several major components of the work are available in open-source repositories, such as the Tensorflow library (https://www.tensorflow.org).
The code-base and pretrained weights used for self-supervised pre-training are available at SimCLR GitHub (https://github.com/google-
research/simclr). The code-base and pretrained weights for the BiT models are available at Big Transfer GitHub (https://github.com/google-
research/big_transfer). All experiments and implementation details are described in sufficient detail in Methods and in the Supplementary
Information to support replication with non-proprietary libraries. The codebase used for our comparison to ResNet-RS was based on the
ResNet-RS GitHub repository (https://github.com/tensorflow/tpu/tree/master/models/official/resnet/resnet_rs). A number of the
checkpoints and models generated through REMEDIS are readily accessible to researchers via the PhysioNet open access project (https://
physionet.org/projects/N0nzs56nBt1IBM073DOr). Additionally, the Foundation Medical ML repositories on GitHub offer access to codes that
can be used to train REMEDIS-based models.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.
April 2023

1
nature portfolio | reporting summary
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A description of any restrictions on data availability
- For clinical datasets or third party data, please ensure that the statement adheres to our policy

The datasets from Northwestern Medicine and Apollo Hospitals were used under a licence for the current study, and are not publicly available. Applications for
access to the Optimam database can be made using this web form (https://medphys.royalsurrey.nhs.uk/omidb/getting-access). The de-identified tele-dermatology
data used in this study are not publicly available owing to restrictions in the data-sharing agreement. The unlabelled dataset used for DME classification is de-
identified data from EyePACS Inc. Interested researchers should contact [email protected] to enquire about access to EyePACSdata and approach the Office of
Research and Development (https://www.research.va.gov/resources/ORD_Admin/ord_contacts.cfm) to enquire about access to VA data. The rest of annotated
data for ID and OOD DME classification tasks are collected at the Rajavithi Hospital Thailand and at the Lions Eye Institute and are not publicly available owing to
restrictions in the data-sharing agreement. Data used in the evaluation and pre-training of the chest-X-ray-condition classification, including MIMIC-CXR (https://
physionet.org/content/mimic-cxr/2.0.0), CheXpert (https://stanfordmlgroup.github.io/competitions/chexpert), and ChestX-ray14 (https://www.kaggle.com/
datasets/nih-chest-xrays/data), are publicly available. Data used for the ID fine-tuning and evaluation of the detection of metastases are publicly available on the
CAMELYON challenge website (https://camelyon16.grand-challenge.org/Data). The Cancer Genome Atlas (TCGA) data has been used for pre-training for both the
pathology-based metastases-detection and survival-prediction tasks, and are available via the NIH website (https://www.cancer.gov/ccg/research/genome-
sequencing/tcga). The rest of the data used in pathology tasks are not publicly available, owing to restrictions in the data-sharing agreement. Moreover,
ImageNet-1K (ILSVRC2012) [68] has been used for the pre-training of baseline supervised models, and ImageNet-21K has been used for the pretraining of BiT-M
models. Both of these are publicly available via the ImageNet website (https://image-net.org/download.php). Please note that ID and OOD below refer to "in
distribution" and "out of distribution", respectively

Research involving human participants, their data, or biological material


Policy information about studies with human participants or human data. See also policy information about sex, gender (identity/presentation),
and sexual orientation and race, ethnicity and racism.
Reporting on sex and gender Sex subgroup reporting is in Supplementary Fig. 8. 'Sex' (rather than gender) is the right term because the datasets contain
information on male/female as categories.

Reporting on race, ethnicity, or Age and breast-density subgroup reporting is provided in Supplementary Fig. 8. Individual race/ethnicity data points were not
other socially relevant consistently available.
groupings

Population characteristics Age, sex and other relevant population information are available in these papers:
Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases, Nature medicine, 26, 900–908 (2020).
McKinney, S.M. et al. International evaluation of an AI system for breast cancer screening, Nature, 577, 89–94 (2020).

Recruitment Not applicable, as we used retrospective de-identified data.

Ethics oversight The institutional-review-board waiver for this study on retrospective de-identified data was obtained from Advarra IRB.

Note that full information on the approval of the study protocol must also be provided in the manuscript.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design


All studies must disclose on these points even when the disclosure is negative.
Sample size Test sets for all 7 medical tasks have been previously published, and the sample size has therefore been shown to be sufficient for the
estimation of a model's diagnostic accuracy with acceptable uncertainty.

Data exclusions The datasets used reflect those used in previous papers; no additional data exclusions were applied.
April 2023

Replication The code of is available to ensure that the work can be replicated externally. Internal tests for replication are a standard engineering practice
at Google.

Randomization Randomization was performed to create the training, test and validation splits of the different datasets used in the study.

2
Blinding Blinding was not applicable to the study.

nature portfolio | reporting summary


Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.

Materials & experimental systems Methods


n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology and archaeology MRI-based neuroimaging
Animals and other organisms
Clinical data
Dual use research of concern
Plants

April 2023

You might also like