0% found this document useful (0 votes)
17 views16 pages

On Pretraining Data Diversity For Self-Supervised Learning

This paper investigates the influence of dataset diversity on the performance of self-supervised learning (SSL) within a fixed computational budget. The authors find that while increasing pretraining data diversity generally improves SSL performance, this is effective only when the distribution shift to downstream data is minimal. Their comprehensive experiments, involving various SSL methods and large-scale datasets, highlight the need for better strategies to leverage data diversity and address distribution shifts in SSL models.

Uploaded by

720matheusmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views16 pages

On Pretraining Data Diversity For Self-Supervised Learning

This paper investigates the influence of dataset diversity on the performance of self-supervised learning (SSL) within a fixed computational budget. The authors find that while increasing pretraining data diversity generally improves SSL performance, this is effective only when the distribution shift to downstream data is minimal. Their comprehensive experiments, involving various SSL methods and large-scale datasets, highlight the need for better strategies to leverage data diversity and address distribution shifts in SSL models.

Uploaded by

720matheusmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

On Pretraining Data Diversity for Self-Supervised Learning

Hasan Abed Al Kader Hammoud∗ Tuhin Das∗ Fabio Pizzati*


KAUST University of Oxford University of Oxford

Philip Torr Adel Bibi Bernard Ghanem


University of Oxford University of Oxford KAUST
arXiv:2403.13808v2 [cs.CV] 5 Apr 2024

Abstract
We explore the impact of training with more diverse
datasets, characterized by the number of unique sam-
ples, on the performance of self-supervised learning
(SSL) under a fixed computational budget. Our findings
consistently demonstrate that increasing pretraining
data diversity enhances SSL performance, albeit only
when the distribution distance to the downstream data
is minimal. Notably, even with an exceptionally large
pretraining data diversity achieved through methods like
web crawling or diffusion-generated data, among other
ways, the distribution shift remains a challenge. Our
experiments are comprehensive with seven SSL meth-
ods using large-scale datasets such as ImageNet and
YFCC100M amounting to over 200 GPU days. The
code and trained models will be available at https :
//github.com/hammoudhasan/DiversitySSL.

Figure 1. Impact of Diversity on Pretraining: Self-supervised


learning (SSL) can be used to pretrain vision models on smaller
1. Introduction datasets closely aligned to downstream task data, e.g., pets classifi-
cation, hence with a small distribution shift (top, wild animals pre-
Self-supervised learning (SSL) has recently emerged as a
training). Conversely, we could pretrain on an extensively varied
new paradigm to pretrain large vision models at scale [26, dataset, with wide distribution differences (outdoor scenes, bot-
27, 46]. Leveraging the ability to learn from unlabelled tom). We demistify the role of pretraining diversity in SSL under
data, pretraining on millions—or even billions [27, 52]— a fixed computational budget, and highlight its effects in relation-
of images turned from an unachievable goal to a com- ship to the distribution shift.
mon practice. This exposure to extremely diverse datasets,
i.e., composed of a remarkable number of unique samples,
granted impressive performance and unprecedented gener- tation and data resources employed for these models. It has
alization capabilities to a growing number of vision mod- become the implicit norm that increasing computation and
els [46]. Large-scale datasets, in conjunction with sub- data is beneficial, without any detailed analysis of how they
stantial computational resources, have been the key driving separately impact SSL effectiveness. In particular, it is not
forces for the success of SSL-based approaches. For in- clear to what extent large datasets are responsible for the im-
stance, SEER [26], pretrained for approximately 11 GPU pressive generalization capabilities of SSL models. Indeed,
years on a billion images, exemplifies the massive compu- consider the example in Figure 1. Assuming a fixed mon-
* Equal contribution.
etary budget, in the form of computational expenses, for
Correspondence: [email protected], pretraining a vision model under the SSL paradigm, does
[email protected] or [email protected] iterating over a large set of images work best given a down-

1
stream task, or it is better to train repeatedly on a smaller set losses [1, 8, 10, 11, 13, 31, 34, 35, 41, 42, 45], enforcing
of samples visually close to the ones of the downstream? In similarity between views of the same image (positives)
this context, we face the ever-lasting problem of distribution and dissimilarity from other images (negatives). Due to
shift on vision models [58]. Without a proper understanding the need of many negative samples, contrastive methods
of the impact of data, there could be a wrong allocation of often require large batch sizes to work effectively [10, 14].
efforts in increasing data or computation, leading to ineffi- Cluster-based methods such as SwAV [8], DINO [9] and
ciencies in the deployment process of SSL models. DeepCluster v2 [6] learn generalizable representations by
In this paper, we study the role of the diversity of the SSL grouping samples into cluster prototypes. Others exploit
pretraining data, under a fixed computational budget, when prediction of features with siamese networks [12], learn
the pretraining data matches the distribution of the down- features in a teacher-student fashion [28], or use redun-
stream data, as well as when they differ. Our experiments dancy reduction techniques [4, 65]. Finally, masked image
span various SSL methods [4, 8–10, 14, 28, 65] and multi- modeling [3, 32, 63] emerged as a scalable alternative
ple datasets [5, 18, 37–39, 43, 47, 54, 55, 69], and are tested for Vision Transformers [21], learning representations by
under several computational budgets amounting to a total of predicting masked image patches as SSL task.
200 GPU days. We summarize our contributions below:
1. We show that SSL pretraining strategies are currently Pretraining at Scale SSL is most effective when pretrain-
data-inefficient in compensating for distribution shifts. ing is conducted at scale, benefiting from large datasets and
Under normalized computational costs, we verify that great computational resources. Initial attempts at large-
pretraining on large datasets with high diversity cannot scale pretraining were made via combining contrastive
outperform the pretraining on datasets with limited di- learning and clustering [7, 56]. The SEER model [26]
versity, but with a lower distribution shift with respect to was trained on 1 billion internet images using SwAV [8]
the downstream tasks. applied on a RegNet [48] backbone. The importance
2. We conclude that there is a wide margin for improve- of model scaling to leverage large-scale datasets is also
ment in the performance of current SSL methods on out- shown in [25, 27, 64], as well as the need for increasing
of-distribution classes, and we propose insights for a fair training duration [64]. Additional strategies are necessary
evaluation. This shows the need for future research on to achieve best performance at scale [17, 23, 66]. All
new SSL techniques leveraging data diversity better, to considered works focus on reaching the best representa-
improve generalization capabilities beyond the training tions, without many considerations about training costs,
data distribution. thus encouraging unfair comparisons about the efficacy
3. We propose a novel strategy for a computationally- of data. A preliminary work [15] found for SimCLR [10]
normalized evaluation of SSL and a carefully designed that increasing data from 500 thousand to 1 million images
set of experiments focusing on pretraining data diversity, leads to a modest boost in downstream performance, but
enabling us to draw our novel conclusions. without limiting the training budget. While large uncurated
Ultimately, our work provides a comprehensive analysis, datasets have been further explored [56], their efficacy in
from experiments worth 200 GPU days, on the interaction relationship with the distribution shift under normalized
between data diversity and computational resources, and computation is still limitedly investigated.
their impact on the performance of SSL models, with an
aim to improve pretraining practices. Now, we will analyze
the relevant literature. Distribution Shift in SSL Some studies have investigated
how the pretraining data domain affects downstream perfor-
2. Related Work mance on other domains. For various pretraining datasets,
preliminary works [15, 25, 36, 40] observed that the best
Self-Supervised Learning Early work in self-supervised performing representations were learned on datasets that
learning used simple pretext tasks, such as relative patch were similar to the downstream test task. Additionally,
prediction [19, 20], image colorization [67], image ro- combining datasets before pretraining or combining self-
tation prediction [24], or solving jigsaw puzzles [44] supervised features learned from various datasets did not
to train feature extractors in absence of annotations. lead to significant improvements in [15]. In [51, 68], they
Representations learned with those methods are limit- showcase higher generalization capabilities of SSL mod-
edly effective, hence more recent literature has moved els compared to their supervised counterparts, for several
towards more sophisticated approaches, such as using downstream tasks in the presence of a distribution shift.
image augmentations to generate correlated views of a In [59], they pretrained on several datasets, and observed
training sample, and learning to extract augmentation- different generalization depending on the object-centric or
invariant representations for these correlated pairs. Among scene-centric appearance of the pretraining dataset. Fur-
these multi-view methods, many exploit contrastive thermore, initial considerations on the impact of external

2
data on downstream tasks under distribution shift have been pretrained feature extractor fθf∗ , after which the linear clas-
proposed in [60]. Some compensate the distribution shift sification head tθt , parameterized by θt , is then applied to
with the scale of training datasets [29]. Although partial in- obtain tθt (fθf∗ (x)). (3) The linear head tθt is optimized as
formation can be inferred by these works, there is still a lack follows:
of a fair, computation-normalized evaluation that allows to h i
study the effects of the distribution shift in a controlled en- θt∗ = arg min E(x,y)∼Dtask Ltask (tθt (fθf∗ (x)), y) , (2)
θt
vironment.
Note that only the parameters of the linear head θt are
3. Preliminaries optimized, while the parameters θf∗ of the feature extractor
are kept frozen. The quality of features extracted by fθf∗ is
Pretraining We first outline the general pretraining pro-
directly inferred from the classification accuracy achieved
cedure common to state-of-the-art self-supervised learning
on the test set of Dtask , which serves as the primary indicator
methods. Specific differences among these methods are de-
of the quality of the extracted features.
tailed in the supplementary material. The overall pretrain-
ing pipeline, common across many SSL approaches [4, 8–
4. Normalized Evaluation
10, 28, 65], goes as follows: (1) Data Sampling: from a
large-scale, unlabeled upstream pretraining dataset, denoted We stress how for a correct understanding of the impact of
as DSSL , an image x is randomly sampled; (2) View Gen- data diversity we need to analyze its effects isolating them
eration: two correlated views, x̃A and x̃B , are generated from the impact of increased computation. To enable this,
from x using two image augmentation functions sampled at we introduce (1) a computational budget used to normalize
random. These random augmentations include random re- computation across experiments, and (2) a quantification of
sized cropping, horizontal flipping, blurring, and color ad- the data diversity seen by models during pretraining.
justments [2], among others; (3) Feature Extraction and
Projection: the correlated views undergo feature extrac- Computational Budget Current progress in SSL pretrain-
tion through a network, fθf , parameterized by θf , such ing simultaneously scales computational budget and dataset
as a ResNet [30] or a ViT [21], leading to representations size to achieve the best performance [26]. This makes it
hA = fθf (x̃A ) and hB = fθf (x̃B ). A linear projection difficult to assess the reasons behind the success of many
head, gθg , parameterized by θg , then maps these represen- SSL algorithms. Do these methods benefit mainly from the
tations to a latent space, resulting in zA = gθg (hA ) and large amounts of computation, i.e., running SSL pretraining
zB = gθg (hB ); (4) Joint Optimization: The feature ex- for large numbers of epochs, or do they benefit from data di-
tractor fθf and the projection head gθg are jointly optimized versity in larger datasets containing a vast variety of visual
according to the following objective: features? To perform a disentangled evaluation of these two
factors, we first introduce C as a measure of computational
θf∗ , θg∗ = arg min Ex∼DSSL LSSL (zA , zB ), (1) budget, which quantifies the total number of images an SSL
θf ,θg
method is allowed to process during pretraining. This is
where LSSL is a loss function specific to the chosen SSL calculated as C = N · E, where N is the number of unique
pretraining method. samples in the pretraining dataset DSSL , hence the data
After pretraining, the feature extractor fθf∗ can be deployed diversity of the dataset, and E is the number of pretraining
for various downstream tasks such as image classifica- epochs. Constraining multiple models pretrained with SSL
tion, object detection, or segmentation. This is typically to a fixed computational budget C allows for meaning-
achieved by training a task-specific head. Alternatively, ful comparison, as it is guaranteed that all SSL methods
the feature extractor fθf∗ can either be fine-tuned or used will have processed the same number of pretraining images.
together with a k-nearest neighbors (kNN) classifier.
Quantifying Pretraining Diversity Even under normal-
Linear Probing There are several ways to evaluate the ized computation C, various SSL approaches may be ex-
performance of a self-supervised learning method such as posed to different data diversity as they are trained with dif-
linear probing [10, 14, 28], kNN [8, 9, 62, 71], and few- ferent datasets. For instance, a model trained for E = 1000
shot evaluation [22, 26]. Consistent with the general proto- epochs on a dataset of size N = 1000 will see less diversity
col established in the literature [10, 15, 31, 67], we use lin- than a model pretrained for E = 1 epoch on a dataset of size
ear probing to measure the quality of the features extracted N = 106 , despite that they are both pretrained under nor-
for classification tasks. The procedure is as follows: (1) malized computation, processing the same number of im-
a labeled downstream dataset, Dtask , consisting of image- ages. Hence, to capture this notion of exposure to diversity
class pairs (x, y) ∼ Dtask , is selected for evaluation. (2) while pretraining under normalized computation C, we de-
For each image x, its representation is extracted using the fine a pretraining diversity D, which captures the number of

3
N D Accuracy↑ N D Accuracy↑
(×103 ) (×10−3 ) SimCLR B.T. BYOL SwAV VICReg MoCoV3 DINO (×103 ) (×10−3 ) SimCLR B.T. BYOL SwAV VICReg MoCoV3 DINO
5 0.1 37.95 44.30 41.14 14.06 44.06 12.51 15.74 10 0.2 36.91 40.98 35.62 34.23 37.58 36.56 34.54
25 0.5 50.13 54.29 51.01 45.44 50.37 51.58 35.89 50 1.0 48.77 52.01 48.05 43.63 48.75 46.45 44.39
50 1.0 58.59 58.42 58.28 56.37 55.83 55.61 40.93 100 2.0 49.83 55.60 50.29 47.48 54.10 48.58 47.32
C = 50 × 106 C = 50 × 106
(a) CIFAR-100 (b) Tiny ImageNet

Table 1. Impact of Data Diversity on CIFAR-100 and Tiny ImageNet SSL Pretraining Performance: We study the effects of diversity
on CIFAR-100 (a) and Tiny ImageNet (b) across seven different methods and three data diversity settings for a ResNet-18 pretraining,
where for all, Dtask = DSSL . This comparison includes analyzing classification accuracy through linear probing on the training set and
evaluation on the test set of the respective datasets. Although performance fluctuates among different methods, a consistent trend is
observed: higher data diversity typically leads to the generation of higher quality representations.

unique samples encountered during training given a fixed C selecting the first 10%, 50%, or 100% split. All models
as D = N/C = 1/E . A model pretrained with larger D in- in this section use a ResNet-18 [30] backbone pretrained
dicates that the model is presented with a larger number of from scratch. For evaluation, we use in-distribution linear
unique images during training with fixed C, and hence is ex- probing, i.e., Dtask = DSSL , where performance is measured
posed to more pretraining data diversity. In the next section, by classification accuracy on the test set.
we explore the effects of variations in D on SSL perfor-
mance under a distribution shift, while keeping C constant. Results The results of linear probing are presented in Ta-
bles 1a and 1b for CIFAR-100 and Tiny ImageNet, respec-
5. Fixed Budget SSL: In & Out-of-Distribution tively. While different SSL methods show varying levels of
performance, we observe that, for all methods, an increase
Training Configuration We evaluate seven SSL methods: in the diversity D consistently leads to higher linear prob-
SimCLR [10], MoCoV3 [14], VICReg [4], DINO [9], ing accuracy, suggesting that including more in-distribution
BYOL [28], Barlow Twins [65], and SwAV [8]. We use samples in DSSL helps under a normalized C. For exam-
different datasets for both pretraining DSSL and linear prob- ple, SimCLR achieves an accuracy of 37.95 when 10% of
ing Dtask for different sections. We use solo-learn [16] CIFAR-100 is provided, whereas this performance improve
as a codebase. For each method, we use the default by around 12% when only 50% of the dataset is provided
parameters provided when available, otherwise, we conduct and by another 8% after pretraining on 100% of unique
a hyperparameter search to ensure proper convergence. samples. This suggests that SSL methods substantially ben-
5.1. Performance on the Same Distribution efit from having a more diverse pretraining set in computa-
tionally budgeted in-distribution evaluation, a fundamental
We aim to answer a first question: does collecting more verification that allows us to proceed in our analysis.
samples from the same distribution help SSL pretraining
with a fixed budget? We conduct experiments to capture Insight (1)
how the pretraining diversity D influences SSL pretrain-
ing within a normalized C, focusing on the simplest setting When the distributions of the upstream and down-
where the upstream and downstream datasets belong to the stream tasks are the same, i.e., DSSL = Dtask , in a
same distribution such that DSSL = Dtask . This serves as a normalized computation setup, increasing pretrain-
fundamental ground for our further experiments. ing diversity D proves to be an effective strategy for
Setup For pretraining, we use CIFAR-100 [38] and enhancing SSL pretraining performance.
Tiny ImageNet [39] as DSSL , which contain 50 × 103 and
100 × 103 images, respectively. We set C = 50 × 106 , 5.2. Increasing Data Diversity
chosen such that it allows all methods to converge during
pretraining. In Section 6, we study the effect of varying As observed in Section 5.1, if DSSL = Dtask having a
the budget C. We pretrain on subsets of DSSL with different more diverse pretraining set benefits SSL methods, under a
sizes N (10%, 50%, and 100% of DSSL ), enabling us to normalized computation assumption. However, to increase
observe the effects of pretraining diversity D on training diversity, sampling from the same distribution Dtask to
outcomes where E is adjusted accordingly to match the extend DSSL is not always attainable. In fact, pretraining
budget C. For example, using 100% of CIFAR-100 involves data is often sampled from a distribution different than
1000 epochs of training, while 10% and 50% of the dataset Dtask . The added samples will then introduce a distribution
lead to 10000 and 2000 epochs, respectively. These subsets shift between DSSL and Dtask .
of DSSL are created by shuffling the dataset and then

4
goldfinch, hornbill, sea anemone (3) images generated with Stable Diffusion V2.1 [50]. We
respectively call these sets AOut Out Out
Source , AWeb , and ASynthetic . We
with class prior also define their In-distribution counterparts as AIn Source ,
AIn
Web , and A In
Synthetic , respectively. Note how B ∪ A In
Source
is the full ImageNet-100, coherently with Section 5.1.
goldfinch
Figure 2 shows each collection strategy, for which we
Existing data Internet crawl Diffusion model
provide implementation details in supplementary material.
Although many factors (such as the appearance of images)
hornbill
impact the distribution shift, using a class prior imposes
without class prior

that any strategy using it would still result in closer


anemone
distribution with respect to the same strategy without class
priors [15]. We pretrain a ResNet-18 from scratch, with the
same settings of Section 5.1, and C = 50 × 106 . Note that
japanese spaniel, ibex, squirrel monkey this results in D = 1.25 × 10−3 for pretraining on B, and
D = 2.5 × 10−3 for pretraining on B ∪ A, introducing a D
Figure 2. Data Collection Strategies: We analyze strategies for difference of a factor of 2.
collecting additional data (A), i.e., collecting more source data,
crawling the web or using synthetic images. Using a class prior
(top row) simulates In-distribution trainings. We also collect Results We report the linear probing accuracies for
images without class prior (bottom row) to analize the interactions pretraining on each B ∪ A as bars in Figure 3, showing the
between diversity and Out-of-distribution classes. C-normalized training on B as a dashed line. Surprisingly,
without class priors (Out), including AOut Out
Source , AWeb , and
Out
ASynthetic underperforms compared to pretraining on B only.
Then, we raise the following question: is increasing For instance, for SimCLR and while B scores 72.3% ac-
pretraining diversity still beneficial when there is a distri- curacy, increasing diversity reduces the accuracy by 1% in
bution shift between DSSL and Dtask , i.e., DSSL ̸= Dtask ? We all cases. This might appear to conflict with our findings in
explore strategies for acquiring new data for increasing D, Section 5.1, however, the inclusion of Out samples leads to
namely, including existing samples, crawled internet data, DSSL ̸= Dtask , since we sample only classes not included in
and data synthetically generated by diffusion models. To Dtask . We infer that, with normalized C, increasing D with-
evaluate the effects of the distribution shift in a controlled out distribution priors may negatively impact performance.
scenario, we analyze distributions closer to Dtask by using Conversely, when class priors are available (In), increasing
a class prior (in-distribution) and without a class prior pretraining diversity D improves performance compared to
(out-of-distribution). B pretraining. For instance, pretraining on AIn Web performs
comparably to augmenting with additional ImageNet
Setup We use ImageNet-100 [55] as Dtask , and we samples (AIn Source ), as in the case of SimCLR where the
construct multiple DSSL to evaluate the effects of different inclusion of AIn Web scores only 0.2% lower than ASource .
In

data collection strategies and the distribution shift. We In In


Including ASynthetic data also helps, Synthetic ASynthetic data
first introduce a set B composed of 65 × 103 images from helps, although more limitedly due to the visual discrep-
ImageNet-100 (50% of the dataset), which we use as a ancies between generated and real images. Ultimately, the
baseline for DSSL with minimum diversity. We denote effectiveness of diversity is linked to the distribution shift.
the 100 classes in ImageNet-100 as T100 , and sample B These findings highlight the impact of the distribution shift
uniformly including images from all classes. Next, we on computationally-normalized SSL pretraining and help
compare with pretraining on more diverse datasets as DSSL , define evaluation practices for this task (see Section 7).
imposing DSSL = B ∪ A where A includes 65 × 103
images sampled with one of three strategies. To highlight
the effects of the distribution shift, we include in A either Insight (2)
images from In-distribution classes, i.e., selecting images
from classes overlapping with T100 , or images from Out-of- When the distributions of the upstream and down-
distribution classes, which do not overlap with T100 . This is stream tasks differ, DSSL ̸= Dtask , and in a normal-
to study the effects of the distribution shift, since we do not ized computation setup, increasing pretraining di-
assume access to downstream classes in real scenarios. For versity D may harm SSL pretraining performance,
the Out-of-distribution samples, we define A as (1) random reflecting the adverse effects of distribution shift.
images sampled from the full ImageNet [18] dataset; (2)
images crawled from Flickr, Bing, and DuckDuckGo; or

5
Figure 3. Effect of Various Data Sources on SSL Pretraining: We use a baseline set B (black dashed line), comprising 65 × 103 images
from ImageNet-100, for pretraining a ResNet-18 with C = 50 × 106 . Augmenting B with In-distribution images enhances performance
(above black line), while Out-of-distribution augmentations reduce it (below black line).

SimCLR MoCoV3
N D Accuracy↑ N D Accuracy↑
DSSL DSSL
(×106 ) (×10−3 ) ImageNet Cars Flow. Pets Places Food (×106 ) (×10−3 ) ImageNet Cars Flow. Pets Places Food
ImageNet 0.128 1.31 56.9 43.0 82.3 73.9 45.4 59.9 ImageNet 0.128 1.31 58.1 40.6 81.8 76.04 45.1 63.9
ResNet-50

ResNet-50
ImageNet 0.256 2.61 61.1 45.5 84.0 76.0 47.0 64.4 ImageNet 0.256 2.61 62.9 45.3 85.0 79.8 47.6 68.7
ImageNet 0.640 6.54 63.7 46.8 84.2 78.8 48.3 67.2 ImageNet 0.640 6.54 65.4 48.4 86.1 81.9 49.2 71.0
ImageNet 1.281 13.0 64.5 46.4 84.6 79.5 48.8 68.0 ImageNet 1.281 13.0 65.9 48.8 86.6 82.6 49.5 71.9
YFCC 98.17 1000 57.3 37.2 76.6 58.9 50.1 62.1 YFCC 98.17 1000 60.4 42.2 82.6 66.3 50.7 67.3
ImageNet 0.128 1.31 54.2 37.2 81.8 70.3 44.6 64.3 ImageNet 0.128 1.31 57.9 33.4 82.8 78.0 46.7 67.8
ViT-B/16
ViT-B/16

ImageNet 0.256 2.61 61.3 39.5 83.8 77.7 47.2 69.4 ImageNet 0.256 2.61 63.7 35.0 85.3 82.9 48.4 71.0
ImageNet 0.640 6.54 65.5 39.9 84.6 80.4 49.1 73.1 ImageNet 0.640 6.54 67.2 39.5 85.0 85.8 49.7 72.8
ImageNet 1.281 13.0 66.7 39.5 83.7 81.7 50.0 73.6 ImageNet 1.281 13.0 68.8 41.9 86.5 86.5 50.3 73.8
YFCC 98.17 1000 54.5 25.0 72.5 59.7 49.5 65.4 YFCC 98.17 1000 57.2 25.2 70.3 43.8 50.3 64.0
C = 98 × 106 C = 98 × 106

Table 2. Performance of ImageNet and YFCC100M SSL Pretraining on Various Downstream Tasks: We train ResNet-50 and ViT-
B/16 under normalized computation (C = 98 × 106 ) using SimCLR (left) and MoCoV3 (right) on ImageNet and YFCC100M (YFCC)
with multiple D. Despite the significantly larger D when trained on YFCC100M, these models cannot offset the effects of the distribution
shift and are outperformed by models pretrained on ImageNet in the majority of downstreams.

5.3. Scaling Pretraining Diversity images to maximize diversity (D = 1). Normalizing C (see
Section 5.1), we pretrain on ImageNet for approximately
We showed that diversity improves pretraining performance 77 epochs, cumulatively utilizing 98 million samples.
when the training set and downstream task share the same Due to the extensive cost of these experiments, we focus
data distribution (DSSL = Dtask ), as discussed in Section 5.1. on SimCLR and MoCoV3 only. We also employ larger
This may change when a distribution shift is introduced, as architectures, namely ResNet-50 [30] and ViT-B/16 [21],
explored in Section 5.2. However, it is still unclear from to leverage the extensive scale of the pretraining datasets.
Section 5.2, whether including a larger number of samples, We also use multiple Dtask including ImageNet [18], Stan-
and thus increasing considerably the pretraining diversity, ford Cars [37], Flowers-102 [43], Oxford-IIIT Pets [47],
can compensate for the negative effects of the distribution Places365 [69], and Food-101 [5].
shift. To address this, the following section presents larger-
scale experiments, employing significantly varied D values,
Results The results of our large-scale experiments are
aiming to explore the interplay between pretraining di-
detailed in Table 2. Consistently with findings in Sec-
versity and different distributions using millions of samples.
tion 5.1, increasing D leads to better pretraining efficacy
when DSSL = Dtask . This is evident when ImageNet is
Setup For our large-scale pretraining experiments, we used for both pretraining and downstream classification, re-
set DSSL to be two datasets of significantly different sizes: inforcing that our observations hold even at a larger scale.
ImageNet [18] and YFCC100M [54], comprising 1.28 Instead, models pretrained on YFCC100M show substan-
million and 98 million images, respectively. Following tially lower performance compared to those pretrained on
Section 5.1, we explore multiple D values for pretraining. ImageNet, although having much higher D. This high-
We set C = 98 × 106 , which corresponds to one epoch on lights the inefficiency of collecting data indiscriminately
YFCC100M, iterating once through each of its 98 million without considering distribution shifts. To stress this,

6
note how the model pretrained on YFCC100M (D = 1) DSSL Backbone
VisualDNA↓
ImageNet Cars Flow. Pets Places Food
often performs similarly to those pretrained with drasti-
ImageNet MUGS 0.00 11.49 12.46 6.09 7.19 9.08
cally lower D on ImageNet (D = 1.31 × 10−3 ). This YFCC MUGS 3.72 11.57 12.71 7.93 6.20 9.72
aligns with our observations in Section 5.2, emphasiz- ImageNet DINO v2 0.00 7.05 6.95 4.43 3.52 7.26
ing that distribution differences remain a significant fac- YFCC DINO v2 2.37 7.12 7.46 5.75 2.74 7.90
tor even when training with large datasets. However, the FID↓
DSSL Backbone
YFCC100M-pretrained model outperforms the ImageNet- ImageNet Cars Flow. Pets Places Food
pretrained model in Places365, suggesting a closer distribu- ImageNet Inception 0.00 143.40 192.89 88.85 64.91 114.92
YFCC Inception 48.14 174.27 214.78 145.79 38.86 154.51
tion relationship between YFCC100M and Places365. We
explore this hypothesis further in Section 6, where we an-
Table 3. Comparing VisualDNA and FID Scores Across
alyze distribution distances with existing metrics. Ulti- Datasets: We assess the relationship between VisualDNA [49]
mately, our analysis highlights that scaling the data is not and FID Score [33] for various large-scale DSSL and several down-
an effective solution for compensating the distribution shift, stream tasks. For VisualDNA, activations are extracted as sug-
when computational resources are normalized. gested [49] with MUGS [70] with ViT-B/16 (MUGS) or DINO
v2 [46] with ViT-B/16 (DINO v2), while FID activations are ob-
Insight (3) tained via an Inception network [53]. Consistently, across all met-
rics, the ranking of dataset distances between DSSL and various
Even an extremely large data diversity cannot miti- Dtask aligns with the accuracy ranking in Table 2. Models exhibit-
gate the distribution shift under normalized compu- ing lower VisualDNA/FID scores benefit more from the diversity
tation. This emphasizes the importance of further in pretraining data for SSL.
research in how effectively SSL pretraining algo-
rithms generalize.
ResNet-50 and ViT-B/16 models pretrained using MoCoV3
with C = 98 × 106 (as used in Section 5.3) against a tripled
6. Additional Analysis budget of C = 294 × 106 . We show that inconsistencies
arise when pretraining data diversity D and computation
Previously, we proposed a computationally-normalized C are not fairly compared. For instance, in scenarios
evaluation to assess the role of D with and without the highlighted in red , we notice that a lower pretraining
distribution shift. We highlighted that, although pretraining diversity coupled with a higher computational budget can
diversity helps when DSSL = Dtask , vision models are yield better results. We observe that under pretraining
unable to compensate for the distributional differences diversity of only D = 13 × 10−3 and a computational
under normalized computation. Now, we analyze additional budget of C = 98 × 106 ResNet-50 only enjoys a 65.9%
elements that support our previous observations. accuracy compared against with 69.8% with a smaller
Distribution Distances Using FID & VisualDNA In pretraining diversity of 2.17 × 10−3 but with a larger com-
Section 5.3, we showed that pretraining on ImageNet typi- putational budget of 294 × 106 . This shows that without
cally outperforms YFCC100M on a variety of downstream normalized computation, it could be incorrectly concluded
tasks, with Places365 being the exception. We speculate that pretraining diversity does not play a significant role.
that the distribution of ImageNet is more aligned with
those of the downstream tasks compared to YFCC100M. Model Saturation In Section 5.3, we evaluated the
To verify this, we evaluate the similarity between the effects of extreme differences in D. We aim to understand
datasets using FID [33] and VisualDNA [49] and report the trend in pretraining on YFCC100M, and if adding
results in Table 3. With both FID or VisualDNA, the more samples could compensate the distribution shift with
distribution of ImageNet is always closer to the down- ImageNet. Hence, we extend the YFCC100M experiments
stream tasks, except for Places365 where YFCC100M from Section 5.3, examining various subsets—specifically
is closer. This aligns with the lower performance of 0.1%, 1%, 10%, and 100% of the dataset. Again, we
ImageNet on this dataset only (Table 2), further suggesting normalize the computational budget to C = 98 × 106 ,
that the performance drop is caused by the distribution shift. equivalent to one epoch on YFCC100M. The results
of linear probing models pretrained using SimCLR and
Importance of Normalized Computation We now MoCoV3 with ResNet-50 and ViT-B/16 on ImageNet are
study the impact of normalizing computation on the shown in Figure 4. Interestingly, we obtain a performance
performance of SSL methods. We aim to understand if C plateau. This observation points to a saturation point in
is not normalized across methods, this will lead to mis- performance, showing that simply increasing the dataset
leading conclusions, and thus motivating our normalized further would not bridge the gap between the DSSL and
computation. In Table 4, we compare the performance of Dtask . Hence, arbitrarily increasing pretraining diversity D

7
MoCoV3 SimCLR
C N D Accuracy↑ N D Accuracy↑
Network Network DSSL
DSSL
(×106 ) (×106 ) (×10−3 ) ImageNet (×106 ) (×10−3 ) 5% 10% 50% 100%
ImageNet 1.28 1.31 50.3 54.2 61.7 64.5
ImageNet 98 0.128 1.31 58.1 ResNet-50
YFCC 98.2 1000 39.4 44.3 54.0 57.3
ImageNet 98 0.640 6.54 65.4
ImageNet 98 1.281 13.0 65.9 ImageNet 1.28 1.31 55.2 59.4 65.6 66.7
ResNet-50 ViT-B/16
YFCC 98.2 1000 40.7 45.7 53.3 54.5
ImageNet 294 0.128 0.43 57.4
ImageNet 294 0.640 2.17 69.8 C = 98 × 106
ImageNet 294 1.281 4.35 71.4 MoCoV3
N D Accuracy↑
ImageNet 98 0.128 1.31 57.9 Network DSSL
(×106 ) (×10−3 ) 5% 10% 50% 100%
ImageNet 98 0.640 6.54 67.2 ImageNet 1.28 1.31 52.1 56.2 63.8 65.9
ImageNet 98 1.281 13.0 68.8 ResNet-50
ViT-B/16 YFCC 98.2 1000 42.8 48.2 57.7 60.4
ImageNet 294 0.128 0.43 56.9 ImageNet 1.28 1.31 59.3 63.0 68.3 68.8
ImageNet 294 0.640 2.17 71.9 ViT-B/16
YFCC 98.2 1000 43.9 49.0 56.0 57.2
ImageNet 294 1.281 4.35 74.9
C = 98 × 106

Table 4. Importance of Normalization. We report the accuracy


Table 5. Evaluating Network Accuracy With Varied Label
of MoCoV3 trained on ImageNet with different data diversity and
Quantity: We evaluate the accuracy of networks trained on Im-
variable C, for the in-distribution assumption (DSSL = Dtask ). For
ageNet and YFCC with different labeling percentages of Dtask =
a given network, cells in red highlight inconsistencies where al-
ImageNet. The increased diversity still does not compensate for
though the model trained with C = 294 × 106 has seen less sam-
the distribution shift. However, for in-distribution data, one can
ples, it outperforms the best model trained with one third of the
get away with fewer labels with more diverse pretraining data.
computational budget (C = 90 × 106 ), showing the importance of
normalization for understanding how models exploit data.
MoCoV3, with ResNet-50 and ViT-B/16 architectures. In
this setup, we set Dtask = ImageNet, with two upstream
dataset scenarios: ImageNet (where DSSL = Dtask ) and
YFCC100M (where DSSL ̸= Dtask ). Our experiments are
summarized in Table 5. We note that with ViT-B/16, if
ImageNet is used for pretraining, linear probing with just
5% labeled samples can surpass the performance achieved
using 100% labeled data when YFCC100M serves as
DSSL . This also implies that when DSSL ̸= Dtask , a higher
quantity of labeled data is necessary to attain competitive
performance, and increasing D does not bring considerable
benefits on label quantity under distribution shift. This
implies that our findings in Section 5.3 hold regardless of
the linear probing labeled set. We note that in scenarios
where DSSL = Dtask , using only 50% of the labeled data
can achieve similar performance as utilizing the full 100%
Figure 4. Data Diversity Impact on YFCC100M Pretraining
of labeled samples, implying that increasing D leads to
Performance: Pretraining (C = 98 × 106 ) of networks with
reduced label requirement efforts for downstream tasks.
DSSL = YFCC100M and Dtask = ImageNet for several dataset
subsets. In the presence of a distribution shift, performance tends
to saturate and does not benefit from additional data diversity.
7. Discussion
Here, we discuss the implications of our findings. We
is not sufficient to bridge the distribution shift problem. highlight inefficiencies in current strategies, and provide
Here DSSL ̸= Dtask , hence still aligning with our findings in takeaways for good SSL practices.
Section 5.1.
Main Conclusions Our set of experiments leads to Insight
Label Quantity A question that arises in our setting (3) in Section 5.3, revealing that with normalized compu-
is whether having a higher pretraining diversity leads to tational costs, SSL pretrainings with large diversity cannot
requiring less labeled samples during linear probing. In compensate for the distribution shift. This is surprising,
this section, we focus on understanding the impact of since the variety of information that SSL algorithms could
increased D on the number of labeled samples required for benefit from during training is much higher in large generic
effective linear probing in downstream tasks. We use the datasets than in small ones. Hence, since our evaluation
same trained models from Section 5.3, i.e., SimCLR and is cost-normalized, (3) also implies that SSL strategies are

8
not efficiently exploiting the pretraining diversity on large generalization when exposed to a wide SSL pretraining
datasets for representation extraction. This inefficiency diversity compared to vision models [61].
reflects in a wide margin for improvement of generalization
performance of SSL models, making better use of the com- Acknowledgements. Fabio Pizzati is financed by KAUST (Grant
putational power involved for training. The role of existing DFR07910). The authors thank Csaba Botos, Alasdair Paren,
models must also be discussed in this context. Following Ameya Phrabu and Francesco Pinto for their feedback.
Insights (1) and (2), respectively in Sections 5.1 and 5.2,
we have studied how in-distribution and out-of-distribution References
data impact performance in a controlled scenario. We
[1] Philip Bachman, R Devon Hjelm, and William Buchwalter.
argue that this behavior should be taken into account in Learning representations by maximizing mutual information
the evaluation of the performance of SSL models. Indeed, across views. In NeurIPS, 2019. 2
training at scale may enlarge the in-distribution space, [2] Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Mor-
including classes of the downstream tasks in the training cos, Shashank Shekhar, Tom Goldstein, Florian Bordes,
set. In recent literature, this is a design choice to maximize Adrien Bardes, Gregoire Mialon, Yuandong Tian, et al.
performance on popular benchmarks [46]. While this A cookbook of self-supervised learning. arXiv preprint
allows for achieving impressive results in many tasks, we arXiv:2304.12210, 2023. 3
stress that this does not permit a fair evaluation. Now, we [3] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEit:
summarize practical takeaways. BERT pre-training of image transformers. In ICML, 2022. 2
Training Takeaways Coherently with our findings in Sec- [4] Adrien Bardes, Jean Ponce, and Yann LeCun. VI-
tion 5.1 and similarly to prior art [15, 25, 36], we find CReg: Variance-invariance-covariance regularization for
that aligned distributions benefit performance, in particu- self-supervised learning. In ICLR, 2022. 2, 3, 4
lar increasing DSSL diversity helps in learning better fea- [5] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.
ture extractors as long as the distribution of the new sam- Food-101–Mining discriminative components with random
forests. In ECCV, 2014. 2, 6
ples match those of the downstream task data. Differently
from the state-of-the-art, we demonstrate that this holds also [6] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and
Matthijs Douze. Deep clustering for unsupervised learning
in a computationally-normalized setting, implying that col-
of visual features. In ECCV, 2018. 2
lecting large scale in-distribution data matching the down-
[7] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Ar-
stream task could be an effective and efficient approach to
mand Joulin. Unsupervised pre-training of image features
improving SSL. Hence, for practical applications distribu- on non-curated data. In ICCV, 2019. 2
tion priors should be used, if available. On the contrary, for
[8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-
a fair evaluation of models, this should not be the case, as otr Bojanowski, and Armand Joulin. Unsupervised learn-
specified below. ing of visual features by contrasting cluster assignments. In
Evaluation Takeaways Our analysis reveals that to NeurIPS, 2020. 2, 3, 4, 5
permit a fair evaluation of SSL methods, computationally [9] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
normalized tests are necessary to avoid inconsistencies, as Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
shown in Section 6. Moreover, it is crucial to identify out- ing properties in self-supervised vision transformers. In
of-distribution downstream tasks for a correct evaluation ICCV, 2021. 2, 3, 4
of generalization performance. By evaluating only on Dtask [10] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
with a low distribution shift, there is a risk of reporting in- offrey Hinton. A simple framework for contrastive learning
flated metrics, not representative of a real gain in generaliza- of visual representations. In ICML, 2020. 2, 3, 4, 1
tion. This is important, since new SSL approaches may be [11] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad
reporting higher downstream performance when pretrained Norouzi, and Geoffrey Hinton. Big self-supervised models
are strong semi-supervised learners. In NeurIPS, 2020. 2
on a different dataset. We relate this to Sections 5 and 6,
[12] Xinlei Chen and Kaiming He. Exploring simple siamese rep-
where we show that increasing the computation and the
resentation learning. In CVPR, 2021. 2
in-distribution data, respectively, can improve performance.
[13] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He.
Ultimately, wrong practices may result in incorrectly
Improved baselines with momentum contrastive learning.
concluding that an SSL algorithm is performing better.
arXiv preprint arXiv:2003.04297, 2020. 2
Differences With Language Models In Section 5.3 [14] Xinlei Chen, Saining Xie, and Kaiming He. An empiri-
we showed that even very diverse datasets, such as cal study of training self-supervised vision transformers. In
YFCC100M, fall short in satisfactory generalization per- ICCV, 2021. 2, 3, 4, 1
formance. Beyond paving the way for further exploration [15] Elijah Cole, Xuan Yang, Kimberly Wilber, Oisin Mac
into generalization for SSL pretraining, this open doors Aodha, and Serge Belongie. When does contrastive visual
to investigating why language models enjoy enhanced representation learning work? In CVPR, 2022. 2, 3, 5, 9

9
[16] Victor Guilherme Turrisi da Costa, Enrico Fini, Moin Nabi, [31] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Nicu Sebe, and Elisa Ricci. solo-learn: A library of self- Girshick. Momentum contrast for unsupervised visual rep-
supervised methods for visual representation learning. In resentation learning. In CVPR, 2020. 2, 3
JMLR, 2022. 4, 5 [32] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
[17] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable
Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter vision learners. In CVPR, 2022. 2, 3
Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul- [33] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
mohsin, et al. Scaling vision transformers to 22 billion pa- Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
rameters. In ICML, 2023. 2 two time-scale update rule converge to a local nash equilib-
[18] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, rium. NeurIPS, 30, 2017. 7, 5
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [34] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon,
database. In CVPR, 2009. 2, 5, 6 Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua
[19] Carl Doersch and Andrew Zisserman. Multi-task self- Bengio. Learning deep representations by mutual informa-
supervised visual learning. In ICCV, 2017. 2 tion estimation and maximization. In ICLR, 2018. 2
[20] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- [35] Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali
vised visual representation learning by context prediction. In Razavi, Carl Doersch, S. M. Ali Eslami, and Aaron van den
ICCV, 2015. 2 Oord. Data-efficient image recognition with contrastive pre-
[21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, dictive coding. In ICML, 2019. 2
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [36] Klemen Kotar, Gabriel Ilharco, Ludwig Schmidt, Kiana
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Ehsani, and Roozbeh Mottaghi. Contrasting contrastive self-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is supervised representation learning pipelines. In ICCV, 2021.
worth 16x16 words: Transformers for image recognition at 2, 9
scale. In ICLR, 2021. 2, 3, 6 [37] Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei.
[22] Linus Ericsson, Henry Gouk, and Timothy M Hospedales. Collecting a large-scale dataset of fine-grained cars. In
How well do self-supervised models transfer? In CVPR, FGVC, 2013. 2, 6
2021. 3 [38] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple
[23] Enrico Fini, Pietro Astolfi, Adriana Romero-Soriano, Jakob layers of features from tiny images. 2009. 4
Verbeek, and Michal Drozdzal. Improved baselines for [39] Ya Le and Xuan Yang. Tiny imagenet visual recognition
vision-language pre-training. TMLR, 2023. 2 challenge. CS 231N, 2015. 2, 4
[24] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- [40] Alexander Cong Li, Ellis Langham Brown, Alexei A Efros,
supervised representation learning by predicting image rota- and Deepak Pathak. Internet explorer: Targeted representa-
tions. In ICLR, 2018. 2 tion learning on the open web. In ICML, 2023. 2
[25] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan [41] Junnan Li, Pan Zhou, Caiming Xiong, and Steven C. H. Hoi.
Misra. Scaling and benchmarking self-supervised visual rep- Prototypical contrastive learning of unsupervised representa-
resentation learning. In ICCV, 2019. 2, 9 tions. In ICLR, 2021. 2
[26] Priya Goyal, Mathilde Caron, Benjamin Lefaudeux, Min Xu, [42] Ishan Misra and Laurens van der Maaten. Self-supervised
Pengchao Wang, Vivek Pai, Mannat Singh, Vitaliy Liptchin- learning of pretext-invariant representations. In CVPR, 2020.
sky, Ishan Misra, Armand Joulin, et al. Self-supervised 2
pretraining of visual features in the wild. arXiv preprint [43] Maria-Elena Nilsback and Andrew Zisserman. Automated
arXiv:2103.01988, 2021. 1, 2, 3 flower classification over a large number of classes. In
[27] Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron, ICVGIP, 2008. 2, 6
Ishan Misra, Levent Sagun, Armand Joulin, and Piotr Bo- [44] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
janowski. Vision models are more robust and fair when visual representations by solving jigsaw puzzles. In ECCV,
pretrained on uncurated images without supervision. arXiv 2016. 2
preprint arXiv:2202.08360, 2022. 1, 2 [45] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre-
[28] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin sentation learning with contrastive predictive coding. arXiv
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, preprint arXiv:1807.03748, 2018. 2, 4
Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- [46] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy
laghi Azar, et al. Bootstrap your own latent: A new approach Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
to self-supervised learning. In NeurIPS, 2020. 2, 3, 4 Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al.
[29] Hasan Abed Al Kader Hammoud, Hani Itani, Fabio Pizzati, DINOv2: Learning robust visual features without supervi-
Philip Torr, Adel Bibi, and Bernard Ghanem. Synthclip: Are sion. arXiv preprint arXiv:2304.07193, 2023. 1, 7, 9
we ready for a fully synthetic clip training? arXiv preprint [47] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
arXiv:2402.01832, 2024. 3 CV Jawahar. Cats and dogs. In CVPR, 2012. 2, 6
[30] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. [48] Ilija Radosavovic, Raj Prateek Kosaraju, Ross Girshick,
Deep residual learning for image recognition. In CVPR, Kaiming He, and Piotr Dollár. Designing network design
2016. 3, 4, 6 spaces. In CVPR, 2020. 2

10
[49] Benjamin Ramtoula, Matthew Gadd, Paul Newman, and [67] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful
Daniele De Martini. Visual DNA: Representing and com- image colorization. In ECCV, 2016. 2, 3
paring images using distributions of neuron activations. In [68] Nanxuan Zhao, Zhirong Wu, Rynson W. H. Lau, and
CVPR, 2023. 7, 5 Stephen Lin. What makes instance discrimination good for
[50] Robin Rombach, Andreas Blattmann, Dominik Lorenz, transfer learning? In ICML, 2021. 2
Patrick Esser, and Björn Ommer. High-resolution image syn- [69] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva,
thesis with latent diffusion models. In CVPR, 2022. 5 and Antonio Torralba. Places: A 10 million image database
[51] Yuge Shi, Imant Daunhawer, Julia E Vogt, Philip Torr, and for scene recognition. T-PAMI, 2017. 2, 6
Amartya Sanyal. How robust is unsupervised representation [70] Pan Zhou, Yichen Zhou, Chenyang Si, Weihao Yu,
learning to distribution shift? In ICLR, 2022. 2 Teck Khim Ng, and Shuicheng Yan. Mugs: A multi-
[52] Chen Sun, Abhinav Shrivastava, Saurabh Singh, and Abhi- granular self-supervised learning framework. arXiv preprint
nav Gupta. Revisiting unreasonable effectiveness of data in arXiv:2203.14415, 2022. 7
deep learning era. In ICCV, 2017. 1 [71] Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. Local
[53] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon aggregation for unsupervised learning of visual embeddings.
Shlens, and Zbigniew Wojna. Rethinking the inception ar- In ICCV, 2019. 3
chitecture for computer vision. In CVPR, 2016. 7
[54] Bart Thomee, David A. Shamma, Gerald Friedland, Ben-
jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
Li-Jia Li. YFCC100M: The new data in multimedia research.
In Commun. ACM, 2016. 2, 6
[55] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con-
trastive multiview coding. In ECCV, 2020. 2, 5
[56] Yonglong Tian, Olivier J. Henaff, and Aaron Van Den Oord.
Divide and contrast: Self-supervised learning from uncu-
rated data. In ICCV, 2021. 2
[57] Shengbang Tong, Yubei Chen, Yi Ma, and Yann Lecun.
Emp-ssl: Towards self-supervised learning in one training
epoch. arXiv preprint arXiv:2304.03977, 2023. 3
[58] Antonio Torralba and Alexei A Efros. Unbiased look at
dataset bias. In CVPR, 2011. 2
[59] Wouter Van Gansbeke, Simon Vandenhende, Stamatios
Georgoulis, and Luc V Gool. Revisiting contrastive meth-
ods for unsupervised learning of visual representations. In
NeurIPS, 2021. 2
[60] Grant Van Horn, Elijah Cole, Sara Beery, Kimberly Wilber,
Serge Belongie, and Oisin Mac Aodha. Benchmarking rep-
resentation learning for natural world image collections. In
CVPR, 2021. 3
[61] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret
Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma,
Denny Zhou, Donald Metzler, et al. Emergent abilities of
large language models. TMLR, 2022. 9
[62] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In CVPR, 2018. 3
[63] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A simple
framework for masked image modeling. In CVPR, 2022. 2
[64] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan
Wei, Qi Dai, and Han Hu. On data scaling in masked image
modeling. In CVPR, 2023. 2
[65] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and
Stéphane Deny. Barlow Twins: Self-supervised learning via
redundancy reduction. In ICML, 2021. 2, 3, 4
[66] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu-
cas Beyer. Scaling vision transformers. In CVPR, 2022. 2

11
On Pretraining Data Diversity
for Self-Supervised Learning
Supplementary Material

In this supplementary material, we present additional experiments and insights on our findings presented in the main paper. First, we
further develop the inconsistencies found within an incorrectly-normalized framework (Section A). Then, we propose different settings for
our analysis at scale (Section B). We extend our analysis on label quantity in Section C. Finally, we introduce additional details about our
settings and implementations (Section D). For reproducibility of our results, we will share on GitHub our codebase, along the fine-tuned
parameters and the data ordering files.
For improved readability, all references to the main paper Sections and Tables in blue (e.g. “Section 1”).

A. Importance of Normalization of Computation


In this section, we aim to complement our experiment in Section 6, providing further proof highlighting the importance of having a
normalized computational budget when evaluating the performance of SSL methods. In the following experiments, we show that if
computation is not normalized properly, one might fall into unfair comparisons and misleading conclusions.

A.1. Increasing Total Computation


In our first setup, we pretrain SimCLR [10] and MoCoV3 [14] on Tiny ImageNet with various D, over a range of increasing amounts of
budget C. We assume DSSL = Dtask . We take the same subsets of Tiny ImageNet as in Section 5.1, which consist of 10%, 50% and 100%
of the training data. We vary C from 5 × 106 to 100 × 106 , and we measure the accuracy of the pretrained models on the Dtask test set,
results of which are shown in Table 6. Note that we refer to models in this section using the data diversity N instead of pretraining diversity
D, as each cell in Table 6 has a different D. Following our previous experiments, we argue that comparison between different diversities
only hold as long as the computation is normalized. This implies that comparisons only hold within any given C. In agreement with our
prior findings, the third row with the highest diversity always outperforms the lower pretraining diversity models on in-domain evaluation,
for both SimCLR and MoCoV3. However, only when we compare between different columns, i.e. different amounts of computation, we
may observe that models pretrained with lower diversity may outperform higher diversity models. For example, for both SimCLR and
MoCoV3, the models pretrained with N = 50 × 103 and C = 10 × 106 outperform the models with higher data diversities N = 100 × 103
but less computation C = 5 × 106 . As a result, we see that models with lower pretraining diversity can still outperform models with
higher diversity, given that more computation is used. This highlights the importance of normalizing computation costs when evaluating
the effects of diversity.

A.2. Epoch-based normalization


In Section 5.2, we adhered to a fixed computational budget of C = 50 × 106 , pretraining models on DSSL = B for 800 epochs, and on
B ∪ A for 400 epochs, considering that the latter dataset was twice the size. We further demonstrate the importance of a computationally-
normalized evaluation by exposing the inconsistencies of an alternative epoch-based normalization, hence in which networks are trained
for 400 epochs, regardless of the dataset size.
We propose this alternative scenario in Figure 5, where the compute-normalized baseline (the black dashed line in Figure 3) is replaced
with an epoch-normalized baseline (indicated by the red dashed line), obtained by pretraining for 400 epochs on B. Here, we observe that
augmenting with additional samples consistently enhances performance, irrespective of the data augmentation technique used and whether
the sample labels are in or out of distribution. This finding does not align with the insights from Section 5.2, and we highlight that it does
not take into account the difference in costs for training models for the same number of epochs, but on a dataset twice the size. Hence,
this constitutes an unfair comparison that may lead to incorrect conclusions, advocating for the effectiveness of our computational-based
normalization.

1
SimCLR MoCoV3
N C (×106 ) N C (×106 )
(×103 ) 5 10 25 50 100 (×103 ) 5 10 25 50 100
10 36.92 36.63 36.30 36.91 35.03 10 39.78 41.82 40.06 36.56 28.92
50 40.76 44.30 47.69 48.77 48.91 50 39.88 43.42 46.68 46.45 48.14
100 41.43 44.76 49.32 49.83 51.62 100 40.35 44.03 47.63 48.58 50.71

Table 6. Pretraining Diversity With Increasing Computational Budget: We show for both SimCLR (left) and MoCoV3 (right) that
increasing pretraining diversity always leads to better in-domain downstream accuracies, given that computation is normalized, i.e., com-
parisons hold within the columns of the tables. Comparing models between different columns may lead to inconsistencies, where lower
diversity models with more computation obtain higher results than higher diversity models with less computation.

Figure 5. Impact of Epoch Normalization on SSL Pretraining Performance: This figure contrasts an epoch-normalized baseline (red
line) with the trained methods in the main paper, Figure 3. Under epoch normalization, we notice contrasting findings, i.e. more diverse
trainings, irrespective of their origin (source, web, or synthetic) and label distribution (in or out-of-distribution), consistently enhances
performance. This is an unfair comparison due to the greater costs of each augmented pretraining if epochs are normalized. This illustrates
how alternative normalization can lead to wrong conclusions compared to compute normalization. DINO B epoch-normalized baseline is
shown in text only (Acc. 41.14) for ease of visualization.

Barlow Twins BYOL


N D Accuracy↑ N D Accuracy↑
DSSL DSSL
(×106 ) (×10−3 ) ImageNet Cars Flow. Pets Places Food (×106 ) (×10−3 ) ImageNet Cars Flow. Pets Places Food
ImageNet 0.128 1.31 57.17 51.51 85.84 75.81 45.87 63.72 ImageNet 0.128 1.31 61.82 46.62 85.84 80.28 46.91 67.01
ImageNet 1.281 13.0 65.40 59.56 89.16 83.89 49.19 70.42 ImageNet 1.281 13.0 68.39 52.51 88.77 84.75 49.96 73.52
YFCC 98.17 1000 57.85 44.76 83.46 67.21 50.01 65.20 YFCC 98.17 1000 60.73 42.78 84.38 68.63 50.92 68.63
C = 98 × 106 C = 98 × 106

Table 7. Non-contrastive pretraining: We explore two more pretraining methods, namely, Barlow Twins and BYOL, for our large-scale
pretraining experiments. Again, the budget is set to C = 98 × 106 , we find that our earlier conclusions still hold here: (1) ImageNet pre-
training outperforms YFCC100M (YFCC) pretraining except for Places365 due to the distribution shift (2) Increased pretraining diversity
D generally correlates with improved downstream performance with the exception. Those findings are consistent for both Barlow Twins
and BYOL.

B. Alternative Settings
B.1. Non-Contrastive Methods
For large-scale experiments in section 5.3 we only considered SimCLR [10] and MoCoV3 [14], both of which are contrastive SSL methods.
Here we show that the results are consistent for the non-contrastive methods Barlow Twins [65] and BYOL [28]. We highlight that these
experiments are computationally intensive, hence we explore a reduced setting with a single backbone and lower D variability. We
pretrained a ResNet-50 backbone using Barlow Twins and BYOL on ImageNet and the same subsets from Section 5.3, as well as on the
full YFCC100M dataset, ensuring that the total compute is equal to a single epoch on YFCC100M, i.e. C = 98 × 106 . Again, we show
linear evaluation on multiple downstream datasets including ImageNet [18], Stanford Cars [37], Flowers-102 [43], Oxford-IIIT Pets [47],
Places365 [69], and Food-101 [5] in Table 7. In accordance with Section 5.3, we observe that pretraining on higher diversities leads to
improved downstream accuracies when DSSL = Dtask , i.e. pretraining and evaluating on ImageNet. Also, the highest pretraining diversity
in ImageNet leads to the best results for all downstream datasets, except for Places365, for which pretraining YFCC100M performs best,
for which we refer again to distribution distances analysis in Section 6. Again for these methods, the maximum diversity of YFCC100M is
not enough to diminish the effects of the domain gap between the pretraining data and the datasets other than Places365.

2
MoCoV3
C N D Network
DSSL
(×106 ) (×106 ) (×10−3 ) ResNet-50 ResNet-101 Vit-S/16 ViT-B/16
ImageNet 98 0.128 1.31 58.1 58.9 56.3 57.9
ImageNet 98 0.640 6.54 65.4 67.2 64.7 67.2
ImageNet 98 1.281 13.0 65.9 67.7 65.4 68.8
ImageNet 294 0.128 0.43 57.5 59.0 52.9 56.9
ImageNet 294 0.640 2.17 69.8 71.4 68.9 71.9
ImageNet 294 1.281 4.35 71.4 73.3 71.4 74.9

Table 8. Pretraining Diversity With Different Architectures Sizes: We investigate how pretraining diversity, total computation budget
and model architecture size interact for MoCoV3 when pretraining and evaluating on ImageNet. Regardless of C and the architecture
choice, increasing pretraining diversity remains a reliable method to improve downstream results. Further, increasing model size also
seems to consistently lead to better learned representations. Again, comparing pretraining diversity values only holds when the model
architecture and C are fixed.

B.2. Different Architectures


We investigate how pretraining diversity interacts with varying backbone architecture sizes, as well as the total computational budget. With
this, we aim to highlight how different models react to pretraining diversity. To benchmark the interaction of these factors, we focus on
MoCoV3 and pretrain and evaluate on ImageNet using C = 98 × 106 and the tripled amount C = 294 × 106 . We use two different
architecture sizes for ViT backbones as well as ResNet backbones: ViT-Small/16 paired with ViT-Base/16 and ResNet-50 paired with
ResNet-101.
Results are shown in Table 8, and the first observation we make, is that for any combination of architecture size and total amount
of computation, the model pretrained with largest amount of pretraining diversity D = 13.0 × 10−3 always has the highest in-domain
downstream performance. Increasing pretraining diversity thus remains a reliable method to improve the quality of learned representations,
regardless of the architecture size. Secondly, we see that for every diversity value, regardless of the backbone type or the amount of
computation, an increase in backbone size, i.e. ResNet-50 to ResNet-101 or ViT-S/16 to ViT-B/16, leads to an increased performance.
It is thus again of importance to only compare models with different pretraining diversities for fixed model size, as we did with fixed
computational budget.
Finally, keep in mind that the larger architectures require more computation, which is not incorporated in C as this term only describes
the number of images that are seen during pretraining.

B.2.1 A note on MAE

Masked Autoencoders (MAE) [32] is a Transformer-specific pretraining method based on an autoregressive loss. This differs considerably
from what has been presented in Section 3, and it has significant impact on the components needed for our normalized analysis. Indeed,
for a C = 98 × 106 budget, MAE is far from providing optimal performance [32], making comparisons unfair without incurring in
unsubstainable costs. Also, the reconstruction task used for supervision extracts features requiring a deep decoder for best performance in
linear probing [32], and it results in considerably better performance with full finetuning exclusively. We will analyze pretraining diversity
effects for MAE in a future work.

B.3. Convergence insights


The convergence of models trained on YFCC and Imagenet leading to our Insight 3 must be further discussed (see main paper, Section 5.3).
One may argue that although C = 98×106 maximizes pretraining diversity on YFCC100M, this may not enough for making trained models
fully converge. First, we highlight how relevant literature sets similar training budget C = 100 × 106 requirements for drawing reliable
conclusions [10]. Secondly, we stress how bringing to convergence both models pretrained on Imagenet and YFCC100M would inevitably
result in a different sizing of the computational budget, preventing a fair evaluation. Alternatively, increasing the computational budget
for a complete convergence of both settings would inevitably lead to the overfitting of the model trained on Imagenet. This may lead
to misleading results, since the overfitting-related loss of performance could lead to wrong conclusions related to the distribution shift
impact. Instead, our setup guarantees a reliable evaluation, by preventing overfitting while training enough for a reasonable representation
extraction. Moreover, we relate to relevant literature highlighting the importance of single-epoch training for representation extractors [57].

3
SimCLR MoCoV3
N D Accuracy↑ N D Accuracy↑
Network DSSL Network DSSL
(×106 ) (×10−3 ) 5% 10% 50% 100% (×106 ) (×10−3 ) 5% 10% 50% 100%
ImageNet 0.128 1.31 42.1 45.4 53.2 56.9 ImageNet 0.128 1.31 43.9 47.4 55.2 58.1
ImageNet 0.256 2.61 46.8 50.3 57.8 61.1 ImageNet 0.256 2.61 49.0 52.7 60.4 62.9
ResNet-50 ResNet-50
ImageNet 0.640 6.54 49.1 53.0 60.6 63.7 ImageNet 0.640 6.54 51.3 55.4 63.1 65.4
ImageNet 1.281 13.0 50.3 54.2 61.7 64.5 ImageNet 1.281 13.0 52.1 56.2 63.8 65.9
ImageNet 0.128 1.31 40.5 44.8 53.0 54.2 ImageNet 0.128 1.31 47.2 51.3 58.0 57.9
ImageNet 0.256 2.61 48.0 52.2 59.6 61.3 ImageNet 0.256 2.61 53.5 57.6 63.1 63.7
ViT-B/16 ViT-B/16
ImageNet 0.640 6.54 53.4 57.6 64.3 65.5 ImageNet 0.640 6.54 57.9 61.7 66.7 67.2
ImageNet 1.281 13.0 55.2 59.4 65.6 66.7 ImageNet 1.281 13.0 59.3 63.0 68.3 68.8

Table 9. Evaluating Network Accuracy With Varied Label Quantity: We evaluate the accuracy of networks pretrained on ImageNet
with various pretraining diversities and evaluate with different labeling percentages of Dtask = ImageNet. For in-distribution data, one can
get away with fewer labels using more diverse pretraining data.

C. Additional Insights on Label Quantity


In Section 6 we considered how pretraining diversity affects the number of labels necessary for the best downstream ImageNet accuracies
when pretrained on ImageNet (DSSL = Dtask ) or on YFCC100M (DSSL ̸= Dtask ). Here explore the setting where upstream and downstream
data are the same, i.e., DSSL = Dtask , and we repeat the experiment with models pretrained on various diversities on ImageNet. Table 9
shows the in-domain results on ImageNet for SimCLR and MoCoV3 pretrained with ResNet-50 and ViT-B/16 backbones. It is clear that for
in-domain evaluation, the models pretrained with largest pretraining diversity always perform the best, regardless of the label quantity used
for linear evaluation. More interestingly, it is possible to achieve better performance with less labels if a model is pretrained with higher
D. For example, for every combination of backbone and SSL method, the models pretrained with maximum diversity D = 13.0 × 10−3
using 50% of the labels outperform the models pretrained with D = 2.61 × 10−3 with 100% of the labels. Thus, if models are evaluated
or deployed in few-shot downstream tasks, it may be desirable to use models pretrained with the highest pretraining diversity available.

D. Additional details
D.1. SSL Methods
In Section 3 we described a general framework for self-supervised pretraining that is common to many state-of-the-art SSL methods.
Although all the methods we use in our experiments mostly follow this procedure, they do differ in loss functions as well as in certain
architectural choices. For each of the methods we use for our experiments, we describe in depth the key aspects that specifically define the
SSL method and make them different from the introduced framework in Section 3. Further details on the methods can be found in their
respective papers and repositories.
SimCLR [10], Barlow Twins [65] and VICReg [4] closely follow the general framework, and mainly differ in the loss function LSSL used
during optimization. SimCLR makes use of the InfoNCE loss [45], which is applied to the representations of each positive pair of samples in
the batch, and also incorporates negatives from the current batch. Barlow Twins uses a loss function that makes the cross-correlation matrix
between the representations of the distorted samples as close to the identity matrix as possible. As a result, representations of correlated
views of the same sample are forced to become similar, while redundancy between the components of these vectors is minimized. The loss
function used for VICReg is a combination of the mean-squared euclidean distance between representations with an additional variance
and covariance term for feature decorrelation and avoiding representation collapse.
BYOL [28], DINO [9] and MoCoV3 [14] have slightly more evident differences from the proposed framework, as they do not use
shared parameters θf and θg for the feature extractor and projection head between the two augmented views. Instead the two augmented
views pass through two different networks: a student network with feature extractor fθf and prediction head gθg , parameterised with θf
and θg , and a teacher network with its own respective components fθ′ f ′ and gθ′ g′ with unique parameters θf ′ and θg′ . The weights θf ′ and
θg′ in the teacher network are an exponential moving average of the weights θf and θg in the student network. The three methods differ
in how they compute LSSL from the representations zA from the student and zB from the teacher. For BYOL, after the correlated views
are passed through the two networks, an additional predictor network qθq , parameterised with θq , tries to predict the representation of the
teacher network zB from the output of the student network as qθq (zA ), and the mean squared error between the teacher representation and
the student prediction is minimised. DINO performs knowledge distillation in the student by minimising the cross-entropy loss between
the direct outputs zA and zB . MoCoV3 uses the student and teacher network to generate representations from the augmented views called
queries zB and keys zB , and stores the keys in a queue. The contrastive InfoNCE loss is again used as SSL objective, and uses matching
queries and keys as positive samples and the recent keys from the dictionary as negative samples. For all three methods, a stop-gradient
operator is applied after the teacher network, to avoid any weight updates in the teacher network during backpropagation.

4
SwAV [8] does share weights for fθf and gθg between correlated views, but instead relies on additional components. First, the represen-
tations of different views zA and zB are assigned to prototype vectors, resulting in codes qA and qB . The prototype vectors are trainable
vectors and represent the notion of clusters. A swapped prediction problem is solved afterwards, where the code of one augmented view
is predicted using the other view. The swapped prediction problem is used to define the loss as LSSL (zA , zB ) = ℓ(zA , qB ) + ℓ(zB , qA ),
where ℓ measures the fit between features and codes.

D.2. Data Collection Strategies


This section outlines the data collection strategies for our three approaches detailed in Section 5.2: Source, Web, and Synthetic. We base
these strategies on the Base dataset B (introduced in 5.2), consisting of half of ImageNet100, totaling 65,000 samples.
Source Dataset We expand the dataset B by integrating the remaining half of ImageNet100, forming AIn Out
Source . For ASource , we begin by
selecting 100 random, non-overlapping classes from ImageNet. We then gather 65,000 corresponding samples from these classes and add
them to B.
Web Dataset We utilize three search engines—Flickr, Bing, and DuckDuckGo—to gather web samples while employing Safe-Search for
content appropriateness. Our queries, based on class names, are carefully crafted to avoid ambiguity. For AIn Web , we collect approximately
100,000 samples from ImageNet100 classes, selecting the top 65,000 for inclusion in B. Similarly, for AOut Web , we follow the same process
for the 100 randomly selected classes from the Source dataset.
Synthetic Dataset For synthetic sample generation, we employ Stable Diffusion V2.1 (SDV2.1). Using the prompt ’A photo of a
class name’, we generate images for each class in ImageNet100 for AIn Out
Synthetic and the 100 distinct classes from ImageNet for ASynthetic .
Each class contributes 650 images, totaling 65,000 samples. We utilize the DPMSolver++ scheduler with start and end β values of 0.00085
and 0.012, respectively. The guidance scale is set at w = 7.5, and each image is generated in 50 steps.

D.3. Details on Distribution Distances


In Section 6 of our study, we explored the relationship between pretraining datasets, specifically ImageNet and YFCC100M, and down-
stream datasets, which include ImageNet, Stanford Cars, Flowers102, OxfordIIITPets, Places365, and Food101. To quantitatively measure
the distance between these datasets, we employed two distinct metrics: VisualDNA (VDNA) and the Fréchet Inception Distance (FID).
Our methodology for calculating distribution distances involved selecting a substantial number of samples from each dataset. We used
50, 000 samples from each dataset for computing the distribution distances, if the dataset is composed from less than 50, 000 samples, the
whole dataset is used. This approach ensured a robust and comprehensive analysis of the dataset distributions. For the implementation of
VisualDNA, we utilized two different architectures: MUGS ViT-B/16, as recommended by the original paper [49], and DinoV2 ViT-B/16.
The FID scores were computed using a standard approach with an Inception network, as detailed in [33].
The results, as discussed in Section 6, revealed consistent findings across both VDNA and FID metrics. Our analysis showed that a
greater distance between the upstream and downstream datasets correlated with a decrease in downstream classification accuracy.

E. Implementation
In all our experiments, we have utilized the solo-learn library [16] as our main codebase. For the ImageNet100 and CIFAR100
experiments, we used to the parameters provided by solo-learn. In the case of ImageNet, while we began with the parameters
provided in the original papers, we made slight modifications to optimize performance. These modifications included changes to the
number of warmup epochs and an adjustment of the learning rate. For the YFCC100M dataset, we found the parameters optimized for
ImageNet to be the most effective, whereas for TinyImageNet, we used the CIFAR100 parameters provided by solo-learn.
To create different fractions of each dataset, our first step involved the creation of an H5 file containing all image paths. This file is then
shuffled and saved. When a specific percentage of the data is required for our SSL pretraining, we simply select the first k% of the image
paths from this H5 file. Since we use a fixed computational budget, we scaled the number of epochs accordingly. This scaling is achieved
by a factor of 1/k × 100. For example, if we utilized 10% of the data for pretraining, we would increase the base number of epochs by a
factor of 10.

You might also like