0% found this document useful (0 votes)
24 views

Scaling Down Deep Learning With MNIST-1D

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Scaling Down Deep Learning With MNIST-1D

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Scaling Down Deep Learning with MNIST-1D

Sam Greydanus 1 2 Dmitry Kobak 3 4

Original MNIST examples


Abstract
Although deep learning models have taken on
commercial and political relevance, key aspects
Represent digits as 1D patterns
arXiv:2011.14439v5 [cs.LG] 3 Jun 2024

of their training and operation remain poorly un-


derstood. This has sparked interest in science of
deep learning projects, many of which require
large amounts of time, money, and electricity. But Pad, translate & transform
how much of this research really needs to occur at
scale? In this paper, we introduce MNIST-1D: a
minimalist, procedurally generated, low-memory,
and low-compute alternative to classic deep learn-
ing benchmarks. Although the dimensionality of
MNIST-1D is only 40 and its default training set Figure 1: Constructing the MNIST-1D dataset. Unlike
size only 4000, MNIST-1D can be used to study MNIST, each sample is a one-dimensional sequence. To
inductive biases of different deep architectures, generate each sample, we begin with a hand-crafted digit
find lottery tickets, observe deep double descent, template loosely inspired by MNIST shapes. Then we ran-
metalearn an activation function, and demonstrate domly pad, translate, and add noise to produce 1D sequences
guillotine regularization in self-supervised learn- with 40 points each. [CODE]
ing. All these experiments can be conducted on a
GPU or often even on a CPU within minutes, al-
lowing for fast prototyping, educational use cases,
and cutting-edge research on a low budget. prizes, have been awarded for work performed with fruit
flies. Early work in genetics — work that paved the way
for the Human Genome Project, which involved billions
1. Introduction of dollars of funding, dozens of institutions, and over a
decade of accelerated research (Lander et al., 2001) — was
The deep learning analogue of Drosophila melanogaster performed on fruit flies and other simple organisms. To
is the MNIST dataset. Drosophila, the fruit fly, has a life this day, experiments on Drosophila are a cornerstone of
cycle that is just a few days long, its nutritional needs are biomedical research.
negligible, and it is easier to work with than mammals,
especially humans. Like Drosophila, MNIST is easy to Meanwhile, MNIST has served as the initial proving ground
use: training a classifier on it takes only a few a minutes for a large number of deep learning innovations including
whereas training full-size vision and language models can dropout, Adam, convolutional networks, generative adver-
take months of time and millions of dollars (Sharir et al., sarial networks, and variational autoencoders (Srivastava
2020). et al., 2014; Kingma and Ba, 2014; LeCun et al., 1989;
Goodfellow et al., 2014; Kingma and Welling, 2014). Once
But in spite of their small size, both test systems have had a proof of concept was established in small-scale exper-
a major impact on their respective fields. A number of iments, researchers were able to justify the time and re-
seminal discoveries in medicine, including multiple Nobel sources needed for larger and more impactful applications.
1
Oregon State University, USA 2 The ML Collective 3 University However, despite its historical significance, MNIST has
of Tübingen, Germany 4 Heidelberg University, Germany. Corre- three notable shortcomings. First, it is too simple. Lin-
spondence to: Sam Greydanus <[email protected]>. ear classifiers, fully-connected networks, and convolutional
Proceedings of the 41 st International Conference on Machine models all perform similarly well, obtaining above 90%
Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by accuracy (Table 1). This makes it hard to measure the con-
the author(s). tribution of a CNN’s spatial priors or to judge the relative

1
Scaling Down Deep Learning with MNIST-1D

Table 1: Test accuracies (in %) of common classifiers on the MNIST and MNIST-1D datasets. Most classifiers achieve
similar test accuracy on MNIST. By contrast, the MNIST-1D dataset is capable of separating different models based on
their inductive biases. The drop in CNN and GRU performance when using shuffled features indicates that spatial priors
are important on this dataset. All models except logistic regression achieve 100% accuracy on the training set. Standard
deviation is computed over three runs. [CODE]

Dataset Logistic regression MLP CNN GRU Human expert


MNIST 94 ± 0.5 > 99 > 99 > 99 > 99
MNIST-1D 32 ± 1 68 ± 2 94 ± 2 91 ± 2 96 ± 1
MNIST-1D (shuffled) 32 ± 1 68 ± 2 56 ± 2 57 ± 2 ∼ 30 ± 10

effectiveness of different regularization schemes. Second, 2. Related work


it is too large. Each sample in MNIST is a 28 × 28 image,
resulting in 784 input dimensions. Together with its sample There are a number of small-scale datasets that are com-
size n = 70 000, this requires an unnecessarily large amount monly used to investigate science of deep learning questions.
of computation to perform a hyperparameter search or de- We have already alluded to MNIST (LeCun et al., 1998),
bug a metalearning loop. Third, it is hard to hack. MNIST is CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). The
a fixed dataset and it is difficult to increase the sample size CIFAR datasets consist of 32 × 32 colored natural images
or to change the noise distribution. The ideal toy dataset and have sample sizes similar to MNIST. They are better
should be procedurally generated to allow researchers to at discriminating between MLP and CNN architectures and
vary its parameters at will. also between different types of CNNs: for example, vanilla
CNNs versus ResNets (He et al., 2016). The FashionM-
In order to address these shortcomings, we propose the NIST dataset (Xiao et al., 2017) has the same size as MNIST
MNIST-1D dataset (Figure 1). It is a minimalist, low- and is somewhat more difficult. It aims to rectify some of
memory, and low-compute alternative to MNIST, designed the most serious problems with MNIST: in particular, that
for exploratory deep learning research where rapid proto- MNIST is too easy and thus all neural network models attain
typing and short latency are a priority. MNIST-1D has 40 roughly the same test accuracy. None of these datasets is
dimensions, many fewer than MNIST’s 784 or CIFAR’s substantially smaller than MNIST and this hampers their
3,072. The sample size can be arbitrarily large, but the use in fast-paced exploratory research or compute-heavy
frozen default dataset contains 4000 training and 1000 test applications such as metalearning.
samples, many fewer than the 70,000 in MNIST and 60,000
in CIFAR-10/100. Although our dataset is procedurally gen- There are very few datasets smaller than MNIST that are
erated, its samples are intuitive enough for a human expert of interest for deep learning research. Toy datasets pro-
to match or even outperform a CNN. vided by Scikit-learn (Pedregosa et al., 2011), such as the
two moons dataset, can be useful for studying clustering
MNIST-1D does a much better job than the original MNIST or training very simple classifiers, but are not sufficiently
at differentiating between model architectures: a linear clas- complex for deep learning investigations. Indeed, these
sifier can only achieve 32% accuracy (Table 1), while a datasets are just 2D point clouds, devoid of spatial or tem-
CNN reaches 94%. Below we show that MNIST-1D can be poral correlations between features and lacking manifold
used to study phenomena ranging from deep double descent structures that a deep nonlinear classifier could use to escape
to self-supervised learning. Crucially, the experiments we the curse of dimensionality (Bellman and Kalaba, 1959).
present in this paper take only a few minutes to run on a
single GPU (in some cases just a CPU) whereas they would To the best of our knowledge, the MNIST-1D dataset is
require multiple GPU hours or even GPU days when using unique in that it is over two orders of magnitude smaller
MNIST or CIFAR. This makes MNIST-1D valuable as a than MNIST but can be used just as effectively — and
playground for quick initial experiments and invaluable for in a number of important cases, more effectively — for
researchers without access to powerful GPUs. studying fundamental deep learning questions. This may be
why MNIST-1D was used as a teaching tool in the recent
All our experiments are in Jupyter notebooks and Understanding Deep Learning textbook (Prince, 2023)1 .
are available at https://github.com/greydanus/
mnist1d, with direct links from figure captions. We pro- MNIST-1D bears philosophical similarities to the Synthetic
vide a mnist1d package that can be installed via pip Petri Dish by Rawal et al. (2020). The authors make similar
install mnist1d. 1
The initial preprint of this manuscript was released on arXiv
in November 2020.

2
Scaling Down Deep Learning with MNIST-1D

Table 2: Default parameters for MNIST-1D generation. a 100


Training accuracy b 100 Test accuracy
Human
Parameter Value 80 80
Train/test split 4000/1000

Accuracy (%)
Template length 12 60 60
Padding points 36–60
Max. translation 48 40 40
Logistic
Gaussian filter width 2 MLP
Gaussian noise scale 0.25 20 CNN 20
White noise scale 0.02 GRU
After shuffling
Shear scale 0.75 0 0
Final seq. length 40 0 2000 4000 6000 0 2000 4000 6000
Training step Training step
Random seed 42
Figure 2: Train and test accuracy of common classification
models on MNIST-1D. The logistic regression model fares
references to biology in order to motivate the use of small worse than the MLP. Meanwhile, the MLP fares worse than
synthetic datasets for exploratory research. Their work dif- the CNN and GRU, which use translation invariance and
fers from ours in that they use metalearning to obtain their local connectivity to bias optimization towards solutions
datasets whereas we construct ours by hand. In doing so, we that generalize well. When local spatial correlations are de-
are able to control various causal factors such as the amount stroyed by shuffling feature indices (dashed lines), the MLP
of noise, translation, and padding. Our dataset is more in- performs the best. CPU runtime: ∼10 minutes. [CODE]
tuitive to humans: an experienced human can outperform a
strong CNN on the MNIST-1D classification task. This is
not possible on the Synthetic Petri Dish dataset. changing dataset features such as maximum digit translation,
correlated noise scale, shear scale, final sequence length,
and more (Table 2). The following code snippet shows how
3. The MNIST-1D dataset to install the mnist1d package, choose a custom number
Dimensionality. Our first design choice was to use of samples, and generate a dataset:
one-dimensional time series instead of two-dimensional # install the package from PyPI
grayscale images or three-dimensional tensors correspond- # pip install mnist1d
ing to colored images. Our rationale was that one-
dimensional signals require far less computation to train from mnist1d.data import make_dataset
on but can be designed to have many of the same biases, from mnist1d.data import get_dataset_args
distortions, and distribution shifts that are of interest to args = get_dataset_args() # default params
researchers studying fundamental deep learning questions. args.num_samples = 10_000
data = make_dataset(args)
Constructing the dataset. We began with ten one- x, y = data["x"], data["y"]
dimensional template patterns which resemble the digits
0–9 when plotted as in Figure 1. Each of these templates
The frozen dataset with 4000 + 1000 samples can be found
consisted of 12 hand-crafted x coordinates. Next we padded
on GitHub as mnist1d data.pkl.
the end of each sequence with 36–60 additional zero values,
did a random circular shift by up to 48 indices, applied a
random scaling, added Gaussian noise, and added a con- 4. Classification
stant linear signal. We used Gaussian smoothing with σ = 2
We used PyTorch to implement and train simple logistic,
to induce spatial correlations. Finally, we downsampled
MLP (fully-connected), CNN (with 1D convolutions), and
the sequences to 40 data points that play the role of pix-
GRU (gated recurrent unit) models. We used the Adam opti-
els in the resulting MNIST-1D (Figure 1). Table 2 gives
mizer and early stopping for model selection and evaluation.
the values of all the default hyperparameters used in these
We obtained 32% accuracy with logistic regression, 68%
transformations.
using an MLP, 91% using a GRU, and 94% using a CNN
Implementation. Our goal was to make the code as simple, (Table 1). Even though the test performance was markedly
modular, and extensible as possible. The code for generating different between MLP, GRU, and CNN, all of them eas-
the dataset occupies two Python files and fits in a total of ily achieved 100% accuracy on the training set (Figure 2).
150 lines. The get dataset method has a simple API for While for the MLP this is a manifestation of overfitting, for

3
Scaling Down Deep Learning with MNIST-1D

a MNIST b MNIST-1D 1D, making it more interesting.

4 5. Science of deep learning with MNIST-1D


9
7 5 6 In this section we show how MNIST-1D can be used to
8 explore empirical science of deep learning topics.
1
0
3 5.1. Lottery tickets and spatial inductive biases
2
It is not unusual for deep learning models to have many
times more parameters than necessary to perfectly fit the
Figure 3: Visualizing the MNIST and MNIST-1D datasets training set (Prince, 2023). This overparameterization helps
with t-SNE. The well-defined clusters in the MNIST em- training but increases computational overhead. One solu-
bedding indicate that the classes are separable via a simple tion is to progressively prune weights from a model during
kNN classifier in pixel space. The MNIST-1D plot reveals training so that the final network is just a fraction of its
little structure and a lack of clusters, indicating that nearest original size. Although this approach works, conventional
neighbors in pixel space are not semantically meaningful, wisdom holds that sparse networks do not train well from
as is the case with natural image datasets. [CODE] scratch. Recent work by Frankle and Carbin (2019) chal-
lenges this conventional wisdom. The authors report finding
sparse subnetworks inside of larger networks that can be
the CNN this is an example of benign overfitting. trained in isolation to equivalent or even higher accuracies.
These lottery ticket subnetworks can be found through a sim-
For comparison, we also report the accuracy of a human ple iterative procedure: train a network, prune the smallest
expert (one of the authors) trained on the training set and weights, reset the remaining weights to their original values
evaluated (one-shot) on the test set. His accuracy was 96%. at initialization, and then retrain and repeat the process until
The purpose of this comparison was to show that MNIST-1D the desired sparsity threshold is reached.
is a task that is as intuitive for humans as it is for machine
learning models with spatial priors. This suggests that the Since the original paper was published, many works have
models are not achieving high performance by exploiting sought to explain this phenomenon and then harness it on
some unintuitive statistical artifacts. Rather, they are using larger datasets and models. However, very few works have
the relative position of various features associated with each attempted to isolate a minimal working example of this ef-
1-D digit. Interestingly, the CNN and the human expert had fect so as to investigate it more carefully. We were able to
similar per-digit error rates (Figure S1). demonstrate the existence of lottery tickets in a MLP clas-
sifier trained on MNIST-1D (Figure 4a–b). Lottery ticket
subnetworks that we found performed better than random
Shuffling sanity check. We also trained the same models subnetworks with the same level of sparsity. Remarkably,
on a version of the dataset which was permuted along the even at high (>95%) rates of sparsity, the lottery tickets we
spatial dimension. This ‘shuffled’ version measured each found performed better than the original dense network.
of the models’ performances in the absence of local spatial
structure (Zhang et al., 2017; Li et al., 2018). The test accu- The asymptotic performance of lottery tickets with 92%
racy of CNNs and GRUs decreased by about 35 percentage sparsity was around 70% (Figure 4c). When we reversed all
points after shuffling whereas the MLP and logistic models the 1D patterns in the dataset, effectively preserving the spa-
performed about the same (Table 1, Figure 2). This makes tial structure but changing the actual locations of all features
sense, as the former two models have spatial and temporal (analogous to flipping an image upside down), the original
locality priors whereas the latter two do not. lottery tickets continued to perform at around 70% accuracy
(Figure 4d). This suggests that the lottery tickets did not
Dimensionality reduction. We used t-SNE (Van der overfit to the original dataset; instead, something about their
Maaten and Hinton, 2008) to visualize MNIST and MNIST- connectivity and initial weights gave them an inherent ad-
1D in two dimensions. We observed ten well-defined clus- vantage over random sparse networks. This reproduces the
ters in the MNIST dataset, suggesting that the classes are findings of Morcos et al. (2019), which showed that lottery
separable with a kNN classifier in pixel space (Figure 3a). In tickets can transfer between datasets.
contrast, there were few well-defined clusters in the MNIST-
1D visualization, suggesting that the nearest neighbors in Next, we asked whether spatial inductive biases were a fac-
pixel space are not semantically meaningful (Figure 3b). tor in the high performance of the lottery tickets we had
This is well known to be the case for natural image datasets found. To answer this question, we trained the same tick-
such as CIFAR-10/100, and is therefore a benefit of MNIST- ets on a feature-shuffled version of the MNIST-1D dataset.

4
Scaling Down Deep Learning with MNIST-1D

Figure 4: Finding and analyzing lottery tickets. (a–b) The test loss and test accuracy of lottery tickets at different levels of
sparsity, compared to randomly selected subnetworks and to the original dense network. (c) Performance of lottery tickets
with 92% sparsity. (d) Performance of the same lottery tickets when trained on flipped data. (e) Performance of the same
lottery tickets when trained on data with shuffled features. (f) Performance of the same lottery tickets but with randomly
initialized weights, when trained on original data. (g) Lottery tickets had more adjacent non-zero weights in the first layer
compared to random subnetworks. Runtime: ∼30 minutes. [CODE]

In other words, we permuted the feature indices in order a MLP b MLP c CNN
70
No label noise Label noise Label noise
to remove any spatial structure from the data. Shuffling
Test
greatly reduced the performance of the lottery tickets: they 60 Train
Classification error (%)

performed appreciably worse — worse, in fact, than the 50


original dense network (Figure 4e). This suggests that part
40
of the lottery tickets’ performance can be attributed to a
spatial inductive bias in their sparse connectivity structure. 30
20
Furthermore, on the original (non-shuffled) MNIST-1D,
when we froze the sparsity patterns of lottery tickets but 10
initialized them with different random weights, they still 0
0 100 200 300 0 100 200 300 0 50 100
continued to outperform the original dense network (Fig- Hidden layer size Hidden layer size Num. of channels
ure 4f). This suggests that not the weight values but rather
the sparsity patterns represent the spatial inductive bias of Figure 5: Deep double descent in MNIST-1D classification.
lottery tickets. We verified this hypothesis by measuring Here the test set had 12 000 samples. (a) MLP classifier
how often non-zero weights in a lottery ticket were adjacent with one hidden layer. (b) MLP classifier; 15% label noise.
to each other in the first layer of the model. The lottery (c) CNN classifier with three convolutional layers; 15%
tickets had more adjacent weights than expected by chance label noise. Adapted with permission from Prince (2023,
(Figure 4g), implying a bias towards local connectivity. See Section 8.4). CPU runtime: ∼60 minutes. [CODE]
Figure S2 for a visualization of the actual sparsity patterns
of several lottery tickets.
The original lottery ticket paper (Frankle and Carbin, 2019),
5.2. Deep double descent
as well as some of the follow-up studies (Morcos et al.,
2019), required a large number of GPUs and multiple days An intriguing property of neural networks is the double de-
of runtime. By contrast, all the experiments we presented scent phenomenon. This phrase refers to a training regime
here took around ∼30 minutes to complete on a single GPU. where more data, more model parameters, or more gradient
descent steps can reduce the test accuracy before it increases
again (Trunk, 1979; Belkin et al., 2019; Geiger et al., 2019;
Nakkiran et al., 2020). This happens around the so-called

5
Scaling Down Deep Learning with MNIST-1D

interpolation threshold where the learning procedure, con-


Outer objective (mean inner loss)
sisting of a model and an optimization algorithm, is just 1.75
barely able to fit the entire training set. At this threshold Trial 1: init_lr=1.5
Trial 2: init_lr=0.15
there is effectively just a single model that can fit the data 1.74
and this model is very sensitive to label noise and model
mis-specification, resulting in overfitting and poor test per-
1.73
formance. In contrast, larger models tend to exhibit benign
overfitting wherein SGD selects a smooth model out of
the many possible models fitting the training set (implicit 1.72
regularization).
1.71
Despite the above intuition, many aspects of double descent, 0 20 40
such as what factors affect its width and location, are not Outer training step
well understood. We argue that MNIST-1D is well suited
Metalearned learning rate
for exploring these questions. We observed double descent
when training a MLP classifier on MNIST-1D, varying the Trial 1: init_lr=1.5
1.4
Trial 2: init_lr=0.15
size of the single hidden layer. In the presence of 15% label
noise, the test error peaked at the interpolation threshold 1.2
(training error reaching zero), at around 50 neurons in the 1.0
hidden layer (Figure 5b). Further increasing the model size
0.8
led to the test error dropping again. Without label noise, the
test error did not peak (Figure 5a). We observed qualitatively 0.6
similar behavior using the CNN architecture (Figure 5c).
0.4
The runtime of this experiment was ∼60 minutes on a CPU.
0 20 40
5.3. Gradient-based metalearning Outer training step

The goal of metalearning is to learn how to learn. This can


be implemented by having two levels of optimization: a fast Figure 6: Metalearning the learning rate of SGD optimiza-
inner optimization loop which corresponds to a traditional tion of an MLP classifier on MNIST-1D. The outer training
learning objective and a slow outer loop which updates some converges to the optimal learning rate of 0.62 regardless
meta properties of the learning process. One of the simplest of whether the initial learning rate is too high or too low.
examples of metalearning is gradient-based hyperparameter Runtime: ∼1 minute. [CODE]
optimization. This concept was proposed in Bengio (2000)
and then scaled to deep learning models by Maclaurin et al.
(2015). The basic idea is to implement a fully differentiable 5.4. Metalearning an activation function
training loop and then backpropagate through the entire
process in order to optimize hyperparameters such as the The small size of MNIST-1D allows researchers to perform
learning rate or the weight decay. more challenging metalearning optimizations. For exam-
ple, it permits the metalearning of an activation function —
Metalearning is a promising line of research but it is very something that to the best of our knowledge has not been
difficult to scale. Metalearning algorithms can consume studied before. We parameterized our classifier’s activa-
enormous amounts of time and compute due to their nested tion function with a separate neural network (MLP with
optimization, and tend to grow complex because most deep layer dimensionalities 1 → 100 → 100 → 1 using tanh
learning frameworks are not well suited for them. This activations, with outputs added to an ELU function such
places an especially high incentive on developing and de- that it could be trained to produce perturbations to the ELU
bugging metalearning algorithms on small-scale datasets shape) and then learned its weights using meta-gradients.
such as MNIST-1D. The learned activation function substantially outperformed
We implemented a metalearning optimization for an MLP common nonlinearities such as ReLU, Elu, and Swish (Fig-
classifier on MNIST-1D with an explicitly written inner ure 7), achieving over 5 percentage points higher test accu-
optimization loop using SGD. The gradient-based hyper- racy. The resulting activation function had a non-monotonic
parameter optimization converges to the optimal learning shape with two local extrema (Figure 7).
rate to be 0.62 regardless of whether the initial learning rate There has been work on optimizing activation functions
is too high or too low (Figure 6). The whole optimization (Clevert et al., 2016; Ramachandran et al., 2018; Vercellino
process took only one minute on a CPU. and Wang), but none has used analytical gradients computed

6
Scaling Down Deep Learning with MNIST-1D

a b 90
3 7 80
2

Linear accuracy
70

0 5 1 60
4 6 50 Default
No proj.
9 40
Deeper proj.
8 30
0 1 2 3 4 5 6 7
Layer

Figure 8: SimCLR-style (Chen et al., 2020) learning on


MNIST-1D. (a) t-SNE embedding of the output represen-
tation after training (n = 5000). (b) Linear classification
accuracy after each layer. Layer 0 stands for the input (pixel
space). Accuracy always peaks in the middle (Bordes et al.,
2023). CPU runtime: ∼5 minutes. [CODE]

We implemented the SimCLR algorithm for MNIST-1D,


using a network with three convolutional and two fully-
connected layers (‘projection head’) with output dimension-
ality 16. Our data augmentations consisted of regressing out
the linear slope, circularly shifting by up to 10 pixels, and
then reintroducing a random linear slope. We achieved 82%
linear classification accuracy before the projection head in
∼1 minute of CPU training (for comparison, training Sim-
CLR on CIFAR-10/100 datasets typically takes ∼10 GPU
Figure 7: Metalearning an activation function. Starting from hours). In the output space, digits 0, 3, 6, and 8 appeared as
an ELU shape, we use gradient-based metalearning to find isolated clusters (Figure 8a). Note that here we used both
the optimal activation function for a neural network trained training and test sets of MNIST-1D for the self-supervised
on the MNIST-1D dataset. The activation function itself is training, and the linear classifier was subsequently trained
parameterized by a second (meta) neural network. Note that on the training set and evaluated on the test set.
the ELU baseline (red) is obscured by the tanh baseline Empirically, it has been observed that the representation
(blue) in the figure above. Runtime: ∼1 hour. [CODE] quality (as measured via linear classification accuracy) is
higher before the projection head rather than after (Chen
et al., 2020); removal of the projection head after training
via nested optimization. Moreover, some of these prior ex- has been dubbed guillotine regularization (Bordes et al.,
periments (e.g. Ramachandran et al., 2018) used multi-day 2023) but remains poorly understood. We observed the
training runs on large clusters of GPUs and TPUs, whereas same effect in our experiment: classification accuracy was
our entire training took around 1 hour of CPU runtime. the highest after the second layer (Figure 8b). Furthermore,
when using a deeper projection head with four layers, or a
5.5. Self-supervised learning network without any projection head at all, we achieved sim-
ilar representation quality, and the accuracy always peaked
As shown in Table 1, logistic classification accuracy for in the middle (Figure 8b). This suggests that MNIST-1D
MNIST-1D in pixel space was low (33%). A powerful is sufficiently rich to study cutting-edge open problems in
approach to self-supervised representation learning in com- self-supervised learning.
puter vision is to rely on data augmentations: each input
image is augmented twice, forming ‘positive pairs’ which
5.6. Benchmarking pooling methods
the network is trained to map to close locations in its output
space while pushing away representations of other input In our final case study we asked: What is the relationship
images (Balestriero et al., 2023). In particular, in SimCLR between pooling and sample efficiency? Here we define
(Chen et al., 2020), each positive pair is repulsed from all pooling as any operation that combines activations from two
other positive pairs in the same mini-batch via the InfoNCE or more neurons into a single feature. We are not aware of
loss function. any prior literature on whether pooling makes models more

7
Scaling Down Deep Learning with MNIST-1D

1000 examples 5000 examples 50000 examples


100 100
80
95 98
Test accuracy

60 96
90
no_pool avg_pool 94
40 stride_2 l2_pool 85 92
max_pool
80 90
0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000
Train step
Figure 9: Benchmarking common pooling methods. Pooling was helpful in low-data regimes but hindered performance in
high-data regimes. Runtime: ∼5 minutes. [CODE]

or less sample efficient. versus increasing their performance. Some researchers con-
tend that a high-performing algorithm need not be inter-
With this in mind, we trained CNN models for MNIST-1D
pretable as long as it saves lives or produces economic
classification with different pooling methods and training
value. Others argue that hard-to-interpret deep learning
set sizes. Note that here we make use of the procedural
models should not be deployed in sensitive real-world con-
generation of MNIST-1D that allows one to generate addi-
texts until we understand them better. Both arguments have
tional samples at will. We found that, while pooling (but
merit. However, we believe that the process of identifying
not striding!) was very effective in low-data regimes, it did
things we do not understand about large-scale neural net-
not make much of a difference when more training data was
works, reproducing them in toy settings like MNIST-1D,
available (Figure 9). We hypothesize that pooling is a poor-
and then performing careful ablation studies to isolate their
man architectural prior which is better than nothing with
causal mechanisms is likely to improve both performance
insufficient data but restricts model expression otherwise.
and interpretability in the long run.

6. Discussion Reducing environmental impact. There is hope that


deep learning will have positive environmental applications
When to scale. This paper is not an argument against
(Loehle, 1987; Rolnick et al., 2019). This may be true in
large-scale machine learning research. That research has
the long run, but so far, artificial intelligence has done little
proven its worth and has come to represent one of the most
to solve environmental problems. Deep learning models
exciting aspects of the ML research ecosystem. Rather,
do, however, require massive amounts of electricity to train
we wish to promote small-scale machine learning research.
and deploy (Strubell et al., 2019). Running experiments
Neural networks do not have problems with scaling or per-
on smaller datasets — and waiting to scale until one has a
formance — but they do have problems with interpretabil-
solid grasp of the phenomena involved — is a good way to
ity, reproducibility, and training speed. We see carefully-
reduce the electricity costs and environmental impact of this
controlled, small-scale experiments as a great way to address
research.
these problems.
In fact, small-scale research is complimentary to large-scale The scaling down manifesto. We would like to provoca-
research. As in biology, where fruit fly genetics helped guide tively suggest that in order to explore the limits of how large
the Human Genome Project, we believe that small-scale we can scale neural networks, we may need to explore the
research should always have an eye on how to successfully limits of how small we can scale them first. Scaling models
scale. For example, several of the findings reported in this and datasets down in a way that preserves the nuances of
paper are at the point where they could be investigated at their behaviors will allow researchers to iterate more quickly
scale. It would be interesting to show that large-scale lottery on fundamental and creative ideas. This fast iteration cycle
tickets also learn spatial inductive biases and feature local is the best way to obtain insights on how to incorporate pro-
connectivity. It would also be interesting to try metalearning gressively more complex inductive biases into our models.
an activation function on a larger model in order to find an We can then transfer these inductive biases across scales
activation that can outperform ReLU and Swish in practical in order to dramatically improve the sample efficiency and
deep learning systems. generalization of large models. The MNIST-1D dataset is a
first step in that direction.
Understanding vs. performance. There has been some
debate over the relative value of understanding neural nets

8
Scaling Down Deep Learning with MNIST-1D

Acknowledgements Conference on Learning Representations (ICLR), 2019.


We would like to thank Luke Metz for the interesting con- M. Geiger, S. Spigler, S. d’Ascoli, L. Sagun, M. Baity-Jesi,
versations, Tony Zador for the encouragement to release this G. Biroli, and M. Wyart. Jamming transition as a paradigm
dataset, Simon Prince for improving our initial double de- to understand the loss landscape of deep neural networks.
scent experiments and for the permission to adapt his code Physical Review E, 100(1):012115, 2019.
for Figure 5, and Peter Steinbach for his help with preparing I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
the Python package. D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.
This work was partially funded by Deutsche Forschungsge- Generative adversarial nets. In Advances in neural informa-
meinschaft (DFG, German Research Foundation) via Ger- tion processing systems, pages 2672–2680, 2014.
many’s Excellence Strategy (Excellence cluster 2064 “Ma- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
chine Learning — New Perspectives for Science”, EXC for image recognition. Proceedings of the IEEE Conference
390727645; Excellence cluster 2181 “STRUCTURES”, on Computer Vision and Pattern Recognition (CVPR), 2016.
EXC 390900948), the German Ministry of Science and Ed-
D. P. Kingma and J. Ba. Adam: A method for stochastic
ucation (BMBF) via the Tübingen AI Center (01IS18039A),
optimization. International Conference on Learning Repre-
and the Gemeinnützige Hertie-Stiftung.
sentations, 2014.
D. P. Kingma and M. Welling. Auto-encoding variational
Impact Statement
bayes. International Conference on Learning Representa-
The goal of our paper is to advance the field of machine tions (ICLR), 2014.
learning. We do not see any potential societal consequences A. Krizhevsky, G. Hinton, et al. Learning multiple layers of
of our work that need to be highlighted in this section. features from tiny images. 2009.
E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum,
References M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle,
R. Balestriero, M. Ibrahim, V. Sobal, A. Morcos, S. Shekhar, W. FitzHugh, et al. Initial sequencing and analysis of the
T. Goldstein, F. Bordes, A. Bardes, G. Mialon, Y. Tian, et al. human genome. Nature, 2001.
A cookbook of self-supervised learning. arXiv preprint Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E.
arXiv:2304.12210, 2023. Howard, W. Hubbard, and L. D. Jackel. Backpropagation
M. Belkin, D. Hsu, S. Ma, and S. Mandal. Recon- applied to handwritten zip code recognition. Neural compu-
ciling modern machine-learning practice and the classi- tation, 1(4):541–551, 1989.
cal bias–variance trade-off. Proceedings of the National Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-
Academy of Sciences, 116(32):15849–15854, 2019. based learning applied to document recognition. Proceed-
R. Bellman and R. Kalaba. On adaptive control processes. ings of the IEEE, 86(11):2278–2324, 1998.
IRE Transactions on Automatic Control, 4(2):1–9, 1959. C. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring
Y. Bengio. Gradient-based optimization of hyperparameters. the Intrinsic Dimension of Objective Landscapes. In In-
Neural Computation, 12(8):1889–1900, 2000. ternational Conference on Learning Representations, Apr.
2018.
F. Bordes, R. Balestriero, Q. Garrido, A. Bardes, and P. Vin-
cent. Guillotine regularization: Why removing layers is C. Loehle. Applying artificial intelligence techniques to
needed to improve generalization in self-supervised learn- ecological modeling. Ecological Modelling, 38(3-4):191–
ing. Transactions on Machine Learning Research, 2023. 212, 1987.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple D. Maclaurin, D. Duvenaud, and R. Adams. Gradient-based
framework for contrastive learning of visual representations. hyperparameter optimization through reversible learning.
In International Conference on Machine Learning, pages In International Conference on Machine Learning, pages
1597–1607. PMLR, 2020. 2113–2122, 2015.
D.-A. Clevert, T. Unterthiner, and S. Hochreiter. Fast and A. Morcos, H. Yu, M. Paganini, and Y. Tian. One ticket to
accurate deep network learning by exponential linear units win them all: generalizing lottery ticket initializations across
(elus). International Conference on Learning Representa- datasets and optimizers. In Advances in Neural Information
tions, 2016. Processing Systems, pages 4933–4943, 2019.
J. Frankle and M. Carbin. The lottery ticket hypothesis: P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and
Finding sparse, trainable neural networks. In International I. Sutskever. Deep double descent: Where bigger models

9
Scaling Down Deep Learning with MNIST-1D

and more data hurt. International Conference on Learning


Representations, 2020.
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss,
V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn:
Machine learning in Python. Journal of Machine Learning
Research, 12:2825–2830, 2011.
S. J. Prince. Understanding Deep Learning. MIT Press,
2023. URL http://udlbook.com.
P. Ramachandran, B. Zoph, and Q. V. Le. Searching for
activation functions. International Conference on Learning
Representations (workshop tract), 2018.
A. Rawal, J. Lehman, F. P. Such, J. Clune, and K. O. Stanley.
Synthetic petri dish: A novel surrogate model for rapid
architecture search. arXiv preprint arXiv:2005.13092, 2020.
D. Rolnick, P. L. Donti, L. H. Kaack, K. Kochanski, A. La-
coste, K. Sankaran, A. S. Ross, N. Milojevic-Dupont,
N. Jaques, A. Waldman-Brown, et al. Tackling climate
change with machine learning. Neural Information Process-
ing Systems Workshop on Climate Change AI, 2019.
O. Sharir, B. Peleg, and Y. Shoham. The cost of train-
ing nlp models: A concise overview. arXiv preprint
arXiv:2004.08900, 2020.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov. Dropout: a simple way to prevent neural
networks from overfitting. The journal of machine learning
research, 15(1):1929–1958, 2014.
E. Strubell, A. Ganesh, and A. McCallum. Energy and
policy considerations for deep learning in nlp. 57th Annual
Meeting of the Association for Computational Linguistics
(ACL, 2019.
G. V. Trunk. A problem of dimensionality: A simple exam-
ple. IEEE Transactions on pattern analysis and machine
intelligence, (3):306–307, 1979.
L. Van der Maaten and G. Hinton. Visualizing data using
t-sne. Journal of Machine Learning Research, 9(11), 2008.
C. J. Vercellino and W. Y. Wang. Hyperactivations for
activation function exploration.
H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a
novel image dataset for benchmarking machine learning
algorithms, 2017.
C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals.
Understanding deep learning requires rethinking generaliza-
tion. International Conference for Learning Representations
(ICLR), 2017.

10
Scaling Down Deep Learning with MNIST-1D

A. Supplementary Figures

Figure S1: Classwise errors of a CNN and a human subject on the test split of the MNIST-1D dataset. As described in the
main text, we estimated human performance on MNIST-1D by training one of the authors to perform classification and
then evaluating his accuracy on 500 test images. His accuracy was 96%. The CNN’s accuracy was 94%. Both humans
and the CNN struggled primarily with classifying 2’s and 7’s, and to a lesser degree 4’s. The human subject had a harder
time classifying 9’s whereas the CNN had a harder time classifying 1’s. Both had zero errors classifying 3’s and 6’s. It is
interesting that a human could outperform a CNN on this task. Part of the reason may be that the CNN was only given 4000
training examples — with more examples it could possibly match and eventually exceed the human baseline. Even though
the data is low-dimensional, the classification objective is quite difficult and spatial/relational priors matter a lot. It may
be that the architecture of the CNN prevents it from learning all of the tricks that humans are capable of using. It is worth
noting that modern CNNs can outperform human subjects on most large-scale image classification tasks like ImageNet. But
in our tiny benchmark a human was still competitive.

11
Scaling Down Deep Learning with MNIST-1D

Random ticket #1 (92 % sparse)


0
Input axis

20

0 100 200 300 400


Hidden layer axis
Random ticket #2 (92 % sparse)
0
Input axis

20

0 100 200 300 400


Hidden layer axis
Lottery ticket #1 (92 % sparse)
0
Input axis

20

0 100 200 300 400


Hidden layer axis
Lottery ticket #2 (92 % sparse)
0
Input axis

20

0 100 200 300 400


Hidden layer axis

Figure S2: First layer weight masks of random tickets and lottery tickets. We sorted the masks along their hidden layer axes,
according to the number of adjacent unmasked parameters. This helps to reveal a bias towards local connectivity in the
lottery ticket masks. Notice how there are many more vertically-adjacent unmasked parameters in the lottery ticket masks.
These vertically-adjacent parameters correspond to local connectivity along the input dimension, which in turn biases the
sparse model towards data with spatial structure.

12

You might also like