Scaling Down Deep Learning With MNIST-1D
Scaling Down Deep Learning With MNIST-1D
1
Scaling Down Deep Learning with MNIST-1D
Table 1: Test accuracies (in %) of common classifiers on the MNIST and MNIST-1D datasets. Most classifiers achieve
similar test accuracy on MNIST. By contrast, the MNIST-1D dataset is capable of separating different models based on
their inductive biases. The drop in CNN and GRU performance when using shuffled features indicates that spatial priors
are important on this dataset. All models except logistic regression achieve 100% accuracy on the training set. Standard
deviation is computed over three runs. [CODE]
2
Scaling Down Deep Learning with MNIST-1D
Accuracy (%)
Template length 12 60 60
Padding points 36–60
Max. translation 48 40 40
Logistic
Gaussian filter width 2 MLP
Gaussian noise scale 0.25 20 CNN 20
White noise scale 0.02 GRU
After shuffling
Shear scale 0.75 0 0
Final seq. length 40 0 2000 4000 6000 0 2000 4000 6000
Training step Training step
Random seed 42
Figure 2: Train and test accuracy of common classification
models on MNIST-1D. The logistic regression model fares
references to biology in order to motivate the use of small worse than the MLP. Meanwhile, the MLP fares worse than
synthetic datasets for exploratory research. Their work dif- the CNN and GRU, which use translation invariance and
fers from ours in that they use metalearning to obtain their local connectivity to bias optimization towards solutions
datasets whereas we construct ours by hand. In doing so, we that generalize well. When local spatial correlations are de-
are able to control various causal factors such as the amount stroyed by shuffling feature indices (dashed lines), the MLP
of noise, translation, and padding. Our dataset is more in- performs the best. CPU runtime: ∼10 minutes. [CODE]
tuitive to humans: an experienced human can outperform a
strong CNN on the MNIST-1D classification task. This is
not possible on the Synthetic Petri Dish dataset. changing dataset features such as maximum digit translation,
correlated noise scale, shear scale, final sequence length,
and more (Table 2). The following code snippet shows how
3. The MNIST-1D dataset to install the mnist1d package, choose a custom number
Dimensionality. Our first design choice was to use of samples, and generate a dataset:
one-dimensional time series instead of two-dimensional # install the package from PyPI
grayscale images or three-dimensional tensors correspond- # pip install mnist1d
ing to colored images. Our rationale was that one-
dimensional signals require far less computation to train from mnist1d.data import make_dataset
on but can be designed to have many of the same biases, from mnist1d.data import get_dataset_args
distortions, and distribution shifts that are of interest to args = get_dataset_args() # default params
researchers studying fundamental deep learning questions. args.num_samples = 10_000
data = make_dataset(args)
Constructing the dataset. We began with ten one- x, y = data["x"], data["y"]
dimensional template patterns which resemble the digits
0–9 when plotted as in Figure 1. Each of these templates
The frozen dataset with 4000 + 1000 samples can be found
consisted of 12 hand-crafted x coordinates. Next we padded
on GitHub as mnist1d data.pkl.
the end of each sequence with 36–60 additional zero values,
did a random circular shift by up to 48 indices, applied a
random scaling, added Gaussian noise, and added a con- 4. Classification
stant linear signal. We used Gaussian smoothing with σ = 2
We used PyTorch to implement and train simple logistic,
to induce spatial correlations. Finally, we downsampled
MLP (fully-connected), CNN (with 1D convolutions), and
the sequences to 40 data points that play the role of pix-
GRU (gated recurrent unit) models. We used the Adam opti-
els in the resulting MNIST-1D (Figure 1). Table 2 gives
mizer and early stopping for model selection and evaluation.
the values of all the default hyperparameters used in these
We obtained 32% accuracy with logistic regression, 68%
transformations.
using an MLP, 91% using a GRU, and 94% using a CNN
Implementation. Our goal was to make the code as simple, (Table 1). Even though the test performance was markedly
modular, and extensible as possible. The code for generating different between MLP, GRU, and CNN, all of them eas-
the dataset occupies two Python files and fits in a total of ily achieved 100% accuracy on the training set (Figure 2).
150 lines. The get dataset method has a simple API for While for the MLP this is a manifestation of overfitting, for
3
Scaling Down Deep Learning with MNIST-1D
4
Scaling Down Deep Learning with MNIST-1D
Figure 4: Finding and analyzing lottery tickets. (a–b) The test loss and test accuracy of lottery tickets at different levels of
sparsity, compared to randomly selected subnetworks and to the original dense network. (c) Performance of lottery tickets
with 92% sparsity. (d) Performance of the same lottery tickets when trained on flipped data. (e) Performance of the same
lottery tickets when trained on data with shuffled features. (f) Performance of the same lottery tickets but with randomly
initialized weights, when trained on original data. (g) Lottery tickets had more adjacent non-zero weights in the first layer
compared to random subnetworks. Runtime: ∼30 minutes. [CODE]
In other words, we permuted the feature indices in order a MLP b MLP c CNN
70
No label noise Label noise Label noise
to remove any spatial structure from the data. Shuffling
Test
greatly reduced the performance of the lottery tickets: they 60 Train
Classification error (%)
5
Scaling Down Deep Learning with MNIST-1D
6
Scaling Down Deep Learning with MNIST-1D
a b 90
3 7 80
2
Linear accuracy
70
0 5 1 60
4 6 50 Default
No proj.
9 40
Deeper proj.
8 30
0 1 2 3 4 5 6 7
Layer
7
Scaling Down Deep Learning with MNIST-1D
60 96
90
no_pool avg_pool 94
40 stride_2 l2_pool 85 92
max_pool
80 90
0 2000 4000 6000 0 2000 4000 6000 0 2000 4000 6000
Train step
Figure 9: Benchmarking common pooling methods. Pooling was helpful in low-data regimes but hindered performance in
high-data regimes. Runtime: ∼5 minutes. [CODE]
or less sample efficient. versus increasing their performance. Some researchers con-
tend that a high-performing algorithm need not be inter-
With this in mind, we trained CNN models for MNIST-1D
pretable as long as it saves lives or produces economic
classification with different pooling methods and training
value. Others argue that hard-to-interpret deep learning
set sizes. Note that here we make use of the procedural
models should not be deployed in sensitive real-world con-
generation of MNIST-1D that allows one to generate addi-
texts until we understand them better. Both arguments have
tional samples at will. We found that, while pooling (but
merit. However, we believe that the process of identifying
not striding!) was very effective in low-data regimes, it did
things we do not understand about large-scale neural net-
not make much of a difference when more training data was
works, reproducing them in toy settings like MNIST-1D,
available (Figure 9). We hypothesize that pooling is a poor-
and then performing careful ablation studies to isolate their
man architectural prior which is better than nothing with
causal mechanisms is likely to improve both performance
insufficient data but restricts model expression otherwise.
and interpretability in the long run.
8
Scaling Down Deep Learning with MNIST-1D
9
Scaling Down Deep Learning with MNIST-1D
10
Scaling Down Deep Learning with MNIST-1D
A. Supplementary Figures
Figure S1: Classwise errors of a CNN and a human subject on the test split of the MNIST-1D dataset. As described in the
main text, we estimated human performance on MNIST-1D by training one of the authors to perform classification and
then evaluating his accuracy on 500 test images. His accuracy was 96%. The CNN’s accuracy was 94%. Both humans
and the CNN struggled primarily with classifying 2’s and 7’s, and to a lesser degree 4’s. The human subject had a harder
time classifying 9’s whereas the CNN had a harder time classifying 1’s. Both had zero errors classifying 3’s and 6’s. It is
interesting that a human could outperform a CNN on this task. Part of the reason may be that the CNN was only given 4000
training examples — with more examples it could possibly match and eventually exceed the human baseline. Even though
the data is low-dimensional, the classification objective is quite difficult and spatial/relational priors matter a lot. It may
be that the architecture of the CNN prevents it from learning all of the tricks that humans are capable of using. It is worth
noting that modern CNNs can outperform human subjects on most large-scale image classification tasks like ImageNet. But
in our tiny benchmark a human was still competitive.
11
Scaling Down Deep Learning with MNIST-1D
20
20
20
20
Figure S2: First layer weight masks of random tickets and lottery tickets. We sorted the masks along their hidden layer axes,
according to the number of adjacent unmasked parameters. This helps to reveal a bias towards local connectivity in the
lottery ticket masks. Notice how there are many more vertically-adjacent unmasked parameters in the lottery ticket masks.
These vertically-adjacent parameters correspond to local connectivity along the input dimension, which in turn biases the
sparse model towards data with spatial structure.
12