WiSe 2023/24
Deep Learning 1
Lecture 5 Overtting & Robustness (1)
Recap Lectures 14
Lectures 14:
▶ With exible neural network architectures, powerful optimization
techniques, and fast machines, we have means of producing functions
that can accurately t large amount of highly nonlinear data.
Question:
▶ Do the learned neural networks generalize to new data, e.g. will it be
able to correctly classify new images?
The data on which we train the model also matter!
1/33
A Bit of Theory
So far, we have only considered the error we use to optimize the model (aka.
the training error):
N
1 X
Etrain (θ) = (f (xi , θ) − ti )2
N i=1
In practice, what we really care about is the true error:
Z
Etrue (θ) = (f (x, θ) − t)2 p(x, t) dxdt
where p(x, t) is the true probability distribution from which the data is coming
at test time. The true error is much harder to minimize, because we don't
know p(x, t).
2/33
Characterizing Datasets
Factors that makes the available dataset D and the true distribution p(x, t)
diverge:
▶ The fact the dataset is composed of few data points drawn randomly
from the underlying data distribution ( nite data).
▶ The fact that the dataset may overrepresent certain parts of the
underlying distribution, e.g. people of a certain age group ( dataset
bias).
▶ The fact that the dataset may have been generated from an underlying
distribution pold (x, t) that is now obsolete ( distribution shift).
3/33
Practical Examples
Data types Properties
thousands of images per class, aggregated
Image/text from many sources. Some image compo-
data sitions may be overrepresented ( dataset
bias / spurious correlations).
Potentially very large datasets, but sen-
Sensor
sors may become decalibrated over time
data
distribution shift).
(
Innite number of games states can
Games
be produced through computer-computer
(GO,
plays, but master-level plays being more
chess,
expensive to generate, simple games may
etc.)
be overrepresented ( dataset bias).
4/33
Practical Examples (cont.)
Data types Properties
Simulated Theoretically innite, but practically lim-
data (e.g. ited do the cost of running simulations. In
physics, practice, we only generate few instances
car driving) nite data).
(
limited number of patients due to rar-
ity of a particular disease, or regulatory
Medical
constraints ( nite data, dataset bias).
data
Aquisition devices may evolve over time
distribution shift).
(
Large amount of data, but only recent
Social data data is relevant. Risk of not capturing the
most recent trends ( distribution shift).
5/33
Outline
The Problem of Finite Data
▶ The problem of overtting
▶ Mitigating overtting
Dataset Bias 1: Imbalance Between Subgroups
▶ Data from Multiple Domains
▶ Building a `Domain' Invariant Classier
Dataset Bias 2: Spurious Correlations
▶ Examples of Spurious Correlations
▶ Detecting and Mitigating Spurious Correlations
6/33
Part 1 The Problem of Limited Data
7/33
Finite Data and Overtting
theoretical optimum model learned in practice
2 2
0 0
2 2
2 0 2 2 0 2
▶ Assume each data point x ∈ Rd and its label y ∈ {0, 1} is generated
iid. from two Gaussian distributions.
▶ With limited data, one class or target value may be locally predominant
`by chance'. Learning these spurious variations is known as overtting.
▶ An overtted model predicts the training data perfectly but works
poorly on new data.
8/33
Model Error and Model Complexity
William of Ockham (1287-1347)
Linked model complexity to how suitable the model is for
explaning phenomena. Entia non sunt multiplicanda praeter
necessitatem
Vladimir Vapnik
Showed a formal relation between model complexity
(measured as the VC-dimension) and the error of a classier.
9/33
Complexity and Generalization Error
Generalization Bound [Vapnik]
Let h denote the VC-dimension of F. The dierence between the true error
R[f ] and the traning error Remp [f ] is upper-bounded as:
s
2N
h(log h
+ 1) − log(η/4)
Etrue (θ) − Etrain (θ) ≤
N
The VC-dimension h denes the complexity (or exibility) of the class of
considered models.
Factors that reduce the gap between test error Etrue (θ) and training error
Etrain (θ):
▶ Lowering the VC-dimension h.
▶ Increasing the number of data points N.
10/33
Characterizing Complexity (One-Layer Networks)
Interpretation:
▶ Model complexity can be restrained if the input data is low-dimensional
or if the model builds a large margin (i.e. has low sensitivity).
Question:
▶ Can we build similar concepts for deep neural networks?
11/33
Reducing Complexity via Low Dimensionality
Features extraction
(hard-coded)
a1
a2
Learned parameters
Approach:
▶ First, generate a low-dimensional representation by extracting a few
features from the high-dimensional input data (either hand-designed, or
automatically generated using methods such as PCA).
▶ Then, learn a neural network on the resulting low-dimensional data.
12/33
Reducing Complexity via Low Dimensionality
Observations:
▶ Building low-dimensional representations can be useful when predicting
noisy high-dimensional data such as gene expression in biology
(d > 20000)
▶ On other tasks such as image recognition, low dimensional
representation can also delete class-relevant information (e.g. edges).
13/33
Reducing Complexity by Reducing Sensitivity
read pixels directly
a1
...
ad
learned parameters
+ regularization
Weight Decay [4]:
▶ Include in the objective a term that makes the weights tend to zero if
they are not necessary for the prediction task.
N
X
E(θ) = (f (xi , θ) − ti )2 + λ∥θ∥2
i=1
▶ The higher the parameter λ, the more the exposure of the model to
variations in the input domain is being reduced.
14/33
Reducing Complexity by Reducing Sensitivity
Dropout [7]:
▶ Alternative to weight decay, which consists of adding articial
multiplicative noise to the input and intermediate neurons, and training
the model subject to that noise.
▶ This is achieved by inserting a
bj » Bernoulli(p)
dropout layer in the neural
network, which multiplies each input
(or activation) by a random variable
zj aj
bj ∼ Bernoulli(p):
15/33
Reducing Complexity by Reducing Sensitivity
Eect of dropout on performance on the MNIST dataset
Note:
▶ On neural networks for image data, dropout tends to yield superior
performance compared to simple weight decay.
16/33
Choosing a Model with Appropriate Complexity
Holdout Validation:
▶ Train multiple neural network models with dierent regularization
parameters (e.g. λ), and retain the one that performs the best on
some validation set disjoint from the training data.
Problem:
▶ Training a model for each parameter λ can be costly. One would
potentially benet from training a bigger model only once.
17/33
Accelerating Model Selection
Early Stopping Technique [6]:
▶ View the iterative procedure for training a neural network as generating
a sequence of increasingly complex models θ1 , . . . , θ T .
▶ Monitor the validation error throughout training and keep a snapshot
of the model when it had lowest validation error.
Early stopping:
θ⋆ = None
E⋆ = ∞
for t = 1 . . . T do
Run a few SGD steps, and collect the current parameter θt
ifEval (θt ) < E ⋆ then
θ ⋆ ← θt
E ⋆ ← Eval (θ)
end if
end for
Advantage:
▶ No need to train several models (e.g. with dient λ's). Only one
training run is needed!
18/33
Very Large Models
▶ When the model becomes very large there is an interesting ` double
descent ' [2] phenomenon that occurs in the context of neural
networks, where the generalization error starts to go down again as
model complexity increases.
▶ This can be interpreted as some implicit averaging between the many
components of the model (interpolating regime).
▶ Increasing model size to a great extent may contribute, without further
regularization techniques to achieve lower test set error.
19/33
Part 2 Imbalances between Subgroups
20/33
Data from Multiple Domains
▶ The data might come from dierent
domains (P , Q).
▶ Domains may e.g. correspond to
dierent acquisition devices, or
dierent ways they are
congured/calibrated.
▶ One of the domains may be
overrepresented in the available
data, or the ML model may learn
better on a given domain at the
expense of another domain.
Image source: Aubreville et al. Quantifying the Scanner-Induced Domain Gap in Mitosis Detection. CoRR abs/2103.16515 (2021)
21/33
Addressing Multiple Domains
Simple Approach (one-layer networks):
▶ Denoting by P and Q the two domains, regularize the ML model
⊤
(w x) so that both domains generate the same responses on average
at the output:
min E(w) + λ · (EP [w⊤ x] − EQ [w⊤ x])2
w
(aka. moment matching). The approach can be enhanced to include
higher-order moments such as variance, etc.
▶ In practice, more powerful tools exist to constrain distributions more
nely in representation space, such as the Wasserstein distance.
22/33
Addressing Multiple Domains
More Advanced Approach [1]:
▶ Learn a auxiliary neural network (domain critic φ) that tries to classify
the two domains. Learn the parameters of the feature extractor in a
way that the domain critic φ is no longer able to distinguish between
the two domains.
23/33
Addressing Multiple Domains
Example:
▶ Example of one particular class of the Oce-Caltech dataset and the
dierent domains from which the data is taken.
▶ Models equiped with a domain critic, although loosing performance on
some domains, achieve better worst-case accuracy:
24/33
Part 3 Spurious Correlations
25/33
Spurious Correlations
▶ Artefact of the distribution of available data (P ) where one or several
task-irrelevant input variables are spuriously correlated to the
task-relevant variables.
▶ Spurious correlations are very common on practical datasets, e.g. a
copyright tag occurring only on images of a certain class;
histopathological images of a certain class having been acquired with
particular device and having as a result a dierent color prole, etc.
26/33
Spurious Correlations and the Clever Hans Eect
Available data (P ) New data (Q)
▶ A ML classier is technically able to classify the available data (P )
using either the correct features or the spurious ones. The ML model
doesn't know a priori which feature (the correct one or the spurious
one) to use. A model that bases its decision strategy on the spurious
right for the wrong reasons
feature is and is also known as a Clever
Hans classier.
▶ A Clever Hans classier may fail to function well on the new data (Q)
where the spurious correlation no longer exists, e.g. horses without
copyright tags, or images of dierent class with copyright tags.
27/33
Spurious Correlations and the Clever Hans Eect
▶ Test set accuracy doesn't give much information on whether the model
bases its decision on the correct features or exploits the spurious
correlation.
▶ Only an inspection of the decision structure by the user (e.g. using
LRP heatmaps) enables the detection of the aw in the model [5].
28/33
Generating LRP heatmaps
Layer 1 Layer 2 Layer 3 Layer 4
Image of 'bicycle' Local Features GMM fitting Fisher Vector Normalization +Linear SVM
(Hellinger's kernel SVM)
Heatmap
Relevance
Conservation
Redistribution
Formula
▶ Explanations are produced using a layer-wise redistribution process
from the output of the model to the input features.
▶ Each layer can have its own redistribution scheme. The redistribution
rules are designed in a way that maximizes explanation quality.
29/33
Mitigating Reliance on Spurious Correlations
Feature Selection / Unlearning:
▶ Retrain without the feature containing the artefact (e.g. crop images
to avoid copyright tags).
▶ Actively look in the model for units (e.g. subsets of neurons) that
respond to the artifact and remove such units from the model (e.g.
[3]).
Dataset Design:
▶ Manually remove the artifact (e.g. copyright tags) from the classes
that contain it, or alternatively, inject the artifact in every class (so
that it cannot be used anymore for discriminating between classes).
▶ Stratify the dataset in a way that the spurious features are present in
all classes in similar proportions.
Learn with Explanation Constraints
▶ Include an extra term in the objective that penalizes decision strategies
that are based on unwanted features (preliminary revealed by an
explanation technique).
30/33
Summary
31/33
Summary
▶ While deep learning can in principle t very complex prediction
functions, the way they perform in practice is in large part determined
by the amount and quality of the data.
▶ Limited data may subject the ML model to overtting and lead to
lower performance on new data. Various methods exist to prevent
overtting (e.g. generating a low-dimensional input vector, or build a
model with limited sensitivity to input such as weight decay or
dropout).
▶ Another problem is dataset bias, where certain parts of the distribution
are over/under-represented, or plagued with spurious correlations.
Reliance of the model on spurious correlations can lead to low test
performance, but this can be detected by Explainable AI approaches. A
number of methods exist to reduce reliance on spurious correlations.
32/33
References
[1] L. Andéol, Y. Kawakami, Y. Wada, T. Kanamori, K.-R. Müller, and G. Montavon.
Learning domain invariant representations by joint wasserstein distance minimization.
Neural Networks, 167:233243, 2023.
[2] M. Belkin, D. Hsu, S. Ma, and S. Mandal.
Reconciling modern machine-learning practice and the classical biasvariance trade-o.
PNAS, 116(32):1584915854, 2019.
[3] P. Chormai, J. Herrmann, K. Müller, and G. Montavon.
Disentangled explanations of neural network predictions by nding relevant subspaces.
CoRR, abs/2212.14855, 2022.
[4] A. Krogh and J. A. Hertz.
A simple weight decay can improve generalization.
In NIPS, pages 950957, 1991.
[5] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller.
Unmasking clever hans predictors and assessing what machines really learn.
Nature Communications, 10(1), Mar. 2019.
[6] L. Prechelt.
Early stopping-but when?
In Neural Networks: Tricks , volume 1524 of
of the Trade LNCS, pages 5569. Springer, 1996.
[7] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overtting.
J. Mach. Learn. Res., 15(1):19291958, 2014.
33/33