0% found this document useful (0 votes)

16 views34 pages

Lecture 05

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views34 pages

Lecture 05

Uploaded by

Tim Widmoser

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

WiSe 2023/24

Deep Learning 1

Lecture 5 Overtting & Robustness (1)

Recap Lectures 14

Lectures 14:
▶ With exible neural network architectures, powerful optimization
techniques, and fast machines, we have means of producing functions
that can accurately t large amount of highly nonlinear data.

Question:
▶ Do the learned neural networks generalize to new data, e.g. will it be
able to correctly classify new images?

The data on which we train the model also matter!

1/33
A Bit of Theory
So far, we have only considered the error we use to optimize the model (aka.
the training error):
N
1 X
Etrain (θ) = (f (xi , θ) − ti )2
N i=1
In practice, what we really care about is the true error:
Z
Etrue (θ) = (f (x, θ) − t)2 p(x, t) dxdt

where p(x, t) is the true probability distribution from which the data is coming
at test time. The true error is much harder to minimize, because we don't
know p(x, t).

2/33
Characterizing Datasets

Factors that makes the available dataset D and the true distribution p(x, t)
diverge:

▶ The fact the dataset is composed of few data points drawn randomly
from the underlying data distribution ( nite data).
▶ The fact that the dataset may overrepresent certain parts of the
underlying distribution, e.g. people of a certain age group ( dataset
bias).
▶ The fact that the dataset may have been generated from an underlying
distribution pold (x, t) that is now obsolete ( distribution shift).

3/33
Practical Examples

Data types Properties

thousands of images per class, aggregated

Image/text from many sources. Some image compo-
data sitions may be overrepresented ( dataset
bias / spurious correlations).

Potentially very large datasets, but sen-

Sensor
sors may become decalibrated over time
data
distribution shift).
(

Innite number of games states can

Games
be produced through computer-computer
(GO,
plays, but master-level plays being more
chess,
expensive to generate, simple games may
etc.)
be overrepresented ( dataset bias).

4/33
Practical Examples (cont.)

Data types Properties

Simulated Theoretically innite, but practically lim-

data (e.g. ited do the cost of running simulations. In
physics, practice, we only generate few instances
car driving) nite data).
(

limited number of patients due to rar-

ity of a particular disease, or regulatory
Medical
constraints ( nite data, dataset bias).
data
Aquisition devices may evolve over time
distribution shift).
(

Large amount of data, but only recent

Social data data is relevant. Risk of not capturing the
most recent trends ( distribution shift).

5/33
Outline

The Problem of Finite Data

▶ The problem of overtting

▶ Mitigating overtting

Dataset Bias 1: Imbalance Between Subgroups

▶ Data from Multiple Domains

▶ Building a `Domain' Invariant Classier

Dataset Bias 2: Spurious Correlations

▶ Examples of Spurious Correlations

▶ Detecting and Mitigating Spurious Correlations

6/33
Part 1 The Problem of Limited Data

7/33
Finite Data and Overtting

theoretical optimum model learned in practice

2 2

0 0

2 2

2 0 2 2 0 2

▶ Assume each data point x ∈ Rd and its label y ∈ {0, 1} is generated

iid. from two Gaussian distributions.

▶ With limited data, one class or target value may be locally predominant
`by chance'. Learning these spurious variations is known as overtting.

▶ An overtted model predicts the training data perfectly but works

poorly on new data.

8/33
Model Error and Model Complexity

William of Ockham (1287-1347)

Linked model complexity to how suitable the model is for
explaning phenomena. Entia non sunt multiplicanda praeter
necessitatem

Vladimir Vapnik
Showed a formal relation between model complexity
(measured as the VC-dimension) and the error of a classier.

9/33
Complexity and Generalization Error

Generalization Bound [Vapnik]

Let h denote the VC-dimension of F. The dierence between the true error
R[f ] and the traning error Remp [f ] is upper-bounded as:

s
2N
h(log h
+ 1) − log(η/4)
Etrue (θ) − Etrain (θ) ≤
N

The VC-dimension h denes the complexity (or exibility) of the class of

considered models.

Factors that reduce the gap between test error Etrue (θ) and training error
Etrain (θ):
▶ Lowering the VC-dimension h.
▶ Increasing the number of data points N.

10/33
Characterizing Complexity (One-Layer Networks)

Interpretation:
▶ Model complexity can be restrained if the input data is low-dimensional
or if the model builds a large margin (i.e. has low sensitivity).

Question:
▶ Can we build similar concepts for deep neural networks?

11/33
Reducing Complexity via Low Dimensionality

Features extraction
(hard-coded)

Learned parameters

Approach:
▶ First, generate a low-dimensional representation by extracting a few
features from the high-dimensional input data (either hand-designed, or
automatically generated using methods such as PCA).

▶ Then, learn a neural network on the resulting low-dimensional data.

12/33
Reducing Complexity via Low Dimensionality

Observations:
▶ Building low-dimensional representations can be useful when predicting
noisy high-dimensional data such as gene expression in biology
(d > 20000)
▶ On other tasks such as image recognition, low dimensional
representation can also delete class-relevant information (e.g. edges).

13/33
Reducing Complexity by Reducing Sensitivity

read pixels directly

...
ad

learned parameters
+ regularization

Weight Decay [4]:

▶ Include in the objective a term that makes the weights tend to zero if
they are not necessary for the prediction task.

N
X
E(θ) = (f (xi , θ) − ti )2 + λ∥θ∥2
i=1

▶ The higher the parameter λ, the more the exposure of the model to
variations in the input domain is being reduced.

14/33
Reducing Complexity by Reducing Sensitivity

Dropout [7]:
▶ Alternative to weight decay, which consists of adding articial
multiplicative noise to the input and intermediate neurons, and training
the model subject to that noise.

▶ This is achieved by inserting a

bj » Bernoulli(p)
dropout layer in the neural
network, which multiplies each input
(or activation) by a random variable
zj aj
bj ∼ Bernoulli(p):

15/33
Reducing Complexity by Reducing Sensitivity
Eect of dropout on performance on the MNIST dataset

Note:
▶ On neural networks for image data, dropout tends to yield superior
performance compared to simple weight decay.

16/33
Choosing a Model with Appropriate Complexity

Holdout Validation:
▶ Train multiple neural network models with dierent regularization
parameters (e.g. λ), and retain the one that performs the best on
some validation set disjoint from the training data.

Problem:
▶ Training a model for each parameter λ can be costly. One would
potentially benet from training a bigger model only once.

17/33
Accelerating Model Selection
Early Stopping Technique [6]:
▶ View the iterative procedure for training a neural network as generating
a sequence of increasingly complex models θ1 , . . . , θ T .
▶ Monitor the validation error throughout training and keep a snapshot
of the model when it had lowest validation error.

Early stopping:
θ⋆ = None
E⋆ = ∞
for t = 1 . . . T do
Run a few SGD steps, and collect the current parameter θt
ifEval (θt ) < E ⋆ then
θ ⋆ ← θt
E ⋆ ← Eval (θ)
end if
end for
Advantage:
▶ No need to train several models (e.g. with dient λ's). Only one
training run is needed!

18/33
Very Large Models

▶ When the model becomes very large there is an interesting ` double

descent ' [2] phenomenon that occurs in the context of neural
networks, where the generalization error starts to go down again as
model complexity increases.

▶ This can be interpreted as some implicit averaging between the many

components of the model (interpolating regime).

▶ Increasing model size to a great extent may contribute, without further

regularization techniques to achieve lower test set error.

19/33
Part 2 Imbalances between Subgroups

20/33
Data from Multiple Domains
▶ The data might come from dierent
domains (P , Q).
▶ Domains may e.g. correspond to
dierent acquisition devices, or
dierent ways they are
congured/calibrated.

▶ One of the domains may be

overrepresented in the available
data, or the ML model may learn
better on a given domain at the
expense of another domain.

Image source: Aubreville et al. Quantifying the Scanner-Induced Domain Gap in Mitosis Detection. CoRR abs/2103.16515 (2021)

21/33
Addressing Multiple Domains

Simple Approach (one-layer networks):

▶ Denoting by P and Q the two domains, regularize the ML model
⊤
(w x) so that both domains generate the same responses on average
at the output:

min E(w) + λ · (EP [w⊤ x] − EQ [w⊤ x])2

(aka. moment matching). The approach can be enhanced to include

higher-order moments such as variance, etc.

▶ In practice, more powerful tools exist to constrain distributions more

nely in representation space, such as the Wasserstein distance.

22/33
Addressing Multiple Domains

More Advanced Approach [1]:

▶ Learn a auxiliary neural network (domain critic φ) that tries to classify
the two domains. Learn the parameters of the feature extractor in a
way that the domain critic φ is no longer able to distinguish between
the two domains.

23/33
Addressing Multiple Domains

Example:
▶ Example of one particular class of the Oce-Caltech dataset and the
dierent domains from which the data is taken.

▶ Models equiped with a domain critic, although loosing performance on

some domains, achieve better worst-case accuracy:

24/33
Part 3 Spurious Correlations

25/33
Spurious Correlations

▶ Artefact of the distribution of available data (P ) where one or several

task-irrelevant input variables are spuriously correlated to the
task-relevant variables.

▶ Spurious correlations are very common on practical datasets, e.g. a

copyright tag occurring only on images of a certain class;
histopathological images of a certain class having been acquired with
particular device and having as a result a dierent color prole, etc.

26/33
Spurious Correlations and the Clever Hans Eect
Available data (P ) New data (Q)

▶ A ML classier is technically able to classify the available data (P )

using either the correct features or the spurious ones. The ML model
doesn't know a priori which feature (the correct one or the spurious
one) to use. A model that bases its decision strategy on the spurious
right for the wrong reasons
feature is and is also known as a Clever
Hans classier.
▶ A Clever Hans classier may fail to function well on the new data (Q)
where the spurious correlation no longer exists, e.g. horses without
copyright tags, or images of dierent class with copyright tags.

27/33
Spurious Correlations and the Clever Hans Eect

▶ Test set accuracy doesn't give much information on whether the model
bases its decision on the correct features or exploits the spurious
correlation.

▶ Only an inspection of the decision structure by the user (e.g. using

LRP heatmaps) enables the detection of the aw in the model [5].

28/33
Generating LRP heatmaps

Layer 1 Layer 2 Layer 3 Layer 4

Image of 'bicycle' Local Features GMM fitting Fisher Vector Normalization +Linear SVM
(Hellinger's kernel SVM)

Heatmap

Relevance
Conservation

Redistribution
Formula

▶ Explanations are produced using a layer-wise redistribution process

from the output of the model to the input features.

▶ Each layer can have its own redistribution scheme. The redistribution
rules are designed in a way that maximizes explanation quality.

29/33
Mitigating Reliance on Spurious Correlations

Feature Selection / Unlearning:

▶ Retrain without the feature containing the artefact (e.g. crop images
to avoid copyright tags).

▶ Actively look in the model for units (e.g. subsets of neurons) that
respond to the artifact and remove such units from the model (e.g.
[3]).

Dataset Design:
▶ Manually remove the artifact (e.g. copyright tags) from the classes
that contain it, or alternatively, inject the artifact in every class (so
that it cannot be used anymore for discriminating between classes).

▶ Stratify the dataset in a way that the spurious features are present in
all classes in similar proportions.

Learn with Explanation Constraints

▶ Include an extra term in the objective that penalizes decision strategies
that are based on unwanted features (preliminary revealed by an
explanation technique).

30/33
Summary

31/33
Summary

▶ While deep learning can in principle t very complex prediction

functions, the way they perform in practice is in large part determined
by the amount and quality of the data.

▶ Limited data may subject the ML model to overtting and lead to

lower performance on new data. Various methods exist to prevent
overtting (e.g. generating a low-dimensional input vector, or build a
model with limited sensitivity to input such as weight decay or
dropout).

▶ Another problem is dataset bias, where certain parts of the distribution

are over/under-represented, or plagued with spurious correlations.
Reliance of the model on spurious correlations can lead to low test
performance, but this can be detected by Explainable AI approaches. A
number of methods exist to reduce reliance on spurious correlations.

32/33
References

[1] L. Andéol, Y. Kawakami, Y. Wada, T. Kanamori, K.-R. Müller, and G. Montavon.

Learning domain invariant representations by joint wasserstein distance minimization.
Neural Networks, 167:233243, 2023.

[2] M. Belkin, D. Hsu, S. Ma, and S. Mandal.

Reconciling modern machine-learning practice and the classical biasvariance trade-o.
PNAS, 116(32):1584915854, 2019.

[3] P. Chormai, J. Herrmann, K. Müller, and G. Montavon.

Disentangled explanations of neural network predictions by nding relevant subspaces.
CoRR, abs/2212.14855, 2022.

[4] A. Krogh and J. A. Hertz.

A simple weight decay can improve generalization.
In NIPS, pages 950957, 1991.
[5] S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller.
Unmasking clever hans predictors and assessing what machines really learn.
Nature Communications, 10(1), Mar. 2019.

[6] L. Prechelt.
Early stopping-but when?
In Neural Networks: Tricks , volume 1524 of
of the Trade LNCS, pages 5569. Springer, 1996.
[7] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov.
Dropout: a simple way to prevent neural networks from overtting.
J. Mach. Learn. Res., 15(1):19291958, 2014.

33/33

Practical Aspects of Deep Learning PI
No ratings yet
Practical Aspects of Deep Learning PI
46 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
DL CS 7 M4 Live Class Flow
No ratings yet
DL CS 7 M4 Live Class Flow
37 pages
Cours 4
No ratings yet
Cours 4
30 pages
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
No ratings yet
Mathematics of Deep Learning: Lecture 1-Introduction and The Universality of Depth 1 Nets
12 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Deep Learning for Data Scientists
No ratings yet
Deep Learning for Data Scientists
17 pages
Lecture 11
No ratings yet
Lecture 11
110 pages
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
No ratings yet
Advanced Machine Learning: Neural Networks Decision Trees Random Forest Xgboost
61 pages
UNIT-III-3.2-ML-Features of ANN and Case Study ANN
No ratings yet
UNIT-III-3.2-ML-Features of ANN and Case Study ANN
24 pages
DLUNIT2
No ratings yet
DLUNIT2
25 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
116 pages
Lecture 2
No ratings yet
Lecture 2
67 pages
Deep Learning Theory Advances
No ratings yet
Deep Learning Theory Advances
50 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Inherent Stochasticity
No ratings yet
Inherent Stochasticity
12 pages
4 - DNN Tip
No ratings yet
4 - DNN Tip
52 pages
Lecture 06
No ratings yet
Lecture 06
22 pages
Unit 2
No ratings yet
Unit 2
37 pages
Neural Networks: A Beginner's Guide
No ratings yet
Neural Networks: A Beginner's Guide
23 pages
Mathematics in AI: Neural Networks
No ratings yet
Mathematics in AI: Neural Networks
10 pages
Neural Networks for Beginners
No ratings yet
Neural Networks for Beginners
79 pages
Home Assignment Submission Solutions
No ratings yet
Home Assignment Submission Solutions
82 pages
DLbook
No ratings yet
DLbook
165 pages
cs188 Fa24 Lec24
No ratings yet
cs188 Fa24 Lec24
46 pages
Optimization of Deep Networks
No ratings yet
Optimization of Deep Networks
84 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000689 2025-01-03 Reference-Material-I
39 pages
DNN Tip
No ratings yet
DNN Tip
49 pages
ML 01
No ratings yet
ML 01
24 pages
6 Working Example 01-08-2024
No ratings yet
6 Working Example 01-08-2024
21 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
DL Unit1
No ratings yet
DL Unit1
61 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Module 04
No ratings yet
Module 04
16 pages
Lecture # 4-2 Autoregressive Models
No ratings yet
Lecture # 4-2 Autoregressive Models
39 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
79 pages
Lecture W15ab
No ratings yet
Lecture W15ab
44 pages
MtechDL Unit2
No ratings yet
MtechDL Unit2
25 pages
Practical Issues in NN Training
No ratings yet
Practical Issues in NN Training
7 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Lecture 2: Basics and Definitions: Networks As Data Models
No ratings yet
Lecture 2: Basics and Definitions: Networks As Data Models
28 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
Lec 24
No ratings yet
Lec 24
39 pages
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
No ratings yet
Fit Without Fear - Remarkable Mathematical Phenomena of Deep Learning Through The Prism of Interpolation
51 pages
DL UNIT 3 - Part2
No ratings yet
DL UNIT 3 - Part2
34 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
Gradient-Based Learning & Neural Networks
No ratings yet
Gradient-Based Learning & Neural Networks
72 pages
Week 10
No ratings yet
Week 10
69 pages
Overfitting Solutions in Machine Learning
No ratings yet
Overfitting Solutions in Machine Learning
7 pages
Chapter-2 (Deep Learning)
No ratings yet
Chapter-2 (Deep Learning)
18 pages
ML Mod1-CS467 Machine Learning - Ktustudents - in
No ratings yet
ML Mod1-CS467 Machine Learning - Ktustudents - in
16 pages
Machine Learning Exam Review Guide
No ratings yet
Machine Learning Exam Review Guide
32 pages
PAC Learning Explained
No ratings yet
PAC Learning Explained
124 pages
Advanced Statistical Theorems
No ratings yet
Advanced Statistical Theorems
13 pages
Machine Learning: Overfitting vs. Underfitting
No ratings yet
Machine Learning: Overfitting vs. Underfitting
25 pages
Advanced Learning Theory Insights
No ratings yet
Advanced Learning Theory Insights
59 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Unit - 1, Notes
No ratings yet
Unit - 1, Notes
38 pages
AI Assignment: Probability & VC Dimension
No ratings yet
AI Assignment: Probability & VC Dimension
2 pages
PAC Learning Explained
No ratings yet
PAC Learning Explained
15 pages
Machine Learning KTU Module 1
No ratings yet
Machine Learning KTU Module 1
77 pages
Vapnik-Chervonenkis Dimension
No ratings yet
Vapnik-Chervonenkis Dimension
6 pages
Revised Lecture Notes 2
No ratings yet
Revised Lecture Notes 2
16 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
MLC2
No ratings yet
MLC2
9 pages
Probability Bounds
No ratings yet
Probability Bounds
14 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
Pac Learning
No ratings yet
Pac Learning
30 pages
1 Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, OR 97331-3902 August 29, 1994
No ratings yet
1 Thomas G. Dietterich Department of Computer Science Oregon State University Corvallis, OR 97331-3902 August 29, 1994
46 pages
Introduction Machine Learning
No ratings yet
Introduction Machine Learning
53 pages
Machine Learning: PAC-Learning and VC-Dimension
No ratings yet
Machine Learning: PAC-Learning and VC-Dimension
31 pages
ML - Questions & Answer
No ratings yet
ML - Questions & Answer
45 pages
Deep Learning & ML Basics Guide
No ratings yet
Deep Learning & ML Basics Guide
151 pages
VC Dimension & Model Complexity
No ratings yet
VC Dimension & Model Complexity
42 pages
Unit1 ML
No ratings yet
Unit1 ML
23 pages
Machine Learning Exam Guide
No ratings yet
Machine Learning Exam Guide
4 pages
MachineLearning - UNIT III
No ratings yet
MachineLearning - UNIT III
30 pages
ML Questions - GROUP - 08
No ratings yet
ML Questions - GROUP - 08
23 pages
Kearns Vairani 1994 Introduction To Computational Learning Thoery Ch01
No ratings yet
Kearns Vairani 1994 Introduction To Computational Learning Thoery Ch01
41 pages