Lec 01 Introduction
Lec 01 Introduction
Lecture 1 – Introduction
1.1 Introduction
3
1.1
Introduction
Team
Lecturer:
I Prof. Dr.-Ing. Andreas Geiger
TAs:
I Dr. Joo-Ho Lee
I Songyou Peng
I Aditya Prakash
I Christian Reiser
I Axel Sauer
5
Contents
7
Exercises
I Every 2 weeks (6 assignments in total)
I Handed out on Wednesdays via ILIAS and introduced via Zoom
I Q&A every other Wednesday via Zoom
I Can be conducted in groups of up to 2 students
I No sharing across groups
I Every group member must submit the solution
I Find a partner via ILIAS booking pool
I Assignments involve pen & paper as well as coding tasks
I Assignment 1-3: Educational Deep-Learning Framework (Python NumPy)
I Assignment 4-6: PyTorch (Google Colab)
I 50% must be successfully completed to participate in exam
I 75% successfully completed leads to a 0.3 bonus in the exam
8
Lecture Notes
Books:
I Goodfellow, Bengio, Courville: Deep Learning
http://www.deeplearningbook.org
10
Materials & Credits
Courses:
I McAllester (TTI-C): Fundamentals of Deep Learning
http://mcallester.github.io/ttic-31230/Fall2020/
11
Materials & Credits
Tutorials:
I The Python Tutorial
https://docs.python.org/3/tutorial/
I NumPy Quickstart
https://numpy.org/devdocs/user/quickstart.html
I PyTorch Tutorial
https://pytorch.org/tutorials/
I Latex / Overleaf Tutorial
https://www.overleaf.com/learn
Frameworks / IDEs:
I Visual Studio Code
https://code.visualstudio.com/
I Google Colab
https://colab.research.google.com 12
Prerequisites
13
Prerequisites
Linear Algebra:
I Vectors: x, y ∈ Rn
I Matrices: A, B ∈ Rm×n
I Operations: AT , A−1 , Tr(A), det(A), A + B, AB, Ax, x> y
I Norms: kxk1 , kxk2 , kxk∞ , kAkF
I SVD: A = UDV>
14
Prerequisites
15
Prerequisites
16
1.2
History of Deep Learning
A Brief History of Deep Learning
sm
g
in
s
ni
n
tic
ar
tio
ne
Le
ec
r
ep
be
nn
De
Cy
Co
1950 1960 1970 1980 1990 2000 2010 2020
18
A Brief History of Deep Learning
t
per
Pa
y/
sk
in
M
n
tro
i
gn
co
eo
N
n
tro
i
gn
co
eo
N
n
io
at
I Remains main workhorse today
pag
pro
ck
1950 1960 1970 1980 Ba 1990 2000 2010 2020
Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986. 23
A Brief History of Deep Learning
TM
LS
1950 1960 1970 1980 1990 2000 2010 2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997. 24
A Brief History of Deep Learning
TM
LS
1950 1960 1970 1980 1990 2000 2010 2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997. 24
A Brief History of Deep Learning
et
I But did not scale up (yet)
n vN
Co
1950 1960 1970 1980 1990 2000 2010 2020
LeCun, Bottou, Bengio, Haffner: Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 25
A Brief History of Deep Learning
et
N
via GPU training, deep models, data
ex
Al
e/
ag
Im
1950 1960 1970 1980 1990 2000 2010 2020
Krizhevsky, Sutskever, Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012. 26
A Brief History of Deep Learning
et
N
via GPU training, deep models, data
ex
Al
e/
I Sparked deep learning revolution
ag
Im
1950 1960 1970 1980 1990 2000 2010 2020
Krizhevsky, Sutskever, Hinton. ImageNet classification with deep convolutional neural networks. NIPS, 2012. 26
A Brief History of Deep Learning
ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Geiger, Lenz and Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR, 2012. 27
A Brief History of Deep Learning
ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015. 28
A Brief History of Deep Learning
ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015. 28
A Brief History of Deep Learning
2014: Generalization
I Empirical demonstration that deep
representations generalize well
despite large number of parameters
I Pre-train CNN on large amounts of
data on generic task (e.g., ImageNet)
I Fine-tune (re-train) only last layers on
few data of a new task
n
io
at
liz
I State-of-the-art performance
ra
ne
Ge
1950 1960 1970 1980 1990 2000 2010 2020
Razavian, Azizpour, Sullivan, Carlsson: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014. 29
A Brief History of Deep Learning
2014: Visualization
I Goal: provide insights into what the
network (black box) has learned
I Visualized image regions that most
strongly activate various neurons at
different layers of the network
I Found that higher levels capture
more abstract semantic information
n
io
at
a liz
su
Vi
1950 1960 1970 1980 1990 2000 2010 2020
Zeiler and Fergus: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014. 30
A Brief History of Deep Learning
RL
ep
De
1950 1960 1970 1980 1990 2000 2010 2020
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 33
A Brief History of Deep Learning
2016: WaveNet
I Deep generative model
of raw audio waveforms
I Generates speech which
mimics human voice
I Generates music
et
eN
av
W
1950 1960 1970 1980 1990 2000 2010 2020
Oord et al.: WaveNet: A Generative Model for Raw Audio. Arxiv, 2016. 34
A Brief History of Deep Learning
er
sf
an
Tr
yle
St
1950 1960 1970 1980 1990 2000 2010 2020
Gatys, Ecker and Bethge: Image Style Transfer Using Convolutional Neural Networks. CVPR, 2016. 35
A Brief History of Deep Learning
a Go
ph
Al
1950 1960 1970 1980 1990 2000 2010 2020
Silver et al.: Mastering the game of Go without human knowledge. Nature, 2017. 36
A Brief History of Deep Learning
N
CN
R-
k
as
M
1950 1960 1970 1980 1990 2000 2010 2020
He, Gkioxari, Dollár and Ross Girshick: Mask R-CNN. ICCV, 2017. 37
A Brief History of Deep Learning
E
LU
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Vaswani et al.: Attention is All you Need. NIPS 2017. 38
A Brief History of Deep Learning
E
LU
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Devlin, Chang, Lee and Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Arxiv, 2018. 38
A Brief History of Deep Learning
E
LU
I But: Computers still fail in dialogue
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019. 38
A Brief History of Deep Learning
E
LU
I But: Computers still fail in dialogue
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019. 38
A Brief History of Deep Learning
d
ar
Aw
g
in
r
Tu
1950 1960 1970 1980 1990 2000 2010 2020
39
A Brief History of Deep Learning
DL
3D
1950 1960 1970 1980 1990 2000 2010 2020
Niemeyer, Mescheder, Oechsle, Geiger: Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision. CVPR, 2020. 40
A Brief History of Deep Learning
2020: GPT-3
I Language model by OpenAI
I 175 Billion parameters
I Text-in / text-out interface
I Many use cases: coding, poetry,
blogging, news articles, chatbots
I Controversial discussions
I Licensed exclusively to Microsoft
on September 22, 2020
3
T-
GP
1950 1960 1970 1980 1990 2000 2010 2020
Brown et al.: Language Models are Few-Shot Learners. Arxiv, 2020. 41
A Brief History of Deep Learning
Current Challenges
I Un-/Self-Supervised Learning
I Interactive learning
I Accuracy (e.g., self-driving)
I Robustness and generalization
I Inductive biases
I Understanding and mathematics
I Memory and compute
I Ethics and legal questions
I Does “Moore’s Law of AI” continue?
42
1.3
Machine Learning Basics
Goodfellow et al.: Deep Learning, Chapter 5
http://www.deeplearningbook.org/contents/ml.html
Learning Problems
I Supervised learning
I Learn model parameters using dataset of data-label pairs {(xi , yi )}N
i=1
I Examples: Classification, regression, structured prediction
I Unsupervised learning
I Learn model parameters using dataset without labels {xi }Ni=1
I Examples: Clustering, dimensionality reduction, generative models
I Self-supervised learning
I Learn model parameters using dataset of data-data pairs {(xi , x0i )}N
i=1
I Examples: Self-supervised stereo/flow, contrastive learning
I Reinforcement learning
I Learn model parameters using active exploration from sparse rewards
I Examples: deep q learning, gradient policy, actor critique
45
Supervised Learning
Classification, Regression, Structured Prediction
Classification / Regression:
f :X →N or f :X →R
48
Supervised Learning
48
Supervised Learning
48
Classification
"Beach"
48
Regression
143,52 €
I Mapping: fw : RN → R
48
Structured Prediction
"Das Pferd
frisst keinen
Gurkensalat."
48
Structured Prediction
Can
Monkey
48
Structured Prediction
3
I Mapping: fw : RW ×H×N → {0, 1}M
I Suppose: 323 voxels, binary variable per voxel (occupied/free)
3
I Question: How many different reconstructions? 232 = 232768
I Comparison: Number of atoms in the universe? ∼ 2273
48
Linear Regression
Linear Regression
Let X denote a dataset of size N and let (xi , yi ) ∈ X denote its elements (yi ∈ R).
Goal: Predict y for a previously unseen input x. The input x may be multidimensional.
1.5
Ground Truth
Noisy Observations
1.0
0.5
0.0
y
0.5
1.0
1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
50
Linear Regression
The error function E(w) measures the displacement along the y dimension between
the data points (green) and the model f (x, w) (red) specified by the parameters w.
1.5
f (x, w) = w> x Ground Truth
Noisy Observations
N
X 1.0 Linear Fit
E(w) = (f (xi , w) − yi )2
0.5
i=1
N 2 0.0
y
X
= x>
i w − yi
i=1 0.5
1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
= 2X> Xw − 2X> y
As E(w) is quadratic and convex in w, its minimizer (wrt. w) is given in closed form:
∇w E(w) = 0
−1
⇒ w = (X> X) X> y
−1
The matrix (X> X) X> is also called Moore-Penrose inverse or pseudoinverse.
52
Example: Line Fitting
Line Fitting
1.5 8
Ground Truth Error Curve
Noisy Observations 7 Minimum
1.0 Linear Fit
6
0.5
5
4
Error
0.0
y
3
0.5
2
1.0 1
1.5 0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x w1
M
X
f (x, w) = wj x j = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0
Tasks:
I Training: Estimate w from dataset X
I Inference: Predict y for novel x given estimated w
Note:
I Features can be anything, including multi-dimensional inputs (e.g., images, audio),
radial basis functions, sine/cosine functions, etc. In this example: monomials.
56
Polynomial Curve Fitting
M
X
f (x, w) = wj x j = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0
N
X
E(w) = (f (xi , w) − yi )2
i=1
57
Polynomial Curve Fitting
The error function from above is quadratic in w but not in x:
2
N N 2 N M
wj xji − yi
X X X X
E(w) = (f (xi , w) − yi )2 = w> xi − yi =
i=1 i=1 i=1 j=0
0.0 0.0
y
y
0.5 0.5
1.0 1.0
1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
60
Polynomial Curve Fitting
1.5 1.5
M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set
0.5 0.5
0.0 0.0
y
y
0.5 0.5
1.0 1.0
1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
60
Capacity, Overfitting and Underfitting
Goal:
I Perform well on new, previously
1.5
unseen inputs (test set, blue), not Ground Truth
Noisy Observations
1.0 Test Set
only on the training set (green)
I This is called generalization and 0.5
separates ML from optimization
0.0
y
I Assumption: training and test data
0.5
independent and identically (i.i.d.)
drawn from distribution pdata (x, y) 1.0
61
Capacity, Overfitting and Underfitting
Terminology:
I Capacity: Complexity of functions which can be represented by model f
I Underfitting: Model too simple, does not achieve low error on training set
I Overfitting: Training error small, but test error (= generalization error) large
1.5 1.5 1.5
M=1 Ground Truth M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set Test Set
0.5 0.5 0.5
y
0.5 0.5 0.5
101
Error
100
10 1
10 2
10 3
0 1 2 3 4 5 6 7 8 9
Degree of Polynomial 63
Capacity, Overfitting and Underfitting
General Approach: Split dataset into training, validation and test set
I Choose hyperparameters (e.g., degree of polynomial, learning rate in neural net, ..)
using validation set. Important: Evaluate once on test set (typically not available).
Test
20%
60% 20%
Training Validation
I When dataset is small, use (k-fold) cross validation instead of fixed split.
64
Ridge Regression
Ridge Regression
Polynomial Curve Model:
M
X
f (x, w) = wj x j = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0
Ridge Regression:
N
X M
X
E(w) = (f (xi , w) − yi )2 + λ w2
i=1 j=0
0.0 0.0
y
y
0.5 0.5
1.0 1.0
1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Plots of polynomial with degree M = 9 fitted to 10 data points using ridge regression.
Left: weak regularization (λ = 10−8 ). Right: strong regularization (right, λ = 103 ).
67
Ridge Regression
102
Model Weights Training Error
Generalization Error
100000
101
50000
100
0
Error
50000 10 1
100000
10 2
10 13 10 12 10 11 10 10 10 9 10 8 10 7 10 6
10 11 10 8 10 5 10 2 101 104
Regularization Weight Regularization weight
Left: With low regularization, parameters can become very large (ill-conditioning).
Right: Select model with the smallest generalization error on the validation set.
68
Estimators, Bias and Variance
Estimators, Bias and Variance
Point Estimator:
I A point estimator g(·) is function that maps a dataset X to model parameters ŵ:
ŵ = g(X )
70
Estimators, Bias and Variance
Bias: Variance:
Bias-Variance Dilemma:
I Statistical learning theory tells us that we can’t have both ⇒ there is a trade-off
71
Estimators, Bias and Variance
0.8 0.8
Estimates = 10 8 = 10 Estimates
0.6 Ground Truth 0.6 Ground Truth
Mean Mean
0.4 0.4
0.2 0.2
0.0 0.0
y
y
0.2 0.2
0.4 0.4
0.6 0.6
0.8 0.8
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Variations:
I If we were choosing pmodel (y|x, w) as a Laplace distribution, we would obtain an
estimator that minimizes the `1 norm: ŵ = argmin w kXw − yk1
I Assuming a Gaussian distribution over the parameters w and performing a
maximum a-posteriori (MAP) estimation yields ridge regression:
argmax p(w|y, x) = argmax p(y|x, w)p(w)
w w
77
Maximum Likelihood Estimation
Remarks:
I Consistency: As the number of training samples approaches infinity N → ∞,
the maximum likelihood (ML) estimate converges to the true parameters
I Efficiency: The ML estimate converges most quickly as N increases
I These theoretical considerations make ML estimators appealing
77