0% found this document useful (0 votes)

139 views98 pages

Deep Learning Course Overview

This document provides an overview of deep learning including its history and organization of the course. It discusses the history in three waves from 1940-1970 focusing on cybernetics, 1980-2000 focusing on connectionism, and 2006 to now focusing on deep learning. It outlines the course contents, organization, materials, and prerequisites.

Uploaded by

Mr. Coffee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

139 views98 pages

Deep Learning Course Overview

Uploaded by

Mr. Coffee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Deep Learning

Lecture 1 – Introduction

Prof. Dr.-Ing. Andreas Geiger

Autonomous Vision Group
University of Tübingen / MPI-IS
Thies, Elgharib, Tewari, Theobalt and Niessner: Neural Voice Puppetry: Audio-driven Facial Reenactment. ECCV, 2020. 2
Agenda

1.1 Introduction

1.2 History of Deep Learning

1.3 Machine Learning Basics

3
1.1
Introduction
Team

Lecturer:
I Prof. Dr.-Ing. Andreas Geiger

TAs:
I Dr. Joo-Ho Lee
I Songyou Peng
I Aditya Prakash
I Christian Reiser
I Axel Sauer

5
Contents

Goal: Students gain an understanding of the theoretical and practical concepts of

deep neural networks including, optimization, inference, architectures and
applications. After this course, students should be able to develop and train deep
neural networks, reproduce research results and conduct original research.

I History of deep learning I Convolutional Neural Networks

I Linear/logistic regression I Recurrent Neural Networks
I Multi-layer perceptrons I Natural Language Processing
I Backpropagation I Generative Models
I Loss/Activation Functions I Graph Neural Networks
I Optimization and Regularization I Self-Supervised Learning
6
Organization
I SWS: 2V + 2Ü, 6 ECTS
I Lectures held via YouTube (provided 1 week before Lecture Q&A)
I Lecture Q&A and Exercises via Zoom: Wednesdays 12:00-14:00 (starting Nov. 4)
I Exam
I Written (date to be defined, announced on course website)
I To qualify for the exam, a student must register to the course, successfully solve
50% of the exercises and submit lecture notes for one lecture
I All students must participate in the primary exam, only students that failed
the first exam or that provide a medical certificate are allowed to a second exam
I A 0.3 bonus can be obtained upon completion of 75% of the exercises
I Course Website with YouTube, Zoom and ILIAS links:
https://uni-tuebingen.de/en/fakultaeten/mathematisch-naturwissenschaftliche-fakultaet/
fachbereiche/informatik/lehrstuehle/autonomous-vision/teaching/lecture-deep-learning/

7
Exercises
I Every 2 weeks (6 assignments in total)
I Handed out on Wednesdays via ILIAS and introduced via Zoom
I Q&A every other Wednesday via Zoom
I Can be conducted in groups of up to 2 students
I No sharing across groups
I Every group member must submit the solution
I Find a partner via ILIAS booking pool
I Assignments involve pen & paper as well as coding tasks
I Assignment 1-3: Educational Deep-Learning Framework (Python NumPy)
I Assignment 4-6: PyTorch (Google Colab)
I 50% must be successfully completed to participate in exam
I 75% successfully completed leads to a 0.3 bonus in the exam
8
Lecture Notes

I We will collaboratively create lecture notes for this new lecture

I Every student writes a summary for one lecture (required to participate in exam)
I Summaries shall be lightweight, comprehensive and mathematically concise
I Summaries shall be using the same notation/symbols as used in the slides
I Only the most important illustrations/graphics shall be included
I The compiled lecture notes should not exceed ∼3 MB per lecture
I TAs will consolidate materials and integrate into one large document
I Assignment of students to lectures in ﬁrst week based on ILIAS registrations
I Deadline for submission: 7 days after the corresponding lecture date
I Link to Latex / Overleaf template provided on course website
9
Materials & Credits

Books:
I Goodfellow, Bengio, Courville: Deep Learning
http://www.deeplearningbook.org

I Bishop: Pattern Recognition and Machine Learning

http://www.springer.com/gp/book/9780387310732

I Zhang, Lipton, Li, Smola: Dive into Deep Learning

http://d2l.ai

I Deisenroth, Faisal, Ong: Mathematics for Machine Learning

https://mml-book.github.io

I Petersen, Pedersen: The Matrix Cookbook

http://cs.toronto.edu/~bonner/courses/2018s/csc338/matrix_cookbook.pdf

10
Materials & Credits

Courses:
I McAllester (TTI-C): Fundamentals of Deep Learning
http://mcallester.github.io/ttic-31230/Fall2020/

I Leal-Taixe, Niessner (TUM): Introduction to Deep Learning

http://niessner.github.io/I2DL/

I Grosse (UoT): Intro to Neural Networks and Machine Learning

http://www.cs.toronto.edu/~rgrosse/courses/csc321_2018/

I Li (Stanford): Convolutional Neural Networks for Visual Recognition

http://cs231n.stanford.edu/

I Abbeel, Chen, Ho, Srinivas (Berkeley): Deep Unsupervised Learning

https://sites.google.com/view/berkeley-cs294-158-sp20/home

11
Materials & Credits
Tutorials:
I The Python Tutorial
https://docs.python.org/3/tutorial/
I NumPy Quickstart
https://numpy.org/devdocs/user/quickstart.html
I PyTorch Tutorial
https://pytorch.org/tutorials/
I Latex / Overleaf Tutorial
https://www.overleaf.com/learn

Frameworks / IDEs:
I Visual Studio Code
https://code.visualstudio.com/
I Google Colab
https://colab.research.google.com 12
Prerequisites

I Basic computer science skills

I Variables, functions, loops, classes, algorithms

I Basic Python coding skills

I If you haven’t written Python code before, follow:
https://docs.python.org/3/tutorial/

I Basic math skills

I Linear algebra, probability and information theory
I If unsure, please read Chapters 1-4 of:
http://www.deeplearningbook.org

13
Prerequisites

Linear Algebra:
I Vectors: x, y ∈ Rn
I Matrices: A, B ∈ Rm×n
I Operations: AT , A−1 , Tr(A), det(A), A + B, AB, Ax, x> y
I Norms: kxk1 , kxk2 , kxk∞ , kAkF
I SVD: A = UDV>

14
Prerequisites

Probability and Information Theory:

I Distributions: Bernoulli, Categorical, Gaussian, Laplace

I Entropy: H(x) , KL Divergence: DKL (pkq)

15
Prerequisites

If you need a refresher, we recommend reading Chapters 1-4 of:

http://www.deeplearningbook.org

16
1.2
History of Deep Learning
A Brief History of Deep Learning

Three waves of development:

I 1940-1970: “Cybernetics” (Golden Age)
I Simple computational models of biological learning, simple learning rules
I 1980-2000: “Connectionism” (Dark Age)
I Intelligent behavior through large number of simple units, Backpropagation
I 2006-now: “Deep Learning” (Revolution Age)
I Deeper networks, larger datasets, more computation, state-of-the-art in many areas

g
in
s

n
tic

ar
tio
ne

Le
ec
r

ep
be

De
Cy

Co
1950 1960 1970 1980 1990 2000 2010 2020
18
A Brief History of Deep Learning

1943: McCullock and Pitts

I Early model for neural activation
I Linear threshold neuron (binary):

+1 if wT x ≥ 0
fw (x) =
−1 otherwise

I More powerful than AND/OR gates

I But no procedure to learn weights
n
ro
eu
N

1950 1960 1970 1980 1990 2000 2010 2020

McCulloch and Pitts: A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 1943. 19
A Brief History of Deep Learning

1958-1962: Rosenblatt’s Perceptron

I First algorithm and implementation
to train single linear threshold neuron
I Optimization of perceptron criterion:
X
L(w) = − w T xn yn
n∈M

I Novikoff proved convergence

n
tro
ep
rc
Pe

1950 1960 1970 1980 1990 2000 2010 2020

Rosenblatt: The perceptron - a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. 20
A Brief History of Deep Learning

1958-1962: Rosenblatt’s Perceptron

I First algorithm and implementation
to train single linear threshold neuron
I Overhyped: Rosenblatt claimed that
the perceptron will lead to computers
that walk, talk, see, write, reproduce
and are conscious of their existence
n
tro
ep
rc
Pe

1950 1960 1970 1980 1990 2000 2010 2020

Rosenblatt: The perceptron - a probabilistic model for information storage and organization in the brain. Psychological Review, 1958. 20
A Brief History of Deep Learning

1969: Minsky and Papert publish book

I Several discouraging results
I Showed that single-layer perceptrons
cannot solve some very simple
problems (XOR problem, counting)
I Symbolic AI research dominates 70s

t
per
Pa
y/
sk
in
M

1950 1960 1970 1980 1990 2000 2010 2020

Minsky and Papert: Perceptrons: An introduction to computational geometry. MIT Press, 1969. 21
A Brief History of Deep Learning

1979: Fukushima’s Neocognitron

I Inspired by Hubel and Wiesel
experiments in the 1950s
I Study of visual cortex in cats
I Found that cells are sensitive to
orientation of edges but insensitive
to their position (simple vs. complex)
I H&W received Nobel price in 1981

n
tro
i
gn
co
eo
N

1950 1960 1970 1980 1990 2000 2010 2020

Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position. IECE (in Japanese), 1979. 22
A Brief History of Deep Learning

1979: Fukushima’s Neocognitron

I Multi-layer processing
to create intelligent behavior
I Simple (S) and complex (C) cells
implement convolution and pooling
I Reinforcement based learning
I Inspiration for modern ConvNets

n
tro
i
gn
co
eo
N

1950 1960 1970 1980 1990 2000 2010 2020

Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position. IECE (in Japanese), 1979. 22
A Brief History of Deep Learning

1986: Backpropagation Algorithm

I Efﬁcient calculation of gradients in a
deep network wrt. network weights
I Enables application of gradient
based learning to deep networks
I Known since 1961, but
ﬁrst empirical success in 1986

n
io
at
I Remains main workhorse today

pag
pro
ck
1950 1960 1970 1980 Ba 1990 2000 2010 2020
Rumelhart, Hinton and Williams: Learning representations by back-propagating errors. Nature, 1986. 23
A Brief History of Deep Learning

1997: Long Short-Term Memory

I In 1991, Hochreiter demonstrated the
problem of vanishing/exploding
gradients in his Diploma Thesis
I Led to development of long-short
term memory for sequence modeling
I Uses feedback and forget/keep gate

TM
LS
1950 1960 1970 1980 1990 2000 2010 2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997. 24
A Brief History of Deep Learning

1997: Long Short-Term Memory

TM
LS
1950 1960 1970 1980 1990 2000 2010 2020
Hochreiter, Schmidhuber: Long short-term memory. Neural Computation, 1997. 24
A Brief History of Deep Learning

1998: Convolutional Neural Networks

I Similar to Neocognitron, but trained
end-to-end using backpropagation
I Implements spatial invariance via
convolutions and max-pooling
I Weight sharing reduces parameters
I Tanh/Softmax activations
I Good results on MNIST

et
I But did not scale up (yet)

n vN
Co
1950 1960 1970 1980 1990 2000 2010 2020
LeCun, Bottou, Bengio, Haffner: Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 25
A Brief History of Deep Learning

2009-2012: ImageNet and AlexNet

ImageNet
I Recognition benchmark (ILSVRC)
I 10 million annotated images
I 1000 categories
AlexNet
I First neural network to win ILSVRC

et
N
via GPU training, deep models, data

ex
Al
e/
ag
Im
1950 1960 1970 1980 1990 2000 2010 2020
Krizhevsky, Sutskever, Hinton. ImageNet classiﬁcation with deep convolutional neural networks. NIPS, 2012. 26
A Brief History of Deep Learning

2009-2012: ImageNet and AlexNet

ImageNet
I Recognition benchmark (ILSVRC)
I 10 million annotated images
I 1000 categories
AlexNet
I First neural network to win ILSVRC

et
N
via GPU training, deep models, data

ex
Al
e/
I Sparked deep learning revolution

ag
Im
1950 1960 1970 1980 1990 2000 2010 2020
Krizhevsky, Sutskever, Hinton. ImageNet classiﬁcation with deep convolutional neural networks. NIPS, 2012. 26
A Brief History of Deep Learning

2012-now: Golden Age of Datasets

I KITTI, Cityscapes: Self-driving
I PASCAL, MS COCO: Recognition
I ShapeNet, ScanNet: 3D DL
I GLUE: Language understanding
I Visual Genome: Vision/Language
I VisualQA: Question Answering
I MITOS: Breast cancer

ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Geiger, Lenz and Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. CVPR, 2012. 27
A Brief History of Deep Learning

2012-now: Synthetic Data

I Annotating real data is expensive
I Led to surge of synthetic datasets
I Creating 3D assets is also costly

ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015. 28
A Brief History of Deep Learning

2012-now: Synthetic Data

I Annotating real data is expensive
I Led to surge of synthetic datasets
I Creating 3D assets is also costly
I But even very simple 3D datasets
proved tremendously useful for
pre-training (e.g., in optical ﬂow)

ts
se
ta
Da
1950 1960 1970 1980 1990 2000 2010 2020
Dosovitskiy et al.: FlowNet: Learning Optical Flow with Convolutional Networks. ICCV, 2015. 28
A Brief History of Deep Learning

2014: Generalization
I Empirical demonstration that deep
representations generalize well
despite large number of parameters
I Pre-train CNN on large amounts of
data on generic task (e.g., ImageNet)
I Fine-tune (re-train) only last layers on
few data of a new task

n
io
at
liz
I State-of-the-art performance

ra
ne
Ge
1950 1960 1970 1980 1990 2000 2010 2020
Razavian, Azizpour, Sullivan, Carlsson: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014. 29
A Brief History of Deep Learning

2014: Visualization
I Goal: provide insights into what the
network (black box) has learned
I Visualized image regions that most
strongly activate various neurons at
different layers of the network
I Found that higher levels capture
more abstract semantic information

n
io
at
a liz
su
Vi
1950 1960 1970 1980 1990 2000 2010 2020
Zeiler and Fergus: CNN Features Off-the-Shelf: An Astounding Baseline for Recognition. CVPR Workshops, 2014. 30
A Brief History of Deep Learning

2014: Adversarial Examples

I Accurate image classiﬁers can be
fooled by imperceptible changes
I Adversarial example:

x+argmin {k∆xk2 : f (x+∆x) 6= f (x)}

∆x

I All images classiﬁed as “ostrich”

1950 1960 1970 1980 1990 2000 2010 2020

Szegedy et al.: Intriguing properties of neural networks. ICLR, 2014. 31
A Brief History of Deep Learning

2014: Domination of Deep Learning

I Machine translation (Seq2Seq)

1950 1960 1970 1980 1990 2000 2010 2020

Sutskever, Vinyals, Quoc: Sequence to Sequence Learning with Neural Networks. NIPS, 2014. 32
A Brief History of Deep Learning

2014: Domination of Deep Learning

I Machine translation (Seq2Seq)
I Deep generative models (VAEs,
GANs) produce compelling images

1950 1960 1970 1980 1990 2000 2010 2020

Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio: Generative Adversarial Networks. NIPS, 2014. 32
A Brief History of Deep Learning

2014: Domination of Deep Learning

I Machine translation (Seq2Seq)
I Deep generative models (VAEs,
GANs) produce compelling images

1950 1960 1970 1980 1990 2000 2010 2020

Zhang, Goodfellow, Metaxas, Odena: Self-Attention Generative Adversarial Networks. ICML, 2019. 32
A Brief History of Deep Learning

2014: Domination of Deep Learning

I Machine translation (Seq2Seq)
I Deep generative models (VAEs,
GANs) produce compelling images
I Graph Neural Networks (GNNs)
revolutionize the prediction of
molecular properties

1950 1960 1970 1980 1990 2000 2010 2020

Duvenaud et al.: Convolutional Networks on Graphs for Learning Molecular Fingerprints. NIPS 2015. 32
A Brief History of Deep Learning

2014: Domination of Deep Learning

I Machine translation (Seq2Seq)
I Deep generative models (VAEs,
GANs) produce compelling images
I Graph Neural Networks (GNNs)
revolutionize the prediction of
molecular properties
I Dramatic gains in vision and speech
(Moore’s Law of AI)

1950 1960 1970 1980 1990 2000 2010 2020

Duvenaud et al.: Convolutional Networks on Graphs for Learning Molecular Fingerprints. NIPS 2015. 32
A Brief History of Deep Learning

2015: Deep Reinforcement Learning

I Learning a policy (state→action)
through random exploration and
reward signals (e.g., game score)
I No other supervision
I Success on many Atari games
I But some games remain hard

RL
ep
De
1950 1960 1970 1980 1990 2000 2010 2020
Mnih et al.: Human-level control through deep reinforcement learning. Nature, 2015. 33
A Brief History of Deep Learning

2016: WaveNet
I Deep generative model
of raw audio waveforms
I Generates speech which
mimics human voice
I Generates music

et
eN
av
W
1950 1960 1970 1980 1990 2000 2010 2020
Oord et al.: WaveNet: A Generative Model for Raw Audio. Arxiv, 2016. 34
A Brief History of Deep Learning

2016: Style Transfer

I Manipulate photograph to adopt
style of a another image (painting)
I Uses deep network pre-trained on
ImageNet for disentangling
content from style
I It is fun! Try yourself:
https://deepart.io/

er
sf
an
Tr
yle
St
1950 1960 1970 1980 1990 2000 2010 2020
Gatys, Ecker and Bethge: Image Style Transfer Using Convolutional Neural Networks. CVPR, 2016. 35
A Brief History of Deep Learning

2016: AlphaGo defeats Lee Sedol

I Developed by DeepMind
I Combines deep learning with
Monte Carlo tree search
I First computer program to
defeat professional player
I AlphaZero (2017) learns via self-play
and masters multiple games

a Go
ph
Al
1950 1960 1970 1980 1990 2000 2010 2020
Silver et al.: Mastering the game of Go without human knowledge. Nature, 2017. 36
A Brief History of Deep Learning

2017: Mask R-CNN

I Deep neural network for joint object
detection and instance segmentation
I Outputs “structured object”, not only
a single number (class label)
I State-of-the-art on MS-COCO

N
CN
R-
k
as
M
1950 1960 1970 1980 1990 2000 2010 2020
He, Gkioxari, Dollár and Ross Girshick: Mask R-CNN. ICCV, 2017. 37
A Brief History of Deep Learning

2017-2018: Transformers and BERT

I Transformers: Attention replaces
recurrence and convolutions

E
LU
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Vaswani et al.: Attention is All you Need. NIPS 2017. 38
A Brief History of Deep Learning

2017-2018: Transformers and BERT

I Transformers: Attention replaces
recurrence and convolutions
I BERT: Pre-training of language
models on unlabeled text

E
LU
/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Devlin, Chang, Lee and Toutanova: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Arxiv, 2018. 38
A Brief History of Deep Learning

2017-2018: Transformers and BERT

I Transformers: Attention replaces
recurrence and convolutions
I BERT: Pre-training of language
models on unlabeled text
I GLUE: Superhuman performance on
some language understanding tasks
(paraphrase, question answering, ..)

E
LU
I But: Computers still fail in dialogue

/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019. 38
A Brief History of Deep Learning

2017-2018: Transformers and BERT

E
LU
I But: Computers still fail in dialogue

/G
RT
BE
1950 1960 1970 1980 1990 2000 2010 2020
Wang et al.: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. ICLR, 2019. 38
A Brief History of Deep Learning

2018: Turing Award

In 2018, the “nobel price of computing”
has been awarded to:
I Yoshua Bengio
I Geoffrey Hinton
I Yann LeCun

d
ar
Aw
g
in
r
Tu
1950 1960 1970 1980 1990 2000 2010 2020
39
A Brief History of Deep Learning

2016-2020: 3D Deep Learning

I First models to successfully output
3D representations
I Voxels, point clouds, meshes,
implicit representations
I Prediction of 3D models
even from a single image
I Geometry, materials, light, motion

DL
3D
1950 1960 1970 1980 1990 2000 2010 2020
Niemeyer, Mescheder, Oechsle, Geiger: Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision. CVPR, 2020. 40
A Brief History of Deep Learning

2020: GPT-3
I Language model by OpenAI
I 175 Billion parameters
I Text-in / text-out interface
I Many use cases: coding, poetry,
blogging, news articles, chatbots
I Controversial discussions
I Licensed exclusively to Microsoft
on September 22, 2020

3
T-
GP
1950 1960 1970 1980 1990 2000 2010 2020
Brown et al.: Language Models are Few-Shot Learners. Arxiv, 2020. 41
A Brief History of Deep Learning

Current Challenges
I Un-/Self-Supervised Learning
I Interactive learning
I Accuracy (e.g., self-driving)
I Robustness and generalization
I Inductive biases
I Understanding and mathematics
I Memory and compute
I Ethics and legal questions
I Does “Moore’s Law of AI” continue?
42
1.3
Machine Learning Basics
Goodfellow et al.: Deep Learning, Chapter 5
http://www.deeplearningbook.org/contents/ml.html
Learning Problems
I Supervised learning
I Learn model parameters using dataset of data-label pairs {(xi , yi )}N
i=1
I Examples: Classification, regression, structured prediction
I Unsupervised learning
I Learn model parameters using dataset without labels {xi }Ni=1
I Examples: Clustering, dimensionality reduction, generative models
I Self-supervised learning
I Learn model parameters using dataset of data-data pairs {(xi , x0i )}N
i=1
I Examples: Self-supervised stereo/flow, contrastive learning
I Reinforcement learning
I Learn model parameters using active exploration from sparse rewards
I Examples: deep q learning, gradient policy, actor critique
45
Supervised Learning
Classification, Regression, Structured Prediction
Classification / Regression:

f :X →N or f :X →R

I Inputs x ∈ X can be any kind of objects

I images, text, audio, sequence of amino acids, . . .
I Output y ∈ N/y ∈ R is a discrete or real number
I classiﬁcation, regression, density estimation, . . .

Structured Output Learning:

f :X →Y
I Inputs x ∈ X can be any kind of objects
I Outputs y ∈ Y are complex (structured) objects
I images, text, parse trees, folds of a protein, computer programs, . . .
47
Supervised Learning

Input Model Output

48
Supervised Learning

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N

i=1

48
Supervised Learning

Input Model Output

I Learning: Estimate parameters w from training data {(xi , yi )}N

i=1
I Inference: Make novel predictions: y = fw (x)

48
Classiﬁcation

Input Model Output

"Beach"

I Mapping: fw : RW ×H → {“Beach”, “No Beach”}

48
Regression

Input Model Output

143,52 €

I Mapping: fw : RN → R

48
Structured Prediction

Input Model Output

"Das Pferd
frisst keinen
Gurkensalat."

I Mapping: fw : RN → {1, . . . , C}M

48
Structured Prediction

Input Model Output

Can
Monkey

I Mapping: fw : RW ×H → {1, . . . , C}W ×H

48
Structured Prediction

Input Model Output

3
I Mapping: fw : RW ×H×N → {0, 1}M
I Suppose: 323 voxels, binary variable per voxel (occupied/free)
3
I Question: How many different reconstructions? 232 = 232768
I Comparison: Number of atoms in the universe? ∼ 2273
48
Linear Regression
Linear Regression
Let X denote a dataset of size N and let (xi , yi ) ∈ X denote its elements (yi ∈ R).
Goal: Predict y for a previously unseen input x. The input x may be multidimensional.
1.5
Ground Truth
Noisy Observations
1.0

0.5

0.0
y

0.5

1.0

1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x
50
Linear Regression
The error function E(w) measures the displacement along the y dimension between
the data points (green) and the model f (x, w) (red) speciﬁed by the parameters w.

1.5
f (x, w) = w> x Ground Truth
Noisy Observations
N
X 1.0 Linear Fit
E(w) = (f (xi , w) − yi )2
0.5
i=1
N 2 0.0

y
X
= x>
i w − yi
i=1 0.5

= kXw − yk22 1.0

1.5
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00
x

Here: x = [1, x]> ⇒ f (x, w) = w0 + w1 x 51

Linear Regression
The gradient of the error function with respect to the parameters w is given by:

∇w E(w) = ∇w kXw − yk22

= ∇w (Xw − y)> (Xw − y)

= ∇w w> X> Xw − 2w> X> y + y> y

= 2X> Xw − 2X> y

As E(w) is quadratic and convex in w, its minimizer (wrt. w) is given in closed form:

∇w E(w) = 0

−1
⇒ w = (X> X) X> y
−1
The matrix (X> X) X> is also called Moore-Penrose inverse or pseudoinverse.
52
Example: Line Fitting
Line Fitting
1.5 8
Ground Truth Error Curve
Noisy Observations 7 Minimum
1.0 Linear Fit
6
0.5
5
4

Error
0.0
y

3
0.5
2
1.0 1

1.5 0
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 1.0 0.5 0.0 0.5 1.0 1.5 2.0
x w1

Linear least squares ﬁt of model f (x, w) = w0 + w1 x (red) to data points (green).

Errors are also shown in red. Right: Error function E(w) wrt. parameter w1 .
54
Example: Polynomial Curve Fitting
Polynomial Curve Fitting

Let us choose a polynomial of order M to model dataset X :

M
X
f (x, w) = wj x j = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0

Tasks:
I Training: Estimate w from dataset X
I Inference: Predict y for novel x given estimated w
Note:
I Features can be anything, including multi-dimensional inputs (e.g., images, audio),
radial basis functions, sine/cosine functions, etc. In this example: monomials.

56
Polynomial Curve Fitting

Let us choose a polynomial of order M to model the dataset X :

M
X
f (x, w) = wj x j = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0

How can we estimate w from X ?

I Deﬁne an error function, e.g.:

N
X
E(w) = (f (xi , w) − yi )2
i=1

I Goal: Optimize error function wrt. the parameters w.

57
Polynomial Curve Fitting
The error function from above is quadratic in w but not in x:
 2
N N 2 N M
wj xji − yi 
X X X X
E(w) = (f (xi , w) − yi )2 = w> xi − yi = 
i=1 i=1 i=1 j=0

It can be rewritten in the matrix-vector notation (i.e., as linear regression problem)

E(w) = kXw − yk22

with feature matrix X, observation vector y and weight vector w:

 . . .. ..   .   
.. .. . . .. w0
 . 
. 
   
2 M
X =  1 xi xi . . . xi y= w=
 yi  . 
  
 
.. .. .. .. ..
. . . . . wM
58
Polynomial Curve Fitting Results
Polynomial Curve Fitting
1.5 1.5
M=0 Ground Truth M=1 Ground Truth
Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set
0.5 0.5

0.0 0.0
y

y
0.5 0.5

1.0 1.0

1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
60
Polynomial Curve Fitting
1.5 1.5
M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set
0.5 0.5

0.0 0.0
y

y
0.5 0.5

1.0 1.0

1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

Plots of polynomials of various degrees M (red) fitted to the data (green). We observe
underfitting (M = 0/1) and overfitting (M = 9). This is a model selection problem.
60
Capacity, Overfitting and Underfitting
Goal:
I Perform well on new, previously
1.5
unseen inputs (test set, blue), not Ground Truth
Noisy Observations
1.0 Test Set
only on the training set (green)
I This is called generalization and 0.5
separates ML from optimization
0.0

y
I Assumption: training and test data
0.5
independent and identically (i.i.d.)
drawn from distribution pdata (x, y) 1.0

I Here: pdata (x) = U(0, 1) 1.5

0.0 0.2 0.4 0.6 0.8 1.0
pdata (y|x) = N (sin(2πx), σ) x

61
Capacity, Overfitting and Underfitting
Terminology:
I Capacity: Complexity of functions which can be represented by model f
I Underfitting: Model too simple, does not achieve low error on training set
I Overfitting: Training error small, but test error (= generalization error) large
1.5 1.5 1.5
M=1 Ground Truth M=3 Ground Truth M=9 Ground Truth
Noisy Observations Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set Test Set
0.5 0.5 0.5

0.0 0.0 0.0

y
0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x x

Capacity too low Capacity about right Capacity too high

62
Capacity, Overﬁtting and Underﬁtting
Example: Generalization error for various polynomial degrees M
I Model selection: Select model with the smallest generalization error
103
Training Error
Generalization Error
102

101
Error

100

10 1

10 2

10 3
0 1 2 3 4 5 6 7 8 9
Degree of Polynomial 63
Capacity, Overﬁtting and Underﬁtting
General Approach: Split dataset into training, validation and test set
I Choose hyperparameters (e.g., degree of polynomial, learning rate in neural net, ..)
using validation set. Important: Evaluate once on test set (typically not available).
Test

20%

60% 20%
Training Validation

I When dataset is small, use (k-fold) cross validation instead of ﬁxed split.
64
Ridge Regression
Ridge Regression
Polynomial Curve Model:
M
X
f (x, w) = wj x j = w > x with features x = (1, x1 , x2 , . . . , xM )>
j=0

Ridge Regression:
N
X M
X
E(w) = (f (xi , w) − yi )2 + λ w2
i=1 j=0

= kXw − yk22 + λkwk22

I Idea: Discourage large parameters by adding a regularization term with strength λ

I Closed form solution: w = (X> X + λI)−1 X> y
66
Ridge Regression
1.5 1.5
M = 9, = 10 8 Ground Truth M = 9, = 103 Ground Truth
Noisy Observations Noisy Observations
1.0 Polynomial Fit 1.0 Polynomial Fit
Test Set Test Set
0.5 0.5

0.0 0.0
y

y
0.5 0.5

1.0 1.0

1.5 1.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

Plots of polynomial with degree M = 9 ﬁtted to 10 data points using ridge regression.
Left: weak regularization (λ = 10−8 ). Right: strong regularization (right, λ = 103 ).
67
Ridge Regression
102
Model Weights Training Error
Generalization Error
100000
101
50000

100
0

Error
50000 10 1

100000
10 2

10 13 10 12 10 11 10 10 10 9 10 8 10 7 10 6
10 11 10 8 10 5 10 2 101 104
Regularization Weight Regularization weight

Left: With low regularization, parameters can become very large (ill-conditioning).
Right: Select model with the smallest generalization error on the validation set.
68
Estimators, Bias and Variance
Estimators, Bias and Variance

Point Estimator:
I A point estimator g(·) is function that maps a dataset X to model parameters ŵ:

ŵ = g(X )

I Example: Estimator of ridge regression model: ŵ = (X> X + λI)−1 X> y

I We use the hat notation to denote that ŵ is an estimate
I A good estimator is a function that returns a parameter set close to the true one
I The data X = {(xi , yi )} is drawn from a random process (xi , yi ) ∼ pdata (·)
I Thus, any function of the data is random and ŵ is a random variable.

70
Estimators, Bias and Variance

Properties of Point Estimators:

Bias: Variance:

Bias(ŵ) = E(ŵ) − w Var(ŵ) = E(ŵ2 ) − E(ŵ)2

I Expectation over datasets X I Variance over datasets X

p
I ŵ is unbiased ⇔ Bias(ŵ) = 0 I Var(ŵ) is called “standard error”
I A good estimator has little bias I A good estimator has low variance

Bias-Variance Dilemma:
I Statistical learning theory tells us that we can’t have both ⇒ there is a trade-off

71
Estimators, Bias and Variance
0.8 0.8
Estimates = 10 8 = 10 Estimates
0.6 Ground Truth 0.6 Ground Truth
Mean Mean
0.4 0.4

0.2 0.2

0.0 0.0
y

y
0.2 0.2

0.4 0.4

0.6 0.6

0.8 0.8
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

Ridge regression with weak (λ = 10−8 ) and strong (λ = 10) regularization.

Green: True model. Black: Plot of model with mean parameters w̄ = E(w).
72
Estimators, Bias and Variance

I There is a bias-variance tradeoff: E[(ŵ − w)2 ] = Bias(ŵ)2 + Var(ŵ)

I Or not? In deep neural networks the test error decreases with network width!
https://www.bradyneal.com/bias-variance-tradeoff-textbooks-update
Neal et al.: A Modern Take on the Bias-Variance Tradeoff in Neural Networks. ICML Workshops, 2019. 73
Maximum Likelihood Estimation
Maximum Likelihood Estimation
I We now reinterpret our results by taking a probabilistic viewpoint
I Let X = {(xi , yi )}N
i=1 be a dataset with samples drawn i.i.d. from pdata
I Let the model pmodel (y|x, w) be a parametric family of probability distributions
I The conditional maximum likelihood estimator for w is given by

ŵM L = argmax pmodel (y|X, w)

w
N
iid
Y
= argmax pmodel (yi |xi , w)
w
i=1
XN
= argmax log pmodel (yi |xi , w)
w
i=1
| {z }
Log-Likelihood
75
Maximum Likelihood Estimation
Example: Assuming pmodel (y|x, w) = N (y|w> x, σ), we obtain
N
X
ŵM L = argmax log pmodel (yi |xi , w)
w
i=1
N
X 1 − 1 2
(w> xi −yi )
= argmax log √ e 2σ 2
w
i=1 2πσ 2
N N
X 1 X 1 > 2
= argmax − log(2πσ 2 ) − w x i − yi
w 2 2σ 2
i=1 i=1
XN 2
= argmax − w > xi − yi
w
i=1

= argmin kXw − yk22

w
76
Maximum Likelihood Estimation

We see that choosing pmodel (y|x, w) to be Gaussian causes maximum likelihood to

yield exactly the same least squares estimator derived before:

ŵ = argmin kXw − yk22

Variations:
I If we were choosing pmodel (y|x, w) as a Laplace distribution, we would obtain an
estimator that minimizes the `1 norm: ŵ = argmin w kXw − yk1
I Assuming a Gaussian distribution over the parameters w and performing a
maximum a-posteriori (MAP) estimation yields ridge regression:
argmax p(w|y, x) = argmax p(y|x, w)p(w)
w w
77
Maximum Likelihood Estimation

We see that choosing pmodel (y|x, w) to be Gaussian causes maximum likelihood to

yield exactly the same least squares estimator derived before:

ŵ = argmin kXw − yk22

Remarks:
I Consistency: As the number of training samples approaches inﬁnity N → ∞,
the maximum likelihood (ML) estimate converges to the true parameters
I Efﬁciency: The ML estimate converges most quickly as N increases
I These theoretical considerations make ML estimators appealing

These Slides Are Based On Materials Created by Prof. Dr.-Ing Andreas Geiger. The Use of These Original Slides Is With Permission
No ratings yet
These Slides Are Based On Materials Created by Prof. Dr.-Ing Andreas Geiger. The Use of These Original Slides Is With Permission
93 pages
Introduction to Deep Learning Course
No ratings yet
Introduction to Deep Learning Course
63 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
130 pages
Lecture 1a - Introduction
No ratings yet
Lecture 1a - Introduction
38 pages
DL Notes 1 5 Deep Learning
100% (1)
DL Notes 1 5 Deep Learning
189 pages
Neural Networks1
No ratings yet
Neural Networks1
164 pages
S5 and S6-2023 Curriculum Syllabus
No ratings yet
S5 and S6-2023 Curriculum Syllabus
6 pages
Syl5 ML
No ratings yet
Syl5 ML
5 pages
Mod-1 Part 1
No ratings yet
Mod-1 Part 1
143 pages
SYLLABUS
No ratings yet
SYLLABUS
3 pages
Deep Neural Network AIML Handout v1.0-1
No ratings yet
Deep Neural Network AIML Handout v1.0-1
8 pages
DL Sessional 1
No ratings yet
DL Sessional 1
301 pages
DNN Ho
No ratings yet
DNN Ho
8 pages
DL Theory Syllabus
No ratings yet
DL Theory Syllabus
1 page
DL SansON Iat1
No ratings yet
DL SansON Iat1
17 pages
Brochure CMU-DELE 03-05-2023 V12
No ratings yet
Brochure CMU-DELE 03-05-2023 V12
12 pages
Deep Learning
No ratings yet
Deep Learning
37 pages
Csa4020 Deep-Learning LP 1.0 22 Csa4020 Deep-Learning LP 1.0 1 Deep Learning
No ratings yet
Csa4020 Deep-Learning LP 1.0 22 Csa4020 Deep-Learning LP 1.0 1 Deep Learning
2 pages
Unit - 1 Deep Learning 3-2
No ratings yet
Unit - 1 Deep Learning 3-2
15 pages
NN-DL Introduction Class
No ratings yet
NN-DL Introduction Class
10 pages
Lecture 01
No ratings yet
Lecture 01
45 pages
Deep Learning Course Plan 2024-25
No ratings yet
Deep Learning Course Plan 2024-25
5 pages
Deep Learning Techniques-Lession Plan
No ratings yet
Deep Learning Techniques-Lession Plan
2 pages
Understanding Deep Learning Simon J. D. Prince Instant Download
100% (8)
Understanding Deep Learning Simon J. D. Prince Instant Download
61 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
2 pages
Deep Learning Revision Guide
No ratings yet
Deep Learning Revision Guide
6 pages
IF4071 Deep Learning Notes
No ratings yet
IF4071 Deep Learning Notes
188 pages
6S191 MIT DeepLearning L1
No ratings yet
6S191 MIT DeepLearning L1
108 pages
ECE Neural Networks Course Guide
No ratings yet
ECE Neural Networks Course Guide
18 pages
Essentials of Deep Learning
No ratings yet
Essentials of Deep Learning
2 pages
NN DL Unit - III
No ratings yet
NN DL Unit - III
19 pages
2019 6S191 L6 PDF
No ratings yet
2019 6S191 L6 PDF
61 pages
Deep Learning Technique Syllabus
100% (1)
Deep Learning Technique Syllabus
2 pages
COMP 488 Neural Network Deep Learning
No ratings yet
COMP 488 Neural Network Deep Learning
3 pages
HDS401 Deep Learning Module Outline
No ratings yet
HDS401 Deep Learning Module Outline
3 pages
Neural Networks Syllabus Overview
No ratings yet
Neural Networks Syllabus Overview
118 pages
Deep Learning
No ratings yet
Deep Learning
2 pages
Syllabus
No ratings yet
Syllabus
5 pages
ML002 Syllabus
No ratings yet
ML002 Syllabus
6 pages
Deep Learning Course Overview
No ratings yet
Deep Learning Course Overview
30 pages
Deep Learning Syllabus
No ratings yet
Deep Learning Syllabus
1 page
20IT7301 - Deep Learning Syllabus
No ratings yet
20IT7301 - Deep Learning Syllabus
3 pages
Neural Networks & Fuzzy Logic Course
No ratings yet
Neural Networks & Fuzzy Logic Course
8 pages
Lec 01 Introduction
No ratings yet
Lec 01 Introduction
116 pages
01 Intro
No ratings yet
01 Intro
49 pages
MITXPRO BROCHURE Deep Learning DRN ENG Oct 2021
No ratings yet
MITXPRO BROCHURE Deep Learning DRN ENG Oct 2021
9 pages
DNN Merged Sugata
No ratings yet
DNN Merged Sugata
243 pages
Lec 01
No ratings yet
Lec 01
28 pages
01 Intro
No ratings yet
01 Intro
45 pages
ccs355 Syllabus NNDL
100% (1)
ccs355 Syllabus NNDL
3 pages
DL Unit 1
No ratings yet
DL Unit 1
8 pages
Syllabus DL Spring 2023
No ratings yet
Syllabus DL Spring 2023
9 pages
CE0733 - Machine Learning and Deep Learning - Compulsory
No ratings yet
CE0733 - Machine Learning and Deep Learning - Compulsory
3 pages
Antern ML002
No ratings yet
Antern ML002
15 pages
DL 1
No ratings yet
DL 1
20 pages
Deep Learning
No ratings yet
Deep Learning
127 pages
DL-Lecture-02 Introduction and History
No ratings yet
DL-Lecture-02 Introduction and History
50 pages
Deep Learning Unit1
No ratings yet
Deep Learning Unit1
126 pages