0% found this document useful (0 votes)
16 views

V02 SS24 DLforCV NN Basics Teil1

Uploaded by

junrunchen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

V02 SS24 DLforCV NN Basics Teil1

Uploaded by

junrunchen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Deep Learning For Computer Vision

Vorlesung SS 2024
Prof. Dr.-Ing. Rainer Stiefelhagen, Dr. Saquib Sarfraz, Dr.-Ing. Alina Roitberg
Computer Vision for HCI Lab – cv:hci, Institut für Anthropomatik & Robotik
ACCESS@KIT – Zentrum für digitale Barrierefreiheit und Assistive Technologien
Institut für Anthropomatik und Robotik, Fakultät für Informatik

KIT – Universität des Landes Baden-Württemberg und


nationales Forschungszentrum in der Helmholtz-Gemeinschaft www.kit.edu
Lecture 2
NEURAL NETWORK BASICS

3 VL - Computer Vision for Human-Computer Interaction Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Next lecture: live demo

This lecture: neural network basics

Next lecture:
Generalization, Overfitting and How to Approach it.
Live demo: training neural networks in Pytorch
Homework: watch the Pytorch introduction video
https://youtu.be/I1WcY1gX8PM

4 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Lecture Outline

Motivation: end-to-end learning vs. handcrafted features


Single-Layer Perceptron and Deep Neural Networks
Forward-pass
Backpropagation

Image source: https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-


9689331ba092

5 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Lecture Outline

Optimization Strategies:
Batch and Stochastic Gradient Descent
Momentum and Nesterov Accelerated Gradient
Adaptive Learning Rate
Activation Functions
Loss Functions

Image source: http://cs231n.stanford.edu/slides/2016/winter1516_lecture3.pdf

6 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Traditional computer vision

Features are not learned!

Feature Learning
Input data
representation algorithm
(pixels)
(hand-crafted) (e.g., SVM)

Stapler

7 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Popular computer vision features

SIFT

HoG

Gabor filters
and many others…
SURF, LBP, color histograms, GLOH

8 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Learning feature hierarchy (1)

Learn a hierarchy from end-to-end


From image pixels to classifier output

Image / Video Simple


Layer 1 Layer 2 Layer 3
pixels Classifier

Train all layers jointly!

9 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Learning feature hierarchy (2)

Learn a hierarchical representation of data

10 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Feature hierarchy
3rd layer
Fill in representation
“Objects”
gap in recognition

2nd layer
“Object parts”

1st layer
“Edges”

Input layer
Lee et al., Unsupervised learning of hierarchical Pixels
representations with convolutional deep belief
networks. CACM 2011.

11 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Taxonomy of feature learning
Supervised

• Logistic Regression • Supervised Deep Neural Net


• Support Vector Machines • Supervised Convolutional Neural Net
• Perceptron • Supervised Recurrent Neural Net
• Semi-supervised deep learning

Shallow Deep
• Stacked De-noising Autoencoder
• Sparse coding
• Deep Belief Nets
• Denoising Autoencoder
• Deep Boltzmann machines
• Restricted Boltzmann machine
• Hierarchical Sparse coding
• Unsupervised deep learning
• Self-supervised deep learning

Unsupervised
12 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Train/Test/Validation Splits in Machine Learning

Training set
A training data set is a data set of examples used during the learning
process and is used to fit the parameters (e.g., weights) of, for example,
a classifier.
For supervised deep learning, annotations for the samples leveraged for
training are used in loss calculation for supervision.
For self-supervised deep learning, no annotations could be
leveraged during training.
Test set
A test data set is a data set that is independent of the training data set,
but that follows the similar probability distribution as the training data set.
Validation set
A validation data set is a data-set of examples used to tune
the hyperparameters (i.e. the architecture) of a classifier. It is sometimes
also called the development set or the "dev set"

13 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Single-Layer Perceptron (1)

Inspired by information processing in biological nervous


systems
Output of a neuron:
spike train

Spike train in electrosensory pyramidal


neuron in fish (Eigenmannia)

Image source: physics.gu.se/~frtbm/joomla/media/mydocs/NeuralNetworks.pdf

14 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Single-Layer Perceptron (2)

Inspired by information processing in biological nervous


systems
input process output input process output

data
x1 w1 weights activation
w2
x2
.  f
y
.
.
wn
xn bias
b

Image source: physics.gu.se/~frtbm/joomla/media/mydocs/NeuralNetworks.pdf

15 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Single-Layer Perceptron (3)

A very old model (McCulloch-Pitts neuron 1943)


Signal processing: weighted sum of inputs with activation
function 𝑓

weights
x1 w1 activation
w2
x2  y
. f
.
. wn
xn bias
b
16 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Single-Layer Perceptron (4)

Non-linear activation function 𝑓 (e.g. binary step, Sigmoid,


ReLU)
Without activation function (𝑓 𝑥 = 𝑥) the network is a
Linear Regression Model
weights
x1 w1 activation
w2
x2  y
. f
.
. wn
xn bias
b
17 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Multi-Layer Perceptron (1)

Combining layers lets us represent non-linear functions


A single-layer perceptron cannot implement simple
functions such as NOT or XOR
(Explanation and proof:
computing.dcu.ie/~humphrys/Notes/Neural/single.neural.html)
input layer output layer AND OR
hidden layer 1 0 1 1 1
o6 1
x
vv11
y1
0 0 0 0 0 1
x
v2 o
v27
y2 0 1 0 1
x3 o
v38 XOR
Bias 1 1 0
b Not linearly
0 0 1 separable!
0 1
18 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Multi-Layer Perceptron (2)

Combining layers lets us represent non-linear functions


Forward propagation: 𝑦 = 𝑓(𝑊𝑥 + 𝑏)
In the following slides, we will use the ReLU activation
function: 𝑓 𝑥 = max(0, 𝑥)
input layer output layer
hidden layer
x
vv11 o6
y1
x
v2 o
v27
y2
x3 o
v38
Bias
b

Input computation for each node

19 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Alternative graphical representations

20 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Neural Networks

A simple two-hidden layer network

𝒙 – input layer (pixels)


𝒉1 – first layer hidden units
𝒉2 – second layer hidden units
𝒐 – output layer

21 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Forward propagation (1)

Compute network output given input

𝒙 ∈ ℝ𝐷 , 𝑊 1 ∈ ℝ𝑁1 ×𝐷 , 𝒃1 ∈ ℝ𝑁1 , 𝒉1 ∈ ℝ𝑁1

𝒉1 = max 0, 𝑊 1 𝒙 + 𝒃1

Non-linearity 𝑢 = max 0, 𝑣 ,
Rectified Linear Unit (more later)

22 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Forward propagation (2)

Compute network output given input

𝒉2 = max 0, 𝑊 2 𝒉1 + 𝒃2

𝒐 = max 0, 𝑊 3 𝒉2 + 𝒃2

23 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Alternative graphical representations

24 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Reasoning (1)

Q: Why can’t the mapping between layers be linear?


A: Compositions of linear functions is linear, whole network collapses to
regression.

Q: What does a hidden unit do?


A: Can be thought as a classifier or feature computer.

Q: How many layers? How many hidden units?


A: Hyper-parameter setting best done using cross-validation. In general
wider and deeper networks allow for complicated “function” mappings.

25 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Reasoning (2)

Q: Why do we need many layers?


A: Data with hierarchical structure is well exploited with a hierarchical
model architecture where intermediate features can be re-used.

[1 0 1 0 0 0 0 1 1 0 0 … ] truck

[0 1 0 1 0 0 0 1 0 1 1 … ] motorbike

26 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Evolution of Network depth

Why are Neural Networks becoming deeper?


Wide Neural Networks memorize the data and are prone to overfitting
Deep networks learn features at various levels of abstraction and
generalize the data

Image: medium.com/@Lidinwise/the-revolution-of-depth-facf174924f5

27 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
What is a good network for classification?

x h1 h2 o z

Softmax
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)

1 k c Loss
y = [0 0 ... 0 1 0 ... 0 0]

The network output z matches expected output y (ground truth / label)

• In this case y is a one-hot-encoded vector


• 1 at correct class index, 0 everywhere else

For classification, a good model has


predicted class = ground truth k – index of correct class
c – number of all classes

28 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Softmax Layer with 3 classes
o z
Network θ 1 0.23

Softmax
x h1 h2
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)
2 0.63

0.5 0.14 Loss


y = [0 0 1]

Network θ (weights and biases)


o: Output of last linear layer, also called logits
z: Softmax output, class "probabilities"
y: Ground-truth label

29 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Softmax Layer - Equations
o z
Network θ 1 0.23

Softmax
x h1 h2
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)
2 0.63

0.5 0.14 Loss


y = [0 0 1]

Softmax: probability that x belongs to class ck


Used in most classification networks
argmax(z) to get predicted class index
zk =

30 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Image Classifier (in the next lectures...)
o z
x: image
CNN θ 1 0.23 P("human")

Softmax
2 0.63 P("dog")

AlexNet 0.5 P("cat")


0.14
Loss
y = [0 0 1]

Softmax: probability that x belongs to class ck


Used in most classification networks
argmax(z) to get predicted class index
zk =

31 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Loss modeling

Probability of class 𝑘 given input image 𝒙 (soft-max)


𝑒 𝑜𝑘
𝑝 𝑐𝑘 = 1 𝒙 =
σ𝑗 𝑒 𝑜𝑗

Loss: negative log-likelihood (per sample 𝒙)


𝐿 𝒙, 𝑦; 𝜽 = − ෍ 𝑦𝑗 log 𝑝 𝑐𝑗 𝒙
𝑗

A good model is
𝜽∗ = arg min ෍ 𝐿(𝒙𝑛 , 𝑦 𝑛 ; 𝜽)
𝜽
𝑛

32 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Training

Model θ: Wn and bn for every layer n


Calculate L(x, y; θ)
𝜕𝐿
n n
Update all W and b based on loss 𝜽← 𝜽−𝜂
gradient
𝜕𝜽
𝜕𝐿
𝜽←𝜽−𝜂
𝜕𝜽

How to compute gradients? Backpropagation!

33 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Backpropagation (1)

x h1 h2 o z

Softmax
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)

1 k c Loss
y = [0 0 ... 0 1 0 ... 0 0]

Computational graph a d
x h1 h2 o z L c
W1 W2 W3 y • d is a function of a and c

b1 b2 b3 • and computed directly

• How to compute and ?

34 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Backpropagation (2) How to compute and ?

Forward (black arrows):

zk =

Computational graph

x h1 h2 o z L
W1 W2 W3 y Softmax derivative:

b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

35 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Backpropagation (3) How to compute and ?

Forward (black arrows): Backward with chain rule (red arrows):

zk =

Computational graph

x h1 h2 o z L
W1 W2 W3 y Softmax derivative:

b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

36 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Backpropagation (4) How to compute and ?

Forward (black arrows): Backward with chain rule (red arrows):

zk =

Computational graph

x h1 h2 o z L
W1 W2 W3 y Softmax derivative:

b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

37 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Backpropagation (5) How to compute and ?

Forward (black arrows): Backward with chain rule (red arrows):

zk =

Computational graph

x h1 h2 o z L
W1 W2 W3 y Softmax derivative:

b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

38 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Backpropagation (6) How to compute and ?

Forward (black arrows): Backward with chain rule (red arrows):

zk =

Computational graph

x h1 h2 o z L
W1 W2 W3 y Softmax derivative:

b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

39 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Backpropagation (7) How to compute and ?

Forward (black arrows): Backward with chain rule (red arrows):

zk =

Computational graph

x h1 h2 o z L
W1 W2 W3 y Softmax derivative:

b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1

40 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Creating non-linear functions

1 input & 1 output


100 hidden units / layer

41 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Optimization with Gradient Descent

Modify the network weight 𝜽 based on gradient of the loss


function:

𝜕𝐿 Increase Decrease
𝜽←𝜽−𝜂 weight here weight here
𝜕𝜽

Do nothing
here

42 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Stochastic Gradient Descent

Also called mini-batch Gradient Descent


Approximate sum with a mini-batch of examples (e.g. 32, 64…)

Allows “incremental” training in batches


Mini-Batch GD will almost certainly converge to the global
minimum for convex error surfaces and to a local minimum for
non-convex surfaces

towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
43 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Challenges of Gradient Descent

Choosing the right learning rate


𝜕𝐿
Weight update: 𝜽 ← 𝜽 − 𝜂
𝜕𝜽
Learning rate 𝜂 is a crucial hyperparameter
If 𝜂 is too large, the loss will fluctuate around the minimum or, in worst
case, diverge
If 𝜂 is too small, it will converge very slowly

https://medium.com/@hiromi_suenaga/deep-learning-2-part-1-lesson-1-602f73869197
towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-
d0d4059c1c10
44 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Challenges of Gradient Descent

Non convex loss functions


GD will certainly (Batch GD) or almost certainly (Mini-Batch GD)
converge to the global minimum for convex error surfaces and to a
local minimum for non-convex surfaces

Most error functions are highly


non-convex
Numerous suboptimal local
minima are a big problem!

Image source: www.kdnuggets.com/2016/06/visual-explanation-backpropagation-algorithm-neural-


networks.html

45 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
„Classic“ or Batch Gradient Descent

Given a model with parameters 𝜽


A training dataset with 𝑛 examples and
𝐿𝑖 depicting the loss for the i-th example.
In Batch Gradient Descent, we try to minimize 𝐿 𝜃 = σ𝑛𝑖=0 𝐿𝑖 (𝜃)
In other words, we need build a full sum over all examples to update 𝜽

Batch GD is guaranteed to converge to the global minimum for


convex error surfaces and to a local minimum for non-convex
surfaces
However: full sum is expensive when N is large! (→ memory size!)
towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
46 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Momentum

Momentum hyperparameter μ (usually 0.9)


Takes the gradient from previous steps into account:
𝜕𝐿(𝜽 )
Δt = μ Δt−1 − 𝜂
𝜕𝜽

𝜽 ← 𝜽 + Δt

Accelerates if the gradients changes in the same direction (→faster


convergence) and reduces the updates if the gradient changes
direction (→ less fluctuations)
cs231n.github.io/neural-networks-3/
Sutskever, Ilya, et al. "On the importance of initialization and momentum in deep learning." International
conference on machine learning. 2013.
47 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Annealing the learning rate
■ Graudally decreasing learning rate helps training

Example: learning rate decrease in Resnet model training


The starting learning rate is 0.1 and is reduced to 0.01 at 80 epochs
and then to 0.001 at 160 epochs
The effect of first reduction has a clear improvement, the effect of the
second reduction is small

48 cs231n.github.io/neural-networks-3/ Maschinensehen für MMI (Prof. Stiefelhagen)


Example source: https://github.com/gcr/torch-residual-networks Institut für Anthropomatik und Robotik
Nesterov accelerated gradient

■ Modification of the Momentum update


■ Instead of computing gradient at the current position (red dot), calculate
the gradient at the future approximate position (green arrow head) and
then update

𝜕𝐿(𝜽 + μ Δt−1 )
Δt = μ Δt−1 − 𝜂
𝜕(𝜽 + μ Δt−1 )
𝜽 ← 𝜽 + Δt
cs231n.github.io/neural-networks-3/

49 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Adagrad

Adapting the learning rates depending on weights


Each weight 𝜽𝒊 are modified with different learning rate, depending
on the past gradients
Weights with high gradients: learning rate reduced
Weights with small gradient: learning rate increased

𝑮𝒕,𝒊𝒊 is a diagonal matrix where element 𝒊, 𝒊 is the sum of the squares


of the gradients of the corresponding weight 𝜽𝒊 up to time step 𝒕

cs231n.github.io/neural-networks-3/

50 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Further optimization algorithms with adaptive
learning rate

Adadelta
Extension of Adagrad with less aggressive learning rate decay
Adadelta restricts the window of accumulated past gradients to some
fixed size 𝒘

Adam
Makes use of the average of the second moments of the gradients
Similar to Adadelta with Momentum
Currently, one of the popular optimization algorithms

Read more on GD optimization algorithms: http://ruder.io/optimizing-


gradient-descent/index.html

51 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Demo: different GD Optimization Strategies

medium.com/datathings/neural-networks-and-backpropagation-explained-in-a-simple-way-f540a3611f5e

52 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Activation functions

Non-linear, should be differentiable functions (since training with


Backpropagation)
“Classic” function: Sigmoid, Tanh
Modern functions: ReLU, Leaky ReLU, Maxout and many more…

Image source: https://medium.com/@shrutijadon10104776/survey-on-activation-functions-for-deep-learning-


9689331ba092

53 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Activation functions

Sigmoid function
Formulation:
Large negative numbers become 0 and
large positive numbers become 1
Common in the past, rarely used today

Drawbacks:
Vanishing gradients: functions gradient at either tail of 1 or 0 is
almost zero
Outputs are not zero-centered, which is undesirable since the
input data is often not zero-centered. If the input is always
positive, the weight gradients will become either all positive or
negative resulting in zig-zagging dynamics.
http://cs231n.github.io/neural-networks-1

54 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Activation functions

Tanh function
Formulation:
large negative numbers become -1 and
large positive numbers become 1
Similar to Sigmoid
In contrast to Sigmoid:
Zero-centered

Drawbacks:
Vanishing gradients (see Sigmoid
function)

http://cs231n.github.io/neural-networks-1

55 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Activation functions

Rectified Linear Unit (ReLU)


Formulation:
Simple, inexpensive operation (only
comparison, addition and multiplication)
Efficient gradient propagation: No
vanishing gradient

Drawback: Dying ReLU problem


Large gradient flow could cause the weights to update in such a
way that the neuron will never activate again
If this happens, the gradient flowing through the unit will forever
be zero from that point on.

http://cs231n.github.io/neural-networks-1

56 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Activation functions

Rectified Linear Unit (ReLU)


greatly accelerates (e.g. a factor of 6 in Krizhevsky et al.) the
convergence of stochastic gradient descent compared to the
sigmoid/tanh functions

Figure from Krizhevsky et al. paper


(2012)
6x improvement in convergence
with the ReLU unit compared to
the tanh unit

http://cs231n.github.io/neural-networks-1
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional
neural networks." Advances in neural information processing systems. 2012.
57 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Activation functions

Leaky ReLU
Formulation: ( 1(): indicator function )
α is a small constant
attempt to fix the “dying ReLU” problem: instead of the function being
zero when x < 0, a leaky ReLU will instead have a small negative
slope

http://cs231n.github.io/neural-networks-1
https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
58 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Activation functions

Maxout

Formulation:
introduced recently by Goodfellow et al.
Generalizes ReLU and Leaky ReLU
E.g. ReLU is Maxout with 𝑤1 = 0 and 𝑏1 = 0
Fixes the dying ReLU problem

Drawbacks: doubles the number of parameters

http://cs231n.github.io/neural-networks-1

59 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Activation functions

In recent years activation functions appear, many of them are based


on ReLU

Overview paper and table source: Mishkin, Dmytro, Nikolay Sergievskiy, and Jiri Matas. "Systematic
evaluation of CNN advances on the ImageNet." arXiv preprint arXiv:1606.02228 (2016).
60 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Activation functions: practical advice

The choice of the activation function matters!


Example: performance comparison with different Activation
Functions on MNIST

Source: https://towardsdatascience.com/exploring-activation-functions-for-neural-networks-73498da59b02

61 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Activation functions: practical advice

What activation function should I use?


Use ReLU! But: be aware, that „dead“ units are possible if the
learning rate is not well-adjusted
If this concerns you, try Leaky
ReLU or Maxout
Possibly try out Tanh, but expect
it to work worse than ReLU or
Maxout
Never use sigmoid

Image source: towardsdatascience.com/activation-


http://cs231n.github.io/neural-networks-1 functions-neural-networks-1cbd9f8d91d6

62 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Loss functions

Quantifies what it means to have a “good” model


This definition depends on the task!

Different types of Loss functions for different tasks, such as:

Classification
Regression
Metric Learning
Reinforcement Learning

63 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Loss functions

Classification

Predicting a discrete class label

We had until now:


Loss: negative log-likelihood (per sample 𝒙)

𝐿 𝒙, 𝑦 = − ෍ 𝑦𝑗 log 𝑝 𝑐𝑗 𝒙
𝑗
Used in various multiclass classification methods for NN training

Hinge Loss: used in Support Vector Machines (SVMs)


𝐿 𝑥, 𝑦 = ෍ 𝑚𝑎𝑥(0,1 − 𝑥𝑖 𝑦𝑖 )
𝑗

64 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Loss functions

Regression

Predicting one or multiple continuous quantities 𝒚𝟏 … 𝒚𝒏


Minimize the distance between the predicted value 𝒙𝒋 true values 𝒚𝒋
L1-Loss (Mean Average Error):
𝐿 𝒙, 𝑦 = ෍(𝒚𝒋 − 𝒙𝒋 )
𝑗

L2-Loss (Mean Square Error):


𝐿 𝒙, 𝑦 = ෍(𝒚𝒋 − 𝒙𝒋 )𝟐
𝑗

65 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Loss functions

Metric Learning / Similarity Learning

A model for measuring the distance (or similarity) between objects

Example: Triplet Loss


Input: three images – an Anchor image 𝑥𝑎 , a Positive 𝑥𝑝 (similar) and a
Negative 𝑥𝑛 (dissimilar) example

Find a model, that would produce a representation {𝑥𝑎 , 𝑥𝑛, 𝑥𝑝 }, so that


the distance between 𝑥𝑎 and 𝑥𝑝 is small and the distance between 𝑥𝑎
and 𝑥𝑛 is large

Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face
recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015.

66 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Loss functions

Metric Learning / Similarity Learning

Example: Triplet Loss

Loss function:

α is the margin constant (usually set to 1)


Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face
recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015.

67 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Summary

Simple Multi Layer Perceptron Neural Network


Stochastic Gradient Descent – Learning algorithms
Back Propagation
Activation Functions
Loss functions

Next lecture:
Generalization, Overfitting and How to Approach it.
Live demo: training neural networks in Pytorch
Homework: watch the Pytorch introduction video
https://youtu.be/I1WcY1gX8PM

68 Maschinensehen für MMI (Prof. Stiefelhagen)


Institut für Anthropomatik und Robotik
Reading / Learning

Slides should give a good overview


Learn in teams, explain things to each other

Papers as additional reading / more details


Don‘t have to read all papers, focus on main papers (see
slides and summary at end of the lecture)
Please see the references listed in the slides
Papers can easily be googled

Book recommendation for overview / background


I. Goodfellow, Y. Bengio, A. Courville, “Deep Learning”,
MIT Press – chapters 6-10, fundamentals: Ch. 5

Basics of backprop, chain rule, …: look it up


E.g. very nice video tutorials from 3blue1brown.com
69 VL DL4CV - Summary Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik

You might also like