V02 SS24 DLforCV NN Basics Teil1
V02 SS24 DLforCV NN Basics Teil1
Vorlesung SS 2024
Prof. Dr.-Ing. Rainer Stiefelhagen, Dr. Saquib Sarfraz, Dr.-Ing. Alina Roitberg
Computer Vision for HCI Lab – cv:hci, Institut für Anthropomatik & Robotik
ACCESS@KIT – Zentrum für digitale Barrierefreiheit und Assistive Technologien
Institut für Anthropomatik und Robotik, Fakultät für Informatik
3 VL - Computer Vision for Human-Computer Interaction Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Next lecture: live demo
Next lecture:
Generalization, Overfitting and How to Approach it.
Live demo: training neural networks in Pytorch
Homework: watch the Pytorch introduction video
https://youtu.be/I1WcY1gX8PM
Optimization Strategies:
Batch and Stochastic Gradient Descent
Momentum and Nesterov Accelerated Gradient
Adaptive Learning Rate
Activation Functions
Loss Functions
Feature Learning
Input data
representation algorithm
(pixels)
(hand-crafted) (e.g., SVM)
Stapler
SIFT
HoG
Gabor filters
and many others…
SURF, LBP, color histograms, GLOH
2nd layer
“Object parts”
1st layer
“Edges”
Input layer
Lee et al., Unsupervised learning of hierarchical Pixels
representations with convolutional deep belief
networks. CACM 2011.
Shallow Deep
• Stacked De-noising Autoencoder
• Sparse coding
• Deep Belief Nets
• Denoising Autoencoder
• Deep Boltzmann machines
• Restricted Boltzmann machine
• Hierarchical Sparse coding
• Unsupervised deep learning
• Self-supervised deep learning
Unsupervised
12 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Train/Test/Validation Splits in Machine Learning
Training set
A training data set is a data set of examples used during the learning
process and is used to fit the parameters (e.g., weights) of, for example,
a classifier.
For supervised deep learning, annotations for the samples leveraged for
training are used in loss calculation for supervision.
For self-supervised deep learning, no annotations could be
leveraged during training.
Test set
A test data set is a data set that is independent of the training data set,
but that follows the similar probability distribution as the training data set.
Validation set
A validation data set is a data-set of examples used to tune
the hyperparameters (i.e. the architecture) of a classifier. It is sometimes
also called the development set or the "dev set"
data
x1 w1 weights activation
w2
x2
. f
y
.
.
wn
xn bias
b
weights
x1 w1 activation
w2
x2 y
. f
.
. wn
xn bias
b
16 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Single-Layer Perceptron (4)
𝒉1 = max 0, 𝑊 1 𝒙 + 𝒃1
Non-linearity 𝑢 = max 0, 𝑣 ,
Rectified Linear Unit (more later)
𝒉2 = max 0, 𝑊 2 𝒉1 + 𝒃2
𝒐 = max 0, 𝑊 3 𝒉2 + 𝒃2
[1 0 1 0 0 0 0 1 1 0 0 … ] truck
[0 1 0 1 0 0 0 1 0 1 1 … ] motorbike
Image: medium.com/@Lidinwise/the-revolution-of-depth-facf174924f5
x h1 h2 o z
Softmax
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)
1 k c Loss
y = [0 0 ... 0 1 0 ... 0 0]
Softmax
x h1 h2
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)
2 0.63
Softmax
x h1 h2
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)
2 0.63
Softmax
2 0.63 P("dog")
A good model is
𝜽∗ = arg min 𝐿(𝒙𝑛 , 𝑦 𝑛 ; 𝜽)
𝜽
𝑛
x h1 h2 o z
Softmax
max(0, W1 x) max(0, W2 h1) max(0, W3 h2)
1 k c Loss
y = [0 0 ... 0 1 0 ... 0 0]
Computational graph a d
x h1 h2 o z L c
W1 W2 W3 y • d is a function of a and c
zk =
Computational graph
x h1 h2 o z L
W1 W2 W3 y Softmax derivative:
b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
zk =
Computational graph
x h1 h2 o z L
W1 W2 W3 y Softmax derivative:
b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
zk =
Computational graph
x h1 h2 o z L
W1 W2 W3 y Softmax derivative:
b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
zk =
Computational graph
x h1 h2 o z L
W1 W2 W3 y Softmax derivative:
b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
zk =
Computational graph
x h1 h2 o z L
W1 W2 W3 y Softmax derivative:
b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
zk =
Computational graph
x h1 h2 o z L
W1 W2 W3 y Softmax derivative:
b1 b2 b3
Softmax derivative: https://towardsdatascience.com/derivative-of-the-softmax-function-and-the-categorical-cross-entropy-loss-ffceefc081d1
𝜕𝐿 Increase Decrease
𝜽←𝜽−𝜂 weight here weight here
𝜕𝜽
Do nothing
here
towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3
43 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Challenges of Gradient Descent
https://medium.com/@hiromi_suenaga/deep-learning-2-part-1-lesson-1-602f73869197
towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-
d0d4059c1c10
44 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Challenges of Gradient Descent
𝜽 ← 𝜽 + Δt
𝜕𝐿(𝜽 + μ Δt−1 )
Δt = μ Δt−1 − 𝜂
𝜕(𝜽 + μ Δt−1 )
𝜽 ← 𝜽 + Δt
cs231n.github.io/neural-networks-3/
cs231n.github.io/neural-networks-3/
Adadelta
Extension of Adagrad with less aggressive learning rate decay
Adadelta restricts the window of accumulated past gradients to some
fixed size 𝒘
Adam
Makes use of the average of the second moments of the gradients
Similar to Adadelta with Momentum
Currently, one of the popular optimization algorithms
medium.com/datathings/neural-networks-and-backpropagation-explained-in-a-simple-way-f540a3611f5e
Sigmoid function
Formulation:
Large negative numbers become 0 and
large positive numbers become 1
Common in the past, rarely used today
Drawbacks:
Vanishing gradients: functions gradient at either tail of 1 or 0 is
almost zero
Outputs are not zero-centered, which is undesirable since the
input data is often not zero-centered. If the input is always
positive, the weight gradients will become either all positive or
negative resulting in zig-zagging dynamics.
http://cs231n.github.io/neural-networks-1
Tanh function
Formulation:
large negative numbers become -1 and
large positive numbers become 1
Similar to Sigmoid
In contrast to Sigmoid:
Zero-centered
Drawbacks:
Vanishing gradients (see Sigmoid
function)
http://cs231n.github.io/neural-networks-1
http://cs231n.github.io/neural-networks-1
http://cs231n.github.io/neural-networks-1
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional
neural networks." Advances in neural information processing systems. 2012.
57 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Activation functions
Leaky ReLU
Formulation: ( 1(): indicator function )
α is a small constant
attempt to fix the “dying ReLU” problem: instead of the function being
zero when x < 0, a leaky ReLU will instead have a small negative
slope
http://cs231n.github.io/neural-networks-1
https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6
58 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Activation functions
Maxout
Formulation:
introduced recently by Goodfellow et al.
Generalizes ReLU and Leaky ReLU
E.g. ReLU is Maxout with 𝑤1 = 0 and 𝑏1 = 0
Fixes the dying ReLU problem
http://cs231n.github.io/neural-networks-1
Overview paper and table source: Mishkin, Dmytro, Nikolay Sergievskiy, and Jiri Matas. "Systematic
evaluation of CNN advances on the ImageNet." arXiv preprint arXiv:1606.02228 (2016).
60 Maschinensehen für MMI (Prof. Stiefelhagen)
Institut für Anthropomatik und Robotik
Activation functions: practical advice
Source: https://towardsdatascience.com/exploring-activation-functions-for-neural-networks-73498da59b02
Classification
Regression
Metric Learning
Reinforcement Learning
Classification
𝐿 𝒙, 𝑦 = − 𝑦𝑗 log 𝑝 𝑐𝑗 𝒙
𝑗
Used in various multiclass classification methods for NN training
Regression
Schroff, Florian, Dmitry Kalenichenko, and James Philbin. "Facenet: A unified embedding for face
recognition and clustering." Proceedings of the IEEE conference on computer vision and pattern
recognition. 2015.
Loss function:
Next lecture:
Generalization, Overfitting and How to Approach it.
Live demo: training neural networks in Pytorch
Homework: watch the Pytorch introduction video
https://youtu.be/I1WcY1gX8PM