AI WinterSchool
January 20 - 24, 2025
Today’s Program
Part I: Introduction lecture
14:15 - 15:45
● Overview
● Theoretical Basics Part II: Hands-on
● Data 16:15 - 18:00
● Training ● Questions
● Evaluation ● Setup
● Design and Techniques ● Some coding
Introduction to
Deep Learning
Jan 20, 2025
Overview
Machine Learning as Artificial Intelligence
Artificial
Intelligence
Machine Learning
Deep Learning
Any technique that
enables computers to Learn to perform tasks
mimic human behaviour from data without being Extract patterns from data using
explicitly programmed deep neural networks
Disciplines of Machine Learning
Supervised Learning
Labeled Training Data
Pentagon
Square
Triangle Circle
Triangle Triangle
Learning to label
New Data Square!
Model
Disciplines of Machine Learning
Supervised Learning Face recognition Handwritten transcription
Labeled Training Data
Pentagon
Square
Triangle Circle
Triangle Triangle
https://www.theguardian.com/technology/2019/jul/29/
https://www.behance.net/gallery/71324093/The-Handwritten-A
what-is-facial-recognition-and-how-sinister-is-it
Speech recognition Medical diagnosis
Learning to label
New Data Square!
Model https://support.apple.com/de-de/HT208336
https://www.wired.com/story/fmri-ai-suicide-ideation/
Disciplines of Machine Learning
Unsupervised Learning
Unlabeled Training Data
Learning meaningful
representations
New Data
Model
Disciplines of Machine Learning
Gene clustering Unsupervised Learning Image clustering
Unlabeled Training Data
https://ernest-bonat.medium.com/building-machine-learning-clust
ering-models-for-gene-expression-rna-seq-data-d0e5af10416d
https://neurohive.io/en/state-of-the-art/deep-clustering-approach/
Learning meaningful
Language processing representations Generation tasks
New Data
https://www.superannotate.com/blog/what-is-natural-language-processing
Model
Disciplines of Machine Learning
Reinforcement Learning
Unlabeled Training Data
Learning to make decisions
best reward!
New Task:
“build a pyramid
with suitable item”
Model
Disciplines of Machine Learning
Game playing Algorithmic trading Reinforcement Learning
Unlabeled Training Data
https://www.mathworks.com/videos/reinforcement-learning-in-finance-1578033119150
.html
https://deepmind.google/research/breakthroughs/alphago/
Robotics
Goal-oriented chatbots Learning to make decisions
best reward!
New Task:
“build a pyramid
with suitable item”
https://towardsdatascience.com/training-a-goal-oriented-chatbot-with-deep-reinf https://www.sciencenews.org/article/reinforcement-learn-ai-humanoid-robots Model
orcement-learning-part-i-introduction-and-dce3af21d383
Disciplines of Machine Learning
Supervised Learning Unsupervised Learning Reinforcement Learning
Labeled Training Data Unlabeled Training Data Unlabeled Training Data
Pentagon
Square
Triangle Circle
Triangle Triangle
Learning to label Learning to cluster
Learning to make decisions
New Data Square! New Data best reward!
New Task:
“build a pyramid
with suitable item”
Model Model Model
Supervised Learning Tasks
Classification Regression
Training: learn to predict a label out of a Training: predict a label as a continuous value
discrete set directly
Testing: accuracy as # of correctly Testing: distance/similarity to actual outcomes
predicted
Unsupervised Learning Tasks
Clustering Generation
Training: learn to identify groups Training: create representations to sample realistic
outputs
Testing? Depends on the availability of ground truth data / other measures of
performance…
Figure modified from: https://towardsdatascience.com/training-a-goal-oriented-chatbot-with-deep-reinforcement-learning-part-i-introduction-and-dce3af21d383
Deep Learning
Deep Neural Networks
Input Hidden Output
Deep Learning
Deep Neural Networks
Why this?
● Hierarchical processing: several levels
● All-in-one model: human out of the loop (?!)
● Extremely expressive: can learn “anything”
Input Hidden Output
Deep Learning
Deep Neural Networks
Why this?
● Hierarchical processing: several levels
● All-in-one model: human out of the loop (?!)
● Extremely expressive: can learn “anything”
Why now?
● Unprecedented amount of available data
● Parallelization of computations by GPUs
● Many available toolkits
Input Hidden Output
Theoretical Basics
A Neural Network
Input Hidden Output
A Neural Network
“Multi-layer Perceptron”
Neuron
Layer
Weight
Perceptron
Input Hidden Output
A Neural Network
Perceptron
input output
A Neural Network
Perceptron
The math
bias
input weights sum non-linearity output
A Neural Network
Perceptron
The math…
bias
linear combination
activation of input bias
input weights sum non-linearity output
A Neural Network
Single Layer Network
The math…
input layer
A Neural Network
Single Layer Network
The math…
input layer
A Neural Network
Single Layer Network
The math…
input layer output
A Neural Network
Multi-layer Network
A Neural Network
Multi-layer Network
A Neural Network
Multi-layer Network
A Neural Network
network output (“prediction”)
Multi-layer Network
input network parameters
= weights
A Neural Network
network output (“prediction”)
Multi-layer Network
input network parameters
= weights
fixed fixed
has two faces
Evaluation function Training function
Non-Linearities: Activation Functions
Biological motivation:
here: activate neuron if threshold b is exceeded
threshold
activate!
discard!
Heaviside (step) function
Non-Linearities: Activation Functions
here:
threshold
the “default”
output within [0,1]
Supervised Learning Tasks
Classification Regression
Cat
Burger
°C, $, ...
Tree
Bed
→ probability distribution (soft-max) predict the value directly
“Expressive Power”
What can a neural network learn?
“Expressive Power”
What can a neural network learn?
anything
“Expressive Power”
What can a neural network learn?
“anything”
Universal Approximation Theorem
“Neural networks with a non-polynomial activation function
can approximate any continuous function arbitrary well”
Data
What is a dataset?
● An organized collection of data
○ One “unit” of data = an instance / data point
○ Information about a data point = Features
○ Labels or other annotations often included
→ Required for supervised tasks but not (necessarily) for unsupervised
○ Normalize it:
Properties of a (good) dataset
● What about dataset size…?
○ Defined entirely by the task (from dozens/hundreds to millions)
○ Only certainty is that “the more the merrier”, but also “the more
representative the merrier”
● Do not forget: the data split (~80/20%)
Training set Test set
Data
use during training check performance of
finished(!) model
Training
Training
Supervised learning:
Given samples of training data with corresponding labels
input (matrix with values) camel
cat
label (binary vector with a 1 at Pikachu
camel cat Pikachu
the correct class)
Goal: Optimize the weights such that ( ) = cat for all of the samples in the training data.
but also ( ) = cat for samples outside!
Training
How to achieve this goal?
Loss function (error, cost) : how good is prediction compared to the true label
● Zero-one loss: - Is it exactly the same or not?
● Square loss (L2): - Euclidean distance
● Cross entropy loss: - maximize likelihood
→ minimizing the loss function will improve the prediction!
Training
Idea: Start with random weights
1) Take a sample and measure good bad the prediction is: ( ) = Pikachu
2) Update the weights to improve the prediction (i.e., loss decreases): ( ) = cat
Repeat the process for every sample in the training data set.
Evaluate
true
labels
Compute
Initialize Loss
if good
enough
Update
weights
go to next sample
Training
GOAL: find a weight update rule that produces a sequence that gradually decreases the loss.
As training progresses, later weights should result in smaller losses.
Loss
And do it over the whole training set:
Training
Find the weights which result in minimal loss over the whole training set.
→ non-linear, non-convex optimization problem!
Special Case: Linear Perceptron
Loss Linear Regression!
Least squares problem!
Weight Updates: A simple optimization technique
Gradient Descent
Gradient of the loss: “how does the loss change, if a weight changes?”
→ points to steepest ascent (i.e., the direction to change the weights, so that there is maximal change in the loss)
Weight Updates: A simple optimization technique
Gradient Descent
Gradient of the loss:
Gradient
→ points to steepest ascent
Weight Updates: A simple optimization technique
Gradient Descent
Gradient of the loss:
→ points to steepest ascent
go opposite direction of steepest ascent
Weight Updates: A simple optimization technique
Gradient Descent
Gradient of the loss:
→ points to steepest ascent
go opposite direction of steepest ascent
Weight Updates: A simple optimization technique
Gradient Descent
Gradient of the loss:
→ points to steepest ascent
go opposite direction of steepest ascent
Weight Updates: A simple optimization technique
Gradient Descent
Gradient of the loss:
→ points to steepest ascent
local minimum
global minimum
go opposite direction of steepest ascent
Weight Updates: A simple optimization technique
Gradient Descent
Gradient of the loss:
Algorithm
Initialize
Until convergence:
→ points to steepest ascent
Compute gradient
Update weights
Return weights
“learning rate”
Training on Batches
Gradient descent is very expensive…
Example: A single step of gradient descent for AlexNet (neural network ~160M parameters) on ImageNet
(dataset ~1.2M images) requires ~2*10^14 flops!
Training on Batches
Gradient descent is very expensive…
Example: A single step of gradient descent for AlexNet (neural network ~160M parameters) on ImageNet
(dataset ~1.2M images) requires ~2*10^14 flops!
Train on small batches of the dataset!
“Training with large minibatches is bad for your health. More importantly, it’s bad for your
test error. Friends don’t let friends use minibatches larger that 32.”
-Yann LeCun
Evaluation
Training-Test
Training set Test set
Data
during training: monitor loss/error after training:
check performance
Training-Test
Training set Test set
Data
during training: monitor loss/error after training:
check performance
Evaluate on unseen data
Loss
Training Training
Bias-Variance Tradeoff
Over- and underfitting
Example:
Learn a second-degree
polynomial from noisy
observations
ground truth
deg = 2
https://shapeofdata.wordpress.com
Bias-Variance Tradeoff
Over- and underfitting
Example:
Learn a second-degree
polynomial from noisy
observations
ground truth underfitting
deg = 2 deg too low
Simple model:
high bias,
good capturing of essentials,
bad fit
https://shapeofdata.wordpress.com
Bias-Variance Tradeoff
Over- and underfitting
Example:
Learn a second-degree
polynomial from noisy
observations
ground truth underfitting overfitting
deg = 2 deg too low deg too high
Simple model: Complex model:
high bias, high variance,
good capturing of essentials, good fit to data,
bad fit too specific
https://shapeofdata.wordpress.com
Bias-Variance Tradeoff
Over- and underfitting
Example:
Learn a second-degree
polynomial from noisy
observations
ground truth underfitting overfitting
deg = 2 deg too low deg too high
Simple model: Complex model: Trade-off between
high bias, high variance, model assumptions (bias) and
good capturing of essentials, good fit to data, model complexity (variance)
bad fit too exact
https://shapeofdata.wordpress.com
Training-Validation-Test
Training set Validation set Test set
Data
during training: after training:
intermediate check performance
performance check
Training-Validation-Test
Training set Validation set Test set
Data
during training: after training:
intermediate check performance
high bias high variance performance check
Training
Loss
Validation
Training / Model complexity
Training-Validation-Test
Training set Validation set Test set
Data
during training: after training:
intermediate check performance
high bias high variance performance check
Stop here!
Training
Loss
Validation
Training / Model complexity
Metrics of performance
● Defined by the task: MSE, accuracy, mAP, etc…
● In case of classification:
Interpretability
● XAI: steering away from the black box
● Crucial in high-responsibility decision making, e.g. medicine
● TOOLS: explainable architecture, post-hoc analysis, etc.
Wu et al., 2023: Discover and Cure - Concept-aware Mitigation of Spurious Correlation
Bias
● Mitigating bias
○ Especially important in decision making with a social effect (e.g., granting parole [1])
● TOOLS: metrics to assess group fairness (demographic parity, equalized
odds, etc.), transparency about biases in the data collection process…
[1]: Angwin et al., 2016: Machine Bias
Design and Techniques
Common Techniques
regularization
Regularizing: term of the ...often
network weights
Dropout: set weights to zero at random
Stochastic Gradient Descent (SGD): use the gradient of a randomly selected subset
Batch normalization: normalize the samples w.r.t. to the other samples in the batch
Popular architectures
Convolutional neural networks: apply “filters” to extract spatial features, textures, patterns, etc.
● Popular choice in image processing.
● Examples: VGG-16, VGG-19, AlexNet, etc.
Popular architectures
Convolutional neural networks: apply “filters” to extract spatial features, textures, patterns, etc.
● Popular choice in image processing.
● Examples: VGG-16, VGG-19, AlexNet, etc.
Autoencoders: learn a compact statistical representation of the data and sample from it.
● Useful in dimensionality reduction, data generation, denoising, etc.
● Example: Variational Autoencoders (VAE)
Popular architectures
Convolutional neural networks: apply “filters” to extract spatial features, textures, patterns, etc.
● Popular choice in image processing.
● Examples: VGG-16, VGG-19, AlexNet, etc.
Autoencoders: learn a compact statistical representation of the data and sample from it.
● Useful in dimensionality reduction, data generation, denoising, etc.
● Example: Variational Autoencoders (VAE)
Residual neural networks: use shortcut connections to skip layers (helps with vanishing gradients).
● Useful in applications requiring large networks: image segmentation, object detection, etc.
● Example: ResNet
Popular architectures
Convolutional neural networks: apply “filters” to extract spatial features, textures, patterns, etc.
● Popular choice in image processing.
● Examples: VGG-16, VGG-19, AlexNet, etc.
Autoencoders: learn a compact statistical representation of the data and sample from it.
● Useful in dimensionality reduction, data generation, denoising, etc.
● Example: Variational Autoencoders (VAE)
Residual neural networks: use shortcut connections to skip layers (helps with vanishing gradients).
● Useful in applications requiring large networks: image segmentation, object detection, etc.
● Example: ResNet
Transformers: capture relationships in sequential data by considering the whole context.
● Useful in applications with sequential data (e.g., text), but also otherwise (vision transformers).
● Examples: GPTs, BERT, ViT, DINOv2
Today’s Program
Part I: Introduction lecture
14:15 - 15:45
● Overview
● Theoretical Basics Part II: Hands-on
● Data 16:15 - 18:00
● Training ● Questions
● Evaluation ● Setup
● Design and Techniques ● Some coding
AI WinterSchool
January 20 - 24, 2025
Exercises
Exercises
Using Google Colab and PyTorch.
Open the notebook Intro_WS_2025.ipynb.
Follow the instructions in the notebook.