0% found this document useful (0 votes)

118 views97 pages

I2DL Student Lecture Notes

The document is a summary of the lecture 'Introduction to Deep Learning' (IN2346) at Technische Universität München, based on the teachings of several professors. It covers various topics including machine learning basics, neural networks, convolutions, and optimization algorithms, along with exam questions and answers. Compiled by Dan Halperin and Benjamin Heltzel, the document serves as a comprehensive guide for students in the course.

Uploaded by

yejinheng5314johann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

118 views97 pages

I2DL Student Lecture Notes

Uploaded by

yejinheng5314johann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 97

Technische Universität München

Summary of the lecture I2DL

Introduction to Deep Learning
(IN2346)

based on the lectures of

Prof. Niessner, Prof. Leal-Taixé, Prof. Dai, and Prof. Cremers

Authors:
Dan Halperin, Benjamin Heltzel

Compiled on November 8, 2024 at 12:39:52

Contents

I Machine Learning Basics 5

I.1 Tasks in Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I.2 Datasets and Dataloaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I.2.a Datasets - general notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
I.2.b Datasets - Programming notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
I.2.c Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
I.2.d Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I.2.e Dataloaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I.3 Exam questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I.3.a Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
I.3.b Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

II Neural Networks 10
II.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
II.1.a Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
II.1.b Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
II.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
II.2.a Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
II.2.b Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
II.2.c Hyperbolic Tangent - Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
II.2.d Rectified Linear Unit - ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
II.2.e The "vanishing gradient" vs the "Dying ReLU" problems . . . . . . . . . . . . . . . . 17
II.2.f Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
II.2.g Parametric ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
II.2.h Exponential Linear Unit - ELU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II.2.i MaxOut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
II.2.j Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
II.3 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
II.3.a Classification loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
II.3.b Regression loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
II.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
II.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
II.6 Complete Schematic of a Fully Connected Neural Network . . . . . . . . . . . . . . . . . . . 27
II.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
II.8 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

III Convolutions 31
III.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
III.2 Unique Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III.2.a MaxPooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III.2.b Average pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III.2.c Point-wise convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.2.d Depth-wise Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.2.e Upsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2
Contents

III.2.f Transpose Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

III.3 Receptive Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
III.4 Handcrafted Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
III.5 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
III.6 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

IV Optimization 40
IV.1 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
IV.1.a Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
IV.1.b Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
IV.1.c SGD with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
IV.1.d Nestrov Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
IV.1.e RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
IV.1.f Adam - Adaptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
IV.1.g Newton’s Method and its Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.1.h Second Order confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.2 Optimization problems and solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.2.a Overfitting vs underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.2.b Vanishing and exploding gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
IV.2.c Learning rate scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IV.2.d Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
IV.2.e Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
IV.2.f Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
IV.2.g Weight initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
IV.2.h Optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
IV.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
IV.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
IV.5 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

V Popular Architectures 64
V.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
V.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
V.3 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
V.4 Skip Connections: Residual Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
V.5 ResNet (Residual Networks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
V.6 GoogleNet: Inception Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
V.7 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
V.8 Fully Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
V.9 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
V.10 Generative Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
V.10.a Variational Autoencoders (VAEs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
V.10.b Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . . . . . . 72
V.11 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
V.12 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

VI Recurrent Neural Networks and Transformers 77

VI.1 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
VI.2 Long Short Term Memory (LSTM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
VI.3 LSTM Gates Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
VI.3.a Components of LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
VI.3.b Gates in LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
VI.3.c LSTM Cell Update Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3
Contents

VI.4 Transformers: Revolutionizing Neural Network Architectures . . . . . . . . . . . . . . . . . . 81

VI.4.a Embedding Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
VI.4.b Attention Mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
VI.4.c Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
VI.4.d Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
VI.5 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
VI.6 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

VII Appendix 87
VII.1 Multidimensional derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
VII.1.a Affine layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
VII.1.b Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
VII.1.c Stanford Article . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
VII.1.d Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4
I Machine Learning Basics

I.1 Tasks in Machine Learning

• Supervised learning: supervised learning is a type of learning, where the model is trained on an existing
and fixed ground truth, relative to the task at hand: Classes for classification, depth values for depth
estimation, facial key-points or the next correct letter in the sentence.
• Unsupervised learning: unsupervised learning is a type of learning, where the model is trained on data
without any ground truth. The model is supposed to find patterns in the data, which can be used for
clustering, dimensionality reduction or generative tasks (E.g. Autoencoders).
• Classification: a discrete goal, where the model assigns the input to one of the C classes. Also applies
to semantic segmentation of images, where each pixel is assigned to one of the C classes. This can
be done in a supervised way (given labeled data) or unsupervised way (clustering). Examples: image
classification, object detection, semantic segmentation, sentence completion, etc.
• Regression: a continuous goal, where the model predicts a continuous value, either bound or unbound
to some range. Examples are depth estimation, facial key-point detection, etc.

Original Segmentation Depth Prediction Clustering

I.2 Datasets and Dataloaders

NOTE: This section refers to Exercise 03: Datasets and Data loaders.

I.2.a Datasets - general notes

• Datasets are the core of machine learning. They are the most important part, and their quality and
quantity can make or break a project. Machine learning models are data-driven, meaning that all
they can learn and use is within the data they are given.
• Datasets are usually split into three parts: training, validation and test set.
– First, the training set is used to train the model. It is the only seen data that the model is
exposed to, and always dealt the bigger chunk of the overall dataset.
– The validation set is an unseen part of the dataset. It is commonly stated that it is used for
tuning the hyperparameters of the model - but that it is only before we even start training. More
importantly, it is used to evaluate the model during training. This is important, because the

5
I.2 Datasets and Dataloaders

model can overfit (subsection IV.2.a) the training data, and the validation set is the main-used
tool to detect this.
– The test set is the final part of the dataset, and should be kept biased as little as possible (no
optimization choice was taken according to it). It is common to argue that this split should be
used only once - however, that is only because it only represents a truly unseen data, if we, as
developers cannot make architectural choices based on the performance of the test set.
– It is important to shuffle the dataset, otherwise different splits could be very biased, e.g. - the
training set would contain only images that were taken during day-time, and the test would
consist only of nighttime images.
– Common split ratios are {80%, 10%, 10%}, {70%, 15%, 15%}, etc. The training set is always the
biggest, as it is the only data that the model is trained on.
• It is important to shuffle the dataset, otherwise different splits could be very biased, e.g. - the training
set would contain only images that were taken during day-time, and the test would consist only of
nighttime images.
• A dataset represents a distribution, which is a sub-part of the real world distribution. It is limited
by the available data, and no matter how big it is, a modern finite dataset cannot represent the real
world distribution.

I.2.b Datasets - Programming notes

Usually, in the modern deep learning frameworks, the datasets are represented as classes. This class has
two main methods that need to be implemented:
• __len__ - this method returns the length of the dataset. It is used by the data loader to know how
many samples are in the dataset.
• __getitem__ - this method returns the i-th element of the dataset. Note, that it returns a single
sample, and NOT a batch of samples. The dataloader is responsible for calling this function multiple
times to get a batch of samples. Within this function, the sample can be loaded from the disk,
normalized, transformed, changed, etc, alongside with the label, if exists.

I.2.c Data augmentation

Data augmentation is a technique used to artificially increase the size of the dataset. It is especially useful
when the dataset is small, and the model is prone to overfitting. It is a common practice to use data
augmentation in computer vision tasks, where the images can be flipped, rotated, cropped, etc.
• Is considered a Regularization technique (subsection IV.2.d) as it helps the model to generalize
better.
• It does NOT increase the physical size of the dataset, but at each epoch, the model sees a different
version of the data with some probability p.
• Unlike normalization and other preprocessing, data augmentation is applied only to the training set,
and not to the validation and test sets.
• While the idea is good and a model could really benefit from data augmentation, the techniques should
fit the problem at hand. For example, it is not a good idea to flip the images in a medical imaging
task, as the left and the right side of the body are different, or rotation by 180◦ of images of numbers,
as the number 6 would become 9, etc.

6
I.3 Exam questions

I.2.d Data processing

Data processing is an overall name for techniques that are applied to the data before it is fed to the model:
• Normalization: this is a technique used to scale the data to a certain range. It is crucial that the input
would have a reasonable magnitude of value, so the gradients wouldn’t explode or vanish (subsec-
tion IV.2.b). Also, it makes the input to the network more consistent and eases the learning process.
It is a common practice to normalize the data to the range [0, 1] or [−1, 1]. This is done by calculation
the mean and the standard deviation of the training set, and then normalizing the data by subtracting
the mean and dividing by the standard deviation:
x−µ
xnorm =
σ

• Those techniques are applied to each sample of the dataset, and are usually done within the
__getitem__ method of the dataset class. Unlike data augmentation, they are applied to all the splits
of the dataset, as the model expects the data to be in the same format.

I.2.e Dataloaders

Dataloaders are classes that are used to load the data from the dataset, and to create batches of samples.
They are responsible for shuffling the data, creating batches, They are usually used in the training loop,
where the model is trained on the data.

I.3 Exam questions

I.3.a Questions

1. What is the difference between supervised and unsupervised learning?

2. What is the difference between classification and regression?
3. What advantages does unsupervised learning have over supervised learning?
4. Give example from the class where we utilized unsupervised learning to improve the accuracy.
5. K-means:
a) What is the K-mean clustering algorithm?
b) What is its key hyperparameter?
c) What are the consequences of a too small or too large value of its hyperparameter?
6. PCA:
a) What is PCA?
b) What Deep Learning architecture could perform a similar task? Why is the deep learning method
preferable for real world problems?
7. Data splits:
a) Why is it important to shuffle the data before splitting it??
b) What are common split rations?
c) Why is the training set bigger?

7
I.3 Exam questions

d) When can we relax the ratio between the splits to be more even? When the other way around?
e) What would be a consequence of a too small validation set?
f) What can we do as simple deep learning engineers to overcome a too small training set?
8. Why not to use the validation set also within the training set and just use the test for validation?
9. What is the main issue with modern datasets’ sizes? Why is it so?
10. To which part of the dataset should we apply data augmentation? Why?
11. To that regard, what is the difference between data augmentation and data processing techniques like
normalization?
12. Why is the mean and std of the data calculated only over the training set?

I.3.b Answers

1. Supervised learning is a type of learning, where the model is trained on an existing and fixed ground
truth, relative to the task at hand: Classes for classification, depth values for depth estimation, facial
key-points or the next correct letter in the sentence. Unsupervised learning is a type of learning, where
the model is trained on data without any ground truth. The model is supposed to find patterns in the
data, which can be used for clustering, dimensionality reduction or generative tasks.
2. Classification is a discrete goal, where the model assigns the input to one of the C classes. Regression
is a continuous goal, where the model predicts a continuous value, either bound or unbound to some
range. Examples are depth estimation, facial key-point detection, etc.
3. Labeled data is expensive, and sometimes impossible to get. Unsupervised learning can be used to
find patterns in the data, that can be used for other tasks, like classification or regression, without
the need to manually or artificially label the data, usually with human-made errors. Note (irrelevant
for the exam): It was shown that unsupervised settings allow a much richer representation learned
features, for example: when GPT-1 was trained, it was shown that the model learned to detect the
sentiment of the text (good or bad), even though it was trained on a next-letter completion task.
4. Autoencoder. We used an autoencoder to reconstruct unlabeled data, to create a feature extractor,
and then used this encoder to train a classifier on the much smaller labeled dataset (Exercise 08).
5. a) K-mean clustering is an unsupervised learning algorithm, that is used to cluster the data into K
clusters. It is an iterative algorithm, where the data is assigned to the closest cluster, and then
the cluster’s center is recalculated. This is done until the algorithm converges.
b) The key hyperparameter of the K-mean clustering algorithm is the number of clusters K.
c) If the value of K is too small, the algorithm could merge clusters that are not similar (underfitting)
and if the value of K is too large, the algorithm could split clusters that are similar (overfitting).
6. a) PCA is a linear dimensionality reduction technique, that is used to reduce the number of features
of the data, while keeping the most important features.
b) A deep learning architecture that could perform a similar task is an autoencoder. The autoencoder
is a non-linear dimensionality reduction technique. The deep learning method is preferable for
real world problems, as it can learn non-linear features, and can be used for more complex tasks,
like classification, regression, etc.
7. a) It is important to shuffle the dataset, as if the data is not shuffled it could lead to biased splits,
that do not represent the same overall distribution. Example: if the datasets consist of day-time
and nighttime images sorted accordingly, the training set might hold only day-time images, while

8
I.3 Exam questions

the test would hold only the much more difficult nighttime ones. Also, the model could learn the
order of the data, and not the data itself.
b) Common split ratios are {80%, 10%, 10%}, {70%, 15%, 15%}, etc.
c) The training set is always the largest, as it is the only data that the model is trained on.
d) We can relax the ratio between the splits to be more even, when the dataset is big enough, and
the model is not prone to overfitting.
e) A consequence of a too small validation set could be that the model could overfit the training
set, and the validation set would not be able to detect this.
f) To overcome a too small training set, we can use data augmentation, as it artificially increases
the size of the dataset (not the physical size, but the number of samples that the model sees).
8. With no validation set, we cannot detect overfitting problems. On the other hand, the test set has a
very clear purpose of being the most unbiased unseen data segment of the dataset, and should not be
used to validate our training process.
9. The main issue with modern datasets’ sizes is that they are finite and limited by the available data, so
they cannot represent the real world distribution. Data collection is very expansive in terms of both
time and resources.
10. Data augmentation is applied only to the training set, as a mean of Regularization, to help the model
generalize better and to prevent overfitting. It is not applied to the validation and test sets, as they are
supposed to represent the original problem’s data distribution, and the model should be evaluated on
that distribution. If applied, the performance on those datasets would be worse and not comparable
to other models.
11. The difference between data augmentation and data processing techniques like normalization is that
data augmentation is applied to the training set, and is used to artificially increase the size of the
dataset, while data processing techniques are applied to all the splits of the dataset, and are used to
preprocess the data before it is fed to the model.
12. The mean and std of the data is calculated only over the training set, as we must keep the test set as
unbiased as possible. That is, no information from the test set should be used to train the model, and
the mean and std of the data are part of the information that the model could learn.

9
II Neural Networks

II.1 Neural Networks

Neural networks are simply defined as a series of functions f (x), g(x), h(x), . . . that are composed together:
y = h(g(f (x)))
A neural network could be described in the following sketch:

In the drawing, the circles (nodes) represent the neurons, and the lines (edges) represent the connections
between the neurons (the weights). Each neuron holds a single value. A common way to index those
neurons would be xl,i,o , where l is the layer, i is the index of the neuron in the input layer, and o is the index
of the neuron in the output layer. Usually, neural networks represent non-linear functions. For that, we
apply a non-linear activation function to the output of each neuron of a hidden layer, but it is not shown
in this drawing.
On top of that, each set of lines between any two groups of neurons is what we refer to as the layers, or the
functions f, g, h that are stated above. The first layer is called the input layer, the last layer is called the
output layer, and the layers in between are called the hidden layers. The number of hidden layers and
the number of neurons in each layer are called the architecture of the neural network. The architecture of
the neural network is a hyperparameter that we need to tune to get the best performance.
Another crucial point, is that in the sketch above we see a single input flowing through the network, and
not a batch of them. However, in practice with computers, we manage to feed into the networks a batch of
inputs at once, all computed in parallel, but independently.
In matrix notation, let’s assume the affine layer, or in its common name: fully connected layer, is defined
as:
y = XW + b

Those variables are usually represented as matrices:

X ∈ RN ×D , W ∈ RD×M , b ∈ RM

10
II.1 Neural Networks

where N is the number of samples in the batch, D is the number of features in each sample, and M is the
number of neurons in the layer. The output y is of shape RN ×M .
For the first layer in the example above, we assume that

X ∈ RN ×3 , W ∈ R3×4 , b ∈ R4

or in matrices:
 
x1,1 x1,2 x1,3  
 x2,1

x2,2 x2,3 
 w 1,1 w 1,2 w 1,3 w 1,4 h i
X= , W = w w w w , b = b b b b
 
 .. .. ..  2,1 2,2 2,3 2,4 1 2 3 4
  
 . . .  w3,1 w3,2 w3,3 w3,4
xN,1 xN,2 xN,3

That means that each bi in b corresponds to one feature in a row of XW - but adding them like that, it is
quite impossible mathematically:

• NumPy or PyTorch use broadcasting to duplicate the row vector b to be a matrix BN xM , by

simply copying b N times and stacking them together along the rows, or the 0-axis. But that’s just
the programming application.
• Mathematically speaking, it’s not Y = XW + b, but Y = XW + 1N b, where 1N is a column vector,
such that
11 h b1 . . . bM
   
 ..  . .. 
 .  b1 . . . bM =  .. . . .
i
. 
1N b1 . . . bM N ×M
That gives us the broadcast that python does by itself, and allows us to actually sum those matrices
together.

II.1.a Fully Connected Layer

The fully connected layer (FC) is a very common computational layer in neural networks. It is also known
as the affine layer, or the dense layer. The fully connected layer is defined as:

y = XW + b

where XN ×D is the input data, WD×M is the weight matrix, and b1×M is the bias vector. Both W and b
are learnable variables. Please check out the sketch above for a visual representation. In addition, a neural
network that is based on FC layers as its main building blocks is called a Fully Connected network (FCN).
You could also find it under the name of a Multi-Layer Perceptron (MLP).
Important to note, that each neuron in the input layer is connected to each neuron in the output layer. This
means that the number of weights in the fully connected layer is equal to the number of neurons in the input
layer times the number of neurons in the output layer. The number of biases in the fully connected layer
is equal to the number of neurons in the output layer. Therefore, we claim that FC layers capture global
features - but would struggle with local ones. To that end, we use convolutional layers, as described
later on (chapter III).

11
II.1 Neural Networks

II.1.b Backpropagation

A pro tip: the nominator is ALWAYS a function, and the denominator is always the input to that function.
For example, given a function y = ReLU (xw + b), it’s better to write:

∂L ∂L ∂ReLU (z) ∂(xw + b)

=
∂x ∂y ∂z ∂x
than
∂L ∂y ∂z
∂y ∂z ∂x

because it’s easier on our brains. Substitute the value with the actual function for visual convenience, at
least when dealing with matrix notation.

Assume the notation, wl,i,o , where l, i and o are the layer, input neuron and output neuron indices. Let us
find the derivative of the weight w1,1,1 colored in red, according to some loss value L → ∂w∂L
1,1,1
:
Forward pass:
• h = XW1 + b1
• ŷ = hW2 + b2
• L = Loss(ŷ)
• For simplicity, assume no activation.
Backward pass:
∂L ∂L ∂(hW2 +b2 ) ∂(XW1 +b1 )
• ∂w1,1,1 = ∂ ŷ ∂h ∂w1,1,1
∂L P4 ∂L ∂hi
• ∂w1,1,1 = i=1 ∂hi ∂w1,1,1

∂L P2 ∂L ∂ yˆj
• ∂hi = j=1 ∂ yˆj ∂hi

∂L P2 P4 ∂L ∂ yˆj ∂hi
• ∂w1,1,1 = j=1 i=1 ∂ yˆj ∂hi ∂w1,1,1

• Note that here I found it more convenient to remain with the variable names, as we’re dealing with
matrix entries.

In matrix notation, the gradients of the affine layer’s variables in respect to the loss function (a scalar) are:

∂L ∂L
= · W⊺
∂X ∂y

12
II.2 Activation functions

∂L ∂L
= X⊺ ·
∂W ∂y
∂L h iN ∂L
= 1 1 ... 1 ·
∂b ∂y

Where L ∈ R is the loss value, and y ∈ RN ×M is the output of the layer. For more details about how to
derive them, please refer to chapter VII.

II.2 Activation functions

The main computing power of a neural network are the linear layers, that contain the learnable weights of
the model. However, linear functions are not sufficient to learn complex patterns in the data. Therefore, we
introduce non-linear functions, called activation functions, that are applied to the output of the linear
layers. These layers are what make neural networks universal function approximations, and capable of
learning patterns and relevant features in real world data. In the following section, we will cover the most
common activation functions used in neural networks, and discuss their pros and cons.

II.2.a Terms

• Activation map / Feature map - an output of a linear layer, after applying an activation function. A
tensor of numbers that represents some learnt features from the previous layer.
• Features - what a group of individual numbers represent together. A single neuron holding a value
has no meaning but within the context of its neighbors or all other neurons in the layer, depending on
the linear layer it is created with (Convolutions vs. Affine Layers).
• Zero centered - a property of an activation function, that the output of the function could be both
positive and negative. The mean doesn’t have to be actually 0 (E.g. LeakyReLU). When an activation
is not zero-centered, it could introduce bias in the network, and could lead to less optimal step direction
(All gradients are positive or negative). However - for the "zigzag" behavior to actually happen, it
boils down to the loss function used.
Example: Assume the following network -
– Y1 = XW1 + b1
– Z1 = σ(Y1 )
– Y2 = Z1 W2 + b2
– Z2 = σ(Y2 )
– L(Z2 ) = l ∈ R
In the plot above one could see that the MAE (|x|) and MSE (x2 ) both have negative and positive
slopes, while the negative logarithm (CE / BCE), at the effective range of [0, 1], has only a negative
slope. On the other hand, Z1 is all non-negative by the definition of the sigmoid function. Therefore,

13
II.2 Activation functions

1
|x|
0.8 x2
− ln(x)

0.6

f (x)
0.4

0.2

0
−1 −0.5 0 0.5 1
x

Figure II.1: Graphs of |x|, x2 and −log(x) in the range of −1 to 1

if using the cross-entropy loss functions, the gradient:

∂l ∂l
= Z1⊤ · (Z2 (1 − Z2 ) ∗ )
∂W2 ∂Z2
is always non-positive, and the weights are always updated in the same direction.

II.2.b Sigmoid

σ(x)
0.8

0.6
σ(x)

0.4

0.2

−4 −2 0 2 4
x

The sigmoid function is defined as:

R → [0, 1]
1
σ(x) =
1 + e−x

Main features:
• It is a scalar function, that operates only on scalars. Therefore, it is applied in an element-wise
fashion to the tensor-like input (on each entry independently).

14
II.2 Activation functions

• It projects the input to the range [0, 1], so it can be interpreted as the probability of the input belonging
to a class.
• It is differentiable, and has a simple derivative:
∂σ(x)
= σ(x)(1 − σ(x))
∂x
Note that this derivative is independent of the input x - this variable is not needed for the calculation
of the derivative.
• Usually used in binary-classification, to make the network’s logits probability-like, to be used in the
binary-cross-entropy loss function.
Drawbacks:
• The sigmoid function saturates when the input is very large or very small. Its effective range is roughly
[-3.5, 3.5]. Beyond those boundaries, the gradient of the function is very close to zero - saturates.
• Largest gradient is at x = 0 → σ(0) = 0.5 → σ ′ (0) = σ(0)(1 − σ(0)) = 0.25. This means that the
gradient is very small, which has a negative impact on the learning process.
• It is not zero-centered.

II.2.c Hyperbolic Tangent - Tanh

1
tanh(x)

0.5
tanh(x)

−0.5

−1
−4 −2 0 2 4
x

The hyperbolic tangent function is defined as:

R → [−1, 1]
ex − e−x
tanh(x) =
ex + e−x
Main features:
• Also a scalar function. Therefore, it is applied in an element-wise fashion to the tensor-like input.
• It projects the input to the range [−1, 1], and is zero-centered.
• It is differentiable, and has a simple derivative:
∂ tanh(x)
= 1 − tanh2 (x)
∂x

15
II.2 Activation functions

• The highest gradient is at x = 0 → tanh(0) = 0 → tanh′ (0) = 1, which is already an improvement

over the sigmoid function, and has much more lively gradients within its effective range.
• It could be seen as a scaled version of the sigmoid function:

tanh(x) = 2σ(2x) − 1

Drawbacks:
• The Tanh function saturates when the input is very large or very small. It’s effective range is roughly
[-2, 2], which is very limiting. Beyond those boundaries, the gradient of the function is very close to
zero - saturates.

II.2.d Rectified Linear Unit - ReLU

5
ReLU(x)
4
ReLU(x)

0
−4 −2 0 2 4
x

The rectified linear unit function is defined as:

R → [0, ∞)

ReLU(x) = max(0, x)
Main features:

• It is a scalar function, that operates only on scalars. Therefore, it is applied in an element-wise

fashion to the tensor-like input.
• It is a simple and computationally efficient function.
• It is non-linear, and has a simple derivative:
(
∂ReLU(x) 0 if x ≤ 0
=
∂x 1 if x > 0

However, it is not continuous and not differentiable at x = 0.

• While it saturates for negative values, and so does its gradient in the range of (−∞, 0), it has a
full-magnitude gradient for positive values, and is not limited in its effective range.

Drawbacks:

16
II.2 Activation functions

• The ReLU function saturates when the input is negative. This means that the gradient of the func-
tion is zero for negative values, and the weights leading to those neurons are not updated during
backpropagation. This is known as the dying ReLU problem.
• It is not zero-centered.
Modern alternatives: GELU, ELU, LeakyReLU, etc.

II.2.e The "vanishing gradient" vs the "Dying ReLU" problems

These two terms often repeat in the literature, and are usually confused. Even modern language models (as
for 2024), repeat the hesitant and incomplete definition they learned from on the internet.
1. The vanishing gradient problem: As mentioned above, some activation functions produce very
small gradients for certain values of the input. For very small networks, this doesn’t pose a problem.
However, when considering the architecture of much deeper networks, then we could observe a very
negative behavior. Since the gradients of those activations are small to begin with, when multiplied by
such small gradients of layers "up-the-stream", they shrink exponentially. That causes a phenomenon,
where the weights of the shallower layers (closer to the input) are not updated as much as the weights
of the deeper layers (closer to the output), if at all - they could simply "vanish".

Figure II.2: In this example, we could see how the magnitude of the gradients is exponentially decreasing as
they flow "up the stream", from the loss function to the input layer, during backpropogation.

2. The dying ReLU problem: The ReLU function saturates when the input is negative. This means
that the gradient of the function is zero for negative values, and the weights leading to those neurons
are not updated during backpropagation. The neurons that are "dead" are not activated, and do not
contribute to the learning process. This can cause the network to learn suboptimal features, and can
slow down the learning process. However, it is true in the scope of a single iteration of the training
process, and doesn’t mean the weights will not be updated at some stage. However, in some very rare
cases - a majority of the neurons could be "dead", and the network would not learn at all.
3. The difference between the two: The vanishing gradient problem is a gradual problem, the usually
occurs in very deep networks, given certain activation functions. On the other hand, the dying ReLU
is an immediate problem, that influence the training process, or the performance differently. Both
could happen at the same time, or not at all.
How should we then deal with those problems?
• The vanishing gradient problem:
– Use activation functions that have a full-magnitude gradient in their effective range, such as the
ReLU function or its variants.

17
II.2 Activation functions

– Use normalization techniques such as batch normalization or layer normalization to stabilize the
gradients during training.
– Use skip connections or residual connections to allow the gradients to flow more easily through
the network, with the "highway" of gradients.
– Use gradient clipping to prevent the gradients from becoming too large or too small.
– Use initialization techniques, such as the Xavier initialization, to prevent the neurons from satu-
rating (subsubsection IV.2.g).
• The dying ReLU problem:
– Use the more updated variants of ReLU, such as the Leaky ReLU, Parametric ReLU, or Expo-
nential Linear Unit (ELU) functions.
– Use initialization techniques such as the Kaiming He initialization, that is tailor-made for ReLU,
to prevent the neurons from saturating

II.2.f Leaky ReLU

5
LeakyReLU(x)
4
LeakyReLU(x)

−4 −2 0 2 4
x

The leaky rectified linear unit function is defined as:

R→R

LeakyReLU(x) = max(αx, x)
Main features:
• It is a scalar function, operates only on scalars. Therefore, it is applied in an element-wise fashion
to the tensor-like input.
• The variable α ∈ R can take any value, but remains fixed after it is set. Originally set to 0.1.
• It is a simple and computationally efficient function, and is a variant of the ReLU function. However,
costs a bit more computationally than simple ReLU.
• It is non-linear, and has a simple derivative:
(
∂LeakyReLU(x) α if x ≤ 0
=
∂x 1 if x > 0

18
II.2 Activation functions

• It is a "zero centered" function.

• Solves the "dying ReLU" problem.
• The hyperparameter α is usually set to a small value, such as 0.01. However - it is not defined anywhere
that it must be positive, and therefore, one could also set it to a negative value.
Drawbacks:
• The Leaky ReLU function is not differentiable at x = 0.
• Introduces hyperparameter, which needs to be tuned.
• A bit more computationally expensive than the ReLU function.

II.2.g Parametric ReLU

5
PReLU(x)
4

3
PReLU(x)

−4 −2 0 2 4
x

The parametric rectified linear unit function is defined as:

R→R

PReLU(x) = max(αx, x)
Main features:
• It is a scalar function, operates only on scalars. Therefore, it is applied in an element-wise fashion
the tensor-like input.
• The variable α ∈ R is learnable, and changes during training through backpropagation. This allows
the network to learn the optimal value of α for each layer.
• It is non-linear, and has a simple derivative:
(
∂PReLU(x) α if x ≤ 0
=
∂x 1 if x > 0

• It could be a zero-centered function, depending on the learnt α.

• Solves the "dying ReLU" problem.
Drawbacks:

19
II.2 Activation functions

• The Parametric ReLU function is not differentiable at x = 0.

• A bit more computationally expensive than the ReLU function.

II.2.h Exponential Linear Unit - ELU

5
ELU(x)
4

3
ELU(x)

−4 −2 0 2 4
x

The exponential linear unit function is defined as:

R→R
(
x if x ≥ 0
ELU(x) = x
α(e − 1) if x < 0

Main features:
• It is a scalar function, operates only on scalars. Therefore, it is applied in an element-wise fashion
to the tensor-like input.
• It is a simple and computationally efficient function, and is a variant of the ReLU function. However,
costs a bit more computationally.
• It is non-linear, and has a simple derivative:
(
∂ELU(x) 1 if x ≥ 0
=
∂x αex if x < 0

• It is a "zero centered" function.

• Solves the "dying ReLU" problem.
Drawbacks:
• The ELU function is not differentiable at x = 0 for any α ̸= 1 and can cause numerical instability
during training. However, in practice, this is not a major issue.
• Introduces hyperparameter, which needs to be tuned.
• A bit more computationally expensive than the ReLU function.

20
II.2 Activation functions

II.2.i MaxOut

The MaxOut function is defined as:

R→R
MaxOut(x) = max(w1T x + b1 , w2T x + b2 )
It is non-linear, and has a simple derivative:
(
∂MaxOut(x) w1 if w1T x + b1 > w2T x + b2
=
∂x w2 if w1T x + b1 < w2T x + b2

Drawbacks:
• Each MaxOut layer requires twice the number of parameters as a ReLU layer, and is therefore more
computationally expensive - and usually never used.

II.2.j Softmax

The Softmax function is defined as:

RK → [0, 1]K
exi
Softmax(x)i = PK xj
j=1 e

Main features:
• Usually not used as a regular activation function, but in very specific cases: To make the logits of
the network to be interpretable as probabilities, for example in the case of multi-class classification.
Another example is to sharpen weights of the attention mechanism in transformers.
• It projects the input to the range [0, 1], and can be interpreted as the probability of the input belonging
to the i-th class.
• It magnifies the differences between the values of the input, and can be used to make the network
more confident in its predictions - big numbers get bigger, and small numbers get smaller.
• It is differentiable, and has a simple derivative:

∂Softmax(x)i
= Softmax(x)i (δij − Softmax(x)j )
∂xj

where δij is the Kronecker delta -

(
1 if i = j
δij =
0 if i ̸= j

• In order to avoid numerical instability, the input to the Softmax function is usually shifted by a
constant value:
exi −max(x)
Softmax(x)i = PK x −max(x)
j=1 e
j

21
II.3 Loss functions

II.3 Loss functions

Loss functions represent the goal of training a neural network. They are the guidelines for the weights
update, and should be tailored carefully to the task at hand. While some loss functions might look simple
at first glance, they could lead to very specific behaviors of the network and the features selected - that
are much more complex than one could imagine. Loss values should almost always be either positive or
negative, so the network could learn to minimize or maximize them, respectively. Also, a learning process
could involve multiple loss functions, that are combined together, and the network is trained to minimize
the sum of them - where some are more important than others. These we call "regularization terms", as
they make the training process more difficult, for the sake of achieving our desired behavior. It is crucial to
understand that the sole goal of a training process of a neural network is to minimize the loss value on
the training set. The performance on unseen data, such as the validation and test sets, is a byproduct of
this process, and can only be achieved if we prevent the network from performing too well on the training
set - a phenomenon called "overfitting". In this case, the network learns to capture features too specific to
the training set (e.g. memorizing it) and not some general features, that could fit also data that it has never
seen before. More on that at the optimization chapter (chapter IV).
Notes:
• The loss function should always result in a scalar: L : RN × RN → R. That specific attribute is
what allows us to learn the gradients of the single weight values, all the millions of them, in a way that
doesn’t require computing Jacobians. The variables in neural networks are always scalars, but just
assembled in matrices and tensors. More on that at the TUM-article, that you could find on piazza,
or chapter VII at the end of these notes.
• The loss function is differentiable (at least locally, like in the case of MAE [L1 ]).
• We ALWAYS divide by N , the number of residuals in the loss function, to make the loss value
independent of the size of the of numbers that are being compared. This is crucial, as the loss value
should be comparable between different models and different datasets. Pay attention that the number
of residuals is not necessarily the number of samples in the dataset, e.g. in the case of semantic
segmentation, it is the number of pixels from all the images in the batch all together. Also, if we don’t
divide the sum, the gradient could become very large, as a dependency of how big the input is, and
that could lead to exploding gradients.
• You’ve seen in class the term "cost function". The deep learning literature is saturated with ambiguities
and no default definitions to important terms. Therefore, we drop this term in this course completely
and discuss ONLY the loss functions.
We will distinguish between two main types of loss functions: classification loss functions and regression loss
functions.

II.3.a Classification loss functions

Binary Cross-Entropy - BCE

The binary cross-entropy loss function is defined as:

N
1 X
LBCE (y, ŷ) = − (yi loge (ŷi ) + (1 − yi ) loge (1 − ŷi ))
N i=1

where yi is the true label, and ŷi is the probability to belong to label 1.
Note: the value of yˆi should never be exactly 0 or 1, as the logarithm of 0 is undefined. To prevent this, we
can clip the values of yˆi to be in the range [0 + ϵ, 1 − ϵ].

22
II.3 Loss functions

Categorical Cross-Entropy - CE

The categorical cross-entropy loss function is defined as:

R X C R
1 X 1 X
LCE (y, ŷ) = − yij loge (ŷij ) = − log (ŷi )
R ∗ ·C i=1 j=1 R i=1 e

where yij is an indicator of 0 or 1 if the input i belongs to class j, and ŷij is the predicted probability it
actually does. R is the number of residuals (e.g. number of instances in the batch, all the pixels in the
batch, etc).
Here we employ what we call the "one-hot" encoding, where the true label is a vector of length C with a
single 1 at the index of the true class, and zeros elsewhere. The predicted probabilities are in a vector of
length C:

   
0 1 0 0.1 0.8 0.1
y = 1 0 0 , ŷ = 0.9 0.1 0.0
   
0 0 1 0.0 0.0 1.0

Therefore, in practice the number of residuals doesn’t contain the size of the one-hot encoding vector
(number of classes), as we only consider the residuals of the true class - so we don’t also divide by C and
the loss function value is independent of the number of classes.

II.3.b Regression loss functions

Mean Squared Error - MSE

N
1 X
LM SE (y, ŷ) = (yi − ŷi )2
N i=1

Also referred to as the "L2" loss function.

Calculates the "Euclidean" distance between the true and predicted values. The loss value is the average of
the squared residuals. This loss function is sensitive to outliers, as it gives bigger weights to larger residuals
(predictions that are farther from the true values), e.g.: a loss value of 0.1 is weighed with 0.1 → 0.12 = 0.01,
while a loss value of 0.01 is weighed with 0.01 → 0.012 = 0.0001. The loss curve is defined as a parabola:

1
LMSE (y, ŷ)
0.8
LMSE (y, ŷ)

0.6

0.4

0.2

0
−1 −0.5 0 0.5 1
ŷ − y

23
II.3 Loss functions

Mean Absolute Error - MAE

N
1 X
LM AE (y, ŷ) = |yi − ŷi |
N i=1

Also referred to as the "L1" loss function.

Calculates the "Manhattan" distance between the true and predicted values. The loss value is the average of
the absolute residuals. This loss function is less sensitive to outliers, as it gives equal weights to all residuals.
The loss curve is defined as a "V" shape:

1
LMAE (y, ŷ)
0.8
LMAE (y, ŷ)

0.6

0.4

0.2

−1 −0.5 0 0.5 1
ŷ − y

Note that it is not differentiable at x = 0, which could cause for some numerical instabilities, and adds a
very small computational overhead, when calculating the gradient on a computer.

24
II.4 Linear Regression

II.4 Linear Regression

(17)

f (x) - best fit (20)

Data points (13) (19)
5
(14)
(10)(12) (18)
(15)
f (x) (6) (16)
(11)
(5) (8)
0
(2)
(4)
(9)
(7)
−5 (3)
(0)
−10(1) −5 0 5 10
x

Figure II.3: Scatter plot with best fit line

Linear regression is a linear technique for modeling the relationship between different data points or vari-
ables. The simplest form of the regression equation with one dependent and one independent variables is
defined by the formula:
Y = XW + b
where X represents the data, W is the weight, and b is the bias. The goal of linear regression is to find
the best-fitting straight line through the data points, possibly in multiple dimensions. The most common
method is to find the best-fitting line that minimizes the sum of the squared differences between the observed
values and the predicted values. This is known as the least squares criterion:

N
1 X
L(yi |xi , w) = (yi − (w · xi + b))2
N i=1

The goal is to find the values of w and b that minimize this error. This can be done using the closed-form
equation:
W = (X T X)−1 X T Y
This equation can be solved directly for W and b. That means, that if we can load all the data into our RAM,
then we can solve the linear regression problem in one step, and with no need for iterative optimization.
However, this is not always possible, and in practice, we often use iterative optimization methods such as
gradient descent (subsection IV.1.a) to find the optimal values of W and b.

25
II.5 Logistic Regression

II.5 Logistic Regression

1
σ(x)
Data points
0.8

0.6
σ(x)
0.4

0.2

0
−10 −5 0 5 10
x

By using a very-very basic neural network, of simply one linear layer f (x) followed by a non-linear activation
function, we can now model a very common non-linear problem: the binary classification. In this setting,
we do not try to fit a line to the data, but rather a curve that separates the two clusters or "classes". The
linear layer is followed by the sigmoid activation function:
1
ŷ = σ(x) =
1 + e−f (x)

This function squashes the logits (the neurons in the output layer) into the range [0, 1], and can be in-
terpreted as the probability of the input belonging to the positive class. The loss function for binary
classification is the binary cross-entropy ("BCE") loss:
N
1 X
LBCE (y, ŷ) = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi ))
N i=1

Then, a threshold t, to establish a discrete boundary between the two classes:

(
1 if σ(f (xi )) ≥ t
ci =
0 if σ(f (xi )) < t

26
II.6 Complete Schematic of a Fully Connected Neural Network

II.6 Complete Schematic of a Fully Connected Neural Network

Now that we have covered Fully Connected Layers, Activation Functions, Loss Functions, and Backpropa-
gation, we can put them together and build a complete fully connected architecture. For example:
• Two input neurons [xi ], one hidden layer [hi ] with ReLU [ai = ReLU (hi )], and one output neuron
• MSE (Mean Squared Error) loss & GD (Gradient Descent) optimizer to update the weights and biases

Try to follow the four steps with some actual numbers to check if you understand how one training step
works. Here are some sample values and the updated weights and biases:
x1 = 1, x2 = 2, Wx1,h1 = 0.1, Wx2,h1 = 0.2, Wx1,h2 = 0.3, Wx2,h2 = 0.4, Wx1,h3 = 0.5, Wx2,h3 = 0.6, bh1 =
0.1, bh2 = 0.2, bh3 = 0.3, Wa1,o = 0.7, Wa2,o = 0.8, Wa3,o = 0.9, bo = 0.1, target = 3.2, lr = 0.1
New weights & biases: Wx1,h1 = 0.09, Wx2,h1 = 0.18, Wx1,h2 = 0.29, Wx2,h2 = 0.37, Wx1,h3 = 0.49, Wx2,h3 =
0.57, bh1 = 0.09, bh2 = 0.19, bh3 = 0.29, Wa1,o = 0.69, Wa2,o = 0.78, Wa3,o = 0.87, bo = 0.08

27
II.7 Questions

II.7 Questions

1. Linear Regression
a) Assume a linear problem of fitting with 5 billion points, each with 100 features. Is it feasible to
find the optimal without a neural network? State if yes or no, and explain why.
b) How can we transform a linear problem (with only affine layers) into a classification problem?
2. Fully Connected Layers
a) Given the following layer y = f (X, W, Z, R, T, A) = XW + ZR + T + A, where XN ×D , WD×M ,
ZN ×D , RD×M , TN ×M , AN ×M And a loss function L : RN × RM → R = N
P PM
i=1 j=1 yij . What is
∂L ∂L
the gradient ∂Z ? What is the gradient ∂A ?
b) Given batch of images of shape (8×3×8×8). How can we process them through a fully connected
layer? What would be the shape of the weight matrix W in a case of a logistic regression and
BCE as a loss function?
c) Given batch of images of shape (8 × 3 × 8 × 8), a fully connected network (FCN) and a task to
detect eyes of people in the images. Name two disadvantages for trying to solve the task with
the current setup.
d) Given two affine layers:
y1 = f1 (X, W1 , B1 ) = XW1 + B1
and
y2 = f2 (y1 , W2 , B2 ) = y1 W2 + B2
Show that it could be described as a single affine layer.
3. Activation Functions
a) What is the purpose of the activation functions in a neural network?
b) Give two advantages of the Tanh function over the Sigmoid function.
c) Explain the vanishing gradient problem, and describe one method that could help us solve it.
d) We’ve learned that the sigmoid function can cause the "vanishing gradient" problem. Therefore,
explain why is the sigmoid function is still sometimes used on the logits (the output of the last
layer).
e) Assume the LeakyReLU activation function. If we take an α < 0, what property of the function
would we lose?
f) The Softmax function could suffer from numerical instability, given the logits. Show that reducing
the maximum value of the logits from each one of the values in the vector would not change the
output of the function:
Softmax(x − max(x))i = Sof tmax(x)i

4. Loss Functions
a) Which property of loss functions allows us to perform backpropagation without computing Jaco-
bians?
b) Assume that you’re using the MSE loss function to compare to batches of images, of shape
(4 × 3 × 8 × 8), what would be the value of N , that we should divide the loss value by?

28
II.8 Answers

c) In depth estimation, where we predict the depth of each pixel in an image, it was found that the
loss for the majority of pixels is very small, while for some few pixels it is very large. What would
be a better loss function to use in this case out of [MAE, MSE, BCE, CE], and why?
d) What could cause numerical instabilities in the BCE and CE functions? How could we solve
that?
e) Why is the MAE loss function could be still used, although it is not differentiable at x = 0?
f) Assume that in the first iteration of training, some the logits are values above 1000, while the
ground-truth values are in the range of [0, 1]. If we’re using the MSE loss function - what
optimization problem should we expect to observe?
g) Let’s use the CE loss function for a task of classification of 100 classes. What is the expected loss
value after the first iteration, and why?
h) BCE: how many neurons are at the output layer?
i) BCE: why do we multiply the result by −1?
j) Why don’t we multiply by −1 in MSE or MAE?
k) You are given a neural network for a classification task with 4 classes and the CE as a loss
function. The batch size is 1000. After the very first iteration of training, what is the expected
loss value?

II.8 Answers

1. Linear Regression
a) No, it is not feasible. Linear regression has a closed-form solution, but it requires loading to the
RAM a matrix of size X109 ×100 and also inverse it, which is impossible. Therefore, we would have
to use an iterative training scheme using batches through a neural network.
b) Add a sigmoid function at the end of the network, and use one of the cross-entropy loss functions.
2. Fully Connected Layers
∂L
= ∂L ⊤ ∂L ∂L
a) ∂Z ∂y · R , ∂A = 1N ×M ⊙ ∂y , where · is matrix multiplication, and ⊙ is element-wise
multiplication.
b) We could flatten the images, each to a vector of size 3 × 8 × 8 = 192 → X8×192 , and the weight
matrix would be of size W192×1 .
c) i. FC layers extract only global features, while the task requires local features.
ii. The number of pixels (resolution) is too small, which means a lack of fine details.
d) y2 = f2 (f1 (X, W1 , B1 ), W2 , B2 ) = f2 (XW1 + B1 , W2 , B2 ) = (XW1 + B1 )W2 + B2 = X(W1 W2 ) +
B1 W2 + B2
3. Activation Functions
a) Activation functions introduce non-linearity to the network, and allow the network to learn com-
plex patterns in the data.
b) 1. The Tanh function is zero-centered, while the Sigmoid function is not. 2. The Tanh function
has a larger maximum gradient than the Sigmoid function.

29
II.8 Answers

c) The vanishing gradient problem is a phenomenon where the gradients become exponentially
smaller as they propagate "up the stream", leading to a scenario where the weights of the shal-
low layers are updated much more slowly or not at all. One method to solve it is to use skip
connections, that allow "a highway for the gradients" to flow through the network.
d) The sigmoid function is still used on the logits, as it projects the logits to the range [0, 1], and
can be interpreted as the probability of belonging to a class in binary classification.
e) If we take an α < 0, we would lose the zero-centered property of the function, as all the values
would be non-negative.
xi −max(x) xi e− max(x) e− max(x) x xi
f) Softmax(x − max(x))i = PKe = PKe = PKe i xj = PKe =
j=1
exj −max(x) j=1
exj e− max(x) e− max(x) j=1
e j=1
e xj
Softmax(x)i
4. Loss Functions
a) The property that allows us to perform backpropagation without computing Jacobians is that
the loss function is a scalar function L : RN × RM → R and that the variables in neural networks
are always scalars, but just assembled in matrices and tensors. Eventually, we aim to find ∂L∂v ,
where v is a single scalar variable, v ∈ R.
b) N = 4 × 3 × 8 × 8 = 768, as we divide by the number of residuals in the loss function.
c) The MAE loss function would be better, as it is less sensitive to outliers, and would fit the
majority of the residuals better.
d) Numerical instabilities could be caused by the logarithm of 0 in the BCE and CE functions. We
could solve that by clipping the values of the logits to be in the range [0 + ϵ, 1 − ϵ].
e) The MAE loss function could still be used, although it is not differentiable at x = 0, as we assume
local differentiability in the loss function, which is the case for almost all cases.
f) The gradients would be HUGE, so we should expect to observe the "exploding gradient" problem,
where the update step would be too large, and the network would diverge.
g) The expected loss value after the first iteration would be − log(0.01) = 4.6, as the network would
be initialized with random weights, and the expected value of the loss function is − log(1/C),
where C is the number of classes.
h) The number of neurons at the output layer would be 1, as we have a binary classification problem.
i) We multiply the result by −1 to make the loss value positive, as ∀0 < x <= 1 : − log(x) ≥ 0.
j) We don’t multiply by −1 in MSE or MAE, as the loss value should be positive, and the network
should learn to minimize it.
k) − ln(0.25)

30
III Convolutions

Introduction Video: Link

While Fully Connected layers were the main linear building blocks in the early days of deep learning in
computer-vision, they have been replaced by Convolutional layers, that introduce a few major advantages:
• While FC layers extract only global features, computer vision tasks require local features, of local
windows, that are much more meaningful. More global look over the input image is achieved by deeper
layer and their receptive fields (section III.3).
• While FC layers are fixed, and cannot take a different input size, convolutions can, as they only look
at each local neighborhood at a time.
• Convolutions still represent a linear operation.
• Convolutions are also very efficient in the number of parameters and calculations.
• Convolutions are translation Equivariant:
– Invariant: If the input changes, the output stays the same.
– Equivariant: If the input changes, the output changes in the same way.
meaning that if the input is translated, the output is also translated. Convolutions can easily find the
same object in different parts of the input image, as they process local windows. However, note that
they are not rotation invariant in general, but might be if the change is negligible.

31
III.1 Definition

III.1 Definition

The default convolution function is applied to an input with 3 key hyperparameters:

1. Kernel size k - the size of the kernel or "window" that is looked at a time. It has a height h and a
width w, where usually h = w = k, but they could potentially be different. Usually chosen to be an
odd number, so the window would have a center pixel.
2. Stride s - The amount of pixels the center of the window moves at the next step to the right, or down,
once the row is over.
3. Padding p - the amount of pixels that are added to the edges of the image. Usually, the added value
is zero, but could be any number. The padding allows us to preserve the spatial size of the image.
The convolutional layer is made up of filters. Each filter has a shape of (cin , kh , kw ), where cin is the number
of channels in the input image. The number of filters is cout , the number of channels in the output tensor.
That means, that each kernel takes at each time a spatial window, through all the channels, multiplies each
entry with a corresponding weight, and then sums it all together to a single value:

X kh X
cin X kw
xl,i,j wl,i,j
l=1 i=1 j=1

The output is a tensor of shape (cout , hout , wout ), where hout and wout are the height and width of the output
tensor. Those spatial values are calculated as follows:

Hin + 2p − kh

Hout = +1
s

Win + 2p − kw

Wout = +1
s

Note, that not all combinations are valid - meaning, that if not careful with the choice of these parameters,
not all windows would be calculated. A window that doesn’t fit into the input, will be simply dropped.
Popular trios:
• k = 1, s = 1, p = 0 - Pointwise convolutions - process only a single pixel across its channels. Keeps the
spatial dimensions intact.
• k = 3, s = 1, p = 1 - The most common one. Has a small neighborhood, and it keeps the spatial size
of the input (hin = hout , win = wout
• k = 3, s = 2, p = 1 - The stride is doubled, so the output spatial size is half the input’s.
• k = 7, s = 4, p = 3 - The stride is quadrupled, so the output spatial size is 1/4 of the input’s.

32
III.2 Unique Convolutional Layers

III.2 Unique Convolutional Layers

III.2.a MaxPooling

Max-pool is a layer that operates channel-wise (the kernel has a depth of 1), in contrast to the default
convolution layer, where the kernel has the same depth as the input. Given a window, it returns the
maximum value within it. Usually, it is used in order to reduce the spatial size of the input, while introducing
non-linearity but no additional parameters. However, this layer doesn’t necessarily reduce the spatial size,
and could be used while maintaining the same spatial dimensions, e.g. in the case of the "inception layer"
(section V.6). The most used case is as in the figure above, with a kernel size of 2 × 2, a stride of 2 and no
padding.
Note about the gradient: Only the entry (or multiple entries) in the window that hold the maximum value
in that window, will get a derivative value of 1. The rest are killed with 0s.

III.2.b Average pooling

Similar to maxpool, but instead of returning the maximum value, it returns the average value of all entries
of the window. Note, that this operation is linear, and does not introduce non-linearity. It is also normally
used to reduce the spatial size of the input, with no additional parameters.

33
III.2 Unique Convolutional Layers

III.2.c 1 × 1 convolutions

Also called "Point wise convolution". This is a special case of the convolution layer, where the kernel is 1 × 1.
This layer is used usually to reduce the number of channels in the input, while keeping the spatial size. This
layer is also first introduced in the "inception layer" (section V.6), with the purpose of reducing the channels
before the more expensive operations, such as 3 × 3, 5 × 5 convolutions. With that, they managed to reduce
dramatically the number of parameters and calculations, while maintaining the performance to some extent.
Note, that this layer doesn’t extract spatial features, as it has no neighborhood, and its receptive field is
the same as the layer before. Also, it is mentioned somewhere that this layer adds non-linearity. This is of
course NOT TRUE, but one could use this layer, followed by some non-linear activation, since it is quite
cheap.

III.2.d Depth-wise Convolutions

This is a special case of the convolution layer, where we take a single kernel of depth 1 for each channel. It
is common in modern visual-transformers architectures.

34
III.2 Unique Convolutional Layers

III.2.e Upsample

A linear layer that introduces no additional parameters, to increase the spatial size of the input. Usually
using a predefined algorithm, such as nearest neighbor, bilinear or bicubic.

III.2.f Transpose Convolutions

Unlike the upsampling layer, the transposed convolution introduces additional parameters in the form of
kernels. While being more capable and able to adjust to the task at hand, it is also more expensive. Note
that it is not the same as "deconvolution" layers.

35
III.3 Receptive Fields

III.3 Receptive Fields

The receptive field is the number of pixels in the network’s input, which an intermediate pixel in some hidden
layer is affected by. The spatial extent of the connectivity of a convolutional filter. What is the relationship
between the input and the output? For example: 3x3 receptive field: 1 output pixel is connected to 9 input
pixels. Note, that we do not consider the channels dimension - we only care about the spatial context.
To calculate the receptive field on a pixel in the intermediate layer l, we use the following formula:

l−1
Y
rl = rl−1 + ( si ) · (kl − 1) f or l ≥ 2
i=1

Where s is the stride and k is the kernel size. Also, r1 is the kernel size of the first layer. Note that we
start from the input and go deeper, and not from the requested layer backwards.

Figure III.1: Example: r2 = r2−1 + ( 1i=1 si ) · (k2 − 1) = 3 + 1 ∗ (3 − 1) = 5

r3 = r2 + ( 1i=1 si ) · (k2 − 1) = 5 + (1 ∗ 1) ∗ (3 − 1) = 7
Q

Note: For fully connected layers, the receptive field is the ENTIRE image. Since each output neuron of
that layer is connected to each neuron of the previous one, where each layer represents the entire output,
whether it is down-scaled or upscaled.

36
III.4 Handcrafted Kernels

III.4 Handcrafted Kernels

Important: For some reason, these keep popping up in the exams. You should know them by heart, or
just decide to skip them in advance. The most common ones are:

• Edge detection:
 
−1 −1 −1
– −1 8 −1
 
−1 −1 −1
  
1 0 −1 1 2 1
– Sobel - vertical and horizontal filters: 2 0 −2  0 0 0
  
1 0 −1 −1 −2 −1
• Blurring:
 
1 1 1
– Box mean: 91 1 1 1
 
1 1 1
 
1 2 1
1 
– Gaussian blur: 16 2 4 2

1 2 1
• Sharpen:
 
0 −1 0
– −1 5 −1
 
0 −1 0

III.5 Questions

1. How can we represent a FC layer with 5 output neurons with a convolutional layer, over an image in
a batch of size 4 × 3 × 8 × 8 ? And with 1x1 Convolution?
2. Given a 1 × 1 convolutional layer with input tensor of 10 channels, that outputs a tensor with 5
channels.
a) Write the shape of the weight matrix.
b) State the number of parameters in the layer.
3. Give two reasons to use a convolution or a pooling layer that reduce the spatial size of the input tensor.
4. Does reducing the spatial size throughout the network reduce the number of parameters in the down-
the-stream convolutional layers?
5. Does reducing the spatial size throughout the network reduce the number of parameters in the down-
the-stream FC layers?
6. State two differences between a convolutional layer and a pooling layer.
7. Assume a fully convolutional model for some task:
a) Can we feed the model with images that are double the resolution of the original training set?
b) Can we expect in such case the same performance?

37
III.6 Answers

8. Can we apply a pooling layer without reducing the spatial size of the input tensor?
9. Assume that when using maxpool with (k = 2, s = 2, p = 0), where only a single entry in a given
window holds the maximum value in that window. How many of the total pixels in the tensor would
get a live gradient? What is the value of said gradient?
10. State one advantage and one disadvantage of a 1 × 1 convolution over a 3 × 3 convolution.
11. Why don’t we use hand-crafted kernels (e.g. Sobel filter for edge detection) within our deep learning
models?
12. In what technique, however, can we use hand-crafted kernels such as Gaussian blur?
13. Assume that we want to use a transformer to process an intermediate feature map (an output tensor
of a convolutional layer), to learn meaningful relations between the pixels. How can we change the
tensor to do that?

III.6 Answers

1. FC layers take ALL neurons in a single input from the batch. To do that with convolutions, we must
have a kernel with kh = Hin , kw = Win and kc = Cin . The weight matrix would be 5 × 3 × 8 × 8
(Number of kernels × kc × kh × kw ). The output tensor would be of shape 4 × 5 × 1 × 1. With a 1x1
Convolution, we can simply reshape the input tensor to 4 × (3 ∗ 8 ∗ 8) × 1 × 1.
2. a) • w = 5 × 10 × 1 × 1
• b=1×5×1×1
b) 55.
3. Two reasons to reduce the spatial size of the input tensor are:
• Compression enforces feature selection and extracting meaningful features.
• Using smaller tensors throughout the network reduces the number of calculations, allowing us to
use more layers for a greater capacity.
4. No - as long as the spatial size remains bigger than the convolutional kernel, the kernel sizes don’t
change, so the number of parameters stays the same.
5. Yes - reducing the spatial size throughout the network would result with smaller tensors, and therefore
smaller FC layers, that take as input the entire previous layer.
6. Two differences between a convolutional layer and a pooling layer are:
• A convolutional layer has learnable parameters, while a pooling layer doesn’t.
• Pooling layers work channel-wise (each channel independently), while default convolutional layers
have the same depth as the input tensor.
7. a) Yes, fully convolutional networks can take varying sizes of inputs without any architectural
changes, as long as it makes sense.
b) No, convolutional layers are not invariant to scale changes.
8. Yes, we can apply a pooling layer without reducing the spatial size of the input tensor by using a
relevant trio, e.g. (k=3, s=1, p=1), just like with convolutions.
9. Each window is of size 2 × 2, therefore has 4 entries. Only a single entry gets a gradient, and therefore
1
4 of the total entries would have a live gradient, with a value of 1.

38
III.6 Answers

10. 1 × 1 vs 3 × 3:
a) Advantage: Much cheaper in terms of parameters and calculations.
b) Disadvantage: Smaller receptive field, so the scope of the layer is much narrower.
11. We don’t use hand-crafted kernels within our deep learning models because they are not learnable,
and thus not able to adapt to the task at hand.
12. We can use hand-crafted kernels such as Gaussian blur within our deep learning models by using them
as the initial weights of the convolutional layers, and then updating them during training / use in data
augmentation.
13. To use a transformer to process an intermediate feature map, we can change each instance in the
batch to a 2D tensor, by transposing the channels to the last dim, and flattening the spatial size:
(B × C × H × W ) → (B × H × W × C) → ((B × (H ∗ W ) × C). Now each pixel, across the channels,
is a vector token for the transformer head.

39
IV Optimization

Since we deal with big-data and non-linear problems, we use iterative optimization algorithms to find the best
solution. Neural networks consist of many different functions all together, and therefore their optimization
problem is usually non-convex. This means that off-the-shelf basic neural networks would face a loss surface
that has many different minimas. However, it is common to assume that any local minima is good enough
as a solution, but to that end - we always aim to have the best solution and performance. In this section
we will discuss the different approaches that exist for solving iterative optimization problems, and see why
some of them are better than others, practically vs. theoretically.

IV.1 Optimization algorithms

IV.1.a Gradient Descent

Gradient descent (GD) is the basic optimization algorithm, on which all modern deep learning optimization
problems are based. It follows a very basic idea: For each variable θ in the model, calculate the gradient
over the entire training-set, and then perform an update on the variable θ in the direction of the negative
gradient. This is done iteratively until the loss function converges to a minimum. The update rule is as

40
IV.1 Optimization algorithms

follows:
∂L
θt+1 = θt − α
∂vt
∂L
Where vt is the variable at epoch t, α is the learning rate, and ∂θt is the gradient of the loss function L
with respect to the variable θt .
Why go in the direction of the negative gradient? Because the gradient points in the direction of the steepest
ascent, and we want to minimize the loss function, so we go in the opposite direction.
Notes:
• Gradient descent is a first-order optimization algorithm, meaning that it only uses the first derivative
of the loss function.
• In gradient descent, we perform the optimizer step only at the end of the epoch, which is only when
we’ve iterated over the entire training set.
• This could be done in batches. Since datasets are usually too large to handle, we split them into
smaller batches and use gradient accumulation - we accumulate the gradients in some buffer and at
the end of the epoch, we divide by the number of iterations and only then perform the optimizer step.
• While gradient descent calculates the most accurate gradient, it has a few major drawbacks:
– If we perform the optimizer step only at the end of the epoch, it each single step could take a
long time to compute - the model will never converge in reasonable time.
– Calculating the gradient over the entire training set prevents the approximation noise that comes
from doing the optimizer step over a small batch, which is actually a good regularization power,
that helps avoid saddle points. Therefore, GD tends to underfit.
– Calculating the gradient over the entire training set is very computationally expensive, and there-
fore it is not practical for large datasets.

Python implementation of the GD algorithm:

def GD(model, loss_fn, optimizer, training_dataloader, epochs, lr):
for epoch in range(epochs):
loss = 0
optimizer.zero_grad()
for iter, batch in enumerate(training_dataloader):
x, y = batch
y_pred = model(x)
loss = loss_fn(y_pred, y) / len(training_dataloader)
loss.backward()
optimizer.step()

41
IV.1 Optimization algorithms

IV.1.b Stochastic Gradient Descent

Stochastic gradient descent (SGD) is a variation of the gradient descent algorithm, that aims to solve the
problems that come with the basic GD. While the original definition consists of using a batch size of 1, in
practice we use a small batch size from the training set and perform the optimizer at each iteration. This
way, we can perform the optimizer step many times in a single epoch, and therefore the model will converge
faster, although it has a much noisier gradient.
The update rule is exactly the same as in GD, but we perform the optimizer step at each iteration:
∂L
θt+1 = θt − α
∂θt
at iteration t + 1.
Notes:
• SGD is a first-order optimization algorithm, meaning that it only uses the first derivative of the loss
function.
• The original definition of SGD is when the batch size is 1, but in practice, we use a small batch size
(e.g 4, 8, 16, 32, 64, etc.).
• SGD is much faster than GD.
• "Noisier steps" comes from the fact that SGD only approximates the global gradient on a small batch,
which cannot represent the real distribution of the data. Therefore, features that are learned on a
current sample, might not be relevant for a sample from the next batch. However, this could be a
good thing, as it helps avoiding saddle points. The bigger the batch size is, the lower the noise is.
• Implementation-wise, the ONLY difference between GD and SGD is that in SGD we perform the
optimizer step at each iteration, and not at the end of the

Python implementation of the SGD algorithm:

def SGD(model, loss_fn, optimizer, training_dataloader, epochs, lr):
for epoch in range(epochs):
for iter, batch in enumerate(training_dataloader):
optimizer.zero_grad()
x, y = batch
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()

42
IV.1 Optimization algorithms

IV.1.c SGD with Momentum

SGD with momentum is a variation of the SGD algorithm, that aims to solve the problems that come with
the basic SGD. The idea is to add a momentum term to the update rule, that will help the optimizer to
converge faster. The update rule is as follows:
∂L
mt+1 = βmt − α
∂θt
θt+1 = θt + mt+1
Where mt is the momentum term at iteration t and β is the momentum coefficient (usually 0.9).
Notes:
• SGD with momentum is a first-order optimization algorithm.
• m0 is usually initialized to 0.
• The momentum is being increased or decreased iteratively by calculating the momentum from one
iteration to the next. It is similar, but not exact, to the "exponential moving average" concept.
• When the "slope" is steep, the velocity will increase very fast, allowing us to overshoot saddle-points.

IV.1.d Nestrov Momentum

The Nesterov Momentum optimizer (NAG) improves on standard momentum by anticipating the direction
of the update, computing the gradient at a point slightly ahead ("Look Ahead") of the current position.
This could result in faster convergence and better training performance for deep learning models.

43
IV.1 Optimization algorithms

The update rule is as follows:

θ̃t+1 = θt + βmt
mt+1 = βmt − α∇v L(θ̃t+1 )
θt+1 = θt + mt+1
1. Compute the new point just ahead of the current point, given the current momentum.
2. Compute the new adjusted velocity mt+1
3. Perform the actual update to the variable θ at time t + 1

IV.1.e RMSprop

This algorithm follows the idea to divide the learning rate by the square root of the sum of the squares of
the gradients, to mitigate big oscillations in the gradient. Since this is done for each variable in the gradient
individually, it results in a different optimization step size for each one of them. The update rule is as
follows:
vt+1 = βvt + (1 − β)[∇θ L ◦ ∇θ L]
∇θ L
θt+1 = θt − α √
vt+1 + ϵ

Where st is the velocity term at iteration t, β2 is the momentum coefficient (usually 0.99), and ϵ is a small
number to avoid numerical instability.
Notes:
• RMSprop is a second-order optimization algorithm, meaning that it uses the second moment of the
loss function gradient.
• The idea is to divide the learning rate by the square root of the sum of the squares of the gradients,
to mitigate big oscillations in the gradient.
• The optimization step is different for each variable in the gradient - hence it is called an adaptive
learning rate.
• The squared gradients approximate the second moment of the gradient, which is the variance: Recall
that the variance is
V ar(X) = E[(X − µ)2 ]
We assume that the mean of all variables is µ = 0. Therefore,
∇θ L (θt )2 = [∇θ L (θt ) ◦ ∇θ L (θt )]

44
IV.1 Optimization algorithms

V ar(∇θ L (θt )) ∼ E[∇θ L (θt )2 ] ∼ β2 · vt + (1 − β2 ) [∇θ L (θt ) ◦ ∇θ L (θt )]

Which is accumulated iteratively. It accumulates a "running variance" of the gradients.

IV.1.f Adam - Adaptive Moment Estimation

Adam is a combination of RMSprop and SGD with momentum. It uses the squared gradients to scale
the learning rate like RMSprop, and it uses the momentum term to avoid local minimas like SGD with
momentum. The update rule is as follows:

mt+1 = β1 mt + (1 − β1 )∇v L
vt+1 = β2 vt + (1 − β2 )[∇θ L ◦ ∇v L]
mt+1
m̂t+1 =
1 − β1t+1
vt+1
v̂t+1 =
1 − β2t+1
m̂t+1
θt+1 = θt − α √
v̂t+1 + ϵ

m represents the running mean of the first-moment, then mean of the gradients, and v stands for the running
variance - for each single entry of the gradient matrix.
Notes:
• Adam is a second-order optimization algorithm
• The idea is to combine the best of both worlds - RMSprop and SGD with momentum.
• Bias correction: The Adam optimizer uses a bias correction term to correct the bias of the first
and second moments. It was first introduced with Adam, and therefore doesn’t exist in RMSprop,
although it should. The idea is that at the first iteration, the variables m, s are initialized to 0, and
therefore they are biased towards 0. To correct this bias, we divide them by the term 1 − β t+1 , where
t is the iteration number, and here is used as the exponent. As the number of iterations increases,
the correction terms vanish, but at the very first ones - we neglect the m, s terms, and therefore the
optimizer step is much more meaningful:

m1 = β1 m0 + (1 − β1 )∇θ L

m1 β1 m0 + (1 − β1 )∇θ L
m̂1 = = = ∇θ L
1 − β1 1 − β1

45
IV.1 Optimization algorithms

IV.1.g Newton’s Method and its Variants

While Stochastic gradient descent and its variants show great capabilities, they are a leg behind in theory
behind some others. Newton’s method is a second-order optimization algorithm, that uses the second
derivative of the loss function to find the optimal solution, in much fewer iterations than SGD. It follows
Newton’s-Raphson method, which is a root-finding algorithm that uses the second derivative of the function
to find the root (e.g. y = f (x) → y = 0) of the function. The update rule is as follows:
Newton-Raphson method:
f (xt )
xt+1 = xt −
f ′ (xt )

But here we want to find the direction of the gradient, so we’re looking for the derivative of the gradient,
which is the second-derivative:
For a single variable x:
f ′ (xt )
xt+1 = xt −
f ′′ (xt )

For a matrix θ:
θt+1 = θt − H −1 ∇θ L(θt )
Where H is the Hessian matrix (matrix of second derivatives) according to the loss function, and H −1 is its
inverse.
Notes:
• Newton’s method is a second-order optimization algorithm.
• This method is NOT FASTER than SGD, but will converge in fewer iterations, as the inverse Hessian
matrix is really expensive to compute.
• So why isn’t it used in practice? the Newton’s method requires the entire training set as input to
compute the inverse Hessian matrix, which is super expensive in memory and time.
• There are some variants of the Newton’s method that are used in practice, that approximate the
Hessian matrix and therefore are much faster. The most common one is the L-BFGS algorithm.
However, they also still require the entire training set as input, and therefore are not practical for
large datasets and deep learning.

IV.1.h Second Order confusion

In literature, you’ll find a lot of inconsistencies about this term. Some claims about Adam, RMSprop and
the newton method being second order methods, and some say that only the newton method is.
• It doesn’t really matter. It’s not important. Don’t break your head over it. Simply, until
stated otherwise, in this course - all the three above are second order.
• Adam and RMSprop are a second-order gradient-descent-based optimizer (second moment).
• Newton’s method is a second-order optimization method (second derivative).
Both cases are shorty referred to as “second order” methods.

46
IV.2 Optimization problems and solutions

IV.2 Optimization problems and solutions

IV.2.a Overfitting vs underfitting

As explained before, the sole goal of training neural networks - is to minimize the loss function and the
generalized performance on unseen data is only a byproduct of that. However, that imposes a problem: if
the model is too capable, it could fit the training data too well. There are plenty of reasons for such a
behavior, such as:
• The model is too complex, and therefore has too many parameters. Instead of learning general patterns,
it memorizes the training data.
• Too little data, and therefore the model has no way to learn the general patterns, but only the specific
that fit the small dataset.
• Not stopping early enough, so after learning relevant, more general, features - the model learns the
noise in the data.
That behavior is called in the literature "overfitting". How can we detect it? By looking at the validation
loss. If the "generalization gap", the difference between the training loss and the validation loss, is too big
- the model is overfitting. Usually, by monitoring those losses during training, we would see that while the
training loss is decreasing, the validation loss is either increasing (diverging) or staying the same.
Solutions:
• Increasing the size of the training set: This is NOT a valid solution, as the creation of a dataset is a
very difficult and expensive task, and we should try and work with the limited resources available. A
method that could create more data out of the existing one is called "data augmentation".
• Regularization: The go-to methods. Make training harder, so the network will learn the general
patterns. Examples: Weight-decay, Dropout, data augmentation, etc.
• Early stopping: In case one doesn’t save checkpoints during training, the model could be stopped
when the generalization gap starts to grow. However, this is NOT a regularization term, as it doesn’t
make the training harder, but rather stops it.
• Hyperparameter tuning: The model might be too complex, and therefore we should try and reduce
its capacity by tuning the hyperparameters.

47
IV.2 Optimization problems and solutions

On the other hand, a model might not be able to reduce the training loss to a sufficient level, where it solves
the problem of the training data. This counter phenomenon is called "underfitting". That could occur for
some of the following reasons:
• The model’s capacity is too low, meaning it is either too small or uses a bad architecture, and therefore
cannot learn the general patterns in the data.
• The model is not trained for enough epochs and therefore has not converged to a minimum.
• The model is using a very basic optimization algorithm, and gets stuck in a "saddle point", where the
gradient is 0.
How to detect it? First, we could look at training loss. If it is still decreasing (along with the validation
loss) when the model stops training - then it means it could have been trained for more epochs. Second,
we could assess the accuracy of the model on the training data: check the classification accuracy, visualize
the reconstructed image, etc. Undesirable results would mean that the model fails to complete its task to
overfit the training data.
Solutions:
• Increasing the model’s capacity: by changing the architecture, adding more layers, etc.
• Training for more epochs: The model might not have converged to a minimum yet.
• Using a better optimization algorithm: The model might be stuck in a saddle point, and therefore we
should use a better optimization algorithm.
• Learning rate decay: The model might be oscillating ("overshooting") around the minimum, and
therefore we should decrease the learning rate.

IV.2.b Vanishing and exploding gradients

We’ve discussed in a previous section (subsection II.2.e) the problem of vanishing gradients and methods to
solve it. On the other hand, exploding gradients is a problem that occurs when the gradients are too big,
and therefore the optimizer step is too big, and the model diverges. This could happen for a few reasons:
• Recurrent cells: In RNNs, the gradients are being multiplied by the same matrix at each time step. If
the matrix has an eigenvalue greater than 1, the gradients will explode.
• No normalization: If activations become very large after linear layers, then the gradients of the weights
will be very large as well.
• Bad initialization: If the weights are initialized to very large values, the activations will also be very
large, and that could affect the next layers’ gradients.
Solutions:
• Gradient clipping: The idea is to clip the gradients to a certain value, so they won’t explode. This
is a very common method in RNNs, where the gradients are usually very large. Note: this method
doesn’t clip them for below, so it doesn’t help with vanishing gradients.
• Normalization: The idea is to normalize the activations after each layer, so they won’t become too
large. Examples: Batch normalization, Layer normalization, etc.
• Weight initialization: Use known initialization methods, such as Xavier or Kaiming initializations,
that will prevent the activations from becoming too large.

48
IV.2 Optimization problems and solutions

IV.2.c Learning rate scheduling

If you don’t implement this crucial technique, stop doing deep learning and go do something else with
your life. Learning rate scheduling is so critical, as the learning rate is the hyperparameter that affects the
optimization process the most. The basic notion is to decrease the learning rate as the training progresses,
so the optimizer step will be smaller and smaller, and the model will converge to a minimum. In addition,
this method could also be used as regularization, if used correctly as described in the amazing paper of
"super-convergence" (Link). Robbins and Monro have summarized (in 1951) the conditions for theoretical
convergence for a strictly convex function F (x, θ):

• α is the initial learning rate.

• an ≥ 0, ∀n ≥ 0
P∞
• n=1 αn =∞
P∞ 2
• n=1 (αn ) <∞
• an ∝ na , for n > 0

49
IV.2 Optimization problems and solutions

IV.2.d Regularization

Regularization techniques are common practices in deep learning to prevent overfitting, but also introducing
constraints or secondary objectives to the optimization problem. Regularization methods usually make the
training process harder, so the model will seek to learn more general patterns in the data, in order to
overcome the added difficulty and keep brining the training loss down. Usually, we will observe that the
training loss will be higher, but the validation will be lower, hence the generalization gap would be minimized.

L1 and L2 regularization + Weight decay

A basic technique of adding a penalty term to the loss function, that tries to minimize the weights themselves,
namely pushing them towards zero. If it succeeds, the model will not be able to calculate anything. The
most common regularization terms are the L1 and L2 regularization, which are defined as follows:

#Θ
X
L1 : λ |θi |
i=1
#Θ
1 X 2
L2 : λ θ
2 i=1

Where λ is the regularization coefficient, and wi is the i-th weight in the model. When chosen, for example,
in the L2 regularization, they are added to the loss function, and the optimizer will try to minimize them
as well:
#Θ
1 X 2
Lreg = L + λ θ
2 i=1 i
∂Lreg ∂L
= + λθl,i,j
θl,i,j θl,i,j

Where L is the original loss function. The L1 regularization is also called "Lasso" regression, and the L2
regularization is also called "Ridge" regression. Note the λ hyperparameter that controls the strength of the
regularization term. If λ is too big, the model will not be able to learn anything, and if it is too small, the
term won’t have any meaningful effect.
The main differences between L1 and L2 regularization are:
• L1 regularization is more likely to produce sparse weights, i.e. pushing some of the weights to zero.
This could be useful for feature selection, as the model will learn which features are important and
which are not (see image above).

50
IV.2 Optimization problems and solutions

• L2 regularization is more likely to produce small weights, and spreads the penalty across all the weights.
• Therefore, and because L2 is also easier to compute, it is more common in practice.
Weight decay is a similar term to the formula above:

∂L ∂L
θt+1 = θt − α( + λθi ) = θt − α − αλθt
∂θt ∂θt

However,note the small difference:

• Weight decay is a term that is added to the optimizer, and not to the loss function. In regular SGD,
weight decay is equivalent to the L2 regularization term. The difference is in optimizers like Adam:
m̂t+1

θt+1 = θt − α √ + λθt
v̂t+1 + ϵ

Dropout

This is the most common and most useful regularization technique to date. The idea is to randomly "drop"
some of the neurons in the network during training, so the model will not be able to rely on any single
neuron, and will have to learn more general patterns. A key feature here is that it reduces the co-adaptation
of neurons, and they will not be able to rely on each other.
Notes:
• Dropout introduces a hyperparameter p. In literature, it could be referred to as pdrop or pkeep , etc.
In the original paper, it is the probability to keep a neuron, while PyTorch’s implementation is the
probability to drop a neuron.
• The dropout layer with pdrop will drop exactly pdrop of the neurons (or channels, for convs). For a
single dropout layer, it also represents the probability of a neuron to be dropped.
• The dropout layer drops neurons only during training, and not during inference. At the latter stage,
the neurons are scaled by pkeep , such that x⋆ = x · pkeep . This is done to keep the same magnitude (or
expected value if you’d like) of the activations during inference, as this is what the next layer expects.
• At each iteration, the neurons to be dropped (or kept) are chosen randomly in respect to the parameter
p, and therefore the model is trained on a different “subnetwork” at each iteration. This is why dropout
is usually looked at as an ensemble of different models.
• Dropout is usually applied last in the computation block (Linear → Normalization → Activation →
Dropout).

51
IV.2 Optimization problems and solutions

• The layer drops the neurons, not the weights.

• Inverse Dropout (Link): In order to save computations during inference, we could already scale the
1
remaining neurons by the pkeep parameter, and not during inference. An intuitive explanation would
be: Assume that M is the value of the magnitude of the activations that are fed into the dropout
layer, and assume a pkeep . During training, we drop (1 − pkeep ) neurons, so the magnitude is already
1
scaled down by pkeep . Therefore: M · pkeep · pkeep =M

The Dropout layer operates differently for the Affine and Convolutional layers. For the Affine, single neurons
are dropped randomly, while for the latter, entire channels are dropped. This is because due to the fact that
convolutional layers capture spatial information, in contrast to the Affine layers, where features are learned
by each neuron. Therefore, dropping channels retains the capability of extracting meaningful features, while
still introducing regularization.

Figure IV.1: Spatial Dropout for convolutions with dropping probability pdrop = 13 . The brighter channels
in this example image are dropped (All entries are set to zeros.

Data augmentation

A nice article: Link

Data augmentation is a technique that is used to artificially virtually increase the size of the training set,
by applying random transformations to the data. That means, that the actual physical size of the training

52
IV.2 Optimization problems and solutions

set is not changed, rather at each epoch, the model is trained on different variations of the data, with some
probability p. The idea is to make the model more robust to different variations in the data, and to prevent
overfitting. The most common transformations are: rotation, scaling, translation, flipping, cropping, etc.
Notes:
• Data augmentation is usually applied to image data, but could be applied to any kind of data.
• The transformations are applied randomly, with some probability p.
• The transformation is done solely on the training set, and not on the validation or test sets.
• The transformations are usually applied in the data loader, and not in the model itself.
• Not all transformations are useful for all tasks. For example, rotating images of digits is not useful,
as it will change the label of the image.
• Make sure that in supervised learning, the label of the data is also transformed accordingly, if needed.

IV.2.e Batch normalization

Batch Normalization (Link) is a technique that is used to normalize the neurons of the layer. The idea is to
make the optimization process easier, as the activations will not be too large or too small, and the optimizer
will be able to converge faster.
Note, that we normalize groups of neurons. Each group consists of all the neurons in the batch that
are created by the same weights, either it’s a set of weights in a FC layer or a kernel in a convolution.
For all corresponding neurons, perform the following normalization:

x−µ
x̂ = √
σ2 + ϵ
Where µ is the mean of the activations, σ is the variance of the activations, and ϵ is a small number to avoid
numerical instability. Then, scale and shift the activations:

y = γ x̂ + β

Where γ and β are learnable parameters, that are learned during training. The idea is that the model
will learn the optimal mean and variance for the activations, and therefore the optimization process will be
easier.
 
x1,1 x1,2 ... x1,M
h i  x2,1 x2,2 ... x2,M 
 h i
γ1 γ2 . . . γM ⊙ 
 .. .. .. .. 
 + β1 β 2 . . . β M
 . . . . 
xN,1 xN,2 . . . xN,M

Note: the mean µ and std σ are not necessarily 0 and 1, respectively.
For affine layers, the normalization is done across the batch for each neuron across the batch:

53
IV.2 Optimization problems and solutions

Figure IV.2: Batch normalization for affine layers. Mean and std are calculated for each color, across the
batch.

• Note that this is the first and only layer that breaks the abstraction of the independence of the
different instances in the batch, as the normalization of each neuron is done with respect to all the
other corresponding neurons in the batch, i.e. for a neuron vl,i,j , the normalization is done with respect
to all the other neurons vb,l,i,j , where b ∈ {1, 2, ..., B}.

However, for convolutional layers, the normalization is done across the batch, but not for each single neuron,
but across the whole channels. Again, for all the neurons across the batch that are created by the same
kernel:

Figure IV.3: Batch normalization for convolutional layers. Mean and std are calculated for each channel,
across the batch. We take all the corresponding channels across, and normalize them across
all their entries. This could be seen as flattening them all and concatenating them, and then
performing the normalization, just line in the affine layers.

• The normalization layers differ in their actions during training and inference. During training, the
mean and variance are calculated for each batch, and the normalization is done with respect to them.
However, during inference we have a single instance flowing through the network, so the statistics
cannot be obtained in the same way. Therefore, during training we keep a running average of the
mean and variance, and use them during inference:
µrunning = βµrunning + (1 − β)µbatch

54
IV.2 Optimization problems and solutions

σrunning = βσrunning + (1 − β)σbatch

Where β is a hyperparameter that controls the weight of the running average, usually set to 0.9. These
running statistics are meant to approximate the mean and variance of their corresponding features
or channels across the entire training set. During inference, we use those pre-computed statistics to
normalize the activations.
• Also, the γ and β parameters are learned during training, and are used to scale and shift the activations.
They are also used during inference. For affine layers, their shape is the same as the number of neurons
in the layer, i.e. the dimension of each row in the activation matrix. For convolutional layers, their
shape is the same as the number of channels in the layer tensor. Example:
– For affine layers, assume a batch size of 8 and a layer with 64 neurons. Then, γ and β will be of
shape (64,). The total amount of parameters is 2 × 64 = 128.
– For convolutional layers, assume a batch size of 8 and a layer with 32 channels. Then, γ and β
will be of shape (32,). The total amount of parameters is 2 × 32 = 64.

Problems with batch normalization:

• A small batch size: If the batch size is too small, the mean and variance of the activations might not
be representative of the entire training set, and therefore the normalization will not be accurate, and
can introduce too much noise. It has been shown (Link) that for a batch size smaller than 16, one
should consider other normalization methods.
• Since the statistics are only approximations of the real ones over the whole training set, the batch
norm is sometimes referred to as a “regularization” technique, as it introduces noise to the training
process. However, this is not a good answer in the exam, and should not be a go-to solution if you’re
looking to solve overfitting.
Other normalization techniques: The following are alternative options to the default batch norm, and could
be really useful in some cases. Note that they do not break the abstraction of the independence of the
instances in the batch, and the normalization is done independently within each instance.
• Layer normalization: The normalization is done across the features - the entire instance. This was
found really useful in transformers. In convolutional layers, it is the entire tensor.
• Instance normalization: Not really relevant for the vanilla FC layers. In convolutional layers, the
normalization is done independently for each channel.
• Group normalization: The best alternative, and was shown to work better than BN for batch sizes
smaller than 16. It is similar to layer normalization, but we divide the channels to groups, and
normalize each group independently. Layer-norm could be seen as a special case of group-norm, where
the number of groups is 1.

IV.2.f Hyperparameter tuning

This stage is an important one in the deep learning pipeline. The idea is to find the optimal hyperparameters
for the model, that will make it converge to a minimum, and generalize well on unseen data. The most
common hyperparameters are, in a reasonable amount of time. Common hyperparameters are: learning
rate, batch size, number of epochs, optimizer, size of hidden layers, activation functions, etc. However,
while being crucial, it is quite easy to have a very good starting point, if we know our theory:
1. Learning rate - use a learning rate scheduler, to mitigate the problem of finding a bad initial one (A
list of PyTorch built-in schedulers).
2. Activations - go with ReLU or its variants.

55
IV.2 Optimization problems and solutions

3. Optimizer - There are many amazing ones, but Adam and its variants are a great starting point (see:
Link).
4. Size of hidden layers - always start small, and grow as you go.
That said, the theoretical step could be quite cumbersome, as the search space is usually very large (each
hyperparameter adds a dimension), and the training process is very time-consuming. There are a few
methods to tackle this problem:
• Grid search: The most basic method, where we define a grid of hyperparameters, and train the model
on all possible combinations. This is very time-consuming, and not practical for large search spaces.
• Random search: A more advanced method, where we randomly sample the hyperparameters from a
distribution and train the model on those. This is usually more efficient than grid search, as it doesn’t
require training on all possible combinations.
• Bayesian optimization: A more advanced method, where we explore (look on a wide range) and exploit
(look on a narrow range) the search space. However, this method is usually more complex and requires
more computational resources and doesn’t outperform random search to a degree that justifies it.

IV.2.g Weight initialization

The learnable weights of a model are usually initialized randomly, and not to zero. If all the weights are
initialized with the same value, they will not be able to learn complex patterns, as many groups of weights
will be updated to the same value. Therefore, it is crucial that the weights are sampled randomly from
some distribution. The range from which they are drawn must not be too big, as that could easily lead to
exploding gradients through the backpropagation process. On the other hand, if the range is too small, the
model will not be able to learn anything, as the gradients will be too small, and the optimization steps will
be too negligible. Thus, we should follow some meaningful initialization schemes:

Xavier Initialization

Xavier Initialization, also known as Glorot Initialization. It aims to keep the scale of the gradients roughly
the same in all layers, which is particularly useful for activation functions like sigmoid and Tanh.
This ensures the variance of the output is the same as the input → Gaussian with
• Mean: E[X] = 0, E[W ] = 0
• Variance. Remember that E[X 2 ] = Var[X] + E[X]2 and if X,Y are independent:

Var[XY ] = E[X 2 Y 2 ] − E[XY ]2

,
E[XY ] = E[X]E[Y ]

56
IV.2 Optimization problems and solutions

Therefore,
n n
!
X X
Var(s) = Var wi x i = Var (wi xi )
i i
n
[E (wi )]2 Var (xi ) + E [(xi )]2 Var (wi ) + Var (xi ) Var (wi )
X
=
i
n
X
= Var (xi ) Var (wi ) = n(Var(w) Var(x))
i

• We want to ensure that the variance of the output s = xw is the same as the input: Var(s) = Var(x).

1
Var (w) = n(Var(w) Var(x)) → Var(w) =
n
• Note that n is the number of input neurons for the layer of weights you want to initialize. This n is
not the number N of input data X ∈ RN ×D , but n = D.
For a layer with nin input units and nout output units, the weights W are initialized as follows:

• Using a normal distribution:

2

W ∼ N 0,
nin + nout

Or (what we teach in class, and the version you should know for the exam):

1

W ∼ N 0,
nin

• Using a uniform distribution:

s s !
6 6
W ∼ Uniform − ,
nin + nout nin + nout

Where nin and nout are the number of input and output units, respectively.

Kaiming Initialization

Kaiming Initialization, also known as He Initialization, is a variant of Xavier Initialization. It is specifically

designed for layers with ReLU (Rectified Linear Unit) activation functions family. This method aims to
address the problem of dead ReLU in deep neural networks by maintaining the variance of the activations
across layers.
The ReLU activation function is defined as:

ReLU(x) = max(0, x)

Because the ReLU function outputs zero for any negative input, the average output of neurons is not zero
but positive, which affects the variance. To maintain the forward signal’s variance, the initialization needs
to take into account the rectification effect of ReLU. He et al. proposed scaling the variance of the weights
by n2in , where nin is the number of input units to the layer.
For a layer with nin input units, the weights W are initialized as follows:

57
IV.2 Optimization problems and solutions

• Using a normal distribution:

2

W ∼ N 0,
nin

• Using a uniform distribution: s s !

6 6
W ∼ Uniform − ,
nin nin

Summary

• Xavier Initialization:
– Suitable for sigmoid and Tanh activations.
– Considers both input and output layer sizes.
• Kaiming Initialization:
– Tailored for ReLU activations.
– Focuses on the size of the input layer only.

IV.2.h Optimization problems

15
Training loss Training loss Training loss

40 Validation loss
30
Validation loss Validation loss

10
L(x)

L(x)

L(x)
20
20
5
10

0 0 0
0 10 20 2 4 6 8 10 2 4 6 8 10
Iterations Iterations Iterations
(a) (b) (c)

Figure IV.4: (a) - Mismatch of distribution, (b) - validation set is too easy / data leakage, (c) - Underfitting

• Mismatch of distribution: The training and validation sets are drawn from different distributions.
The model learns the training set well, but fails to generalize to the validation set.
• Validation set is too easy / data leakage: The validation set is too easy, and the model is able
to solve it with a very low loss. This could happen if the validation set is too small, or if the model is
too simple and underfits the training set. Also, a bug in the implementation could cause data leakage,
where some data from the validation set is used in the training process.
• Underfitting: The model is not able to solve the training set, and therefore the validation set as well.
This could happen if the model is too simple, or if the optimization process is not good enough.

58
IV.3 Transfer Learning

IV.3 Transfer Learning

Transfer learning is a machine learning technique that addresses the challenges of limited data and resource-
intensive training. Instead of training a model from scratch, transfer learning utilizes a pre-trained model
that has already learned features from a large dataset, and adapts it for a new but related task.
Key Concepts:
• Pre-trained Models: These models are trained on a large dataset (Distribution P1). They have
learned useful features that can be transferred to other tasks.
• Adaptation to New Task: For a new task with a smaller dataset (Distribution P2), the pre-trained
model is used as a starting point.
– Only the final layer (the classifier) is replaced and trained on the new task’s data.
– Optionally, more layers can be fine-tuned with a lower learning rate if the new dataset is large
enough.
Benefits:

• Efficiency: Reduces the need for large amounts of labeled data and computational resources.
• Performance: Improves model accuracy by leveraging previously learned features.

When to Use Transfer Learning:

Transfer learning is effective when:
• The tasks share the same input type (e.g., both tasks use RGB images).
• The source task (T1) has significantly more data than the target task (T2).
• Low-level features from the source task are relevant to the target task.

59
IV.4 Questions

IV.4 Questions

1. Optimizers:
a) What is the main difference between gradient descent (GD) and stochastic gradient descent
(SGD)?
b) Name two advantages of SGD over GD.
c) Why can we call RMS prop an adaptive optimizer?
d) What is the bias correction in Adam? Why isn’t it implemented in RMS prop?
e) Adam with bias correction and Adam without bias correction have different global minimas. True
or False?
f) What two optimizers does Adam combine?
g) Why is SGD+momentum usually better than SGD?
h) i. Why do we always step in the direction of the negative gradient?
ii. Write down a modified version of the SGD optimizer step, in case we want to maximize the
loss instead of minimizing it.
2. Overfitting and underfitting
a) Define overfitting in a short sentence.
b) Define underfitting in a short sentence.
c) State two different behaviors that indicate underfitting.
d) State a possible reason for a situation where the validation loss is lower than the training loss.
e) In a single word, what is the go-to solution to overfitting?
3. Regularization
a) The term "regularization term" is ambiguous. What are the two usages of such "terms"?
b) What is the effect of L1 and L2 regularization terms on the weights?
c) What are the two differences between L1 and L2 regularization terms and "weight decay"?
d) How can we avoid the computational overhead of the dropout layer during inference?

60
IV.5 Answers

e) Explain how regular dropout works during training and during inference. Hint: crucial to distin-
guish between the definitions of p.
f) Why is data augmentation considered a regularization technique?
g) Why don’t I allow you to consider Early-stopping as a regularization technique?
4. Batch Normalization:
a) Given a loss value L, and a fully connected layer with output shape 8 × 16, followed by a batch
normalization layer: y = BN (x) = γ ∗ xnorm + β
i. What are dimension of the parameters γ and β?
ii. Show a derivation of the gradients of those parameters ( ∂L ∂L
∂γ , ∂β ) and show how to use NumPy
to calculate it.
b) Why is batch normalization sometimes referred to as a regularization technique?
c) Explain why we need to save the running averages of the mean and variance in the batch norm
to the memory during training.
d) What optimization problems could batch norm help us solve?
5. Hyperparameter tuning
a) What is the main bottleneck for grid search?
6. Weight initialization
a) What weight initialization scheme fits the ReLU activation function? How does it affect the
output of the activation function?
b) What do we expect to observe if we initialize the weights of the model to the same value?
7. Transfer Learning:
a) Explain one of the scenarios in the exercises where we used transfer learning, and how.

IV.5 Answers

1. Optimizers:
a) GD performs the optimization step at the end of the epoch, while SGD performs the optimization
step at each iteration.
b) i. SGD performs more optimization steps in a single time unit and therefore converges faster.
ii. SGD introduces noise to the training process, and can help to avoid saddle-points.
c) RMS prop adapts the learning rate for each parameter individually, based on the magnitude
of the gradients.
d) The bias correction is a mechanism to overcome the variables m, v’s initialization bias to zero,
so the early steps are really slow. It is not implemented in RMS prop, as it simply was first
introduced with ADAM.
e) False. The global minimum isn’t changed as the goal and loss don’t change, but Adam with the
bias correction would converge faster. However, they could end up in different local minims, as
the first steps would be much bigger. For a good visualization of different minimas, look at the
plots here.

61
IV.5 Answers

f) RMS prop and Momentum.

g) SGD+momentum is usually better than SGD, as it helps the optimizer to escape saddle points
and go faster in the appropriate directions.
h) i. Because the gradient points in the direction of the steepest increase (ascent) of the loss
function, and we aim to get to the minima.
∂L
ii. Simply change the operation: θt+1 = θt + α ∂θ t

2. Overfitting and underfitting

a) Overfitting is when the model performs too well on the training data, but fails to generalize to
unseen data.
b) Underfitting is when the model fails to perform well on the training data.
c) The training loss is still decreasing when the model stops training / The accuracy of the model
on the training set is very low.
d) The validation set is easier than the training set / underfitting stage at the beginning of training
/ bug in the implementation.
e) Regularization.
3. Regularization
a) i. A term that makes training difficult, hence forcing the network to generalize better (The
go-to definition in the exam).
ii. A term that represents an additional objective to the training process.
b) The L1 regularization term pushes some weights towards zero, thus making them sparser, while
the L2 regularization term spreads the penalty across all the weights, thus all become smaller,
but not necessarily zero.
c) i. The L1 and L2 regularization terms are added to the loss function, while the weight decay is
added to the optimizer step.
ii. In weight decay, the regularization term is also supervised by the learning rate, helping to
mitigate the effect of its magnitude.
1
d) Use "inverse dropout", where already during training, we scale the remaining neurons by the pkeep
parameter, and not during inference.
e) • During training: At each iteration, we drop neurons or whole channels with some probability
pdrop .
• During inference: We scale all neurons by the pkeep parameter: x⋆ = x ∗ pkeep
f) Data augmentation is considered as a regularization technique, as it virtually increases the size of
the training set, thus making the training process harder and therefore forces the model to learn
more general patterns.
g) Early stopping doesn’t make the training process harder, but rather stops it, when the general-
ization gap starts to grow. However, it will not solve the problem that the model is overfitting,
but just doesn’t let the current weights to get any worse. Saving checkpoints to the memory is a
much better approach.
4. Batch Normalization:
a) i. Their dimension is 1 × 16.

62
IV.5 Answers

ii. y = BN (x) = γ ∗ xnorm + β

iii.
∂L ∂L ∂y
=
∂γ ∂y ∂γ
While xnorm ∈ RN ×D , both γ, β ∈ RD .

y = BN (x) = γ ∗ xnorm + β = (1N ) · γ ∗ xnorm + (1N ) · β //(Broadcasting)

Where (1N ) is a column vector, but γ, β are row vectors. Therefore,

∂L ∂L ∂y ∂L
= = (1N )⊤ · ( ∗ xnorm )
∂γ ∂y ∂γ ∂y
or np.sum(dout ∗ xnorm , axis = 0)
∂L ∂L ∂y ∂L
= = (1N )⊤ ·
∂β ∂y ∂β ∂y
or np.sum(dout, axis = 0)
b) Batch norm approximates the statistics of the whole training set on a small batch size, which
introduces noise to the training process, thus making it a bit harder.
c) During inference (testing), the model is fed with a single instance at a time, and therefore there is
no batch to calculate the statistics over. Therefore, we compute an approximation of the statistics
during training, and use them during inference.
d) Batch norm normalizes neurons, therefore doesn’t let them explode and get too small. It could
mitigate the vanishing and exploding gradient problems.
5. Hyperparameter tuning
a) The main bottleneck for grid search is that the search space is usually very large, and the training
process is very time-consuming. Therefore, grid search is not practical for large search spaces.
6. Weight initialization
a) The initialization scheme that fits the ReLU activation is the Kaiming or He initialization, that
is a variant of Xavier initialization. With formula:
2

W ∼ N 0,
nin
this initialization helps to maintain the variance of the activations across layers and prevents the
dead ReLU problem.
b) If the weights of the model are initialized to the same value, the model will not be able to learn
complex patterns, as many groups of weights will be updated to the same value. Therefore, the
weights must be sampled randomly from some distribution.
7. Transfer Learning:
a) Autoencoders:
• Train in an unsupervised way an autoencoder to reconstruct the large dataset that is not
labeled, to train an encoder that serves as a feature extractor. Then, remove the decoder
and plug in instead a small classifier, that will train on the smaller labeled dataset, but with
the capabilities learned on the bigger one.
• Train a segmentation network using a pretrained model as a backbone, either completely, or
just the encoder, and do a smaller training session, to adjust the weights to the new dataset.

63
V Popular Architectures

Performance Metrics in ImageNet:

• Top-1 Score: check if a sample’s top class is the same as its target label
• Top-5 Score: check if the label is in the first 5 predictions.
• Top-5 Error: percentage of test samples for which the correct class was not in the top 5 predicted
classes.

V.1 LeNet

Firstly used for handwritten digits. Has the following architecture:

Figure V.1: LeNet Architecture: Conv → Pool → Conv → Pool → Conv → FC

• Apply valid convolution: size shrinks (reduced by two pixels on each side)
• 6 × 1 × 5 × 5 convolution filters used in first layer (6 filters, as the depth of the convolution obtained
is 6)
• Average pooling is used (now: Max pooling much more common)
• Reduce first to 120, then to 84
• Tanh/sigmoid activation is used (not common now)
• Has 60k parameters
• As we go deeper: width, height go down and number of filters go up

64
V.2 AlexNet

V.2 AlexNet

Figure V.2: AlexNet Architecture

• Similar to LeNet but ∼1000 times bigger

• Max Pool instead of Average (Non-linear)
• ReLU instead of Tanh/Sigmoid
• 60m parameters
• 8 layers
• Note: the first FC layer’s dim (9216) is after the flattening operation of the tensor, and not an FC
layer. The first FC layer is from 9216 → 4096.

V.3 VGGNet

Figure V.3: VGGNet Architecture: Conv → Pool → Conv → Pool → Conv → FC

• CONV = (k = 3, s = 1, p = 1)
• Maxpool = 2x2 filters with stride 2
• 138m parameters
• It’s large, but simplicity makes it appealing
• As we go deeper, width and height go down and number of filters up.

65
V.4 Skip Connections: Residual Block

• 19 layers

V.4 Skip Connections: Residual Block

As neural networks had become deeper and deeper, harsh optimization problems, such as the vanishing
gradients, were a big bottleneck in their capability to perform at full capacity. Skip connections, also known
as residual connections, were introduced as a solution to this problem. They allow the gradient to flow
directly through the network ("highway for gradients") by providing alternate pathways for the gradient
during backpropagation. This innovation not only mitigates the vanishing gradient issue, by allowing a full
gradient to pass through, but also facilitates the training of much deeper networks. By enabling the network
to learn identity mappings more easily, skip connections ensure that the performance of deep networks does
not degrade with depth. Overall, skip connections have become a fundamental component in modern deep
learning architectures, contributing to more stable and efficient training processes.

• Plain scheme xL−1 → xL → xL+1

• Main path: xL+1 = f (W L+1 xL + bL+1 )

• Add skip connection: xL+1 = f W L+1 xL + bL+1 + xL−1

• Highway for gradients: Assume that Loss(x, θ, y) = l and that xL−1 ∈ RN ×D , we show mathematically
that a full magnitude of the gradients can pass from the loss function:

∂L ∂L ∂xl+1 ∂L ∂(xl−1 + F (xl−1 )) ∂L ∂xl−1 ∂F (xl−1 ) ∂L ∂F (xl−1 )

= = = + = 1+
∂xl−1 ∂xl+1 ∂xl−1 ∂xl+1 ∂xl−1 ∂xl+1 ∂xl−1 ∂xl−1 ∂xl+1 ∂xl−1

• X L is incorporated within F (X L−1 ) and is irrelevant for this calculation.

• Note that xL−1 and xL+1 must have the same dimensions: If used by addition +, then the tensors’
shapes must match completely. If concatenation along the channels, only the batch dimension and
spatial size (H × W ) must match.
• If addition + is used, then we usually don’t apply an activation on the second linear part of the residual
block, but only after the addition.
• If addition + is used, the added number of calculations is negligible, to the point that we gain that
much power for free. It’s basically cheating.
• The identity is easy for the residual block to learn, so performance is guaranteed to not degrade in
comparison to a plain network, w/o skip connections.

66
V.5 ResNet (Residual Networks)

Figure V.4: Plain block vs. Residual Block

V.5 ResNet (Residual Networks)

Figure V.5: ResNet

A deep network that utilizes the powerful skip connections and has become the go-to idea in any modern
deep learning architecture.

• 60m parameters
• 152 layers.
• Note that if there is dimensionality reduction, the gradients cannot flow completely uninterrupted all
the way from the loss function to the very first layers, but only at blocks of the network where the
dimensions are equal, where addition between two tensors is possible.

V.6 GoogleNet: Inception Layer

Why to restrict our model to a specific kernel size? We’re Google, we have the computational power - we’ll
just use them all!
Inception layer:
• Each block in the network is made up with 4 different convolutions or pooling layers, that are then
concatenated as an output.
• Very expensive to perform all of that. Therefore, 1 × 1 Convolutions are used, to reduce the number
of channels and shrink the number of computations of the more expansive convolutions.
• Note that in order to maintain the same spatial sizes across all the different branches, the max-pool
layer is initialized with the magic-trio: k = 3, s = 1, p = 1.

67
increa
witho
Previous layer
V.7 Autoencoder plexit
use of
(a) Inception module, naı̈ve version
tions
Filter
lows
Filter
efficiency during training), it seemed beneficial to start us-
concatenation be pro
concatenation ing Inception modules only at higher layers while keeping the ne
the lower layers in traditional convolutional fashion. This is simul
3x3 convolutions 5x5 convolutions 1x1 convolutions
not strictly necessary, simply reflecting some infrastructural Th
1x1 convolutions 3x3 convolutions 5x5 convolutions 3x3 max pooling
inefficiencies
1x1 convolutions in our current implementation.
increa
A useful aspect 1x1 ofconvolutions
this architecture is that it3x3allows
1x1 convolutions for
max pooling
ber of
increasing the number of units at each stage significantly One c
without an uncontrolled blow-up in computational com- inferi
Previous layer
plexity at later stages.Previous This layer
is achieved by the ubiquitous have
use of dimensionality reduction prior to expensive convolu- a con
(a) Inception module, naı̈ve version (b)larger
Inception module withFurthermore,
dimensionality reduction
tions with patch sizes. the design fol- in net
Filter
lows the practical intuition that visual information should ing ne
concatenation be processed at various Figure 2:scales Inceptionand module
then aggregated so that
requir
V.7 Autoencoder the next stage can abstract features from the different scales
simultaneously.
3x3 convolutions 5x5 convolutions 1x1 convolutions
ber Theof filters
improved in theuse previous stage. Theresources
of computational mergingallows of output for
5. G
1x1 convolutions
ofincreasing
the pooling bothlayer with outputs
the width of each stage of theasconvolutional
well as the num- lay- By
1x1 convolutions 1x1 convolutions 3x3 max pooling ers
berwould
of stages lead to an getting
without inevitable intoincrease
computational in the difficulties.
number of carna
outputs from stage to stage. While
One can utilize the Inception architecture to create slightly this architecture might sion f
Previous layer
cover
inferior, but computationally cheaper versions of it. inef-
the optimal sparse structure, it would do it very We deepe
ficiently,
have found leading
that all to the
a computational
available knobs blow
and up within
levers allow a fewfor qualit
(b) Inception module with dimensionality reduction stages.
a controlled balancing of computational resources resulting result
inThis
networksleadsthat are second
to the 3 10⇥idea faster
of than similarly perform-
the Inception architec- as em
Figure 2: Inception module ing
ture: networks
judiciously with non-Inception
reducing dimension architecture,
wherever however
the compu- this act ar
requires careful manual design at
tational requirements would increase too much otherwise. this point. lustra
This is based on the success of embeddings: even low di- comp
ber of filters in the previous stage. The merging of output 5. GoogLeNet
mensional embeddings might contain a lot of information patch
of the pooling layer with outputs of the convolutional lay- about a relatively
By the“GoogLeNet” large nameimagewepatch. refer toHowever,
the particular embed- in- in our
ers would lead to an inevitable increase in the number of dings represent information in a dense,
carnation of the Inception architecture used in our submis- compressed form Al
outputs from stage to stage. While this architecture might and
sioncompressed
for the ILSVRC information is harder to process.
2014 competition. We also used The rep- one tion m
One the
cover of theoptimalmost interesting
sparse structure,network
it would do architectures.
it very inef- An autoencoder
resentation should isbesuch
kept asparse
powerful at most tool,
placesthat(as later on
required recep
deeper and wider Inception network with slightly superior
was perfected,
ficiently, leading to and it is being used
a computational blowintensively
up within a in fewthe industry
by
quality, and
the conditions
but research,
adding of thefor
it to[2]) and thecompress
ensemble tasksseemed ofthesegmentation,
tosignals
improveonly the space
stages.
3D reconstruction, generative models, style transfers, whenever
etc. Basically,
results only they have to be
endless
marginally. Weaggregated
applications.
omit the details en masse. That is,
of that network, stand
This leads to the second idea of the Inception architec- 1⇥1 convolutions
as empirical evidence are suggests
used to compute reductions
that the influence of thebeforeex- used
ture:1.judiciously
The basicreducing architecture dimensionis based
wherever on fully
the compu-connected the
actlayers.
expensive
architectural 3⇥3 and 5⇥5isconvolutions.
parameters relatively minor. Besides
Tablebeing1 il- the nu
tational
2. Itrequirements
consists of would three increase
main parts: too much otherwise. used as reductions,
lustrates the most common they alsoinstance
includeof theInception
use of rectified
used in lin- the built-
This is based on the success of embeddings: even low di- ear activation making them dual-purpose.
competition. This network (trained with different image- The final result is ductio
mensional•embeddings Encoder: might Reducing contain the a lot of information towards
dimensionality depicted the in Figure
latent 2(b).
patch sampling methods) was used for 6 out of the 7 modelsto
space. That forces the network well.
about a relatively large image patch. However,
focus on collecting the most meaningful features embed- inIn ourgeneral,
ensemble.
from theaninput
Inception data, network
so theisloss a network consist-
of information Th
dings represent information
would be as small in a dense,
as possible.compressed form we
Therefore, Allmodules
ingcould
of thesay convolutions,
of the
that theabove including
Encoder isthose
type stacked usedinside
upon the
as aeach Incep-
other,
features and p
and compressed information is harder to process. The rep-
extractor. tion occasional
with modules, use rectified linear
max-pooling activation.
layers with strideThe size
2 to of the
halve dividu
resentation should be kept sparse at most places (as required receptive
the resolution fieldof in the
our grid.
network Foristechnical
224⇥224 reasons in the RGB color
(memory tation
by the conditions
• Latent of space:
[2]) andthe compress
compressed the signals only
representation spaceof thewith input
zero mean. data. “#3⇥3 These reduce”
latent andreside
“#5⇥5inreduce” multi-
whenever they plehave to be aggregated
directionalities andenhave masse.
really That is,
powerful stands for the number
representations of 1⇥1 filters
of features. By in the reduction layer
manipulating them
1⇥1 convolutions are used
correctly, one to compute
could achieve reductions before
very interesting used before the 3⇥3 and 5⇥5 convolutions. One can see
things.
the expensive 3⇥3 and 5⇥5 convolutions. Besides being the number of 1⇥1 filters in the projection layer after the
used as reductions, – Iftheythe also
sizeinclude
of is toothesmall,
use of rectified
not much lin-information could eventually
built-in max-pooling in the passpool projthrough column. to the decoder,
All these re-
ear activation making andthem the dual-purpose.
reconstruction Thewould
final result
be very is hard.
duction/projection
The result would layers be useveryrectified
blurry. linear activation as
depicted in Figure 2(b). well.
– If too big, the network could basically learn to copy the image, without learning any mean-
In general, an Inception network is a network consist- The network was designed with computational efficiency
ing of modules ofingful the above features.
type stacked upon each other, and practicality in mind, so that inference can be run on in-
with occasional– max-poolingThe latentlayers spacewith is astride
very2 powerful
to halve representation
dividual devices of including
the input. even While
those with RGB limited
images, compu- for
the resolution of the grid. For technical reasons (memory tational resources,
example, offer quite redundant information on their own, the compressed latent space could especially with low-memory footprint.
represent the most meaningful features, that are specific for the task at hand, on a multidi-
mensional manifold. From an RGB image, the latent space could represent so many things

68
V.8 Fully Convolutional Networks

and can even close the gap between two completely different domains. For example, one
could align features from an RGB image with its corresponding semantic segmentation map,
which differs from it completely, such that there is a simple linear transformation between
them (Halperin et al. - may god help me, and soon it will be). One could learn "simple"
features for reconstruction, or even the distribution of the dataset (see VAE later in this
chapter).
– In the literature we say, that the latent space represents "high-level features", while the early
layers of the model extract "low-level features":

– Calculations are usually much cheaper to perform on the latent space, which is a low-
dim representation on the input. Therefore, we should aim to perform as little as possible
calculations on higher-resolutions, and instead do most of the heavy-work on those lower-dim
spaces, that are also much more flexible in what they can represent.
• Decoder: Receives the dimensionality-reduced latent space, and aims to reconstruct the input
of the Encoder. The output has the same shape as the input.
3. Without non-linearities, it is very similar to PCA.

But why do we even want to use that? Auto encoders, as used in exercise 08, is an excellent solution to a
state where our dataset is very big, but only just a small part of it is actually labeled, like medical a CT
dataset, for example. So, we will have 2 steps:
1. Autoencoder → reconstruct the input. Let the Encoder learn the relevant features about the unla-
beled data. This part can be referred to as unsupervised learning.
2. After the training has converged, remove the decoder and discard it. Then, plug in instead, just after
the latent-space, a very simple fully-connected classifier, and train on the labeled data, given the fact
that the remaining Encoder is already trained as a good features extractor. This part can be referred
to as supervised learning.

V.8 Fully Convolutional Networks

Simply put, these are neural networks that consist of only of convolutional layers. Notes:
• with some restriction, these networks can accept different spatial sizes of inputs, with no any change
to the architecture. Example: In exercise 9, we used input images with spatial sizes of 240 × 240,
while the Alexnet (section V.2) backbone was originally taking inputs of shape 224 × 224. If FC layers
are used, then the input size must be fixed, as they process the entire input at once. However, it is
important to remember that convolutional layers are NOT invariant to changes in scale, and without
any further training for adjustment to the new settings, the performance is likely to be suboptimal.

69
V.9 U-Net

• In computer vision applications, these types of architectures are much superior to the original FC
networks: they usually introduce much fewer parameters, fewer calculations and much more capable
of extracting meaningful local features.

V.9 U-Net

A fully-convolutional autoencoder with skip connections.

Figure V.6: A generic example of the U-net Architecture → could be implemented with much different
hyperparameters.

• The U-net incorporates skip connections between corresponding layers in the encoder and the decoder,
for the highway of gradients, and the usage of fine-grained features from the encoder, which
performs as a features extractor. In contrast to the original ResNet (section V.5), the gradients
flow to the shallower layers of the network with even fewer steps and could be even more meaningful
than in a ResNet with dimensionality reduction, where gradients cannot flow completely up-the-stream.
• Remember that the main task of an autoencoder is to compress the features, for the selection of the
most meaningful ones. However, as the spatial dimensions shrink, we face an issue of loss of critical
information. To mitigate this, as we decrease the spatial size, we increase the number of channels, to
offer more "breathing room" for features to pass through, and learn a bigger variety of them. But,
when decoding the latent space, the spatial size increases, and therefore we must shrink the number
of channels, to mitigate the exponentially rising number of calculations. A convolution on the original
spatial size, no matter how thin it is parameters-wise, would introduce so many calculations and
become a serious bottleneck for the runtime, while most processing could be done in a much cheaper
fashion at the lower dimensions.
• The latent space here is a tensor of shape B × C × H × W and not a single vector, like in FC layers.
• These networks are widely used in tasks such as semantic segmentation, depth prediction, image
reconstructions, GANs, etc.

70
V.10 Generative Networks

V.10 Generative Networks

Very popular tasks, which their destructive results, that we encounter today all over, with fake-images
flooding any good part of our lives. In this course, we introduce only two of the earlier ones: Variational
Autoencoders (VAEs) and Generative Adversarial Networks (GANs). Stable-diffusion is for the advanced
courses.

V.10.a Variational Autoencoders (VAEs)

Figure V.7: During training: Encode the mean and std, use the parametrization trick to sample from the
encoded distribution, and then decode the sample.

Based on the vanilla Fully-Connected Autoencoder (section V.7), instead of extracting features that will
allow us to reconstruct the input, now we would like to encode the distribution of the entire training set
into the latent space. This is done by introducing a probabilistic approach to the latent space, where the
latent space is not a single vector, but two vectors (a single vector that is split in practice into two) that
represent the compressed mean and the variance of the data’s distribution. During training, we use both
encoder and decoder to reconstruct the input. However, instead of using the encoded latent space, we sample
from the distribution it represents (by some method that are beyond the scope of this course). We use the
parametrization trick to sample from the actual encoded distribution:

z = µ + σϵ

where ϵ ∼ N (0, 1) → sample noise from the standard normal distribution and then scale and shift the
sample using the mean and variance of the encoded distribution.
The loss function is now composed of two parts:

1. The reconstruction loss: Just like in the vanilla Autoencoder, we would like to reconstruct the input
as best as possible, to guide the encoder to learn the most meaningful features.
2. The KL-divergence loss: The KL-divergence loss is a measure of how much two probability distributions
differ from each other. The loss function is defined as:
1
L = Ep(z|x) [log p(x|z)] − Eq(z|x) [log q(z|x)] = (µ − µtarget )2 + log σ 2 − log (σtarget )2 − 1
2
where µ and σ 2 are the parameters of the learned distribution, and µtarget and σ target are the parameters
of the target distribution, which is assumed to be the standard normal distribution, with mean 0 and
variance (std) 1.

On the other hand, during testing (inference), we follow the following steps:

71
V.10 Generative Networks

• Sample random Gaussian noise from the standard normal distribution.

• Use the parameterization trick.
• Decode the sample, and practically generate a new sample from the distribution of the training set.
Differences from regular autoencoders:
• The latent space is not a single vector, but two vectors that represent the mean and variance of the
distribution of the training set.
• The loss function is composed of two parts: the reconstruction loss and the KL-divergence loss.
• During testing, we sample from the encoded distribution, and not from the encoded latent space.

V.10.b Generative Adversarial Networks (GANs)

Figure V.8: During training: The generator generates fake images, and the discriminator tries to distinguish
between real and fake images.

GeeksForGeeks has a great article on GANs: Link

GANs are a type of generative network that consists of two networks: a generator and a discriminator.
The generator is responsible for generating fake images, starting from some random Gaussian noise, while
the discriminator is responsible for distinguishing between real and fake images. The two networks are
trained simultaneously and compete against each other, with the generator trying to generate images that
are indistinguishable from real images, and the discriminator trying to distinguish between real and fake
images. At each iteration of the training, we follow the following steps:
1. Sample random Gaussian noise from the standard normal distribution.
2. Generate a fake image and sample a random real image from the training set.
3. Push both the fake and the real images through the discriminator, while labeling the fake with class
0 and the real with class 1.
4. Update the weights of the discriminator.
5. Push again only the fake image into the discriminator, but this time with class 1 (trying to fool the
discriminator).
6. Update the weights of the generator.
Unlike VAEs, GANs implicitly learns the distribution of the training set. In the original GAN, this is done
by using the two competing loss functions:

72
V.11 Questions

• Decoder loss: Binary cross-entropy loss between the output of the discriminator and the target label
(e.g. 1 for real images, 0 for fake images):
N
1 X
LD = − (yi log(D(xi )) + (1 − yi ) log(1 − D(G(zi )))
2N i=1

where D(xi ) is the output of the discriminator for the i-th image, N is the batch size and yi is the
target label.
• Generator loss: since we update first the discriminator weights to try and push the fake images to the
class 0, we take the compliment of the discriminator loss:
N
1 X
LG = − log(D(G(zi )))
N i

These two work in a MixMax fashion, which can cause quite often a case of underfitting, since both sides
"pull the rope" or even modal-collapse, where the generator learns to generate only a single image, that is
the most likely to fool the discriminator. This is why there are many variations of the GANs, such as the
Wasserstein GAN, the Least Squares GAN, the Conditional GAN, etc.
However, GANs proved to generate more real-looking images than VAEs, but at a cost of training instability
and complex training scheme. It was lately overthrown by other techniques, such as Stable Diffusion, etc.

V.11 Questions

1. Architectures:
a) LeNet uses average pooling to reduce the spatial size. Give one advantage and one disadvantage
of using average pooling over max pooling.
b) In LeNet - what is the receptive field of a neuron in the first FC layer?
c) Alexnet uses a 11 × 11 convolutional filter in the first layer. Name two disadvantages of using
such a large filter.
d) Alexnet uses ReLU instead of sigmoid or Tanh, as used in LeNet. Explain why it allows Alexnet
to be deeper than LeNet, when coupled with the Kaiming initialization.
e) VGGnet: What is the purpose of the convolutional part of the model? Why do we need the FC
layers at the end?
f) InceptionNet:
i. What was the problem with the first version of InceptionNet? How was it solved?
ii. We learned in class the MaxPool is usually used to reduce the spatial dimensions. Therefore,
how was it possible to use it inside the Inception block, and concatenate its output to all
other outputs?
2. Skip connections:
a) Why can we say that skip connections introduce "highway of gradients"?
b) Given a residual block Xl+1 = Xl + F (Xl ), where Xl , Xl+1 ∈ R - show the highway of gradients
∂L
in the chain rule formula of ∂X l
, given some loss value L ∈ R
c) Can you give a python-like implementation of the residual block?

73
V.12 Answers

3. AutoEncoders:
a) Assume we use an autoencoder to reconstruct an image. What could be used as a loss function?
What do we compare between? What kind of learning is it (supervised or unsupervised)?
b) What is the effect of a latent space that is too small? What is the effect of a latent space that is
too big?
c) What linear approach does this kind of autoencoder resemble? What is the advantage of autoen-
coder over this method?
d) State a scenario in which we would like to use an autoencoder for feature extraction.
4. U-net:
a) Give 3 advantages of U-net over the vanilla Autoencoder
b) How do we mitigate the drop in spatial size, so not to much information is lost?
c) For the task image-reconstruction, does it make sense to use skip-connections between the encoder
and the decoder? Explain.
5. Generative Networks:
a) What is the difference between GANs and VAEs in the way they learn the distribution of the
training set?
b) In the Vanilla GAN, why is it prone to underfitting and mode collapse?
c) In GAN, how do we train the generator to fool the discriminator?
d) VAE: what are the two loss function we use, and what is their purpose?
e) What do we assume the training set distribution to be in both VAEs and GANs?
f) Sampling from a random distribution, that is not the normal distribution, is very hard. How
does VAEs solve this problem?

V.12 Answers

1. Architectures:
a) i. Advantage: all entries of the window get a gradient.
ii. Disadvantage: It is a linear operation, while max pooling offers non-linearity.
b) i. 11 × 11 is very expensive in both parameters and calculations.
ii. It captures more global features than specific local features.
c) The entire input image (32 × 32).
d) It prevented the vanishing gradient problem, that is likely in deeper networks.
e) i. The convolutional part is used as a features extractor.
ii. The FC layers have global receptive field, which is very useful for the task of classification,
for which the model is meant in the first place.
f) InceptionNet:

74
V.12 Answers

i. The first version of InceptionNet is very complex, due to convolutions with large kernel sizes.
The problem was solved by introducing 1 × 1 convolutions, to reduce the number of channels
before the more expensive convolutions.
ii. Simply used the magical trio k = 3, s = 1, p = 1, that is used to keep the spatial dimensions.
2. Skip connections:
a) Skip connections allow skipping whole blocks of layers, by adding the input to the output. That
allows the gradient of that input X to bypass the block, and not be diminished by it.
b)

∂L ∂L ∂Xl+1 ∂L ∂(Xl + F (Xl )) ∂L ∂Xl ∂F (Xl ) ∂L ∂F (Xl )

= = = ( + )= (1 + )
∂Xl ∂Xl+1 ∂Xl ∂Xl+1 ∂Xl ∂Xl+1 ∂Xl ∂Xl ∂Xl+1 ∂Xl

c) z = conv1 ( x )
z = relu (z)
z = conv2 ( z )
x = relu (x + z)

3. AutoEncoders:
a) Loss functions: MSE, MAE. We compare the input with the reconstructed image. It is an
unsupervised learning approach.
b) If the latent space is too small, the reconstruction will be very hard and the result will be very
blurry - underfitting. If the latent space is too big, the network could learn to copy the image,
without learning any meaningful features, thus - overfitting.
c) PCA. The advantage of autoencoder over PCA is that it is non-linear, and can learn more complex
features.
d) A scenario in which we would like to use an autoencoder for feature extraction is when we have a
lot of unlabeled data, and we want to extract features from it. Then, we can use the pre-trained
autoencoder for a supervised task.
4. U-net:
a) i. Skip connections
ii. Allows making most of the calculations and processing at lower dimensions, for a fraction of
the cost.
iii. Accepts different input spatial sizes, with no any structural changes.
iv. Allows the extraction of meaningful local features instead of the global ones.
v. No, as the model can learn to simply memorize the image and send it to the output layer.
b) We mitigate it by increasing the number of channels, to offer more "breathing room" for features
to pass through, and learn a bigger variety of them.
5. Generative Networks:
a) VAEs learn the distribution explicitly (directly), while the GANs learn the distribution implic-
itly (indirectly).
b) Both Generator and discriminator work against each other, so both could converge to suboptimal
solutions. Also, the generator could learn to produce a single image, which is likely to fool the
discriminator, and thus have a very good accuracy.

75
V.12 Answers

c) Send the fake image through the discriminator, but this time with class 1, like for the real image.
d) VAE: MAE/MSE for reconstruction, and the KL divergence for the latent space’s distribution.
e) We assume the training set distribution to be the standard normal distribution.
f) We use the parametrization trick: z = µ+σϵ, to shift the normal distribution to the actual learned
distribution, where ϵ ∼ N (0, 1) is the sampled noise from the standard normal distribution, and
µ and σ are the mean and variance of the distribution of the training set.

76
VI Recurrent Neural Networks and Transformers

So far we’ve been dealing with neural networks that took as input independent instances (e.g. single RGB
images) and processed them for the sake of some task. In this section, we will be dealing with networks that
process data instances that are dependent on each other with some context, e.g. sentences in text, videos -
that are consecutive images of the same scene or even different parts of the same image, etc.

VI.1 Recurrent Neural Networks (RNNs)

The very first type of neural network that was designed to handle sequential data. The idea behind RNNs
is to have a network that has some kind of memory of the previous inputs it has seen. This is achieved by
having the network pass the output of the previous time step as an input to the current time step. This is
illustrated in the figure below:

In this notion, we use the same weights to process each time step, but only in a very limited scope of the
history, or "Short Term Memory". The components of the RNN are as follows:
• xt : The input at time t. Representing a word in a sentence, a frame in a video, etc. Usually given as
a vector of a fixed size D, that in literature is called a "token". Note, that this scheme is used also in
other context-related models, such as LSTMs and Transformers.
• ht : The hidden state of time t. This is the memory of the network. It is a vector of size M . This
hidden state is sent to the next time step and compresses into itself all previous time steps.
• ot : The output at time t. This is the prediction of the network at time t. It is a vector of size M .
At each time step t, we use the same weights to process the current token xt with the previous hidden state
ht−1 to get the new hidden state ht and the output ot . x, h and o have a set of weights WxD×M , WhM ×M
and WoM ×O respectively. Note that these weights are shared across all time steps. Therefore, at time t we
compute:
ht = A(xt · Wx + ht−1 · Wh + bh )
ot = A(ht · Wo + bo ), ot ∈ RM

77
VI.1 Recurrent Neural Networks (RNNs)

Where A is an activation function (Sigmoid, Tanh or ReLU) and b⋆ is the relevant bias in the affine equation.

• Number of parameters: Without considering an output variable, the number of parameters in the
RNN cell relies on the size of h, as the tensors need to match in shape for the addition. Assume
h ∈ R1×M , x ∈ R1×D , then, the number of parameters is M · M + D · M + M , for Wh , Wx , bh .

There are different types of RNNs, as shown in the figure above, and the outputs ot could be used at each
time step, only at the end of the sequence, or at every time step. The backpropagation through these
networks is called "backpropagation through time" (BPTT) and is also affected by the choice of using the
intermediate outputs, of course. This process heavily relies on the length of the input sequence and therefore
could be quite expensive and long to compute.
Uses of the different types of RNN (source):
1. One-to-one: A simple neural network, as we know it.
2. One-to-many: Image captioning.
3. Many-to-one: Sentiment analysis.
4. Many-to-many (first sequence, then output) - Language translation, as the entire sentence is processed
first.
5. Many-to-many (both input and output): Labeling images in a video, or some task that doesn’t rely
completely on the entirety of the sequence.
Also, the way h0 is initialized can affect the performance of the RNN. Common initializations are either to
a set of zeros, or sampled from a random distribution, but could also be learned.
Main challenges:
• Backpropagation through time (Example). Assume an RNN with 2 time steps, and a single output
at the end, for simplicity. Also, for the sake of a readable example, assume all variables are scalars
(x, h, Wx , Wh , Wo , bh , bo ∈ R) and that the activation function is the identity function. Remember that
if y = f (x)g(x), then
∂y ∂y ∂f (x)g(x) ∂f (x) ∂g(x)
= f ′ (x)g(x) + f (x)g ′ (x) → = = g(x) + f (x)
∂x ∂x ∂x ∂x ∂x
. The forward pass is as follows:
– y1 = x1 · Wx + h0 · Wh + bh
– h1 = A1 (y1 )
– y2 = x2 · Wx + h1 · Wh + bh
– h2 = A2 (y2 )
– y3 = h2 · Wh + bo

78
VI.2 Long Short Term Memory (LSTM)

– o = A3 (y3 )
∂o ∂A3 (y3 )
Now, let’s calculate the gradient of ∂Wh = ∂Wh :

∂A3 (y3 ) ∂A3 (y3 ) ∂y3 ∂A3 ∂h2

= · = · (Wh · + h2 )
∂Wh ∂y3 ∂Wh ∂y3 ∂Wh

∂h2 ∂A2 (y2 ) ∂A2 ∂y2 ∂A2 ∂h1

= = · = (Wh · + h1 )
∂Wh ∂Wh ∂y2 ∂Wh ∂y2 ∂Wh

∂h1 ∂A1 (y1 ) ∂A1 ∂y1 ∂A1

= = · = · h0
∂Wh ∂Wh ∂y1 ∂Wh ∂y1

∂A3 (y3 ) ∂A3 (y3 ) ∂y3 ∂A3 ∂A2 ∂A1

= · = · (Wh · (Wh · ( · h0 ) + h1 ) + h2 )
∂Wh ∂y3 ∂Wh ∂y3 ∂y2 ∂y1

Where,
∂A3 ∂A2 ∂A1
, , =1
∂y3 ∂y2 ∂y1

• Observe the example above. We could see that the weight matrix is multiplied by itself as we backprop-
agate through time. If the eigenvalues of the weight matrix are smaller than 1, then the gradient will
become smaller and smaller as we go up the stream - and introduce the vanishing gradient problem.
On the other hand, if the eigenvalues of the weight matrix are larger than 1, then the gradients will
explode! A solution could be to use a regularization term to force the weight matrix to be orthogonal,
or clip the gradient values, so it won’t get too small or too large.
• Not much capacity: While it is very cheap in parameters, using only 3 different weight matrices does
not introduce much capacity to the network.
• Can only have short-term context, which makes context-related tasks, such as language translation,
quite hard.

VI.2 Long Short Term Memory (LSTM)

LSTMs try to overcome the vanilla RNN’s problems by changing the inner logic of the RNN cell by using
different gates for past and current information and introducing skip-connections.

79
VI.3 LSTM Gates Explanation

VI.3 LSTM Gates Explanation

The LSTM (Long Short-Term Memory) network is a type of RNN that addresses the vanishing gradient
problem and can capture long-term dependencies more effectively. The diagram provided illustrates the
internal workings of an LSTM cell. Let’s break down the key components and gates of the LSTM block:

VI.3.a Components of LSTM

• Cell State (Ct ): The cell state runs straight down the entire chain, with only some minor linear
interactions. This allows information to flow unchanged and provides a memory that can be updated
or reset based on the gates’ actions. Must have the same shape as the hidden state h.
• Hidden State (ht ): This is the output of the LSTM cell at each time step, which can also serve as
an input to the next time step.

VI.3.b Gates in LSTM

LSTM networks have three main gates that regulate the flow of information:

• Forget Gate (ft ):

ft = σ(ht−1 · Wfh + xt · Wfx + bf )
The forget gate decides what information to discard from the cell state. It takes the previous hidden
state (ht−1 ) and the current input (xt ) and passes them through a sigmoid function (σ). The output
is a value between 0 and 1 for each number in the cell state Ct−1 , where 1 means "completely keep
this" and 0 means "completely forget this."
• Input Gate (it ):
it = σ(ht−1 · Wih + xt · Wix + bi )
The input gate controls how much of the new information (candidate values), which is generated by
the Tanh activation, to add to the cell state. Similar to the forget gate, it uses a sigmoid function to
decide which values to update.
• Candidate Layer (C̃t ):
˜ t = Tanh(ht−1 · WC + xt · WCx + bC )
Can h

This layer generates new candidate values, which could be added to the cell state. The Tanh function
outputs values between -1 and 1.
• Output Gate (ot ):
ot = σ(ht−1 · Woh + xt · Wox + bo )
The output gate determines what the next hidden state (ht ) should be. It takes the input and previous
hidden state, processes them through a sigmoid function, and then multiplies the output with the Tanh
of the updated cell state to generate the hidden state.

80
VI.4 Transformers: Revolutionizing Neural Network Architectures

VI.3.c LSTM Cell Update Equations

The cell state and hidden state are updated as follows (⊙ for element-wise multiplication):

• Update the Cell State (Ct ):

˜ t
Ct = ft ⊙ Ct−1 + it ⊙ Can
The cell state is updated by multiplying the previous cell state by the forget gate value and adding
the product of the input gate value and the candidate values.
• Compute the Hidden State (ht ):

ht = ot ⊙ Tanh(Ct )

The hidden state is updated by multiplying the output gate value with the Tanh of the new cell state.

VI.4 Transformers: Revolutionizing Neural Network Architectures

Transformers have become a cornerstone in modern deep learning, particularly in natural language processing
(NLP). Introduced by Vaswani et al. in 2017 (paper), Transformers have surpassed traditional recurrent
neural networks (RNNs) and convolutional neural networks (CNNs) in various tasks due to their ability to
handle long-range dependencies and parallelize training.
The Transformer architecture relies heavily on self-attention mechanisms, eliminating the need for recur-
rence, in contrast to RNNs and LSTMs. It is made up of an encoder-decoder structure, where both the
encoder and decoder are composed of a stack of identical blocks. Each block consists of sub-layers, including
multi-head self-attention mechanisms, feed-forward neural networks, and normalization layers.

81
VI.4 Transformers: Revolutionizing Neural Network Architectures

VI.4.a Embedding Layer

Token Embedding

The sequence input to the Transformer architecture is made up of "tokens", which are fixed-size vectors, that
are predefined, usually by some preprocessed "dictionary". The transformer cannot accept tokens of different
sizes, as it is crucial for the mathematical operations within the different components. The embedding layer
converts input tokens into dense vectors that represent the tokens in a high-dimensional space.

Positional Encoding

Transformers lack the sequential nature of RNNs, so they require a way to incorporate the order of the
sequence. This is done through positional embeddings. The first proposal was the sinusoidal positional
encoding:
pos

P E(pos,2i) = sin
100002i/dmodel
pos

P E(pos,2i+1) = cos
100002i/dmodel
These functions provide unique encodings for each position, allowing the model to differentiate between
different positions in the sequence. Before the input embeddings are fed into the encoder, those positional
embeddings are added to their corresponding input tokens.
For a deeper understanding of embedding layers, take a look at this 3B1B video.

VI.4.b Attention Mechanisms

The self-attention mechanism is at the heart of the Transformer model. It allows the model to focus on
different parts of the input sequence when encoding a particular word.

82
VI.4 Transformers: Revolutionizing Neural Network Architectures

Scaled Dot-Product Attention

The scaled dot-product attention computes the attention scores using the following formula:
!
QK T
Attention(Q, K, V ) = softmax √ V
dk

where Q (query), K (key), and V (value) are the input matrices, and dk is the dimension of the key vectors.
√
• The division by dk is another mechanism to prevent the outputs of the linear operations from
becoming too huge, and in turn make the outputs and the gradients explode. In programming, you
would usually see your loss value become nan.
• The Softmax function has two purposes - to normalize the weights to be in the range of [0, 1], and to
make stronger weights stronger, and smaller weights - smaller

Multi-Head Attention

Multi-head attention allows the model to jointly attend to information from different representation sub-
spaces. We simply divide each token into sub-tokens named "heads", and perform the linear operations with
smaller weight matrices, but each head has its own set. Instead of performing a single attention function,
the transformer employs multiple attention heads to capture different aspects of the input data. Each head
performs its own attention calculation, and the results are concatenated and linearly transformed.
Simply reshape your tensor of shape (N, C, D) to (N, C, D//head-size, head-size) before feeding it to the
self-attention mechanism. After the attention, the "concatenation" is performed by simply reshaping the
tensor back to (N, C, D).

VI.4.c Encoder

The encoder is a stack of N identical blocks:

Each block consists of two main sub-layers:

83
VI.4 Transformers: Revolutionizing Neural Network Architectures

• Self-Attention Mechanism: This allows the model to weigh the importance of different tokens in
a sequence, in respect to each other.
• Feed-Forward Neural Network: A position-wise fully connected feed-forward network applied
independently to each position.

Each sub-layer has a residual connection around it followed by layer normalization. The output of each
sub-layer is:
LayerNorm(x + Sublayer(x))
Note that those normalizations are crucial, as the self-attention mechanisms do not introduce any non-linear
activations or normalizations, and therefore are prone to exploding values or gradients.

VI.4.d Decoder

The decoder is also composed of N identical blocks:

With an additional sub-layer compared to the encoder, each layer consists of:
• Masked Self-Attention Mechanism: Prevents attending to future tokens. This ensures that the
predictions for the i-th position can depend only on the known outputs at positions less than i.
Prevents from the decoder to simply copy its inputs.
• Encoder-Decoder Attention: Attends to the encoder’s output. Let M be the embedding size - q is
the decoder’s masked self-attention outputs, of shape q ∈ RK×M . k and v are the encoder’s outputs,
of shape N × M . That results in K embeddings, to be processed and compared with the K ground
truth embeddings.
• The output of the network, after applying N -decoder blocks, is a 2D matrix for each instance in the
batch (each sequence) of shape N × C, where N is the number of tokens in the sequence, and C is the
number of classes in the target dictionary, for a classification loss function (CE).

84
VI.5 Questions

VI.5 Questions

1. Give two drawbacks of using RNNs that are not the exploding or vanishing gradients.
2. How did LSTM solve the vanishing gradient problem in RNNs?
3. Is it a good idea to use ReLU instead of Sigmoid as the activation of the input gate of LSTM?
4. Transformers:
a) Can the transformer architecture take embeddings of different sizes?
b) Can it take sequences of different sizes as inputs to the encoder and the decoder?
c) Cross-attention layer. Given the encoder outputs of shape Xe ∈ RN ×M and the decoder outputs
of shape Xd ∈ RK×M , what is the dimension of the output of that layer?
d) Is the task of predicting the next token in the sequence a classification or a regression task?
e) Why is the self-attention mechanism prone to exploding gradients? How does the original archi-
tecture of the transformer solve that?
f) Why is it important in transformers to use positional encoding, in comparison to RNNs?

VI.6 Answers

1. a) Too few parameters - not enough capacity.

b) Short-term memory - cannot learn relevant context with tokens that are far away in the sequence.
2. By introducing the cell’s state - a highway for gradients, just like with skip-connections.
3. No, it is not recommended. The sigmoid is bounded between [0, 1], and therefore allows us to choose
what information to pass on, and with what magnitude. The ReLU function, however, doesn’t have
an upper bound, which drops this desired behavior of the sigmoid and could even cause numerical
instabilities.
4. Transformers:
a) No, the fully-connected layers within the transformer architecture expects a fixed size embedding
input.
b) Yes, sequences of embeddings are like batches of tokens, and the fully-connected layers know to
handle different sizes of batches. In addition, the cross-attention in the decoder handles different
sequence sizes properly.

85
VI.6 Answers

c) q = Xd , k = Xe , v = Xe .

out = (q · k ⊤ ) · v → (K × M ) · (M × N ) · (N × M ) = (K × N ) · (N × M ) = K × M

d) Classification. Each word in the dictionary has its own label.

√
e) The self-attention model doesn’t include any normalization layer, besides the dk scale. The
affine layers then could lead to huge numbers and therefore huge gradients. The transformer then
utilizes layer norms, to normalizes each embedding individually before and after the attention
blocks.
f) Transformers do not process the sequence input sequentially. Since the scores between two embed-
dings is invariant to permutations in the order of the tokens, unlike RNNs, we need to introduce
the positional encoding to introduce time-wise context.

86
VII Appendix

VII.1 Multidimensional derivatives

VII.1.a Affine layer

y = XW + b (VII.1)

Where XN xD WDxM b1xM .

A known use case is the 1-dim case of a line equation:

y = ax + b

What is X?

• In the affine layer context, the matrix X is considered to be the input.

• In neural networks, we almost always refer to it as a batch of input elements (e.g. images).
• In some deep learning applications (e.g. "style-transfer"), it is also trained by backpropagation.
• Besides being the input of each layer of the network, X is also the output of the previous layer.

Let us take for example this one input instance (image) from the MNIST handwritten digits’ dataset. Each
gray scale image in this dataset is a 1 × 8 × 8 tensor: 1 for the channels, 8 for the height, and 8 for the
width.

87
VII.1 Multidimensional derivatives

For the affine layer, as phrased in (VII.1), each input instance is flattened to be a row vector inside X. Let
us take a batch of 2 images from the MNIST dataset.

x111 . . . x118 x211 . . . x218

  h i
x 111 , . . . , x 118 , , x121 , . . . , x 181 , . . . , x 188
X =  ... .. ..   .. .. ..  → h

. .  . . .  i
x211 , . . . , x218 , , x221 , . . . , x281 , . . . , x288
x181 . . . x188 x281 . . . x288
(VII.2)
Here, the batch shape is 2 × 1 × 8 × 8
Question: What if we had a 3-channels RGB images?
Answer: The images are flattened the same, row by row and channel by channel. The actual order doesn’t
matter, but it is important that it will remain consistent among all input instances, so the weights will
correspond to the correct entries.

What is W?

• The coefficient matrix.

• In a learning model, they represent the learnable weights, and modified during the backpropagation
step.
• If in X each row represents one input inside the batch, in W each column represents the weights that
are attached from all input neurons (cells in the input vector) to one neuron in the next layer, which
is the input to the next layer, as could be seen in Figure VII.1.

Notes

• Note: It is not a linear function, but we treat it as an approximation. Why not? It doesn’t follow
the rules of linearity, where
f (x + y) = f (x) + f (y)
or
f (ax) = af (x)

• Another common notation of an affine layer is

T
y = W x + b = (W x + b)T = (X T W T + bT )T (VII.3)

Which calculates the exact same thing, but results in a column vector and not a row. W weight vectors
are now row vectors and X inputs are now column vectors. It is just a matter of how we construct
our inputs and weights.

88
VII.1 Multidimensional derivatives

VII.1.b Derivatives

Figure VII.1: A neural network computational graph. Note: Although we always deal with batches of
inputs, in the sketch, the input layer represents only one input instance (e.g one flattened
image). Each color represents a different weights column vector in W . Also, each neuron in
the input layer (true to any neuron in the network) will collect the gradients from the flow on
the colorful edges that are attached to it.

What is a gradient

It all depends on the function!

•
∂f
f : R → R, x ∈ R, =a∈R (VII.4)
∂x
• Gradient:  ∂f   
a1
∂f  ∂x. 1   . 
f : Rn → R, x ∈ Rn , =  ..   =  ..  ∈ R
n
(VII.5)
∂x  ∂f
∂xn
an

• Gradient:  ∂f ∂f 
∂x ... ∂x1,m
∂f  .1,1 ..
f : Rn×m → R, x ∈ Rn×m , n×m

=  .. . ...  ∈ R
 (VII.6)
∂x  ∂f ∂f
∂xn,1 ... ∂xn,m

• Jacobian:  ∂f ∂f1

1
x1 f1 ...
   
∂x1 ∂xn
 ..   ..  ∂f  . .. .. 
f : Rn → Rm , f ( . ) =  .  , = . . . ∈R
m×n
(VII.7)
∂x  ∂f.

xn fm m ∂fm
∂x1 ... ∂xn

• Note that if x was a row vector and so was the function ’image’ (result), then this Jacobian matrix
would have been transposed.

89
VII.1 Multidimensional derivatives

• An ugly Jacobian (Tensor - A multidimensional matrix):

 
∂f1 ∂f1
...
 ∂w.11 ∂w1m
 . .. 
 . . . . . 

 ∂f ∂f1

1
w11 . . . w1m f1 ...
     
 ∂wn1 ∂wmn 
 .. .. ..  =  ..  , ∂f ..
f : Rn×m → Rn ,
 
f  . . .   .  = .
 (VII.8)
∂w  ∂fn ∂fn

wn1 . . . wmn fn


 ∂w11 ... ∂w1m 

 . .. .. 
 .. . 
. 
 
∂fn ∂fn
∂wn1 ... ∂wmn

• What is a gradient? It is the derivative scalar-valued differentiable function by a vector or a matrix

input.
• Super important: Neural networks in general could take any shape of input, but they all result in
a loss function, that gives a scalar L ∈ R. That means:
• In the neural network backpropagation algorithm, we observe only the current layer at a time, as an
abstraction. We do not try to think of the entire network at once, but step-by-step. Example:
Toy-Network:
– Affine()
– ReLU()
– Affine()
– Sigmoid()
– Loss()
When it comes to think of how to derive the current Affine() layer, we observe it as if it was a function
with its own scope.

According to the chain rule, the derivative of the loss function value L according to the weight matrix
of our current affine layer W , would be:
∂L ∂L ∂σ(Y ) ∂Y
= ⊕ ⊕ (VII.9)
∂W ∂σ(Y ) ∂Y ∂W
∂L ∂σ(Y )
In this case, ∂σ(Y ) ∂Y is what we call dout, or the upstream gradient, and we assume it is already
calculated before, as in our current scope (according to the relevant functions, of course). Now it is
sent to our current scope, to be calculated as a part of the chain-rule, and sent up the stream to the
next layer.

90
VII.1 Multidimensional derivatives

Also, ⊕ in scalar derivatives represents a simple multiplication. However, in multidimensional deriva-

tives, ⊕ represents some unknown function, which we need to figure out.
• In the backpropagation step, the derivative of a learnable weight wij is to be calculated as a scalar
derivative:
∂L X X ∂L ∂yi,j
= (VII.10)
∂wu,v i j
∂yi,j ∂wu,v
∂L
Where i, j correspond to the rows and columns of ∂Y in some function that utilizes w, such in Y =
f (X) = XW + b.
Example:
L = g(f (X)), where X ∈ Rn . Therefore:
P
Let f (X) = 2X and g(X) = i xi .

z = f (X)
,
L = g(f (X)) = g(z) = z1 + . . . + zn = 2x1 + . . . + 2xn

∂L ∂L ∂f (x)1 ∂L ∂f (x)2 ∂L ∂f (x)n

= + + ... +
∂x1 ∂f (x)1 ∂x1 ∂f (x)2 ∂x1 ∂f (x)n ∂x1
∂x1 ∂x2 ∂xn
=2· +2· + ... + 2 ·
∂x1 ∂x1 ∂x1
= 2 · 1 + 2 · 0 + ... + 2 · 0
=2

91
VII.1 Multidimensional derivatives

VII.1.c Stanford Article

• Original article by Stanford (cs231n): Link

Let’s follow their example:

! !
x1,1 x1,2 w1,1 w1,2 w1,3
X= W = (VII.11)
x2,1 x2,2 w2,1 w2,2 w2,3
2×2 2×3
!
x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 x1,1 w1,3 + x1,2 w2,3
Y = XW = (VII.12)
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 x2,1 w1,3 + x2,2 w2,3

∂L ∂L
Given a loss function Loss(Y ) = L, we want to calculate ∂X or ∂W .

As seen in (VII.6), the derivative of a scalar by a matrix, is a gradient / Jacobian matrix that has the same
shape as the input. Moreover, we saw in (VII.10), that the final derivative of the loss value L by any entry
of any matrix in the whole neural network is just a scalar. For better understanding we could look at the
computational graph of the network in, Figure VII.1, to clearly see that each neuron collects and sums the
upstream derivatives (from the loss up to it) - that it took part in calculation of, during the forward pass.
 
∂L ∂L ∂L
∂L ∂y1,1 ∂y1,2 ∂y1,3
= (VII.13)
 
∂Y

∂L ∂L ∂L
∂y2,1 ∂y2,2 ∂y2,3

So, from (VII.6) we know that the gradient of Y will have the same shape of Y , because L is a scalar, and
it is calculated as a part of the chain-rule. This is the abstraction notion that is discussed above.
Let’s derive W . Eventually, after the chain-rule, the derivative of W would have the same shape:
 
∂L ∂L ∂L
∂L ∂w1,1 ∂w1,2 ∂w1,3
= (VII.14)
 
∂W

∂L ∂L ∂L
∂w2,1 ∂w2,2 ∂w2,3

Now, this is important. We do not (!!) want to calculate the Jacobians. For a better explanation why,
∂L
refer to the attached article. We have also learned that each entry of ∂W is a scalar, that is computed as in
(VII.10).
So let’s divide and conquer. It is always a better practice, because it’s hard to wrap our minds on something
bigger than scalars.

2 X 3
∂L X ∂L ∂yij
= (VII.15)
∂w11 i=1 j=1
∂yij ∂w11

For better visualization, we could look at it as a dot product, which is elementwise multiplication and
then summation off all cells (Not what we know as np.dot() - this is confusing). Remember: when deriving
a function, it is by the input variable (at least one):

∂y1,1 ∂y1,2 ∂y1,3

  
∂L ∂L ∂L
∂L ∂L ∂Y  ∂y1,1 ∂y1,2 ∂y1,3   ∂w1,1 ∂w1,1 ∂w1,1 
= · = (VII.16)
∂w11 ∂Y ∂w11
 
∂L ∂L ∂L ∂y2,1 ∂y2,2 ∂y2,3
∂y2,1 ∂y2,2 ∂y2,3 ∂w1,1 ∂w1,1 ∂w1,1

92
VII.1 Multidimensional derivatives

If we go back to (VII.12), we get:

  
∂L ∂L ∂L 
x1,1 0 0
∂L  ∂y1,1 ∂y1,2 ∂y1,3 
= ·  (VII.17)
∂w11 ∂L ∂L ∂L
x2,1 0 0
∂y2,1 ∂y2,2 ∂y2,3

Now let’s perform the dot product, and we get:

∂L ∂L ∂L
= x1,1 + x2,1 (VII.18)
∂w11 ∂y1,1 ∂y2,1

We can do that for every entry wi,j in W, and we get:

 
∂L ∂L ∂L ∂L ∂L ∂L
x1,1 + ∂y2,1 x2,1 ∂y1,2 x1,1 + ∂y2,2 x2,1 ∂y1,3 x1,1 + ∂y2,3 x2,1
∂L  ∂y1,1
= (VII.19)

∂W

∂L ∂L ∂L ∂L ∂L ∂L
∂y1,1 x1,2 + ∂y2,1 x2,2 ∂y1,2 x1,2 + ∂y2,2 x2,2 ∂y1,3 x1,2 + ∂y2,3 x2,2

From this matrix, with a little experience, we could derive

 
! ∂L ∂L ∂L
∂L x1,1 x2,1  ∂y1,1 ∂y1,2 ∂y1,3 
T ∂L
= =X · (VII.20)
∂W x1,2 x2,2 ∂Y

∂L ∂L ∂L
∂y2,1 ∂y2,2 ∂y2,3

We could, of course, do the exact same thing in order to derive X, and we will see that:
∂L ∂L
= · WT (VII.21)
∂X ∂Y

∂Y
Note: This is only true, because L is a scalar. If we just looked at Y = XW → ∂W would be a Jacobian.

93
VII.1 Multidimensional derivatives

What about the bias b term in the affine layer?

We could, or course, do the trick of merging it into X and W , as we saw in the lecture.
If not:
1.
Y = XW + b
where XN xD , WDxM , b1xM XWN xM
That means that each bi in b corresponds to one feature in a row of XW - but to add them like that,
it is quite impossible mathematically, right?

• NumPy uses broadcasting to duplicate the row vector b to be a matrix BN xM , by simply

copying b N times and stacking them together along the rows, or the 0-axis. But that’s just the
programming application.
• Mathematically speaking, it’s not Y = XW + b, but Y = XW + 1N b, where 1N is a column
vector, such that
11 h b1 . . . bM
   
 ..   .. . 
i
..
 .  b1 . . . bM =  . . .. 
1N b1 . . . bM N ×M
That gives us the broadcast that python does by itself, and allows us to actually sum those
matrices together.

Now, one can simply follow the exact same paradigm that we’ve shown above to solve for b, or we
could just look at 1N b as another XW , and do the exact same thing as you did for XW , where 1N
was X and b was W .
∂L
2. We see that the derivative of ∂b is,

∂L ∂L hP
N ∂L PN ∂L
i
= (1N )T · = i=1 ∂yi,1 , . . . , i=1 ∂yi,M
∂b ∂Y
Which in NumPy translates into:
np.sum(dout, axis = 0)

94
VII.1 Multidimensional derivatives

VII.1.d Exercise

Given a simple neural network as above:

• Affine()
• Sigmoid()
• Loss()
Given again that: ! !
x1,1 x1,2 w1,1 w1,2 w1,3
X= W = (VII.22)
x2,1 x2,2 w2,1 w2,2 w2,3
2×2 2×3
!
x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 x1,1 w1,3 + x1,2 w2,3
Y = XW = (VII.23)
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 x2,1 w1,3 + x2,2 w2,3

And Y = Af f ine(X)

1. Show a solution to compute the gradient of the Sigmoid layer, w.r.t to the upstream gradient.
2. For a concrete example with numbers, replace sigmoid in the activation function with f (x) = x2
(applied element-wise) and use input X and weight matrix W below. So now, our network looks
like this: ! !
0 1 1 1
X= W =
2 3 1 1
2×2 2×2

Affine : Y = XW

Activation : f (Y ) = Y 2

#Rows
X #Cols
X
Loss : L(f (Y )) = yi,j
i j

Find:
(a) The loss value L(f (Y ))
∂L
(b) ∂f
∂L
(c) ∂y
∂L
(d) ∂X
∂L
(e) ∂W

Hint: Remember, we can use:

∂L ∂L
= · WT
∂X ∂Y
and
∂L ∂L
= XT ·
∂W ∂Y

95
VII.1 Multidimensional derivatives

Solutions:
∂L ∂L ∂σ
1. Task: "Gradient of Sigmoid" means the derivative of σ with respect to its input: ∂Y = ∂σ ∂Y
!
σ(y1,1 ) σ(y1,2 ) σ(y1,3 )
σ(Y ) =
σ(y2,1 ) σ(y2,2 ) σ(y2,3 )
 
∂L ∂L ∂L
∂L
=  ∂σ1,1 ∂σ1,2 ∂σ1,3

∂σ ∂L ∂L ∂L
∂σ2,1 ∂σ2,2 ∂σ2,3
 ∂σ ∂σ1,2 ∂σ1,3

1,1
σ1,1 (1 − σ1,1 ) σ1,2 (1 − σ1,2 ) σ1,3 (1 − σ1,3 )
!
∂σ
=  ∂y1,1 ∂y1,2 ∂y1,3
 =
∂Y ∂σ2,1 ∂σ2,2 ∂σ2,3
∂y2,1 ∂y2,2 ∂y2,3
σ2,1 (1 − σ2,1 ) σ2,2 (1 − σ2,2 ) σ2,3 (1 − σ2,3 )

For scalars, which all of our loss functions (except for Softmax) are, we can compute this as a dot
product i.e. element-wise multiplication:
 
∂L ∂L ∂L
σ1,1 (1 − σ1,1 ) σ1,2 (1 − σ1,2 ) σ1,3 (1 − σ1,3 )
!
∂L
=  ∂σ1,1 ∂σ1,2 ∂σ1,3
 · =
∂Y ∂L ∂L ∂L
∂σ2,1 ∂σ2,2 ∂σ2,3 σ2,1 (1 − σ2,1 ) σ2,2 (1 − σ2,2 ) σ2,3 (1 − σ2,3 )
 ∂L ∂L ∂L 
∂σ1,1 · σ1,1 (1 − σ1,1 ) ∂σ1,2 · σ1,2 (1 − σ1,2 ) ∂σ1,3 · σ1,3 (1 − σ1,3 )
= 
∂L ∂L ∂L
∂σ2,1 · σ2,1 (1 − σ2,1 ) ∂σ2,2 · σ2,2 (1 − σ2,2 ) ∂σ2,3 · σ2,3 (1 − σ2,3 )

∂L ∂L
In code we can use the element-wise operator ⊕ like this: ∂Y = ∂σ ⊕ σ(1 − σ)

2. Task:
! !
x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 1 1
(a) Affine: Y = =
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 5 5

2 2
! !
y1,1 y1,2 1 1
Activation: f (Y ) = 2 2
=
y2,1 y2,2 25 25

Loss: L(f (Y )) = f1,1 + f1,2 + f2,1 + f2,2 = 1 + 1 + 25 + 25 = 52

 ∂(f ∂(f1,1 +f1,2 +f2,1 +f2,2 )

1,1 +f1,2 +f2,1 +f2,2 )
 !
∂L ∂f1,1 ∂f1,2 1 1
(b) ∂f =   =
∂(f1,1 +f1,2 +f2,1 +f2,2 ) ∂(f1,1 +f1,2 +f2,1 +f2,2 ) 1 1
∂f2,1 ∂f2,2

∂L ∂L ∂f
(c) ∂Y = ∂f ∂Y
! !
∂f 2y1,1 2y1,2 2 2
∂Y = =
2y2,1 2y2,2 10 10
Remember that since f is an element-wise function, the multiplication with the upstream
gradient ("dout") is also element-wise.
! ! !
∂L ∂L ∂f 1 1 2 2 2 2
∂Y = ∂f ∂Y = =
1 1 10 10 10 10

∂L ∂L
(d) ∂X = ∂Y · WT

96
VII.1 Multidimensional derivatives

! ! !
2 2 1 1 4 4
· =
10 10 1 1 20 20

∂L ∂L
(e) ∂W = XT · ∂Y
! ! !
0 2 2 2 20 20
· =
1 3 10 10 32 32

Deep Learning Book Part1
No ratings yet
Deep Learning Book Part1
100 pages
BBBB
No ratings yet
BBBB
8 pages
Table of Contents
No ratings yet
Table of Contents
9 pages
Deep Learning Guide with Gluon
100% (1)
Deep Learning Guide with Gluon
633 pages
Deep Learning and Computacional Physics
No ratings yet
Deep Learning and Computacional Physics
88 pages
Dive Into Deep Learning - D2l-En
100% (1)
Dive Into Deep Learning - D2l-En
660 pages
Deep Learning Guide: Concepts & Implementation
100% (1)
Deep Learning Guide: Concepts & Implementation
658 pages
d2l en PDF
100% (1)
d2l en PDF
670 pages
Deep Learning Notes Andrew NG
100% (1)
Deep Learning Notes Andrew NG
54 pages
CS230
No ratings yet
CS230
101 pages
Deep Learning Notes (1) 2
No ratings yet
Deep Learning Notes (1) 2
54 pages
d2l en
No ratings yet
d2l en
505 pages
Mathematics of Neural Networks: Bart M.N. Smets November 12, 2022
No ratings yet
Mathematics of Neural Networks: Bart M.N. Smets November 12, 2022
80 pages
ArtificialNeuralNetworks Mku PDF
No ratings yet
ArtificialNeuralNetworks Mku PDF
206 pages
Deep Learning Cours
No ratings yet
Deep Learning Cours
165 pages
d2l en PDF
No ratings yet
d2l en PDF
635 pages
Alice's Adventures in A Differentiable Wonderland
No ratings yet
Alice's Adventures in A Differentiable Wonderland
279 pages
Priors Algorithms Bayesian
No ratings yet
Priors Algorithms Bayesian
108 pages
Deep Learning A Z PDF
100% (9)
Deep Learning A Z PDF
799 pages
Machine Learning Cheat Sheet HCMUT K
No ratings yet
Machine Learning Cheat Sheet HCMUT K
34 pages
Validation and Optimization of 3d-Human Body Pose Estimation Approaches For Use in Motion Analysis, Ab. Sami Noorzad and Malek Zedan
No ratings yet
Validation and Optimization of 3d-Human Body Pose Estimation Approaches For Use in Motion Analysis, Ab. Sami Noorzad and Malek Zedan
87 pages
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
No ratings yet
Machine Learning and Deep Learning Techniques Used in Cybersecurity and Digital Forensics: A Review
91 pages
Future Proof Yourself-An AI Era Survival Guide
No ratings yet
Future Proof Yourself-An AI Era Survival Guide
259 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
206 pages
MIT 6.036 Machine Learning Lecture Notes
No ratings yet
MIT 6.036 Machine Learning Lecture Notes
99 pages
Inference and Learning
No ratings yet
Inference and Learning
33 pages
Chap 2
No ratings yet
Chap 2
49 pages
Contemporary ML For Physicists
No ratings yet
Contemporary ML For Physicists
91 pages
DeepL OK 2022 Dive Into DL With PyTorch d2l Nov.2022
No ratings yet
DeepL OK 2022 Dive Into DL With PyTorch d2l Nov.2022
975 pages
Mathematical Deep Learning Theory
No ratings yet
Mathematical Deep Learning Theory
275 pages
Deep Learning Concepts and Techniques
No ratings yet
Deep Learning Concepts and Techniques
988 pages
Deep Learning Math
No ratings yet
Deep Learning Math
282 pages
Alice Book Volume 1
No ratings yet
Alice Book Volume 1
281 pages
Deep Learning Guide: Installation to MLPs
No ratings yet
Deep Learning Guide: Installation to MLPs
986 pages
Deep Learning for Developers
100% (1)
Deep Learning for Developers
1,029 pages
D2L PyTorch: Deep Learning Guide
No ratings yet
D2L PyTorch: Deep Learning Guide
979 pages
d2l en Pytorch
No ratings yet
d2l en Pytorch
977 pages
d2l en
No ratings yet
d2l en
1,027 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
837 pages
Mathematical Foundations of Deep Learning
No ratings yet
Mathematical Foundations of Deep Learning
174 pages
Deep Learning Fundamentals Guide
100% (1)
Deep Learning Fundamentals Guide
1,025 pages
Deep Learning
No ratings yet
Deep Learning
26 pages
Dive Into Deep Learning
No ratings yet
Dive Into Deep Learning
1,021 pages
Deep Learning Tutorial: Release 0.1
100% (1)
Deep Learning Tutorial: Release 0.1
137 pages
The Science of Deep Learning Iddo Drori
No ratings yet
The Science of Deep Learning Iddo Drori
37 pages
Table of Content: (Page Numbers in PDF File)
No ratings yet
Table of Content: (Page Numbers in PDF File)
223 pages
Nguyen Duy
No ratings yet
Nguyen Duy
66 pages
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
No ratings yet
Introduction To Neural Networks: RWTH Aachen University Chair of Computer Science 6 Prof. Dr.-Ing. Hermann Ney
31 pages
MATLAB CNN Toolbox for Researchers
No ratings yet
MATLAB CNN Toolbox for Researchers
59 pages
Mathematics Theory of Deep Learning
No ratings yet
Mathematics Theory of Deep Learning
3 pages
Deep Learning Basics by Romain Tavenard
No ratings yet
Deep Learning Basics by Romain Tavenard
49 pages
Dive Into Deep Learning
100% (3)
Dive Into Deep Learning
291 pages
MIT 6.036: Machine Learning Overview
No ratings yet
MIT 6.036: Machine Learning Overview
56 pages
Deontology PP TX
No ratings yet
Deontology PP TX
42 pages
DLA Testing: Relationship Between Power Factor and Dissipation Factor
No ratings yet
DLA Testing: Relationship Between Power Factor and Dissipation Factor
3 pages
Hotel Booking Ref-2803250073854
No ratings yet
Hotel Booking Ref-2803250073854
3 pages
Memory Tree Activity: #Rememberwhen and #Childrensgriefawareness
No ratings yet
Memory Tree Activity: #Rememberwhen and #Childrensgriefawareness
2 pages
TALENT EXAM - English
No ratings yet
TALENT EXAM - English
5 pages
Ewp - 21 Multisim
No ratings yet
Ewp - 21 Multisim
2 pages
Bts432e2 20030926
0% (1)
Bts432e2 20030926
14 pages
N Top Hyd All 12oct2012 F480RB of 1940RB D
No ratings yet
N Top Hyd All 12oct2012 F480RB of 1940RB D
14 pages
My Action Plan02023
No ratings yet
My Action Plan02023
3 pages
Codificaciones y Moods para Volkswagen
No ratings yet
Codificaciones y Moods para Volkswagen
8 pages
English Lecture (Special Occasion Speeches)
No ratings yet
English Lecture (Special Occasion Speeches)
3 pages
Talent Team - Time Management
No ratings yet
Talent Team - Time Management
22 pages
Bilge System
No ratings yet
Bilge System
11 pages
Focus Rotary Drilling Products: Secoroc Rock Drilling Tools
No ratings yet
Focus Rotary Drilling Products: Secoroc Rock Drilling Tools
24 pages
Modelling The Adsorption of Iron and Manganese by
No ratings yet
Modelling The Adsorption of Iron and Manganese by
20 pages
Publish To The World: I Turned The Corner
No ratings yet
Publish To The World: I Turned The Corner
3 pages
14th CX - Delegate Agenda - V2 1 1
No ratings yet
14th CX - Delegate Agenda - V2 1 1
15 pages
Fibonacci Search: Observation On Unimodal Functions
No ratings yet
Fibonacci Search: Observation On Unimodal Functions
5 pages
Steamdeck 2d 20220202
No ratings yet
Steamdeck 2d 20220202
2 pages
Curriculum Vitae: Personal Details
No ratings yet
Curriculum Vitae: Personal Details
6 pages
Hải Dương-Anh 10 -Đề Thi
No ratings yet
Hải Dương-Anh 10 -Đề Thi
8 pages
BDE Intern Assessment
No ratings yet
BDE Intern Assessment
6 pages
Manual Fierastrau Panglica Shark 280 SX
0% (1)
Manual Fierastrau Panglica Shark 280 SX
152 pages
Discuss The Relationship Between Engineering and Economics. - 20240619 - 040311 - 0000
No ratings yet
Discuss The Relationship Between Engineering and Economics. - 20240619 - 040311 - 0000
1 page
Ahmed Umer - Senior Software Test Engineer
No ratings yet
Ahmed Umer - Senior Software Test Engineer
2 pages
"Spelling Ball" Improving Spelling Skills of Grade 11 Humss Through An Interactive Game-Based Approach
No ratings yet
"Spelling Ball" Improving Spelling Skills of Grade 11 Humss Through An Interactive Game-Based Approach
19 pages
Authoritative Parenting's Impact on Behavior
No ratings yet
Authoritative Parenting's Impact on Behavior
22 pages
Adolescent Development Guide
No ratings yet
Adolescent Development Guide
13 pages
Industrial Attachment Report Guide
No ratings yet
Industrial Attachment Report Guide
5 pages
Using COMSOL-Multiphysics in An Eddy Current
No ratings yet
Using COMSOL-Multiphysics in An Eddy Current
5 pages