I2DL Student Lecture Notes
I2DL Student Lecture Notes
Authors:
Dan Halperin, Benjamin Heltzel
II Neural Networks 10
II.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
II.1.a Fully Connected Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
II.1.b Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
II.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
II.2.a Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
II.2.b Sigmoid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
II.2.c Hyperbolic Tangent - Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
II.2.d Rectified Linear Unit - ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
II.2.e The "vanishing gradient" vs the "Dying ReLU" problems . . . . . . . . . . . . . . . . 17
II.2.f Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
II.2.g Parametric ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
II.2.h Exponential Linear Unit - ELU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II.2.i MaxOut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
II.2.j Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
II.3 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
II.3.a Classification loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
II.3.b Regression loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
II.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
II.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
II.6 Complete Schematic of a Fully Connected Neural Network . . . . . . . . . . . . . . . . . . . 27
II.7 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
II.8 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
III Convolutions 31
III.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
III.2 Unique Convolutional Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III.2.a MaxPooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III.2.b Average pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
III.2.c Point-wise convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.2.d Depth-wise Convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.2.e Upsample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2
Contents
IV Optimization 40
IV.1 Optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
IV.1.a Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
IV.1.b Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
IV.1.c SGD with Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
IV.1.d Nestrov Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
IV.1.e RMSprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
IV.1.f Adam - Adaptive Moment Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
IV.1.g Newton’s Method and its Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.1.h Second Order confusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.2 Optimization problems and solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.2.a Overfitting vs underfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.2.b Vanishing and exploding gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
IV.2.c Learning rate scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IV.2.d Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
IV.2.e Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
IV.2.f Hyperparameter tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
IV.2.g Weight initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
IV.2.h Optimization problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
IV.3 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
IV.4 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
IV.5 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
V Popular Architectures 64
V.1 LeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
V.2 AlexNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
V.3 VGGNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
V.4 Skip Connections: Residual Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
V.5 ResNet (Residual Networks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
V.6 GoogleNet: Inception Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
V.7 Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
V.8 Fully Convolutional Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
V.9 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
V.10 Generative Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
V.10.a Variational Autoencoders (VAEs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
V.10.b Generative Adversarial Networks (GANs) . . . . . . . . . . . . . . . . . . . . . . . . . 72
V.11 Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
V.12 Answers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3
Contents
VII Appendix 87
VII.1 Multidimensional derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
VII.1.a Affine layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
VII.1.b Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
VII.1.c Stanford Article . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
VII.1.d Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4
I Machine Learning Basics
• Supervised learning: supervised learning is a type of learning, where the model is trained on an existing
and fixed ground truth, relative to the task at hand: Classes for classification, depth values for depth
estimation, facial key-points or the next correct letter in the sentence.
• Unsupervised learning: unsupervised learning is a type of learning, where the model is trained on data
without any ground truth. The model is supposed to find patterns in the data, which can be used for
clustering, dimensionality reduction or generative tasks (E.g. Autoencoders).
• Classification: a discrete goal, where the model assigns the input to one of the C classes. Also applies
to semantic segmentation of images, where each pixel is assigned to one of the C classes. This can
be done in a supervised way (given labeled data) or unsupervised way (clustering). Examples: image
classification, object detection, semantic segmentation, sentence completion, etc.
• Regression: a continuous goal, where the model predicts a continuous value, either bound or unbound
to some range. Examples are depth estimation, facial key-point detection, etc.
NOTE: This section refers to Exercise 03: Datasets and Data loaders.
• Datasets are the core of machine learning. They are the most important part, and their quality and
quantity can make or break a project. Machine learning models are data-driven, meaning that all
they can learn and use is within the data they are given.
• Datasets are usually split into three parts: training, validation and test set.
– First, the training set is used to train the model. It is the only seen data that the model is
exposed to, and always dealt the bigger chunk of the overall dataset.
– The validation set is an unseen part of the dataset. It is commonly stated that it is used for
tuning the hyperparameters of the model - but that it is only before we even start training. More
importantly, it is used to evaluate the model during training. This is important, because the
5
I.2 Datasets and Dataloaders
model can overfit (subsection IV.2.a) the training data, and the validation set is the main-used
tool to detect this.
– The test set is the final part of the dataset, and should be kept biased as little as possible (no
optimization choice was taken according to it). It is common to argue that this split should be
used only once - however, that is only because it only represents a truly unseen data, if we, as
developers cannot make architectural choices based on the performance of the test set.
– It is important to shuffle the dataset, otherwise different splits could be very biased, e.g. - the
training set would contain only images that were taken during day-time, and the test would
consist only of nighttime images.
– Common split ratios are {80%, 10%, 10%}, {70%, 15%, 15%}, etc. The training set is always the
biggest, as it is the only data that the model is trained on.
• It is important to shuffle the dataset, otherwise different splits could be very biased, e.g. - the training
set would contain only images that were taken during day-time, and the test would consist only of
nighttime images.
• A dataset represents a distribution, which is a sub-part of the real world distribution. It is limited
by the available data, and no matter how big it is, a modern finite dataset cannot represent the real
world distribution.
Usually, in the modern deep learning frameworks, the datasets are represented as classes. This class has
two main methods that need to be implemented:
• __len__ - this method returns the length of the dataset. It is used by the data loader to know how
many samples are in the dataset.
• __getitem__ - this method returns the i-th element of the dataset. Note, that it returns a single
sample, and NOT a batch of samples. The dataloader is responsible for calling this function multiple
times to get a batch of samples. Within this function, the sample can be loaded from the disk,
normalized, transformed, changed, etc, alongside with the label, if exists.
Data augmentation is a technique used to artificially increase the size of the dataset. It is especially useful
when the dataset is small, and the model is prone to overfitting. It is a common practice to use data
augmentation in computer vision tasks, where the images can be flipped, rotated, cropped, etc.
• Is considered a Regularization technique (subsection IV.2.d) as it helps the model to generalize
better.
• It does NOT increase the physical size of the dataset, but at each epoch, the model sees a different
version of the data with some probability p.
• Unlike normalization and other preprocessing, data augmentation is applied only to the training set,
and not to the validation and test sets.
• While the idea is good and a model could really benefit from data augmentation, the techniques should
fit the problem at hand. For example, it is not a good idea to flip the images in a medical imaging
task, as the left and the right side of the body are different, or rotation by 180◦ of images of numbers,
as the number 6 would become 9, etc.
6
I.3 Exam questions
Data processing is an overall name for techniques that are applied to the data before it is fed to the model:
• Normalization: this is a technique used to scale the data to a certain range. It is crucial that the input
would have a reasonable magnitude of value, so the gradients wouldn’t explode or vanish (subsec-
tion IV.2.b). Also, it makes the input to the network more consistent and eases the learning process.
It is a common practice to normalize the data to the range [0, 1] or [−1, 1]. This is done by calculation
the mean and the standard deviation of the training set, and then normalizing the data by subtracting
the mean and dividing by the standard deviation:
x−µ
xnorm =
σ
• Those techniques are applied to each sample of the dataset, and are usually done within the
__getitem__ method of the dataset class. Unlike data augmentation, they are applied to all the splits
of the dataset, as the model expects the data to be in the same format.
I.2.e Dataloaders
Dataloaders are classes that are used to load the data from the dataset, and to create batches of samples.
They are responsible for shuffling the data, creating batches, They are usually used in the training loop,
where the model is trained on the data.
I.3.a Questions
7
I.3 Exam questions
d) When can we relax the ratio between the splits to be more even? When the other way around?
e) What would be a consequence of a too small validation set?
f) What can we do as simple deep learning engineers to overcome a too small training set?
8. Why not to use the validation set also within the training set and just use the test for validation?
9. What is the main issue with modern datasets’ sizes? Why is it so?
10. To which part of the dataset should we apply data augmentation? Why?
11. To that regard, what is the difference between data augmentation and data processing techniques like
normalization?
12. Why is the mean and std of the data calculated only over the training set?
I.3.b Answers
1. Supervised learning is a type of learning, where the model is trained on an existing and fixed ground
truth, relative to the task at hand: Classes for classification, depth values for depth estimation, facial
key-points or the next correct letter in the sentence. Unsupervised learning is a type of learning, where
the model is trained on data without any ground truth. The model is supposed to find patterns in the
data, which can be used for clustering, dimensionality reduction or generative tasks.
2. Classification is a discrete goal, where the model assigns the input to one of the C classes. Regression
is a continuous goal, where the model predicts a continuous value, either bound or unbound to some
range. Examples are depth estimation, facial key-point detection, etc.
3. Labeled data is expensive, and sometimes impossible to get. Unsupervised learning can be used to
find patterns in the data, that can be used for other tasks, like classification or regression, without
the need to manually or artificially label the data, usually with human-made errors. Note (irrelevant
for the exam): It was shown that unsupervised settings allow a much richer representation learned
features, for example: when GPT-1 was trained, it was shown that the model learned to detect the
sentiment of the text (good or bad), even though it was trained on a next-letter completion task.
4. Autoencoder. We used an autoencoder to reconstruct unlabeled data, to create a feature extractor,
and then used this encoder to train a classifier on the much smaller labeled dataset (Exercise 08).
5. a) K-mean clustering is an unsupervised learning algorithm, that is used to cluster the data into K
clusters. It is an iterative algorithm, where the data is assigned to the closest cluster, and then
the cluster’s center is recalculated. This is done until the algorithm converges.
b) The key hyperparameter of the K-mean clustering algorithm is the number of clusters K.
c) If the value of K is too small, the algorithm could merge clusters that are not similar (underfitting)
and if the value of K is too large, the algorithm could split clusters that are similar (overfitting).
6. a) PCA is a linear dimensionality reduction technique, that is used to reduce the number of features
of the data, while keeping the most important features.
b) A deep learning architecture that could perform a similar task is an autoencoder. The autoencoder
is a non-linear dimensionality reduction technique. The deep learning method is preferable for
real world problems, as it can learn non-linear features, and can be used for more complex tasks,
like classification, regression, etc.
7. a) It is important to shuffle the dataset, as if the data is not shuffled it could lead to biased splits,
that do not represent the same overall distribution. Example: if the datasets consist of day-time
and nighttime images sorted accordingly, the training set might hold only day-time images, while
8
I.3 Exam questions
the test would hold only the much more difficult nighttime ones. Also, the model could learn the
order of the data, and not the data itself.
b) Common split ratios are {80%, 10%, 10%}, {70%, 15%, 15%}, etc.
c) The training set is always the largest, as it is the only data that the model is trained on.
d) We can relax the ratio between the splits to be more even, when the dataset is big enough, and
the model is not prone to overfitting.
e) A consequence of a too small validation set could be that the model could overfit the training
set, and the validation set would not be able to detect this.
f) To overcome a too small training set, we can use data augmentation, as it artificially increases
the size of the dataset (not the physical size, but the number of samples that the model sees).
8. With no validation set, we cannot detect overfitting problems. On the other hand, the test set has a
very clear purpose of being the most unbiased unseen data segment of the dataset, and should not be
used to validate our training process.
9. The main issue with modern datasets’ sizes is that they are finite and limited by the available data, so
they cannot represent the real world distribution. Data collection is very expansive in terms of both
time and resources.
10. Data augmentation is applied only to the training set, as a mean of Regularization, to help the model
generalize better and to prevent overfitting. It is not applied to the validation and test sets, as they are
supposed to represent the original problem’s data distribution, and the model should be evaluated on
that distribution. If applied, the performance on those datasets would be worse and not comparable
to other models.
11. The difference between data augmentation and data processing techniques like normalization is that
data augmentation is applied to the training set, and is used to artificially increase the size of the
dataset, while data processing techniques are applied to all the splits of the dataset, and are used to
preprocess the data before it is fed to the model.
12. The mean and std of the data is calculated only over the training set, as we must keep the test set as
unbiased as possible. That is, no information from the test set should be used to train the model, and
the mean and std of the data are part of the information that the model could learn.
9
II Neural Networks
Neural networks are simply defined as a series of functions f (x), g(x), h(x), . . . that are composed together:
y = h(g(f (x)))
A neural network could be described in the following sketch:
In the drawing, the circles (nodes) represent the neurons, and the lines (edges) represent the connections
between the neurons (the weights). Each neuron holds a single value. A common way to index those
neurons would be xl,i,o , where l is the layer, i is the index of the neuron in the input layer, and o is the index
of the neuron in the output layer. Usually, neural networks represent non-linear functions. For that, we
apply a non-linear activation function to the output of each neuron of a hidden layer, but it is not shown
in this drawing.
On top of that, each set of lines between any two groups of neurons is what we refer to as the layers, or the
functions f, g, h that are stated above. The first layer is called the input layer, the last layer is called the
output layer, and the layers in between are called the hidden layers. The number of hidden layers and
the number of neurons in each layer are called the architecture of the neural network. The architecture of
the neural network is a hyperparameter that we need to tune to get the best performance.
Another crucial point, is that in the sketch above we see a single input flowing through the network, and
not a batch of them. However, in practice with computers, we manage to feed into the networks a batch of
inputs at once, all computed in parallel, but independently.
In matrix notation, let’s assume the affine layer, or in its common name: fully connected layer, is defined
as:
y = XW + b
10
II.1 Neural Networks
where N is the number of samples in the batch, D is the number of features in each sample, and M is the
number of neurons in the layer. The output y is of shape RN ×M .
For the first layer in the example above, we assume that
X ∈ RN ×3 , W ∈ R3×4 , b ∈ R4
or in matrices:
x1,1 x1,2 x1,3
x2,1
x2,2 x2,3
w 1,1 w 1,2 w 1,3 w 1,4 h i
X= , W = w w w w , b = b b b b
.. .. .. 2,1 2,2 2,3 2,4 1 2 3 4
. . . w3,1 w3,2 w3,3 w3,4
xN,1 xN,2 xN,3
That means that each bi in b corresponds to one feature in a row of XW - but adding them like that, it is
quite impossible mathematically:
The fully connected layer (FC) is a very common computational layer in neural networks. It is also known
as the affine layer, or the dense layer. The fully connected layer is defined as:
y = XW + b
where XN ×D is the input data, WD×M is the weight matrix, and b1×M is the bias vector. Both W and b
are learnable variables. Please check out the sketch above for a visual representation. In addition, a neural
network that is based on FC layers as its main building blocks is called a Fully Connected network (FCN).
You could also find it under the name of a Multi-Layer Perceptron (MLP).
Important to note, that each neuron in the input layer is connected to each neuron in the output layer. This
means that the number of weights in the fully connected layer is equal to the number of neurons in the input
layer times the number of neurons in the output layer. The number of biases in the fully connected layer
is equal to the number of neurons in the output layer. Therefore, we claim that FC layers capture global
features - but would struggle with local ones. To that end, we use convolutional layers, as described
later on (chapter III).
11
II.1 Neural Networks
II.1.b Backpropagation
A pro tip: the nominator is ALWAYS a function, and the denominator is always the input to that function.
For example, given a function y = ReLU (xw + b), it’s better to write:
because it’s easier on our brains. Substitute the value with the actual function for visual convenience, at
least when dealing with matrix notation.
Assume the notation, wl,i,o , where l, i and o are the layer, input neuron and output neuron indices. Let us
find the derivative of the weight w1,1,1 colored in red, according to some loss value L → ∂w∂L
1,1,1
:
Forward pass:
• h = XW1 + b1
• ŷ = hW2 + b2
• L = Loss(ŷ)
• For simplicity, assume no activation.
Backward pass:
∂L ∂L ∂(hW2 +b2 ) ∂(XW1 +b1 )
• ∂w1,1,1 = ∂ ŷ ∂h ∂w1,1,1
∂L P4 ∂L ∂hi
• ∂w1,1,1 = i=1 ∂hi ∂w1,1,1
∂L P2 ∂L ∂ yˆj
• ∂hi = j=1 ∂ yˆj ∂hi
∂L P2 P4 ∂L ∂ yˆj ∂hi
• ∂w1,1,1 = j=1 i=1 ∂ yˆj ∂hi ∂w1,1,1
• Note that here I found it more convenient to remain with the variable names, as we’re dealing with
matrix entries.
In matrix notation, the gradients of the affine layer’s variables in respect to the loss function (a scalar) are:
∂L ∂L
= · W⊺
∂X ∂y
12
II.2 Activation functions
∂L ∂L
= X⊺ ·
∂W ∂y
∂L h iN ∂L
= 1 1 ... 1 ·
∂b ∂y
Where L ∈ R is the loss value, and y ∈ RN ×M is the output of the layer. For more details about how to
derive them, please refer to chapter VII.
The main computing power of a neural network are the linear layers, that contain the learnable weights of
the model. However, linear functions are not sufficient to learn complex patterns in the data. Therefore, we
introduce non-linear functions, called activation functions, that are applied to the output of the linear
layers. These layers are what make neural networks universal function approximations, and capable of
learning patterns and relevant features in real world data. In the following section, we will cover the most
common activation functions used in neural networks, and discuss their pros and cons.
II.2.a Terms
• Activation map / Feature map - an output of a linear layer, after applying an activation function. A
tensor of numbers that represents some learnt features from the previous layer.
• Features - what a group of individual numbers represent together. A single neuron holding a value
has no meaning but within the context of its neighbors or all other neurons in the layer, depending on
the linear layer it is created with (Convolutions vs. Affine Layers).
• Zero centered - a property of an activation function, that the output of the function could be both
positive and negative. The mean doesn’t have to be actually 0 (E.g. LeakyReLU). When an activation
is not zero-centered, it could introduce bias in the network, and could lead to less optimal step direction
(All gradients are positive or negative). However - for the "zigzag" behavior to actually happen, it
boils down to the loss function used.
Example: Assume the following network -
– Y1 = XW1 + b1
– Z1 = σ(Y1 )
– Y2 = Z1 W2 + b2
– Z2 = σ(Y2 )
– L(Z2 ) = l ∈ R
In the plot above one could see that the MAE (|x|) and MSE (x2 ) both have negative and positive
slopes, while the negative logarithm (CE / BCE), at the effective range of [0, 1], has only a negative
slope. On the other hand, Z1 is all non-negative by the definition of the sigmoid function. Therefore,
13
II.2 Activation functions
1
|x|
0.8 x2
− ln(x)
0.6
f (x)
0.4
0.2
0
−1 −0.5 0 0.5 1
x
II.2.b Sigmoid
σ(x)
0.8
0.6
σ(x)
0.4
0.2
−4 −2 0 2 4
x
Main features:
• It is a scalar function, that operates only on scalars. Therefore, it is applied in an element-wise
fashion to the tensor-like input (on each entry independently).
14
II.2 Activation functions
• It projects the input to the range [0, 1], so it can be interpreted as the probability of the input belonging
to a class.
• It is differentiable, and has a simple derivative:
∂σ(x)
= σ(x)(1 − σ(x))
∂x
Note that this derivative is independent of the input x - this variable is not needed for the calculation
of the derivative.
• Usually used in binary-classification, to make the network’s logits probability-like, to be used in the
binary-cross-entropy loss function.
Drawbacks:
• The sigmoid function saturates when the input is very large or very small. Its effective range is roughly
[-3.5, 3.5]. Beyond those boundaries, the gradient of the function is very close to zero - saturates.
• Largest gradient is at x = 0 → σ(0) = 0.5 → σ ′ (0) = σ(0)(1 − σ(0)) = 0.25. This means that the
gradient is very small, which has a negative impact on the learning process.
• It is not zero-centered.
1
tanh(x)
0.5
tanh(x)
−0.5
−1
−4 −2 0 2 4
x
15
II.2 Activation functions
tanh(x) = 2σ(2x) − 1
Drawbacks:
• The Tanh function saturates when the input is very large or very small. It’s effective range is roughly
[-2, 2], which is very limiting. Beyond those boundaries, the gradient of the function is very close to
zero - saturates.
5
ReLU(x)
4
ReLU(x)
0
−4 −2 0 2 4
x
R → [0, ∞)
ReLU(x) = max(0, x)
Main features:
Drawbacks:
16
II.2 Activation functions
• The ReLU function saturates when the input is negative. This means that the gradient of the func-
tion is zero for negative values, and the weights leading to those neurons are not updated during
backpropagation. This is known as the dying ReLU problem.
• It is not zero-centered.
Modern alternatives: GELU, ELU, LeakyReLU, etc.
These two terms often repeat in the literature, and are usually confused. Even modern language models (as
for 2024), repeat the hesitant and incomplete definition they learned from on the internet.
1. The vanishing gradient problem: As mentioned above, some activation functions produce very
small gradients for certain values of the input. For very small networks, this doesn’t pose a problem.
However, when considering the architecture of much deeper networks, then we could observe a very
negative behavior. Since the gradients of those activations are small to begin with, when multiplied by
such small gradients of layers "up-the-stream", they shrink exponentially. That causes a phenomenon,
where the weights of the shallower layers (closer to the input) are not updated as much as the weights
of the deeper layers (closer to the output), if at all - they could simply "vanish".
Figure II.2: In this example, we could see how the magnitude of the gradients is exponentially decreasing as
they flow "up the stream", from the loss function to the input layer, during backpropogation.
2. The dying ReLU problem: The ReLU function saturates when the input is negative. This means
that the gradient of the function is zero for negative values, and the weights leading to those neurons
are not updated during backpropagation. The neurons that are "dead" are not activated, and do not
contribute to the learning process. This can cause the network to learn suboptimal features, and can
slow down the learning process. However, it is true in the scope of a single iteration of the training
process, and doesn’t mean the weights will not be updated at some stage. However, in some very rare
cases - a majority of the neurons could be "dead", and the network would not learn at all.
3. The difference between the two: The vanishing gradient problem is a gradual problem, the usually
occurs in very deep networks, given certain activation functions. On the other hand, the dying ReLU
is an immediate problem, that influence the training process, or the performance differently. Both
could happen at the same time, or not at all.
How should we then deal with those problems?
• The vanishing gradient problem:
– Use activation functions that have a full-magnitude gradient in their effective range, such as the
ReLU function or its variants.
17
II.2 Activation functions
– Use normalization techniques such as batch normalization or layer normalization to stabilize the
gradients during training.
– Use skip connections or residual connections to allow the gradients to flow more easily through
the network, with the "highway" of gradients.
– Use gradient clipping to prevent the gradients from becoming too large or too small.
– Use initialization techniques, such as the Xavier initialization, to prevent the neurons from satu-
rating (subsubsection IV.2.g).
• The dying ReLU problem:
– Use the more updated variants of ReLU, such as the Leaky ReLU, Parametric ReLU, or Expo-
nential Linear Unit (ELU) functions.
– Use initialization techniques such as the Kaiming He initialization, that is tailor-made for ReLU,
to prevent the neurons from saturating
5
LeakyReLU(x)
4
LeakyReLU(x)
−4 −2 0 2 4
x
R→R
LeakyReLU(x) = max(αx, x)
Main features:
• It is a scalar function, operates only on scalars. Therefore, it is applied in an element-wise fashion
to the tensor-like input.
• The variable α ∈ R can take any value, but remains fixed after it is set. Originally set to 0.1.
• It is a simple and computationally efficient function, and is a variant of the ReLU function. However,
costs a bit more computationally than simple ReLU.
• It is non-linear, and has a simple derivative:
(
∂LeakyReLU(x) α if x ≤ 0
=
∂x 1 if x > 0
18
II.2 Activation functions
5
PReLU(x)
4
3
PReLU(x)
−4 −2 0 2 4
x
R→R
PReLU(x) = max(αx, x)
Main features:
• It is a scalar function, operates only on scalars. Therefore, it is applied in an element-wise fashion
the tensor-like input.
• The variable α ∈ R is learnable, and changes during training through backpropagation. This allows
the network to learn the optimal value of α for each layer.
• It is non-linear, and has a simple derivative:
(
∂PReLU(x) α if x ≤ 0
=
∂x 1 if x > 0
19
II.2 Activation functions
5
ELU(x)
4
3
ELU(x)
−4 −2 0 2 4
x
R→R
(
x if x ≥ 0
ELU(x) = x
α(e − 1) if x < 0
Main features:
• It is a scalar function, operates only on scalars. Therefore, it is applied in an element-wise fashion
to the tensor-like input.
• It is a simple and computationally efficient function, and is a variant of the ReLU function. However,
costs a bit more computationally.
• It is non-linear, and has a simple derivative:
(
∂ELU(x) 1 if x ≥ 0
=
∂x αex if x < 0
20
II.2 Activation functions
II.2.i MaxOut
Drawbacks:
• Each MaxOut layer requires twice the number of parameters as a ReLU layer, and is therefore more
computationally expensive - and usually never used.
II.2.j Softmax
Main features:
• Usually not used as a regular activation function, but in very specific cases: To make the logits of
the network to be interpretable as probabilities, for example in the case of multi-class classification.
Another example is to sharpen weights of the attention mechanism in transformers.
• It projects the input to the range [0, 1], and can be interpreted as the probability of the input belonging
to the i-th class.
• It magnifies the differences between the values of the input, and can be used to make the network
more confident in its predictions - big numbers get bigger, and small numbers get smaller.
• It is differentiable, and has a simple derivative:
∂Softmax(x)i
= Softmax(x)i (δij − Softmax(x)j )
∂xj
• In order to avoid numerical instability, the input to the Softmax function is usually shifted by a
constant value:
exi −max(x)
Softmax(x)i = PK x −max(x)
j=1 e
j
21
II.3 Loss functions
Loss functions represent the goal of training a neural network. They are the guidelines for the weights
update, and should be tailored carefully to the task at hand. While some loss functions might look simple
at first glance, they could lead to very specific behaviors of the network and the features selected - that
are much more complex than one could imagine. Loss values should almost always be either positive or
negative, so the network could learn to minimize or maximize them, respectively. Also, a learning process
could involve multiple loss functions, that are combined together, and the network is trained to minimize
the sum of them - where some are more important than others. These we call "regularization terms", as
they make the training process more difficult, for the sake of achieving our desired behavior. It is crucial to
understand that the sole goal of a training process of a neural network is to minimize the loss value on
the training set. The performance on unseen data, such as the validation and test sets, is a byproduct of
this process, and can only be achieved if we prevent the network from performing too well on the training
set - a phenomenon called "overfitting". In this case, the network learns to capture features too specific to
the training set (e.g. memorizing it) and not some general features, that could fit also data that it has never
seen before. More on that at the optimization chapter (chapter IV).
Notes:
• The loss function should always result in a scalar: L : RN × RN → R. That specific attribute is
what allows us to learn the gradients of the single weight values, all the millions of them, in a way that
doesn’t require computing Jacobians. The variables in neural networks are always scalars, but just
assembled in matrices and tensors. More on that at the TUM-article, that you could find on piazza,
or chapter VII at the end of these notes.
• The loss function is differentiable (at least locally, like in the case of MAE [L1 ]).
• We ALWAYS divide by N , the number of residuals in the loss function, to make the loss value
independent of the size of the of numbers that are being compared. This is crucial, as the loss value
should be comparable between different models and different datasets. Pay attention that the number
of residuals is not necessarily the number of samples in the dataset, e.g. in the case of semantic
segmentation, it is the number of pixels from all the images in the batch all together. Also, if we don’t
divide the sum, the gradient could become very large, as a dependency of how big the input is, and
that could lead to exploding gradients.
• You’ve seen in class the term "cost function". The deep learning literature is saturated with ambiguities
and no default definitions to important terms. Therefore, we drop this term in this course completely
and discuss ONLY the loss functions.
We will distinguish between two main types of loss functions: classification loss functions and regression loss
functions.
where yi is the true label, and ŷi is the probability to belong to label 1.
Note: the value of yˆi should never be exactly 0 or 1, as the logarithm of 0 is undefined. To prevent this, we
can clip the values of yˆi to be in the range [0 + ϵ, 1 − ϵ].
22
II.3 Loss functions
Categorical Cross-Entropy - CE
where yij is an indicator of 0 or 1 if the input i belongs to class j, and ŷij is the predicted probability it
actually does. R is the number of residuals (e.g. number of instances in the batch, all the pixels in the
batch, etc).
Here we employ what we call the "one-hot" encoding, where the true label is a vector of length C with a
single 1 at the index of the true class, and zeros elsewhere. The predicted probabilities are in a vector of
length C:
0 1 0 0.1 0.8 0.1
y = 1 0 0 , ŷ = 0.9 0.1 0.0
0 0 1 0.0 0.0 1.0
Therefore, in practice the number of residuals doesn’t contain the size of the one-hot encoding vector
(number of classes), as we only consider the residuals of the true class - so we don’t also divide by C and
the loss function value is independent of the number of classes.
N
1 X
LM SE (y, ŷ) = (yi − ŷi )2
N i=1
1
LMSE (y, ŷ)
0.8
LMSE (y, ŷ)
0.6
0.4
0.2
0
−1 −0.5 0 0.5 1
ŷ − y
23
II.3 Loss functions
1
LMAE (y, ŷ)
0.8
LMAE (y, ŷ)
0.6
0.4
0.2
−1 −0.5 0 0.5 1
ŷ − y
Note that it is not differentiable at x = 0, which could cause for some numerical instabilities, and adds a
very small computational overhead, when calculating the gradient on a computer.
24
II.4 Linear Regression
(17)
Linear regression is a linear technique for modeling the relationship between different data points or vari-
ables. The simplest form of the regression equation with one dependent and one independent variables is
defined by the formula:
Y = XW + b
where X represents the data, W is the weight, and b is the bias. The goal of linear regression is to find
the best-fitting straight line through the data points, possibly in multiple dimensions. The most common
method is to find the best-fitting line that minimizes the sum of the squared differences between the observed
values and the predicted values. This is known as the least squares criterion:
N
1 X
L(yi |xi , w) = (yi − (w · xi + b))2
N i=1
The goal is to find the values of w and b that minimize this error. This can be done using the closed-form
equation:
W = (X T X)−1 X T Y
This equation can be solved directly for W and b. That means, that if we can load all the data into our RAM,
then we can solve the linear regression problem in one step, and with no need for iterative optimization.
However, this is not always possible, and in practice, we often use iterative optimization methods such as
gradient descent (subsection IV.1.a) to find the optimal values of W and b.
25
II.5 Logistic Regression
1
σ(x)
Data points
0.8
0.6
σ(x)
0.4
0.2
0
−10 −5 0 5 10
x
By using a very-very basic neural network, of simply one linear layer f (x) followed by a non-linear activation
function, we can now model a very common non-linear problem: the binary classification. In this setting,
we do not try to fit a line to the data, but rather a curve that separates the two clusters or "classes". The
linear layer is followed by the sigmoid activation function:
1
ŷ = σ(x) =
1 + e−f (x)
This function squashes the logits (the neurons in the output layer) into the range [0, 1], and can be in-
terpreted as the probability of the input belonging to the positive class. The loss function for binary
classification is the binary cross-entropy ("BCE") loss:
N
1 X
LBCE (y, ŷ) = − (yi log(ŷi ) + (1 − yi ) log(1 − ŷi ))
N i=1
26
II.6 Complete Schematic of a Fully Connected Neural Network
Try to follow the four steps with some actual numbers to check if you understand how one training step
works. Here are some sample values and the updated weights and biases:
x1 = 1, x2 = 2, Wx1,h1 = 0.1, Wx2,h1 = 0.2, Wx1,h2 = 0.3, Wx2,h2 = 0.4, Wx1,h3 = 0.5, Wx2,h3 = 0.6, bh1 =
0.1, bh2 = 0.2, bh3 = 0.3, Wa1,o = 0.7, Wa2,o = 0.8, Wa3,o = 0.9, bo = 0.1, target = 3.2, lr = 0.1
New weights & biases: Wx1,h1 = 0.09, Wx2,h1 = 0.18, Wx1,h2 = 0.29, Wx2,h2 = 0.37, Wx1,h3 = 0.49, Wx2,h3 =
0.57, bh1 = 0.09, bh2 = 0.19, bh3 = 0.29, Wa1,o = 0.69, Wa2,o = 0.78, Wa3,o = 0.87, bo = 0.08
27
II.7 Questions
II.7 Questions
1. Linear Regression
a) Assume a linear problem of fitting with 5 billion points, each with 100 features. Is it feasible to
find the optimal without a neural network? State if yes or no, and explain why.
b) How can we transform a linear problem (with only affine layers) into a classification problem?
2. Fully Connected Layers
a) Given the following layer y = f (X, W, Z, R, T, A) = XW + ZR + T + A, where XN ×D , WD×M ,
ZN ×D , RD×M , TN ×M , AN ×M And a loss function L : RN × RM → R = N
P PM
i=1 j=1 yij . What is
∂L ∂L
the gradient ∂Z ? What is the gradient ∂A ?
b) Given batch of images of shape (8×3×8×8). How can we process them through a fully connected
layer? What would be the shape of the weight matrix W in a case of a logistic regression and
BCE as a loss function?
c) Given batch of images of shape (8 × 3 × 8 × 8), a fully connected network (FCN) and a task to
detect eyes of people in the images. Name two disadvantages for trying to solve the task with
the current setup.
d) Given two affine layers:
y1 = f1 (X, W1 , B1 ) = XW1 + B1
and
y2 = f2 (y1 , W2 , B2 ) = y1 W2 + B2
Show that it could be described as a single affine layer.
3. Activation Functions
a) What is the purpose of the activation functions in a neural network?
b) Give two advantages of the Tanh function over the Sigmoid function.
c) Explain the vanishing gradient problem, and describe one method that could help us solve it.
d) We’ve learned that the sigmoid function can cause the "vanishing gradient" problem. Therefore,
explain why is the sigmoid function is still sometimes used on the logits (the output of the last
layer).
e) Assume the LeakyReLU activation function. If we take an α < 0, what property of the function
would we lose?
f) The Softmax function could suffer from numerical instability, given the logits. Show that reducing
the maximum value of the logits from each one of the values in the vector would not change the
output of the function:
Softmax(x − max(x))i = Sof tmax(x)i
4. Loss Functions
a) Which property of loss functions allows us to perform backpropagation without computing Jaco-
bians?
b) Assume that you’re using the MSE loss function to compare to batches of images, of shape
(4 × 3 × 8 × 8), what would be the value of N , that we should divide the loss value by?
28
II.8 Answers
c) In depth estimation, where we predict the depth of each pixel in an image, it was found that the
loss for the majority of pixels is very small, while for some few pixels it is very large. What would
be a better loss function to use in this case out of [MAE, MSE, BCE, CE], and why?
d) What could cause numerical instabilities in the BCE and CE functions? How could we solve
that?
e) Why is the MAE loss function could be still used, although it is not differentiable at x = 0?
f) Assume that in the first iteration of training, some the logits are values above 1000, while the
ground-truth values are in the range of [0, 1]. If we’re using the MSE loss function - what
optimization problem should we expect to observe?
g) Let’s use the CE loss function for a task of classification of 100 classes. What is the expected loss
value after the first iteration, and why?
h) BCE: how many neurons are at the output layer?
i) BCE: why do we multiply the result by −1?
j) Why don’t we multiply by −1 in MSE or MAE?
k) You are given a neural network for a classification task with 4 classes and the CE as a loss
function. The batch size is 1000. After the very first iteration of training, what is the expected
loss value?
II.8 Answers
1. Linear Regression
a) No, it is not feasible. Linear regression has a closed-form solution, but it requires loading to the
RAM a matrix of size X109 ×100 and also inverse it, which is impossible. Therefore, we would have
to use an iterative training scheme using batches through a neural network.
b) Add a sigmoid function at the end of the network, and use one of the cross-entropy loss functions.
2. Fully Connected Layers
∂L
= ∂L ⊤ ∂L ∂L
a) ∂Z ∂y · R , ∂A = 1N ×M ⊙ ∂y , where · is matrix multiplication, and ⊙ is element-wise
multiplication.
b) We could flatten the images, each to a vector of size 3 × 8 × 8 = 192 → X8×192 , and the weight
matrix would be of size W192×1 .
c) i. FC layers extract only global features, while the task requires local features.
ii. The number of pixels (resolution) is too small, which means a lack of fine details.
d) y2 = f2 (f1 (X, W1 , B1 ), W2 , B2 ) = f2 (XW1 + B1 , W2 , B2 ) = (XW1 + B1 )W2 + B2 = X(W1 W2 ) +
B1 W2 + B2
3. Activation Functions
a) Activation functions introduce non-linearity to the network, and allow the network to learn com-
plex patterns in the data.
b) 1. The Tanh function is zero-centered, while the Sigmoid function is not. 2. The Tanh function
has a larger maximum gradient than the Sigmoid function.
29
II.8 Answers
c) The vanishing gradient problem is a phenomenon where the gradients become exponentially
smaller as they propagate "up the stream", leading to a scenario where the weights of the shal-
low layers are updated much more slowly or not at all. One method to solve it is to use skip
connections, that allow "a highway for the gradients" to flow through the network.
d) The sigmoid function is still used on the logits, as it projects the logits to the range [0, 1], and
can be interpreted as the probability of belonging to a class in binary classification.
e) If we take an α < 0, we would lose the zero-centered property of the function, as all the values
would be non-negative.
xi −max(x) xi e− max(x) e− max(x) x xi
f) Softmax(x − max(x))i = PKe = PKe = PKe i xj = PKe =
j=1
exj −max(x) j=1
exj e− max(x) e− max(x) j=1
e j=1
e xj
Softmax(x)i
4. Loss Functions
a) The property that allows us to perform backpropagation without computing Jacobians is that
the loss function is a scalar function L : RN × RM → R and that the variables in neural networks
are always scalars, but just assembled in matrices and tensors. Eventually, we aim to find ∂L∂v ,
where v is a single scalar variable, v ∈ R.
b) N = 4 × 3 × 8 × 8 = 768, as we divide by the number of residuals in the loss function.
c) The MAE loss function would be better, as it is less sensitive to outliers, and would fit the
majority of the residuals better.
d) Numerical instabilities could be caused by the logarithm of 0 in the BCE and CE functions. We
could solve that by clipping the values of the logits to be in the range [0 + ϵ, 1 − ϵ].
e) The MAE loss function could still be used, although it is not differentiable at x = 0, as we assume
local differentiability in the loss function, which is the case for almost all cases.
f) The gradients would be HUGE, so we should expect to observe the "exploding gradient" problem,
where the update step would be too large, and the network would diverge.
g) The expected loss value after the first iteration would be − log(0.01) = 4.6, as the network would
be initialized with random weights, and the expected value of the loss function is − log(1/C),
where C is the number of classes.
h) The number of neurons at the output layer would be 1, as we have a binary classification problem.
i) We multiply the result by −1 to make the loss value positive, as ∀0 < x <= 1 : − log(x) ≥ 0.
j) We don’t multiply by −1 in MSE or MAE, as the loss value should be positive, and the network
should learn to minimize it.
k) − ln(0.25)
30
III Convolutions
31
III.1 Definition
III.1 Definition
X kh X
cin X kw
xl,i,j wl,i,j
l=1 i=1 j=1
The output is a tensor of shape (cout , hout , wout ), where hout and wout are the height and width of the output
tensor. Those spatial values are calculated as follows:
Hin + 2p − kh
Hout = +1
s
Win + 2p − kw
Wout = +1
s
Note, that not all combinations are valid - meaning, that if not careful with the choice of these parameters,
not all windows would be calculated. A window that doesn’t fit into the input, will be simply dropped.
Popular trios:
• k = 1, s = 1, p = 0 - Pointwise convolutions - process only a single pixel across its channels. Keeps the
spatial dimensions intact.
• k = 3, s = 1, p = 1 - The most common one. Has a small neighborhood, and it keeps the spatial size
of the input (hin = hout , win = wout
• k = 3, s = 2, p = 1 - The stride is doubled, so the output spatial size is half the input’s.
• k = 7, s = 4, p = 3 - The stride is quadrupled, so the output spatial size is 1/4 of the input’s.
32
III.2 Unique Convolutional Layers
III.2.a MaxPooling
Max-pool is a layer that operates channel-wise (the kernel has a depth of 1), in contrast to the default
convolution layer, where the kernel has the same depth as the input. Given a window, it returns the
maximum value within it. Usually, it is used in order to reduce the spatial size of the input, while introducing
non-linearity but no additional parameters. However, this layer doesn’t necessarily reduce the spatial size,
and could be used while maintaining the same spatial dimensions, e.g. in the case of the "inception layer"
(section V.6). The most used case is as in the figure above, with a kernel size of 2 × 2, a stride of 2 and no
padding.
Note about the gradient: Only the entry (or multiple entries) in the window that hold the maximum value
in that window, will get a derivative value of 1. The rest are killed with 0s.
Similar to maxpool, but instead of returning the maximum value, it returns the average value of all entries
of the window. Note, that this operation is linear, and does not introduce non-linearity. It is also normally
used to reduce the spatial size of the input, with no additional parameters.
33
III.2 Unique Convolutional Layers
III.2.c 1 × 1 convolutions
Also called "Point wise convolution". This is a special case of the convolution layer, where the kernel is 1 × 1.
This layer is used usually to reduce the number of channels in the input, while keeping the spatial size. This
layer is also first introduced in the "inception layer" (section V.6), with the purpose of reducing the channels
before the more expensive operations, such as 3 × 3, 5 × 5 convolutions. With that, they managed to reduce
dramatically the number of parameters and calculations, while maintaining the performance to some extent.
Note, that this layer doesn’t extract spatial features, as it has no neighborhood, and its receptive field is
the same as the layer before. Also, it is mentioned somewhere that this layer adds non-linearity. This is of
course NOT TRUE, but one could use this layer, followed by some non-linear activation, since it is quite
cheap.
This is a special case of the convolution layer, where we take a single kernel of depth 1 for each channel. It
is common in modern visual-transformers architectures.
34
III.2 Unique Convolutional Layers
III.2.e Upsample
A linear layer that introduces no additional parameters, to increase the spatial size of the input. Usually
using a predefined algorithm, such as nearest neighbor, bilinear or bicubic.
Unlike the upsampling layer, the transposed convolution introduces additional parameters in the form of
kernels. While being more capable and able to adjust to the task at hand, it is also more expensive. Note
that it is not the same as "deconvolution" layers.
35
III.3 Receptive Fields
The receptive field is the number of pixels in the network’s input, which an intermediate pixel in some hidden
layer is affected by. The spatial extent of the connectivity of a convolutional filter. What is the relationship
between the input and the output? For example: 3x3 receptive field: 1 output pixel is connected to 9 input
pixels. Note, that we do not consider the channels dimension - we only care about the spatial context.
To calculate the receptive field on a pixel in the intermediate layer l, we use the following formula:
l−1
Y
rl = rl−1 + ( si ) · (kl − 1) f or l ≥ 2
i=1
Where s is the stride and k is the kernel size. Also, r1 is the kernel size of the first layer. Note that we
start from the input and go deeper, and not from the requested layer backwards.
r3 = r2 + ( 1i=1 si ) · (k2 − 1) = 5 + (1 ∗ 1) ∗ (3 − 1) = 7
Q
Note: For fully connected layers, the receptive field is the ENTIRE image. Since each output neuron of
that layer is connected to each neuron of the previous one, where each layer represents the entire output,
whether it is down-scaled or upscaled.
36
III.4 Handcrafted Kernels
Important: For some reason, these keep popping up in the exams. You should know them by heart, or
just decide to skip them in advance. The most common ones are:
• Edge detection:
−1 −1 −1
– −1 8 −1
−1 −1 −1
1 0 −1 1 2 1
– Sobel - vertical and horizontal filters: 2 0 −2 0 0 0
1 0 −1 −1 −2 −1
• Blurring:
1 1 1
– Box mean: 91 1 1 1
1 1 1
1 2 1
1
– Gaussian blur: 16 2 4 2
1 2 1
• Sharpen:
0 −1 0
– −1 5 −1
0 −1 0
III.5 Questions
1. How can we represent a FC layer with 5 output neurons with a convolutional layer, over an image in
a batch of size 4 × 3 × 8 × 8 ? And with 1x1 Convolution?
2. Given a 1 × 1 convolutional layer with input tensor of 10 channels, that outputs a tensor with 5
channels.
a) Write the shape of the weight matrix.
b) State the number of parameters in the layer.
3. Give two reasons to use a convolution or a pooling layer that reduce the spatial size of the input tensor.
4. Does reducing the spatial size throughout the network reduce the number of parameters in the down-
the-stream convolutional layers?
5. Does reducing the spatial size throughout the network reduce the number of parameters in the down-
the-stream FC layers?
6. State two differences between a convolutional layer and a pooling layer.
7. Assume a fully convolutional model for some task:
a) Can we feed the model with images that are double the resolution of the original training set?
b) Can we expect in such case the same performance?
37
III.6 Answers
8. Can we apply a pooling layer without reducing the spatial size of the input tensor?
9. Assume that when using maxpool with (k = 2, s = 2, p = 0), where only a single entry in a given
window holds the maximum value in that window. How many of the total pixels in the tensor would
get a live gradient? What is the value of said gradient?
10. State one advantage and one disadvantage of a 1 × 1 convolution over a 3 × 3 convolution.
11. Why don’t we use hand-crafted kernels (e.g. Sobel filter for edge detection) within our deep learning
models?
12. In what technique, however, can we use hand-crafted kernels such as Gaussian blur?
13. Assume that we want to use a transformer to process an intermediate feature map (an output tensor
of a convolutional layer), to learn meaningful relations between the pixels. How can we change the
tensor to do that?
III.6 Answers
1. FC layers take ALL neurons in a single input from the batch. To do that with convolutions, we must
have a kernel with kh = Hin , kw = Win and kc = Cin . The weight matrix would be 5 × 3 × 8 × 8
(Number of kernels × kc × kh × kw ). The output tensor would be of shape 4 × 5 × 1 × 1. With a 1x1
Convolution, we can simply reshape the input tensor to 4 × (3 ∗ 8 ∗ 8) × 1 × 1.
2. a) • w = 5 × 10 × 1 × 1
• b=1×5×1×1
b) 55.
3. Two reasons to reduce the spatial size of the input tensor are:
• Compression enforces feature selection and extracting meaningful features.
• Using smaller tensors throughout the network reduces the number of calculations, allowing us to
use more layers for a greater capacity.
4. No - as long as the spatial size remains bigger than the convolutional kernel, the kernel sizes don’t
change, so the number of parameters stays the same.
5. Yes - reducing the spatial size throughout the network would result with smaller tensors, and therefore
smaller FC layers, that take as input the entire previous layer.
6. Two differences between a convolutional layer and a pooling layer are:
• A convolutional layer has learnable parameters, while a pooling layer doesn’t.
• Pooling layers work channel-wise (each channel independently), while default convolutional layers
have the same depth as the input tensor.
7. a) Yes, fully convolutional networks can take varying sizes of inputs without any architectural
changes, as long as it makes sense.
b) No, convolutional layers are not invariant to scale changes.
8. Yes, we can apply a pooling layer without reducing the spatial size of the input tensor by using a
relevant trio, e.g. (k=3, s=1, p=1), just like with convolutions.
9. Each window is of size 2 × 2, therefore has 4 entries. Only a single entry gets a gradient, and therefore
1
4 of the total entries would have a live gradient, with a value of 1.
38
III.6 Answers
10. 1 × 1 vs 3 × 3:
a) Advantage: Much cheaper in terms of parameters and calculations.
b) Disadvantage: Smaller receptive field, so the scope of the layer is much narrower.
11. We don’t use hand-crafted kernels within our deep learning models because they are not learnable,
and thus not able to adapt to the task at hand.
12. We can use hand-crafted kernels such as Gaussian blur within our deep learning models by using them
as the initial weights of the convolutional layers, and then updating them during training / use in data
augmentation.
13. To use a transformer to process an intermediate feature map, we can change each instance in the
batch to a 2D tensor, by transposing the channels to the last dim, and flattening the spatial size:
(B × C × H × W ) → (B × H × W × C) → ((B × (H ∗ W ) × C). Now each pixel, across the channels,
is a vector token for the transformer head.
39
IV Optimization
Since we deal with big-data and non-linear problems, we use iterative optimization algorithms to find the best
solution. Neural networks consist of many different functions all together, and therefore their optimization
problem is usually non-convex. This means that off-the-shelf basic neural networks would face a loss surface
that has many different minimas. However, it is common to assume that any local minima is good enough
as a solution, but to that end - we always aim to have the best solution and performance. In this section
we will discuss the different approaches that exist for solving iterative optimization problems, and see why
some of them are better than others, practically vs. theoretically.
Gradient descent (GD) is the basic optimization algorithm, on which all modern deep learning optimization
problems are based. It follows a very basic idea: For each variable θ in the model, calculate the gradient
over the entire training-set, and then perform an update on the variable θ in the direction of the negative
gradient. This is done iteratively until the loss function converges to a minimum. The update rule is as
40
IV.1 Optimization algorithms
follows:
∂L
θt+1 = θt − α
∂vt
∂L
Where vt is the variable at epoch t, α is the learning rate, and ∂θt is the gradient of the loss function L
with respect to the variable θt .
Why go in the direction of the negative gradient? Because the gradient points in the direction of the steepest
ascent, and we want to minimize the loss function, so we go in the opposite direction.
Notes:
• Gradient descent is a first-order optimization algorithm, meaning that it only uses the first derivative
of the loss function.
• In gradient descent, we perform the optimizer step only at the end of the epoch, which is only when
we’ve iterated over the entire training set.
• This could be done in batches. Since datasets are usually too large to handle, we split them into
smaller batches and use gradient accumulation - we accumulate the gradients in some buffer and at
the end of the epoch, we divide by the number of iterations and only then perform the optimizer step.
• While gradient descent calculates the most accurate gradient, it has a few major drawbacks:
– If we perform the optimizer step only at the end of the epoch, it each single step could take a
long time to compute - the model will never converge in reasonable time.
– Calculating the gradient over the entire training set prevents the approximation noise that comes
from doing the optimizer step over a small batch, which is actually a good regularization power,
that helps avoid saddle points. Therefore, GD tends to underfit.
– Calculating the gradient over the entire training set is very computationally expensive, and there-
fore it is not practical for large datasets.
41
IV.1 Optimization algorithms
Stochastic gradient descent (SGD) is a variation of the gradient descent algorithm, that aims to solve the
problems that come with the basic GD. While the original definition consists of using a batch size of 1, in
practice we use a small batch size from the training set and perform the optimizer at each iteration. This
way, we can perform the optimizer step many times in a single epoch, and therefore the model will converge
faster, although it has a much noisier gradient.
The update rule is exactly the same as in GD, but we perform the optimizer step at each iteration:
∂L
θt+1 = θt − α
∂θt
at iteration t + 1.
Notes:
• SGD is a first-order optimization algorithm, meaning that it only uses the first derivative of the loss
function.
• The original definition of SGD is when the batch size is 1, but in practice, we use a small batch size
(e.g 4, 8, 16, 32, 64, etc.).
• SGD is much faster than GD.
• "Noisier steps" comes from the fact that SGD only approximates the global gradient on a small batch,
which cannot represent the real distribution of the data. Therefore, features that are learned on a
current sample, might not be relevant for a sample from the next batch. However, this could be a
good thing, as it helps avoiding saddle points. The bigger the batch size is, the lower the noise is.
• Implementation-wise, the ONLY difference between GD and SGD is that in SGD we perform the
optimizer step at each iteration, and not at the end of the
42
IV.1 Optimization algorithms
SGD with momentum is a variation of the SGD algorithm, that aims to solve the problems that come with
the basic SGD. The idea is to add a momentum term to the update rule, that will help the optimizer to
converge faster. The update rule is as follows:
∂L
mt+1 = βmt − α
∂θt
θt+1 = θt + mt+1
Where mt is the momentum term at iteration t and β is the momentum coefficient (usually 0.9).
Notes:
• SGD with momentum is a first-order optimization algorithm.
• m0 is usually initialized to 0.
• The momentum is being increased or decreased iteratively by calculating the momentum from one
iteration to the next. It is similar, but not exact, to the "exponential moving average" concept.
• When the "slope" is steep, the velocity will increase very fast, allowing us to overshoot saddle-points.
The Nesterov Momentum optimizer (NAG) improves on standard momentum by anticipating the direction
of the update, computing the gradient at a point slightly ahead ("Look Ahead") of the current position.
This could result in faster convergence and better training performance for deep learning models.
43
IV.1 Optimization algorithms
IV.1.e RMSprop
This algorithm follows the idea to divide the learning rate by the square root of the sum of the squares of
the gradients, to mitigate big oscillations in the gradient. Since this is done for each variable in the gradient
individually, it results in a different optimization step size for each one of them. The update rule is as
follows:
vt+1 = βvt + (1 − β)[∇θ L ◦ ∇θ L]
∇θ L
θt+1 = θt − α √
vt+1 + ϵ
Where st is the velocity term at iteration t, β2 is the momentum coefficient (usually 0.99), and ϵ is a small
number to avoid numerical instability.
Notes:
• RMSprop is a second-order optimization algorithm, meaning that it uses the second moment of the
loss function gradient.
• The idea is to divide the learning rate by the square root of the sum of the squares of the gradients,
to mitigate big oscillations in the gradient.
• The optimization step is different for each variable in the gradient - hence it is called an adaptive
learning rate.
• The squared gradients approximate the second moment of the gradient, which is the variance: Recall
that the variance is
V ar(X) = E[(X − µ)2 ]
We assume that the mean of all variables is µ = 0. Therefore,
∇θ L (θt )2 = [∇θ L (θt ) ◦ ∇θ L (θt )]
44
IV.1 Optimization algorithms
Adam is a combination of RMSprop and SGD with momentum. It uses the squared gradients to scale
the learning rate like RMSprop, and it uses the momentum term to avoid local minimas like SGD with
momentum. The update rule is as follows:
mt+1 = β1 mt + (1 − β1 )∇v L
vt+1 = β2 vt + (1 − β2 )[∇θ L ◦ ∇v L]
mt+1
m̂t+1 =
1 − β1t+1
vt+1
v̂t+1 =
1 − β2t+1
m̂t+1
θt+1 = θt − α √
v̂t+1 + ϵ
m represents the running mean of the first-moment, then mean of the gradients, and v stands for the running
variance - for each single entry of the gradient matrix.
Notes:
• Adam is a second-order optimization algorithm
• The idea is to combine the best of both worlds - RMSprop and SGD with momentum.
• Bias correction: The Adam optimizer uses a bias correction term to correct the bias of the first
and second moments. It was first introduced with Adam, and therefore doesn’t exist in RMSprop,
although it should. The idea is that at the first iteration, the variables m, s are initialized to 0, and
therefore they are biased towards 0. To correct this bias, we divide them by the term 1 − β t+1 , where
t is the iteration number, and here is used as the exponent. As the number of iterations increases,
the correction terms vanish, but at the very first ones - we neglect the m, s terms, and therefore the
optimizer step is much more meaningful:
m1 = β1 m0 + (1 − β1 )∇θ L
m1 β1 m0 + (1 − β1 )∇θ L
m̂1 = = = ∇θ L
1 − β1 1 − β1
45
IV.1 Optimization algorithms
While Stochastic gradient descent and its variants show great capabilities, they are a leg behind in theory
behind some others. Newton’s method is a second-order optimization algorithm, that uses the second
derivative of the loss function to find the optimal solution, in much fewer iterations than SGD. It follows
Newton’s-Raphson method, which is a root-finding algorithm that uses the second derivative of the function
to find the root (e.g. y = f (x) → y = 0) of the function. The update rule is as follows:
Newton-Raphson method:
f (xt )
xt+1 = xt −
f ′ (xt )
But here we want to find the direction of the gradient, so we’re looking for the derivative of the gradient,
which is the second-derivative:
For a single variable x:
f ′ (xt )
xt+1 = xt −
f ′′ (xt )
For a matrix θ:
θt+1 = θt − H −1 ∇θ L(θt )
Where H is the Hessian matrix (matrix of second derivatives) according to the loss function, and H −1 is its
inverse.
Notes:
• Newton’s method is a second-order optimization algorithm.
• This method is NOT FASTER than SGD, but will converge in fewer iterations, as the inverse Hessian
matrix is really expensive to compute.
• So why isn’t it used in practice? the Newton’s method requires the entire training set as input to
compute the inverse Hessian matrix, which is super expensive in memory and time.
• There are some variants of the Newton’s method that are used in practice, that approximate the
Hessian matrix and therefore are much faster. The most common one is the L-BFGS algorithm.
However, they also still require the entire training set as input, and therefore are not practical for
large datasets and deep learning.
In literature, you’ll find a lot of inconsistencies about this term. Some claims about Adam, RMSprop and
the newton method being second order methods, and some say that only the newton method is.
• It doesn’t really matter. It’s not important. Don’t break your head over it. Simply, until
stated otherwise, in this course - all the three above are second order.
• Adam and RMSprop are a second-order gradient-descent-based optimizer (second moment).
• Newton’s method is a second-order optimization method (second derivative).
Both cases are shorty referred to as “second order” methods.
46
IV.2 Optimization problems and solutions
As explained before, the sole goal of training neural networks - is to minimize the loss function and the
generalized performance on unseen data is only a byproduct of that. However, that imposes a problem: if
the model is too capable, it could fit the training data too well. There are plenty of reasons for such a
behavior, such as:
• The model is too complex, and therefore has too many parameters. Instead of learning general patterns,
it memorizes the training data.
• Too little data, and therefore the model has no way to learn the general patterns, but only the specific
that fit the small dataset.
• Not stopping early enough, so after learning relevant, more general, features - the model learns the
noise in the data.
That behavior is called in the literature "overfitting". How can we detect it? By looking at the validation
loss. If the "generalization gap", the difference between the training loss and the validation loss, is too big
- the model is overfitting. Usually, by monitoring those losses during training, we would see that while the
training loss is decreasing, the validation loss is either increasing (diverging) or staying the same.
Solutions:
• Increasing the size of the training set: This is NOT a valid solution, as the creation of a dataset is a
very difficult and expensive task, and we should try and work with the limited resources available. A
method that could create more data out of the existing one is called "data augmentation".
• Regularization: The go-to methods. Make training harder, so the network will learn the general
patterns. Examples: Weight-decay, Dropout, data augmentation, etc.
• Early stopping: In case one doesn’t save checkpoints during training, the model could be stopped
when the generalization gap starts to grow. However, this is NOT a regularization term, as it doesn’t
make the training harder, but rather stops it.
• Hyperparameter tuning: The model might be too complex, and therefore we should try and reduce
its capacity by tuning the hyperparameters.
47
IV.2 Optimization problems and solutions
On the other hand, a model might not be able to reduce the training loss to a sufficient level, where it solves
the problem of the training data. This counter phenomenon is called "underfitting". That could occur for
some of the following reasons:
• The model’s capacity is too low, meaning it is either too small or uses a bad architecture, and therefore
cannot learn the general patterns in the data.
• The model is not trained for enough epochs and therefore has not converged to a minimum.
• The model is using a very basic optimization algorithm, and gets stuck in a "saddle point", where the
gradient is 0.
How to detect it? First, we could look at training loss. If it is still decreasing (along with the validation
loss) when the model stops training - then it means it could have been trained for more epochs. Second,
we could assess the accuracy of the model on the training data: check the classification accuracy, visualize
the reconstructed image, etc. Undesirable results would mean that the model fails to complete its task to
overfit the training data.
Solutions:
• Increasing the model’s capacity: by changing the architecture, adding more layers, etc.
• Training for more epochs: The model might not have converged to a minimum yet.
• Using a better optimization algorithm: The model might be stuck in a saddle point, and therefore we
should use a better optimization algorithm.
• Learning rate decay: The model might be oscillating ("overshooting") around the minimum, and
therefore we should decrease the learning rate.
We’ve discussed in a previous section (subsection II.2.e) the problem of vanishing gradients and methods to
solve it. On the other hand, exploding gradients is a problem that occurs when the gradients are too big,
and therefore the optimizer step is too big, and the model diverges. This could happen for a few reasons:
• Recurrent cells: In RNNs, the gradients are being multiplied by the same matrix at each time step. If
the matrix has an eigenvalue greater than 1, the gradients will explode.
• No normalization: If activations become very large after linear layers, then the gradients of the weights
will be very large as well.
• Bad initialization: If the weights are initialized to very large values, the activations will also be very
large, and that could affect the next layers’ gradients.
Solutions:
• Gradient clipping: The idea is to clip the gradients to a certain value, so they won’t explode. This
is a very common method in RNNs, where the gradients are usually very large. Note: this method
doesn’t clip them for below, so it doesn’t help with vanishing gradients.
• Normalization: The idea is to normalize the activations after each layer, so they won’t become too
large. Examples: Batch normalization, Layer normalization, etc.
• Weight initialization: Use known initialization methods, such as Xavier or Kaiming initializations,
that will prevent the activations from becoming too large.
48
IV.2 Optimization problems and solutions
If you don’t implement this crucial technique, stop doing deep learning and go do something else with
your life. Learning rate scheduling is so critical, as the learning rate is the hyperparameter that affects the
optimization process the most. The basic notion is to decrease the learning rate as the training progresses,
so the optimizer step will be smaller and smaller, and the model will converge to a minimum. In addition,
this method could also be used as regularization, if used correctly as described in the amazing paper of
"super-convergence" (Link). Robbins and Monro have summarized (in 1951) the conditions for theoretical
convergence for a strictly convex function F (x, θ):
49
IV.2 Optimization problems and solutions
IV.2.d Regularization
Regularization techniques are common practices in deep learning to prevent overfitting, but also introducing
constraints or secondary objectives to the optimization problem. Regularization methods usually make the
training process harder, so the model will seek to learn more general patterns in the data, in order to
overcome the added difficulty and keep brining the training loss down. Usually, we will observe that the
training loss will be higher, but the validation will be lower, hence the generalization gap would be minimized.
A basic technique of adding a penalty term to the loss function, that tries to minimize the weights themselves,
namely pushing them towards zero. If it succeeds, the model will not be able to calculate anything. The
most common regularization terms are the L1 and L2 regularization, which are defined as follows:
#Θ
X
L1 : λ |θi |
i=1
#Θ
1 X 2
L2 : λ θ
2 i=1
Where λ is the regularization coefficient, and wi is the i-th weight in the model. When chosen, for example,
in the L2 regularization, they are added to the loss function, and the optimizer will try to minimize them
as well:
#Θ
1 X 2
Lreg = L + λ θ
2 i=1 i
∂Lreg ∂L
= + λθl,i,j
θl,i,j θl,i,j
Where L is the original loss function. The L1 regularization is also called "Lasso" regression, and the L2
regularization is also called "Ridge" regression. Note the λ hyperparameter that controls the strength of the
regularization term. If λ is too big, the model will not be able to learn anything, and if it is too small, the
term won’t have any meaningful effect.
The main differences between L1 and L2 regularization are:
• L1 regularization is more likely to produce sparse weights, i.e. pushing some of the weights to zero.
This could be useful for feature selection, as the model will learn which features are important and
which are not (see image above).
50
IV.2 Optimization problems and solutions
• L2 regularization is more likely to produce small weights, and spreads the penalty across all the weights.
• Therefore, and because L2 is also easier to compute, it is more common in practice.
Weight decay is a similar term to the formula above:
∂L ∂L
θt+1 = θt − α( + λθi ) = θt − α − αλθt
∂θt ∂θt
Dropout
This is the most common and most useful regularization technique to date. The idea is to randomly "drop"
some of the neurons in the network during training, so the model will not be able to rely on any single
neuron, and will have to learn more general patterns. A key feature here is that it reduces the co-adaptation
of neurons, and they will not be able to rely on each other.
Notes:
• Dropout introduces a hyperparameter p. In literature, it could be referred to as pdrop or pkeep , etc.
In the original paper, it is the probability to keep a neuron, while PyTorch’s implementation is the
probability to drop a neuron.
• The dropout layer with pdrop will drop exactly pdrop of the neurons (or channels, for convs). For a
single dropout layer, it also represents the probability of a neuron to be dropped.
• The dropout layer drops neurons only during training, and not during inference. At the latter stage,
the neurons are scaled by pkeep , such that x⋆ = x · pkeep . This is done to keep the same magnitude (or
expected value if you’d like) of the activations during inference, as this is what the next layer expects.
• At each iteration, the neurons to be dropped (or kept) are chosen randomly in respect to the parameter
p, and therefore the model is trained on a different “subnetwork” at each iteration. This is why dropout
is usually looked at as an ensemble of different models.
• Dropout is usually applied last in the computation block (Linear → Normalization → Activation →
Dropout).
51
IV.2 Optimization problems and solutions
The Dropout layer operates differently for the Affine and Convolutional layers. For the Affine, single neurons
are dropped randomly, while for the latter, entire channels are dropped. This is because due to the fact that
convolutional layers capture spatial information, in contrast to the Affine layers, where features are learned
by each neuron. Therefore, dropping channels retains the capability of extracting meaningful features, while
still introducing regularization.
Figure IV.1: Spatial Dropout for convolutions with dropping probability pdrop = 13 . The brighter channels
in this example image are dropped (All entries are set to zeros.
Data augmentation
52
IV.2 Optimization problems and solutions
set is not changed, rather at each epoch, the model is trained on different variations of the data, with some
probability p. The idea is to make the model more robust to different variations in the data, and to prevent
overfitting. The most common transformations are: rotation, scaling, translation, flipping, cropping, etc.
Notes:
• Data augmentation is usually applied to image data, but could be applied to any kind of data.
• The transformations are applied randomly, with some probability p.
• The transformation is done solely on the training set, and not on the validation or test sets.
• The transformations are usually applied in the data loader, and not in the model itself.
• Not all transformations are useful for all tasks. For example, rotating images of digits is not useful,
as it will change the label of the image.
• Make sure that in supervised learning, the label of the data is also transformed accordingly, if needed.
Batch Normalization (Link) is a technique that is used to normalize the neurons of the layer. The idea is to
make the optimization process easier, as the activations will not be too large or too small, and the optimizer
will be able to converge faster.
Note, that we normalize groups of neurons. Each group consists of all the neurons in the batch that
are created by the same weights, either it’s a set of weights in a FC layer or a kernel in a convolution.
For all corresponding neurons, perform the following normalization:
x−µ
x̂ = √
σ2 + ϵ
Where µ is the mean of the activations, σ is the variance of the activations, and ϵ is a small number to avoid
numerical instability. Then, scale and shift the activations:
y = γ x̂ + β
Where γ and β are learnable parameters, that are learned during training. The idea is that the model
will learn the optimal mean and variance for the activations, and therefore the optimization process will be
easier.
x1,1 x1,2 ... x1,M
h i x2,1 x2,2 ... x2,M
h i
γ1 γ2 . . . γM ⊙
.. .. .. ..
+ β1 β 2 . . . β M
. . . .
xN,1 xN,2 . . . xN,M
Note: the mean µ and std σ are not necessarily 0 and 1, respectively.
For affine layers, the normalization is done across the batch for each neuron across the batch:
53
IV.2 Optimization problems and solutions
Figure IV.2: Batch normalization for affine layers. Mean and std are calculated for each color, across the
batch.
• Note that this is the first and only layer that breaks the abstraction of the independence of the
different instances in the batch, as the normalization of each neuron is done with respect to all the
other corresponding neurons in the batch, i.e. for a neuron vl,i,j , the normalization is done with respect
to all the other neurons vb,l,i,j , where b ∈ {1, 2, ..., B}.
However, for convolutional layers, the normalization is done across the batch, but not for each single neuron,
but across the whole channels. Again, for all the neurons across the batch that are created by the same
kernel:
Figure IV.3: Batch normalization for convolutional layers. Mean and std are calculated for each channel,
across the batch. We take all the corresponding channels across, and normalize them across
all their entries. This could be seen as flattening them all and concatenating them, and then
performing the normalization, just line in the affine layers.
• The normalization layers differ in their actions during training and inference. During training, the
mean and variance are calculated for each batch, and the normalization is done with respect to them.
However, during inference we have a single instance flowing through the network, so the statistics
cannot be obtained in the same way. Therefore, during training we keep a running average of the
mean and variance, and use them during inference:
µrunning = βµrunning + (1 − β)µbatch
54
IV.2 Optimization problems and solutions
This stage is an important one in the deep learning pipeline. The idea is to find the optimal hyperparameters
for the model, that will make it converge to a minimum, and generalize well on unseen data. The most
common hyperparameters are, in a reasonable amount of time. Common hyperparameters are: learning
rate, batch size, number of epochs, optimizer, size of hidden layers, activation functions, etc. However,
while being crucial, it is quite easy to have a very good starting point, if we know our theory:
1. Learning rate - use a learning rate scheduler, to mitigate the problem of finding a bad initial one (A
list of PyTorch built-in schedulers).
2. Activations - go with ReLU or its variants.
55
IV.2 Optimization problems and solutions
3. Optimizer - There are many amazing ones, but Adam and its variants are a great starting point (see:
Link).
4. Size of hidden layers - always start small, and grow as you go.
That said, the theoretical step could be quite cumbersome, as the search space is usually very large (each
hyperparameter adds a dimension), and the training process is very time-consuming. There are a few
methods to tackle this problem:
• Grid search: The most basic method, where we define a grid of hyperparameters, and train the model
on all possible combinations. This is very time-consuming, and not practical for large search spaces.
• Random search: A more advanced method, where we randomly sample the hyperparameters from a
distribution and train the model on those. This is usually more efficient than grid search, as it doesn’t
require training on all possible combinations.
• Bayesian optimization: A more advanced method, where we explore (look on a wide range) and exploit
(look on a narrow range) the search space. However, this method is usually more complex and requires
more computational resources and doesn’t outperform random search to a degree that justifies it.
The learnable weights of a model are usually initialized randomly, and not to zero. If all the weights are
initialized with the same value, they will not be able to learn complex patterns, as many groups of weights
will be updated to the same value. Therefore, it is crucial that the weights are sampled randomly from
some distribution. The range from which they are drawn must not be too big, as that could easily lead to
exploding gradients through the backpropagation process. On the other hand, if the range is too small, the
model will not be able to learn anything, as the gradients will be too small, and the optimization steps will
be too negligible. Thus, we should follow some meaningful initialization schemes:
Xavier Initialization
Xavier Initialization, also known as Glorot Initialization. It aims to keep the scale of the gradients roughly
the same in all layers, which is particularly useful for activation functions like sigmoid and Tanh.
This ensures the variance of the output is the same as the input → Gaussian with
• Mean: E[X] = 0, E[W ] = 0
• Variance. Remember that E[X 2 ] = Var[X] + E[X]2 and if X,Y are independent:
,
E[XY ] = E[X]E[Y ]
56
IV.2 Optimization problems and solutions
Therefore,
n n
!
X X
Var(s) = Var wi x i = Var (wi xi )
i i
n
[E (wi )]2 Var (xi ) + E [(xi )]2 Var (wi ) + Var (xi ) Var (wi )
X
=
i
n
X
= Var (xi ) Var (wi ) = n(Var(w) Var(x))
i
• We want to ensure that the variance of the output s = xw is the same as the input: Var(s) = Var(x).
1
Var (w) = n(Var(w) Var(x)) → Var(w) =
n
• Note that n is the number of input neurons for the layer of weights you want to initialize. This n is
not the number N of input data X ∈ RN ×D , but n = D.
For a layer with nin input units and nout output units, the weights W are initialized as follows:
Or (what we teach in class, and the version you should know for the exam):
1
W ∼ N 0,
nin
Where nin and nout are the number of input and output units, respectively.
Kaiming Initialization
ReLU(x) = max(0, x)
Because the ReLU function outputs zero for any negative input, the average output of neurons is not zero
but positive, which affects the variance. To maintain the forward signal’s variance, the initialization needs
to take into account the rectification effect of ReLU. He et al. proposed scaling the variance of the weights
by n2in , where nin is the number of input units to the layer.
For a layer with nin input units, the weights W are initialized as follows:
57
IV.2 Optimization problems and solutions
Summary
• Xavier Initialization:
– Suitable for sigmoid and Tanh activations.
– Considers both input and output layer sizes.
• Kaiming Initialization:
– Tailored for ReLU activations.
– Focuses on the size of the input layer only.
15
Training loss Training loss Training loss
40 Validation loss
30
Validation loss Validation loss
10
L(x)
L(x)
L(x)
20
20
5
10
0 0 0
0 10 20 2 4 6 8 10 2 4 6 8 10
Iterations Iterations Iterations
(a) (b) (c)
Figure IV.4: (a) - Mismatch of distribution, (b) - validation set is too easy / data leakage, (c) - Underfitting
• Mismatch of distribution: The training and validation sets are drawn from different distributions.
The model learns the training set well, but fails to generalize to the validation set.
• Validation set is too easy / data leakage: The validation set is too easy, and the model is able
to solve it with a very low loss. This could happen if the validation set is too small, or if the model is
too simple and underfits the training set. Also, a bug in the implementation could cause data leakage,
where some data from the validation set is used in the training process.
• Underfitting: The model is not able to solve the training set, and therefore the validation set as well.
This could happen if the model is too simple, or if the optimization process is not good enough.
58
IV.3 Transfer Learning
Transfer learning is a machine learning technique that addresses the challenges of limited data and resource-
intensive training. Instead of training a model from scratch, transfer learning utilizes a pre-trained model
that has already learned features from a large dataset, and adapts it for a new but related task.
Key Concepts:
• Pre-trained Models: These models are trained on a large dataset (Distribution P1). They have
learned useful features that can be transferred to other tasks.
• Adaptation to New Task: For a new task with a smaller dataset (Distribution P2), the pre-trained
model is used as a starting point.
– Only the final layer (the classifier) is replaced and trained on the new task’s data.
– Optionally, more layers can be fine-tuned with a lower learning rate if the new dataset is large
enough.
Benefits:
• Efficiency: Reduces the need for large amounts of labeled data and computational resources.
• Performance: Improves model accuracy by leveraging previously learned features.
59
IV.4 Questions
IV.4 Questions
1. Optimizers:
a) What is the main difference between gradient descent (GD) and stochastic gradient descent
(SGD)?
b) Name two advantages of SGD over GD.
c) Why can we call RMS prop an adaptive optimizer?
d) What is the bias correction in Adam? Why isn’t it implemented in RMS prop?
e) Adam with bias correction and Adam without bias correction have different global minimas. True
or False?
f) What two optimizers does Adam combine?
g) Why is SGD+momentum usually better than SGD?
h) i. Why do we always step in the direction of the negative gradient?
ii. Write down a modified version of the SGD optimizer step, in case we want to maximize the
loss instead of minimizing it.
2. Overfitting and underfitting
a) Define overfitting in a short sentence.
b) Define underfitting in a short sentence.
c) State two different behaviors that indicate underfitting.
d) State a possible reason for a situation where the validation loss is lower than the training loss.
e) In a single word, what is the go-to solution to overfitting?
3. Regularization
a) The term "regularization term" is ambiguous. What are the two usages of such "terms"?
b) What is the effect of L1 and L2 regularization terms on the weights?
c) What are the two differences between L1 and L2 regularization terms and "weight decay"?
d) How can we avoid the computational overhead of the dropout layer during inference?
60
IV.5 Answers
e) Explain how regular dropout works during training and during inference. Hint: crucial to distin-
guish between the definitions of p.
f) Why is data augmentation considered a regularization technique?
g) Why don’t I allow you to consider Early-stopping as a regularization technique?
4. Batch Normalization:
a) Given a loss value L, and a fully connected layer with output shape 8 × 16, followed by a batch
normalization layer: y = BN (x) = γ ∗ xnorm + β
i. What are dimension of the parameters γ and β?
ii. Show a derivation of the gradients of those parameters ( ∂L ∂L
∂γ , ∂β ) and show how to use NumPy
to calculate it.
b) Why is batch normalization sometimes referred to as a regularization technique?
c) Explain why we need to save the running averages of the mean and variance in the batch norm
to the memory during training.
d) What optimization problems could batch norm help us solve?
5. Hyperparameter tuning
a) What is the main bottleneck for grid search?
6. Weight initialization
a) What weight initialization scheme fits the ReLU activation function? How does it affect the
output of the activation function?
b) What do we expect to observe if we initialize the weights of the model to the same value?
7. Transfer Learning:
a) Explain one of the scenarios in the exercises where we used transfer learning, and how.
IV.5 Answers
1. Optimizers:
a) GD performs the optimization step at the end of the epoch, while SGD performs the optimization
step at each iteration.
b) i. SGD performs more optimization steps in a single time unit and therefore converges faster.
ii. SGD introduces noise to the training process, and can help to avoid saddle-points.
c) RMS prop adapts the learning rate for each parameter individually, based on the magnitude
of the gradients.
d) The bias correction is a mechanism to overcome the variables m, v’s initialization bias to zero,
so the early steps are really slow. It is not implemented in RMS prop, as it simply was first
introduced with ADAM.
e) False. The global minimum isn’t changed as the goal and loss don’t change, but Adam with the
bias correction would converge faster. However, they could end up in different local minims, as
the first steps would be much bigger. For a good visualization of different minimas, look at the
plots here.
61
IV.5 Answers
62
IV.5 Answers
63
V Popular Architectures
V.1 LeNet
• Apply valid convolution: size shrinks (reduced by two pixels on each side)
• 6 × 1 × 5 × 5 convolution filters used in first layer (6 filters, as the depth of the convolution obtained
is 6)
• Average pooling is used (now: Max pooling much more common)
• Reduce first to 120, then to 84
• Tanh/sigmoid activation is used (not common now)
• Has 60k parameters
• As we go deeper: width, height go down and number of filters go up
64
V.2 AlexNet
V.2 AlexNet
V.3 VGGNet
• CONV = (k = 3, s = 1, p = 1)
• Maxpool = 2x2 filters with stride 2
• 138m parameters
• It’s large, but simplicity makes it appealing
• As we go deeper, width and height go down and number of filters up.
65
V.4 Skip Connections: Residual Block
• 19 layers
As neural networks had become deeper and deeper, harsh optimization problems, such as the vanishing
gradients, were a big bottleneck in their capability to perform at full capacity. Skip connections, also known
as residual connections, were introduced as a solution to this problem. They allow the gradient to flow
directly through the network ("highway for gradients") by providing alternate pathways for the gradient
during backpropagation. This innovation not only mitigates the vanishing gradient issue, by allowing a full
gradient to pass through, but also facilitates the training of much deeper networks. By enabling the network
to learn identity mappings more easily, skip connections ensure that the performance of deep networks does
not degrade with depth. Overall, skip connections have become a fundamental component in modern deep
learning architectures, contributing to more stable and efficient training processes.
• Highway for gradients: Assume that Loss(x, θ, y) = l and that xL−1 ∈ RN ×D , we show mathematically
that a full magnitude of the gradients can pass from the loss function:
66
V.5 ResNet (Residual Networks)
A deep network that utilizes the powerful skip connections and has become the go-to idea in any modern
deep learning architecture.
• 60m parameters
• 152 layers.
• Note that if there is dimensionality reduction, the gradients cannot flow completely uninterrupted all
the way from the loss function to the very first layers, but only at blocks of the network where the
dimensions are equal, where addition between two tensors is possible.
Why to restrict our model to a specific kernel size? We’re Google, we have the computational power - we’ll
just use them all!
Inception layer:
• Each block in the network is made up with 4 different convolutions or pooling layers, that are then
concatenated as an output.
• Very expensive to perform all of that. Therefore, 1 × 1 Convolutions are used, to reduce the number
of channels and shrink the number of computations of the more expansive convolutions.
• Note that in order to maintain the same spatial sizes across all the different branches, the max-pool
layer is initialized with the magic-trio: k = 3, s = 1, p = 1.
67
increa
witho
Previous layer
V.7 Autoencoder plexit
use of
(a) Inception module, naı̈ve version
tions
Filter
lows
Filter
efficiency during training), it seemed beneficial to start us-
concatenation be pro
concatenation ing Inception modules only at higher layers while keeping the ne
the lower layers in traditional convolutional fashion. This is simul
3x3 convolutions 5x5 convolutions 1x1 convolutions
not strictly necessary, simply reflecting some infrastructural Th
1x1 convolutions 3x3 convolutions 5x5 convolutions 3x3 max pooling
inefficiencies
1x1 convolutions in our current implementation.
increa
A useful aspect 1x1 ofconvolutions
this architecture is that it3x3allows
1x1 convolutions for
max pooling
ber of
increasing the number of units at each stage significantly One c
without an uncontrolled blow-up in computational com- inferi
Previous layer
plexity at later stages.Previous This layer
is achieved by the ubiquitous have
use of dimensionality reduction prior to expensive convolu- a con
(a) Inception module, naı̈ve version (b)larger
Inception module withFurthermore,
dimensionality reduction
tions with patch sizes. the design fol- in net
Filter
lows the practical intuition that visual information should ing ne
concatenation be processed at various Figure 2:scales Inceptionand module
then aggregated so that
requir
V.7 Autoencoder the next stage can abstract features from the different scales
simultaneously.
3x3 convolutions 5x5 convolutions 1x1 convolutions
ber Theof filters
improved in theuse previous stage. Theresources
of computational mergingallows of output for
5. G
1x1 convolutions
ofincreasing
the pooling bothlayer with outputs
the width of each stage of theasconvolutional
well as the num- lay- By
1x1 convolutions 1x1 convolutions 3x3 max pooling ers
berwould
of stages lead to an getting
without inevitable intoincrease
computational in the difficulties.
number of carna
outputs from stage to stage. While
One can utilize the Inception architecture to create slightly this architecture might sion f
Previous layer
cover
inferior, but computationally cheaper versions of it. inef-
the optimal sparse structure, it would do it very We deepe
ficiently,
have found leading
that all to the
a computational
available knobs blow
and up within
levers allow a fewfor qualit
(b) Inception module with dimensionality reduction stages.
a controlled balancing of computational resources resulting result
inThis
networksleadsthat are second
to the 3 10⇥idea faster
of than similarly perform-
the Inception architec- as em
Figure 2: Inception module ing
ture: networks
judiciously with non-Inception
reducing dimension architecture,
wherever however
the compu- this act ar
requires careful manual design at
tational requirements would increase too much otherwise. this point. lustra
This is based on the success of embeddings: even low di- comp
ber of filters in the previous stage. The merging of output 5. GoogLeNet
mensional embeddings might contain a lot of information patch
of the pooling layer with outputs of the convolutional lay- about a relatively
By the“GoogLeNet” large nameimagewepatch. refer toHowever,
the particular embed- in- in our
ers would lead to an inevitable increase in the number of dings represent information in a dense,
carnation of the Inception architecture used in our submis- compressed form Al
outputs from stage to stage. While this architecture might and
sioncompressed
for the ILSVRC information is harder to process.
2014 competition. We also used The rep- one tion m
One the
cover of theoptimalmost interesting
sparse structure,network
it would do architectures.
it very inef- An autoencoder
resentation should isbesuch
kept asparse
powerful at most tool,
placesthat(as later on
required recep
deeper and wider Inception network with slightly superior
was perfected,
ficiently, leading to and it is being used
a computational blowintensively
up within a in fewthe industry
by
quality, and
the conditions
but research,
adding of thefor
it to[2]) and thecompress
ensemble tasksseemed ofthesegmentation,
tosignals
improveonly the space
stages.
3D reconstruction, generative models, style transfers, whenever
etc. Basically,
results only they have to be
endless
marginally. Weaggregated
applications.
omit the details en masse. That is,
of that network, stand
This leads to the second idea of the Inception architec- 1⇥1 convolutions
as empirical evidence are suggests
used to compute reductions
that the influence of thebeforeex- used
ture:1.judiciously
The basicreducing architecture dimensionis based
wherever on fully
the compu-connected the
actlayers.
expensive
architectural 3⇥3 and 5⇥5isconvolutions.
parameters relatively minor. Besides
Tablebeing1 il- the nu
tational
2. Itrequirements
consists of would three increase
main parts: too much otherwise. used as reductions,
lustrates the most common they alsoinstance
includeof theInception
use of rectified
used in lin- the built-
This is based on the success of embeddings: even low di- ear activation making them dual-purpose.
competition. This network (trained with different image- The final result is ductio
mensional•embeddings Encoder: might Reducing contain the a lot of information towards
dimensionality depicted the in Figure
latent 2(b).
patch sampling methods) was used for 6 out of the 7 modelsto
space. That forces the network well.
about a relatively large image patch. However,
focus on collecting the most meaningful features embed- inIn ourgeneral,
ensemble.
from theaninput
Inception data, network
so theisloss a network consist-
of information Th
dings represent information
would be as small in a dense,
as possible.compressed form we
Therefore, Allmodules
ingcould
of thesay convolutions,
of the
that theabove including
Encoder isthose
type stacked usedinside
upon the
as aeach Incep-
other,
features and p
and compressed information is harder to process. The rep-
extractor. tion occasional
with modules, use rectified linear
max-pooling activation.
layers with strideThe size
2 to of the
halve dividu
resentation should be kept sparse at most places (as required receptive
the resolution fieldof in the
our grid.
network Foristechnical
224⇥224 reasons in the RGB color
(memory tation
by the conditions
• Latent of space:
[2]) andthe compress
compressed the signals only
representation spaceof thewith input
zero mean. data. “#3⇥3 These reduce”
latent andreside
“#5⇥5inreduce” multi-
whenever they plehave to be aggregated
directionalities andenhave masse.
really That is,
powerful stands for the number
representations of 1⇥1 filters
of features. By in the reduction layer
manipulating them
1⇥1 convolutions are used
correctly, one to compute
could achieve reductions before
very interesting used before the 3⇥3 and 5⇥5 convolutions. One can see
things.
the expensive 3⇥3 and 5⇥5 convolutions. Besides being the number of 1⇥1 filters in the projection layer after the
used as reductions, – Iftheythe also
sizeinclude
of is toothesmall,
use of rectified
not much lin-information could eventually
built-in max-pooling in the passpool projthrough column. to the decoder,
All these re-
ear activation making andthem the dual-purpose.
reconstruction Thewould
final result
be very is hard.
duction/projection
The result would layers be useveryrectified
blurry. linear activation as
depicted in Figure 2(b). well.
– If too big, the network could basically learn to copy the image, without learning any mean-
In general, an Inception network is a network consist- The network was designed with computational efficiency
ing of modules ofingful the above features.
type stacked upon each other, and practicality in mind, so that inference can be run on in-
with occasional– max-poolingThe latentlayers spacewith is astride
very2 powerful
to halve representation
dividual devices of including
the input. even While
those with RGB limited
images, compu- for
the resolution of the grid. For technical reasons (memory tational resources,
example, offer quite redundant information on their own, the compressed latent space could especially with low-memory footprint.
represent the most meaningful features, that are specific for the task at hand, on a multidi-
mensional manifold. From an RGB image, the latent space could represent so many things
68
V.8 Fully Convolutional Networks
and can even close the gap between two completely different domains. For example, one
could align features from an RGB image with its corresponding semantic segmentation map,
which differs from it completely, such that there is a simple linear transformation between
them (Halperin et al. - may god help me, and soon it will be). One could learn "simple"
features for reconstruction, or even the distribution of the dataset (see VAE later in this
chapter).
– In the literature we say, that the latent space represents "high-level features", while the early
layers of the model extract "low-level features":
– Calculations are usually much cheaper to perform on the latent space, which is a low-
dim representation on the input. Therefore, we should aim to perform as little as possible
calculations on higher-resolutions, and instead do most of the heavy-work on those lower-dim
spaces, that are also much more flexible in what they can represent.
• Decoder: Receives the dimensionality-reduced latent space, and aims to reconstruct the input
of the Encoder. The output has the same shape as the input.
3. Without non-linearities, it is very similar to PCA.
But why do we even want to use that? Auto encoders, as used in exercise 08, is an excellent solution to a
state where our dataset is very big, but only just a small part of it is actually labeled, like medical a CT
dataset, for example. So, we will have 2 steps:
1. Autoencoder → reconstruct the input. Let the Encoder learn the relevant features about the unla-
beled data. This part can be referred to as unsupervised learning.
2. After the training has converged, remove the decoder and discard it. Then, plug in instead, just after
the latent-space, a very simple fully-connected classifier, and train on the labeled data, given the fact
that the remaining Encoder is already trained as a good features extractor. This part can be referred
to as supervised learning.
Simply put, these are neural networks that consist of only of convolutional layers. Notes:
• with some restriction, these networks can accept different spatial sizes of inputs, with no any change
to the architecture. Example: In exercise 9, we used input images with spatial sizes of 240 × 240,
while the Alexnet (section V.2) backbone was originally taking inputs of shape 224 × 224. If FC layers
are used, then the input size must be fixed, as they process the entire input at once. However, it is
important to remember that convolutional layers are NOT invariant to changes in scale, and without
any further training for adjustment to the new settings, the performance is likely to be suboptimal.
69
V.9 U-Net
• In computer vision applications, these types of architectures are much superior to the original FC
networks: they usually introduce much fewer parameters, fewer calculations and much more capable
of extracting meaningful local features.
V.9 U-Net
Figure V.6: A generic example of the U-net Architecture → could be implemented with much different
hyperparameters.
• The U-net incorporates skip connections between corresponding layers in the encoder and the decoder,
for the highway of gradients, and the usage of fine-grained features from the encoder, which
performs as a features extractor. In contrast to the original ResNet (section V.5), the gradients
flow to the shallower layers of the network with even fewer steps and could be even more meaningful
than in a ResNet with dimensionality reduction, where gradients cannot flow completely up-the-stream.
• Remember that the main task of an autoencoder is to compress the features, for the selection of the
most meaningful ones. However, as the spatial dimensions shrink, we face an issue of loss of critical
information. To mitigate this, as we decrease the spatial size, we increase the number of channels, to
offer more "breathing room" for features to pass through, and learn a bigger variety of them. But,
when decoding the latent space, the spatial size increases, and therefore we must shrink the number
of channels, to mitigate the exponentially rising number of calculations. A convolution on the original
spatial size, no matter how thin it is parameters-wise, would introduce so many calculations and
become a serious bottleneck for the runtime, while most processing could be done in a much cheaper
fashion at the lower dimensions.
• The latent space here is a tensor of shape B × C × H × W and not a single vector, like in FC layers.
• These networks are widely used in tasks such as semantic segmentation, depth prediction, image
reconstructions, GANs, etc.
70
V.10 Generative Networks
Very popular tasks, which their destructive results, that we encounter today all over, with fake-images
flooding any good part of our lives. In this course, we introduce only two of the earlier ones: Variational
Autoencoders (VAEs) and Generative Adversarial Networks (GANs). Stable-diffusion is for the advanced
courses.
Figure V.7: During training: Encode the mean and std, use the parametrization trick to sample from the
encoded distribution, and then decode the sample.
Based on the vanilla Fully-Connected Autoencoder (section V.7), instead of extracting features that will
allow us to reconstruct the input, now we would like to encode the distribution of the entire training set
into the latent space. This is done by introducing a probabilistic approach to the latent space, where the
latent space is not a single vector, but two vectors (a single vector that is split in practice into two) that
represent the compressed mean and the variance of the data’s distribution. During training, we use both
encoder and decoder to reconstruct the input. However, instead of using the encoded latent space, we sample
from the distribution it represents (by some method that are beyond the scope of this course). We use the
parametrization trick to sample from the actual encoded distribution:
z = µ + σϵ
where ϵ ∼ N (0, 1) → sample noise from the standard normal distribution and then scale and shift the
sample using the mean and variance of the encoded distribution.
The loss function is now composed of two parts:
1. The reconstruction loss: Just like in the vanilla Autoencoder, we would like to reconstruct the input
as best as possible, to guide the encoder to learn the most meaningful features.
2. The KL-divergence loss: The KL-divergence loss is a measure of how much two probability distributions
differ from each other. The loss function is defined as:
1
L = Ep(z|x) [log p(x|z)] − Eq(z|x) [log q(z|x)] = (µ − µtarget )2 + log σ 2 − log (σtarget )2 − 1
2
where µ and σ 2 are the parameters of the learned distribution, and µtarget and σ target are the parameters
of the target distribution, which is assumed to be the standard normal distribution, with mean 0 and
variance (std) 1.
On the other hand, during testing (inference), we follow the following steps:
71
V.10 Generative Networks
Figure V.8: During training: The generator generates fake images, and the discriminator tries to distinguish
between real and fake images.
72
V.11 Questions
• Decoder loss: Binary cross-entropy loss between the output of the discriminator and the target label
(e.g. 1 for real images, 0 for fake images):
N
1 X
LD = − (yi log(D(xi )) + (1 − yi ) log(1 − D(G(zi )))
2N i=1
where D(xi ) is the output of the discriminator for the i-th image, N is the batch size and yi is the
target label.
• Generator loss: since we update first the discriminator weights to try and push the fake images to the
class 0, we take the compliment of the discriminator loss:
N
1 X
LG = − log(D(G(zi )))
N i
These two work in a MixMax fashion, which can cause quite often a case of underfitting, since both sides
"pull the rope" or even modal-collapse, where the generator learns to generate only a single image, that is
the most likely to fool the discriminator. This is why there are many variations of the GANs, such as the
Wasserstein GAN, the Least Squares GAN, the Conditional GAN, etc.
However, GANs proved to generate more real-looking images than VAEs, but at a cost of training instability
and complex training scheme. It was lately overthrown by other techniques, such as Stable Diffusion, etc.
V.11 Questions
1. Architectures:
a) LeNet uses average pooling to reduce the spatial size. Give one advantage and one disadvantage
of using average pooling over max pooling.
b) In LeNet - what is the receptive field of a neuron in the first FC layer?
c) Alexnet uses a 11 × 11 convolutional filter in the first layer. Name two disadvantages of using
such a large filter.
d) Alexnet uses ReLU instead of sigmoid or Tanh, as used in LeNet. Explain why it allows Alexnet
to be deeper than LeNet, when coupled with the Kaiming initialization.
e) VGGnet: What is the purpose of the convolutional part of the model? Why do we need the FC
layers at the end?
f) InceptionNet:
i. What was the problem with the first version of InceptionNet? How was it solved?
ii. We learned in class the MaxPool is usually used to reduce the spatial dimensions. Therefore,
how was it possible to use it inside the Inception block, and concatenate its output to all
other outputs?
2. Skip connections:
a) Why can we say that skip connections introduce "highway of gradients"?
b) Given a residual block Xl+1 = Xl + F (Xl ), where Xl , Xl+1 ∈ R - show the highway of gradients
∂L
in the chain rule formula of ∂X l
, given some loss value L ∈ R
c) Can you give a python-like implementation of the residual block?
73
V.12 Answers
3. AutoEncoders:
a) Assume we use an autoencoder to reconstruct an image. What could be used as a loss function?
What do we compare between? What kind of learning is it (supervised or unsupervised)?
b) What is the effect of a latent space that is too small? What is the effect of a latent space that is
too big?
c) What linear approach does this kind of autoencoder resemble? What is the advantage of autoen-
coder over this method?
d) State a scenario in which we would like to use an autoencoder for feature extraction.
4. U-net:
a) Give 3 advantages of U-net over the vanilla Autoencoder
b) How do we mitigate the drop in spatial size, so not to much information is lost?
c) For the task image-reconstruction, does it make sense to use skip-connections between the encoder
and the decoder? Explain.
5. Generative Networks:
a) What is the difference between GANs and VAEs in the way they learn the distribution of the
training set?
b) In the Vanilla GAN, why is it prone to underfitting and mode collapse?
c) In GAN, how do we train the generator to fool the discriminator?
d) VAE: what are the two loss function we use, and what is their purpose?
e) What do we assume the training set distribution to be in both VAEs and GANs?
f) Sampling from a random distribution, that is not the normal distribution, is very hard. How
does VAEs solve this problem?
V.12 Answers
1. Architectures:
a) i. Advantage: all entries of the window get a gradient.
ii. Disadvantage: It is a linear operation, while max pooling offers non-linearity.
b) i. 11 × 11 is very expensive in both parameters and calculations.
ii. It captures more global features than specific local features.
c) The entire input image (32 × 32).
d) It prevented the vanishing gradient problem, that is likely in deeper networks.
e) i. The convolutional part is used as a features extractor.
ii. The FC layers have global receptive field, which is very useful for the task of classification,
for which the model is meant in the first place.
f) InceptionNet:
74
V.12 Answers
i. The first version of InceptionNet is very complex, due to convolutions with large kernel sizes.
The problem was solved by introducing 1 × 1 convolutions, to reduce the number of channels
before the more expensive convolutions.
ii. Simply used the magical trio k = 3, s = 1, p = 1, that is used to keep the spatial dimensions.
2. Skip connections:
a) Skip connections allow skipping whole blocks of layers, by adding the input to the output. That
allows the gradient of that input X to bypass the block, and not be diminished by it.
b)
c) z = conv1 ( x )
z = relu (z)
z = conv2 ( z )
x = relu (x + z)
3. AutoEncoders:
a) Loss functions: MSE, MAE. We compare the input with the reconstructed image. It is an
unsupervised learning approach.
b) If the latent space is too small, the reconstruction will be very hard and the result will be very
blurry - underfitting. If the latent space is too big, the network could learn to copy the image,
without learning any meaningful features, thus - overfitting.
c) PCA. The advantage of autoencoder over PCA is that it is non-linear, and can learn more complex
features.
d) A scenario in which we would like to use an autoencoder for feature extraction is when we have a
lot of unlabeled data, and we want to extract features from it. Then, we can use the pre-trained
autoencoder for a supervised task.
4. U-net:
a) i. Skip connections
ii. Allows making most of the calculations and processing at lower dimensions, for a fraction of
the cost.
iii. Accepts different input spatial sizes, with no any structural changes.
iv. Allows the extraction of meaningful local features instead of the global ones.
v. No, as the model can learn to simply memorize the image and send it to the output layer.
b) We mitigate it by increasing the number of channels, to offer more "breathing room" for features
to pass through, and learn a bigger variety of them.
5. Generative Networks:
a) VAEs learn the distribution explicitly (directly), while the GANs learn the distribution implic-
itly (indirectly).
b) Both Generator and discriminator work against each other, so both could converge to suboptimal
solutions. Also, the generator could learn to produce a single image, which is likely to fool the
discriminator, and thus have a very good accuracy.
75
V.12 Answers
c) Send the fake image through the discriminator, but this time with class 1, like for the real image.
d) VAE: MAE/MSE for reconstruction, and the KL divergence for the latent space’s distribution.
e) We assume the training set distribution to be the standard normal distribution.
f) We use the parametrization trick: z = µ+σϵ, to shift the normal distribution to the actual learned
distribution, where ϵ ∼ N (0, 1) is the sampled noise from the standard normal distribution, and
µ and σ are the mean and variance of the distribution of the training set.
76
VI Recurrent Neural Networks and Transformers
So far we’ve been dealing with neural networks that took as input independent instances (e.g. single RGB
images) and processed them for the sake of some task. In this section, we will be dealing with networks that
process data instances that are dependent on each other with some context, e.g. sentences in text, videos -
that are consecutive images of the same scene or even different parts of the same image, etc.
The very first type of neural network that was designed to handle sequential data. The idea behind RNNs
is to have a network that has some kind of memory of the previous inputs it has seen. This is achieved by
having the network pass the output of the previous time step as an input to the current time step. This is
illustrated in the figure below:
In this notion, we use the same weights to process each time step, but only in a very limited scope of the
history, or "Short Term Memory". The components of the RNN are as follows:
• xt : The input at time t. Representing a word in a sentence, a frame in a video, etc. Usually given as
a vector of a fixed size D, that in literature is called a "token". Note, that this scheme is used also in
other context-related models, such as LSTMs and Transformers.
• ht : The hidden state of time t. This is the memory of the network. It is a vector of size M . This
hidden state is sent to the next time step and compresses into itself all previous time steps.
• ot : The output at time t. This is the prediction of the network at time t. It is a vector of size M .
At each time step t, we use the same weights to process the current token xt with the previous hidden state
ht−1 to get the new hidden state ht and the output ot . x, h and o have a set of weights WxD×M , WhM ×M
and WoM ×O respectively. Note that these weights are shared across all time steps. Therefore, at time t we
compute:
ht = A(xt · Wx + ht−1 · Wh + bh )
ot = A(ht · Wo + bo ), ot ∈ RM
77
VI.1 Recurrent Neural Networks (RNNs)
Where A is an activation function (Sigmoid, Tanh or ReLU) and b⋆ is the relevant bias in the affine equation.
• Number of parameters: Without considering an output variable, the number of parameters in the
RNN cell relies on the size of h, as the tensors need to match in shape for the addition. Assume
h ∈ R1×M , x ∈ R1×D , then, the number of parameters is M · M + D · M + M , for Wh , Wx , bh .
There are different types of RNNs, as shown in the figure above, and the outputs ot could be used at each
time step, only at the end of the sequence, or at every time step. The backpropagation through these
networks is called "backpropagation through time" (BPTT) and is also affected by the choice of using the
intermediate outputs, of course. This process heavily relies on the length of the input sequence and therefore
could be quite expensive and long to compute.
Uses of the different types of RNN (source):
1. One-to-one: A simple neural network, as we know it.
2. One-to-many: Image captioning.
3. Many-to-one: Sentiment analysis.
4. Many-to-many (first sequence, then output) - Language translation, as the entire sentence is processed
first.
5. Many-to-many (both input and output): Labeling images in a video, or some task that doesn’t rely
completely on the entirety of the sequence.
Also, the way h0 is initialized can affect the performance of the RNN. Common initializations are either to
a set of zeros, or sampled from a random distribution, but could also be learned.
Main challenges:
• Backpropagation through time (Example). Assume an RNN with 2 time steps, and a single output
at the end, for simplicity. Also, for the sake of a readable example, assume all variables are scalars
(x, h, Wx , Wh , Wo , bh , bo ∈ R) and that the activation function is the identity function. Remember that
if y = f (x)g(x), then
∂y ∂y ∂f (x)g(x) ∂f (x) ∂g(x)
= f ′ (x)g(x) + f (x)g ′ (x) → = = g(x) + f (x)
∂x ∂x ∂x ∂x ∂x
. The forward pass is as follows:
– y1 = x1 · Wx + h0 · Wh + bh
– h1 = A1 (y1 )
– y2 = x2 · Wx + h1 · Wh + bh
– h2 = A2 (y2 )
– y3 = h2 · Wh + bo
78
VI.2 Long Short Term Memory (LSTM)
– o = A3 (y3 )
∂o ∂A3 (y3 )
Now, let’s calculate the gradient of ∂Wh = ∂Wh :
Where,
∂A3 ∂A2 ∂A1
, , =1
∂y3 ∂y2 ∂y1
• Observe the example above. We could see that the weight matrix is multiplied by itself as we backprop-
agate through time. If the eigenvalues of the weight matrix are smaller than 1, then the gradient will
become smaller and smaller as we go up the stream - and introduce the vanishing gradient problem.
On the other hand, if the eigenvalues of the weight matrix are larger than 1, then the gradients will
explode! A solution could be to use a regularization term to force the weight matrix to be orthogonal,
or clip the gradient values, so it won’t get too small or too large.
• Not much capacity: While it is very cheap in parameters, using only 3 different weight matrices does
not introduce much capacity to the network.
• Can only have short-term context, which makes context-related tasks, such as language translation,
quite hard.
LSTMs try to overcome the vanilla RNN’s problems by changing the inner logic of the RNN cell by using
different gates for past and current information and introducing skip-connections.
79
VI.3 LSTM Gates Explanation
The LSTM (Long Short-Term Memory) network is a type of RNN that addresses the vanishing gradient
problem and can capture long-term dependencies more effectively. The diagram provided illustrates the
internal workings of an LSTM cell. Let’s break down the key components and gates of the LSTM block:
• Cell State (Ct ): The cell state runs straight down the entire chain, with only some minor linear
interactions. This allows information to flow unchanged and provides a memory that can be updated
or reset based on the gates’ actions. Must have the same shape as the hidden state h.
• Hidden State (ht ): This is the output of the LSTM cell at each time step, which can also serve as
an input to the next time step.
LSTM networks have three main gates that regulate the flow of information:
This layer generates new candidate values, which could be added to the cell state. The Tanh function
outputs values between -1 and 1.
• Output Gate (ot ):
ot = σ(ht−1 · Woh + xt · Wox + bo )
The output gate determines what the next hidden state (ht ) should be. It takes the input and previous
hidden state, processes them through a sigmoid function, and then multiplies the output with the Tanh
of the updated cell state to generate the hidden state.
80
VI.4 Transformers: Revolutionizing Neural Network Architectures
The cell state and hidden state are updated as follows (⊙ for element-wise multiplication):
ht = ot ⊙ Tanh(Ct )
The hidden state is updated by multiplying the output gate value with the Tanh of the new cell state.
Transformers have become a cornerstone in modern deep learning, particularly in natural language processing
(NLP). Introduced by Vaswani et al. in 2017 (paper), Transformers have surpassed traditional recurrent
neural networks (RNNs) and convolutional neural networks (CNNs) in various tasks due to their ability to
handle long-range dependencies and parallelize training.
The Transformer architecture relies heavily on self-attention mechanisms, eliminating the need for recur-
rence, in contrast to RNNs and LSTMs. It is made up of an encoder-decoder structure, where both the
encoder and decoder are composed of a stack of identical blocks. Each block consists of sub-layers, including
multi-head self-attention mechanisms, feed-forward neural networks, and normalization layers.
81
VI.4 Transformers: Revolutionizing Neural Network Architectures
Token Embedding
The sequence input to the Transformer architecture is made up of "tokens", which are fixed-size vectors, that
are predefined, usually by some preprocessed "dictionary". The transformer cannot accept tokens of different
sizes, as it is crucial for the mathematical operations within the different components. The embedding layer
converts input tokens into dense vectors that represent the tokens in a high-dimensional space.
Positional Encoding
Transformers lack the sequential nature of RNNs, so they require a way to incorporate the order of the
sequence. This is done through positional embeddings. The first proposal was the sinusoidal positional
encoding:
pos
P E(pos,2i) = sin
100002i/dmodel
pos
P E(pos,2i+1) = cos
100002i/dmodel
These functions provide unique encodings for each position, allowing the model to differentiate between
different positions in the sequence. Before the input embeddings are fed into the encoder, those positional
embeddings are added to their corresponding input tokens.
For a deeper understanding of embedding layers, take a look at this 3B1B video.
The self-attention mechanism is at the heart of the Transformer model. It allows the model to focus on
different parts of the input sequence when encoding a particular word.
82
VI.4 Transformers: Revolutionizing Neural Network Architectures
The scaled dot-product attention computes the attention scores using the following formula:
!
QK T
Attention(Q, K, V ) = softmax √ V
dk
where Q (query), K (key), and V (value) are the input matrices, and dk is the dimension of the key vectors.
√
• The division by dk is another mechanism to prevent the outputs of the linear operations from
becoming too huge, and in turn make the outputs and the gradients explode. In programming, you
would usually see your loss value become nan.
• The Softmax function has two purposes - to normalize the weights to be in the range of [0, 1], and to
make stronger weights stronger, and smaller weights - smaller
Multi-Head Attention
Multi-head attention allows the model to jointly attend to information from different representation sub-
spaces. We simply divide each token into sub-tokens named "heads", and perform the linear operations with
smaller weight matrices, but each head has its own set. Instead of performing a single attention function,
the transformer employs multiple attention heads to capture different aspects of the input data. Each head
performs its own attention calculation, and the results are concatenated and linearly transformed.
Simply reshape your tensor of shape (N, C, D) to (N, C, D//head-size, head-size) before feeding it to the
self-attention mechanism. After the attention, the "concatenation" is performed by simply reshaping the
tensor back to (N, C, D).
VI.4.c Encoder
83
VI.4 Transformers: Revolutionizing Neural Network Architectures
• Self-Attention Mechanism: This allows the model to weigh the importance of different tokens in
a sequence, in respect to each other.
• Feed-Forward Neural Network: A position-wise fully connected feed-forward network applied
independently to each position.
Each sub-layer has a residual connection around it followed by layer normalization. The output of each
sub-layer is:
LayerNorm(x + Sublayer(x))
Note that those normalizations are crucial, as the self-attention mechanisms do not introduce any non-linear
activations or normalizations, and therefore are prone to exploding values or gradients.
VI.4.d Decoder
With an additional sub-layer compared to the encoder, each layer consists of:
• Masked Self-Attention Mechanism: Prevents attending to future tokens. This ensures that the
predictions for the i-th position can depend only on the known outputs at positions less than i.
Prevents from the decoder to simply copy its inputs.
• Encoder-Decoder Attention: Attends to the encoder’s output. Let M be the embedding size - q is
the decoder’s masked self-attention outputs, of shape q ∈ RK×M . k and v are the encoder’s outputs,
of shape N × M . That results in K embeddings, to be processed and compared with the K ground
truth embeddings.
• The output of the network, after applying N -decoder blocks, is a 2D matrix for each instance in the
batch (each sequence) of shape N × C, where N is the number of tokens in the sequence, and C is the
number of classes in the target dictionary, for a classification loss function (CE).
84
VI.5 Questions
VI.5 Questions
1. Give two drawbacks of using RNNs that are not the exploding or vanishing gradients.
2. How did LSTM solve the vanishing gradient problem in RNNs?
3. Is it a good idea to use ReLU instead of Sigmoid as the activation of the input gate of LSTM?
4. Transformers:
a) Can the transformer architecture take embeddings of different sizes?
b) Can it take sequences of different sizes as inputs to the encoder and the decoder?
c) Cross-attention layer. Given the encoder outputs of shape Xe ∈ RN ×M and the decoder outputs
of shape Xd ∈ RK×M , what is the dimension of the output of that layer?
d) Is the task of predicting the next token in the sequence a classification or a regression task?
e) Why is the self-attention mechanism prone to exploding gradients? How does the original archi-
tecture of the transformer solve that?
f) Why is it important in transformers to use positional encoding, in comparison to RNNs?
VI.6 Answers
85
VI.6 Answers
c) q = Xd , k = Xe , v = Xe .
out = (q · k ⊤ ) · v → (K × M ) · (M × N ) · (N × M ) = (K × N ) · (N × M ) = K × M
86
VII Appendix
y = XW + b (VII.1)
y = ax + b
What is X?
Let us take for example this one input instance (image) from the MNIST handwritten digits’ dataset. Each
gray scale image in this dataset is a 1 × 8 × 8 tensor: 1 for the channels, 8 for the height, and 8 for the
width.
87
VII.1 Multidimensional derivatives
For the affine layer, as phrased in (VII.1), each input instance is flattened to be a row vector inside X. Let
us take a batch of 2 images from the MNIST dataset.
What is W?
Notes
• Note: It is not a linear function, but we treat it as an approximation. Why not? It doesn’t follow
the rules of linearity, where
f (x + y) = f (x) + f (y)
or
f (ax) = af (x)
Which calculates the exact same thing, but results in a column vector and not a row. W weight vectors
are now row vectors and X inputs are now column vectors. It is just a matter of how we construct
our inputs and weights.
88
VII.1 Multidimensional derivatives
VII.1.b Derivatives
Figure VII.1: A neural network computational graph. Note: Although we always deal with batches of
inputs, in the sketch, the input layer represents only one input instance (e.g one flattened
image). Each color represents a different weights column vector in W . Also, each neuron in
the input layer (true to any neuron in the network) will collect the gradients from the flow on
the colorful edges that are attached to it.
What is a gradient
• Gradient: ∂f ∂f
∂x ... ∂x1,m
∂f .1,1 ..
f : Rn×m → R, x ∈ Rn×m , n×m
= .. . ... ∈ R
(VII.6)
∂x ∂f ∂f
∂xn,1 ... ∂xn,m
• Jacobian: ∂f ∂f1
1
x1 f1 ...
∂x1 ∂xn
.. .. ∂f . .. ..
f : Rn → Rm , f ( . ) = . , = . . . ∈R
m×n
(VII.7)
∂x ∂f.
xn fm m ∂fm
∂x1 ... ∂xn
• Note that if x was a row vector and so was the function ’image’ (result), then this Jacobian matrix
would have been transposed.
89
VII.1 Multidimensional derivatives
According to the chain rule, the derivative of the loss function value L according to the weight matrix
of our current affine layer W , would be:
∂L ∂L ∂σ(Y ) ∂Y
= ⊕ ⊕ (VII.9)
∂W ∂σ(Y ) ∂Y ∂W
∂L ∂σ(Y )
In this case, ∂σ(Y ) ∂Y is what we call dout, or the upstream gradient, and we assume it is already
calculated before, as in our current scope (according to the relevant functions, of course). Now it is
sent to our current scope, to be calculated as a part of the chain-rule, and sent up the stream to the
next layer.
90
VII.1 Multidimensional derivatives
z = f (X)
,
L = g(f (X)) = g(z) = z1 + . . . + zn = 2x1 + . . . + 2xn
91
VII.1 Multidimensional derivatives
∂L ∂L
Given a loss function Loss(Y ) = L, we want to calculate ∂X or ∂W .
As seen in (VII.6), the derivative of a scalar by a matrix, is a gradient / Jacobian matrix that has the same
shape as the input. Moreover, we saw in (VII.10), that the final derivative of the loss value L by any entry
of any matrix in the whole neural network is just a scalar. For better understanding we could look at the
computational graph of the network in, Figure VII.1, to clearly see that each neuron collects and sums the
upstream derivatives (from the loss up to it) - that it took part in calculation of, during the forward pass.
∂L ∂L ∂L
∂L ∂y1,1 ∂y1,2 ∂y1,3
= (VII.13)
∂Y
∂L ∂L ∂L
∂y2,1 ∂y2,2 ∂y2,3
So, from (VII.6) we know that the gradient of Y will have the same shape of Y , because L is a scalar, and
it is calculated as a part of the chain-rule. This is the abstraction notion that is discussed above.
Let’s derive W . Eventually, after the chain-rule, the derivative of W would have the same shape:
∂L ∂L ∂L
∂L ∂w1,1 ∂w1,2 ∂w1,3
= (VII.14)
∂W
∂L ∂L ∂L
∂w2,1 ∂w2,2 ∂w2,3
Now, this is important. We do not (!!) want to calculate the Jacobians. For a better explanation why,
∂L
refer to the attached article. We have also learned that each entry of ∂W is a scalar, that is computed as in
(VII.10).
So let’s divide and conquer. It is always a better practice, because it’s hard to wrap our minds on something
bigger than scalars.
2 X 3
∂L X ∂L ∂yij
= (VII.15)
∂w11 i=1 j=1
∂yij ∂w11
For better visualization, we could look at it as a dot product, which is elementwise multiplication and
then summation off all cells (Not what we know as np.dot() - this is confusing). Remember: when deriving
a function, it is by the input variable (at least one):
92
VII.1 Multidimensional derivatives
We could, of course, do the exact same thing in order to derive X, and we will see that:
∂L ∂L
= · WT (VII.21)
∂X ∂Y
∂Y
Note: This is only true, because L is a scalar. If we just looked at Y = XW → ∂W would be a Jacobian.
93
VII.1 Multidimensional derivatives
We could, or course, do the trick of merging it into X and W , as we saw in the lecture.
If not:
1.
Y = XW + b
where XN xD , WDxM , b1xM XWN xM
That means that each bi in b corresponds to one feature in a row of XW - but to add them like that,
it is quite impossible mathematically, right?
Now, one can simply follow the exact same paradigm that we’ve shown above to solve for b, or we
could just look at 1N b as another XW , and do the exact same thing as you did for XW , where 1N
was X and b was W .
∂L
2. We see that the derivative of ∂b is,
∂L ∂L hP
N ∂L PN ∂L
i
= (1N )T · = i=1 ∂yi,1 , . . . , i=1 ∂yi,M
∂b ∂Y
Which in NumPy translates into:
np.sum(dout, axis = 0)
94
VII.1 Multidimensional derivatives
VII.1.d Exercise
And Y = Af f ine(X)
1. Show a solution to compute the gradient of the Sigmoid layer, w.r.t to the upstream gradient.
2. For a concrete example with numbers, replace sigmoid in the activation function with f (x) = x2
(applied element-wise) and use input X and weight matrix W below. So now, our network looks
like this: ! !
0 1 1 1
X= W =
2 3 1 1
2×2 2×2
Affine : Y = XW
Activation : f (Y ) = Y 2
#Rows
X #Cols
X
Loss : L(f (Y )) = yi,j
i j
Find:
(a) The loss value L(f (Y ))
∂L
(b) ∂f
∂L
(c) ∂y
∂L
(d) ∂X
∂L
(e) ∂W
95
VII.1 Multidimensional derivatives
Solutions:
∂L ∂L ∂σ
1. Task: "Gradient of Sigmoid" means the derivative of σ with respect to its input: ∂Y = ∂σ ∂Y
!
σ(y1,1 ) σ(y1,2 ) σ(y1,3 )
σ(Y ) =
σ(y2,1 ) σ(y2,2 ) σ(y2,3 )
∂L ∂L ∂L
∂L
= ∂σ1,1 ∂σ1,2 ∂σ1,3
∂σ ∂L ∂L ∂L
∂σ2,1 ∂σ2,2 ∂σ2,3
∂σ ∂σ1,2 ∂σ1,3
1,1
σ1,1 (1 − σ1,1 ) σ1,2 (1 − σ1,2 ) σ1,3 (1 − σ1,3 )
!
∂σ
= ∂y1,1 ∂y1,2 ∂y1,3
=
∂Y ∂σ2,1 ∂σ2,2 ∂σ2,3
∂y2,1 ∂y2,2 ∂y2,3
σ2,1 (1 − σ2,1 ) σ2,2 (1 − σ2,2 ) σ2,3 (1 − σ2,3 )
For scalars, which all of our loss functions (except for Softmax) are, we can compute this as a dot
product i.e. element-wise multiplication:
∂L ∂L ∂L
σ1,1 (1 − σ1,1 ) σ1,2 (1 − σ1,2 ) σ1,3 (1 − σ1,3 )
!
∂L
= ∂σ1,1 ∂σ1,2 ∂σ1,3
· =
∂Y ∂L ∂L ∂L
∂σ2,1 ∂σ2,2 ∂σ2,3 σ2,1 (1 − σ2,1 ) σ2,2 (1 − σ2,2 ) σ2,3 (1 − σ2,3 )
∂L ∂L ∂L
∂σ1,1 · σ1,1 (1 − σ1,1 ) ∂σ1,2 · σ1,2 (1 − σ1,2 ) ∂σ1,3 · σ1,3 (1 − σ1,3 )
=
∂L ∂L ∂L
∂σ2,1 · σ2,1 (1 − σ2,1 ) ∂σ2,2 · σ2,2 (1 − σ2,2 ) ∂σ2,3 · σ2,3 (1 − σ2,3 )
∂L ∂L
In code we can use the element-wise operator ⊕ like this: ∂Y = ∂σ ⊕ σ(1 − σ)
2. Task:
! !
x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 1 1
(a) Affine: Y = =
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 5 5
2 2
! !
y1,1 y1,2 1 1
Activation: f (Y ) = 2 2
=
y2,1 y2,2 25 25
∂L ∂L ∂f
(c) ∂Y = ∂f ∂Y
! !
∂f 2y1,1 2y1,2 2 2
∂Y = =
2y2,1 2y2,2 10 10
Remember that since f is an element-wise function, the multiplication with the upstream
gradient ("dout") is also element-wise.
! ! !
∂L ∂L ∂f 1 1 2 2 2 2
∂Y = ∂f ∂Y = =
1 1 10 10 10 10
∂L ∂L
(d) ∂X = ∂Y · WT
96
VII.1 Multidimensional derivatives
! ! !
2 2 1 1 4 4
· =
10 10 1 1 20 20
∂L ∂L
(e) ∂W = XT · ∂Y
! ! !
0 2 2 2 20 20
· =
1 3 10 10 32 32
97