The Little Book of Deep Learning
The Little Book of Deep Learning
of
Deep Learning
François Fleuret
François Fleuret is a professor of computer sci-
ence at the University of Geneva, Switzerland.
Contents 5
List of figures 7
Foreword 8
I Foundations 10
1 Machine Learning 11
1.1 Learning from data ........................12
1.2 Basis function regression ...............14
1.3 Under and overfitting.....................16
1.4 Categories of models .....................18
2 Efficient computation 20
2.1 GPUs, TPUs, and batches ..............21
2.2 Tensors ..........................................23
3 Training 25
3.1 Losses ...........................................26
3.2 Autoregressive models ...................30
3.3 Gradient descent ............................35
3
3.4 Backpropagation . . . . . . . . 40
3.5 The value of depth . . . . . . . 45
3.6 Training protocols . . . . . . . 48
3.7 The benefits of scale . . . . . . 51
II Deep models 56
4 Model components 57
4.1 The notion of layer . . . . . . . 58
4.2 Linear layers . . . . . . . . . . . 60
4.3 Activation functions . . . . . . 70
4.4 Pooling . . . . . . . . . . . . . . 73
4.5 Dropout . . . . . . . . . . . . . 76
4.6 Normalizing layers . . . . . . . 79
4.7 Skip connections . . . . . . . . 83
4.8 Attention layers . . . . . . . . . 86
4.9 Token embedding . . . . . . . . 94
4.10 Positional encoding . . . . . . . 95
5 Architectures 97
5.1 Multi-Layer Perceptrons . . . . 98
5.2 Convolutional networks . . . . 100
5.3 Attention models . . . . . . . . 107
Bibliography 150
Index 159
5
List of Figures
6
4.11 Attention operator interpretation . 87
4.12 Complete attention operator .............. 89
4.13 Multi-Head Attention layer ..................91
7
Foreword
If you did not get this book from its official URL
https://fleuret.org/public/lbdl.pdf
François Fleuret,
June 23, 2023
9
PART I
FoUndATIOns
10
Chapter 1
Machine Learning
11
1.1 Learning from data
The simplest use case for a model trained from
data is when a signal x is accessible, for instance,
the picture of a license plate, from which one
wants to predict a quantity y, such as the string
of characters written on the plate.
12
Most of the content of this book is about the defi-
nition of f , which, in realistic scenarios, is a com-
plex combination of pre-defined sub-modules.
13
1.2 Basis function regression
We can illustrate the training of a model in a sim-
ple case where xn and yn are two real numbers,
the loss is the mean squared error:
N
Σ
1
ℒ(w) = (y n − f (xn ;w))2, (1.1)
N
n=1
Σ
K
f (x;w) = wkfk(x).
k=1
14
the loss ℒ(w) is quadratic with respect to the
wks, and finding w∗ that minimizes it boils down
to solving a linear system. See Figure 1.1 for an
example with Gaussian kernels as f k.
15
1.3 Under and overfitting
A key element is the interplay between the capac-
ity of the model, that is its flexibility and ability
to fit diverse data, and the amount and quality
of the training data. When the capacity is insuf-
ficient, the model cannot fit the data, resulting
in a high error during training. This is referred
to as underfitting.
16
fit to the global structure of the data, and poor
performance on new inputs. This phenomenon
is referred to as overfitting.
17
1.4 Categories of models
We can organize the use of machine learning
models into three broad categories:
19
Chapter 2
Efficient computation
20
2.1 GPUs, TPUs, and batches
Graphical Processing Units were originally de-
signed for real-time image synthesis, which re-
quires highly parallel architectures that happen
to be well suited for deep models. As their usage
for AI has increased, GPUs have been equipped
with dedicated tensor cores, and deep-learning
specialized chips such as Google’s Tensor Pro-
cessing Units (TPUs) have been developed.
21
units. Proceeding by batches allows for copying
the model parameters only once, instead of doing
it for each sample. In practice, a GPU processes
a batch that fits in memory almost as quickly as
it would process a single sample.
22
2.2 Tensors
GPUs and deep learning frameworks such as Py-
Torch or JAX manipulate the quantities to be
processed by organizing them as tensors, which
are series of scalars arranged along several dis-
crete axes. They are elements of RN1×···×ND
that generalize the notion of vector and matrix.
24
Chapter 3
Training
25
3.1 Losses
The example of the mean squared error from
Equation 1.1 is a standard loss for predicting a
continuous value.
Cross-entropy
For classification, the usual strategy is that the
output of the model is a vector with one com-
ponent f (x;w)y per class y, interpreted as the
logarithm of a non-normalized probability, or
logit.
ˆ expf (x;w)y
P (Y = y | X = x) = Σ .
z expf (x;w) z
26
To be consistent with this interpretation, the
model should be trained to maximize the proba-
bility of the true classes, hence to minimize the
cross-entropy, expressed as:
N
Σ1
ℒce (w) = − log P̂ (Y = y
n | X = xn)
N n=1
N
1Σ expf (xn;w)yn
= −log Σ expf (x ;w) .
N n=1 z n z
`
˛¸ x
Lce (f (xn ;w),yn )
Contrastive loss
In certain setups, even though the value to be
predicted is continuous, the supervision takes
the form of ranking constraints. The typical do-
main where this is the case is metric learning,
where the objective is to learn a measure of dis-
tance between samples such that a sample xa
from a certain semantic class is closer to any
sample xb of the same class than to any sample
xc from another class. For instance, xa and xb
can be two pictures of a certain person, and xc a
picture of someone else.
29
3.2 Autoregressive models
A key class of methods, particularly for deal-
ing with discrete sequences in natural language
processing and computer vision, are the autore-
gressive models,
30
Then, a model
f : {∅,1,. ..,K}T → RK
which, given such an input, computes a vector
lt of K logits corresponding to
P̂ (Xt | X1 = x1 ,.. .,Xt−1 = xt−1 ),
allows to sample one token given the previous
ones.
x1 x2 ... xT −2 xT −1
Causal models
The training procedure we described requires
a different input for each t, and the bulk of the
computation done for t < t′ is repeated for t′.
This is extremely inefficient since T is often of
the order of hundreds or thousands.
f : {1,...,K} T → RT ×K,
32
but with a computational structure such that the
computed logits lt for xt depend only on the
input values x 1 ,...,x t−1 .
Tokenizer
One important technical detail when dealing
with natural languages is that the representation
as tokens can be done in multiple ways, ranging
from the finest granularity of individual symbols
33
to entire words. The conversion to and from the
token representation is carried out by a separate
algorithm called a tokenizer.
34
3.3 Gradient descent
Except in specific cases like the linear regression
we saw in § 1.2, the optimal parameters w∗ do
not have a closed-form expression. In the general
case, the tool of choice to minimize a function is
gradient descent. It starts by initializing the pa-
rameters with a random w0, and then improves
this estimate by iterating gradient steps, each
consisting of computing the gradient of the loss
with respect to the parameters, and subtracting
a fraction of it:
Learning rate
The meta-parameter η is called the learning rate.
It is a positive value that modulates how quickly
the minimization is done, and must be chosen
carefully.
ℒ(w)
36
bounce around a good minimum and never de-
scend into it. As we will see in § 3.6, it can depend
on the iteration number n.
1 Σ
N
∇ℒ |w (w) = ∇ n|w (w). (3.2)
N n=1
39
3.4 Backpropagation
Using gradient descent requires a tech-
nical means to compute ∇𝒟| w(w) where
𝒟 = L(f (x;w);y). Given that f and L are
both compositions of standard tensor opera-
tions, as for any mathematical expression, the
chain rule from differential calculus allows us to
get an expression of it.
f (d) (· ;wd )
x(d−1) x(d)
×Jf (d)|x
∇𝒟 |x(d−1) ∇𝒟 |x(d)
×Jf (d)|w
∇𝒟|wd
40
Forward and backward passes
Consider the simple case of a composition of
mappings:
42
Resource usage
Regarding the computational cost, as we will
see, the bulk of the computation goes into linear
operations, each requiring one matrix product
for the forward pass and two for the products by
the Jacobians for the backward pass, making the
latter roughly twice as costly as the former.
Vanishing gradient
A key historical issue when training a large net-
work is that when the gradient propagates back-
wards through an operator, it may be scaled by a
43
multiplicative factor, and consequently decrease
or increase exponentially when it traverses many
layers. A standard method to prevent it from
exploding is gradient norm clipping, which con-
sists of re-scaling the gradient to set its norm to
a fixed threshold if it is above it [Pascanu et al.,
2013].
44
3.5 The value of depth
As the term “deep learning” indicates, useful
models are generally compositions of long se-
ries of mappings. Training them with gradient
descent results in a sophisticated co-adaptation
of the mappings, even though this procedure is
gradual and local.
46
visualization, while real models take advantage
of representations in high dimensions, which, in
particular, facilitates the optimization by provid-
ing many degrees of freedom.
47
3.6 Training protocols
Training a deep network requires defining a pro-
tocol to make the most of computation and data,
and to ensure that performance will be good on
new data.
Loss
Validation
Training
Number of epochs
49
on the training set [Belkin et al., 2018].
50
3.7 The benefits of scale
There is an accumulation of empirical results
showing that performance, for instance, esti-
mated through the loss on test data, improves
with the amount of data according to remarkable
scaling laws, as long as the model size increases
correspondingly [Kaplan et al., 2020] (see Figure
3.6).
51
Test loss
Compute (peta-FLOP/s-day)
Test loss
Number of parameters
52
Dataset Year Nb. of images Size
ImageNet 2012 1.2M 150Gb
Cityscape 2016 25K 60Gb
LAION-5B 2022 5.8B 240Tb
Dataset Year Nb. of books Size
WMT-18-de-en 2018 14M 8Gb
The Pile 2020 1.6B 825Gb
OSCAR 2020 12B 6Tb
53
1GWh
PaLM
1024
GPT-3 LaMDA
AlphaZero Whisper
Training cost (FLOP)
ViT
1MWh
AlphaGo CLIP-ViT
GPT-2
1021
BERT
Transformer
GPT
ResNet
1KWh
VGG16
1018 AlexNet GoogLeNet
2015 2020
Year
Figure 3.7: Training costs in number of FLOP of some
landmark models [Sevilla et al., 2023]. The colors in-
dicate the domains of application: Computer Vision
(blue), Natural Language Processing (red), or other
(black). The dashed lines correspond to the energy con-
sumption using A100s SXM in 16-bit precision. For
reference, the total electricity consumption in the US in
2021 was 3920TWh.
54
The most impressive current successes of artifi-
cial intelligence rely on the so-called Large Lan-
guage Models (LLMs), which we will see in § 5.3
and § 7.1, trained on extremely large text datasets
(see Table 3.1).
55
PART II
DEEp MODEls
56
Chapter 4
Model components
57
4.1 The notion of layer
We call layers standard complex compounded
tensor operations that have been designed and
empirically identified as being generic and effi-
cient. They often incorporate trainable param-
eters and correspond to a convenient level of
granularity for designing and describing large
deep models. The term is inherited from sim-
ple multi-layer neural networks, even though
modern models may take the form of a complex
graph of such modules, incorporating multiple
parallel pathways.
Y
4×4
g n=4
f
×K
32 × 32
X
58
added in blue on their right,
59
4.2 Linear layers
The most important modules in terms of compu-
tation and number of parameters are the Linear
layers. They benefit from decades of research
and engineering in algorithmic and chip design
for matrix operations.
Convolutional layers
A linear layer can take as input an arbitrarily-
shaped tensor by reshaping it into a vector, as
long as it has the correct number of coefficients.
However, such a layer is poorly adapted to deal-
61
Y Y
ϕ ψ
X X
Y Y
ϕ ψ
X X
... ...
Y Y
ϕ ψ
X X
1D transposed
1D convolution
convolution
Figure 4.1: A 1D convolution (left) takes as input
a D ×T tensor X, applies the same affine mapping
ϕ( ·;w) to every sub-tensor of shape D
× K, and stores
the resulting D′ ×
1 tensors into Y . A 1D transposed
convolution (right) takes as input a D×T tensor, ap-
plies the same affine mapping ψ(· ;w) to every sub-
tensor of shape D×1, and sums the shifted resulting
D′ ×K tensors. Both can process inputs of different
sizes.
62
ϕ ψ
Y X
X Y
2D transposed
2D convolution
convolution
Figure 4.2: A 2D convolution (left) takes as input a
D × H × W tensor X, applies the same affine mapping
ϕ(·;w) to every sub-tensor of shape D × K × L, and
stores the resulting D′ × 1 × 1 tensors into Y . A 2D
transposed convolution (right) takes as input a D ×
H × W tensor, applies the same affine mapping ψ(·;w)
to every D × 1 × 1 sub-tensor, and sums the shifted
resulting D′ × K × L tensors into Y .
63
Y
Y ϕ
X
ϕ
p=2
X
Padding
Y
Y
ϕ
X ϕ
X
s = 2 ...
d=2
Stride
Dilation
Figure 4.3: Beside its kernel size and number of input
/ output channels, a convolution admits three meta-
parameters: the stride s (left) modulates the step size
when going through the input tensor, the padding p
(top right) specifies how many zero entries are added
around the input tensor before processing it, and the
dilation d (bottom right) parameterizes the index count
between coefficients of the filter.
64
and statistical stationarity with respect to trans-
lation, scaling, and certain symmetries. This
is not reflected in the inductive bias of a fully
connected layer, which completely ignores the
signal structure.
66
Figure 4.4: Given an activation in a series of convolu-
tion layers, here in red, its receptive field is the area in
the input signal, in blue, that modulates its value. Each
intermediate convolutional layer increases the width
and height of that area by roughly those of the kernel.
68
chitecture for mapping a large-dimension signal,
such as an image or a sound sample, to a low-
dimension tensor. This can be used, for instance,
to get class scores for classification or a com-
pressed representation. Transposed convolution
layers are used the opposite way to build a large-
dimension signal from a compressed representa-
tion, either to assess that the compressed repre-
sentation contains enough information to recon-
struct the signal or for synthesis, as it is easier
to learn a density model over a low-dimension
representation. We will revisit this in § 5.2.
69
4.3 Activation functions
If a network were combining only linear com-
ponents, it would itself be a linear operator,
so it is essential to have non-linear operations.
These are implemented in particular with activa-
tion functions, which are layers that transform
each component of the input tensor individually
through a mapping, resulting in a tensor of the
same shape.
70
Tanh ReLU
gelu(x) = xP (Z ≤ x),
72
4.4 Pooling
A classical strategy to reduce the signal size is to
use a pooling operation that combines multiple
activations into one that ideally summarizes the
information. The most standard operation of this
class is the max pooling layer, which, similarly
to convolution, can operate in 1D and 2D and is
defined by a kernel size.
73
Y
max
max
...
max
1D max pooling
74
invariant to local deformations.
75
4.5 Dropout
Some layers have been designed to explicitly
facilitate training or improve the learned repre-
sentations.
01 1 1 1 1 10 1 1 1 10
1 01 1 01 1 1 1 1 1 1
× 1 1 01 1 1 1 1 01 1 1 × 1−p
1
1 1 1 1 1 10 1 1 01 1
01 1 1 01 1 1 1 1 1 1
X X
Train Test
Figure 4.7: Dropout can process a tensor of arbitrary
shape. During training (left), it sets activations at ran-
dom to zero with probability p and applies a multiply-
ing factor to keep the expected values unchanged. Dur-
ing test (right), it keeps all the activations unchanged.
77
D
H,W
× 1 1 0 1 0 0 1 × 1−p
1
Train Test
Figure 4.8: 2D signals such as images generally exhibit
strong short-term correlation and individual activa-
tions can be inferred from their neighbors. This redun-
dancy nullifies the effect of the standard unstructured
dropout, so the usual dropout layer for 2D tensors drops
entire channels instead of individual values.
78
4.6 Normalizing layers
An important class of operators to facilitate the
training of deep architectures are the normaliz-
ing layers, which force the empirical mean and
variance of groups of activations.
79
D
H,W
√ √
(· − m̂d )/ v̂d + ϵ (· − m̂b )/ v̂b + ϵ
batchnorm layernorm
80
viation γd:
xb,d − m̂d
∀b, zb,d = √v̂ + ϵ
d
82
4.7 Skip connections
Another technique that mitigates the vanishing
gradient and allows the training of deep archi-
tectures are skip connections [Long et al., 2014;
Ronneberger et al., 2015]. They are not layers
per se, but an architectural design in which out-
puts of some layers are transported as-is to other
layers further in the model, bypassing process-
ing in between. This unmodified signal can be
concatenated or added to the input of the layer
the connection branches into (see Figure 4.10). A
particular type of skip connections are the resid-
ual connections which combine the signal with
a sum, and usually skip only a few layers (see
Figure 4.10, right).
83
···
f (8)
···
(7)
···
f
(6)
f +
f (6)
f (5) f (4)
(5)
f
f (4) f (3)
(4)
f
f (3) +
(3)
f
f (2) f (2)
f (2)
f (1) f (1)
f (1)
··· ···
···
Figure 4.10: Skip connections, highlighted in red on this
figure, transport the signal unchanged across multiple
layers. Some architectures (center) that downscale and
re-upscale the representation size to operate at multiple
scales, have skip connections to feed outputs from the
early parts of the network to later layers operating at
the same scales [Long et al., 2014; Ronneberger et al.,
2015]. The residual connections (right) are a special
type of skip connections that sum the original signal
to the transformed one, and usually bypass at most a
handful of layers [He et al., 2015].
84
Their role can also be to facilitate multi-scale rea-
soning in models that reduce the signal size be-
fore re-expanding it, by connecting layers with
compatible sizes, for instance for semantic seg-
mentation (see § 6.4). In the case of residual
connections, they may also facilitate learning
by simplifying the task to finding a differential
improvement instead of a full update.
85
4.8 Attention layers
In many applications, there is a need for an op-
eration able to combine local information at lo-
cations far apart in a tensor. For instance, this
could be distant details for coherent and realistic
image synthesis, or words at different positions
in a paragraph to make a grammatical or seman-
tic decision in natural language processing.
Attention operator
Given
Y = att(K,Q,V )
QK⊤
att(Q,K,V ) = softargmax √ V.
` ˛ ¸ DQK x
A
88
Y
×
A
dropout
Masked
softargmax M 1/Σk
⊙
exp
×
Q K V
Figure 4.12: The attention operator Y = att(Q,K,V )
computes first an attention matrix A as the per-query
softargmax of QK⊤, which may be masked by a con-
stant matrix M before the normalization. This atten-
tion matrix goes through a dropout layer before being
multiplied by V to get the resulting Y . This operator
can be made causal by taking M full of 1s below the
diagonal and zeros above.
89
This operator is usually extended in two ways,
as depicted in Figure 4.12. First, the attention
matrix can be masked by multiplying it before
the softargmax normalization by a Boolean ma-
trix M . This allows, for instance, to make the
operator causal by taking M full of 1s below the
diagonal and zero above, preventing Yq from de-
pending on keys and values of indices k greater
than q. Second, the attention matrix is processed
by a dropout layer (see § 4.5) before being multi-
plied by V , providing the usual benefits during
training.
• W Q of size H × D × DQK ,
• W K of size H × D × DQK , and
• W V of size H × D × DV ,
to compute respectively the queries, the keys,
and the values from the input, and a final weight
matrix W O of size HD V × D to aggregate the
90
Y
×W O
(Y1 | ··· | YH )
attattatt
attatt
Q Q K V
×W
×W1 2 ×W×W 1 2 ×W
K K V V
3 4×W
Q Q
×W 3 4 Q×W K×W
K 1 V
×W×W H
×W×W H
×W
2
×W
3 4 V
H
×H
XQ XK XV
Figure 4.13: The Multi-head Attention layer applies
for each of its h = 1, ,H heads a parametrized lin-
ear transformation to individual elements of the input
sequences X Q ,X K ,X V to get sequences Q,K,V that
are processed by the attention operator to compute Yh.
These H sequences are concatenated along features,
and individual elements are passed through one last
linear operator to get the final result sequence Y .
91
per-head results.
• XQ of size N Q × D,
• XK of size N KV × D, and
• XV of size N KV × D,
from which it computes, for h = 1,...,H,
93
4.9 Token embedding
In many situations, we need to convert discrete
tokens into vectors. This can be done with an em-
bedding layer, which consists of a lookup table
that directly maps integers to vectors.
94
4.10 Positional encoding
While the processing of a fully connected layer
is specific to both the positions of the features
in the input tensor and to the positions of the
resulting activations in the output tensor, con-
volutional layers and Multi-Head Attention lay-
ers are oblivious to the absolute position in the
tensor. This is key to their strong invariance and
inductive bias, which is beneficial for dealing
with a stationary signal.
pos-enc[t,d] =
sin t
Td/D if d ∈ 2N
t
cos
T (d−1)/D
otherwise,
with T = 104.
96
Chapter 5
Architectures
97
5.1 Multi-Layer Perceptrons
The simplest deep architecture is the Multi-Layer
Perceptron (MLP), which takes the form of a
succession of fully connected layers separated
by activation functions. See an example in Figure
5.1. For historical reasons, in such a model, the
number of hidden layers refers to the number of
linear layers, excluding the last one.
Y
2
fully-conn
relu
10
Hidden fully-conn
layers
relu
25
fully-conn
50
X
Figure 5.1: This multi-layer perceptron takes as input
a one-dimensional tensor of size 50, is composed of
three fully connected layers with outputs of dimensions
respectively 25, 10, and 2, the two first followed by
ReLU layers.
98
and not polynomial, any continuous function f
can be approximated arbitrarily well uniformly
on a compact domain, which is bounded and
contains its boundary, by a model of the form
l2 ◦σ ◦l1 where l1 and l2 are affine. Such a model
is a MLP with a single hidden layer, and this
result implies that it can approximate anything
of practical value. However, this approximation
holds if the dimension of the first linear layer’s
output can be arbitrarily large.
99
5.2 Convolutional networks
The standard architecture for processing images
is a convolutional network, or convnet, that com-
bines multiple convolutional layers, either to re-
duce the signal size before it can be processed by
fully connected layers, or to output a 2D signal
also of large size.
LeNet-like
The original LeNet model for image classifica-
tion [LeCun et al., 1998] combines a series of 2D
convolutional layers and max pooling layers that
play the role of feature extractor, with a series of
fully connected layers which act as a MLP and
perform the classification per se (see Figure 5.2).
Residual networks
Standard convolutional neural networks that fol-
low the architecture of the LeNet family are not
easily extended to deep architectures and suffer
from the vanishing gradient problem. The resid-
100
P̂ (Y )
10
fully-conn
Classifier
relu
200
fully-conn
256
reshape
relu
64 × 2 × 2
maxpool k=2
64 × 4 × 4
Feature conv-2d k=5
extractor
relu
32 × 8 × 8
maxpool k=3
32 × 24 × 24
conv-2d k=5
1 × 28 × 28
X
Figure 5.2: Example of a small LeNet-like network for
classifying 28×28 grayscale images of handwritten
digits [LeCun et al., 1998]. Its first half is convolutional,
and alternates convolutional layers per se and max
pooling layers, reducing the signal dimension from
28 ×28 scalars to 256. Its second half processes this
256-dimensional feature vector through a one hidden
layer perceptron to compute 10 logit scores correspond-
ing to the ten possible digits.
101
Y
C ×H ×W
relu
+
batchnorm
C ×H ×W
conv-2d k=1
relu
batchnorm
conv-2d k=3 p=1
relu
batchnorm
C
2 ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.3: A residual block.
102
Y
4C H W
S × S × S
relu
+
batchnorm batchnorm 4C H W
S × S × S
conv-2d k=1 s=S conv-2d k=1
relu
batchnorm C H W
S × S × S
conv-2d k=3 s=S p=1
relu
batchnorm
C
S ×H ×W
conv-2d k=1
C ×H ×W
X
Figure 5.4: A downscaling residual block. It admits a
meta-parameter S, the stride of the first convolution
layer, which modulates the reduction of the tensor size.
103
P̂ (Y )
1000
fully-conn
2048
reshape
2048 × 1 × 1
avgpool k=7
resblock
×2
2048 × 7 × 7
dresblock
S=2
resblock
×5
1024 × 14 × 14
dresblock
S=2
resblock
×3
512 × 28 × 28
dresblock
S=2
resblock
×2
256 × 56 × 56
dresblock
S=1
64 × 56 × 56
maxpool k=3 s=2 p=1
relu
batchnorm
64 × 112 × 112
conv-2d k=7 s=2 p=3
3 × 224 × 224
X
Figure 5.5: Structure of the ResNet-50 [He et al., 2015].
104
convolutional layer, and its computational cost,
are quadratic with the number of channels. This
residual block mitigates this problem by first re-
ducing the number of channels with a 1×1 con-
volution, then operating spatially with a 3× 3
convolution on this reduced number of chan-
nels, and then upscaling the number of channels,
again with a 1 × 1 convolution.
105
scaling, only an increase of the number of chan-
nels by a factor of 4. The output of the last resid-
ual block is 2048× 7×7, which is converted to a
vector of dimension 2048 by an average pooling
of kernel size 7 × 7, and then processed through
a fully-connected layer to get the final logits,
here for 1000 classes.
106
5.3 Attention models
As stated in § 4.8, many applications, particu-
larly from natural language processing, benefit
greatly from models that include attention mech-
anisms. The architecture of choice for such tasks,
which has been instrumental in recent advances
in deep learning, is the Transformer proposed
by Vaswani et al. [2017].
Transformer
The original Transformer, pictured in Figure 5.7,
was designed for sequence-to-sequence transla-
tion. It combines an encoder that processes the
input sequence to get a refined representation,
and an autoregressive decoder that generates
each token of the result sequence, given the en-
coder’s representation of the input sequence and
the output tokens generated so far.
+
dropout
fully-conn
gelu
fully-conn
layernorm
XQKV
Y Y
+ +
mha mha
Q K V Q K V
layernorm layernorm
XQKV XQ XKV
Figure 5.6: Feed-forward block (top), self-attention
block (bottom left) and cross-attention block (bottom
right). These specific structures proposed by Radford
et al. [2018] differ slightly from the original architec-
ture of Vaswani et al. [2017], in particular by having
the layer normalization first in the residual blocks.
108
P̂ (Y1 ),...,P̂ (YS | Ys<S )
S ×V
fully-conn
S×D
ffw
cross-att
Q KV
Decoder
causal
self-att ×N
pos-enc +
S×D
embed
S
0,Y1,...,YS−1
Z 1 ,...,Z T
T ×D
ffw
self-att
×N
Encoder
pos-enc +
T ×D
embed
T
X 1 ,...,X T
109
• The self-attention block, pictured on the bot-
tom left of Figure 5.6, is a Multi-Head Attention
layer (see § 4.8), that recombines information
globally, allowing any position to collect infor-
mation from any other positions, preceded by
a layer normalization. This block can be made
causal by using an adequate mask in the atten-
tion layer, as described in § 4.8
110
P̂ (X1 ),...,P̂ (XT | Xt<T )
T×V
fully-conn
T ×D
ffw
causal
self-att ×N
pos-enc +
T ×D
embed
T
0,X1,...,XT −1
Vision Transformer
Transformers have been put to use for image
classification with the Vision Transformer (ViT)
model [Dosovitskiy et al., 2020] (see Figure 5.9).
C
fully-conn
gelu
MLP
readout fully-conn
gelu
fully-conn
D
Z0,Z1,...,ZM
(M + 1) × D
ffw
self-att
×N
pos-enc +
(M + 1) × D
E0,E1,...,EM
×W
E
Image E0
encoder 2
M × 3P
X 1 ,...,X M
113
by a two-hidden-layer MLP to get the final C
logits. Such a token, added for a readout of a
class prediction, was introduced by Devlin et al.
[2018] in the BERT model and is referred to as a
CLS token.
114
PART III
ApplicATIOns
115
Chapter 6
Prediction
116
6.1 Image denoising
A direct application of deep models to image
processing is to recover from degradation by
utilizing the redundancy in the statistical struc-
ture of images. The petals of a sunflower in a
grayscale picture can be colored with high confi-
dence, and the texture of a geometric shape such
as a table on a low-light, grainy picture can be
corrected by averaging it over a large area likely
to be uniform.
117
with a lossy compression method.
118
6.2 Image classification
Image classification is the simplest strategy for
extracting semantics from an image and consists
of predicting a class from a finite, predefined
number of classes, given an input image.
119
6.3 Object detection
A more complex task for image understanding is
object detection, in which the objective is, given
an input image, to predict the classes and posi-
tions of objects of interest.
Z1
Z2
ZS−1 ZS
...
...
121
Figure 6.2: Examples of object detection with the Single-
Shot Detector [Liu et al., 2015].
122
receptive field, that is larger than this square but
centered on it. This results in a non-ambiguous
matching of any bounding box (x1,x2,y1,y2) to
a s,h,w, determined respectively by max(x2 −
x1,y2 − y1), y1+y 2
2 , and
x 1+x2
2 .
123
that task involves the regression of geometric
quantities.
124
6.4 Semantic segmentation
The finest-grain prediction task for image under-
standing is semantic segmentation, which con-
sists of predicting, for each pixel, the class of the
object to which it belongs. This can be achieved
with a standard convolutional neural network
that outputs a convolutional map with as many
channels as classes, carrying the estimated logits
for every pixel.
125
Figure 6.3: Semantic segmentation results with the
Pyramid Scene Parsing Network [Zhao et al., 2016].
126
backbone, concatenate the resulting multi-scale
representation after upscaling, before making
the final per-pixel prediction [Zhao et al., 2016].
127
6.5 Speech recognition
Speech recognition consists of converting a
sound sample into a sequence of words. There
have been plenty of approaches to this problem
historically, but a conceptually simple and recent
one proposed by Radford et al. [2022] consists of
casting it as a sequence-to-sequence translation
and then solving it with a standard attention-
based Transformer, as described in § 5.3.
128
This approach allows leveraging extremely large
datasets that combine multiple types of sound
sources with diverse ground truths.
129
6.6 Text-image representations
A powerful approach to image understanding
consists of learning consistent image and text
representations, such that an image, or a textual
description of it, would be mapped to the same
feature vector.
132
6.7 Reinforcement learning
Many problems, such as strategy games or
robotic control, can be formalized with a discrete-
time state process St and reward process Rt that
can be modulated by choosing actions At. If
St is Markovian, meaning that it carries alone
as much information about the future as all the
past states until that instant, such an object is a
Markovian Decision Process (MDP).
Σ
E γ t Rt ,
t≥0
133
the Bellman equation:
Q(s,a) = (6.1)
E Rt + γ max Q(St+1,a′). St = s,At = a ,
a′
Frame number
136
Chapter 7
Synthesis
137
7.1 Text generation
The standard approach to text synthesis is to
use an attention-based, autoregressive model. A
very successful model proposed by Radford et al.
[2018], is the GPT which we described in § 5.3.
140
7.2 Image generation
Multiple deep methods have been developed to
model and sample from a high-dimensional den-
sity. A powerful approach for image synthesis
relies on inverting a diffusion process.
x0
142
process exponentially reduces the importance of
x0, and xt’s density can rapidly be approximated
with a normal.
144
The missing bits
Autoencoder
An autoencoder is a model that maps an input
signal, possibly of high dimension, to a low-
dimension latent representation, and then maps
it back to the original signal, ensuring that infor-
mation has been preserved. We saw it in § 6.1
for denoising, but it can also be used to auto-
matically discover a meaningful low-dimension
parameterization of the data manifold.
146
put following a fixed distribution as input and
produces a structured signal such as an image,
and a discriminator, which takes a sample as
input and predicts whether it comes from the
training set or if it was generated by the genera-
tor.
147
These models are composed of layers that com-
pute activations at each vertex by combining
linearly the activations located at its immediate
neighboring vertices. This operation is very sim-
ilar to a standard convolution, except that the
data structure does not reflect any geometrical
information associated with the feature vectors
they carry.
Self-supervised training
As stated in § 7.1, even though they are trained
only to predict the next word, Large Language
Models trained on large unlabeled datasets such
as GPT (see § 5.3) are able to solve various tasks,
such as identifying the grammatical role of a
word, answering questions, or even translating
from one language to another [Radford et al.,
2019].
148
dataset exists. In computer vision, for instance,
image features can be optimized so that they are
invariant to data transformations that do not
change the semantic content of the image, while
being statistically uncorrelated [Zbontar et al.,
2021].
149
Bibliography
152
A. Gomez, M. Ren, R. Urtasun, and R. Grosse.
The Reversible Residual Network: Backprop-
agation Without Storing Activations. CoRR,
abs/1707.04585, 2017. [pdf]. 43
153
S. Ioffe and C. Szegedy. Batch Normalization: Ac-
celerating Deep Network Training by Reduc-
ing Internal Covariate Shift. In International
Conference on Machine Learning (ICML), 2015.
[pdf]. 79
154
recognition. Proceedings of the IEEE, 86(11):
2278–2324, 1998. [pdf]. 100, 101
155
man feedback. CoRR, abs/2203.02155, 2022.
[pdf]. 140
158
Index
1D convolution, 65
2D convolution, 65
activation, 23, 41
function, 70, 98
map, 68
Adam, 39
affine operation, 60
artificial neural network, 8, 11
attention operator, 87
autoencoder, 146
denoising, 117
Autograd, 42
autoregressive model, see model, autoregressive
average pooling, 75
backpropagation, 42
backward pass, 42
basis function regression, 14
batch, 21, 38
batch normalization, 79, 103
Bellman equation, 134
159
bias vector, 60, 66
BPE, see Byte Pair Encoding
Byte Pair Encoding, 34, 128
cache memory, 21
capacity, 16
causal, 32, 89, 110
model, see model, causal
chain rule (derivative), 40
chain rule (probability), 30
channel, 23
checkpointing, 43
classification, 18, 26, 100, 119
CLIP, see Contrastive Language-Image
Pre-training
CLS token, 114
computational cost, 43
Contrastive Language-Image Pre-training, 130
contrastive loss, 27, 130
convnet, see convolutional network
convolution, 65
convolutional layer, see layer, convolutional
convolutional network, 100
cross-attention block, 92, 108, 110
cross-entropy, 27, 31, 45
160
denoising autoencoder, see autoencoder,
denoising
density modeling, 18
depth, 41
diffusion process, 141
dilation, 66, 73
discriminator, 147
downscaling residual block, 105
DQN, see Deep Q-Network
dropout, 76, 90
padding, 66, 73
parameter, 12
meta, 13, 35, 48, 65, 66, 73, 90, 94
parametric model, see model, parametric
peak performance, 22
perplexity, 31
policy, 133
optimal, 133
pooling, 73
positional encoding, 95, 110
posterior probability, 26
pre-trained model, see model, pre-trained
prompt, 138, 139
query, 87
random initialization, 61
164
receptive field, 67, 123
rectified linear unit, 70, 145
recurrent neural network, 145
regression, 18
Reinforcement Learning, 133, 140
Reinforcement Learning from Human Feedback,
140
ReLU, see rectified linear unit
residual
block, 103
connection, 83, 102
network, 47, 83, 102
ResNet-50, 102
return, 133
reversible layer, see layer, reversible
RL, see Reinforcement Learning
RLHF, see Reinforcement Learning from Human
Feeback
RNN, see recurrent neural network
scaling laws, 51
self-attention block, 92, 108, 110
self-supervised learning, 148
semantic segmentation, 85, 125
SGD, see stochastic gradient descent
Single Shot Detector, 120
skip connection, 83, 126, 145
softargmax, 26, 88
softmax, 26
165
speech recognition, 128
SSD, see Single Shot Detector
stochastic gradient descent, 38, 45, 51
stride, 66, 73
supervised learning, 19
underfitting, 16
universal approximation theorem, 98
unsupervised learning, 19
weight, 13
decay, 28
matrix, 60
167
This book is licensed under the Creative Com-
mons BY-NC-SA 4.0 International License.
168