Santry D. Demystifying Deep Learning. An Introduction... Math of Neural Net. 2024
Santry D. Demystifying Deep Learning. An Introduction... Math of Neural Net. 2024
所有
Demystifying Deep Learning
IEEE Press
445 Hoes Lane
Piscataway, NJ 08854
Douglas J. Santry
University of Kent, United Kingdom
Copyright © 2024 by The Institute of Electrical and Electronics Engineers, Inc.
All rights reserved.
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any
form or by any means, electronic, mechanical, photocopying, recording, scanning, or otherwise,
except as permitted under Section 107 or 108 of the 1976 United States Copyright Act, without
either the prior written permission of the Publisher, or authorization through payment of the
appropriate per-copy fee to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers,
MA 01923, (978) 750-8400, fax (978) 750-4470, or on the web at www.copyright.com. Requests to
the Publisher for permission should be addressed to the Permissions Department, John Wiley &
Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at
http://www.wiley.com/go/permission.
Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley &
Sons, Inc. and/or its affiliates in the United States and other countries and may not be used
without written permission. All other trademarks are the property of their respective owners.
John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.
Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best
efforts in preparing this book, they make no representations or warranties with respect to the
accuracy or completeness of the contents of this book and specifically disclaim any implied
warranties of merchantability or fitness for a particular purpose. No warranty may be created or
extended by sales representatives or written sales materials. The advice and strategies contained
herein may not be suitable for your situation. You should consult with a professional where
appropriate. Further, readers should be aware that websites listed in this work may have
changed or disappeared between when this work was written and when it is read. Neither the
publisher nor authors shall be liable for any loss of profit or any other commercial damages,
including but not limited to special, incidental, consequential, or other damages.
For general information on our other products and services or for technical support, please
contact our Customer Care Department within the United States at (800) 762-2974, outside the
United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in
print may not be available in electronic formats. For more information about Wiley products,
visit our web site at www.wiley.com.
Contents
1 Introduction 1
1.1 AI/ML – Deep Learning? 5
1.2 A Brief History 6
1.3 The Genesis of Models 9
1.3.1 Rise of the Empirical Functions 9
1.3.2 The Biological Phenomenon and the Analogue 13
1.4 Numerical Computation – Computer Numbers Are Not ℝeal 14
1.4.1 The IEEE 754 Floating Point System 15
1.4.2 Numerical Coding Tip: Think in Floating Point 18
1.5 Summary 20
1.6 Projects 21
4 Training Classifiers 73
4.1 Backpropagation for Classifiers 73
4.1.1 Likelihood 74
4.1.2 Categorical Loss Functions 75
4.2 Computing the Derivative of the Loss 77
4.2.1 Initiate Backpropagation 80
4.3 Multilabel Classification 81
4.3.1 Binary Classification 82
4.3.2 Training A Multilabel Classifier ANN 82
4.4 Summary 84
4.5 Projects 85
9 Vistas 175
9.1 The Limits of ANN Learning Capacity 175
9.2 Generative Adversarial Networks 177
9.2.1 GAN Architecture 178
9.2.2 The GAN Loss Function 180
9.3 Reinforcement Learning 183
9.3.1 The Elements of Reinforcement Learning 185
9.3.2 A Trivial RL Training Algorithm 187
9.4 Natural Language Processing Transformed 193
9.4.1 The Challenges of Natural Language 195
9.4.2 Word Embeddings 195
9.4.3 Attention 198
9.4.4 Transformer Blocks 200
9.4.5 Multi-Head Attention 204
9.4.6 Transformer Applications 205
9.5 Neural Turing Machines 207
9.6 Summary 210
9.7 Projects 210
Glossary 221
References 229
Index 243
ix
Acronyms
AI artificial intelligence
ANN artificial neural network
BERT bidirectional encoder representation for transformers
BN Bayesian network
BPG backpropagation
CNN convolutional neural network
CNN classifying neural network
DL deep learning
FFFC feed forward fully connected
GAN generative adversarial network
GANN generative artificial neural network
GPT generative pre-trained
LLM large language model
LSTM long short term memory
ML machine learning
MLE minimum likelihood estimator
MSE mean squared error
NLP natural language processing
RL reinforcement learning
RNN recurrent neural network
SGD stochastic gradient descent
1
Introduction
Interest in deep learning (DL) is increasing every day. It has escaped from the
research laboratories and become a daily fact of life. The achievements and poten-
tial of DL are reported in the lay news and form the subject of discussion at dinner
tables, cafes, and pubs across the world. This is an astonishing change of fortune
considering the technology upon which it is founded was pronounced a research
dead end in 1969 (131) and largely abandoned.
The universe of DL is a veritable alphabet soup of bewildering acronyms. There
are artificial neural networks (ANN)s, RNNs, LSTMs, CNNs, Generative Adversar-
ial Networks (GAN)s, and more are introduced every day. The types and applica-
tions of DL are proliferating rapidly, and the acronyms grow in number with them.
As DL is successfully applied to new problem domains this trend will continue.
Since 2015 the number of artificial intelligence (AI) patents filed per annum has
been growing at a rate of 76.6% and shows no signs of slowing down (169). The
growth rate speaks to the increasing investment in DL and suggests that it is still
accelerating.
DL is based on ANN. Often only neural networks is written and the artificial is
implied. ANNs attempt to mathematically model biological assemblies of neurons.
The initial goal of research into ANNs was to realize AI in a computer. The motiva-
tion and means were to mimic the biological mechanisms of cognitive processes in
animal brains. This led to the idea of modeling the networks of neurons in brains.
If biological neural networks could be modeled accurately with mathematics, then
computers could be programmed with the models. Computers would then be able
to perform tasks that were previously thought only possible by humans; the dream
of the electronic brain was born (151). Two problem domains were of particular
interest: natural language processing (NLP), and image recognition. These were
areas where brains were thought to be the only viable instrument; today, these
applications are only the tip of the iceberg.
Figure 1.1 Examples of GAN-generated cats. The matrix on the left contains examples
from the training set. The matrix on the right are GAN-generated cats. The cats on the
right do not exist. They were generated by the GAN. Source: Karras et al. (81).
example of a generative language model. Images and videos can also be generated.
A GANN can draw an image this is very different from learning to recognize
an image. A powerful means of building GANNs is with GAN (50); again, very
much an alphabet soup. As an example, a GAN can be taught impressionist
painting by training it with pictures by the impressionist masters. The GAN will
then produce a novel painting very much in the genre of impressionism. The
quality of the images generated is remarkable. Figure 1.1 displays an example
of cats produced by a GAN (81). The GAN was trained to learn what cats look
like and produce examples. The object is to produce photorealistic synthetic
cats. Products such as Adobe Photoshop have included this facility for general
use by the public (90). In the sphere of video and audio, GANs are producing
the so-called “deep fake” videos that are of very high quality. Deep fakes are
becoming increasingly difficult for humans to detect. In the age of information
war and disinformation, the ramifications are serious. GANs, are performing
tasks at levels undreamt of a few decades ago, the quality can be striking, and even
troubling. As new applications are identified for GANs the resources dedicated
to improving them will continue to grow and produce ever more spectacular
results.
1.1 AI/ML – Deep Learning? 5
3 The word, learn, is in bold in Mitchell’s text. The author clearly wished to emphasize the
nature of the exercise.
1.2 A Brief History 7
1930s, Alonzo Church had described his Lambda Calculus model of computation
(21), and his student, Alan Turing, had defined his Turing Machine4 (152), both
formal models of computation. The age of modern computation was dawning.
Warren McCulloch and Walter Pitts wrote a number of papers that proposed arti-
ficial neurons to simulate Turing machines (164). Their first paper was published
in 1943. They showed that artificial neurons could implement logic and arithmetic
functions. Their work hypothesized networks of artificial neurons cooperating to
implement higher-level logic. They did not implement or evaluate their ideas, but
researchers had now begun thinking about artificial neurons.
Daniel Hebb, an eminent psychologist, wrote a book in 1949 postulating a learn-
ing rule for artificial neurons (65). It is a supervised learning rule. While the rule
itself is numerically unstable, the rule contains many of the ingredients of modern
ANNs. Hebb’s neurons computed state based on the scaler product and weighted
the connections between the individual neurons. Connections between neurons
were reinforced based on use. While modern learning rules and network topolo-
gies are different, Hebb’s work was prescient. Many of the elements of modern
ANNs are recognizable such as a neuron’s state computation, response propaga-
tion, and a general network of weighted connections.
The next step to modern ANNs was Frank Rosenblatt’s perceptron (130). Rosen-
blatt published his first paper in 1958. Building on Hebb’s neuron, he proposed an
updated supervised learning rule called the perceptron rule. Rosenblatt was inter-
ested in computer vision. His first implementation was in software on an IBM 704
mainframe (it had 18 k of memory!). Perceptrons were eventually implemented in
hardware. The machine was a contemporary marvel fitted with an array of 20 × 20
cadmium sulfide photocells used to create a 400 pixel input image. The New York
Times reported it with the headline, “Electronic Brain Teaches Itself.” Hebb’s neu-
ron state was improved with the introduction of a bias, an innovation still very
important today. Perceptrons were capable of learning linear decision boundaries,
that is, the categories of classification had to be linearly separable.
The next milestone was a paper by Widrow and Hoff in 1960 that proposed a
new learning rule, the delta rule. It was more numerically stable than the percep-
tron learning rule. Their research system was called ADALINE (15) and used least
squares to train the network. Like Rosenblatt’s early work, ADALINE was imple-
mented in hardware with memristors. The follow-up system, MADALINE (163),
included multiple layers of perceptrons, another step toward modern ANNs. It
suffered from a similar limitation as Rosenblatt’s perceptrons in that it could only
address linearly separable problems; it was a composition of linear classifiers.
In 1969, Minksy and Papert published a book that set a pall on ANN
research (106). They demonstrated that ANNs, as they were understood at that
point, suffer from an inherent limitation. It was argued that ANNs could never
solve “interesting” problems; but the assertion was based on the assumption
that ANNs could never practically handle nonlinear decision boundaries. They
famously used the example of the XOR logic gate. As the XOR truth table could
not be learnt by an ANN, and XOR is trivial concept when compared to image
recognition and other applications, they concluded that the latter applications
were not appropriate. As most interesting problems are nonlinear, including
vision and NLP, they concluded that the ANN was a research dead end. Their
book had the effect of chilling research in ANNs for many years as the AI com-
munity accepted their conclusion. It coincided with a general reassessment of the
practicality of AI research in general and the beginning of the first “AI Winter.”
The fundamental problem facing ANN researchers was how to train multiple
layers of an ANN to solve nonlinear problems. While there were multiple inde-
pendent developments, Rumelhart, Hinton, and Williams are generally credited
with the work that described the backpropagation of error algorithm in the context
of training ANNs (34). This was published in 1986. It is still the basis of train-
ing today. Backpropagation of error is the basis of the majority of modern ANN
training algorithms. Their method provided a means of training ANNs to learn
nonlinear problems reliably.
It was also in 1986 that Rina Dechter coined the term, “Deep Learning” (30).
The usage was not what is meant by DL today. She was describing a backtracking
algorithm for theorem proving with Prolog programs.
The confluence of two trends, the dissemination of the backpropagation algo-
rithm and the advent of widely available workstations, led to unprecedented exper-
imentation and advances in ANNs. By 1989, in a space of just 3 years, ANNs had
been successfully trained to recognize hand-written digits in the form of postal
codes from the United States Postal Service. This feat was achieved by a team led
by Yann Lecun at AT&T Labs (91). The work had all the recognizable features of
DL, but the term had not yet been applied to neural networks in that sense. The
system would evolve into LeNet-5, a classic DL model. The renewed interest in
ANN research has continued unbroken down to this day. In 2006, Hinton et al.
described a multi-layered belief network that was described as a “Deep Belief Net-
work,” (67). The usage arguably led to referring to deep neural networks as DL.
The introduction of AlexNet in 2012 demonstrated how to efficiently use GPUs
to train DL models (89). AlexNet set records in image recognition benchmarks.
Since AlexNet DL models have dominated most machine learning applications; it
has heralded the DL Age of machine learning.
We leave our abridged history here and conclude with a few thoughts. As
the computing power required to train ANNs grows ever cheaper, access to
the resources required for research becomes more widely available. The IBM
Supercomputer, ASCI White, cost US$110 million in 2001 and occupied a special
1.3 The Genesis of Models 9
purpose room. It had 8192 processors for a total of 123 billion transistors with a
peak performance of 12.3 TFLOPS.5 In 2023, an Apple Mac Studio costs US$4000,
contains 114 billion transistors, and offers peak performance of 20 TFLOPS. It sits
quietly and discreetly on a desk. In conjunction with improvements in hardware,
there is a change in the culture of disseminating results. The results of research
are proliferating in an ever more timely fashion.6 The papers themselves are also
recognizing that describing the algorithms is not the only point of interest. Papers
are including experimental methodology and setup more frequently, making
it easier to reproduce results. This is made possible by ever cheaper and more
powerful hardware. Clearly, the DL boom has just begun.
ẍ = g. (1.2)
ẋ = g ⋅ dt = gt, (1.3)
∫
which in turn can be integrated to produce the desired model, t = f (h),
√
g 2 2h
h≡x= t ⟹ t= = f (h). (1.4)
2 g
This yields an analytical solution obtained from the constraint, which was
obtained from a natural law. Of course this is a very trivial example, and often an
analytical solution is not available. Under those circumstances, the modeler must
resort to numerical methods to solve the equations, but it illustrates the historical
approach.
1.3 The Genesis of Models 11
Δt = f (height)
10.0
7.5
Δt (s)
5.0
2.5
Figure 1.2 The graph of t = f (h), is plotted with empty circles as points. The ANN(h)’s
predictions are crosses plotted over them. The points are from training dataset. The ANN
seems to have learnt the function.
the use of ANNs to make predictions that are more accurate (22). There are many
more examples.
None of this is to say that DL software is inappropriate for use or not fit for
purpose, quite the contrary, but it is important to have some perspective on the
nature of the simulation and the fundamental differences.
8 The “C” language types float and double often correspond with the 32-bit and 64-bit types
respectively, but it is not a language requirement. Python’s float is the double-precision type
(64-bit).
9 The oldest number system that we know about, Sumerian (c. 3,000 BC), was sexagesimal, base
60, and survives to this day in the form of minutes and seconds.
16 1 Introduction
1 0 1 1 0 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 1
Figure 1.3 IEEE-754 representation for the value of 4.050000 ⋅ 10−8 . The exponent is
biased and centered about 127, and the mantissa is assumed to have a leading “1.”
The integers are straight forward, but representing real numbers requires more
elaboration. The correct root was written in scientific notation, −4.050000 ⋅ 10−8 .
There are three distinct elements in this form of a number. The mantissa, or sig-
nificand, is the sequence of significant digits, and its length is the precision, 8 in
this case. It is written with a single digit to the left of the decimal point and mul-
tiplied to obtain the correct order of magnitude. This is done by raising the radix,
10 in this case, to the power of the exponent, −8. The IEEE-754 format for 32-bit
floating point values encodes these values to represent a number, see Figure 1.3.
So what can be represented with this system?
Consider a decimal number, abc.def, each position represents an order of mag-
nitude. For decimal numbers, the positions represent:
1 1 1
100 + 10 + 1 ⋅ + + ,
10 100 1000
while binary numbers look like:
1 1 1
4+2+1⋅ + +
2 4 8
Some floating point examples are 0.510 = 0.12 and 0.12510 = 0.0012 . So far so
good, but what of something “simple” such as 0.110 ? Decimal 0.1 is represented
as 0.000112 , where the bar denotes the sequence is repeated ad infinitum. This
can be written in scientific notation as 1.100 ⋅ 2−4 . Using the 32-bit IEEE encod-
ing its representation is 00111101110011001100110011001101. The first bit is the
sign bit. The following 8 bits form the exponent and the remaining 23 bits com-
prise the mantissa. There are two seeming mistakes. First, the exponent is 123. For
efficiently representing normal numbers, the IEEE exponent is biased, that is, cen-
tered about 127 ∶ 127 − 4 = 123. The second odd point is in the mantissa. As the
first digit is always one it is implied, so the encoding of the mantissa starts at the
first digit to the right of the first 1 of the binary representation, so effectively there
can be 24 bits of precision. The programmer does not need to be aware of this – it all
happens automatically in the implementation and the computer language (such
as C++ and Python). Converting IEEE 754 back to decimal, we get, 0.100000001.10
10 What is 10% of a billion dollars? This is a sufficiently serious problem for financial software
that banks often use specialized routines to ensure that the money is correct.
1.4 Numerical Computation – Computer Numbers Are Not ℝeal 17
Observe that even a simple number like 1/10 cannot be represented in the IEEE
32-bit values, just as 1/3 is difficult for decimal, 0.310 .
Let 𝔽 ⊂ ℝ be the set of IEEE single-precision floating point numbers. Being
finite 𝔽 will have minimum and maximum elements. They are 1.17549435E-38
and 3.40282347E+38, respectively. Any operation that strays outside of that range
will not have a result. Values less than the minimum are said to underflow, and
values that exceed the maximum are said to overflow. Values within the supported
range are known as normal numbers. Even within the allowed range, the set is
not continuous and a means of mapping values from ℝ onto 𝔽 is required, that
is, we need a function, fl(x) ∶ ℝ → 𝔽 , and the means prescribed by the IEEE 754
standard is rounding.
All IEEE arithmetical operations are performed with extra bits of precision.
This ensures that a computation will produce a value that is superior to the
bounds. The result is rounded to the nearest element of 𝔽 , with ties going
to the even value. Specifying ties may appear to be overly prescriptive, but
deterministic computational results are very important. IEEE offers 4 rounding
modes,11 but rounding to nearest value in 𝔽 is usually the default. Rounding
error is subject to precision. Given a real number, 1.0, what is the next largest
number? There is no answer. There is an answer for floating point numbers,
and this gap is the machine epsilon, or unit roundoff. For the double precision,
IEEE-754 standard the machine epsilon is 2.2204460492503131e-16. The width
of a proton is 8.83e-16 m, so this is quite small (computation at that scale would
choose a more appropriate unit than meters, such as Angstroms, but this does
demonstrate that double precision is very useful). The machine epsilon gives the
programmer an idea of the error when results are rounded. Denote the machine
epsilon as u. The rounding error is |fl(x) − x| ≤ 12 u. This quantity can be used to
calculate rigorous bounds on the accuracy of computations and algorithms when
required.
Let us revisit our computation of the smallest root, which was done in double
precision.
√ p was very close to the result of the square root. For our values of p and
q, p ≈ p2 + q, and so performing p− ≈ p (p minus almost p) canceled out all of
the information in the result. This effect is known as catastrophic cancellation,
but it is widely misunderstood. A common misconception is that it is a bad
practice to subtract floating point numbers that are close together, or add floating
point numbers that are far apart, but that is not necessarily true. In this case, the
subtraction has merely exposed an earlier problem that no longer has anywhere
to hide. The square root is 12,345,678.000000040500003, but it was rounded to
The choice of 𝜖 will be a suitably small value that the application can toler-
ate. In general, comparison of a computed floating point value should be done
as abs(𝛼 − x), where x is the computed quantity and 𝛼 is the quantity that is being
tested for. Note that printing the contents of I to the screen may appear to produce
the exact values of zero and one, but print format routines, the routines whose jobs
is to convert IEEE variables to a printable decimal string, do all kinds of things and
often mislead.
A great contribution of the IEEE system is its quality as a progressive system.
Many computer architectures used to treat floating point errors, such as division
by zero, as terminal. The program would be aborted. IEEE systems continue to
make progress following division √by zero and simply record the error condition in
the result. Division by zero and −1 results in the special nonnormal value “not a
number” recorded in the result (NaN). Overflow and underflow are also recorded
as the two special values, ±∞ (if used they will produce an NaN). These error
values need to be detected because if they are used, then the error will contaminate
all further use of the tainted value. Often there is a semantic course of action that
can be adopted to fix the error before proceeding. They are notoriously difficult to
20 1 Introduction
1.5 Summary
1.6 Projects
In Chapter 1, it was stated that deep learning (DL) models are based on artificial
neural networks (ANNs). In this chapter, deep learning will be defined more pre-
cisely, which is still quite loose. This will be done by connecting deep learning
to ANNs more concretely. It was also claimed that ANNs can be interpreted as
programmable functions. In this chapter, we describe what those functions look
and how ANNs compute values. Like a function, an ANN accepts inputs and com-
putes an output. How an ANN turns the input into the output is detailed. We also
introduce the notation and abstractions that we use in the rest of the text.
A deep learning model is a model built with an artificial neural network, that is,
a network of artificial neurons. The neurons are perceptrons. Networks require a
topology. The topology is specified as a hyperparameter to the model. This includes
the number of neurons and how they are connected. The topology determines how
information flows in the resulting ANN configuration and some of the properties
of the ANN. In broad terms, ANNs have many possible topologies, and indeed
there are an infinite number of possibilities. Determining good topologies for a
particular problem can be challenging; what works in one problem domain may
not (probably will not) work in a different domain. ANNs have many applications
and come in many types and sizes. They can be used for classification, regression,
and even generative purposes such as producing a picture. The different domains
often have different topologies dictated by the application. Most ANN applications
employ domain-specific techniques and adopt trade-offs to produce the desired
result. For every application, there are a myriad of ANN configurations and param-
eters, and what works well for one application may not work for others. If this
seems confusing – it is (107; 161). A good rule of thumb is to keep it as simple as
possible (93).
This section presents the rudiments of ANN topology. One topology in particu-
lar, the feed-forward and fully-connected (FFFC), topology is adopted. There is
no loss of generality as all the principles and concepts presented still apply to
other topologies. The focus on a simple topology lends itself to clearer explana-
tions. To make matters more concrete, we begin with a simple example presented
in Figure 2.1. We can see that an ANN is comprised of neurons (nodes), connec-
tions, and many numbers. Observing the figure, it is clear that we can interpret an
ANN as a directed graph, G(N, E). The nodes, or vertices, of the graph are neurons.
Neurons that communicate immediately are connected with a directed edge. The
direction of the edge determines the flow of the signal.
The nodes in the graph are neurons. The neuron abstraction is at the heart of
the ANN. Neurons in ANNs are generally perceptrons. Information, that is, sig-
nals, flow through the network along the directed edges through the neurons. The
arrow indicates that a signal is coming from a source neuron and going to a target
1 1 1
–0
.17
33
7
–0.
538
–1.
094
84
13
–2.
768
1.6
17
72
76
–0.58
–2.45
5
08
01
768
58
42
.74
0.7
–2
80
–0.65
5
36
38
412
37
33
0.6
1
–1.677
885
1.4
605
1.3
Figure 2.1 A trained ANN that has learnt the sine function. The circles, graph nodes, are
neurons. The arrows on the edges determine which direction the communication
flows.
2.1 Feed-Forward and Fully-Connected Artificial Neural Networks 25
neuron. There are no rules respecting the number of edges. Most neurons in an
ANN are both a source and a target. Any neuron that is not a source is an output
neuron. Any neuron that is not a target is an input neuron. Input neurons are the
entry point to the graph, and the arguments to the ANN are supplied there. Out-
put neurons provide the final result of the computation. Thus, given ŷ = ANN(x),
x goes to the input neurons and ŷ is read from the output neurons. In the example,
x is the angle, and ŷ is the sine of x.
Each neuron has an independent internal state. A neuron’s state is computed
from the input signals from connected source neurons. The neuron computes
its internal state to build its own signal, also known as a response. This internal
state is then propagated in turn through the directed edges to its target neurons.
The inceptive stage for the computation is the provision of the input arguments
to the input neurons of the ANN. The input neurons compute their states and
propagate them to their target neurons. This is the first signal that triggers the
rest of the activity. The signals propagate through the network, forcing neurons
to update state along the way, until the signal reaches the output neurons, and
then the computation is finished. Only the state in the output neurons, that is,
the output neurons’ responses, matter to an application as they comprise the
“answer,” ŷ .
Neurons are connected with edges. The edges are weighted by a real number.
The weights determine the behavior of the signals as received by the target
neuron – they are the crux of an ANN. It is the values of the weights that
determine whether an ANN is sine, cosine, ex – whatever the desired function.
The weights in Figure 2.1 make the ANN sine. The ANN in Figure 2.2 is a cosine.
The graphs in Figure 2.3 present their respective plots for 32 random points. Both
ANNs have the same topologies, but they have very different weights. It is the
weights that the determine what an ANN computes. The topology can be thought
of as supporting the weights by ensuring that there is a sufficient number of them
to solve a problem; this is called the learning capacity of an ANN.
The weights of an ANN are the parameters of the model. The task of training
a neural network is determining the weights, w. This is reflected in the notation
ŷ = ANN(x; w) or ŷ = ANN(x|w) where ŷ is conditioned on the vector of parame-
ters, w. Given a training set, the act of training an ANN is reconciling the weights
in a model with the examples in the training set. The fitted weights should then
produce a model that emits the correct output for a given input. Training sets con-
sisting of sine and cosine produced the ANNs in the trigonometric examples,
respectively. Both sine and cosine are continuous functions. As such building
models for them are examples of regression, we are explaining observed data from
the past to make predictions in the future.
The graph of an ANN can take many forms. Without loss of generality, but
for clarity of exposition, we choose a particular topology, the fully-connected
26 2 Deep Learning and Neural Networks
1 1 1
–1
.38
89
1
1.5
561
–3.
24 571
5
–2.
057
–6
02
.86
81
6
1.029
–1.32
95
34
57
277
0
.73
23
–0
1.5
63 0.279
3
.87
41
5
3
04
97
.11
–2
22
–1.0496
232
–2.
5
297
2
0.8
Figure 2.2 A trained ANN that has learnt the cosine function. The only differences with
the sine model are the weights. Both ANNs have the same topologies.
0.75
0.75
cosine
sine
0.50
0.50
0.25
0.25
Figure 2.3 The output of two ANNs is superimposed on the ground truth for the trained
sine and cosine ANN models. The ANN predictions are plotted as crosses. The empty
circles are taken from the training set and comprise the ground truth.
2.1 Feed-Forward and Fully-Connected Artificial Neural Networks 27
feed-forward architecture, as the basis for all the ANNs that we will discuss.
The principles are the same for all ANNs, but the simplicity of the feed-forward
topology is pedagogically attractive.
Information in the feed-forward ANN flows acyclically from a single input layer,
through a number of hidden layers, and finally to a single output layer; the signals
are always moving forward through the graph. The ANN is specified as a set of
layers. Layers are sets of related, peer, neurons. A layer is specified as the number
of neurons that it contains, the layer’s width. The number of layers in an ANN is
referred to as its depth. All the neurons in a layer share source neurons, specified
as a layer, and target neurons, again, specified as a layer. All of a layer’s source
neurons form a layer as do its target neurons. There are no intralayer connections.
In the language of graph theory, isolating a layer produces a tripartite graph. Thus,
a layer is sandwiched between a shallower, source neuron layer, and a deeper target
layer.
The set of layers can be viewed as a stack. Consider topology in Figure 2.1, with
respect to the stack analogy, the input layer is the top, or the shallowest layer, and
the output layer is the bottom, or the deepest layer. The argument is supplied to
the input layer and the answer read from the output layer. The layers between the
input and output layers are known as hidden layers.
It is the presence of hidden layers that characterizes an ANN as a deep learning
ANN. There is no consensus on how many hidden layers are required to qualify as
deep learning, but the loosest definition is at least 1 hidden layer. A single hidden
layer does not intuitively seem very deep, but its existence in an ANN does put
it in a different generation of model. Rosenblatt’s original implementations were
single layers of perceptrons, but he speculated on deeper arrangements in his book
(130). It was not clear what value multiple layers of perceptrons had given his
linear training methods. Modern deep learning models of 20+ hidden layers are
common, and they continue to grow deeper and wider.
The process of computing the result of an ANN begins with supplying the argu-
ment to the function at the input layer. Every input neuron in the input layer
receives a copy of the full ANN argument. Once every neuron in the input layer
has computed its state with the arguments to the ANN, the input layer is ready
to propagate the result to next layer. As the signals percolate through the ANN,
each layer accepts its source signals from the previous layer, computes the new
state, and then propagates the result to the next layer. This continues until the
final layer is reached; the output layer. The output layer contains the result of
the ANN.
To further simplify our task, we specify that the feed-forward ANN is fully con-
nected, sometimes also called dense. At any given layer, every neuron is connected
to every source neuron in the shallower layer; recursively, this implies that every
neuron in a given layer is a source neuron for the next layer.
28 2 Deep Learning and Neural Networks
1 1 1
–0
.17
33
7
–0.
538
–1.
094
13
84
–2.
768
1.6
17
72
76
–0.58
–2.45
5
01
768
58
08
.74
42
0.7
–2
80 –0.654
36
5
38
37
12
33
0.6
1
–1.677
885
1.4
605
1.3
Figure 2.4 The sine model in more detail. The layers are labeled. The left is shallowest
and the right is the deepest. Every neuron in a layer is fully connected with its shallower
neighbor. This ANN is specified by the widths of it layers, 3, 2, 1, so the ANN
has a depth of 3.
Let us now reexamine the ANN implementing sine in Figure 2.4 in terms of
layers. We see that there are 3 layers. The first layer, the input layer, has 3 neu-
rons. The input layer accepts the predictors, the inputs of the model. The hidden
layer has 2 neurons, and the output layer has one neuron. There can be as many
hidden layers as desired, and they can also be as wide as needed. The depths and
the widths are the hyperparameters of the model. The number of layers and their
widths should be kept as limited as possible (93). As the number of weights grows,
that is, trainable parameters, the size of the ANN increases exponentially. Too
many weights also leads to other problems that will be examined in later chapters.
As sine is a scaler function, there can only be one output neuron in the ANN’s
output layer; that is, where the answer (sine) can be found. Notice, however, that
the number of input neurons is not similarly constrained. Sine has only one pre-
dictor (argument), but observe that there can be any number of neurons in the
input layer. They will each receive a copy of the arguments.
The mechanics of signal propagation form the basis of how ANNs compute.
Having seen how the signals flow in a feed-forward ANN, it remains to examine
2.2 Computing Neuron State 29
what those signals are and how they are computed. This will lead to a recurrence
equation that we will use to succinctly describe the computation. This is the topic
of the Section 2.2.
Not any function can act as an activation function. The activation function plays
an important role when computing the final state. Dot products can produce
arbitrary values. An activation function can tame the dot products. If a specific
range is required then an activation function can be selected accordingly. Two
common requirements are to either map the scaler product between [−1, 1],
or make it strictly positive. Activation functions also add non-linearity to the
neuron’s response making it possible to handle more challenging problems. Three
important activation functions are, tanh (hyperbolic tangent), sigmoid and RelU.
Their curves are depicted in Figure 2.5. While superficially similar, they differ in
important ways. Both tanh and sigmoid tend to squish their domain into a narrow
range. The former is centered about zero and produces a result between [−1, 1].
The sigmoid’s range is [0, 1]. The sigmoid is historically important lending its
symbol, 𝜎, to the notation for activation functions. As we shall see in Chapter
3, all three have important niches. For the moment we examine the sigmoid
function.
The sigmoid was the first activation function (162) and its use was inspired by
biological processes. The idea was that a neuron was either “on” or “off.” This was
abstracted as 0 or 1. While the desired behavior suggests a Heaviside function the
sigmoid was attractive for a number of reasons, and certainly has a number of
advantages. The sigmoid is continuous, differentiable everywhere and non-linear.
As we shall see in Chapter 3, the sigmoid is also very convenient when a derivative
is required. The sigmoid function is defined as:
1
z = 𝜎(u) =
. (2.3)
1 + e−u
Its range is roughly [0, 1] but its interesting dynamics are in the domain [−3, 3].
As u grows very negative or very positive the sigmoid becomes saturated and the
asymptotic behavior manifests itself; it mimics the Heaviside function and is either
10
1
0.8
8
0.5
0.6
6
sigmoid
ReIU
tanh
0
0.4
4
–0.5
0.2
2
0.0
–1
Figure 2.5 Three popular activation functions. The two on the left are superficially
similar, but note the tanh’s range is centered on 0.0 and the sigmoid’s range is centered
on 0.5.
2.3 The Feed-Forward ANN Expressed with Matrices 31
“on” or “off.” Ideally weights, the determiners of u, avoid saturating the activation
functions because adjusting the output becomes more difficult. A neuron that is
fixed in the −10ish domain will always be −1, and may as well be dropped from
the network.
More recently introduced, an important family of activation functions for deep
learning tasks is the rectified linear unit, ReLU. The ReLU is defined as
It tends to be deployed in deeper ANNs (ANNs with many hidden layers) for the
reasons for which will be exaplined in Section 3.5.4. It adds an element of nonlin-
earity while its derivative has attractive qualities, which is important when train-
ing deep ANNs (42). Saturated weights are less of a problem, but the nonlinearity
is far less pronounced. When introduced to the ANN community, it was used to
set a record, at the time, for depth in an ANN in AlexNet (89) that also resulted in
setting an accuracy record.
We shall refer to the activation function generically as 𝜎(u), where u is the scaler
product of weights and inputs, but unless specified 𝜎 is not a particular activation
function.
We have seen that the propagation of signals through an ANN proceeds from layer
to layer, starting from the input. Each layer is comprised of neurons that need to
compute the dot product of its weights with the signal from the previous layer. The
computation can be expressed concisely with matrices.
We can express the propagation, or arrival, of a signal at a layer with a matrix
multiplication. A matrix of weights, W𝓁 , can be constructed by populating each
row with the weights of an individual neuron; the jth row of W𝓁 containing the
weights for the jth neuron. The ith column of W𝓁 corresponds to the weights on the
edges from the ith neuron in the previous layer. Thus the matrix element W𝓁 [j, i] is
the weight for the edge from neuron i in 𝓁−1 to j in 𝓁. Each row in the W𝓁 matrix
encapsulates the weights vector of a neuron in that layer, and every layer has its
own matrix. Note that if the ANN is not fully connected the missing connections
could be represented with zeros in the appropriate matrix entries. The matrix for
the hidden layer of the sine in Figure 2.4 is
( )
−1.09484 0.74208 1.48851
Whidden = .
−2.76817 −1.37805 1.3605
32 2 Deep Learning and Neural Networks
We can now express the first step of computing the states of neurons, the per neu-
ron scaler products of arguments and weights, with a matrix multiplication. The
dot products of all the neurons in a layer can now be written as
u𝓁 = W𝓁 z𝓁−1 + b𝓁 , (2.5)
where u is a vector and b𝓁 the vector of bias weights. In computational linear alge-
bra, this is known as a general Ax + y, or GAXPY operation (49), and there are
many software libraries that implement it efficiently.
Once the per neuron scaler products have been computed the activation func-
tion can be applied and a layer’s output, z𝓁 , computed. This results in
z𝓁 = 𝜎(u𝓁 ) = 𝜎(W𝓁 z𝓁−1 + b𝓁 ). (2.6)
Here the activation function has been applied on a per element basis of its argu-
ment vector, u𝓁 , producing a new vector of the same dimension, the final result
for the layer, z𝓁 .
We can now express the total computation of an ANN concisely. We define ℒ
as the ordered list of the layers of our ANN, that is, { 𝓁1 , … , 𝓁depth }. The computa-
tion of the response for a feed-forward ANN can be expressed with the following
algorithm:
includes an extra column. The first column consists of the bias weights for a layer.
For a layer 𝓁−1 with m neurons and a layer 𝓁 with n neurons the weight matrix
looks like:
⎛w1,b w1,1 w1,2 · · · w1,m ⎞
⎜ ⎟
w w w · · · w2,m ⎟
W𝓁 = ⎜ 2,b 2,1 2,2 . (2.7)
⎜ ⋮ ⎟
⎜w w a · · · w ⎟
⎝ n,b n,1 w,2 n,m ⎠
To make this work there is an implied 1.0 in the first position of the input vector,
⎛1.0⎞
⎜ ⎟
z
z𝓁−1 = ⎜ 1 ⎟, (2.8)
⎜⋮⎟
⎜z ⎟
⎝ m⎠
where the shallower response vector has been prefixed with 1 and the remaining
entries pushed down. The dot product and bias translation can now be written
more concisely as
u𝓁 = W𝓁 z𝓁−1 . (2.9)
Revisiting the example in Figure 2.4 the resulting neural matrix is
( )
−0.53813 −1.09484 0.74208 1.48851
Whidden = . (2.10)
−0.65412 −2.76817 −1.37805 1.3605
It is just notional sugar for expressing Wz + b. This is proposed for notational con-
venience only; it has no mathematical implications. In the remainder of the text,
Wz can be construed as Wz + b.
2.4 Classification
800
600
Insulin (μ/mL)
400
200
0
Figure 2.6 A binary diabetes classifier. The predictors are continuous, but the output is
categorical. The grid depicts the decision boundary for a single patient but plotted for
glucose and insulin levels. A Black square is a prediction for diabetes and gray is healthy.
1 Released by the National Institute of Diabetes and Digestive and Kidney Diseases.
36 2 Deep Learning and Neural Networks
selected randomly from the training set. Six of the patient’s predictors were kept
constant, while the glucose and insulin values were iterated over to produce the
plot. This results in a graphical projection of the decision boundary with respect
to insulin and glucose. It is clearly more complex than a simple line. In the full
8-dimensional space, the structure is very complicated. The power of deep learn-
ing lies in its ability to find decision boundaries in highly complex feature spaces.
Note that graphing a classifier is very different from a regressor. This is a binary
outcome so color can capture the results.
Sepal.length
setosa
Sepal.width
versicolor [0.001,
Softmax 0.002,
0.997]
Petal.length
virginica
Petal.width
Figure 2.7 A trained classier for the Iris dataset. There are 4 predictors and 3 classes.
Each class has an output. The outputs are used by the softmax function to make a
prediction. In this example, virginica is the predicted class.
Problems where an example can be a member of more than one class simultane-
ously are called multilabel problems, and one-hot encoding is not appropriate.
The multilabel problem is framed differently. See Section 4.3 for details.
Categorical features are one-hot encoded vectors. A vector representing a cate-
gory is populated entirely with zeros except for the position corresponding to the
class of x, which will contain a 1. This can remain abstract. The one-hot encoded
vector can, in turn, be encoded simply with the index of the 1 in the vector, for
example, [0, 0, 1] is summarized as 2 – the index of the 1. A one-hot encoded vec-
tor is discrete, so it is a mass function, not a density function, which is continuous,
so the representation is accurate. For applications with many categories, this can
save a great deal of memory and copying.
The iris classifier was built with a famous dataset2 introduced by Ronald Fisher
in 1936 (117), a father of modern statistics and inventor of likelihood (116). Known
as the iris dataset, it is frequently used with introductions to machine learning.
It has 4 continuous predictors and 3 classes; so the classifier looks like, ANN ∶
ℝ4 → ℝ3 . Note the difference between this example and the sine ANN. There are
3 classes so we have 3 output neurons, not the 1 continuous output neuron of
2 It is also sometimes referred to as Anderson’s Iris Data Set as Edgar Anderson collected
the data.
38 2 Deep Learning and Neural Networks
the regression ANN, or the binary classifier. While the number of predictors hap-
pens to correspond with the number of input neurons this is by no means required
(see the sine example above). As it is a dense ANN, we can write the topology as
𝜏 = { 4, 4, 3 }. The ANN has a depth of 3, and there is one hidden layer comprises
4 neurons.
if the logits are large and positive. This is particularly likely at the beginning of
training an ANN. It is a good idea to subtract the largest logit from every element in
the logits vector (z). Translating the logits by the maximum zj achieves two things.
The first is that at least one of the zj will be zero, which means one term in the
denominator’s sum will be 1.0, obviating a division by zero. The second is that the
remaining terms will be negative and so the sum cannot overflow. Underflow and
catastrophic cancelation are not a problem. The class with the maximum value
will produce a 1.0 by construction. The 1 dominates the sum so that class, j, is the
most probable (and the softmax will look like a one-hot encoded vector).
Armed with softmax an algorithm for an ANN classifier can be described
Algorithm 2.2.
We conclude with a brief numerical example. Let x = (5.6, 2.8, 4.9, 2), an example
from the iris dataset. Then the classifier computes as follows:
1. z = ANN (x) = (−2.1, 5.2, −1)
2. p̂ = softmax ((−2.1, −1, 5.2)) = (0.0006737164, 0.002023956, 0.9973023)
3. ̂ = 0.9973023
max (p)
4. the index is 2
5. the predicted class is 𝕂[2] : virginica
All the steps can be carried out inside a model. The application would be obliv-
ious to the individual steps. The model should simply accept the predictors and
return the predicted class to the application.
2.5 Summary
ANNs can be viewed as graphs. The graphs can be defined as a list of widths of
its layers. Layers consist of perceptrons that compute state by evaluating the dot
product of its weights with the input signal and finish with an activation func-
tion. The activation functions introduce nonlinearity. The entire computation can
be expressed with a series of iterative matrix multiplications. Two types of ANNs
40 2 Deep Learning and Neural Networks
have been introduced, regressors, and classifiers. Regressors and classifiers differ
in that classifiers are categorical. Classifiers use softmax to map the ANN’s logits
to a synthetic probability distribution over the categories.
2.6 Projects
1. Implement Algorithm 2.1 in your favorite computer language. Use the weights
in the sine and cosine examples to test it.
2. Measure the time taken to perform a matrix vector multiplication. Plot a graph
for the time taken as a function of the N, the number rows in a square matrix.
What is the relationship?
3. The binary classifier example in Section 2.4.1 was implemented with a regres-
sor and the sigmoid activation. The outcomes equally shared the space [0, 1].
What would have to change to accommodate the other two activation functions
presented in the chapter?
41
In Chapter 2 it was shown how ANNs compute values and perform the tasks of
regression and classification. In this chapter, we will learn how ANNs are created.
ANNs depend on good values for weights to produce accurate results. Creating
an ANN is the process of constructing a graph and then finding the appropriate
weights. The latter task is accomplished by training the ANN, the topic of this
chapter.
There are two stages in the life-cycle of an ANN. The stages are training and
inference. Inference formed the subject of Chapter 2, it is the application of the
ANN – the trained model; it has appropriate weights and is ready to be used to
make predictions on data that is has never seen before. Inference is the second
stage of the life-cycle. The first stage is training. A model cannot be used until is
has been trained. The raw clay of the function, a graph with bad weights, must be
molded to fit the constraints of the problem’s defining dataset. There are multiple
steps in the course of training and accepting a model, and they will be introduced
to produce a working training framework.
ANNs are defined by their weights. The values of the weights determine the
behavior of an ANN. The weights are found during the process of training the
ANN. Training an ANN requires a dataset. The dataset is used to train the ANN
so it is called the training set. The ANN “learns” from the training set. Training an
ANN consists of fitting weights to the ANN such that the ANN emits the desired
values with respect to the training set. During training, the weights of the ANN are
reconciled with the training data. The correctness of the ANN is measured against
the training set. In this sense, the model (ANN) is “fitted” to the data. This is an
example of supervised learning (105) because the dataset must also include the
answer, also known as the ground truth or labels, so that the ANN’s result can be
verified and its accuracy quantified. The error is used to correct the ANN. Once
the error is sufficiently low, the training is said to have converged. Convergence
will be clarified below.
Ideally the data will make it easy for the scaler products to produce values within
the interesting domains of the activations functions. To make that more likely, two
basic steps of preprocessing are adopted: they are translation and scaling. Both
steps are performed on a per feature basis. These steps are only performed on
numerical features. Translation should ensure that the input data starts centered
on the center of interesting activation dynamics, and scaling ensures that the input
data does not stray outside the interesting activation domain.
Typically, features are centered at 0.0. The first step is to translate the mean of
the distribution of the features to zero. This is done by computing the per feature
mean and then subtracting it element-wise from the feature. Each feature is treated
separately. Iterating over the features, the mean is computed and then subtracted
element-wise. The effect of this translation is that the distribution of every feature
is now centered about zero, that is, has a mean of 0.0.
The second step is to “normalize” the data. Normalization of the data is effected
by scaling it with an appropriate per-feature scaler value. There are a number of
ways of selecting an appropriate scaler. One way it is to compute the max (|xi |) of
a feature and then dividing every element of said feature with the maximum. This
produces values in the feature that are −1 ≤ xi ≤ 1.
Z-scoring is another way to normalize a numerical feature. The standard devia-
tion of the feature is computed and then every element is divided by it. This results
in each element of a feature being the count of the number of standard deviations
that it is away from the mean, which is zero by construction. Z-scoring is usually
preferable to scaling by the maximum.
Normalization is the process of scaling the data to make it more amenable to the
linear transformation that multiplication by the weights matrix effects. Scaling
with either the maximum or standard deviation preserves the relationships
between the data. Moreover, it would appear at first glance that the compact
domain produced by scaling with the maximum is superior. This begs the
question, why is Z-scoring recommended? The answer lies in the application of
the ANN. Training produces a model that has learnt the data, but applications
are usually interested in performing inference with data that the model has
never seen before. Training with the narrow domain [−1, 1] can produce a model
that is not good at dealing with unseen data. The problem is the special nature
of multiplication in that domain. Numbers in that subset grow smaller with
multiplication, which is the opposite with values outside the subset, which grow
larger. During training the weights will specialize for the former case. Unseen
data, even after pre-processing, may fall outside the subset [−1, 1]. The values
will grow larger with multiplication, that is, behave completely differently than
training data. This effect can produce poor predictions. Z-scoring produces a
compact normalized training set while handling unseen data better. This is
because the model was trained with data outside [−1, 1], and the weights will not
44 3 Training Neural Networks
have specialized to the exclusion of values outside that subset; the model will be
more robust. Handling unseen data is usually the most important consideration.
Recalling the discussion of activation functions in Section 2.2.1, three particu-
larly important activation functions were proposed. There are many more possibil-
ities, but those selected are important as an introduction to the trade-offs attendant
to selecting an activation function. The range of tanh and sigmoid are very similar.
It may not be clear why both are used. Some rules of thumb can now be suggested.
The sigmoid function has a range of [0, 1]. Observe that half of the sigmoid’s inter-
esting domain is not in its range. ANNs are stacks of layers, the output of one layer
acting as the input for the next. The sigmoid produces output that the deeper layer
will have to translate to make full use of its sigmoid’s domain. This makes it harder
to train. The tanh should be preferred as it has a range of [−1, 1]; it is centered. The
tanh was introduced as an activation function once the dynamics of training ANNs
was better understood. sigmoid is used when the interpretation of the output is
a probability. This is not uncommon so the sigmoid still occupies an important
niche. Finally, observe that preprocessing ensures that the training set is in the
interesting area of the activation functions’ domains. Starting the ANN’s compu-
tation with sympathetic input speeds up training and produces better results.
Data preprocessing needs to be incorporated seamlessly into a model’s workflow.
There are two choices of when to do it when training. One choice is to perform all
preprocessing a priori in a batch. As described above, the means are calculated,
the data translated and normalization performed on a copy of the dataset; it never
needs to be done again in the course of training. Another option is to compute
the per feature means and scaling factors then incorporate them as an ante-layer
in the ANN directly. When arguments arrive at the ante-layer, they are processed
prior to passing them on to the first dense layer of the ANN. The arrangement is
depicted in Figure 3.1. Both methods are correct and the choice is a trade-off, but
the latter is often chosen. There are a number of reasons for this recommendation.
First, once the model is trained and ready to be deployed, then data used for
inference will also need to be preprocessed. To ensure correctness, the para-
meters used for preprocessing of the training set will have to be retained for use
with the application, even if batch preprocessing was employed. Incorporating
the preprocessing as a layer in the ANN obviates a whole class of bugs where
ANNs are invoked with unprocessed arguments. The second reason is a question
of resources. It is a terrible idea to modify training data, it should always be kept as
a record. Preprocessing the training data batch style requires creating a copy, and
this can be prohibitive. The one seeming advantage of batch preprocessing prior to
training is that it is only done once. The ante-layer ANN may seem wasteful, doing
the same thing to the same data over and over again each training epoch, but
compared with the computation of computing the ANN’s value, and the weight
updates, it is really just noise. Relative to the expense of training, preprocessing
3.2 Weight Initialization 45
1 1 1
{μ0, σ0}
x0
{μ1, σ1}
x1
Output
ŷ
x2 Preprocessing
layer
{μ2, σ2}
Figure 3.1 An ANN with a preprocessing layer. The preprocessing nodes are in the first
layer. The translation and scaling are performed on a per feature basis. The data are
passed directly to the ANN from the training set unmodified. Training a preprocessing
layer consists of learning the per-feature means and standard deviations.
same weights they would all give the same answer, which is pointless. The point of
having multiple neurons is to increase the capacity of a layer to learn; the weights
need to be different to so that all of the neurons can contribute.
The neurons not only need to be different, but ideally they should be in a certain
range. The ranges of activation functions are most interesting in certain subsets of
their domains. The internal scaler product of a neuron, u, should be, −1 ≤ u ≤ 1.
This suggests that the weights should be initialized thus: w ∼ U[−1, 1], where U
is the uniform distribution. While this might be appropriate it turns that we can
do better. The magnitude of the scaler product will also depend on the number of
source neurons – the more there are, the greater the magnitude of u. Glorot and
Bengio (48) suggest that it is important to ensure that the variance of the weights
is high. High variance ensures that the weights are widely dispersed and con-
tributing, not overlapping. The following is their proposed method for initializing
weights:
[ √ √ ]
6 6
w ∼ U −√ ,√ , (3.1)
Min + Mout Min + Mout
where the Ms denotes the numbers of the neurons in the surrounding layers.
Equation (3.1) is known is known as Glorot initialization. The formulation
accounts for the number of weights in the surrounding layers by shrinking the
centered subset appropriately.
It is also very common to sample the initial values for weights from a Gaussian
distribution. Many deep ANNs employ this technique. The method’s use in Deep
Learning seems to have emanated from its use in AlexNET (89), a very success-
ful system that broke accuracy records at the time. The successful use of Gaussian
initialization has not been explained, but it is widespread (62). AlexNET also intro-
duced ReLU as an activation function for Deep Learning, which has a very different
dynamic domain. The latter point may explain the success of the method. More
advanced methods of initialization exist, such as orthogonalization (71), but Glo-
rot is the most commonly implemented.
And a final implementation note. ANNs do not have unique solutions. The
solution depends entirely on the weight initializations, which are random. The
weights do not converge to some unique combination. If determinism is required,
for example, when debugging, control of the order of initialization and the seeding
of the pseudo-random number generator is required. When developing libraries,
it is a good idea to print out the seed for the random number generator before
running any code. If a problem is observed in the run, then it can be reproduced.
Rare bugs are very difficult to fix if there is no means of reproducing them reliably.
3.3 Training Outline 47
ANNs are trained with example data; the training set. Unlike the historical method
of solving for functions, ANNs are empirically constrained functions. We do not
define the desired function with a law of nature such as an energy or a conserva-
tion of mass constraint, we specify the function directly with data by example. The
data defines the function, that is, the desired behavior of the ANN. The real func-
tion is unknown. The ANN is an approximation to the unknown function, that is,
ANN ≡ f̂ ≃ f .
The model is defined by its training set, the data that empirically constrains how
the ANN should behave. The training set might consist of a set of photographs and
labels if it is a classifier, or a set of vectors and continuous outputs if it is a regressor.
In both cases, there is a set tuples that consists of the inputs, that is, predictors, xi .
The data also includes the correct output, yi , the desired response. This is also
known as the ground truth. Datasets that include the correct answers are known
as labeled data. It is the requirement for the inclusion of the answers in the dataset
that makes this an example of supervised learning. Without the answers we do not
know if the training is proceeding well or needs to be adjusted.
Given the training set, the role of training is to reconcile the weights of the model
with the data, that is, to fit the weights to the model; thus, a synonym for train-
ing is fitting. Fitting a model consists of presenting the training data to the ANN
repeatedly until the desired behavior is observed. The examples comprising the
training set define the correct behavior. Elements from the training set are pre-
sented to the ANN until it “learns” the data. The process of fitting a model consists
of a sequence of distinct and discrete training epochs where the training data are
presented to the ANN, errors are computed and the weights are updated (and,
hopefully, improved). But what does learn mean? How do we know when training
is completed?
Fitting the model is driven by globally optimizing an objective function over
the training set. By global we mean over the entire training set. ANN objective
functions are generally minimized. Let ŷ = ANN(x), the ANN’s response, and x an
example from the training set. Then the general form of the problem is
1∑
N
GOF t = Loss(̂yti , yi ), (3.2)
N i
at epoch t, and where there are N elements in the training set. The global objec-
tive function (GOF) is a sum of the per training example loss functions. Training
an ANN is the act of finding the weights that minimize the objective function.
The local per example loss function quantifies the accuracy of the ANN and must
be differentiable. The ANN’s result in the local loss function, ŷ ti , is superscripted
48 3 Training Neural Networks
to denote that the output changes every epoch (as it learns and improves). The
ground truth never changes.
Computing the current value of the GOF constitutes the basis of a training
epoch. To compute the global loss, we need to run through the training set and
compute the per training example losses. At the end of the epoch, the weights are
updated to new values based on the results of computing the objective function.
The updated weights should improve the global loss function.
Training terminates when the objective function converges. What this means
varies between loss functions. The objective functions used in this book have a
minimum of zero. It is, however, common practice to terminate earlier than that.
A sufficiently small threshold for the objective function can be specified below
which the model is accepted. The threshold is determined by the tolerances of the
application. This question is examined more closely below as loss functions are
introduced.
The convergence of the objective function is the gauge of training progress.
At the end of training, the ANN should have weights that minimize the GOF. The
outline of the process is presented in Algorithm 3.1.
Each iteration of the while loop constitutes a training epoch. All the elements
of the training set are evaluated, and the current value of the objective function
is computed. Once the individual losses have been calculated, the weights are
updated. When the objective function is sufficiently small the algorithm termi-
nates. At this point, training is said to have converged. This form of training is
called batch training because every epoch uses the entire dataset; it is processed
as a batch.
3.4 Least Squares: A Trivial Example 49
Figure 3.2 The fit of a least squares solution to a random sample of points and their
sine. The training set is plotted as points. The results of the computed model are plotted
as crosses. The slope is negative because there are more points in the training set on the
right side of the curve.
a different result). This can be used to model lines, planes, and hyperplanes, but
most interesting problems are extremely nonlinear. It is precisely this nonlinearity
that Minksy and Papert used to argue that ANNs were a research dead end (131).
Early learning rules were mostly variants of least squares. An improvement can be
made by treating the dataset piecewise and dividing the dataset into ranges, but
this presents its own problems and remains unsatisfactory.
To learn nonlinear relations requires the introduction of nonlinearity. ANNs
get their nonlinearity from activation functions. That is the role of the activation
function: they bend the decision boundary around subsets that are not linearly sep-
arable. An example is depicted in Figure 3.3. The problem on the right is linearly
separable, and least squares is capable of learning it. The problem on the left has a
more complex decision boundary. It is not linearly separable. The problem on the
left requires more elaborate mechanisms for a satisfactory model. The nonlinear
activation functions make possible learning nonlinear decision boundaries.
While the nonlinear models are very successful at dealing with nonlinear prob-
lems, they lack interpretability. Least squares results in a solution that can be easily
interpreted as a relationship. A trained ANN is a black box. It may make predic-
tions very well, but they usually do not admit of interpretation making it difficult
to draw broad conclusions.
3.5 Backpropagation of Error for Regression 51
1.0
1.0
0.8
0.8
0.6
0.6
y
y
0.4
0.4
0.2
0.2
0.0
0.0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
(a) x (b) x
Figure 3.3 The decision boundaries for two classification problems. The example
depicted in subfigure (b) features two classes that are neatly separated by a straight line;
it is linearly separable. The classification problem presented in subfigure (b) is far more
complicated. Its two classes cannot be differentiated with the superposition of linear
boundaries.
1 1 1
–0
.17
33
7
–0.
538
–1.
094
13
84
–2.
768
1.6
17
72
76
–0.58
–2.45
5
01
768
58
08
.74
42
0.7
–2
80 –0.654
36
5
38
37
12
33
0.6
1
–1.677
885
1.4
605
1.3
Figure 3.4 A trained ANN model that has learnt the sine function. The layers are labeled.
At the start of the first epoch of training, the first example is presented to the
untrained model. From the input layer, the signal propagates through the net-
work, and as the weights are initialized randomly, we can expect nonsense in the
output neuron. To progress further, there are two immediate requirements. We
need a means of measuring the accuracy of the ANN’s answer, or how “wrong”
it is, and the means must be quantitative and differentiable. There must also be
a mechanism for using this information to update the weights and thus improve
the solution. We address ourselves to meeting these requirements in the rest of the
section. The starting point is the GOF:
1∑
N
GOF t = Loss(̂yti , yi ). (3.3)
N i
The work of a training epoch is computing the individual loss terms. When train-
ing an example is selected from the training set, (xi , yi = sin(xi )), and compute
ŷ i = ANN(xi ). Now, almost certainly, we will have ŷ i ≠ yi . A loss function will deter-
mine the quality, or lack thereof, of ŷ . For regression, the standard measure is the
squared error loss function:
1
Loss = (y − ŷ )2 , (3.4)
2
3.5 Backpropagation of Error for Regression 53
where y is the ground truth and ŷ is the response of the ANN. The squared error
loss function punishes large differences and forgives smaller differences. As the
difference is squared the result is always positive, which makes it useful in a sum.
Returning to the objective function, it can now be defined more precisely for
regression. To measure the quality of the fit for the entire training set, we employ
the mean squared error (MSE). The MSE is the objective function for regression.
The global loss function is the mean of all the errors of all the examples from the
training set:
1 ∑1
N
GOF t = MSEt = (y − ŷ ti )2 . (3.5)
N i 2 i
As the MSE → 0 the accuracy of the model is improving, that is, converging. In prac-
tice it is rarely exactly zero. A nonzero threshold is specified below which the fit
can be tolerated. When the loss of the ANN falls below the threshold the train-
ing process is halted and the model is ready for use. The threshold is left to the
application to select.
Having computed a response with an example from the training set the loss, the
metric for (in)accuracy, is calculated. The loss needs to be used to improve the
accuracy of the ANN so that the ANN can learn from its experience (exposure to
an example from the training set). Recall that the object of fitting a model to a
training set is to look for the appropriate weights for the ANN. It is now clear what
that means: we want values for the weights that reduce the MSE. An update for
the weights is required at epoch, t, such that,
wt = wt−1 + Δwt , (3.6)
is an improvement. The object of the training epoch is to compute the weight
updates, Δw. Ideally, the update to a weight would decrease the MSE. An update to
the weights is required that decreases the loss. To that end, the weight updates need
to be related to how the MSE is changing. In other words, we want to relate how
the loss is changing with respect to how the weights are changing. The following
equation captures the relationship:
𝜕L
Δw ∝ − , (3.7)
𝜕w
where L is the loss function, in this case, squared error. It is negative because
this is a minimization problem. The desired direction of travel is toward minima
and away from maxima, that is, it needs to descend. If the slope is positive, that
is, increasing, we want to go in the opposite direction. If the slope is negative,
decreasing, then we simply keep going.
The canonical means of achieving this is a method known as backpropagation of
error. The backpropagation procedure computes the derivative of the loss and uses
the chain rule from the Calculus to propagate the error gradient backward through
54 3 Training Neural Networks
the ANN’s graph; this is known as the backpropagation phase or the backward pass
of training. We use a recurrence equation and dynamic programming to update
every weight in the ANN starting from the loss function.
To start, the backpropagation algorithm computes the loss with an example from
the training set. This is the forward pass as the responses go forward through the
ANN’s graph in the usual computation. With the ANN’s result, ŷ , the loss is com-
puted and then the gradient is forced backward through the ANN’s graph. This is
the backward phase.
The first step is to compute, ŷ j = ANN(xj ), with the jth example from the training
set. With ŷ available the loss is computed. The inceptive step of backpropagation
is the derivative of the loss. Differentiating the loss with respect to the output of
the ANN we obtain:
( )
𝜕L 𝜕 1
= (y − ŷ )2 = −(y − ŷ ). (3.8)
𝜕 ŷ 𝜕 ŷ 2
This is the how the loss is changing with respect to the output neuron, ŷ ≡ zoutput .
We now have 𝜕L 𝜕 ŷ
𝜕L
, but we need to compute the 𝜕w for each weight in the ANN’s
i
graph. The backpropagation algorithm is a means of doing just that. To start we
will compute the update for the weights of the terminal layer, the output layer.
1
Sigmoid function 𝜎(u) = 1+e−u
𝜎(1 − 𝜎)
Hyperbolic tangent (tanh) 𝜎(u) = tanh(u) 1 − 𝜎2
{
0 u≤0
Rectified linear unit (ReLU) 𝜎(u) = max (0, u)
1 u>0
𝜕L
the desired term in the weight update equation. We have computed 𝜕 ŷ
, so that
𝜕 ŷ 𝜕u
leaves 𝜕u
and 𝜕wi
.
𝜕 ŷ
Proceeding from left to right we start with, 𝜕u . Recall that ŷ ≡ zoutput = 𝜎(u), it is
short-hand for the terminal activation function; the final output of the neuron,
𝜕 ŷ
so 𝜕u = 𝜕𝜎
𝜕u
. As such it depends on the choice of activation function. Our ANN
employs a sigmoid activation function so we need to compute its derivative. Recall
the definition of the sigmoid function (Table 3.1):
1
𝜎(u) = . (3.10)
1 + e−u
Let v = 1 + e−u , then we can re-write the sigmoid as, 𝜎 = 1v , thus:
𝜕 ŷ ( )
𝜕 1
=
𝜕u 𝜕u v
1 dv
=− 2 ⋅
v du
1
= − 2 ⋅ −e−u
v
e−u
= 2
v
(1 − 1) + e−u
=
v2
v−1
= 2
v
v 1
= 2 − 2
v v
( )
1 v 1
= −
v v v
= 𝜎(1 − 𝜎). (3.11)
Recalling that ŷ is a synonym for 𝜎(u) we obtain:
𝜕 ŷ
= 𝜎(1 − 𝜎) = ŷ (1 − ŷ ). (3.12)
𝜕u
56 3 Training Neural Networks
This is a very convenient form, we already have ŷ , and this was one of the reasons
why the sigmoid was originally so attractive. The last derivative is obtained by
differentiating the scaler product with respect to the weight in question:
( )
𝜕u 𝜕 ∑
= wj zj = zi . (3.13)
𝜕wi 𝜕wi j
We now have the three derivatives required for the evaluation of (BPG):
𝜕L
= −(y − ŷ ), (3.14)
𝜕 ŷ
𝜕 ŷ
= ŷ (1 − ŷ ), (3.15)
𝜕u
and
𝜕u
= zi . (3.16)
𝜕wi
Composing the individual derivatives to produce the final form results in:
𝜕L 𝜕L 𝜕 ŷ 𝜕u
= ⋅ ⋅ = −(y − ŷ ) ⋅ ŷ (1 − ŷ ) ⋅ zi (3.17)
𝜕wi 𝜕 ŷ 𝜕u 𝜕wi
While there are 3 weights to update in the output layer, observe that only the last
𝜕u
derivative, 𝜕w , is different across all three updates, indeed, it is the only quantity
i
that varies. The first two derivatives (shaded) are common across all 3 updates.
This quantity is special and referred to as 𝛿. In general, when the activation is
unknown, the delta is written as
𝜕L 𝜕 ŷ 𝜕L
𝛿= ⋅ = . (3.18)
𝜕 ŷ 𝜕u 𝜕u
In our example, the activation has been specified, it is the sigmoid activation func-
tion. The the shaded factors in Eq. (3.17) comprise the delta,
𝛿 = −(y − ŷ ) ⋅ ŷ (1 − ŷ ). (3.19)
This is not abstract or symbolic, we have both values to hand, y is from the training
set and ŷ was computed by the ANN in the forward pass. The delta is trivially
computed. This leads to the actual weight updates.
We are now in a position to compute the updates for the weights in the output
layer for use in the following epoch:
𝜕L
wt+1 = wti − Δwt+1 = wti − 𝜂 ⋅ = wti − 𝜂 ⋅ 𝛿 ⋅ zi , (3.20)
i i 𝜕wi
where we have introduced 𝜂, a learning rate, to scale the update. The learning rate
is there to regulate the size of the update. The three updates for the weights for the
output neuron are
𝜕L
wt+1 = wtb − Δwb = wtb − 𝜂 ⋅ = wtb − 𝜂𝛿, (3.21)
b 𝜕wb
3.5 Backpropagation of Error for Regression 57
𝜕L 𝜕uoutput 𝜕𝜎hidden
= ⋅ ⋅
𝜕uoutput 𝜕𝜎hidden 𝜕uhidden
58 3 Training Neural Networks
𝜕uoutput 𝜕𝜎hidden
= 𝛿output ⋅ ⋅ , (3.24)
𝜕𝜎hidden 𝜕uhidden
𝜕uoutput
where there are two quantities that have not yet been computed. They are 𝜕𝜎hidden
𝜕𝜎
and 𝜕uhidden .
hidden
Note that the bias for the output layer does not play any part in the backpropa-
gation. The bias weight receives updates like all other weights in a layer but does
not contribute to further propagation of the gradient; it does not receive signals
from shallower layers in the graph, it is an input (a BPG terminus). Also bear in
mind that zhidden = 𝜎(uhidden ), the sigma notation is used to emphasize that it is a
differentiable activation function.
Computing the 𝛿hidden derivative propagates the gradient between layers. Once
the delta is obtained, the weight updates are computed as they were in the output
layer. Every time the gradient is propagated back across layers, the deltas will be
computed. There are two neurons in the hidden layer so there will be two deltas.
∑
Recall that u𝓁+1 = i z𝓁,i ⋅ wi (the scaler product in the output layer). We need to
compute the following twice, for i ∈ {0,1}:
(neurons in hidden )
𝜕uoutput 𝜕 ∑
= zj ⋅ woutput,j = woutput,i . (3.25)
𝜕𝜎hidden,i 𝜕zhidden,i j
The derivative is the weight on the edge from the shallower neuron to the deeper
one. At this point, a pattern is emerging with respect to the layers. The final form
of Eq. (3.24) for computing the 𝛿 for a neuron can be generalized:
𝜕u𝓁+1 𝜕𝜎𝓁
𝛿𝓁 = 𝛿𝓁+1 ⋅ ⋅ , (3.26)
𝜕𝜎𝓁 𝜕u𝓁
and the general form for computing the per neuron 𝛿𝓁,i for a layer, 𝓁, and bearing
in mind that it is the shallower layer, is
(neurons in 𝓁 )
𝜕u𝓁+1 𝜕 ∑
= zj w j = wi , (3.27)
𝜕𝜎𝓁,i 𝜕z𝓁,i j
where wi is the weight on the edge between the two neurons. We already know
how to compute the last derivative for the sigmoid activation function:
𝜕𝜎hidden
= 𝜎(1 − 𝜎) = z(1 − z), (3.28)
𝜕uhidden
where z is the neuron’s output. During training the responses for all of the neu-
rons must be retained for the backpropagation phase. During ordinary use the
responses can be discarded as soon as the deeper layer has consumed them, but
they required when training. It is this requirement for retention that is the con-
nection to dynamic programming.
3.5 Backpropagation of Error for Regression 59
There are two neurons in the hidden layer. Each neuron has a 𝛿 so we need to
compute two of them. We can now compute the two 𝛿s that we need in the hidden
layer. The resulting 𝛿s are
𝛿hidden,0 = 𝛿output ⋅ w0 ⋅ 𝜎hidden,0 ⋅ (1 − 𝜎hidden,0 ), (3.29)
and
𝛿hidden,1 = 𝛿output ⋅ w1 ⋅ 𝜎hidden,1 ⋅ (1 − 𝜎hidden,1 ). (3.30)
With the two 𝛿hidden,i computed the weight updates for the hidden layer can be
computed as they were in the output layer.
The passage of the gradient through the output layer to the hidden layer by way
of computing the 𝛿s is general. Backpropagation pushes the gradient backward
through the network’s layers through the graph from delta to delta, updating the
weights in each layer as they are reached, and halting at dead ends (a bias or an
input layer). The gradient passes through a layer to a neuron, i, using the deeper
layer’s 𝛿 with:
𝜕𝜎𝓁,i
𝛿𝓁,i = 𝛿𝓁+1 ⋅ wi ⋅ , (3.31)
𝜕u𝓁,i
where no assumptions have been made about 𝜎.
Proceeding with the sine example, the gradient has now reached the input layer.
Notice that in this layer each input neuron is connected to the two deeper hidden
neurons. It is clear that the earlier expression for the gradient crossing layers is
too simple. Where neurons are connected to multiple deeper neurons, they need
to absorb the gradient from all of their connected deeper neurons. A neuron needs
the total derivative. The total derivative is just the sum of all the gradients passing
backward through the neuron. Hence, the following is performed for each neuron
in layer, 𝓁,
𝓁+1
𝜕𝜎𝓁,i ∑
𝛿𝓁,i = ⋅ 𝛿 ⋅w , (3.32)
𝜕u𝓁,i j 𝓁+1,j j,i
where wj,i is the weight on the edge connecting the two neurons (row j and column
i in the deeper layer’s W matrix). In the sine example, there are 3 neurons in the
input layer, and we know the form of 𝜕𝜎𝜕u
because the activation function is known
(sigmoid), so we have,
𝛿input,0 = z0 (1 − z0 ) ⋅ (w0,0 ⋅ 𝛿hidden,0 + w1,0 ⋅ 𝛿hidden,1 ), (3.33)
Once the input neurons compute their deltas the updates for their weights can be
𝜕u
computed by multiplying them with 𝜕wi as demonstrated with the output layer.
These are the elements of the input tuple for the ANN. In the case of the input for
the sine ANN, it is the argument to the ANN, the x from the training set.
In summary, we can now see how propagating the gradient backward through
the graph is used to update all of the weights in an ANN. Backpropagation consists
of computing the weight updates for the output layer, in the course of which we
also compute the output neuron’s 𝛿. The 𝛿’s then percolate backward through the
ANN until the input layer is encountered, and the last weight updates are per-
formed. Once all of the weights in the ANN have been updated we run through
the training set again and measure the improvement. This continues until the MSE
is below the halting threshold. Because the weights are changing in the direction
that reduces the MSE, the objective function should be reduced over the repeated
training epochs. The sine example is trivial, and as presented backpropagation
may seem cumbersome, but it can be expressed more elegantly.
As with ANN computation in the forward pass, backpropagation can be
expressed more succinctly with matrices. Let 𝛿𝓁+1 be the vector of 𝛿s for layer
𝓁 + 1. Then,
𝜙𝓁 = W𝓁+1
T
⋅ 𝛿𝓁+1 , (3.37)
is the vector of total derivatives for layer 𝓁. Note that the upper-case T superscript
denotes the matrix transpose, not t, the epoch. The matrix multiplication sums the
products of the weights and the deltas on a per shallower neuron basis. To compute
the 𝛿𝓁,i one more step is required:
( )
𝜕𝜎i
𝛿𝓁 = ⊗ 𝜙𝓁 , (3.38)
𝜕ui
where ⊗ is the element-wise3 vector multiplication operator. 𝛿𝓁 is the vector of
deltas for the layer. The vector on the left-hand side of the multiplication is the per
neuron 𝛿 so i ranges over all the neurons in the layer, 𝓁.
The per weight gradient for the entire layer can now be written as the outer
product:
ΔW𝓁 = 𝛿𝓁 ⋅ z𝓁−1
T
, (3.39)
2 It must be emphasized, this expression is for the sine example. The shaded factor is the
specific derivative of the sigmoid activation function.
3 The Hadamard product.
3.5 Backpropagation of Error for Regression 61
where z is the vector of the outputs of the shallower layer and thus forms the input
for the current layer. Recall that a vector times the transpose of a vector produces
a square matrix,4 not a scaler. At the input layer of the ANN z𝓁−1 is the argument
to the ANN from the training set. Finally, the weight update is
W𝓁t+1 = W𝓁t − 𝜂 ⋅ ΔW𝓁t+1 , (3.40)
where W is the matrix of weight for the layer, 𝓁. All the ingredients for an algo-
rithm are now in place.
The net gradient is used for a number of reasons. Updating the weights can be
expensive – there may millions of them. Moreover, updating them in the middle of
the epoch will affect all further iterations in the epoch. This can lead to suboptimal
and superfluous, even counterproductive, updates if care is not taken. Amortizing
the cost of a weight update over multiple loss calculations nets out contradictory
directions of travel. Netting out the gradients ensures that the update reflects the
information from the entire dataset.
In closing, we observe that Loss (ANN, x, w) is high-dimensional function. With
backpropagation we are exploring the loss function in an effort to minimize it by
descending its gradient. During training the parameters of the ANN, the weights,
wi , are variable and the training set is constant, that is, they reverse roles. Once
the model is trained the weights are constant, they are parameters, and it is the
arguments to the ANN that are variable.
The rationale is as follows. Recall that sigmoid is a mapping, [0, 1]. The deriva-
tive will be zero when the neuron is saturated (one of the extrema). For unsaturated
neurons, the interesting range of the function, sigmoid maps to (0, 1). Its deriva-
tive is 𝜓 = 𝜎(1 − 𝜎), thus its derivative will be even smaller. Consequently,
as the error is propagated backward through the network the gradient grows
successively fainter through each layer – and in deeper networks simply disap-
pears (the vanishing gradient problem). The desire to implement deep ANNs
led to a re-examination of activation functions. Let us examine the derivative
for the sigmoid activation. Its maximum is
d𝜓 1
=1−2⋅𝜎 ⟹ 𝜎 = . (3.41)
d𝜎 2
Thus, the maximum for the derivative is 0.25. sigmoid and its derivative are plotted
in Figure 3.5. As the error propagates backward through the network, at best, it is
scaled by a quarter in each layer:
𝜕𝜎𝓁,i
𝛿𝓁,i = 𝛿𝓁+1 ⋅ wi ⋅ ≤ 𝛿𝓁+1 ⋅ wi ⋅ 0.25, (3.42)
𝜕u𝓁,i
It can be viewed as attenuating the error at a rate of 0.25depth−𝓁 . The example sine
used so far is, strictly speaking, a Deep Learning model, but in spirit it is not what
comes to mind when people think of Deep Learning. The state-of-the-art Deep
Learning models today have more than 20 layers. Training such deep networks
simply could not be done if sigmoid was used throughout.
1.0
0.8
0.6
range
0.4
0.2
0.0
domain
Figure 3.5 The sigmoid activation function plotted in black. Its derivative is
superimposed in gray. The derivative acts to attenuate the error signal between layers.
64 3 Training Neural Networks
The ReLU activation has a very different range, [0, ∞). Its derivative is either 0
or 1, a Heaviside function; consequently, the error signal can penetrate further
back up the stack of layers. Gradients flow backward unimpeded, or are stopped
dead. This is a critical property for training Deep Learning models. The more
layers in a model the more an ANN can learn. It does, however, introduce another
problem: the “dead neuron.” Should a neuron’s scaler product consistently
produce negative values it can never learn its way out of the hole; the derivative
will be zero forever. Care must be taken to ensure the trap is avoided (99). One
means is to use a slower learning rate. The slower learning rate gives neurons the
chance to avoid death. Note that while the learning rate might be lower it is still a
win as ReLU makes it possible to train deeper networks than previously possible.
For shallower networks, such as the sine example, the non-linearity of sigmoid
is a win.
0 10 20 30 40 50
Epoch
Figure 3.6 MSE loss for four training runs of a sine ANN. They all show the same
distinctive precipitous decline followed by a very heavy tail. The topology was {3, 2, 1}.
Figure 3.7 The output of an ANN learning the sine function. The black curve is the
ground truth. The remaining curves demonstrate the evolution of the model as it learns.
The straight line is the initial random weights. The over time the output of the model
approaches the ground truth as the weights improve. Plots are taken every 25 epochs.
66 3 Training Neural Networks
to sine as the MSE comes down. The differences between the curves of subsequent
epochs become less pronounced over time, this is reflected in the heavy tails in the
loss graph.
a preprocessing layer are easily verified with other software packages such as R
or MatLab. The other types of layers in an ANN library are more challenging.
We want to verify the correctness of the implementation of backpropagation
in the layers. This is not a job for an application developer training a model.
Verification needs to be performed by the people writing the software libraries
that application developers use. It is also useful when doing research. Verifying
an implementation of a new layer is vital to ensure that the experimental results
can be trusted.
Verification of an implementation can be performed by computing the expected
gradient with differencing equations. Consider the gradient at some weight, wi , in
𝜕L
the neural network. Then the backpropagation implementation will compute 𝜕w .
𝜕L
We wish to confirm that the quantity, 𝜕w , is correct. We can verify the result by
computing it directly with the definition of a derivative. Both results should agree
and confirm the correctness of the backpropagation implementation. The classical
definition of a derivative is
dL L(w + h) − L(w)
= lim . (3.43)
dw h→0 h
Notice that we have used L, the loss, and not the ANN, as we need the derivative of
𝜕L
the loss with respect to the weight, not the ANN per se. That is how 𝜕w is computed.
The derivative is computed by selecting a datum from the training set and com-
puting ComputeLoss ((x, y)). The loss is recorded, w + h is then used in a second
invocation. The new loss is computed and the approximate derivative calculated.
If we obtain “acceptable,”
dL 𝜕L
≈𝜖 , (3.44)
dw 𝜕w
results for every weight in the layer, then we can have confidence that our
backward propagation code works correctly.
In a real computer, implementation such a naïve numerical differentiation is
unreliable. This can quickly lead to trouble and require large, and potentially
meaningless, choices of 𝜖. It must always be borne in mind that the real number
line does not exist in a computer. An arithmetically correct algorithm is not
necessarily correct when implemented in a computer. The above definition of
a derivative must be approximated with the discrete representation in a com-
puter. Employing a truncated Taylor series expansion we can get an idea of the
arithmetical and analytical error that we can expect,
We can see that this naïve approach leads to error that is linear in h. The error is
very high for an application such as verifying the correctness of gradient propaga-
tion through an ANN implementation. By using a centered form of approximating
the derivative, we can compute the gradient a second way that is far more accurate,
dL L(w + h) − L(w − h)
= lim . (3.47)
dw h→0 2h
It is centered because the computation looks in both directions around w giving us
a better idea of local behavior. The improvement can be quantified by expanding
the Taylor series by a further term and doing it in both directions,
L′′ (w) ⋅ h2
L(w + h) = L(w) + L′ (w) ⋅ h + + O(h3 ), (3.48)
2!
and,
L′′ (w) ⋅ h2
L(w − h) = L(w) − L′ (w) ⋅ h + − O(h3 ), (3.49)
2!
subtracting the two equations and the second-order terms cancel. Solving for the
derivative we are interested in produces an expression with second-order error:
L(w + h) − L(w − h) dL
⟹ = L′ (w) + O(h2 ) ≈ , (3.50)
2h dw
and we can see that the error is now quadratic in h. The first-order errors cancelled
themselves out, taking with them many numerical problems as well. In practice,
in a computer implementation, h must be chosen very carefully.
There are now two means of computing the gradient anywhere in a neural net-
work. They will rarely agree exactly so we still need a means of comparing the
results. The following is used,
|bprop| − |diff |
, (3.51)
|bprop| + |diff |
where bprop is the value from backpropagation and diff is the result of the centered
differencing equation. The closer to zero the better. The comparison can be done
at any point in an ANN’s graph.
Some important practical details to consider, it is best to only take a couple of
training steps before performing the verification. Too late and the gradient will
be too faint (the ANN is growing accurate so the error is small). Too early and
the results can be erratic. It is also recommended to use double-precision floating
point variables for verification. The deeper in the network, the smaller h can be
(deeper is closer to the loss function, and the stronger the signal will be). A good
starting point for h is 10−7 . The entire process is shown in Algorithm 3.4. It verifies
the 𝛿s of a dense layer.
3.7 Verification of a Software Implementation 69
1 1 1
|diff0| − |G[0]|
|diff0| + |G[0]|
|diff1| − |G[1]|
|diff1| + |G[1]|
Layer to be
verified
Differencing
layer
|diff2| − |G[2]|
|diff2| + |G[2]|
When new types of layers are implemented and added to a library, they can be
verified with the above algorithm. A model can be created with the new type of
layer and with a generic dense layer above it (shallower). If backpropagation is
flowing correctly through the new type of layer, then the instrumentation in the
dense layer will confirm it (Figure 3.8).
It is important to bear in mind that differencing assumes a continuous function;
dense layers are continuous. Many types of ANN layers, such as the max pooling
and convolutions found in Chapter 6, have important discontinuities that need to
be accounted for when the above method is used. Differencing can still be used
but the discontinuities must be accounted for.
The Table 3.2 presents examples of output of the verification Algorithm 3.4 for
an ANN implementing sine with two hidden layers of 5 and 4, respectively. It
is clear that the library that was used to build the model is correct. A value of
h = 0.0000001 was used.
3.8 Summary
This chapter introduced the rudiments of training ANNs. It was seen that pre-
processing the data is critical to success. The basis of learning is employing a
quantitative “loss” function to measure progress. The loss function must be dif-
ferentiable. Backpropagation of error from the loss function is used to update the
3.9 Projects 71
3.9 Projects
Training Classifiers
This chapter presents how to train classifier artificial neural network (ANNs).
The loss functions required are motivated and derived. The inherent and fun-
damental supporting concepts are also demonstrated. Chapter 3 detailed how to
train a regressor ANN. Backpropagation of the loss gradient was used to compute
updates for every learnable parameter in an ANN’s graph. The loss is a quantitative
metric for the incorrectness of an ANN’s output. Classifiers differ substantively
from regressors in only one particular: the loss function. The loss function is very
important as it not only measures progress, but its derivative is the inceptive step of
the backpropagation of error. mean squared error (MSE) is not appropriate for clas-
sification so we begin by motivating an appropriate loss function for classification.
Let 𝕂 be the set of classes, that is, the range of the classifier, and K = |𝕂| be
the number of classes. Softmax produces a vector of probabilities, and the training
examples use one-hot encoded vectors, which can also be interpreted as a discrete
probability distribution. The two distributions, pi and p̂ i , need to be quantitatively
compared to compute a loss. The MSE does not make sense when comparing dis-
tributions in the form of probability mass functions; they are not continuous. To
train a classifier, we need a loss function that is more appropriate, a measure of
similarity between probability distributions.
4.1.1 Likelihood
Before the loss function is presented, a brief digression is indicated to introduce
the idea that the loss function is based on the notion of likelihood. Likelihood is
often used colloquially as a synonym for probability, but it is subtly different. It is
easiest to grasp the difference with a simple example. Suppose someone claims to
have a fair coin, an equal probability for either a head or a tail, and plays a game
with you. Over the course of 15 tosses 12 heads and 3 tails are observed. Is the coin
fair? ( )
Coin flips are distributed binomially, P(X = S|𝜃) = NS 𝜃 S (1 − 𝜃)N−S , where S is
the number of tails and we have conditioned on the parameter, 𝜃, the probability
of a tail. The binomial distribution could be used to compute the probability of
X = 3 tails, as it is claimed that the coin is fair and it is assumed that 𝜃 = 0.5. This
results in a probability of the observed outcome of 0.0138855, or 1.4%: but it is the
claim that the coin is fair that needs to be addressed. We are really interested in
the value of 𝜃 that explains the observations. The task is to fit observed data to a
distribution, not make a prediction (we have the data). The object is to find the
most likely value of 𝜃 that explains the observed data. The tool for such problems
is called a maximum likelihood estimator (MLE).
Probability is used to predict data, and likelihood is used to explain data. All of
X, S, and N are known. 𝜃 is the variable, so we use a likelihood function to find 𝜃.
For discrete probabilities, the likelihood function is the probability function with
the outcome fixed and the parameters varied. For more details and a thorough
introduction see (110).
The likelihood function for the binomial distribution is conditioned on the
observed data:
( )
15 3
ℒ (𝜃|X = 3) = 𝜃 (1 − 𝜃)12 . (4.1)
3
The object is to find 𝜃 that maximizes the likelihood function. Its graph is presented
in Figure 4.1. 𝜃 = 0.5 seems very unlikely. The maximum likelihood lies in the gray
region of the curve. A more plausible value for 𝜃 lies between 0.15 and 0.3, hence
4.1 Backpropagation for Classifiers 75
0.25
0.20
0.15
Likelihood
0.10 0.05
0.00
Figure 4.1 Likelihood for the fairness of a coin following 15 tosses and observing 3
tails. The portion of the curve that is gray is where the likelihood is highest indicating
that 𝜃 is like found there, between [0.1, 0.3]. This suggests that the coin is likely unfair.
we conclude that the coin is not fair. The notion of likelihood is very powerful, and
it can be used to train categorical ANNs.
where we have conditioned on our parameters, the weights. Given the output of
softmax, p̂ = {p̂ 1 , ..., p̂ K }, we want to maximize the likelihood that p̂ represents the
ground truth, p. The likelihood function, ℒ , for independent conditionals is
∏
K
p
̂ w) = argmaxw
ℒ (p; p̂ i i . (4.4)
In other words, the loss function for classification is the response of the ANN expo-
nentiated to the ground truth (the expected number of times they should appear),
the one-hot encoded vector from the training set. This function is maximized when
all the p̂ i = pi . When training categorical ANNs all the pi will be zero except for one
item, the target class.
The problem with this loss function is that it is a product, and underflow is a
serious problem. There is also the ironic problem of correct progress, one zero fac-
tor, which is desired given one-hot encoding, and there is no information at all. It
is more convenient to work with a sum. To that end the natural logarithm is taken,
written log, of the MLE. The choice of base e is not mathematically required, but
it will be prove to be a very convenient choice when differentiating softmax below.
Taking the negative logarithm of ℒ yields a more convenient function (the expo-
nents are probabilities). We give it below for an ANN with K possible classes:
∑
K
̂ = − pk log(p̂ k ),
Loss(p, p) (4.5)
k
where we are iterating over a one-hot encoded vector. This function compares the
ground truth with the computed vector. Only one pk is nonzero (it is 1, the target
class), and the rest are zero. Maximizing the original likelihood function is done
by minimizing the loss function.
The loss function can also be arrived at with information theory (26). In the
context of information theory, the loss is known as H, the cross entropy. Infor-
mation theory uses a logarithm with base 2, and this gives units of Shannons,
or Shannon bits, after Claude Shannon. Claude Shannon is the father of modern
information theory. He published a seminal paper, “A Mathematical Theory of
Communication,” published in Bell Labs’ internal journal, Bell System Technical
Journal. Shannon developed information entropy as a means of computing the
optimal length of messages for distributions (such as the frequency of letters in a
language, e.g. English or Arabic). Cross entropy can be interpreted as the loss of
information when using a suboptimal encoding.
With the definition of an appropriate local loss function, the global objective
function can be defined. Training a classifier ANN attempts to minimize:
1 ∑∑
N K
GOF t = − pk log(p̂ tk ), (4.6)
N k=1
4.2 Computing the Derivative of the Loss 77
where there are N training examples and K categories. Again, as with the MSE,
as the cross entropy → 0.0 the information loss approaches zero and the model is
converging. It is rare for the cross entropy to actually reach zero, and it is standard
procedure to specify a loss threshold below which training is halted. The threshold
specified depends on the application. In practice, the inner sum does not need to
be computed as the pk will all be zero except for the target class, which is 1, so only
one term is nonzero. The simplified version is
1∑
N
GOF t = − log(p̂ ttarget ). (4.7)
N
The MLE is used to seed the backpropagation of error to train a classifier. The
strategy is the same as for a regressor measured with MSE. The weight updates
𝜕L
that reduce the loss function need to be computed, so we need the 𝜕w for every
weight in the ANN. The output of a classifier is different from a regressor owing to
the softmax function. The softmax accepts the raw output of the ANN, logits, and
from that point backpropagation is no different from a regressor. The strategy then
is to differentiate from the loss function to the logits then perform backpropagation
as usual.
where we have convolved softmax with cross entropy. Let P(z) ̂ be the softmax
function. Then we begin by recalling the definition of softmax. For the class i, its
softmax probability is
zi
̂ = ∑e
p̂ i = P(i) (4.9)
K z
k e
k
78 4 Training Classifiers
and,
̂
p̂ = (P(0), ..., P(K − 1))T , (4.10)
the vector of all the softmax probabilities.
𝜕L
To compute the 𝜕z we start by differentiating the loss with respect to the output
i
of the neuron for each class i.
( K )
𝜕L 𝜕 ∑
= − pk ⋅ log(p̂ k )
𝜕zi 𝜕zi k
∑
K
̂
1 𝜕 P(k)
= − pk
k
p̂ k 𝜕zi
̂
1 𝜕 P(i) ∑K
̂
1 𝜕 P(k)
= −pi − pk , (4.11)
p̂ i 𝜕zi k≠i
̂
p k 𝜕zi
where the problem has been split into two cases. The first term is the case where
i is the class with which we are differentiating with respect to; we have broken it
out of the sum. The second term is the sum containing the remaining i ≠ k terms.
Our strategy is to deal with the two cases separately and recombine the results.
The first case is
̂
1 𝜕 P(i)
−pi . (4.12)
p̂ i 𝜕zi
Let us compute
( )
̂
𝜕 P(i) 𝜕 ezi
= ∑ zj . (4.13)
𝜕zi 𝜕zi je
(∑ )−1
K
We employ the product rule to differentiate. Let A = ezi , and B = j ezj . Then
AB = p̂ i . This leads to,
𝜕
(AB) = A′ B + AB′
𝜕zi
= AB + AB′
= p̂ i − ezi ⋅ B2 ⋅ ezi
= p̂ i − p̂ 2i
= p̂ i (1 − p̂ i )
̂
𝜕 P(i)
= .
𝜕zi
̂
𝜕 P(i)
Substituting 𝜕zi
back into Eq. (4.12) we obtain,
̂
1 𝜕 P(i) 1
⟹ −pi = −pi p̂ i (1 − p̂ i ) = pi p̂ i − pi . (4.14)
p̂ i 𝜕zi p̂ i
4.2 Computing the Derivative of the Loss 79
Recombining our results obtained for Eqs. (4.12) and (4.15) we reconstitute the
original equation:
̂
1 𝜕 P(i) ∑K
̂
1 𝜕 P(k) ∑K
−pi − pk = pi p̂ i − pi + pk p̂ i
p̂ i 𝜕zi k≠i
p̂ k 𝜕zi k≠i
∑
K
= −pi + pk p̂ i
k
∑
K
= −pi + p̂ i pk
k
= p̂ i − pi . (4.18)
∑K
The full sum was reassembled and we take advantage of the fact that k pk = 1 to
simplify the final expression. Thus, we arrive at,
𝜕L
= p̂ i − pi . (4.19)
𝜕zi
The expression cross-entropy’s derivative looks very similar to the derivative
of the squared error used for regression, a simple difference between the com-
puted answer and the ground truth. This is the inceptive step of the recurrence for
backpropagation when training classifiers. It does not look very different from the
squared loss for regression, but note that it is applied to every class, that is, the K
outputs of the ANN, e.g. all three species of iris in the example classifier. Moreover,
pi is zero for every class except the target class, in which case it is 1.
80 4 Training Classifiers
1 1
setosa
δs
eto
s a =
∂se ∂L
to
sa =
pˆ
0 –p
0
versicolor
K
–p
2
pˆ 2
virginica =
∂L ica
ir gin
= ∂v
ica
∂L ∂u δ vir
gin
= δvirginica· ∂w
∂w– –
Figure 4.2 Detail of an iris classifier’s terminal layer. The softmax layer runs into the
loss function where the back propagation begins. It is differentiated with respect to each
class, k.
4.3 Multilabel Classification 81
The classifying neural networks presented so far are known as multiclass prob-
lems. Given a set of categories, 𝕂, a datum could only be a member of one of the
classes. For example, an iris flower can only be one species. Membership in the cat-
egories is mutually exclusive. There are, however, problems where membership of
multiple classes is desirable, that is, a datum can be a member of more than one
category. The problem is known as multilabel classification. To solve this problem,
a different technique is required, and we present it here.
Consider the problem of classifying an email by subject matter. Emails fre-
quently touch on many subjects. A forensic examination of a batch of emails might
be interested in the following topics, 𝕂= {HR, Finance, Budget, Criminal}.
Forcing an email to be a member of a single category would clearly lose a great
deal of information. To that end we permit an email to be a member of multiple
classes. The training set for such a classifier could not use one-hot encoding. In
this problem, an email can have multiple labels. To account for the difference, the
problem is known as multilabel classification.
Multilabel classification refers to the fact that a datum can have more than
one label. Multiclass classification is framed very differently. The fundamental
difference is the relationship between the outputs. The mutually exclusive class
membership of multiclass problems can be represented with one-hot encoding,
82 4 Training Classifiers
which can be efficiently represented with a single integer, the index of the “1”
in said vector. Softmax produces a distribution over the output vector. Multilabel
classification is different as membership in all the classes is independent. The
question of membership of one class has no bearing on membership in another
class. Thus multilabel class membership can be treated separately with respect to
each category. Each class can be interpreted, and treated, as a binary classifier.
The special case of binary classification is simpler. A datum is either in the class
or it is not. This is captured in the following formulation for a class, k:
lossk = −[pk ⋅ log(p̂ k ) + (1 − pk ) ⋅ log(1 − p̂ k )]. (4.21)
This leads to the local loss function,
1∑
K
loss = lossk . (4.22)
K k
1 1 1
HR
p̂1
Feature0
δHR = pˆ 1 – p1
Feature2 Criminal
p̂4
δCriminal = pˆ4 – p4
Figure 4.3 An example of a multilabel classifier. The outputs corresponding to the 4 classes are independent. The final output is a vector
with 4 entries mapped to category membership based on a probability threshold. The per class 𝛿s are indicated.
84 4 Training Classifiers
1∑1∑
N K
GOF = lossk , (4.23)
N K k
where there are N examples in the training set.
A multilabel CNN can be trained with backpropagation. To initiate the inceptive
step, the loss function must be differentiated. It is much simpler to derive as there is
no softmax function, just sigmoid. As the classes are all treated independently, the
work is greatly simplified. The special case of binary cross entropy is the beginning
of the backwards pass:
𝜕L 𝜕 ( )
= − (pk ⋅ log(p̂ k ) + (1 − pk ) ⋅ log(1 − p̂ k )) , (4.24)
𝜕 p̂ k 𝜕 p̂ k
this leads to,
𝜕L −pk −(1 − pk )
= + ⋅ −1
𝜕 p̂ k p̂ k 1 − p̂ k
−pk ⋅ (1 − p̂ k ) + p̂ k ⋅ (1 − pk )
=
p̂ k ⋅ (1 − p̂ k )
p̂ k − pk
= . (4.25)
p̂ k ⋅ (1 − p̂ k )
This was a relatively simple derivative to compute, but numerically it is fraught
with danger. Luckily, it does not need to be evaluated1 . The computed probability
is p̂ k = 𝜎(uk ). As the activation function is known, it is the sigmoid function, and
recalling that the derivative of the sigmoid function function is p̂ k ⋅ (1 − p̂ k ), the
expression can be simplified. It is best to work with the per class 𝛿. It can be easily
computed,
𝜕L 𝜕 p̂ k p̂ k − pk
𝛿k = ⋅ = ⋅ p̂ ⋅ (1 − p̂ k ) = p̂ k − pk . (4.26)
𝜕 p̂ k 𝜕u p̂ k ⋅ (1 − p̂ k ) k
With 𝛿k computed the backpropagation of error takes places as usual. The termi-
nal layer computes the deltas using this numerically stable method, updates its
weights, and propagates the error back through the network.
4.4 Summary
Backpropagation of error is used for training classifiers, but a different loss
function is employed. Multiclass classification is prediction of membership in a
1 Never forget that computing is different from doing mathematics. This is a classic example of
anticipating the computational pitfalls and obviating them.
4.5 Projects 85
mutually exclusive set of categories. One-hot encoded vectors are used to repre-
sent the set. The loss function is cross entropy across the distribution produced
by softmax. Multilabel classification provides for membership of a datum in
multiple categories. Each category is treated independently. Instead of softmax,
the sigmoid activation is used. The loss is special case of cross entropy, per
category binary cross entropy.
4.5 Projects
In Chapters 3 and 4 it was seen that the training of artificial neural networks is the
act of fitting weights to models with respect to a training set. While the updates to
weights was related to how the loss function is changing, how to precisely quantify
the update was not discussed. In this chapter, we examine the process of updating
weights more closely and present some strategies for selecting the updates effi-
ciently. The treatment consists of two prongs. The training step itself is examined
followed by a closer look at the individual weight updates themselves.
An important consideration for training is the batch size used for a training
epoch. The training process, as described, is a sequence of discrete steps called
epochs. During each epoch, ComputeLoss is called for every element in the training
set. ComputeLoss performs backpropagation, and the per weight net gradients are
accumulated. The weights are only updated after the net gradients are computed.
The net gradient reflects all of the information in the training set. A training step
that makes use of the entire training set is called a batch step. Batch steps are not
always desirable. There are alternative methods, one of which is stochastic gradient
descent. It forms the topic of Section 5.1.
There are occasions when training sets are so large that batch training is undesir-
able or even infeasible. Stochastic gradient descent (SGD) is an alternative means
of training artificial neural networks (ANNs) (93). SGD is the basis of almost, if
not all, deep learning (DL) training. SGD is not a method confined to use with
training ANNs, nor did it originate with ANNs or even with machine learning. It
is a general technique used in optimization problems framed as fitting the param-
eters of a function with respect to data. Its modern origins lie in the 1950s and
was known as the Robbins–Monroe method, but arguably reached maturity in the
1960s as statisticians began to examine the use of the latest technological marvels,
digital computers in the form of mainframes, to deal with experimental data and
regression. By then the technique had evolved into stochastic estimation, intro-
duced by Kiefer and Wolfowitz (83). The interested reader is encouraged to read
the latter as it is both concise and approachable.
SGD attempts to approximate the net gradient by sampling a subset of the gradi-
ents, that is, by only using a subset of the training set between weight updates.
In “pure” SGD, the weights are updated after each evaluation of ComputeLoss,
see Algorithm 5.1. In practice training, ANNs with pure SGD is rarely used. All
of the arguments for batch processing adduced above suggest that updating the
weights following a single sample is a bad idea. An alternative refinement to SGD
is used, called “minibatch,” and described in Algorithm 5.2. Instead of updating
the weights after every call to ComputeLoss, a random subset of data is sampled
from the training set, called a minibatch, and the minibatch is used for the training
epoch.
The insight behind mini-batch SGD is that if the describes the problem domain
(the ground truth), then a randomly selected subset of the will approximate the
gradient well. By repeatedly sampling the training set randomly, then, on average,
the correct gradient is computed and the ANN will converge to a good solution.
Moreover, many training examples are redundant or superfluous. If data points
are sufficiently close together in the input space, then there is no need to use all of
them at once. Because the full training set is not used with every epoch, less work is
done. Epochs are shorter and solutions are found sooner. The size of the minibatch,
by which we mean the percentage of the training set used in a minibatch, varies
greatly between applications and datasets.
Figure 5.1 presents the results of training 4 models to recognize the MNIST
dataset of hand-written digits. MNIST consists of 60,000 28 × 28 black and white
images of hand-written digits and 10,000 further examples for testing. It is a plot
of epoch number versus loss demonstrating the progress of training over time. All
5.1 Stochastic Gradient Descent 89
four curves are relatively well clustered signifying that all four models are con-
verging at roughly the same rate; but they are not doing the same amount of work!
The run using a minibatch size of 10% is doing roughly 1/10th of the work of a full
batch, but it is converging at a similar rate. This claim is borne out by Table 5.1. As
the size of the mini-batch decreases, the time taken to train exhibits corresponding
decreases as well. The accuracy of the final model (all values represent 100 epochs)
does not seem to suffer. It must be emphasized that 10% is not being proposed as
a universal mini-batch size. The size will vary widely between problem domains
and datasets. But it does demonstrate the advantages of mini-batch training with
SGD.
Ideally, the subsets are sampled from the full training set by randomly permuting
the order of the training set. During batch training, the training set is iterated over
in the same order every time; the order does not matter when the entire training
set is used. The best implementation of SGD is to randomly permute the order
of the training set. Every time the model has finished iterating over the complete
training set, the order is permuted again (so no minibatch is the same). Let n be the
number of mini-batches required to consume the entire dataset once (the inverse
90 5 Weight Update Strategies
100%
50%
25%
3
10%
2
Loss
1
0
0 20 40 60 80 100
Epoch
Figure 5.1 Four training runs using SGD to train a classifier to recognize the MNIST
hand-written digits dataset. Each run had a different % of the dataset for a minibatch. All
models were trained over 100 epochs.
of the minibatch). Every n SGD epochs a new permutation of the training data
order is computed ensuring entirely different minibatches the next time through.
Also, no training example is seen a second time prior to all the other examples
being seen at least once. This ensures that all data examples receive equal weight.
Implementing SGD in a permuted serial fashion has some important qualities.
With this method, we are still using every element of the training set an equal num-
ber of times, that is, each element of the training set is contributing equally to the
5.1 Stochastic Gradient Descent 91
final model, all the elements of the training set are weighted equally. By changing
the order of every n epochs, we are also varying the make up of the minibatches
ensuring many combinations of data examples are producing the approximated
gradients. Some training libraries implement SGD with sampling with replace-
ment, but this is discouraged as the resulting distribution of samples is very dif-
ferent (and, arguably, incorrect). SGD should not be an implementation of boot-
strapping or bagging (36). Sampling without replacement produces better qual-
ity updates. SGD minibatch can be implemented efficiently with the “shuffle”
algorithm (86).
Detecting convergence when training with SGD can be challenging. Batch train-
ing usually exhibits a monotonically decreasing loss function. In contrast, SGD
losses can jump around and exhibit jittery behavior. This is owing to the way losses
are computed in SGD: they depend on a subset of the data, which is usually much
smaller. Sometimes a mini-batch will contain elements that the ANN has learnt
yielding a small loss, then the next minibatch might contain all the pathological
outliers in the training set producing enormous loss. There are many techniques to
cope with the problem. One approach is based on n. If the loss is below the thresh-
old for n epochs in a row, then training has probably converged. An even simpler
variant is to maintain a sum of the losses. At the point of permuting the train-
ing set, the sum is converted to a mean (divided by n) and if acceptable training
is halted. While simple it does have a drawback. Waiting for the next permuta-
tion to test for convergence may postpone acceptance of the solution while doing
superfluous work.
Below we contrast and compare the 3 variants of epochs presented thus far:
1. Full batch training was described in Algorithm 3.2. The entire dataset is
used in every epoch. The weights are only updated once the comprehensive
net gradient representing the entire dataset has been computed. This can be
prohibitively slow for large datasets.
2. Pure SGD, described in Algorithm 5.1. The weights are updated after each loss
computation. This is expensive and also leads to jerky objective function eval-
uations; GOF will be very jittery. The pure form of SGD is expensive because
updating weights can also take a great deal of time for large ANNs. It is more
likely to make “mistakes” as the error gradients are not averaged out before
updating the weights resulting in many superfluous weight updates.
3. Minibatch SGD, described in Algorithm 5.2. In each training epoch, a random
subset from the training set is selected. The gradient is approximated over mul-
tiple examples from the training set, the subset. The idea is that all the approx-
imated gradients will average out over multiple samples. The cost of updating
the weights is amortized over more loss calculations than pure SGD, The objec-
tive function is far smoother than pure SGD.
92 5 Weight Update Strategies
SGD is an important means of training ANNs. For the large state-of-the-art prob-
lems in DL, it is almost always used. Problem domains where the datasets are
smaller, such as medical clinical studies where the number of patients might only
be a few hundred, SGD is not as important.
A
Loss
Minimum
Weight space
Figure 5.2 A loss surface for a hypothetical ANN. A static choice for 𝜂 is problematic and
would depend on the starting position. The precise shape is not known a priori and so a
dynamic choice is required.
epoch, t, we might have a good update for some weight, wi , but updating weight wj
may interfere with it; the loss is a function of both of them. Most methods update
each weight independently, failing to take account of negative, or positive, side
effects on other weights.
Training is an iterative process. An epoch computes the current loss, updates the
weights and repeats. The iteration has the form,
wt+1
i
= wti + Δwt+1
i
, (5.2)
for weight wi . With increasing epoch, the objective function will be reduced and
the ANN’s weights should converge to a tolerable solution. Iteration is an impor-
tant means of finding the solutions to equations. The remainder of the chapter is
dedicated to presenting a selection of iteration strategies for weight updates. There
is no one best approach. Like any complex problem, there are trade-offs to be con-
sidered. Often the problem domain must be considered, some update strategies
work best with different problems.
Newton was interested in finding the extrema of functions. He used his recently
invented Calculus to do so in the setting of convex optimization. Training neural
networks is often assumed to be a convex optimization problem (18) so his insights
are relevant. Training ANNs is generally concerned with finding minima, as the
objective is to minimize the loss function.
Consider a scaler function, f (w). Let us iteratively find a minimum for f . Then
the object is to find w such that f ′ (w) = 0, a stationary point of f . Suppose that,
for whatever reason, we cannot solve the equation analytically (exactly). Then
we need to use a numerical method to approximate a solution. Training a neu-
ral network is the same problem. The training set is constant. The parameters of
the ANN, the weights, are what we are solving for. So we want to optimize the
objective function by finding appropriate weights.
Newton’s method for optimization approximates a solution by starting with an
initial guess, then iteratively improving it. To proceed, start with an initial guess
for the solution, w0 , then compute a new value, w1 = w0 + Δw1 . Δw is chosen
to improve the guess, that is, f (w0 + Δw1 ) should move the solution closer to an
extremum.
More generally, at each step, we have wt+1 = wt + Δwt+1 . In the case of ANNs,
we want f , the objective function, to decrease so movement should be in the direc-
tion of decreasing f , this suggests that the direction should be −f ′ . So we choose
Δw = −𝜂 ⋅ f ′ (wt ). We still need to choose a value for 𝜂, that is, determine how large
a step we should take in that direction. Too large a step and we may miss the min-
imum. Too small a step and we may have to perform many superfluous updates.
The step size certainly should not be constant.
Newton invented what is now known as a Newton step to determine the step
size. He used the curvature of f . Let us approximate f with a second-order Taylor
series around our current guess, w:
f ′′ (w)Δw2
f (w + Δw) ≈ f (w) + f ′ (w)Δw + + (Δw3 ). (5.3)
2!
There remains the choice of Δw. Newton used his calculus to find Δw. Differenti-
ating the series with respect to Δw and ignoring error terms yield:
df
= f ′ (w) + f ′′ (w)Δw. (5.4)
d Δw
Setting the derivative equal to zero and solving for Δw we obtain,
f ′ (w)
Δw = − , (5.5)
f ′′ (w)
and so as expected our update is proportional to the first derivative, but it is now
scaled by the local curvature, the second derivative. The full update is then,
f ′ (wti )
wt+1
i
= wti − . (5.6)
f ′′ (wti )
5.2 Weight Updates as Iteration and Convex Optimization 95
1.0
0.9
0.8 Newton’s method of optimization
cosine
0.70.6
0.5
−1 −0.5 0 0.5 1
Figure 5.3 An example run of Newton’s Method for optimization for the cosine function.
There are 4 steps until it finds a maximum. Each step is numbered.
This is a very elegant and simple result. An example of Newton’s method with a
scaler function is depicted in Figure 5.3. The trouble lies in the fact that ANNs are
not scaler functions. ANN’s can have thousands, and even millions, of parameters
to solve for. Applying Newton’s method for optimization to ANNs requires the use
of matrices.
Extending Newton’s method to matrices, we have g = ∇Loss(x; w), where g is
the gradient, ∇Loss, but approximated numerically with backpropagation, and
H = ∇2 Loss(x; w), where H is the Hessian (not the cross-entropy), the matrix of
the second derivatives. Then, Δw = H −1 g, is the vector of updates for the weights.
This is an example of a second-order method for training ANNs.
There are two problems with the equation, Δw = H −1 g. The first is constructing
the Hessian. Computing the derivatives of the ANN for the Hessian is computa-
tionally prohibitive, especially for state-of-the art problems where there can be
millions of parameters. The second problem is solving the Hessian. Even suppos-
ing that the Hessian could be computed practically, solving such a large matrix
is extremely expensive. The Hessian of an ANN is dense, that is, so many ele-
ments are nonzero that it cannot be considered sparse. The state of the art for
solving matrices of that size, Krylov subspace methods, is targeted at sparse matri-
ces (132). The number of ANN parameters (weights) continues to grow as research
96 5 Weight Update Strategies
5.3 RPROP+
The RPROP+ algorithm addresses itself to the problem of choosing a step size for a
weight update (73). Like most modern learning rules, it relies on backpropagation
𝜕L
to push the gradient through the graph and provide the 𝜕w s. RPROP+ computes
the step size.
The RPROP+ acronym stands for Resilient PROPagation. The original RPROP
algorithm was introduced in 1993 (124). The + signifies an amended version of
the algorithm that also includes a means of back tracking, a later improvement on
the original work. RPROP+ fully embraces the limitation of the first derivative.
5.3 RPROP+ 97
The sign of the derivative is the only information that it needs. The intuition of
RPROP+ is to continually increase the step size, that is, accelerate, until the sign
of the derivative changes, in other words, it accelerates over the surface of weight
space until it passes a minimum. The solution will then bounce back and forth
over the minimum until it falls in.
The memory requirements are linear in the number of weights. It requires two
variables for each weight and thus consumes 2 ⋅ MG of memory. For each weight,
wi , RPROP+ keeps an update increment, 𝜂it , and its last error, 𝜕L𝜕w . The RPROP+
t−1
i
algorithm consists of three cases, and they are implemented in Algorithm 5.3.
8: wt ← wt−1 − Δwt
𝜕L t
9:
𝜕w
←0
10: else
( t)
11: Δwt ← −𝐬𝐢𝐠𝐧 𝜕L
𝜕w
⋅ Δt
12: wt ← wt−1 + Δwt
13: end if
14: end procedure
RPROP+ has 5 hyperparameters. The authors suggest the following values for
the parameters. They are
1. Δ0 = 10−2 : the initial weight update size
2. Δmin = 10−8 : the minimum weight update size
3. Δmax = 50 : the maximum weight update size
4. 𝜂 + = 1.2 : rate of acceleration
5. 𝜂 − = 0.5 : rate of deceleration
The authors suggest that RPROP+ is not very sensitive to these choices. The
algorithm is very robust with respect to the choice of hyperparameters. The default
values suggested in the paper are usually used in implementations.
98 5 Weight Update Strategies
w2
0
10
20
30
1.18
1.16
1.14
Loss
1.12
1.1
1.08
−10
−20
−30 w3
−40
Figure 5.4 An example path for RPROP+ during the training of a sine ANN. It plots the
loss as a function in weight space with respect to two of the weights in the same layer.
Notice the flat plains and the steep canyons of the loss function. It is crucial to produce
updates in response to changing topology.
5.4 Momentum Methods 99
and the errors can vary widely between epochs. RPROP+ can behave poorly
in these circumstances as from line 3 it is clear that following q uninterrupted
steps forward, the step size will be Δt+q = Δt ⋅ (𝜂 + )q . Following r moves in the
opposite direction slows RPROP+ down by (𝜂 − )r , but even when q = r we have
(𝜂 + )q ⋅ (𝜂 − )q ≠ 1.0. Put another way, Δt+q+r ≠ Δt . The updates have not cancelled
out (with the default parameters it is a net deceleration). This can lead to serious
convergence problems with SGD variants. The only way to make the updates
average out is to use 𝜂 − = 1∕𝜂 + , which results in deceleration that is simply
too fast.
Despite these limitations, RPROP+ has been reexamined in recent years in a
number of contexts (74) and it is widely used in many applications. In practice, it
can dramatically out perform the de rigueur weight update schemes when used in
conjunction with SGD, but the theoretical problems should be borne in mind. If
inexplicable convergence problems are observed, it is a good idea to switch to one
of the schemes presented below.
𝜕w
is changing. The accumulated history can detect
flat plains and steep precipices. The most popular optimizer for updating weights,
ADAM, is a member of the momentum family, and we will trace its develop-
ment by first presenting some of its important precursors to understand how
it works.
100 5 Weight Update Strategies
Δt
4: Δw ← 𝜂 ⋅ √
Δsq +𝜖
5: wt ← wt−1 − Δw
6: end procedure
Every weight requires a variable, the running sum of the square of the gradients.
The final update, Δw , is the scaled gradient. When the gradient is large dividing by
its square attenuates the step size. Small gradients are amplified by the square. 𝜂
is a hyper-parameter, a learning rate, and 𝜖 is a small number to avoid division by
zero. AdaGrad does well early in training, but Δsq grows monotonically and can
quickly lead to low rates of convergence. If good weights are not trained prior to
the inevitable immobility, then training has to be restarted; ideally, the Δsq can just
be reset. The momentum variable, Δsq , needs a decaying factor. This problem was
addressed by RMSProp.
RMSProp appeared soon after AdaGrad in 2012 (150). It recognized the need
for a decay factor to ensure progress throughout training. This was effected by
introducing another hyper-parameter, 𝜌, that decays the history over time. 𝜌
should be a positive value less than unity. The choice of 𝜌 and 𝜂 will depend on
the model being trained, and usually some experimentation is required to identify
good values. RMSProp is presented in Algorithm 5.5. The algorithm is the same
as AdaGrad except for the introduction of a hyperparameter. The additions are in
shaded.
5.4 Momentum Methods 101
Δt
4: Δw ← 𝜂 ⋅ √
Δsq +𝜖
5: wt ← wt−1 − Δw
6: end procedure
RMSProp was a good improvement, but it still left a great deal to be desired.
Choosing the hyperparameters can be difficult, the perennial problem in machine
learning, but RMSProp can be used with minibatches and so it is appropriate for
use with SGD.
5.4.2 ADAM
In 2015, a further refinement was developed called ADAM (84). It is an extremely
popular optimizer and possibly the most widely used. ADAM works well with SGD
variants, that is, minibatch schemes, as it scales well to large problems. This makes
ADAM eminently suitable for use with large training sets. The name, ADAM,
is derived from ADAptive Moment estimation. It is based on the idea of using
momentum to choose the step size. ADAM evolved from RMSProp.
ADAM requires two variables per weight. They are its estimations of the first
moment of the gradient (mean) and the estimation of the squared uncentered
second moment (variance). The estimations are used to scale the step size based
on the recent history of the gradient. The ADAM update rule is presented in
Algorithm 5.6.
3: mt ← 𝛽1 ⋅ mt−1 + (1 − 𝛽1 ) ⋅ gt
4: vt ← 𝛽2 ⋅ vt−1 + (1 − 𝛽2 ) ⋅ (gt )2
mt
5: ̂ ←
m 1−𝛽1t
vt
6: v̂ ← 1−𝛽2t
m ̂
7: wt ← wt−1 − 𝛼 ⋅ √
v̂ +𝜖
8: end procedure
102 5 Weight Update Strategies
2.0
1.5
Rule
Density
1.0 RPROP
ADAM
0.5
0.0
Figure 5.5 Computed densities of observed weight updates. Training was initiated with
the same initial values for weights. The x-axis is logarithmic.
5.5 Levenberg–Marquard Optimization for Neural Networks 103
ADAM’s (the x-axis is logarithmic). There is a barrier that ADAM hits that it seems
to want to cross. The wall curve is the effect of 𝛼 on the update size. Modern imple-
mentations of ADAM tend to be more aggressive, and many models using Keras’s
implementation of the ADAM optimizer use a value of 𝛼 = 0.01.
Of course large weight updates are not useful unless they contribute to earlier
convergence. To measure the effect of the larger updates, the losses and accuracy
are presented in Table 5.2. The table was produced with 5 runs for each training
method. Both models were trained for 100 epochs with SGD, and for each of the
5 runs, the models started with the same initial weights. The RPROP+ algorithm
clearly out performed ADAM in this instance. While this is an example where
RPROP+ out-performed ADAM despite theoretical limitations, it must be empha-
sized that LeNet5 is an extremely shallow and simple neural network by today’s
standards of the state of the art1 .
While slower than RPROP+, ADAM is steady. When training a network with
RelU activation functions, ADAM is always selected over RPROP+. To avoid the
“dead neuron” effect inherent with RelU, slower training is essential to avoid neu-
ron death; recall that it is not recoverable. RPROP+ can easily shoot into a bad
area, it is designed to, but it needs to recover and as RelU is not forgiving (once a
neuron is dead its derivative is always zero and can never recover). RPROP+ can
kill many neurons with RelU and progress typically comes to a halt well before con-
vergence. RelU (and its variants) are almost always used in the deepest DL models,
and ADAM is the optimizer of choice.
Both ADAM and RPROP+ are examples of first-order methods, that is, they only
rely on the first-order derivatives produced with backpropagation. Both strate-
gies employ heuristics to compute the step size. We saw with Newton’s method
1 LeNet-5 is strictly ordered and comprised of 5 layers. This is small compared to the 20+ layers
with skipping in modern visual classifiers.
104 5 Weight Update Strategies
⎛ ∇T ANN(x1 ) ⎞
J=∇ f=⎜T
⋮ ⎟, (5.9)
⎜ T ⎟
⎝∇ ANN(xN )⎠
where the N rows are the results of performing backpropagation on the ANN with
the N elements from the training set. Note that the ANN is differentiated not a
loss function. Let fi = ANN(xi ), then the Jacobian matrix is generated over a train-
𝜕f
ing epoch creating a row of M entries of 𝜕wi for each of the N training examples.
j
5.5 Levenberg–Marquard Optimization for Neural Networks 105
For a training set containing N examples, and an ANN with M weights, we obtain
the following Jacobian matrix:
⎛ 𝜕f1 𝜕f1 𝜕f1 ⎞
⎜ 𝜕w ··· ⎟
𝜕w 𝜕w
⎜ 1,1 1,2 1,M ⎟
⎜ 𝜕f 𝜕f2 𝜕f2 ⎟
⎜ 2 ··· ⎟
JN,M = ⎜ 𝜕w2,1 𝜕w2,2 𝜕w2,M ⎟ . (5.10)
⎜ ⎟
⎜ ⋮ ⎟
⎜ ⎟
⎜ 𝜕fN 𝜕fN 𝜕fN ⎟
⎜ 𝜕w · · ·
𝜕wN,M ⎟⎠
⎝ N,1 𝜕wN,2
Each row is the per weight error for an example from the training set. Thus, entry
𝜕fi
𝜕w i,j
is the error for weight j at example i from the training set.
Computing the Jacobean also requires a modification to the usual training
epoch. Instead of computing the net gradient by summing the per weight deriva-
tives over the examples, the Jacobean records each derivative individually. The
net gradient for wj is the sum of the jth column in the Jacobean. Both ADAM and
RPROP+ only store the net gradient.
The object for the weight update is to reduce the error with the training set. The
error reduction can be expressed as |f(w + Δw) − y|min , where y is the vector of
the ground truth from the training set, and we are using Euclidean distance. The
method is to solve for the vector Δw by minimizing the residuals. Substituting for
the Taylor series approximation, the following is obtained,
where r is a vector of the residuals (the error). This formulation of the problem
can be solved with least squares. The object is to solve for the vector, Δw. Observe
that r also happens to be the vector of per training example derivatives of the MSE
loss function, z 𝜕L . The global MSE loss can be computed as 2N 1 T
r r. So the matrix
output,i
formulation naturally led to the usual regressor framing of the problem.
Constructing the normal equations for least squares leads to,
The canonical approach when dealing with least squares problems is to employ the
QR decomposition. Decomposing the Jacobean and substituting yields the familiar
LS solution:
2 A Taylor series expansion was used, (B + A)−1 ≈ A−1 - A−1 B A−1 + A−1 B A−1 B A−1 - ...,
where A = 𝜆I.
5.5 Levenberg–Marquard Optimization for Neural Networks 107
4 Method
Density
ADAM
LM
RPROP+
2
0
−4 −3 −2 −1
log 10 (MSE)
Figure 5.6 Densities of log scale losses of models following training. LM losses are
centered 2–3 orders of magnitude to the left of the first-order methods.
LM resource demands quickly grow onerous. Both memory and time grow very
quickly, so there are very few problems for which LM is appropriate.
Figure 5.6 presents the results of training ANNs to learn sine. The densities of
the final losses for 30 attempts to train an ANN per strategy are depicted. ADAM
and RPROP+ ANNs were trained for 100 epochs. Comparing LM with ADAM and
RPROP+ is potentially problematic. Running LM for 100 epochs would produce
spectacular results, but take far more time. The firstorder methods were used to
calibrate the comparison. The mean CPU time consumed by the resulting com-
bined 60 runs of the first-order methods was used to run LM. Thus, LM training
took the same CPU time, but managed far fewer epochs. The logs of the losses
are presented. LM is orders of magnitude better than the first-order methods. The
experiment is not meant to be misleading. It must be emphasized that sine is a
trivial function to learn. LM it totally impractical for the LeNet-5 experiment in
Table 5.2. Medium sized Deep Learning models have millions of weights. The
space and time requirements of LM render it totally impractical for such prob-
lems. But the dream of such rapid convergence is so promising that researchers
continue to examine second order methods.
5.6 Summary
Backpropagation reliably finds the direction of the correct update, but not the
appropriate size of the step to be taken. The optimal step size is too expensive to
5.7 Projects 109
compute so heuristics are used to compute it. RPROP+ is an extremely fast heuris-
tic, but can present challenges when used with SGD. ADAM is the de rigueur
optimizer used for training Deep Learning models. RelU and its variants are the
activation functions of choice.
5.7 Projects
The projects below rely on notebooks that can be found here, https://github.com/
nom-de-guerre/DDL_book/tree/main/Ch05.
1. The Python notebook sine.ipynb looks for the minimum of sine. It contains an
implementation of ADAM and RPROP+. Implement Newton’s Step. Compare
the heuristics’ step sizes with Newton’s step.
2. The iris classifier iris05.ipynb includes ADAM and RPROP+ optimizers. Plot
loss versus epoch for both RPROP+ and ADAM. Experiment with different
topologies. Is there an important difference?
3. The website includes a handwritten digit classifier, called MNIST05.ipynb. Plot
loss for MNIST for both RPROP+ and ADAM. Do the results differ from the iris
experiment?
111
This chapter presents convolutional neural networks (CNNs). For the purposes
of this chapter, CNN means convolutional neural network, not classifying neural
network. In general, when people use the acronym CNN, it is the former meaning
that is intended, a convolutional neural network. CNNs are often classifiers,
so a CNN can be classifying neural network. When the latter sense is meant,
it is generally written out in full, not as an acronym, to avoid confusion. CNNs
have wide application, often in image recognition, but they have many uses
including in games, generative artificial neural networks (ANNs) and natural
language processing. A CNN is an ANN that includes at least one convolutional
layer. They are used extensively in deep learning (DL) performing many vital
functions in deep neural networks. This chapter motivates the use of convolu-
tional layers, describes their operation inside an ANN, and demonstrates how to
train them.
CNNs were motivated by the observation that cats had neurons dedicated to
fields of vision, that is, they could comprehend sub-regions of an image in parallel.
A cat’s brain divides an image into subimages called receptive fields. The receptive
fields have dedicated neurons. Dedicating neurons to receptive fields produces an
efficient motion detection mechanism. Fukushima was inspired by the biological
use of receptive fields. In 1980, he described how to combine perceptrons with
convolutions to produce a “neocognitron” (43). A neocognitron is a multilayer
neural network and meets the modern definition of “deep learning.” The system
used perceptrons to perform image recognition tasks with convolutions. Another
point of interest is that two modes of training were described, supervised, and
unsupervised. The neocognitron was able to learn and recognize hand-written
digits, but it was arguably the work done in 1989 by Yann LeCun et al. (91) that
set the pattern for modern CNNs. It was the first paper to describe the use of
backpropagation of error to train a CNN. LeCun recognized that introducing
6.1 Motivation
Images are much larger inputs than the example datasets that have been discussed
thus far. Such large inputs require special handling to efficiently process them.
When performing image recognition, many properties are required. The algorithm
should be shift invariant, that is, it should not matter where precisely an object is
in an image, it should be detected. Images can be large. It is a challenge to both
quickly and reliably recognize items in an image.
Consider a color image with a resolution of 1024 × 1024 × 3. There are 3 colour
channels, one for each of red, green, and blue (RGB). RGB images can be thought
of as 3 images exclusively in one of each of the 3 color channels. An RGB image
is a volume, not an area (2 dimensional grid). Such objects can be encapsulated
in a tensor (the reader is directed to section 12.4 in (49) for a brief introduction).
To invoke an ANN, the example image could naively be construed as a vector
with a length of 3,145,728. Passing such a large vector to a fully connected ANN
would result in an enormous number of weights. An input layer with 100 neu-
rons would require 314,572,900 weights (the additional 100 weights accounts for
the bias). Such a large number of weights present an enormous problem in many
respects. The weights consume space (memory), increase training time, and add
latency when performing inference with the final model. Large numbers of fully
connected neurons with dedicated weights can also generalize poorly as they tend
to overfit during training.
Image classification attempts to look for spatial relationships. If a nose is
detected, then there are probably one or two eyes nearby. A spoon will look
like a spoon no matter where it is located in an image. The Figure 6.1 depicts a
sample of hand-written twos. The examples are taken from the modified National
Institute of Standards and Technology (MNIST) hand-written digit dataset. The
dataset is a combination of two datasets. Both consist of hand-written digits
Figure 6.1 Five examples of hand-written twos from the MNIST dataset. The images are
28 × 28 × 1 (gray scale). There are no simple rules that define what a two looks like.
6.2 Convolutions and Features 113
and their labels, but from two separate sources (US Post Office employees and
highschool students). The full MNIST dataset includes a training set of 600,000
digits and a test set consisting of 10,000 digits for a total of 70,000 examples. All
10 digits, 0–9, are represented approximately equally. It was introduced in a paper
describing a very famous DL model, LeNet-5 (92). MNIST digits are convenient to
use as examples as they are relatively small, 28 × 28 × 1, where the 1 denotes that
it is gray-scale, which makes an MNIST digit a 2 dimensional array, not a volume.
The MNIST dataset contains all 10 digits, but we will restrict ourselves to the
twos for the moment. Implementing a model to recognize twos requires learning
what a “2” looks like. A curve implies finding a line in predictable direction, and
vice-versa. A two has defining features that can be recognized, but perhaps are dif-
ficult to specify formally. Moreover, a two is a composite of a number of distinct
features in the image. Humans can recognize them immediately (and they com-
plain about the hand writing if they cannot). The same can be said of all digits and
the Latin alphabet, which is relatively simple when compared to other alphabets
such as the Japanese alphabet. Instead of writing down rules to distinguish the dif-
ferent symbols, it is desirable for an ANN to simply learn how to recognize them.
The techniques that are developed can be used to recognize anything.
The process starts by applying the kernel to the top-left of the input matrix.
The computation is performed, the result stored, and the kernel is moved along
to the right and the operation repeated with the new submatrix. Applying the ker-
nel and sliding over to the new submatrix are repeated until the end of the image
is reached on the right. At this point, the algorithm returns to the first column
on the left, but slides down. The number of slots to move along (the number of
columns), or the number of rows to move down, is known as the stride. The stride
is one of the hyperparameters of convolutional layers. An example of the process
is shown in Figure 6.2. For clarity of exposition, input matrices and kernels are
assumed square. Neither object is required to be square and in real applications
are frequently not. MNIST images are square.
To be useful, the kernel should extract meaningful information from the original
image. One means of identifying features is to hardcode some masks and use them
as kernels. By applying masks to an image, it is possible to identify vertical lines,
horizontal lines, and diagonals. Each mask produces a feature, and applying them
to the original image decomposes the image in a different way with respect to the
feature. The masks can be interpreted as filters. The mask looking for vertical lines
is filtering the image with respect to vertical lines. The matrices in Eq. (6.1) are
examples of filters. Applying all 4 masks produces 4 separate feature maps. The
idea is that patterns associated with digits will emerge. Comparing feature maps,
intradigit should find similarities, and comparing feature maps interdigit should
identify differences. Assuming a kernel size of 3 × 3, the following constitute the
Figure 6.2 The repeated application of a kernel to produce a feature map. The kernel is
applied to every submatrix resulting from traversing the image by the stride. The 28 × 28
MNIST image is convolved to a 26 × 26 image produced from the 26 × 26 convolutions
resulting from a stride of 1.
6.2 Convolutions and Features 115
masks required:
⎛0 0 0⎞
⎜ ⎟
khor = ⎜1 1 1⎟ ,
⎜ ⎟
⎝0 0 0⎠
⎛0 1 0⎞
⎜ ⎟
kver = ⎜0 1 0⎟ ,
⎜ ⎟
⎝0 1 0⎠
⎛1 0 0⎞
⎜ ⎟
kdTop = ⎜0 1 0⎟ ,
⎜ ⎟
⎝0 0 1⎠
⎛0 0 1⎞
⎜ ⎟
kdBot = ⎜0 1 0⎟ . (6.1)
⎜ ⎟
⎝1 0 0⎠
The convolution consists of taking a mask, e.g. khor , and element-wise applying
a binary operator logically, the not exclusive or (NXOR). The kernel is applied to
a submatrix from the image of the same dimensions as the kernel (3 × 3). The
truth table for NXOR is 1 if both arguments are nonzero or if both arguments are
zero, that is, both arguments agree. It is zero if the arguments disagree. Applying
the mask yields 9 values as there are 9 binary operations between the mask and
the submatrix. To produce the final scaler, the results are summed; a value of 9 is
a perfect match, 0 connotes a complete difference. The kernel is summarizing a
submatrix with a scaler, that is,
∑ ∑
rows columns
ki,j ⊙ submi,j . (6.3)
i=1 j=1
The procedure is repeated until all of the submatrices in the image have been pro-
cessed and the feature matrix is filled.
Figure 6.2 graphically depicts the process for an example MNIST “2”. An MNIST
image is a 28 × 28 grayscale matrix. The mask is 3 × 3, with a stride of 1. As the
stride is 1, the convolution moves over one column following each computation.
Once the last column is reached, computation returns to first column, but moves
down a row. A stride of 1 creates a great many overlapping convolutions decreasing
the chance of “missing something” in the image; the kernel is looking for spatial
116 6 Convolutional Neural Networks
relationships. The cost is a bigger feature map. The number of rows in the fea-
ture matrix is Imrows − krows + 1 = 28 − 3 + 1 = 26. As all objects are square in this
example, the resulting feature matrix is 26 × 26. There are 4 masks defined in (6.1)
leading to 4 feature maps, each 26 × 26. The output is therefore 4 × 26 × 26 × 1.
In general, a feature matrix will have dimension,
Imrows − krows
+ 1, (6.4)
s
where s is the stride. In the case of asymmetric kernels or images, then the column
width will have to be computed as well.
Note that not all elements in the original image participate in an equal number
of times in the convolutions. As described, the pixel in the top right corner, and
the top left, each participated in only a single kernel application. The pixels in
the center of the top row will contribute to 3 calculations. If there is something
important on the edges of the images, then this may not be desirable. Images can
be padded to increase the inclusion rate. In the case of an MNIST digit, a 28 × 28
image can have zeros added around the edges depending on the desired level of
padding. A 28 × 28 image can be padded to produce a 29 × 29 or a 30 × 30 image,
whatever is required. To produce a 29 × 29 padded image, a row is added to the
top and bottom, and a column is prepended and appended to the left and right of
the image. The original image is at the center of the padded image. The paddings
increase the number of times the edges participate, and the convolutions and the
edges receive the full benefit of the kernels. A classifier training to learn the MNIST
dataset does not need padding, and the images are centered. Not all applications
are so well behaved though. Domain knowledge is required to make a sensible
determination.
The results of applying the masks to an MNIST “2” are displayed in Figure 6.3.
The left-most image is the original input image of a “2”. The remaining images to
the right are the feature maps resulting from applying the masks in order of their
definition in Eq. (6.1). The darker colors are higher numbers, so closer matches
to the kernel. Visual inspection of the results suggests that they are sensitive to
Figure 6.3 The result of applying the kernels in (6.1) to detect features. The original
hand-written two is on the left. The result of applying the masks are in order from left to
right. The stride is 1 which results in 28 × 28 images convolving to 26 × 26.
6.3 Filters 117
a number of factors. The masks are looking for patterns that are a single pixel
in width. The example “2” does not restrict itself to single pixel width anywhere.
Despite that, some feature definition has emerged. Once computed, the feature
maps are ready for use with the bottom of the classifier, a FFFC ANN.
With the features computed, they can be further processed by a fully connected
ANN. Construing the 4 features as a single, albeit long, vector the features can
be passed into the deeper ANN. The feature vector is 4 × 26 × 26 = 2704 elements
long. While this is far larger than the 28 × 28 = 784 long vector that would result
from starting with the original image, it is far more information rich. The impor-
tant differences between a “3” and an “8” are easier to discern in the feature matri-
ces than the original image. The structural differences emerge more distinctly.
The CNN described is a big improvement on a naive classifying ANN. A num-
ber of questions, do, however, immediately suggest themselves. Do the masks that
were defined make sense for digits? Are there better kernels that could be used?
Ideally, the masks would be well suited to the problem domain. What if more
than 4 feature maps are desired? Conceiving bitmasks for use with convolutions
is clearly not ideal, and there are only so many sensible masks that suggest them-
selves. If possible the features should be automatically created during training, not
hardcoded a priori. Learned features increase the number of possible useful fea-
tures as the onus is placed on the CNN to find them, not the human. Features that
are learnt would also be well suited to the problem domain as the features were
learnt from the training set, the definition of the problem domain.
6.3 Filters
The above agruments suggest that a desideratum for the kernels is that they are
learnt. If kernels are learnt, then domain-specific kernels will result yielding better
feature maps. In addition, the number of features is constrained by the needs of
the model, not by how many masks can be dreamt up by a human.
The perceptron makes an excellent kernel. A perceptron employed as a kernel
is known as a filter. Some of its advantages as a kernel include that the perceptron
is well understood, including how to train it. A perceptron accepts multiple inputs
and produces a scaler, precisely what is required for a kernel. The weights of a
perceptron can be used for the values in a mask. Weights are learnt leading nat-
urally to learned features. A 3 × 3 kernel can be implemented with a perceptron
with 9 weights, and in general a krows × kcolumns kernel can be implemented with
krows ⋅ kcolumns weights (and a bias). Perceptron weights are usually numbered lin-
early and construed as a vector. This follows from designing them to accept vector
arguments in an ANN. The vector interpretation also makes it convenient to imple-
ment perceptrons as a row in a weight matrix for a layer. Kernels are small matrices
118 6 Convolutional Neural Networks
⎛w0 w1 w2 ⎞
⎜ ⎟
{w0 , w1 , w2 , w3 , w4 , w5 , w6 , w7 , w8 } → ⎜w3 w4 w5 ⎟ (6.5)
⎜ ⎟
⎝w6 w7 w8 ⎠
Arranging the weights in a matrix produces a kernel of the correct dimensions,
but the kernel must produce a scaler. Matrix multiplication produces a matrix. A
scaler is obtained with an extension of the vector dot product, known as the inner
product, used by perceptrons. The dot product for a pair of matrices of the same
dimension is known as a Frobenius product. The Frobenius dot product is similar
to a Hadamard operation except that the final result is a scaler as the results are
summed. The Frobenius dot product for a filter and a submatrix is defined as:
∑ ∑
rows columns
F(f , subm) = wi,j ⋅ submi,j , (6.6)
i j
where f is a filter (perceptron), and subm is the submatrix. The Frobenius prod-
uct is a natural extension of how a perceptron computed its intermediate state,
u, in an ANN; the dot product of its inputs with its weights. Once the Frobenius
product has been computed, the perceptron applies its activation function to pro-
duce the final result. The final expression for the scaler produced by a perceptron
is 𝜎(F); this is the filter kernel. The result is stored in the feature map. Modern
DL CNNs typically use the RelU activation function for the reasons outlined in
Section 3.5.4. It is interesting to note that Fukushima introduced the RelU activa-
tion function when describing his cognitron in 1975 (42), but he did not use it in
the neocognitron with convolutions (Figure 6.4).
Figure 6.4 The figure depicts a 3 × 3 image and the feature map that results following
the application of a 2 × 2 kernel when a stride of 1 is used. The 3 × 3 image convolves to
a 2 × 2 feature map. The shaded elements in the submatrix on the left are used in the
highlighted element in the feature map on the right. The kernel is depicted with a
general activation function, 𝜎, but it is usual to use RelU.
6.4 Pooling 119
To produce the feature map, the filter (perceptron kernel) is applied to each sub-
matrix as described above with naive masks. A feature map is produced per filter,
and the number of filters corresponds to the number of the desired features. The
number of filters required is specified as a hyperparameter of the model.
Typically, multiple filters are applied in parallel in one layer of a CNN. The result
is multiple feature maps produced from a layer of filters. Continuing with the
MNIST example (3 × 3 kernel, stride 1), the input for the filter layer is 28 × 28, and
assuming 5 filters the output would be 5 × 26 × 26. Once the feature maps have
been computed, they can be passed on to a fully connected ANN. Fully connected
ANNs expect vectors, and the output of the filter is the wrong shape. Prior to pass-
ing the feature maps to the ANN classifier, they must be flattened. Flattening is
the process of changing the higher order (complicated) shape to a vector. Follow-
ing with the MNIST example, the 5 × 26 × 26 tensor is flattened to a 3380 element
vector. The vector is the shape expected by an ANN’s input layer. It should be noted
that a good implementation will not perform any copying of data, but merely rein-
terpret the memory occupied by the tensor. Flattening is not to be confused with
concatenation. Concatenation is the operation of fusing, or synthesizing, multiple
tensors of the same underlying shape. For example, concatenating a 3 × 20 × 20
tensor with a 4 × 20 × 20 tensor results in a 7 × 20 × 20 tensor. It is a very different
operation (and almost certainly will involve copying data).
An important interpretation of filters is that of a regularized perceptron. A
feature map has a single perceptron processing an input image of many pixels.
In other words, convolutions are a form of regularization as they are kernels
providing shared weights between two layers. This is described in the section on
regularization (Section 7.4). All of the entries in a feature map share the same
weights. In consequence, overfitting is far less likely and filters generalize well.
Filters extract features learnt from the problem domain; they are learnt during
training. It is clear that filters simultaneously perform two important rôles, they
regularize the network and automatically perform quality feature extraction.
LeCun et al. recognized the utility of perceptrons as filters. It was this insight that
motivated them to train perceptrons as filters with backpropagation of error in a
CNN (91). The arguments remain valid today, and these are the reasons for their
continued widespread use.
6.4 Pooling
In a typical CNN, multiple filters are used in parallel in a layer resulting in a corre-
sponding number of feature maps. At this point, the feature maps can be flattened
and passed to an ANN, but the resulting vector is rather large (usually much larger
than the original image). The feature maps need to be condensed to make them
120 6 Convolutional Neural Networks
⎛1 2 2⎞
⎜ ⎟
argmax ⎜3 5 1⎟ = 7. (6.7)
ai,j ⎜ ⎟
⎝7 4 1⎠
It is common to place a maxpooling layer after the first filter layer in a network.
The first layer accepts the input image and is therefore the largest. Maxpooling
is expected to select the most important aspects of the filter’s feature maps. Just
like any convolutional kernel, it is applied to the submatrices of an input matrix
producing a new result matrix. Every feature map accepted as input is pooled to
produce a corresponding output pooled map. The pooled matrices can then be
flattened and passed on to a fully connected ANN.
The width of the kernel and the stride are independent hyperparameters of a
pooling layer, but the number of result matrices is dictated by the earlier layer.
The hyperparameters of both the filter and the pooling layers are often considered
together as there is often a target output size, but the resulting pair of hyperparam-
eters are usually different (e.g. a filtering layer with a stride of 1 and kernel width
of 5 followed by a pooling layer with a stride of 2 and a width of 2). The number
of maxpooling maps is dictated by its input layer.
With convolutions defined the pieces can be assembled to produce the whole:
a CNN. There are 3 hyperparameters defining a convolutional layer with filters.
They are the number of the filters (feature maps), the stride, and the dimensions
of the filter. The immediately following maxpooling layer, if present, is only free
to use different strides and kernel width. For maxpooling layers, the number of
output maps is determined by the number of input maps.
It is common to place a pooling layer following the first filter layer in a CNN.
Figure 6.5 presents a CNN to classify MNIST digits. The FCFF ANN portion of the
model could be quite small, for this example {50, 50, 10} would work well. The
convolutional layers of the CNN can be known as the features portion of the CNN,
6.5 Feature Layers 121
but in modern DL CNNs the distinction is blurred. There can be many filters, and
the number of filters, and the number of filter layers, is dictated by the application.
The MNIST dataset is a trivial dataset by modern standards, but more challenging
images such as those produced by the camera of a mobile phone would require a
deeper network. This would include more filter layers, not just a wider filter layer
at the top.
It is useful to examine the output of the individual layers to understand what
is happening inside the CNN (168). This can be invaluable when debugging.
The Figure 6.6 presents the outputs of the convolutional layers of the CNN in
Figure 6.5. It was trained to learn the MNIST dataset. The input image was
deconstructed into 5 features followed by 5 pooling layers. Problem domain filters
have learnt how to recognize features, and maxpooling has summarized the
features and the deeper fully connected ANN has an easier job to do.
When considering and designing CNNs, it is helpful to think in terms of the
shapes of the tensors. The CNN can be viewed as a pipeline. Data starts at the
top and exits at the bottom. In the case of a classifying ANN, the bottom is a soft-
max layer producing the prediction. Layer after layer shapes and passes on the
image. Figure 6.5 depicts the pipeline 28 × 28 → 5 × 26 × 26 → 5 × 13 × 13 → 845.
The 5 final maxpool maps are flattened to produce a vector of 169 × 5 = 845 ele-
ments. 845 is slightly larger than the original 784 vector, but it is packed full of
information.
filter0 pool0
0,
filter1 pool1 1,
FCFF 2,
ANN 3,
filter2 pool2 4,
flatten
5,
6,
7,
filter3 pool3
8,
9
filter4 pool4
Features
Figure 6.5 An example of a complete CNN. The CNN is a classifier that accepts
examples from the MNIST dataset. The shapes of each layer are inscribed above the
layers. There are 5 filters, which each receive a complete copy of the digit, followed by a
pooling layer. The strides are 1 and 2, respectively.
122 6 Convolutional Neural Networks
Figure 6.6 The top row is the result of applying a set of 5 filters to an example “6” from
the MNIST dataset. The bottom row presents the result of applying a Maxpool layer to
the result. The pipeline is as described in the text.
Convolutions are almost synonymous with DL. The example CNN in Figure 6.5
is trivial by modern standards. A single layer of filters can learn MNIST. The
pipeline can be much deeper and include many layers of filters. A more chal-
lenging problem would require a deeper CNN. The shallower filters identify
useful features and the deeper filters learn compositions of the trivial features.
As the CNN gets deeper, the filters learn more complicated artifacts built on the
earlier decompositions. The output of filters can be flattened and used as input
to a deeper filter layer. A CNN’s architecture is dictated by the application, and
the shallowest possible should be used. The efficient use of GPUs to train CNNs
described in AlexNet (89) set off a race for deeper and wider CNNs. Modern
deep networks can have over 22 convolutional layers. Examples include the
Googlenet (148) and VGGx (140) networks. Training such large systems requires
GPUs and experience. Propagating an accurate gradient so far backward is
challenging and prone to disappearing (become zero).
The deepest networks often include a macro structure built with multiple layers
to perform a specific job. Googlenet introduced the inception module. An inception
module combines multiple layers that were designed together to solve a problem.
One of the hyperparameters of a filter is the kernel width. An inception module
includes a filter layer of multiple filter widths. This is far more challenging than it
first appears. The final step of an inception module is to ensure that a consistent
shape is passed on (usually with concatenation). The power of the module is that
it introduces an element of size invariance in the pipeline. Inceptions module have
the capacity to detect an object, e.g. an apple, no matter what size it is when the
object is present in an image.
The difficulty and resources required to train DL models have led to the trend
of making pretrained models available. An application developer can simply
6.6 Training a CNN 123
flattening layer simply reshapes the multiple objects of its input into a single
output vector. When transmitting the gradient, it simply does the reverse. The
gradient, which is flat when emerges from the ANN, is reshaped to fit the shape
the flatten layer expects from its input feature maps. The only task then is to
transmit the gradient through the flattening layer. Perceptrons inside of the
classifying fully connected ANN propagate the gradient across layers by
𝜕L
𝜙𝓁 = T
= W𝓁+1 ⋅ 𝛿𝓁+1 , (6.9)
𝜕z𝓁
where z𝓁 is the output vector of the flattening layer, 𝓁, and 𝜙𝓁 is the gradient arriv-
ing at the flattening layer, 𝓁. The only work to be done by the flattening layer is
reshape 𝜙𝓁 to fit the expected input shape. As there are no trainable parameters
in the flattening layer, or inflective operations performed on the data, the flatten-
ing layer has no effect on the gradient. The gradient is simply transmitted through
to be consumed by the immediately shallower layer in the correct shape. Refer-
ring back to the example of the CNN depicted in Figure 6.5, 𝜙𝓁 is a vector of 845
elements. It is reshaped to 5 × 13 × 13 and then propagated backward.
Figure 6.7 An example of a pooling layer in a forward training pass and the resulting
backpropagation of error. The 3 × 3 input image is pooled with a 2 × 2 kernel and a stride
of 1, convolving to a 2 × 2 feature map. The ANN computes and propagates the gradient
as usual until it reaches the flattening layer. The gradient is reshaped and passed on to
the pooling layer. The pooling layer assigns the gradient based on the results of the
forward pass (the maxima).
defined, result. Thus we conclude that the derivative for max is defined and equal
to either 1 or 0; 1 for the max and zero for all the other arguments. This is intuitive
as the arguments that are not the maximum have no effect on the downstream
𝜕L
computations. For a pooling layer, 𝓁, 𝜙𝓁 = 𝜕z is passed back to the maxima and
𝓁
is zero for all other arguments. A pooling layer does not contain any trainable
parameters so it merely passes the gradient through, albeit selectively, similarly
to the flattening layer. The gradient has the same shape as the input, so the gra-
dient emanating from the pooling layer in Figure 6.5 would be a tensor of shape
5 × 26 × 26.
Mathematically pooling layers do not present any challenges to backpropaga-
tion. A pooling layer does, however, require some bookkeeping during training.
When computing the maxima for the feature map, a pooling layer needs to record
which ai,j produced maxima so that the gradient can be passed back efficiently. In
the event of a tie, for example, the background of an image, the gradient should
be split equally among all of the inputs in the submatrix. This is not always done
and should not be assumed when using third-party libraries (Figure 6.7).
𝜕L ∑
convolutions
= xp,k ⋅ 𝛿i,j . (6.14)
𝜕wm i,j
6.6 Training a CNN 127
Figure 6.8 Backpropagation from a pooling layer to a filter. The filter gradient table is
simply the Maxpooling gradient renamed (relabeled). The gradient is used to compute
the 𝛿s producing a 𝛿 for each instance of weight sharing (convolution). In this example,
only 3 of the 𝛿s are nonzero, the blue ones.
∂L ∂L
=F • ... =F •
∂w1,1 ∂w2,2
Figure 6.9 Some example weight updates in a filter. A weight accumulates the loss for
every element in the feature map that it participated in. This can in turn be computed as
a Frobenius product between the table of 𝛿s and the submatrix of the input image that
the weight touched. The kernel is 2 × 2 and the stride is 2.
Note that xp,k has different indexing, it the argument from the input map that pro-
duced the convolution. This equation can be expressed as a Frobenius product
between the input image and the map of 𝛿s. The Frobenius product is computed
for each weight in the filter. The submatrix of the input image changes to match
those elements that the weight touched when computing the feature map. The
process is depicted graphically in Figure 6.9.
For the example in Figure 6.5, the backpropagation would be terminated at this
point, but it is a trivial example with a single feature layer. State-of-the-art DL
CNNs have many filter layers. It follows that the gradient must be transmitted
through a filter to shallower layers. The dedicated weight form of interlayer gradi-
ent passage is Eq. (6.9). Not only are the weights shared in a filter but the elements
of the input are as well. It is clear that the gradient propagation equation must also
be emended to deal with convolutions.
The same strategy that was employed for derivatives of filter weights can also be
used for the gradient’s passage. For a particular point in the input image, the total
gradient is accumulated by summing all of the losses emanating from the con-
volutions in which it participated. The actual computation is similar to the filter
weights. Instead of a single computation, the dedicated weight version is applied
128 6 Convolutional Neural Networks
repeatedly. The result will be a gradient in the same shape as the layer’s input.
𝜕L ∑
convolutions
= 𝛿i,j ⋅ wk . (6.15)
xp,k i,j
The maximum length of a sum for any point in the transmitted gradient is the ker-
nel width squared. The computation itself can also be expressed as a Frobenius
product. Equation (6.15) is the Frobenius of the filter with the 𝛿 map correspond-
ing with the convolutions in which the pixel participated. Prior to use the filter’s
matrix must be reversed. This is because the input map’s elements interact with
the filter in the reverse order. An input element can participate in at most filter
size number of convolutions. If an element contributes to the maximum number,
then it is in the order starting from wn−1 in reverse order to w0 . To reflect the order
of usage, the filter matrix must be reversed, not just transposed, as the diagonal
needs to change as well.
⎛w0 w1 w2 ⎞ ⎛w8 w7 w6 ⎞
⎜ ⎟ ⎜ ⎟
filter = ⎜w3 w4 w5 ⎟ ⟹ filterBPG ⎜w5 w4 w3 ⎟ (6.16)
⎜ ⎟ ⎜ ⎟
⎝w6 w7 w8 ⎠ ⎝w2 w1 w0 ⎠
To propagate the error to a shallower layer, a map of the same dimensions as the
input is built containing the gradient. The complete prodecure is presented in
Algorithm 6.1.
A filter layer may have either a 1:1 relationship with its input, or a 1:many.
This exposes a potential dichotomy of two cases respecting the input shape. The
example in Figure 6.5 is 1:many. The single input image produces many feature
maps. A filter can also accept a number of maps producing a 1:1 relationship. Both
cases are handled the same way despite the superficial appearance of a complica-
tion. The gradient needs to flow through each element of the input shape, that is,
the total gradient of any convolutions that involved the element needs to be accu-
mulated, regardless of the shape of the input. In the case of 1:1, then gradient has
the shape nFilters × rows × columns. 1:many simply sums the individual gradient
maps to produce the correct shape (and gradient).
The implementation of a convolutional layer must include 3 functions. The layer
implements a forward pass, the application of a kernel. It must also include sup-
port for a backward pass. The backward pass must update learnable parameters as
well as propagating the gradient through it. Any layer that implements this func-
tionality can be dropped in to a CNN library. The importance of designing a flexible
API is clear. New convolutional layers must implement the API to be used in a soft-
ware library. The API should make it easy for new convolutions to be implemented
without having to make changes in the rest of the library. Verifying the shapes of
the data going between the layers is vital to ensure bugs are caught early.
6.7 Applications 129
6.7 Applications
CNNs have many applications. Their origins lie in image recognition, but their
power extends far beyond that field. Convolutions are adept at dealing with
shift-invariant data that is also quasi-translation invariant. The world abounds
with such problems. The following rules of thumb should be considered when
designing a model:
6.8 Summary
CNN almost always stands for convolutional neural network, not classifying neu-
ral network. CNNs are ANNs with at least one convolutional layer. Of particular
interest is the filter, it regularizes a neural network while extracting learned fea-
tures. Filters work on grids by identifying components of interest regardless of
where they are in an input grid. They can be trained with normal backpropaga-
tion making them attractive for incorporation in ANN software libraries. When
designing and reasoning about a CNN, it important to think about the flow of data
in terms of shapes and what layers are doing with them.
6.9 Projects
The aim of these projects is provide insight and experience into how the
performance of a CNN responds to changes in its hyperparameters. The
projects are based on the Python notebook that can be found here, https://
github.com/nom-de-guerre/DDL_book/tree/main/Ch06. The notebook is called,
MNIST06.ipynb. The site includes directions for obtaining the data (MNIST).
6.9 Projects 131
1. Plot the percentage correct of the test set as a function of the number of features
in the first layer.
2. Plot the percentage correct of the test set as a function of the width of the FFFC
initial dense layer.
3. Plot the time versus percentage correct of the test set as a function of kernel
width and number of features.
133
So far this book has demonstrated how to train articial neural networks (ANNs).
While theoretically the methods presented thus far should be sufficient to train
practical models, there are further points to bear in mind when designing prac-
tical software to train models. There are techniques that accelerate convergence
of training. Trained models also need to be verified to confirm that they are suit-
able. These considerations imply that there are a few more ancillary concepts that
are required to successfully implement a deep learning (DL) library. We cover the
most important points in this chapter.
training error should be the same. The argument is based on the assumption that
the training data and the unseen data are all independently and identically dis-
tributed samples, so the model should perform just as well with unseen data. This
argument is, however, naïve.
Let us examine the problem of estimating the generalization error with respect to
the MSE loss function. The expectation of the generalization error can be defined
as, 𝔼[(̂y − y)2 ], which is the expectation of the loss function. The components are
ŷ ≡ ANN(x), x is the observed predictor, and y is the ground truth. Of course the
ground truth will probably not be available in production1 ; if it was available, then
there would be no need for the model. None the less some conclusions can be
drawn from the expectation. The expression can be expanded by recalling the def-
inition of the variance of a random variable, 𝜎 2 = 𝔼[z2 ] − 𝔼[z]2 . If we rearrange
the terms, we obtain 𝔼[z2 ] = 𝔼[z]2 + 𝜎 2 . Letting zi = ŷi − yi the result is
There are two terms on the right. The first is the square of the bias. The second
term is the variance. Let us deal with these two terms separately. But first, it is
important to realize that during training the model is the result of both the act of
training and the effect of the training set. A different training set, or indeed, just
a different seed to the random number generator, would yield a different model.
The trained model is very much a variable in this context.
7.2.1 Bias
The bias of a model is defined as bias = 𝔼[̂y − y]. It is a measure of how wrong a
model is relative to the underlying ground truth, or the generative process. Ideally,
a model would be unbiased, that is, the bias would 0. Informally, we can interpret
it as a measure of how wrong our assumptions are about the underlying process
producing the data. The canonical example is the Bernoulli distribution,
It is the special case of the Binomial distribution where n = 1 and q takes on the
values of either 0 or 1. It is used for binary outcomes. Given a set of outcomes,
1 Production means that the Deep Learning model has been deployed and in real use by the
application.
7.2 Generalization Error 135
1∑
N
𝜇̂ = qi , (7.3)
N
which is an unbiased estimator. The bias is 𝔼[𝜇̂ − 𝜇], and we can see that as N → ∞
then 𝜇̂ → 𝜇 and the bias → 0. Thus, it is an unbiased estimator of 𝜇. The estimator
understands the underlying process generating the observable data. Of course this
is a trivial example of parametric estimation. An ANN is far more complicated and
models are more complex objects than simple parametric estimators. The bias is
rarely zero. There are a number of interpretations of model bias.
In the context of ANNs, a model with a high bias does not explain the training
set well. Recall the example of a least squares fitting of the sine curve in Section
3.4. The bias was extremely high, and intuitively it is clear why: sine is not linear,
that is, the model was “wrong.” In machine learning (ML) this phenomenon is
known as underfitting. The trained model does not explain the training data well.
The size of the training set can be increased, but the quality of the results will not
improve. Low bias is a desideratum of a ML model.
Underfitting of ANNs can occur when either the loss threshold is too high or the
loss threshold cannot be met. In the latter case, the ANN’s capacity to learn the
training set was not sufficiently high. Either there are not enough layers or there
are not enough neurons (or both). The more neurons there are the more trainable
parameters, and the learning capacity of the ANN is increased. The number of
neurons can be increased by either increasing the depth of the ANN or widening
a layer.
7.2.2 Variance
The second term of generalized error is the variance. It is a measure of how the
models produced vary with respect to sampling the underlying process. A training
set is a sample of an underlying process or phenomenon. The training set is, in
effect, a random variable. For a given statistic, e.g. arithmetic mean, it will vary
from training set to training set, but the underlying process itself has an unknown
mean. So in a very real sense the model is a function of the training set, which is in
turn a random variable. The variance is a measure of how the performance of the
model varies with respect to sampling the underlying process. Models with high
variance tend to lead to overfitting; learning the data too well to the exclusion of
generalizing. Low variance of a model is a desideratum. Overfitting is the effect
of learning the training set to the point of being unable to recognize similar, but
unseen, data. Consider the 5 examples of the hand-written digit, 3, in Figure 7.1.
If we trained an ANN to learn the right-most example perfectly, then it would fail
to recognize the remaining 4 examples. The model will have overfitted and only
136 7 Fixing the Fit
recognize the 3 that it learnt to the exclusion of unseen, but genuine 3s. Overfitting
can be addressed with a number of mechanisms. For example, the training set can
be improved by adding further examples of 3s. More information leads to a more
robust model. The training can also be terminated sooner so the ANN’s idea of
what exactly constitutes a “3” is broader, that is, more general. The latter is far more
difficult to get right. Ideally, the training process itself could be made more robust.
4
2
2
0
0
–2
–2
–2
–4
–4
–4
–4 –2 0 2 4 –4 –2 0 2 4 –4 –2 0 2 4
Figure 7.2 kNN models for a static dataset. The data are produced as pairs of normally
distributed (x, y), and the classes have different means to produce the geometrical
differentiation. From left to right k = 5, 10, and 25. Note that the decision boundary
becomes far more regular with increasing k.
7.2 Generalization Error 137
or
Err
e
Error
nc
Bi
ria
as
Va
Model complexity
Figure 7.3 The classical bias-variance trade-off. The minimum of the error occurs as the
bias and variance intersect. Underfitting occurs to the left of the dotted line and
overfitting to the right.
in the training set, so it has high variance, but we obtain low bias. There are
fewer misclassifications of the training set, but any change in the training set will
produce a very different model. If we increase k to reduce the variance, we observe
that the bias is increasing and the number of misclassifications is also increasing;
the decision boundary is becoming straighter and less jagged. Adjusting either of
the bias or variance has an inverse effect on the other. Thus, it is suggested that
in ML, there is a fundamental tension between overfitting and underfitting, the
bias-variance trade-off (137). The relationship is demonstrated in Figure 7.3. The
diagram depicts the classic “U” shaped error curve. The best a model can do is
find the minimum of the error curve.
Recall that bias can be construed as how “wrong” a model is. Least squares is
an example of a high bias model, but it is also a low variance model. Changing a
few points in the training set is unlikely to dramatically change the resultant slope.
Generally, the more complex a model, the higher the variance and the lower the
bias. In the context of ANNs, this is viewed with respect to the number of param-
eters, the majority of which are usually weights. The more parameters in a model
the greater its capacity to learn. Overfitting is also known as overparameterization.
138 7 Fixing the Fit
2 The authors were acting in good faith and behaved correctly. The raw data was presented and
the problem discussed openly. The authors believed they had failed when they had actually
succeeded in finding something very interesting. It was very good science.
7.2 Generalization Error 139
The training set is used to fit the model. The test set is kept in reserve to be used to
validate the fitted model following training; its data are never used during training
and remain unseen by the model throughout. Assuming the dataset as a whole
truly represents the ground truth of the problem domain, then the test set can be
used to estimate how the model will really behave when deployed, that is, how it
will generalize to unseen data.
The procedure is as follows. The dataset is split between the training set and
the test set. The model is fitted to the training set. With the test set, an estimation
of the generalization error can now be computed by using the model to perform
inference on every datum in said set. As the ground truth is available for the test
set, the performance of the model can be quantified. For regressors, the MSE loss
is used as the performance metric. The quality of classifiers could also be gauged
with their loss function, but typically accuracy is used (percentage correct). The
percentage typically ranges between 100% and (100/K)%, where K is the number
of distinct classes. A test error of 100% suggests that the model is working very well
(if that was the training accuracy overfitting may have occurred). Accuracy of 1/K
indicates that the model is totally broken. Random guessing is just as accurate;
hence, the lower bound. If the test loss, or accuracy, are not tolerable, then the
model should be revisited and subsequently retrained.
The ratio of the sizes of test to training sets varies with the size of the dataset
as a whole. A training set that consists of circa 100,000 examples or more can be
split up as 90:10 training:test. But for smaller datasets, larger proportions of test
and training are required. If the training set is small, then an assumption is made
that training the model is cheap. In this case, a technique called folding is used.
The technique of folding provides for training and validating a model multi-
ple times. This is actually best practice in general, but as fitting models can be
expensive there are occasions when it is simply too expensive to train and validate
multiple times. The procedure of folding is as follows. The dataset is randomly split
in to two disjoint sets, the values of 2/3 training and 1/3 test is a good choice. The
random divisions of the data are known as folds. For classification, the proportions
of examples for each class should be the same, but this generally happens naturally
with uniform sampling. A model is trained with the training set and then its per-
formance with the test set computed. The process is repeated multiple times. The
final result is the mean and standard deviation of the accuracies of the individual
runs. Following a number of folds, confidence can be had in the hyperparameters
that produced the model, which in turn suggests that the model is good.
Finally, overfitting can be avoided with a third division of the data, a verification
set. A verification set is disjoint from the training set. During training, the model
is verified with the verification set following the execution of each epoch. When
the model achieves 100% accuracy with the verification set training is halted. This
can be expensive, but it is effective.
140 7 Fixing the Fit
80.5%
13.4%
27.8%
72.2%
97.8%
2.2%
2.2% 94.6%
90 68 92
3 The iris dataset is very easy to learn, so training was terminated early to make the confusion
matrix “interesting.”
7.3 Classification Performance 141
where k is the class being examined. The best F1 score is 1. The precision and
recalls have a maximum of 1 so the fractional part of the expression peaks at 0.5.
The F1 is scaled by a factor of 2 to make it a normal metric. It can be interpreted
as the covariance of precision and recall.
For the virginica class we have, precision = 66/(66 + 16) = 0.805, and the recall
= 66/(66 + 24) = 0.733. This gives F1 = 0.767. Whether this value is acceptable is
based on the application. For disease detection, this is probably far too low.
The scores can be combined to sum up the performance of the model. The F
score for each class is first computed. The final step is to combine the individual
F1 scores into a single score for the entire matrix. The two most popular methods
are variants of the arithmetical mean. The simplest method is that of the arithmetic
mean of the F-scores for a confusion matrix,
1∑
K
Ftotal = Fi . (7.5)
K
The second method is simply a weighted version. The terms are weighted by the
percentage of the test set it represents. The weighted mean is not always appro-
priate. For example, if used with the healthy-skewed medical dataset, the heart
disease class is the most important class, and we do not want it overwhelmed by
weighting the healthy people. The weightings can be synthesized to increase the
importance of the disease positive cases.
There are many different measures possible when examining a confusion
matrix. The Matthews correlation coefficient (MCC) takes better account of
the FNs in multiclass models (14). For important applications where coverage
and efficacy are really important, it can make sense to employ a framework
incorporating many aspects of classification performance. The restrictedness and
bias-dispersion are just two metrics used in a complete framework to quantify
correctness and incorrectness of models by the authors of class dispersion (37).
There are many possibilities. The correct choice depends on the requirements
of the application. Metrics are a double-edged sword. Goodhart’s law states
that any metric becomes useless as soon as it is stated. What he meant was
that people tend to lose sight of what they are trying to measure, instead they
focus on the optimizing the number. A framework of metrics can slow the
effect down.
7.4 Regularization 143
7.4 Regularization
A serious problem when training models is the phenomenon of overfitting. This
is of particular concern in larger and deeper networks. One means of addressing
the problem is with regularization. Regularization is the act of calibrating models
such that overfitting is less likely. The particular technique that we describe in this
section is called neuron dropout, it is also known as dilution.
An important technique in ML is known as the ensemble method. To increase
the accuracy of a system multiple models are trained. Together, all the trained
models form a set of models, or an ensemble. When performing inference, the
entire ensemble of models is used and some means of aggregating their responses
is taken, that is, all the models cast a vote for the final result (e.g. the arithmetic
mean of their responses). While effective, this technique is impractical for the
larger neural networks where the memory, training time, and inference time may
be prohibitive. A method to realize some of the benefits of ensemble methods in a
single large neural network was developed called dropout (68; 144; 157).
When used with ANNs, the training infrastructure requires some minor mod-
ification. Up until now, the forward pass of an ANN was the same regardless of
training or inference. Dropout requires different forward passes for both cases.
The forward pass is different when training. The training infrastructure needs to
inform the ANN when training has stopped and started. The remainder of the
section describes dropout for all phases of use.
1 1 1 1 1 1
Dropout
layer
Figure 7.5 An example of an ANN with a dropout layer. The left side shows a possible configuration with a dropout of 0.5. The right side
shows the logical ensemble resulting from training. The subnet changes for every element of the training set.
7.4 Regularization 145
Algorithm 7.2 Inverted Dropout for a Dense Layer’s Training Forward Pass
1: procedure DROPOUTTRAININGFORWARD
2: z̄ ← 𝜎(Wx + b)
3: 𝛽̄ ← U(M𝓁 ) ⊳ vector uniformly sampled [0, 1] for each neuron.
4: 𝛽 ← 𝛽̄ > pdropout ⊳ vector of True/False (1/0)
1
5: z← pdropout
⋅ 𝛽 ⊗ z̄ ⊳ Element-wise multiplication
6: end procedure
Despite the potential difficulties, the essence of the method is relatively simple
so dropout is a popular means of regularizing models. There are two ways to imple-
ment the technique efficiently. One way is to retrofit it to existing implementations
7.4 Regularization 147
1 1 1
l–1 l l+1
zl-1 = σ(Wzl + b)
zl-1 = σ(Wx + b) zl = β ⊗ zl-1
Dropout
x ≡ input layer
of layers, which has been described. The most popular way, however, is to intro-
duce a specialized dropout layer. The dropout layer is placed immediately deeper
to the target dense layer. A specialized layer has two advantages. The particulars
of dropout are encapsulated entirely and isolated in one place. An encapsulated
layer makes code reuse easy. The dropout layer does not have to be retrofitted to
every type of layer in the library; all existing layers in the library get dropout for
free. An example architecture is shown in Figure 7.6. A dedicated dropout layer
can be placed as the deeper neighbor in a model beside any type of layer that needs
to be regularized. The latter does not even know, effectively creating the illusion
that the shallower layer is a dropout layer. Dropout layers form an important tool
for regularizing DL models. They are, however, used sparingly, for example, as the
final layer in a deep ANN prior to softmax. It is rare to see a dropout layer following
every dense layer in a model.
Figure 7.5 gives an example of training with dropout. A specialized dropout
layer performs the regularization. During training, the dropout layer decides
148 7 Fixing the Fit
which neurons will participate, they are light, and which ones to drop, the dark
neurons. This happens for every example in the training set. The connections
between the regularized layer and the dropout layer are 1:1, not dense. The
two dense layers are oblivious to the regularization. The dropout layer accepts
the output of the shallower layer and zeros out the entries for the excluded
neurons. The new result vector is then passed on as usual, all:all. The dropout
layer has no effect on back-propagation phase of the training step. The gradient
is received by the dropout layer, and the entries for the precluded neurons
are zeroed. The dropout layer then passes the gradient back to the shallower
layer.
Regularization is important for training DL models. The learning capacity of
an ANN is proportional to the number of neurons. Selecting the right number is
difficult. Underfitting is relatively easy to detect, but overfitting it more difficult.
Adding neurons to fit the model is common response, and dropout layers help keep
the potential overfitting at bay. A software implementation should be verified with
the technique of differencing equations presented in Section 3.7
In Section 3.1, it was argued that preprocessing data by normalizing was critical for
successful training. Normalizing the training set produces desirable numerical and
statistical effects for the input layer. Hidden layers also have the same problems.
While preprocessing ameliorates potential numerical problems for the input layer,
the hidden layers do not necessarily benefit.
Consider the input layer. It accepts the normalized training data and then pro-
duces its response. The signals produced by the input layer will probably be dis-
tributed very differently from the normalized training data that produced them.
In general, the input for a layer is distributed very differently from its output.
This is especially true at the start of training when the weights are random. All
the arguments for normalizing training data apply to hidden layers as well. More-
over, following weight updates at the end of each training epoch the distributions
𝜕L
change. Indeed, the weight updates are based on the 𝜕w s, which approximates the
instantaneous rate of change for the loss function at the precise point, Loss(w),
where Loss is the global objective function and w is the vector of every weight
in the ANN. Changing one entry in w changes L, which in turns changes all the
𝜕L
𝜕w
s. Changing all of them simultaneously has an even stronger effect. This effect
retards the training and in consequence the hidden layers are chasing a “moving”
target as each epoch changes what they observe. The effect is demonstrated in
Figure 7.8. The difference between epochs is clearly visible.
7.5 Advanced Normalization 149
1∑
N
∀neuroni ∈ 𝓁, 𝜇i = z , (7.11)
N j i,j
150 7 Fixing the Fit
1 1 1
Normalization
layer
Figure 7.7 An ANN with a batch normalization layer. The 1:1 links correspond to the per
neuron normalization statistics.
0.50
0.48
Mean
0.46
0.44
0.42
Figure 7.8 The unnormalized means ribboned with the standard deviation for 5 epochs
of training LeNet-5 at layer C5. The epochs are clearly delineated. The weight updates in
the deeper layer will have to adapt to a distribution of signals that it has never seen
before every epoch.
Both gamma and beta are trainable parameters learnt during fitting of the model.
If need be, training can force the parameters to undo the normalization by learning
𝛾 = 𝜎 and 𝛽 = 𝜇.
Batch normalization requires all N outputs from its immediately shallower layer
before it can compute the statistics. It follows that 𝓁 must buffer 𝓁 − 1’s signals. If
𝓁 − 1 has M𝓁−1 neurons, and the training epoch is processing N samples from the
training set, then a buffer of size of M𝓁−1 ⋅ N will be required. This leads to a new
algorithm for processing a training epoch.
152 7 Fixing the Fit
Algorithm 7.3 is structured very differently than the earlier training epoch
described in Algorithm 3.1. Note the inversion of the loops. The batch normal-
ization version does not iterate over the training set, but rather, the layers of the
ANN, as the entire batch must progress through the ANN as a whole. The earlier
version of a training epoch used to iterate over the training set. A forward pass
was immediately followed by a backward pass, ping pong style, and there were N
executions of each pass. The batch normalization version only has one forward
pass; the training set moves through the ANN together. While it is not necessarily
slower, it is roughly doing the same amount of work, merely in a different order,
it does consume more memory. The individual Z𝓁 s must be retained as they
are required to compute the 𝜕L 𝜕z
s during the backward pass (they are used to
propagate the gradient between layers). The requirement is far more onerous
when processing an entire training batch at once. Back propagation must also
proceed in a batch-oriented fashion; the backward pass cannot proceed until the
forward pass has completed. Back propagation through a batch normalization
layer must also proceed as a batch. This follows from differentiating a path
through the normalization.
Backpropagation must go through the normalization layer so its derivatives are
required. Batch normalization is differentiable, but messy. The derivatives are sep-
arated into two sets, or phases. They need to be evaluated in the order presented
as there are dependencies. Notice the derivatives of the variance and the mean
involve sums over the entire batch. The first set of derivatives is required to prop-
agate the gradient through the layer. The computation of the gradient needs the
7.5 Advanced Normalization 153
𝜕L ∑ 𝜕L
N
−1 2 −3
= ⋅ (z𝓁−1 − 𝜇) ⋅ (𝜎 + 𝜖) 2 , (7.16)
𝜕𝜎 2
i
𝜕 ẑ 𝓁−1 2
and,
( )
𝜕L ∑
N
𝜕L −1 𝜕L −2 ∑
N
= ⋅√ + ⋅ ⋅ (z𝓁−1 − 𝜇). (7.17)
𝜕𝜇 i
𝜕 ̂
z 𝓁−1 𝜎2 + 𝜖 𝜕𝜎 2 N i
Finally, with the above values computed, the expression below is the gradient that
is propagated to the shallower layer:
𝜕L 𝜕L 1 𝜕L 2(z𝓁−1 − 𝜇) 𝜕L 1
= ⋅√ + 2 ⋅ + ⋅ . (7.18)
𝜕z𝓁−1 𝜕 ẑ 𝓁−1 𝜎 2 + 𝜖 𝜕𝜎 N 𝜕𝜇 N
With the latter derivative, the gradient can be pushed to the shallower layer.
Observe that only layer outputs need to be retained as claimed. The intermediate
ẑ 𝓁−1 values do not appear, just their derivatives, which can be easily computed.
The second set of derivatives is required to update the learnt parameters, 𝛾 and 𝛽.
They too are sums over the entire batch.
𝜕L ∑ 𝜕L
N
= ⋅ ẑ , (7.19)
𝜕𝛾 𝜕z𝓁 𝓁−1
and
𝜕L ∑ 𝜕L
N
= . (7.20)
𝜕𝛽 𝜕z𝓁
Bear in mind that these expressions are per neuron in the normalization layer.
There will be M𝓁−1 of them, as dictated by the preceding shallower layer.
When training is terminated, the model is ready to be used for inference, and
the layer will have learnt the M𝓁−1 pairs of 𝛾 and 𝛽, but there is still a require-
ment for the {𝜇i , 𝜎i }. During inference, there are tuples of {𝜇i , 𝜎i , 𝛾i , 𝛽i } required.
The easiest solution is to retain the {𝜇i , 𝜎i } computed during the last training
epoch.
While promising experimentally, batch normalization is not widely imple-
mented. The problems far out-weighed the advantages. The memory requirements
can be prohibitive. Using SGD does address the memory requirements to some
extent. The method also introduces a great deal of potential numerical instability,
the above derivatives represent many opportunities for overflow, underflow and
NaN. The algorithm is also difficult to retrofit into existing training libraries
154 7 Fixing the Fit
as it required a new data flow during training. Batch normalization did, how-
ever, inspire a more practical form of interlayer normalization that forms the
subject of the Section 7.5.2.
1 ∑
M𝓁−1
𝜇= z [i], (7.21)
M𝓁−1 i 𝓁−1
and the standard deviation is
√
√
√ 1 M∑ 𝓁−1
The derivatives for layer normalization are the same, except the sums are across
the vector instead of the batch. They are reproduced below to account for the
different summations.
𝜕L 𝜕L
= ⋅𝛾 (7.23)
𝜕 ẑ 𝓁−1 𝜕z𝓁
𝜕L ∑ 𝜕L
M𝓁−1
−1 2 −3
= ⋅ (z𝓁−1 − 𝜇) ⋅ (𝜎 + 𝜖) 2 , (7.24)
𝜕𝜎 2 i
𝜕 ̂
z 𝓁−1 2
and,
(M )
𝜕L ∑ 𝓁−1
𝜕L −1 𝜕L −2 ∑
M𝓁−1
= ⋅√ + 2 ⋅ ⋅ (z𝓁−1 − 𝜇). (7.25)
𝜕𝜇 i
𝜕 ẑ 𝓁−1 𝜎2 + 𝜖 𝜕𝜎 M𝓁−1 i
𝜕L ∑ 𝜕L
M𝓁−1
= ⋅ ẑ . (7.27)
𝜕𝛾 𝜕z𝓁 𝓁−1
𝜕L ∑ 𝜕L
M𝓁−1
= . (7.28)
𝜕𝛽 𝜕z𝓁
Training is performed with the standard SGD training algorithm. A layer normal-
ization layer is interposed between dense layers where required and is completely
encapsulated. This is a very attractive feature. Software infrastructure does not
need to change to accommodate it, and memory demands are not increased either.
The numerical stability of the derivatives is also far better than full batch normal-
ization. Sums over a vector are better behaved than over an entire training set.
To demonstrate the advantages and the dynamics of layer normalization, a com-
parison between with and without layer normalization for an ANN classifier for
the iris dataset is presented. The graph in Figure 7.9 shows the distribution of the
means for a layer over two attempts to train a model. The model was initialized
with the same weights for both runs. The differences between the distributions are
striking. Layer normalization has a tighter distribution. The deeper layer will have
an easier time of converging as a result. Without normalization, the distribution
drifts creating difficulties its deeper neighbor.
When implementing algorithms such as layer normalization, it is a very good
idea to verify the resulting code with the technique of differencing equations intro-
duced in Section 3.7. If the derivatives are not implemented correctly, they will
have pernicious effect on convergence; this can be hard to detect without direct
156 7 Fixing the Fit
150
100
Type
Density
Lnorm
Raw
50
Figure 7.9 The distribution of means by type. The layer normalization densities are far
more predictable over time, hence the deeper layer can learn faster. The weight updates
do not confuse it.
7.6 Summary
Evaluating the quality of a trained model is challenging. The applications for mod-
els require them to perform inference on unseen data, and unseen data is unknown
data. The performance of a model with unseen data is the generalization error.
There is a great deal of theory in ML to analyze generalization error for ML, but the
jury is out on its relevance to ANNs. Nonetheless, overfitting and underfitting are
7.7 Projects 157
7.7 Projects
The projects below rely on notebooks that can be found here, https://github.com/
nom-de-guerre/DDL_book/tree/main/Chap07.
1. The Python notebook iris07.ipynb contains an implementation of an iris clas-
sifier. Plot a graph of the classifier’s bias versus variance by experimenting with
different topologies.
2. The Python notebook MNIST07.ipynb contains an implementation of an
MNIST classifier. Measure the accuracy difference between the dropout
version and the control version of the model.
3. Implement a confusion matrix with the MNIST classifier in the previous
project.
159
This chapter describes how to design and implement a software library for
building deep learning (DL) models. Models are built with software libraries. This
is the software that applications use to define, assemble, and train models. The
design principles are all demonstrated in a software library called the Realtime
Artificial Neural Tool (RANT). RANT is an artifact developed for use with
DL experimentation and embedded applications. The source code is available
online.1 The library is written in C/C++ for efficiency, but it does include Python
bindings. Some might question why Python was not used. The answer is that most
implementations of DL training routines are written in a lower level language.
For example, libraries such as TensorFlow are implemented in C/C++ and
exposed in Python. Python is implemented in C. To understand how to implement
a DL library, we need to work in a low-level language. Demonstrating how to
use a DL training library would be appropriate in Python; again, even Python
implementations generally call down into C/C++ code to the workhorse routines.
Many indulge in dogma when prescribing computer languages, but it is
important to bear in mind that computer languages are like any other tool, saws,
hammers, and kitchen blenders. It is important to use the right tool for the job.
All computer languages have strengths and weaknesses. Selecting the appropriate
language for a task is an exercise in objectively examining the trade-offs that are
appropriate for the task at hand. Many students do not understand statements
such as, “Python is slower, but more productive.” Such a claim is laden with
information for a computer scientist, and we examine it here, in the course of
which the use of C/C++ for low-level routines will be motivated.
1 https://github.com/nom-de-guerre/RANT.
2 The IBM 701 Mainframe was a landmark machine whose introduction led to IBM’s
dominance in the following decades.
8.1 Computer Languages 161
libraries. The code had been debugged to the point of total reliability and sharing
it further increased productivity. Libraries of important scientific and engineering
routines made writing new applications easier. Every new missile guidance system
no longer began with rewriting the basic matrix and linear algebra routines that
were required. The FORTRAN language took off and is still widely used today.3
FORTRAN is terse and was designed for scientists and engineers. It was
designed for high-performance scientific computing. The business community
had different requirements and users. Business problems are more concerned
with data flow (processing customer records and bills, etc.), code readability,
and documentation to support maintenance and extensions. FORTRAN was
not suitable so a new language, COBOL, was introduced in 1959. COBOL was
more data record oriented, and more like English (to the point where the earliest
specification was so ambiguous that COBOL code was not portable, it depended
on compiler behavior. As portability was a primary goal this was a gross error).
These early languages did not stray far from the machine architecture they were
originally designed to run on. The computers they were designed for could barely
add and subtract, there was no compute head room for abstraction (multiplication
was usually a software routine, not a machine instruction). The choice between
COBOL and FORTRAN was a trade-off, usually dictated by the application. The
number of languages available quickly proliferated in the 1960s as computers
became more powerful, cheaper, and more widely available.
Of note, before we fast forward to today is the C language. It was introduced
in 1972 with version 2 of the UNIX operating system. UNIX was designed on a
new class of computer, the mini-computer,4 and was developed on a DEC PDP-7.
It was fast, simple, and made pointers a first class language element. It is still
an important language today for systems programming because it is low level. To
understand the utility of C, we need to examine how it works. It will explain why
some basic machine learning routines are still written in C++ today, even if they
are exposed for use in higher level languages such as Python.
An application is a program of instructions for a CPU to execute; the term of art,
program, has precisely the same meaning in lay English. Computer instructions
are numbers specifying primitive operations such as add the contents of memory
location x to the contents of memory location y and store the result in z. Some
CPU architectures can do that with one instruction. Others might require 4
instructions, load [x] into the CPU, load [y] into the CPU, add the values, and
store in [z]. High-level languages abstract these primitives and leave the details to
3 FORTRAN programs are still used to benchmark and rank the world’s fastest supercomputers.
4 Mini-computers were scaled down mainframes and made possible by the recent invention of
integrated circuits. The Intel 8088 would be introduced 7 years later in 1979 setting off the PC
revolution.
162 8 Design Principles for a Deep Learning Training Library
C ×86 M1 (ARM)
Figure 8.1 A C program and the resultant assembly code following compilation to two
popular architectures. Both architectures make direct use of the stack to store the
automatic variables. A C programmer would expect the stack to be utilized in that way,
have complete access to the addresses of the identifiers, and understand the results are
only valid in that stack frame. Note that just because the ARM version is longer does not
mean that it is slower, the individual machine instructions can be faster.
5 Sadly, programmers often do not posses the necessary understanding, hence dangerous bugs
and security flaws in low-level code written in C.
8.1 Computer Languages 163
As computers grew more powerful, the computer itself could be used to increase
correctness. Computers now have sufficient power to run programs and provide
many run-time ancillary services. Languages such as Java and Python offer little
or no correspondence between the machine and the language. Java is compiled
to an intermediate assembly language for a Java virtual machine (JVM). Python
is interpreted. Both languages offer memory management. Allocating and free-
ing memory are not concerns for a Python programmer. There is, however, a cost.
When Python evaluates z = x + y, it does not correspond to a series of machine
instruction but rather a sequence of high-level language operations that in turn
execute many machine instructions. This is not necessarily a bad thing. Speed of
implementation (productivity) and eliminating an entire class of memory bugs are
well worth the price on modern CPU architectures, but too often students are not
aware of the trade-off much less the cost.
C/C++ are low-level languages giving them direct access to hardware primitives
and hardware. This includes control over the IEEE 754 rounding method. There
are four modes, and they are exposed in C/C++ via FLT_EVAL_METHOD (float.h).
There are also many platforms where performance is still paramount. Mobile
phones and embedded devices are very sensitive to memory usage and CPU
consumption. Apple uses Swift and Android uses Java (and Kotlin).
Table 8.1 gives an indication of the differing performance. The Python pro-
gram creating a classifier with the Iris dataset is using Keras and Tensorflow.
Consequently, it is eventually calling down into C/C++ code. The comparison
between loops was done with for, which is sympathetic to Python as the differ-
ence between while loops is known to be wider. A Python programming rule of
thumb is to use for instead of while where possible because it is faster.
An examination of the relative energy requirements of computer languages con-
cluded that C is the most energy efficient language, and Python was ranked at #73
(112). The authors also demonstrate that Python consumes a great deal more mem-
ory. This can be an important consideration for embedded applications and mobile
devices.
C++ 3 ns 0.05 s
Python 1628.4 ns 26.09 s
a) Comparison of values computed with backpropagation of error and differencing. The ratio
suggests good agreement between both methods suggesting correctness of the BPROP
implementation.
164 8 Design Principles for a Deep Learning Training Library
The energy footprint, and the related metric of carbon footprint, has for the most
part been ignored in the machine learning community. Research tends to focus on
increases of accuracy and capability. This is not necessarily a bad thing, but some
attention will probably fall on the cost of training as models grow increasingly
hungry for energy. Efforts such as DAWNBench (23) are a step in the right direc-
tion. The benchmark includes metrics such as time to accuracy, which implicitly
admits of an energy efficiency interpretation. It is not much of a leap to refine the
metric to explicitly measure performance in terms of Joules. Competitions such
as JouleSort (125) make a more explicit connection to energy efficiency. Sorting
algorithms are evaluated with respect to speed and energy usage. The DL commu-
nity can learn from such efforts. Spiking neural networks are an attempt to build
models with lower energy requirements (69), but they do not perform well (47).
A powerful compromise to gain the productivity of Python and the power of
C/C++ is the Python package, ctypes. Routines and libraries are written in C/C++
and then exposed higher up in Python. Numpy and Tensorflow are two examples
of this approach. Applications can then be written in Python that call the perfor-
mant code, or access devices such as a GPU, by calling the Python bindings.
It must be made very clear, and this section was not an anti-Python polemic.
Python is a very productive language. Implementing a computer language
requires selecting a point in the design space dictated by what the language is
trying to achieve. The trade-offs adopted by the language constitute the niche that
the language occupies. The list of modern computer languages is endless. C/C++
and Python are different languages trying to achieve different things. Python
dominates its niche because it is a great language. Python’s eco-system of libraries,
support and documentation are unparalleled for machine learning and ANNs.
The abstractions it offers also yield tremendous productivity. It can, however, be
useful to understand what is happening under the hood of the implementation of a
language. This is not just abstract, and it can lead to writing better Python code too.
are easily managed with a single matrix of weights (a row per neuron, a column
per incoming connection). The algorithms presented thus far have all been
matrix-centric, and this is one of the reasons.
Training Neural Networks involve two phases: a forward pass and a backward
pass. Both operations are performed, depending on the nature of the problem, hun-
dreds or even tens of thousands of times; it is all that training an ANN does. We
have seen in previous chapters how both phases are represented with 3 matrix
operations, a matrix vector product in the forward phase, or inference, and a trans-
pose matrix vector product followed by a vector outer product in the backward
pass. The performance of an implementation will depend very much on how well
these basic operations are implemented. Before one can understand how to best
implement these operations, a brief review of computer architecture is required.
A note on the scale of the DL models is in order here. Models vary in size and
scale. The algorithms presented so far are correct, but writing down an algorithm
is very different from realizing the implementation of one in a computer. The
trade-offs and design for an implementation that can handle models with billions
of learnable parameters and smaller million parameter models are different. The
most challenging problems today require multiple GPUs to train in a practica-
ble time frame (29). Training language models can take days. The techniques
developed and presented below are for CPU-based implementations.
6 In C/C++ a memory address is called a pointer. Java and Python do not grant direct access to
memory. They offer the safer “reference” abstraction.
166 8 Design Principles for a Deep Learning Training Library
The first point is the existence of a quantum unit of access to memory by the
CPU. It is called the CPU cache line. Even if memory is addressable at the granu-
larity of the byte the CPU must access memory, that is, read and write, in units of
the cache line size (e.g. 64 bytes on Intel CPUs and 128 bytes for M1 SoCs). A cache
line is fetched from DRAM and stored in the CPU cache. The CPU accesses are also
aligned on cache line size. For example, if a program reads byte 438, an M1 will
load 128 bytes from offset 384 and store it in the CPU cache; the desired byte is the
54th byte in the cache line. Once the cache line is loaded, the actual byte that has
been requested will be fetched into the core’s register. In the course of loading, the
cache line it will evict a cache line that is already in the CPU’s cache. This hap-
pens invisibly and is managed entirely by the CPU. Note that in this example the
program only needs 0.78% of the data that was loaded from memory into the CPU
cache.
The number of CPU cycles required to access DRAM is circa 100 clock ticks. The
number of cycles required to perform an operation, such as addition, is circa 10
clock ticks. The ratio of time to fetch a datum versus the operation is enormous.
Programmers and compilers do not need to be aware of the details of memory
access, but it is important to understand it when writing high-performance code.
The speed mismatch is observable and easily measurable. When a CPU has to wait
for a memory fetch, it is stalled and does no work.7 A CPU clocked at 3.2 GHz could
conceivably execute operations on data at that rate, but rarely gets close owing to
memory fetches.
7 Stalling is such a serious problem that the SPECTRE family of security bugs, etc. was
inevitable.
8.2 The Matrix: Crux of a Library Implementation 167
1 2 3 ... 205 206 207 208 209 210 211 212 213 ...
A= 4 5 6 1 2 3 4 5 6 7 8 9 Row order A
7 8 9 1 4 7 2 5 8 3 6 9 Column order A
Figure 8.3 Two options for the physical memory layout of a matrix, A. The base address
of the matrix is 205 in DRAM. Row order provides for the rows to be laid out contiguously
in memory. Column order is the opposite. CPUs access memory in units of cache lines so
the choice is important.
8 When the performance of iterating over the entire matrix is not critical, then a matrix can be
synthesized by abstracting it. One method is to build a binary tree or hash table of rows or
columns. The interposition of abstract data structures over linear arrays is not considered here
as the matrix products would be infeasibly slow.
168 8 Design Principles for a Deep Learning Training Library
possible. An additional advantage is the data is only loaded once. For W ∈ ℝn,m ,
the number of memory fetches is n ⋅ m÷ cache line length. If the matrix was
laid out in column order, then the number of memory fetches would be n ⋅ m.
In both cases, there are n ⋅ m multiplications and n ⋅ (m − 1) additions, but the
expensive operation is the memory access, and memory access will dominate. The
effect is pernicious and easily measured. This is one of the reasons why Python’s
Numpy library is written in C. Fine-grained control of the memory is crucial
for performance. A further advantage to streaming the data is the opportunity
for compilers to detect the contiguous memory multiplications. Many CPUs can
multiply 4–8 products in parallel. Compiler optimizers that detect contiguous
operations can use native CPU vector instructions.
The NeuralM_t class includes an extra column for the bias. The GAXPY for u𝓁 =
W𝓁 z𝓁−1 + b𝓁 is implemented in Algorithm 8.1.
⎛𝛿1 ⎞ ( )T
= ΔW𝓁 + ⎜𝛿2 ⎟ ⋅ z1 z2 z3
⎜ ⎟
⎝𝛿3 ⎠
⎛Δw1,1 + = 𝛿1 z1 Δw1,2 + = 𝛿1 z2 Δw1,3 + = 𝛿1 z3 ⎞
⎜ ⎟
= ⎜Δw2,1 + = 𝛿2 z1 Δw2,2 + = 𝛿2 z2 Δw2,3 + = 𝛿2 z3 ⎟ . (8.3)
⎜ ⎟
⎝Δw3,1 + = 𝛿3 z1 Δw3,2 + = 𝛿3 z2 Δw3,3 + = 𝛿3 z3 ⎠
With efficient implementations of these matrix operations, a sound basis for the
implementation of a training library is created. A great deal of work has been done
to make matrix operations efficient for CPUs. One approach is to extract all of the
performance possible by writing assembly routines optimized for a particular CPU
(52). Matrix operations are highly parallelizable, which lends them to concurrent
computation. A library can be written to compute the individual dot products in
parallel resulting in vast increases in speed (142).
Matrix of kernels 9 × 4
1 2 7 8
Input image 6 × 6
3 4 9 10
1 2 3 4 5 6 Filter 4 × 1
5 6 11 12
7 8 9 10 11 12
13 14 19 20
13 14 15 16 17 18
19 20 21 22 23 24
15 16 21 22 • =
17 18 23 24
25 26 27 28 29 30
25 26 31 32 Feature map 9 × 1,
31 32 33 34 35 36
27 28 33 34 row-order 3 × 3
29 30 35 36
Figure 8.4 The physical layout of the input matrix is on the left. The matrix of kernels
is to the right. The feature map is computed with a matrix multiplication. The kernel is
2 × 2, and the stride is 2.
The answer is the CPU’s floating point implementation. On some CPUs, it may not
make sense, but for many, the scaler multiplication instruction is expensive. Apple
M1s and Intel x86 have limited, but useful, vector instruction. They can compute
4–8 multiplications in parallel, so it is worth the memory copying. On a GPU, the
win is far greater.
The matrix of kernels is also useful for backpropagation. Reconstructing the
matrix of 𝛿s as a vector then a weight’s derivative is the dot product of a column
in the kernel matrix and the 𝛿 vector. A similar technique can also be used for
transmission of the gradient through the feature layer.
One means of producing a matrix of kernels is with Im2Col (17). Im2Col will
generate a matrix of kernels, one kernel per row. The scheme can also be directly
implemented. The RANT library includes a simple example.
Dropout_t
Stratum_t
Softmax_t
Stratum_t
Stitching layers together is potentially fraught with peril. The model enforces
sanity by verifying and enforcing correct shape transitions through the model.
Two dense layers can neighbor each other with no trouble. But a CNN filter with
n features needs to have a deeper neighbor of the same shape, or a flatten layer.
The shape checks are performed at model creation time as the layers are added.
The model can reject an improper layer addition and return an error.
8.4 Summary
Computers languages are designed with different objectives. Designing a com-
puter language involves selecting a design point in a large design space. The niche
the language is addressing dictates the trade-offs. Machine learning libraries
are often implemented in C/C++ because it produces high-performance code.
The libraries are exposed in higher level languages, such as Python, that offer
safety and high-level language abstractions. The RANT library can be exposed to
Python or R, but is also well suited to real time and embedded applications with
demanding requirements and a dearth of resources.
8.5 Projects
The projects below use the Python notebook MNIST08.ipynb that can be found
here, https://github.com/nom-de-guerre/DDLbook/tree/main/Chap08. Use the
notebook to verify your work is both correct and performant.
174 8 Design Principles for a Deep Learning Training Library
Vistas
The subject of this book is the introduction to the canonical concepts underpin-
ning deep learning ANNs. The central ideas have been presented, but only the
surface has been scratched in the field of deep learning. The discipline of deep
learning is a very rich and fertile area of research and commercial application.
There are many directions for research and areas of specialization. Armed with
a sound grounding in backpropagation, advanced topics come into view. This
chapter presents some of the more interesting directions in which deep learning
is currently moving. As the sections below will show, the roles of ANNs have far
more potential than as regressors and classifiers.
beyond the scope of what ANNs could master. Their most famous argument
showed that a perceptron could not learn XOR. The conclusions that followed
were controversial and seemed to contradict Rosenblatt’s earlier formal results.
Rosenblatt had already examined the question of convolutions of linear decision
boundaries and concluded the opposite in his famous Existence Theorem (129).
The debate has moved on since then. Non-linear algorithms and the exponential
growth in the power of computers have resulted in new assumptions. ANNs have
broken free of their linear shackles. Nonlinear learning has led to enormous
renewed interest in ANNs, yielding the current state of the art.
Modern work has been done on the question based on new assumptions leading
to many promising results. In the theory of ANNs there are many results providing
for the capabilities of what ANNs can represent (“learn”). Such results are known
as universal approximation theorems. They are concerned with the following, con-
sider some function, f , unknown and perhaps (usually) unknowable, can it be
approximated, ANN ≡ f̂ ≈ f ? The answer is, with some weak assumptions, yes, in
many cases.
As has been shown, for the ANNs presented in this book, ANN topology has two
hyperparameters, and they are depth and width. This is a two-dimensional hyper-
parameter space. Universal approximation theorems can usually be classified as
either an arbitrary depth argument or an arbitrary width argument. A seminal
paper written by George Cybenko in 1989 showed that an ANN of arbitrary width
and using the sigmoid activation function can approximate many functions (27).
This was an exciting result as the ANN community was emerging from the linearly
constrained stupor of 2 decades. Since then, there have been many further results.
Hornik showed in 1991 that the choice of activation function is not important so
much as the depth of the network (70): a deep learning theorem. Thus theory exists
supporting expanding an ANN in any hyperparameter dimension and expressing
general functions.
Examination of the problem continues today, and recently, Kidger and Lyons
showed in 2020 that for modern network topologies (i.e. deep learning), bounded
width and arbitrary depth also yield a universal approximator (82). While all of
the results are theoretical and do not necessarily lead to implementation insights,
they do provide a sound theoretical basis for ANNs and their training.
Universal approximation theorems also motivate the construction that was
placed on ANNs in Section 1.3. They create the connection between the analogy
of the raw clay of an untrained model alluded to earlier and the final molding of
the clay into any desired shape, that is, a trained model. Clay can be molded to
any shape desired, and universal approximation theorems suggest that ANNs can
be too. ANNs were originally motivated biologically, but we see now that their
use can be motivated more generally: to paraphrase a celebrated mountaineer,
“because they work.” ANNs can be viewed as programmable functions, and we
9.2 Generative Adversarial Networks 177
can program them to represent almost anything. It is this power that has led to
so many more applications for deep learning, and we briefly survey a few in the
following sections.
The general form of the ANN regressor and classifier has been framed in Chapter 2,
and indeed, they are very important applications of ANNs. Both forms of ANNs
share a property: inference is performed by accepting a valid input and producing
a verifiable result. They are subject to ground truth. Generative ANNs are com-
pletely different models. They produce material that is not necessarily “wrong”
or “right”; they have no ground truth per se. Evaluating their results can be more
subjective.
Generative models produce output that resembles examples from their training
domain. The domain is specified with a dataset. A generative ANN produces novel
material, and it does not classify an input. A trained generative ANN should pro-
duce output that is indistinguishable from an example in its training set. The roles
of model and human are reversed in the sense that it is the human that classifies
the model’s output (is it good enough), as opposed to a model classifying a datum
for a human. Consider the problem of learning how to draw a hand-written 2 in the
style of the MNIST dataset. Figure 9.1 shows the intermediate stages of a genera-
tive ANN trying to learn how to draw a 2. Learning to produce an image that looks
like a hand-written 2 is different from training a model to recognize one. But the
concepts are related. Both models must have some idea of how to represent two.
Of particular interest of late is the generative adversarial network (GAN). GANs
were described in 2014 by Goodfellow et al. (50). They were later combined
with CNNs to produce the deep convolutional GAN (DCGAN) (119). This
innovation proved fruitful, and soon GANs were producing photorealistic images
(94). StyleGAN (80) soon followed producing images of human faces that were
indistinguishable from real faces. The realistic images led to the coining of the
Figure 9.1 An example of generating twos. During training, 4 sample twos were
generated every 25 epochs of training. The progression starts from the left and goes to
the right.
178 9 Vistas
phrase, “deepfake.” GANs can now generate realistic images for anything that
has a dataset to represent it such as impressionist paintings, aardvarks, or Rolling
Stones albums (81). There are examples of entire videos being produced from a
single photograph of a real person. The GAN is a fascinating innovation in that it
can produce output that has never been seen before.
Figure 9.1 shows a trivial example of teaching a GAN to learn twos from the
MNIST dataset. Over time the twos become more distinct. They were generated
with code that is available on the book website (see Section 9.7). An Apple M1’s
GPU took 5.6 seconds per epoch, and over 200 epochs consumed 18 minutes.
To limit the computational resources required to train the GAN the model was
kept simple. The training set consists of selecting a single digit from MNIST. The
example on the website is simpler; MNIST 2s have been downsampled to 14 × 14
to make it more accessible to notebook computers.
Training set
D(w)
G(z)
Figure 9.3 The GAN game. The generator, G, produces fakes and attempts to fool the
descriminator, D. D is learning how to recognize the real article from the training set. G
learns from D.
Two models are required. CNNs can be used for both models (making it a
DCGAN). Let the discriminator be D, then its job is to act as a gatekeeper and
only pass exemplars from the domain training set. The exclusivity is maintained
by rejecting the fakes proposed by its opponent, the generator. The function D’s
range is then 𝕂 = {Real, Fake}. It is the discriminator’s task to learn the hidden
structure of the exemplar dataset. Let x ∈ 𝕋 ⊂ 𝕌 where 𝕋 is the training set.
Then x is an example from the training set. For D(x) ∶ 𝕌 → 𝕂 the answer should
180 9 Vistas
only be Real if x really is from the training set. A perfect D would perform as
follows:
{
Real ≡ 1 w∈𝕋
D(w) = (9.1)
Fake ≡ 0 w ∉ 𝕋 .
Of course, D will only approximate Eq. (9.1) or the generator would not be very
useful, it merely represents the ideal. The discriminator is clearly a binary classifier
(see Section 4.3.1). Any ANN that can learn the domain training set can be used
as a discriminator. For example, the discriminator used to create the examples in
Figure 9.1 was a CNN designed to learn how to recognize MNIST 2s.
Generators present more of a challenge. The immediate complication lies in the
fact that ANNs require input. Generators are no different, and a means of gener-
ating input is required in addition to producing output. Let the generator be G. G
must produce its own input and learn to produce sensible output. The first problem
is solved easily; the arguments to a generator are sampled from a random distribu-
tion. The two most popular distributions are the normal and uniform distributions,
the former usually being used.
A vector of samples is obtained, z ∈ ℝd , by sampling one of the distributions.
The connection between ℝd and 𝕌 is not dictated by the range. G must learn the
distribution of 𝕋 in 𝕌. It is the task of G to map the sampled vector to something
sensible in its range, G(z) ∶ ℝd → 𝕌. Not even the shape is important, but a vector
is convenient. It is the learnable parameters in G where the sport lies; they do the
work. Using transposed convolutions and dense layers, the sample vector is shaped
to the desired output dimensions while simultaneously mapping it to the required
output distribution. In contrast to the classifying CNN, a DCGAN, is learning
features to produce them, not detect them. Like all training problems, fitting the
parameters is challenge. Provided that G’s CNN has the capacity to learn the dis-
tribution of 𝕋 the transformation can be learned. During training the parameters
in the convolutions (projecting filters) and dense layers learn how to turn random
noise into desirable output. Returning to Figure 9.1, it shows the evolution of a
generator learning to map ℝ100 → ℝ28×28 such that it produces what looks like
hand-written twos. It is the learning capacity of G that does the work, not the input.
The loss should capture the adversarial relationship between the models. Once
the loss functions are available, both models can be trained with SGD and
backpropagation of error.
The discriminator can be construed as a binary classifier; see Section 4.3.1.
Framing the discriminator problem solely as a classifier does not quite capture the
winner take all nature of the exercise. The loss function needs to quantify the cost
of losing. Unlike a normal classifier, which can just be “wrong,” the discriminator
has not simply misclassified an input, and it has lost a game; there is a real cost
attached to a mistake. We begin with the definition of a binary classifier:
̂ + (1 − p) ⋅ log(1 − p)].
lossbinary = −[p ⋅ log(p) ̂ (9.2)
The discriminator has two inputs which lead to,
∑
x∪G(z)
lossD = lossbinary (w) = lossbinary (x) + lossbinary (G(z)). (9.3)
w
The two cases can be treated separately and then recombined later. For the case of
x ∈ 𝕋 then D(x) is,
lossx = −[1 ⋅ log(D(x)) + (1 − 1) ⋅ log(1 − D(x))] = log(D(x)). (9.4)
and for the fake attempt,
The first term reflects the fact that examples from the training set should produce
1, and the second term reflects the desire to return 0 for the fakes. The subtraction
of 1 turns the minimum problem into a maximum problem, so the entire expres-
sion needs to be maximized with respect to the discriminator. The generator wants
to minimize the expression as its interests are in direct conflict to those of the dis-
criminator. In practical implementations, the minimization and the maximization
are broken out, and the generator uses the loss function,
min − log(1 − D(G(z)). (9.7)
G
By making Eq. (9.6) negative, the usual gradient descent can be used; it is the stan-
dard loss minimization problem. When used with SGD the losses are summed and
scaled by the mini-batch size.
Algorithm 9.1 demonstrates the basic steps when training a GAN. The train-
ing routine loops for the specified number of steps. For clarity of exposition the
182 9 Vistas
Figure 9.4 A chess game following the opening moves of both players. Blue went first,
it is thus the black player’s turn to move.
required that are now defined. Let us now examine the environment and reward
more formally.
An agent learning how to behave in an abstract environment can be trained with
RL. The environment is provided by the people training the agent. The environ-
ment represents the problem domain. For example, if training an agent to play
chess, the environment must accept a move from the agent and update the state of
the game appropriately. The environment must also detect wins, losses, and draws.
By interacting with the environment, i.e. making chess moves, the agent learns
how to play. To learn, there must be reinforcement, both positive and negative.
Losing discourages bad moves and winning encourages good moves.
A task is accomplished by performing actions in an environment. An agent inter-
acts with an environment by sending it actions to change the state. An environ-
ment is the set, , of all possible states for the problem under consideration. The
set can be infinite if continuous, such as a robot learning to walk, or the set can
be finite if it is discrete. A game such as checkers, which has ≈ 1040 valid board
configurations, has a finite, if large, state space. At time t the system is in the state,
st ∈ , where st is the state at that time. The system advances to a state, st+1 , when
the agent sends an action to environment. The agent must grow adept at selecting
a good action to transition to a desirable new state.
An agent has a set of possible actions that it can take in the environment, . Of
particular importance is the set, (st ), which is the set of all possible valid actions
when in the state, st . To change state, that is, transition from state st to st+1 , an agent
must select and execute an action, a ∈ (st ), that should ideally be an improve-
ment. The action selected and executed is, At . This leads to the notion of a reward
function. Interacting with the environment results in a reward function attaching
a measure of “goodness” or “badness” to the action. High rewards are sought after
and low rewards avoided.
This leads directly to the notion of best, and what is meant by it. A metric is
required for quantitative comparison. The means employed is a reward function,
R(s) ∈ ℝ. The reward must be computable for every action, but it does not have to
be defined. The reward function looks like, R ∶ → ℝ. At time t one action will
be selected, At , resulting in a new state and its reward, Rt+1 (st+1 ).
The object of training is to produce a policy, 𝜋, that the agent can follow to nav-
igate the environment. The role of the policy is to select the actions that the agent
executes. To change state, an agent must select and execute an action, a ∈ (st ).
The actions are selected by 𝜋, and it looks like 𝜋 ∶ → . As 𝜋 is the policy it is
responsible for selecting the “best” action. The action selected by 𝜋 is At . Sending
the action to the environment will result in the new state, st+1 and the reward, Rt+1 .
The agent navigates through the environment producing the sequence of states,
{s0 , s1 , … , st } and their attendant rewards, {R0 , R1 , … , Rt }. The sequence of states
is known as a trajectory.
9.3 Reinforcement Learning 187
a0 S1
S0
P(s
t+1 = S
0 |s t =S , P(st+1 = S1|st = S2, a = a4) = 1.0
1 a=
a2 ) =
0.7
P(s
t+1 = S
a1 2 |st =
S1 , a
a2 a4
=a )
P (s 2 =0
.3
t+1 =S
2 |s =S
t
0, a=
a1 )
=1
.0
S3 S2
a3
P(st+1 = S3|st = S2, a = a3) = 1.0
(a)
Figure 9.5 (a) Markov Decision Process and (b) the resulting Markov Chain following
computation of a policy, 𝜋, and applying it. Black arrows lead to actions and light arrows
lead to possible results of the actions. In the Markov Chain there is at most one black
arrow emanating from a state. This is the result of applying a policy to the MDP.
188 9 Vistas
a0 S1
S0
a2
S3 S2
a3
(b)
action is selected. Note that a particular action does not necessarily produce a pre-
dictable result. For example, when in state S0 and executing action a0 there are
two possible outcomes. Some of the states have multiple actions available to them.
The policy is responsible for selecting the action. Training an agent to produce a
policy result in the solution depicted in Figure 9.5b. The MDP has been reduced
to a Markov Chain as the policy has decided which action to take when in a given
state, e.g. 𝜋(S0 ) = a0 . Consequently there is only one black arrow starting from any
state in the diagram; the policy chose the surviving arrows.
Training an agent with RL produces a policy, 𝜋, that performs a task to an accept-
able level. To motivate the method about to be presented, the initial state of train-
ing must be presented. This will motivate the algorithm while demonstrating a
fundamental trade-off when training with RL. To this end, we introduce a more
concrete example in the form of the game of tic-tac-toe.
The game of tic-tac-toe1 consists of drawing a 3 × 3 matrix. Two players take
turns placing either an x or an o on the board. The object of the game is for a
MDP graph) is required. This is captured with the idea of a value function. Value
functions are derivatives of reward functions, but not in the Calculus sense of the
word. The reward function is the instant gratification of an action. A value function
is based on the reward function, but it is a longer-term view of the reward function.
Value functions differ from reward functions in their time-scale and reflect plan-
ning. For most people, surgery is painful and has a low reward. Its value, however,
is high. In the long run, it can increase life span or may increase the baseline of a
reward function by fixing a chronic problem such as back pain or cataracts. One
way to compute value is to calculate the expectation of reward with respect to a
given action and state tuple, (si , aj ):
R1 + R2 + · · · + Rn
Qn+1 ∣ (si , aj ) = 𝔼(R) = , (9.9)
n
Q defined in Eq. (9.9) is the action value function. It captures the long term payoff
of pursuing the action in the state. Note that n and the subscripts are not time,
they are the number of observations, that is, the number of times during training
the action was executed in that state. During training, a game will be played many
times. For example, the opening move of the agent in tic-tac-toe might be to place
an x in (2, 1), then Eq. (9.9) is computing the expectation for that particular move
for an empty board. In the initial state of tic-tac-toe all moves are valid so the agent
would maintain 9 action values, one for each move, to learn the best move. As
defined Qn is not suitable for computation as it requires storing far too much data
(even for a trivial game such as tic-tac-toe, which has 19,683 valid states, ignoring
symmetry). A different form of the equation is required that is more suitable for
implementation:
1∑
n
Qn+1 = R
n i=1 i
( )
1 ∑
n−1
= Rn + Ri
n i=1
( )
n−1 ∑
n−1
1
= Rn + ⋅ R
n n − 1 i=1 i
1
= (R + (n − 1) ⋅ Qn )
n n
1
= (Rn + n ⋅ Qn − Qn )
n
1
= Qn + (Rn − Qn ). (9.10)
n
9.3 Reinforcement Learning 191
s1 s2 ) s3 s4 ) s5
,2 ,3
(2 (2
= =
A0 = (2, 1) A2 A4
Figure 9.6 An example training run for tic-tac-toe. The progression of the game is from
left to right. The agent’s moves are labeled with the action, At . The result of this game is
a win for the agent. A4 results in s5 and R5 = 1.0. The reward is consumed and values are
updated backward through the trajectory of moves. The linear indices for Z are
row ⋅ 3 + column.
192 9 Vistas
that unsupervised training techniques are easier as a training set does not need
to be curated. It is clear, however, that it merely substitutes one problem for
another, and the challenge should not be underestimated. The reward function
for tic-tac-toe is very simple and intuitive. For a trivial zero-sum game a reward
function is relatively easy to construct. The reward function for a robot learning
to stand, or a hand learning to grasp a glass bottle is far more complicated (and
not obvious). Games with deeper game trees may have to address the problem of
sparse rewards. Google’s Go implementation did manage to train with a zero-sum
terminal reward function (139), but it is not always possible. The 19 × 19 board
used in their implementation can have trees that are 1048 moves deep. For
complicated tasks the reward function itself can be learnt. Apprentice learning
describes itself as inverse RL as it attempts to recover a reward function from a
solved system (2). Stochastic approaches, such Hindsight Replay, have also been
successful (5).
The representation of the policy for tic-tac-toe is expensive. The table driven
approach provides for the explicit storage of the 𝜋s and is simply not feasible for any
interesting problem. Q-Learning is one solution (160) as it takes better advantage
of the MDP property of the problem and does not require tables. Action values
make sense for simple applications, but for more complicated problems state value
functions, V, are used.
RL is a powerful technique for building machine learning models. It is partic-
ularly useful when there is no training set available and the desired outcome is
more behavioral. It should not be viewed in isolation from ANNs. The two fields
are very much intertwined and growing ever more so. RL can be used to train neu-
ral networks and neural networks can be used to train RL models. This mutually
supporting relationship is only set to grow. This section was a necessarily brief
introduction, but it is hoped that the essence of the paradigm has been conveyed.
9.4 Natural Language Processing Transformed 193
transformers in their current form, “Attention is all You Need” (155). The authors
showed that a well-known technique, attention, could be used by itself to address
the problem of relationships between words in language. Transformers have
wider application, but it is the problem domain of natural language that they
have had the biggest impact. The paper described a refinement of attention
called self-attention implemented with a transformer. Transformers not only
perform better than previous methods, but they overcome the inherent sequential
nature of earlier methods and naturally support parallel processing, an important
consideration when using GPUs. The output of transformers is vital for use in
“heavy” NLP applications that consume it downstream in the text processing
pipeline. ANNs consume the output of transformers when they are training for
NLP domain problems. Transformers process text that can then be used for classi-
fication, regression, and generative purposes. Since its publication self-attention
has been wildly successful and arguably set off off the current NLP revolution.
An important challenge facing AI is to facilitate communication between com-
puters and humans. The ideal is for computers to learn how humans speak, not
vice versa. Currently, humans are subject to the strictures that computers impose
on them when they interact. The onus is on people to comport their practices to
those of computers; the machines dictate to the humans. Computers are dumb
calculators and do not grasp the intricacies of human language. NLP is the field
of teaching computers to competently deal with human modes of language. The
subject of NLP is worthy of a volume in its own right and any attempt to present it
in a single section can only scratch the surface. For a thorough treatment of NLP
the reader is directed to (77).
Humans employ a different class of language when communicating with
each other than they do with computers. The problem arises from the inherent
differences between the natures of computers and people. Computers only
understand numbers represented as binary integers. Human language consists
of words and context. Despite the differences the gap has been partially bridged.
There are many examples of computer languages that humans can employ to
instruct a machine: C, Python, Rust, Lisp, Swift, Smalltalk, Pascal, Basic, Java…
the examples are legion. The list is by no means exhaustive. There are thousands.
Humans have instructed computers using computer languages for decades, but
they are special languages. The field of computer languages, also referred to as
programming languages, is an active and important area of computer science
research. Humans can communicate with computers, but any programmer would
agree, only in a very superficial and exhausting way. Computers languages are
a different type of language than human language, and are specially designed.
Programming languages are context free grammars. Human languages have
context and ambiguity.
9.4 Natural Language Processing Transformed 195
2 Hence the joke, Teacher: “A double negative makes a positive, but not vice-versa.” Student:
“Yea, right.”
196 9 Vistas
the basic building block, that higher level NLP tools are built with. This section
motivates and outlines their use.
Sentences are comprised of words, but computers only understand numbers.
The first task is to convert a word to something that a computer program can han-
dle. A trivial means of accomplishing this is called tokenizing. A static dictionary
can map a word, and its inflections (e.g. take and took), to a unique integer. The
representation can be augmented by attaching the part of speech as well. This is
not a very good system as homonyms break it immediately. There is also no infor-
mation contained in the numeric representation. More than a token is required.
In addition to the token, the meaning of the word must be captured. This would
seem to lead to a catch-22. The definition of a word is yet more words leading to
more definitions. A means of representing the meanings of words mathematically
is clearly indicated.
A powerful means of representing words are vector semantics. Words can
be represented with vectors. The vectors are embeddings of words in a vector
space that captures the semantic meaning of a word; the vector space is called
a semantic space. The method is known as word embeddings; the meanings
of words are embedded in the semantic space. The vectors are the means of
encapsulating words for use by an NLP system. The word embeddings are vectors
that encapsulate the meaning of a word, and they are numerical. Computers are
good at dealing with vectors, and indeed that is why such a representation was
selected. Representing words as vectors implies that the full force of mathematics
can be brought to bear on them.
A word embedding should have certain properties. For the embeddings and
mathematical operations to be useful the linguistic value must not be lost. The
mathematical operations should makes sense linguistically in the word space.
The vectors are elements of a vector space so a measure of distance can be defined
for them. The measure should reflect linguistic relationships. Dog, cat, tea, and
coffee are all different words so they will have unique embeddings. Let xword be the
embedding of word. The following relations should hold: xdog − xcat < xdog − xtea
and xtea − xcoffee < xdog − xtea . And further, xmammal − xcat < xmammal − xsoda . The
distance between word embeddings is known as similarity, and a good word
embedding will define a useful measure that captures similarity. It should
naturally capture synonyms and conceptual relationships.
Word embeddings can be implemented as static dictionaries. The dictionaries
are surprisingly easy to generate. An NLP application has a text domain of concern,
a corpus (e.g. 10,000 legal opinions), and the corpus can be treated as a training set.
Words that occur close to each other are usually related, this is called the distribu-
tional hypothesis, and forms the basis of learning an embedding. The word, judge,
is more likely to be in the same sentence with lawyer, opinion, or judgment than
soup or beetle, and this is naturally reflected in the corpus. This property is called
9.4 Natural Language Processing Transformed 197
self-supervising, and does not require explicit labels. Processing the text it becomes
clear that the word judge is connected with lawyer, but not giraffe. An application
developer chooses the dimension of the embedding, ℝd , and then trains the dic-
tionary. Once trained the dictionary is used by calling it with a word’s token. The
dictionary returns the word embedding, which is a vector: dict(judge) → xjudge .
The vector is real-valued and dense. Word2vec is the canonical example of this
approach (104). Once the embeddings have been computed they can be used with
a model; they are the input that can be used with an ANN. The dimensionality of
the word embeddings does a better job of capturing the multiple senses of a word,
but it is not ideal. Nor does an embedding give any indication of which sense is
meant in the current sentence or its effect on other words.
Word embeddings are typically required to produce vectors that are normalized,
||x|| = 1. The set of embeddings describe the surface of a hypersphere, that is, a
sphere in d dimensions. Recall that the dot product is defined as,
where 𝜃 is the angle between the two vectors. The dot product is the projection
of x on to y. When dot products are used with normalized vectors the result is
the cosine of the angle between them. This property applies to word embeddings.
Word embeddings are normalized so the dot product is the cosine of the angle
between the embeddings. Word embeddings can be designed to make the cosine
meaningful. It is often the cosine that is used as a measure of similarity, not the dif-
ference used in the example above. The method is called cosine similarity. Similar
words will have similar embeddings. The angle between them will be small. As
the angle grows smaller the cosine approaches 1. Large angles connote little sim-
ilarity and cosine approaches −1. The intuitive interpretation and the range of
cosine ≤ ±1 make cosine similarity very attractive mathematically.
Word embeddings makes it possible to encode text such that computers can
make sense of it. A sentence of length n is a sequence of tokens, t1 , t2 , … , tn . The
sentence is processed in sequence from 1 to n producing the set of embeddings,
Given a sentence with n words, a matrix can be built by placing each word embed-
ding row-wise in a matrix (the transpose of the vectors, xit ):
⎛x1T ⎞
⎜ T⎟
⎜x ⎟
E = ⎜ 2⎟. (9.13)
⎜⋮⎟
⎜ T⎟
⎝xn ⎠
198 9 Vistas
The embeddings matrix can be used to multiply one of the word’s embedding,
Exi = si . (9.14)
The result is a vector of the similarities between the word and all of the words in the
sentence. si is the projection of the word xi onto the rest of the sentence. Moreover,
EET = S, is a matrix of the similarity vectors for all the words in a sentence, column
wise. The diagonal will be exclusively 1s as self-similarity has a cosine of 1, 𝜃 = 0.
This property is very useful for NLP systems, especially those that are matrix based,
such as ANNs.
Word embedding is a very powerful tool for dealing with natural language appli-
cations. They do, however, leave something to be desired. Recall the list of contex-
tual sentences above. The meaning of some words depended on words in other
parts of the sentence. A sentence must be examined simultaneously in its entirety
to encapsulate it fully. A static mapping from a word to its vector does not account
for context. The word’s relation to other words in a sentence is important when
determining its meaning, or even the value of its contribution to the sentence.
Representing words with vectors that account for context is a language model.
A language model can also be predictive, that is, produce the distribution of tokens:
P(ti ∣ ti−1 ). Predictive language models can be turned around and made generative.
If the model can predict tokens well then in reverse it can also produce them. Word
embeddings are sufficient for many applications, but applications that must really
understand or mimic human language require language models. Language mod-
els are the basis for chatbots such as ChatGPT (GPT is the name of the language
model) and many others. The current state of the art for building language models
is the transformer. It is the transformer that forms the subject of the Section, 9.4.3.
9.4.3 Attention
Attentive processes are operations that prioritize a subset of information. They can
be thought of as providing focus on a particular aspect of a problem. Recalling the
embeddings in Eq. (9.12), there is a clear limitation. The individual word embed-
dings have not utilized any information about other words in the sentence; each
embedding was performed in isolation.
For a token, ti , how does the model handle the information latent in some
token, tj , that has already been seen but is now forgotten ( j < i) or is yet to be
observed ( j > i). Moreover, processing a sentence is sequential, and the job of a
language model is to predict the j > i. Accounting exclusively for earlier tokens in
the sentence is of particular importance and is called causal. There are two related
problems when attempting to incorporate relationships between words. The first
is producing a means of detecting the relationships between words in a sentence.
The second problem is the question of how to represent the relationships.
9.4 Natural Language Processing Transformed 199
The weights are computed at inference time, so they are called soft weights. The
idea of computing weights during inference was introduced in a generative text
setting (53), but it quickly proved useful for comprehension as well. This is in
contrast to normal weights, parameters of a model, which are fixed following the
conclusion of training. The attentional soft weights can be used in a dot product
with the sentence to attenuate the elements appropriately. One means of comput-
ing attention is to employ the embedding matrix to compute the similarities. The
similarities are then construed as importance and used for attention.
ai = softmax(Exi ). (9.18)
Equation (9.18) produces the vector of attentional weights, ai , for the word xi .
The softmax function ensures that the similarities meet the constraints defined
in Eq. (9.16). By computing the attention with the entire sentence words are con-
nected across distance. A new embedding can now be computed that accounts for
attention with the soft weights,
∑
i
yi = 𝛼j ⋅ xj , (9.19)
j=1
where the 𝛼j is the jth entry in the attention tuple, a scaler. The new embedding is
a convex linear combination of nearby words scaled by their attention scores. The
resulting embedding accounts for the importance of other words in the sentence
by incorporating their similarities in its embedding.
The use of similarity for attention is an improvement over static word embed-
dings, but it is not ideal. In particular, it assumes quality word embeddings and
a useful measure of similarity. A further refinement described by Vaswani et al.
in their attention paper is self-attention. They proposed a scheme that employed
200 9 Vistas
where the 𝛼j are scalers. The new vector is a linear combination of the value vectors
scaled by the weights. The final operation of the self-attentive layer is to apply a
residual connection and then to normalize.
yi = layerNorm(yi + xi ). (9.24)
Layer normalization is described in Section 7.5.2. The residual connection trans-
mits information deeper into a network by jumping over an intermediate, parallel
layer. It also helps with training as the gradient will be stronger as it skips the same
component in the backward direction (64). Following the self-attention layer the
final result of the transformer block is produced by using an ANN.
zi = layerNorm(ANN(yi ) + yi ). (9.25)
The result is residually connected and then normalized to produce the final output.
The exposition so far has been vector-centric; the focus was on computing the
self-attention for a single token. There is no reason why more than one token can-
not be processed at once. Indeed, for a sentence of length n all the computations
can be performed in parallel with matrix multiplications. A matrix of the input
vectors is used instead of the individual vectors. X is composed by stacking the xiT
in row order. As they are in row order the parameter matrices are the right-side
factors in the multiplication.
Q = XW q
K = XW k
V = XW v . (9.26)
202 9 Vistas
The matrix formulation that results can be efficiently implemented with matrix
software for either CPUs or GPUs.
( )
QK T
SelfAttention(Q, K, V) = softmax √ V. (9.27)
d
The result is a self-attention matrix. The matrix product, QK T will need the upper
triangle marked as −∞ to ensure that they do not contribute to the softmax
operation. This preserves the property that self-attention calculations only use
previously seen tokens for each token position.
A transformer consists of stacking the transformer blocks (the original paper
used 6). The architecture is presented graphically in Figure 9.7. Input arrives and
it percolates through the transformer blocks in sequence. The output of a block
forms the input to the next. After the final transformer block has finished the result
is a matrix with the new embeddings. The result has the same dimensions as the
input matrix. A transformer block has four trainable pieces. They are the three
weights matrices and the ANN.
Algorithm 9.4 illustrates the complete process for self-attention. It is invoked
with text, e.g.: Transformer (“Transformers are powerful tools.”). The first step
Z Z
Transformer block
Layer normalization
Transformer block
Transformer block
Layer normalization
Transformer block
QK t
Self-attention layer: softmax V
d
Transformer block
x1t
x2t
X= Transformer block
··
·
xtn
X
Transformer block Transformer
Figure 9.7 The transformer architecture. The figure on the left depicts the transformer
block. The block consists of 4 layers. A transformer consists of stacking the discrete
blocks to form a composition.
9.4 Natural Language Processing Transformed 203
is to preprocess the text. The text is tokenized followed by a static word embed-
ding. Both tasks are performed by dict. This produces the initial set of vectors for
the transformer. The embedding includes positional information in the sentence,
hence i is passed into dict as well as indexing the sentence. The initial embeddings
are then passed on to the transformer. The data percolates through each trans-
former block until the last one is reached. Each block computes the self-attention
and then runs it through an ANN. After the data exits the transformer, it is ready
for use by a downstream model.
An important implementation note. Each transformer block has its own set
of learned matrices, and the dimensions are fixed at training time. A real trans-
former prescribes n. The original attention paper used a value of 512. This means
that 512 tokens need to be passed in at a time. The initial vector, X, will have a
dimension of n × d. If less than n tokens are available then it needs to be padded
with a special “null” embedding that will not contribute to the self-attention. The
matrix product, QK T , has n2 entries. It grows quadratically with the length of the
input. Selecting the size of n needs to be done carefully to avoid excessive memory
consumption. The more usual problem is the opposite of an insufficient num-
ber of tokens, but rather, having a document that is too long. When there are too
many tokens for a single invocation the transformer is invoked multiple times with
n-sized subsets.
204 9 Vistas
Y = Y 1 ⊕ · · · ⊕ Y h ∈ ℝn×h⋅d . (9.28)
The concatenation is row-wise, that is, the number of columns increases. A prob-
lems arises respecting the dimensions of Y . It does not conform to the dimensions
of the input vector or the expected output vector. It has d × h columns. The input
and the output of transformers, hence also transformer blocks, must have the same
dimension. Transformer blocks can be stacked arbitrarily so the dimensions of
input and output must be in agreement. To address the mismatch another step is
introduced in the transformer blocks to shrink Y down to the required dimensions.
The concatenated vector is projected to produce the final vector of the correct
dimensions. The projection is effected with a learned vector, W o ∈ ℝh⋅d×d , that has
the correct dimensions,
Y p = Y W o ∈ ℝn×d . (9.29)
Y= Y1 Y2 Y3 Y4
W q, W k , W v
4 4 4
W q3, W k3, W v3
4 Heads
W q2, W k2, W v2
W q1, W k1, W v1
transformer-based system (141). While there are concerns that language models
are growing too powerful, (12), there is too much at stake and the relentless race
to build better ones will almost certainly continue. Switch-C is setting the stage
for trillion+ parameter language models (40).
The applications of language models are legion. Anything that involves natural
language is a potential application. Language translation performance has dra-
matically improved with transformer technology. Legal document summarization
saves a great deal of money. It is sufficiently accurate for some applications that
paying lawyers to read routine material is no longer required. NLP agents that
monitor news feeds keep abreast of enormous quantities of news automatically,
and reliably. News monitoring is important to many organizations such as gov-
ernments and the trading floors of banks, and it can take many forms. Televisions
monitors or online news feeds such as Reuters and Bloomberg create automatic
alerts and executive summaries. Routing emails reliably saves hiring personnel to
monitor group mailboxes. ChatGPT has been acknowledged as an author on sci-
entific papers (146). The latest OpenAI technical report on GPT-4 acknowledged
ChatGPT for producing summaries of text and as a copy-editing tool (1).
9.5 Neural Turing Machines 207
3 Turing called them a-machines, Turing machine was coined by his PhD supervisor, Alonso
Church.
208 9 Vistas
4 These predate the use of transformers and are responsible for the terminology of queries, keys,
and values.
210 9 Vistas
programmable memory has opened the door to all kinds of new applications. This
includes reformulating problems in a new way. Providing state is arguably taking
ANNs from the stateless, functional programming paradigm of Church’s Lambda
Calculus, to the stateful world of Turing Machines and programming with side
effects. Judgment has not yet been passed on the consequences as it is still far too
early, but it is an alluring prospect.
9.6 Summary
This chapter has examined some of the theoretical limitations of what ANNs can
represent, that is, what they can learn. With some weak assumptions about the
problem, it seems they can learn almost anything. The text has presented the
canonical and fundamental algorithms for training deep learning artificial neural
networks. The basics of training models with backpropagation of error prepares
the reader for more advanced study. ANNs are from a very active research area.
Generative adversarial networks can be trained to produce novel content from
an exemplar dataset. RL and deep learning are becoming more intertwined
producing hybrid systems. Transformers are very deep ANNs with enormous
numbers of parameters. They are useful with sequential data, such as natural
language and time-series.
9.7 Projects
The following projects can be found on the book’s website: https://github.com/
nom-de-guerre/DDL_book/tree/main/Ch09.
1. The website contains a Python Notebook, SmallGAN.ipynb. The generator’s
latent space is normally distributed. Experiment with a uniform distribution
and determine which latent space is superior.
2. The tic-tac-toe example is available as a Python notebook, Tic-Tac-Toe.ipynb.
The reward functions values a draw as 0.5. Experiment with different reward
functions, such as valuing a draw as 0.0, and measure the changes to the win-
ning percentage of the agent.
3. The website includes a Python notebook, ClassifyNews.ipynb, that uses a pre-
trained language model. It is connected to a FFFC ANN. The model is a textual
classifier. Vary the ANN’s topology (both width and depth) and measure the
effect on accuracy. The notebook includes instructions for obtaining the data
and the pretrained model.
211
Appendix A
Mathematical Review
A.1.1 Vectors
A vector is a matrix, x ∈ ℝn , with multiple rows and only one column. A vector
looks like,
⎛x1 ⎞
x = ⎜⋮⎟.
⎜ ⎟
⎝xn ⎠
It has n rows and 1 column. A vector can be multiplied by a scaler,
⎛𝛼 ⋅ x1 ⎞
𝛼 ⋅ x = ⎜ ⋮ ⎟. (A.1)
⎜ ⎟
⎝𝛼 ⋅ xn ⎠
Vectors of the same dimensions can be added (or subtracted) as follows:
⎛x1 ⎞ ⎛y1 ⎞ ⎛ x1 + y1 ⎞
x + y = ⎜⋮⎟ + ⎜⋮⎟ = ⎜ ⋮ ⎟. (A.2)
⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝xn ⎠ ⎝yn ⎠ ⎝xn + yn ⎠
Demystifying Deep Learning: An Introduction to the Mathematics of Neural Networks,
First Edition. Douglas J. Santry.
© 2024 The Institute of Electrical and Electronics
212 Appendix A Mathematical Review
The transpose of a vector is created by swapping the each element’s row and
column, xT = (x1 , … , xn ). Thus, a transposed vector has 1 row and n columns.
The magnitude of a vector is computed with the Pythagorean theorem,
√
√ n
√∑
|x|2 = √ xi . (A.3)
i
A vector can be normalized such that its magnitude is 1 by dividing a vector by its
magnitude,
x
xnormal = . (A.4)
|x|2
For vectors, x, y ∈ ℝn the inner product is defined as
∑
n
xT ⋅ y = xi yi = 𝛼, (A.5)
i=1
A.1.2 Matrices
The general matrix is defined as A ∈ ℝn,m
⎛a1,1 a1,2 · · · a1,m ⎞
⎜ ⎟
⎜a2,1 a2,2 · · · a2,m ⎟
A=⎜ ⎟, (A.8)
⎜ ⋮ ⎟
⎜a a · · · a ⎟
⎝ n,1 n,2 n,m ⎠
A matrix is a mapping. The Image(A) is the set of all vectors that are produced
by multiplying it with a vector. It is defined as y ∈ Image(A) ⟹ ∃x|Ax = y.
The Kernel(A) is the set of vectors that when multiplied by A result in a vector
of all zeros: x ∈ Kernel(A) ⟹ Ax = 0.
Two special matrices of note are the NULL matrix, which has zero’s in all of its
entries and defined as
⎛01,1 01,2 · · · 01,n ⎞
⎜ ⎟
⎜02,1 02,2 · · · 02,n ⎟
𝟎=⎜ ⎟, (A.9)
⎜ ⋮ ⎟
⎜0 0 · · · 0 ⎟
⎝ n,1 n,2 n,n ⎠
and the identity matrix, which has 1’s on the diagonal, and 0’s in all of the
off-diagonal entries,
⎛11,1 01,2 · · · 01,n ⎞
⎜0 1 · · · 0 ⎟
I = ⎜ 2,1 2,2 2,n ⎟
. (A.10)
⎜ ⋮ ⎟
⎜ ⎟
⎝0n,1 0n,2 · · · 1n,n ⎠
A matrix or a vector multiplied by the identity matrix is the original matrix, that
is, IA = AI = A.
For two matrices, A, B ∈ ℝn,m , then matrix addition and subtraction are defined
as
⎛ a1,1 + b1,1 a1,2 + b1,2 · · · a1,m + b1,m ⎞
⎜ ⎟
⎜ a2,1 + b2,1 a2,2 + b2,2 · · · a2,m + b2,m ⎟
A+B=⎜ ⎟. (A.11)
⎜ ⋮ ⎟
⎜a + b a + b · · · a + b ⎟
⎝ n,1 n,1 n,2 n,2 n,m n,m ⎠
The number of rows and columns must be equal, that is, the matrices must have
precisely the same dimensions.
For a vector, x ∈ ℝm , and matrix, A ∈ ℝn,m , matrix-vector multiplication is
defined as
∑m
⎛ i=1 a1,i ⋅ xi ⎞
⎜ ∑m ⎟
⎜ i=1 a2,i ⋅ xi ⎟
Ax = ⎜ ⎟ = y, (A.12)
⎜ ⋮ ⎟
⎜ ∑m ⎟
⎝ i=1 an,i ⋅ xi ⎠
which produces a new column vector, y ∈ ℝn . Each entry in y is the dot product
of x with a row in A.
214 Appendix A Mathematical Review
For matrices, A ∈ ℝn,m and B ∈ ℝm,p , the matrix product AB = C is defined when
the number of columns in A equals the number of rows in B. The product is,
C ∈ ℝn,p (the matrix vector product is not a special case, per se, p = 1).
∑ ∑m ∑m
⎛ mi=1 a1,i ⋅ bi,1 i=1 a1,i ⋅ bi,2 ... i=1 a1,i ⋅ bi,m
⎞
⎜∑ ∑ ∑ ⎟
⎜ m a ⋅ b
m
a ⋅ b ...
m
a ⋅ b ⎟
AB = ⎜
i=1 2,i i,1 i=1 2,i i,2 i=1 2,i i,m
⎟. (A.13)
⎜ ⋮ ⎟
⎜ ⎟
⎜∑m a ⋅ b ∑m a ⋅ b ... ∑m a ⋅ b ⎟
⎝ i=1 n,i i,1 i=1 n,i i,2 i=1 n,i i,m ⎠
Each entry in C is, ci,j = the dot product of row i in A and column j in B (hence the
columns and rows must agree in the two operands).
The transpose of a matrix,
⎛a1,1 a1,2 · · · a1,m ⎞
⎜ ⎟
⎜a2,1 a2,2 · · · a2,m ⎟
A=⎜ ⎟, (A.14)
⎜ ⋮ ⎟
⎜a a · · · a ⎟
⎝ n,1 n,2 n,m ⎠
the column vectors of A. Multiplying with AT computes the dot products with all
of A’s column vectors. This leads to the normal equations. The normal equations
are
AT (Âx − y) = 0 ⟹ AT Âx = AT y. (A.18)
The normal equations must be equal zero as the residual vector must be at a right
angle to the image of A. The QR decomposition yields the least squares solution,
R̂x = Q−1 y. See (145) for more details.
2 Do not confuse the difference between a scaler and a vector with multiple values.
218 Appendix A Mathematical Review
f ′ (x) = g′ ⋅ h + g ⋅ h′ . (A.25)
The Jacobean of a vector function can be interpreted as how the function is chang-
ing with respect to all of its arguments. A detailed exposition can be found in (88).
A.4 Probability
Probability is a mathematical means of attaching a quantitative chance of an
event occurring during an experiment. The universe of possible events is called
220 Appendix A Mathematical Review
Glossary
The following is a glossary of terms and acronyms used in the book. The list is
not exhaustive. Many terms and acronyms are overloaded in the literature and in
practice so the definitions should not be considered authoritative. The terms can
have multiple senses.
References
9 Sergio Barrachina, Manuel F. Dolz, Pablo San Juan, and Enrique S. Quintana-Ortí.
Efficient and portable GEMM-based convolution operators for deep neural
network training on multicore processors. Journal of Parallel and Distributed
Computing, 167:240–254, 2022. ISSN 0743-7315. doi: 10.1016/j.jpdc.2022.05.009.
URL https://www.sciencedirect.com/science/article/pii/S0743731522001241.
10 Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak
features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180, 2020. doi:
10.1137/20M1336072.
11 R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics,
6(5):679–684, 1957.
12 Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret
Shmitchell. On the dangers of stochastic parrots: Can language models be too
big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and
Transparency, FAccT ’21, pages 610–623, New York, NY, USA, 2021. Association
for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922.
13 Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron
van den Oord. Are we done with imagenet? CoRR, abs/2006.07159, 2020. URL
https://arxiv.org/abs/2006.07159.
14 B.W. Matthews. Comparison of the predicted and observed secondary structure
of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure,
405(2):442–451, 1975.
15 B. Widrow and M. Hoff. Adaptive switching circuits. In Proceedings WESCON,
pages 96–104, 1960.
16 Ewen Callaway. It will change everything: Deepmind’s AI makes gigantic leap in
solving protein structures. Nature, 588:203–204, 2020.
17 Kumar Chellapilla, Sidd Puri, and Patrice Simard. High performance convolutional
neural networks for document processing. In Tenth International Workshop on
Frontiers in Handwriting Recognition. SuviSoft, 2006.
18 Anna Choromanska, Mikael Henaff, Michaël Mathieu, Gérard Ben Arous, and
Yann LeCun. The loss surface of multilayer networks. CoRR, abs/1412.0233, 2014.
URL http://arxiv.org/abs/1412.0233.
19 Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav
Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebas-
tian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez,
Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran,
Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin,
Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay
Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin
Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek
Lim, Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani
Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai,
Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov,
Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan
References 231
Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean,
Slav Petrov, and Noah Fiedel. PaLM: Scaling language modeling with pathways.
CoRR, abs/2204.02311, 2022. doi: 10.48550/arXiv.2204.02311.
20 Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. Empir-
ical evaluation of gated recurrent neural networks on sequence modeling. CoRR,
abs/1412.3555, 2014. URL http://arxiv.org/abs/1412.3555.
21 Alonzo Church. An unsolvable problem of elementary number theory. American
Journal of Mathematics, 58(2):345–363, 1936.
22 Kirkwood A. Cloud, Brian J. Reich, Christopher M. Rozoff, Stefano Alessandrini,
William E. Lewis, and Luca Delle Monache. A feed forward neural network based
on model output statistics for short-term hurricane intensity prediction. Weather
and Forecasting, 34(4):985–997, 2019.
23 Cody Coleman, Daniel Kang, Deepak Narayanan, Luigi Nardi, Tian Zhao, Jian
Zhang, Peter Bailis, Kunle Olukotun, Chris Ré, and Matei Zaharia. Analysis of
DAWNBench, a time-to-accuracy machine learning performance benchmark. ACM
SIGOPS Operating Systems Review, 53(1):14–25, Jul 2019. ISSN 0163-5980. doi:
10.1145/3352020.3352024.
24 Mark Collier and Jöran Beel. Implementing neural turing machines. In Vera
Kurková, Yannis Manolopoulos, Barbara Hammer, Lazaros S. Iliadis, and Ilias
Maglogiannis, editors, Artificial Neural Networks and Machine Learning - ICANN
2018 - 27th International Conference on Artificial Neural Networks, Rhodes, Greece,
October 4-7, 2018, Proceedings, Part III, volume 11141 of Lecture Notes in Computer
Science, pages 94–104. Springer, 2018. doi: 10.1007/978-3-030-01424-7_10.
25 Committee. IEEE standard for binary floating-point arithematic, 1985.
26 Thomas Cover and Joy Thomas. Elements of Information Theory. Wiley, 2006.
ISBN 0471241954.
27 G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathemat-
ics of Control, 2:303–314, 1989.
28 Yehuda Dar, Vidya Muthukumar, and Richard G. Baraniuk. A farewell to the
bias-variance tradeoff? An overview of the theory of overparameterized machine
learning, 2021.
29 Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V.
Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang,
and Andrew Y. Ng. Large scale distributed deep networks. In NIPS, 2012.
30 Rina Dechter. Learning while searching in constraint-satisfaction-problems.
In Proceedings of the Fifth AAAI National Conference on Artificial Intelligence,
AAAI’86, pages 178–183. AAAI Press, 1986.
31 Howard B. Demuth, Mark H. Beale, Orlando De Jess, and Martin T. Hagan. Neu-
ral Network Design. Martin Hagan, Stillwater, OK, USA, 2nd edition, 2014. ISBN
0971732116.
232 References
32 Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Ima-
geNet: A large-scale hierarchical image database. In 2009 IEEE Confer-
ence on Computer Vision and Pattern Recognition, pages 248–255, 2009. doi:
10.1109/CVPR.2009.5206848.
33 Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT:
Pre-training of deep bidirectional transformers for language understanding, 2018.
URL https://arxiv.org/abs/1810.04805.
34 D. Rumelhart and Ronald Williams Geoffrey Hinton. Learning representations by
back-propagating errors. Nature, 323(6088):533–536, 1986.
35 John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for
online learning and stochastic optimization. Journal of Machine Learning Research,
12(61):2121–2159, 2011. URL http://jmlr.org/papers/v12/duchi11a.html.
36 B. Efron. Bootstrap methods: Another look at the Jackknife. The Annals of Statis-
tics, 7(1):1–26, 1979. doi: 10.1214/aos/1176344552.
37 Michael Egmont-Petersen, Jan L. Talmon, Jytte Brender, and Peter McNair. On
the quality of neural net classifiers. Artificial Intelligence in Medicine, 6(5):359–381,
1994. ISSN 0933-3657. doi: 10.1016/0933-3657(94)90002-7. URL https://www
.sciencedirect.com/science/article/pii/0933365794900027. Neural Computing in
Medicine.
38 Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. GPTs are
GPTs: An early look at the labor market impact potential of large language mod-
els, 2023.
39 D. Farringdon, S. Waples M. Gill, and J. Agromaniz. The effects of closed-circuit
television on crime: Meta-analysis of an English national quasi-experimental
multi-site evaluation. Journal of Experimental Criminology, 3:21–38, 2007.
40 William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to
trillion parameter models with simple and efficient sparsity. CoRR, abs/2101.03961,
2021. URL https://arxiv.org/abs/2101.03961.
41 G. Forsythe. Pitfalls in computation, or why a mathbook isn’t enough, 1970.
42 Kunihiko Fukushima. Cognitron: A self-organizing multilayered neural network.
Biological Cybernetics, 20(3):121–136, 1975.
43 Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a
mechanism of pattern recognition unaffected by shift in position. Biological Cyber-
netics, 36:193–202, 1980.
44 G. Hinton. Where do features come from? Cognative Sciences, 38(6):1078–1101,
2014.
45 Ofer Gal. The New Science. The Origins of Modern Science. Cambridge University
Press, 2021. ISBN 978-1316649701.
46 Stuart Geman, Elie Bienenstock, and René Doursat. Neural networks and the
bias/variance dilemma. Neural Computation, 4(1):1–58, 1992. ISSN 0899-7667.
URL http://portal.acm.org/citation.cfm?id=148062.
References 233
77 Daniel Jurafsky and James Martin. Speech and Language Processing: An Intro-
duction to Natural Language Processing, Computational Linguistics, and Speech
Recognition. Prentice Hall, 2008. ISBN 9780131873216.
78 Lukasz Kaiser and Samy Bengio. Can active memory replace attention? In Daniel
D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman
Garnett, editors, Advances in Neural Information Processing Systems 29: Annual
Conference on Neural Information Processing Systems 2016, December 5–10, 2016,
Barcelona, Spain, pages 3774–3782, 2016. URL https://proceedings.neurips.cc/
paper/2016/hash/fb8feff253bb6c834deb61ec76baa89.
79 Eric Kandel and Thomas Jessell James Schwarz. Principles of Neural Science, 4th
edition. McGraw-Hill, 2000. ISBN 978-083857701.
80 Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing
of gans for improved quality, stability, and variation. CoRR, abs/1710.10196, 2017.
URL http://arxiv.org/abs/1710.10196.
81 Tero Karras, Janne Hellsten Miika Aittala, Jaako Lehtinen Samuli Laine, and
Timo Aila. Training generative adversarial networks with limited data. In Neural
Information Processing Systems (NeurIPS), 2020.
82 Patrick Kidger and Terry Lyons. Universal approximation with deep narrow net-
works. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty
Third Conference on Learning Theory, volume 125 of Proceedings of Machine
Learning Research, pages 2306–2327. PMLR, 09–12 Jul 2020. URL https://
proceedings.mlr.press/v125/kidger20a.html.
83 J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regres-
sion function. The Annals of Mathematical Statistics, 23(3):462–466, 1952. ISSN
00034851. URL http://www.jstor.org/stable/2236690.
84 Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimiza-
tion. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference
on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015,
Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
85 Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes. In Yoshua
Bengio and Yann LeCun, editors, 2nd International Conference on Learning Rep-
resentations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Conference Track
Proceedings, 2014. URL http://arxiv.org/abs/1312.6114.
86 Donald Knuth. Seminumerical Algorithms. The Art of Computer Programming.
Addison-Wesley, 2002. ISBN 0201038226.
87 Mark A. Kramer. Nonlinear principal component analysis using autoassociative
neural networks. Aiche Journal, 37:233–243, 1991.
88 Erwin Kreyszig, Herbert Kreyszig, and E.J. Norminton. Advanced Engineering
Mathematics. Wiley, Hoboken, NJ, 10th edition, 2011. ISBN 0470458364.
89 Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification
with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and
236 References
102 James Martens. New insights and perspectives on the natural gradient method.
Journal of Machine Learning Research, 21(146):1–76, 2020. URL http://jmlr.org/
papers/v21/17-678.html.
103 Scott McKinney, Marcin Sieniek, Varun Godbole, Jonathan Godwin, Natasha
Antropova, Hutan Ashrafian, Trevor Back, Mary Chesus, Greg S. Corrado,
Ara Darzi, Mozziyar Etemadi, Florencia Garcia-Vicente, Fiona J. Gilbert,
Mark Halling-Brown, Demis Hassabis, Sunny Jansen, Alan Karthikesalingam,
Christopher J. Kelly, Dominic King, Joseph R. Ledsam, David Melnick,
Hormuz Mostofi, Lily Peng, Joshua Jay Reicher, Bernardino Romera-Paredes,
Richard Sidebottom, Mustafa Suleyman, Daniel Tse, Kenneth C. Young, Jeffrey
De Fauw, and Shravya Shetty. International evaluation of an AI system for breast
cancer screening. Nature, 577:89–94, 2020.
104 Tomás Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean.
Distributed representations of words and phrases and their compositionality.
In Christopher J.C. Burges, Léon Bottou, Zoubin Ghahramani, and Kilian Q.
Weinberger, editors, Advances in Neural Information Processing Systems 26: 27th
Annual Conference on Neural Information Processing Systems 2013. Proceed-
ings of a Meeting Held December 5–8, 2013, Lake Tahoe, Nevada, United States,
pages 3111–3119, 2013. URL https://proceedings.neurips.cc/paper/2013/hash/
9aa42b31882ec039965f3c4923ce901.
105 Tom Mitchell. Machine Learning. McGraw-Hill, 1997. ISBN 9384761168.
106 M. Minsky and S. Papert. Perceptrons. MIT Press, 1969.
107 Nikolai Nowaczyk, Jörg Kienitz, Sarp Kaya Acar, and Qian Liang. How deep is
your model? Network topology selection from a model validation perspective.
Journal of Mathematics in Industry, 12:1, 2022.
108 Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela
Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John
Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training lan-
guage models to follow instructions with human feedback, 2022. URL https://arxiv
.org/abs/2203.02155.
109 Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Wino-grad. The PageRank
citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford
InfoLab, November 1999. URL http://ilpubs.stanford.edu:8090/422/. Previous
number = SIDL- WP-1999-0120.
110 Yudi Pawtan. In All Likelihood. Oxford University Press, 2013. ISBN 0199671222.
111 K. Pearson. On the Theory of Contingency and Its Relation to Association and Nor-
mal Correlation, volumes 1–4; volume 7 in Drapers’ company research memoirs:
Biometric series. Dulau and Company, 1904. URL https://books.google.co.uk/
books?id=8h3OxgEACAAJ.
112 Rui Pereira, Marco Couto, Francisco Ribeiro, Rui Rua, Jácome Cunha, João Paulo
Fernandes, and João Saraiva. Ranking programming languages by energy effi-
ciency. Science of Computer Programming, 205:102609, 2021. ISSN 0167-6423. doi:
238 References
126 V.K. Rohatgi and A.K.M.E. Saleh. An Introduction to Probability and Statistics.
Wiley Series in Probability and Statistics. Wiley, 2015. ISBN 9781118799642. URL
https://books.google.co.uk/books?id=RCq9BgAAQBAJ.
127 Raul Rojas. Neural Networks - A Systematic Introduction. Springer-Verlag, Berlin,
1996.
128 Kevin Roose. The brilliance and weirdness of ChatGPT. The New York Times,
pages Section B, page 1, 2022.
129 F. Rosenblatt. Principles of neurodynamics: Perceptrons and the theory of brain
mechanisms. Cornell Aeronautical Laboratory. Report no. VG-1196-G-8. Spartan
Books, 1962. URL https://books.google.co.uk/books?id=7FhRAAAAMAAJ.
130 Frank Rosenblatt. The perceptron—a perceiving and recognizing automaton.
Cornell Aeronautical Laboratory, Report 85-460-1, 1957.
131 S. Minsky and M. Papert. Perceptrons: An Introduction to Computational Geometry.
MIT Press, 1969. ISBN 978-0-262-63022-1.
132 Yousef Saad. Iterative Methods for Sparse Linear Systems. Applied Mathematics,
2nd edition. SIAM, 2003. ISBN 978-0-89871-534-7. doi: 10.1137/1.9780898718003.
URL http://www-users.cs.umn.edu/~saad/IterMethBook2ndEd.pdf.
133 Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How
does batch normalization help optimization? In Proceedings of the 32nd Inter-
national Conference on Neural Information Processing Systems, NIPS’18, pages
2488–2498, Red Hook, NY, USA, 2018. Curran Associates Inc.
134 Dominik Scherer, Andreas Müller, and Sven Behnke. Evaluation of pooling
operations in convolutional architectures for object recognition. In Konstanti-
nos Diamantaras, Wlodek Duch, and Lazaros S. Iliadis, editors, Artificial Neural
Networks – ICANN 2010, pages 92–101, Berlin, Heidelberg, 2010. Springer-Verlag,
Berlin, Heidelberg. ISBN 978-3-642-15825-4.
135 John Searle. Minds, brains and programs. Behavioral and Brain Sciences,
3(3):417–424, 1980.
136 Charles Sherrington. Experiments in examination of the peripheral distribution
of the fibers of the posterior roots of some spinal nerves. Proceedings of the Royal
Society of London, 190:45–186, 1898.
137 Pannaga Shivaswamy and Ashok Chandrashekar. Bias-variance decompo-
sition for ranking. In Proceedings of the 14th ACM International Conference
on Web Search and Data Mining, WSDM ’21, pages 472–480, New York, NY,
USA, 2021. Association for Computing Machinery. ISBN 9781450382977. doi:
10.1145/3437963.3441772.
138 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper,
and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language
models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv
.org/abs/1909.08053.
139 David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre,
George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda
Panneershelvam, Marc Lanctot, Sander Diele-man, Dominik Grewe, John Nham,
240 References
159 Alex Wang, Nikita Nangia, Yada Pruksachatkun, Julian Michael, Amanpreet
Singh, Omer Levy, Felix Hill, and Samuel Bowman. SuperGLUE: A stickier bench-
mark for general-purpose language understanding systems. In 33rd Conference on
Neural Information Processing Systems, 2019.
160 Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards.
King’s College, Cambridge, United Kingdom, 1989.
161 Colin White, Mahmoud Safari, Rhea Sanjay Sukthanker, Binxin Ru, Thomas
Elsken, Arber Zela, Debadeepta Dey, and Frank Hutter. Neural architecture
search: Insights from 1000 papers. ArXiv, abs/2301.08727, 2023.
162 Hugh Wilson and Jack Cowan. Excitory and inhibitory interactions in localized
populations of model neurons. Biophysical Journal, 12:1–24, 1972.
163 Rodney Winter and Bernard Widrow. MADALINE RULE II: A training algorithm
for neural networks. In IEEE International Conference on Neural Networks, pages
401–408, 1988.
164 W. Pitts and W.S. McCulloch. How we know universals the perception of auditory
and visual forms. Bulletin of Mathematical Biology, 9(3):127–147, 1947.
165 Jie Xu. A deep learning approach to building an intelligent video surveillance sys-
tem. Multimedia Tools and Applications, 80:5495–5515, 2020.
166 Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, and Junyang Lin. Under-
standing and improving layer normalization. In Advances in Neural Information
Processing Systems, volume 32. Curran Associates, Inc., 2019.
167 Santiago Ramón y Cajal. Manual de Anatomía Patológica general. N. Moya, 1906.
168 Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional
networks. CoRR, abs/1311.2901, 2013.
169 Daniel Zhang, Erik Brynjolfsson Nestor Maslej, Terah Lyons John Etchemendy,
Helen Ngo James Manyika, Michael Sellitto Juan Niebles, Yoav Shoham Ellie
Sakhaee, and Raymond Perrault Jack Clark. The AI index 2022 annual report,
2022.
243
Index
a underfitting 135
activation function 29, 38, 50 bias-variance
ADAM dogma 138
algorithm 101 optimization 137
performance 103 trade-off 136
SGD 101 binomial distribution
agent 185, 186 74, 134
AlexNet 8, 122
artificial intelligence 5 c
attention 194, 199 C++ 159
multi-head 204 classification
Neural Turing Machines 208, 209 binary 34, 82
self 199, 200 loss derivative 79
weights 199 loss function 76
metrics 142
b multi-class 34, 81
backpropagation 8, 51 multi-label 81
CNN 123 CNN
CNN implementation 171 classification 123
equation 54 feature 113
MLE 77 filter 114
multi-class 77 perceptron 117
multi-label 84 Confusion Matrix
softmax 77 classification 140
Batch Normalization 149 Convolution
Bernoulli Distribution 134 feature 113
bias max pooling 120
definition 134 perceptron 117
least squares 137 training 123
gradient 60 r
Hessian 95 RANT 168
Jacobean 104 recall 141
memory 168 regression 23, 25
weights 31 loss 52
Matthews Correlation Coefficient 142 regularization
mean squared error 53 CNN 119
memory 166 dropout 143
MNIST 112 perceptron 119
momentum 99 reinforcement learning 183
exploitation 189
n reward 189
neuron 1, 23 temporal difference 191
Newton’s Method 93 value function 190
ReLU
NLP 193
activation 31
normalization 43
CNN 118, 126
Numpy 168
derivative 55
initialization 46
o
optimizer 103
objective function 48, 52
Rosenblatt 7, 27, 176
one hot encoding 36
RPROP
overfitting 135
ADAM comparison 102
verification set 139
algorithm 97
SGD 98
p
Pearson table 140
s
perceptron
self-attention 199, 200
CNN 117 semantic space 196
filter 117 SGD 91
kernel 117 ADAM 101
neuron 23 RPROP 98
regularizer 119 SGD 87
Rosenblatt 7 sigmoid
precision 141 activation function 30
preprocessing 42 derivative 55
probability 74, 219 probability 82
sigmoid 82 similarity 197
softmax 38 softmax
Python 20, 159, 164, 168 activiation function 38
library 173 classification 38
246 Index