Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory
Introduction to
Deep Learning:
arXiv:2310.20360v1 [cs.LG] 31 Oct 2023
Methods,
Implementations,
and Theory
Arnulf Jentzen
Benno Kuckuck
Philippe von Wurstemberger
Arnulf Jentzen
School of Data Science and Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)
Shenzhen, China
email: [email protected]
Applied Mathematics: Institute for Analysis and Numerics
University of Münster
Münster, Germany
email: [email protected]
Benno Kuckuck
School of Data Science and Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong Shenzhen (CUHK-Shenzhen)
Shenzhen, China
email: [email protected]
Applied Mathematics: Institute for Analysis and Numerics
University of Münster
Münster, Germany
email: [email protected]
Keywords: deep learning, artificial neural network, stochastic gradient descent, optimization
Mathematics Subject Classification (2020): 68T07
All Python source codes in this book can be downloaded from https://github.com/introdeeplearning/
book or from the arXiv page of this book (by clicking on “Other formats” and then “Download source”).
Preface
This book aims to provide an introduction to the topic of deep learning algorithms. Very
roughly speaking, when we speak of a deep learning algorithm we think of a computational
scheme which aims to approximate certain relations, functions, or quantities by means
of so-called deep artificial neural networks (ANNs) and the iterated use of some kind of
data. ANNs, in turn, can be thought of as classes of functions that consist of multiple
compositions of certain nonlinear functions, which are referred to as activation functions,
and certain affine functions. Loosely speaking, the depth of such ANNs corresponds to
the number of involved iterated compositions in the ANN and one starts to speak of deep
ANNs when the number of involved compositions of nonlinear and affine functions is larger
than two.
We hope that this book will be useful for students and scientists who do not yet have
any background in deep learning at all and would like to gain a solid foundation as well
as for practitioners who would like to obtain a firmer mathematical understanding of the
objects and methods considered in deep learning.
After a brief introduction, this book is divided into six parts (see Parts I, II, III, IV,
V, and VI). In Part I we introduce in Chapter 1 different types of ANNs including fully-
connected feedforward ANNs, convolutional ANNs (CNNs), recurrent ANNs (RNNs), and
residual ANNs (ResNets) in all mathematical details and in Chapter 2 we present a certain
calculus for fully-connected feedforward ANNs.
In Part II we present several mathematical results that analyze how well ANNs can
approximate given functions. To make this part more accessible, we first restrict ourselves
in Chapter 3 to one-dimensional functions from the reals to the reals and, thereafter, we
study ANN approximation results for multivariate functions in Chapter 4.
A key aspect of deep learning algorithms is usually to model or reformulate the problem
under consideration as a suitable optimization problem involving deep ANNs. It is precisely
the subject of Part III to study such and related optimization problems and the corresponding
optimization algorithms to approximately solve such problems in detail. In particular, in
the context of deep learning methods such optimization problems – typically given in the
form of a minimization problem – are usually solved by means of appropriate gradient based
optimization methods. Roughly speaking, we think of a gradient based optimization method
as a computational scheme which aims to solve the considered optimization problem by
performing successive steps based on the direction of the (negative) gradient of the function
which one wants to optimize. Deterministic variants of such gradient based optimization
methods such as the gradient descent (GD) optimization method are reviewed and studied
in Chapter 6 and stochastic variants of such gradient based optimization methods such
as the stochastic gradient descent (SGD) optimization method are reviewed and studied
in Chapter 7. GD-type and SGD-type optimization methods can, roughly speaking, be
viewed as time-discrete approximations of solutions of suitable gradient flow (GF) ordinary
differential equations (ODEs). To develop intuitions for GD-type and SGD-type optimization
3
methods and for some of the tools which we employ to analyze such methods, we study in
Chapter 5 such GF ODEs. In particular, we show in Chapter 5 how such GF ODEs can be
used to approximately solve appropriate optimization problems. Implementations of the
gradient based methods discussed in Chapters 6 and 7 require efficient computations of
gradients. The most popular and in some sense most natural method to explicitly compute
such gradients in the case of the training of ANNs is the backpropagation method, which
we derive and present in detail in Chapter 8. The mathematical analyses for gradient
based optimization methods that we present in Chapters 5, 6, and 7 are in almost all
cases too restrictive to cover optimization problems associated to the training of ANNs.
However, such optimization problems can be covered by the Kurdyka–Łojasiewicz (KL)
approach which we discuss in detail in Chapter 9. In Chapter 10 we rigorously review
batch normalization (BN) methods, which are popular methods that aim to accelerate ANN
training procedures in data-driven learning problems. In Chapter 11 we review and study
the approach to optimize an objective function through different random initializations.
The mathematical analysis of deep learning algorithms does not only consist of error
estimates for approximation capacities of ANNs (cf. Part II) and of error estimates for the
involved optimization methods (cf. Part III) but also requires estimates for the generalization
error which, roughly speaking, arises when the probability distribution associated to the
learning problem cannot be accessed explicitly but is approximated by a finite number of
realizations/data. It is precisely the subject of Part IV to study the generalization error.
Specifically, in Chapter 12 we review suitable probabilistic generalization error estimates
and in Chapter 13 we review suitable strong Lp -type generalization error estimates.
In Part V we illustrate how to combine parts of the approximation error estimates
from Part II, parts of the optimization error estimates from Part III, and parts of the
generalization error estimates from Part IV to establish estimates for the overall error in
the exemplary situation of the training of ANNs based on SGD-type optimization methods
with many independent random initializations. Specifically, in Chapter 14 we present a
suitable overall error decomposition for supervised learning problems, which we employ
in Chapter 15 together with some of the findings of Parts II, III, and IV to establish the
aforementioned illustrative overall error analysis.
Deep learning methods have not only become very popular for data-driven learning
problems, but are nowadays also heavily used for approximately solving partial differential
equations (PDEs). In Part VI we review and implement three popular variants of such deep
learning methods for PDEs. Specifically, in Chapter 16 we treat physics-informed neural
networks (PINNs) and deep Galerkin methods (DGMs) and in Chapter 17 we treat deep
Kolmogorov methods (DKMs).
This book contains a number of Python source codes, which can be downloaded
from two sources, namely from the public GitHub repository at https://github.com/
introdeeplearning/book and from the arXiv page of this book (by clicking on the link
“Other formats” and then on “Download source”). For ease of reference, the caption of each
4
source listing in this book contains the filename of the corresponding source file.
This book grew out of a series of lectures held by the authors at ETH Zurich, University
of Münster, and the Chinese University of Hong Kong, Shenzhen. It is in parts based on
recent joint articles of Christian Beck, Sebastian Becker, Weinan E, Lukas Gonon, Robin
Graeber, Philipp Grohs, Fabian Hornung, Martin Hutzenthaler, Nor Jaafari, Joshua Lee
Padgett, Adrian Riekert, Diyora Salimova, Timo Welti, and Philipp Zimmermann with
the authors of this book. We thank all of our aforementioned co-authors for very fruitful
collaborations. Special thanks are due to Timo Welti for his permission to integrate slightly
modified extracts of the article [230] into this book. We also thank Lukas Gonon, Timo
Kröger, Siyu Liang, and Joshua Lee Padget for several insightful discussions and useful
suggestions. Finally, we thank the students of the courses that we held on the basis of
preliminary material of this book for bringing several typos to our notice.
This work was supported by the internal project fund from the Shenzhen Research
Institute of Big Data under grant T00120220001. This work has been partially funded by
the National Science Foundation of China (NSFC) under grant number 12250610192. The
first author gratefully acknowledges the support of the Cluster of Excellence EXC 2044-
390685587, Mathematics Münster: Dynamics-Geometry-Structure funded by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation).
5
6
Contents
Preface 3
Introduction 15
7
Contents
2 ANN calculus 77
2.1 Compositions of fully-connected feedforward ANNs . . . . . . . . . . . . . 77
2.1.1 Compositions of fully-connected feedforward ANNs . . . . . . . . . 77
2.1.2 Elementary properties of compositions of fully-connected feedforward
ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.1.3 Associativity of compositions of fully-connected feedforward ANNs 80
2.1.4 Powers of fully-connected feedforward ANNs . . . . . . . . . . . . 84
2.2 Parallelizations of fully-connected feedforward ANNs . . . . . . . . . . . . 84
2.2.1 Parallelizations of fully-connected feedforward ANNs with the same
length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.2.2 Representations of the identities with ReLU activation functions . 89
2.2.3 Extensions of fully-connected feedforward ANNs . . . . . . . . . . 90
2.2.4 Parallelizations of fully-connected feedforward ANNs with different
lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.3 Scalar multiplications of fully-connected feedforward ANNs . . . . . . . . 96
2.3.1 Affine transformations as fully-connected feedforward ANNs . . . . 96
2.3.2 Scalar multiplications of fully-connected feedforward ANNs . . . . 97
2.4 Sums of fully-connected feedforward ANNs with the same length . . . . . 98
2.4.1 Sums of vectors as fully-connected feedforward ANNs . . . . . . . . 98
2.4.2 Concatenation of vectors as fully-connected feedforward ANNs . . 100
2.4.3 Sums of fully-connected feedforward ANNs . . . . . . . . . . . . . 102
8
Contents
II Approximation 105
3 One-dimensional ANN approximation results 107
3.1 Linear interpolation of one-dimensional functions . . . . . . . . . . . . . . 107
3.1.1 On the modulus of continuity . . . . . . . . . . . . . . . . . . . . . 107
3.1.2 Linear interpolation of one-dimensional functions . . . . . . . . . . 109
3.2 Linear interpolation with fully-connected feedforward ANNs . . . . . . . . 113
3.2.1 Activation functions as fully-connected feedforward ANNs . . . . . 113
3.2.2 Representations for ReLU ANNs with one hidden neuron . . . . . 114
3.2.3 ReLU ANN representations for linear interpolations . . . . . . . . 115
3.3 ANN approximations results for one-dimensional functions . . . . . . . . . 118
3.3.1 Constructive ANN approximation results . . . . . . . . . . . . . . 118
3.3.2 Convergence rates for the approximation error . . . . . . . . . . . . 122
9
Contents
10
Contents
8 Backpropagation 337
8.1 Backpropagation for parametric functions . . . . . . . . . . . . . . . . . . 337
8.2 Backpropagation for ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . 342
11
Contents
9.14 Standard KL inequalities for empirical risks in the training of ANNs with
analytic activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . 388
9.15 Fréchet subdifferentials and limiting Fréchet subdifferentials . . . . . . . . 390
9.16 Non-smooth slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
9.17 Generalized KL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
IV Generalization 431
12 Probabilistic generalization error estimates 433
12.1 Concentration inequalities for random variables . . . . . . . . . . . . . . . 433
12.1.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 433
12.1.2 A first concentration inequality . . . . . . . . . . . . . . . . . . . . 434
12.1.3 Moment-generating functions . . . . . . . . . . . . . . . . . . . . . 436
12.1.4 Chernoff bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
12.1.5 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 438
12.1.6 A strengthened Hoeffding’s inequality . . . . . . . . . . . . . . . . 444
12.2 Covering number estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 445
12.2.1 Entropy quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
12
Contents
13
Contents
Bibliography 559
14
Introduction
Very roughly speaking, the field deep learning can be divided into three subfields, deep
supervised learning, deep unsupervised learning, and deep reinforcement learning. Algorithms
in deep supervised learning often seem to be most accessible for a mathematical analysis.
In the following we briefly sketch in a simplified situation some ideas of deep supervised
learning.
Let d, M ∈ N = {1, 2, 3, . . . }, E ∈ C(Rd , R), x1 , x2 , . . . , xM +1 ∈ Rd , y1 , y2 , . . . , yM ∈ R
satisfy for all m ∈ {1, 2, . . . , M } that
ym = E(xm ). (1)
(x1 , y1 ) = (x1 , E(x1 )), (x2 , y2 ) = (x2 , E(x2 )), . . . , (xM , yM ) = (xM , E(xM )) ∈ Rd × R.
(2)
Observe that (1) ensures that L(E) = 0 and, in particular, we have that the unknown
function E : Rd → R in (1) above is a minimizer of the function
15
Contents
L = L ◦ ψ. (5)
Rd ∋ θ 7→ ψθ ∈ C(Rd , R) (7)
as the parametrization function associated to this set. For example, in the case d = 1 one
could think of (7) as the parametrization function associated to polynomials in the sense
that for all θ = (θ1 , . . . , θd ) ∈ Rd , x ∈ R it holds that
d−1
X
ψθ (x) = θk+1 xk (8)
k=0
Employing the parametrization function in (7), one can also reformulate the optimization
problem in (9) as the optimization problem of computing approximate minimizers of the
function "M #
1 X
Rd ∋ θ 7→ L(θ) = L(ψθ ) = 2
|ψθ (xm ) − ym | ∈ [0, ∞) (10)
M m=1
16
Contents
and this optimization problem now has the potential to be amenable for discrete numer-
ical computations. In the context of deep supervised learning, where one chooses the
parametrization function in (7) as deep ANN parametrizations, one would apply an SGD-
type optimization algorithm to the optimization problem in (10) to compute approximate
minimizers of (10). In Chapter 7 in Part III we present the most common variants of such
SGD-type optimization algorithms. If ϑ ∈ Rd is an approximate minimizer of (10) in the
sense that L(ϑ) ≈ inf θ∈Rd L(θ), one then considers ψϑ (xM +1 ) as an approximation
of the unknown output E(xM +1 ) of the (M + 1)-th input data xM +1 . We note that in deep
supervised learning algorithms one typically aims to compute an approximate minimizer
ϑ ∈ Rd of (10) in the sense that L(ϑ) ≈ inf θ∈Rd L(θ), which is, however, typically not a
minimizer of (10) in the sense that L(ϑ) = inf θ∈Rd L(θ) (cf. Section 9.14).
In (3) above we have set up an optimization problem for the learning problem by using
the standard mean squared error function to measure the loss. This mean squared error
loss function is just one possible example in the formulation of deep learning optimization
problems. In particular, in image classification problems other loss functions such as the
cross-entropy loss function are often used and we refer to Chapter 5 of Part III for a survey
of commonly used loss function in deep learning algorithms (see Section 5.4.2). We also refer
to Chapter 9 for convergence results in the above framework where the parametrization
function in (7) corresponds to fully-connected feedforward ANNs (see Section 9.14).
17
Contents
18
Part I
19
Chapter 1
Basics on ANNs
21
Chapter 1: Basics on ANNs
Input layer 1st hidden layer 2nd hidden layer (L − 1)-th hidden layer Output layer
···
(1st layer) (2nd layer) (3rd layer) (L-th layer) ((L + 1)-th layer)
1 1 ··· 1
1 2 2 ··· 2 1
2 3 3 ··· 3 2
.. 4 4 ··· 4 ..
. .
l0 .. .. .. .. lL
. . . .
l1 l2 ··· lL−1
22
1.1. Fully-connected feedforward ANNs (vectorized description)
Aθ,1
2,2 ((1, 2)) = (8, 6) (1.2)
Exercise 1.1.1. Let θ = (3, 1, −2, 1, −3, 0, 5, 4, −1, −1, 0) ∈ R11 . Specify Aθ,2
2,3 ((−1, 1, −1))
explicitly and prove that your result is correct (cf. Definition 1.1.1)!
23
Chapter 1: Basics on ANNs
and for every k ∈ {1, 2, . . . , L} let Ψk : Rlk → Rlk be a function. Then we denote by
NΨθ,l1 ,Ψ
0
2 ,...,ΨL
: Rl0 → RlL the function which satisfies for all x ∈ Rl0 that
θ, L−1 θ, L−2
P P
k=1 lk (lk−1 +1) k=1 lk (lk−1 +1)
NΨθ,l1 ,Ψ
0
2 ,...,ΨL
(x) = ΨL ◦ A lL ,lL−1 ◦ ΨL−1 ◦ A lL−1 ,lL−2 ◦ ...
θ,l (l0 +1)
l1 ,l0 (x) (1.5)
◦ Ψ1 ◦ Aθ,0
. . . ◦ Ψ2 ◦ Al2 ,l11
Example 1.1.4 (Example for Definition 1.1.3). Let θ = (1, −1, 2, −2, 3, −3, 0, 0, 1) ∈ R9
and let Ψ : R2 → R2 satisfy for all x = (x1 , x2 ) ∈ R2 that
Then
θ,1
(1.7)
NΨ,id R
(2) = 12
(cf. Definition 1.1.3).
Proof for Example 1.1.4. Note that (1.1), (1.5), and (1.6) assure that
θ,1
θ,4 θ,0
θ,4
1 2
NΨ,idR (2) = idR ◦A1,2 ◦ Ψ ◦ A2,1 (2) = A1,2 ◦ Ψ 2 +
−1 −2
(1.8)
4 4 4
= Aθ,4 = Aθ,4
1,2 ◦ Ψ 1,2 = 3 −3 + 0 = 12
−4 0 0
(cf. Definitions 1.1.1 and 1.1.3). The proof for Example 1.1.4 is thus complete.
Exercise 1.1.2. Let θ = (1, −1, 0, 0, 1, −1, 0) ∈ R7 and let Ψ : R2 → R2 satisfy for all
x = (x1 , x2 ) ∈ R2 that
24
1.1. Fully-connected feedforward ANNs (vectorized description)
Definition 1.1.3).
b) Prove or disprove the following statement: It holds that NΦ,Ψ
θ,2
(−1, 1) = (−4, −4)
(cf. Definition 1.1.3).
let Wk ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, and bk ∈ Rlk , k ∈ {1, 2, . . . , L}, satisfy for all
k ∈ {1, 2, . . . , L} that
θvk−1 +1 θvk−1 +2 ... θvk−1 +lk−1
θv +l +1 θvk−1 +lk−1 +2 ... θvk−1 +2lk−1
k−1 k−1
(1.14)
θv +2l +1 θvk−1 +2lk−1 +2 ... θvk−1 +3lk−1
Wk = k−1 k−1
.. .. .. ..
. . . .
θvk−1 +(lk −1)lk−1 +1 θvk−1 +(lk −1)lk−1 +2 . . . θvk−1 +lk lk−1
| {z }
weight parameters
and (1.15)
bk = θvk−1 +lk lk−1 +1 , θvk−1 +lk lk−1 +2 , . . . , θvk−1 +lk lk−1 +lk ,
| {z }
bias parameters
25
Chapter 1: Basics on ANNs
Input layer 1st hidden layer 2nd hidden layer Output layer
(1st layer) (2nd layer) (3rd layer) (4th layer)
Figure 1.2: Graphical illustration of an ANN. The ANN has 2 hidden layers and
length L = 3 with 3 neurons in the input layer (corresponding to l0 = 3), 6 neurons
in the first hidden layer (corresponding to l1 = 6), 3 neurons in the second hidden
layer (corresponding to l2 = 3), and one neuron in the output layer (corresponding
to l3 = 1). In this situation we have an ANN with 39 weight parameters and 10 bias
parameters adding up to 49 parameters overall. The realization of this ANN is a
function from R3 to R.
and
θ,v
(ii) it holds for all k ∈ {1, 2, . . . , L}, x ∈ Rlk−1 that Alk ,lk−1
k−1
(x) = Wk x + bk
26
1.2. Activation functions
of fully-connected feedforward ANNs, cf. Definition 1.4.5 below for the use of activation
functions in the context of CNNs, cf. Definition 1.5.4 below for the use of activation functions
in the context of ResNets, and cf. Definitions 1.6.3 and 1.6.4 below for the use of activation
functions in the context of RNNs).
Mψ,d1 ,d2 ,...,dT : Rd1 ×d2 ×...×dT → Rd1 ×d2 ×...×dT (1.17)
the function which satisfies for all x = (xk1 ,k2 ,...,kT )(k1 ,k2 ,...,kT )∈(×Tt=1 {1,2,...,dt }) ∈ Rd1 ×d2 ×...×dT ,
y = (yk1 ,k2 ,...,kT )(k1 ,k2 ,...,kT )∈(×Tt=1 {1,2,...,dt }) ∈ Rd1 ×d2 ×...×dT with ∀ k1 ∈ {1, 2, . . . , d1 }, k2 ∈
{1, 2, . . . , d2 }, . . . , kT ∈ {1, 2, . . . , dT } : yk1 ,k2 ,...,kT = ψ(xk1 ,k2 ,...,kT ) that
(1.19)
A= 1 −1 , −2 2 , 3 −3
(1.20)
Mψ,3,1,3 (A) = 1 1 , 4 4 , 9 9
Proof for Example 1.2.2. Note that (1.18) establishes (1.20). The proof for Example 1.2.2
is thus complete.
and let ψ : R → R satisfy for all x ∈ R that ψ(x) = |x|. Specify Mψ,2,3 (A) and Mψ,2,2,2 (B)
explicitly and prove that your results are correct (cf. Definition 1.2.1)!
27
Chapter 1: Basics on ANNs
Specify NM θ,1
and θ,1
(1) explicitly and prove that your results are correct
f,3 ,M g,2
(1) NM g,2 ,M f,3
(cf. Definitions 1.1.3 and 1.2.1)!
..
.
..
.
28
1.2. Activation functions
Lemma 1.2.3 (Fully-connected feedforward ANN with one hidden layer). Let I, H ∈ N,
θ = (θ1 , θ2 , . . . , θHI+2H+1 ) ∈ RHI+2H+1 , x = (x1 , x2 , . . . , xI ) ∈ RI and let ψ : R → R be a
function. Then
" H I #
X
θ,I
(1.24)
P
NMψ,H ,idR (x) = θHI+H+k ψ xi θ(k−1)I+i + θHI+k + θHI+2H+1 .
k=1 i=1
29
Chapter 1: Basics on ANNs
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.5
14 s . set_zorder (0)
15
16 return ax
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
12 plt . savefig ( " ../../ plots / relu . pdf " , bbox_inches = ’ tight ’)
Rd = Mr,d (1.27)
and we call Rd the d-dimensional ReLU activation function (we call Rd the d-dimensional
rectifier function) (cf. Definitions 1.2.1 and 1.2.4).
30
1.2. Activation functions
Lemma 1.2.6 (An ANN with the ReLU activation function as the activation function).
Let W1 = w1 = 1, W2 = w2 = −1, b1 = b2 = B = 0. Then it holds for all x ∈ R that
W1 max{w1 x + b1 , 0} + W2 max{w2 x + b2 , 0} + B
= max{w1 x + b1 , 0} − max{w2 x + b2 , 0} = max{x, 0} − max{−x, 0} (1.29)
= max{x, 0} + min{x, 0} = x.
Exercise 1.2.3 (Real identity). Prove or disprove the PH following statement: There exist
d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ Rd with d ≥ 2l1 + 1 such that
l (l
k=2 k k−1 + 1) + lH +
for all x ∈ R it holds that
NRθ,1l ,Rl ,...,Rl ,idR (x) = x (1.30)
1 2 H
Lemma 1.2.7 (Real identity). Let θ = (1, −1, 0, 0, 1, −1, 0) ∈ R7 . Then it holds for all
x ∈ R that
NRθ,12 ,idR (x) = x (1.31)
Exercise 1.2.4 (Absolute value). Prove or disproveP the following statement: There exist
d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ Rd with d ≥ 2l1 + H
1 such that
l (l
k=2 k k−1 + 1) + lH +
for all x ∈ R it holds that
NRθ,1l ,Rl ,...,Rl ,idR (x) = |x| (1.32)
1 2 H
31
Chapter 1: Basics on ANNs
NRθ,k (1.38)
l ,Rl ,...,Rl ,idR
(x1 , x2 , . . . , xk ) = max{x1 , x2 , . . . , xk }
1 2 H
32
1.2. Activation functions
Exercise 1.2.11 (Hat function). Prove or disprove the following statement: There exist
d, l ∈ N, θ ∈ Rd with d ≥ 3l + 1 such that for all x ∈ R it holds that
1 : x≤2
x−1 : 2<x≤3
NRθ,1l ,idR (x) = (1.40)
5−x : 3<x≤4
1 : x>4
33
Chapter 1: Basics on ANNs
2.0
ReLU
(0,1)-clipping
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.5
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)
9
10 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . relu (x , max_value =1) ,
12 label = ’ (0 ,1) - clipping ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / clipping . pdf " , bbox_inches = ’ tight ’)
34
1.2. Activation functions
and we call Cu,v,d the d-dimensional (u, v)-clipping activation function (cf. Definitions 1.2.1
and 1.2.9).
4.0
ReLU
softplus 3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
4 3 2 1 0 1 2 3 4
0.5
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -4 ,4) , ( -.5 ,4) )
7
8 x = np . linspace ( -4 , 4 , 100)
9
35
Chapter 1: Basics on ANNs
The next result, Lemma 1.2.12 below, presents a few elementary properties of the
softplus function.
Lemma 1.2.12 (Properties of the softplus function). Let a be the softplus activation
function (cf. Definition 1.2.11). Then
Proof of Lemma 1.2.12. Observe that the fact that 2 ≤ exp(1) ensures that for all x ∈ [0, ∞)
it holds that
x = ln(exp(x)) ≤ ln(1 + exp(x)) = ln(exp(0) + exp(x))
≤ ln(exp(x) + exp(x)) = ln(2 exp(x)) ≤ ln(exp(1) exp(x)) (1.48)
= ln(exp(x + 1)) = x + 1.
Note that Lemma 1.2.12 ensures that s(0) = ln(2) = 0.693 . . . (cf. Definition 1.2.11).
In the next step we introduce the multidimensional version of the softplus function (cf.
Definitions 1.2.1 and 1.2.11 above).
A(x) = (ln(1 + exp(x1 )), ln(1 + exp(x2 )), . . . , ln(1 + exp(xd ))) (1.49)
36
1.2. Activation functions
Proof of Lemma 1.2.14. Throughout this proof, let a be the softplus activation function
(cf. Definition 1.2.11). Note that (1.18) and (1.47) ensure that for all x = (x1 , . . . , xd ) ∈ Rd
it holds that
Ma,d (x) = (ln(1 + exp(x1 )), ln(1 + exp(x2 )), . . . , ln(1 + exp(xd ))) (1.50)
(cf. Definition 1.2.1). The fact that A is the d-dimensional softplus activation function (cf.
Definition 1.2.13) if and only if A = Ma,d hence implies (1.49). The proof of Lemma 1.2.14
is thus complete.
Definition 1.2.15 (GELU activation function). We say that a is the GELU unit activation
function (we say that a is the GELU activation function) if and only if it holds that
a : R → R is the function from R to R which satisfies for all x ∈ R that
Z x
x z2
a(x) = √ exp(− 2 ) dz . (1.51)
2π −∞
3.0
ReLU
softplus 2.5
GELU
2.0
1.5
1.0
0.5
0.0
4 3 2 1 0 1 2 3
0.5
Figure 1.7 (plots/gelu.pdf): A plot of the GELU activation function, the ReLU
activation function, and the softplus activation function
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -4 ,3) , ( -.5 ,3) )
7
8 x = np . linspace ( -4 , 3 , 100)
37
Chapter 1: Basics on ANNs
9
10 ax . plot (x , tf . keras . activations . relu ( x ) , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . softplus ( x ) , label = ’ softplus ’)
12 ax . plot (x , tf . keras . activations . gelu ( x ) , label = ’ GELU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / gelu . pdf " , bbox_inches = ’ tight ’)
Lemma 1.2.16. Let x ∈ R and let a be the GELU activation function (cf. Definition 1.2.15).
Then the following two statements are equivalent:
Proof of Lemma 1.2.16. Note that (1.26) and (1.51) establish that ((i) ↔ (ii)). The proof
of Lemma 1.2.16 is thus complete.
Definition 1.2.17 (Multidimensional GELU unit activation function). Let d ∈ N and let a
be the GELU activation function (cf. Definition 1.2.15). we say that A is the d-dimensional
GELU activation function if and only if A = Ma,d (cf. Definition 1.2.1).
1 exp(x)
a(x) = = . (1.52)
1 + exp(−x) exp(x) + 1
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -.5 ,1.5) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 ax . plot (x , tf . keras . activations . relu (x , max_value =1) ,
11 label = ’ (0 ,1) - clipping ’)
38
1.2. Activation functions
1.5
(0,1)-clipping
standard logistic 1.0
0.5
0.0
3 2 1 0 1 2 3
0.5
16 plt . savefig ( " ../../ plots / logistic . pdf " , bbox_inches = ’ tight ’)
39
Chapter 1: Basics on ANNs
This establishes item (ii). The proof of Proposition 1.2.20 is thus complete.
Proof of Lemma 1.2.21. Observe that (1.47) implies that for all x ∈ R it holds that
1
′
s (x) = exp(x) = l(x). (1.58)
1 + exp(x)
The fundamental theorem of calculus hence shows that for all w, x ∈ R with w ≤ x it holds
that Z x
l(y) dy = s(x) − s(w). (1.59)
w |{z}
≥0
Combining this with the fact that limw→−∞ s(w) = 0 establishes (1.57). The proof of
Lemma 1.2.21 is thus complete.
40
1.2. Activation functions
3.0
ReLU
GELU 2.5
swish
2.0
1.5
1.0
0.5
0.0
4 3 2 1 0 1 2 3
0.5
Figure 1.9 (plots/swish.pdf): A plot of the swish activation function, the GELU
activation function, and the ReLU activation function
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -4 ,3) , ( -.5 ,3) )
7
8 x = np . linspace ( -4 , 3 , 100)
9
Lemma 1.2.23 (Relation between the swish activation function and the logistic activation
function). Let β ∈ R, let s be the swish activation function with parameter 1, and let l be
the standard logistic activation function (cf. Definitions 1.2.18 and 1.2.22). Then it holds
for all x ∈ R that
s(x) = xl(βx). (1.61)
Proof of Lemma 1.2.23. Observe that (1.60) and (1.52) establish (1.61). The proof of
Lemma 1.2.23 is thus complete.
Definition 1.2.24 (Multidimensional swish activation functions). Let d ∈ N and let a be
the swish activation function with parameter 1 (cf. Definition 1.2.22). Then we say that A
is the d-dimensional swish activation function if and only if A = Ma,d (cf. Definition 1.2.1).
41
Chapter 1: Basics on ANNs
1.5
(-1,1)-clipping
standard logistic 1.0
tanh
0.5
0.0
3 2 1 0 1 2 3
0.5
1.0
1.5
Figure 1.10 (plots/tanh.pdf): A plot of the hyperbolic tangent, the (−1, 1)-clipping
activation function, and the standard logistic activation function
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -1.5 ,1.5) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 ax . plot (x , tf . keras . activations . relu ( x +1 , max_value =2) -1 ,
11 label = ’ ( -1 ,1) - clipping ’)
12 ax . plot (x , tf . keras . activations . sigmoid ( x ) ,
13 label = ’ standard logistic ’)
14 ax . plot (x , tf . keras . activations . tanh ( x ) , label = ’ tanh ’)
15 ax . legend ()
16
17 plt . savefig ( " ../../ plots / tanh . pdf " , bbox_inches = ’ tight ’)
42
1.2. Activation functions
Lemma 1.2.27. Let a be the standard logistic activation function (cf. Definition 1.2.18).
Then it holds for all x ∈ R that
Proof of Lemma 1.2.27. Observe that (1.52) and (1.62) ensure that for all x ∈ R it holds
that
exp(2x) 2 exp(2x) − (exp(2x) + 1)
2 a(2x) − 1 = 2 −1=
exp(2x) + 1 exp(2x) + 1
exp(2x) − 1 exp(x)(exp(x) − exp(−x))
= = (1.64)
exp(2x) + 1 exp(x)(exp(x) + exp(−x))
exp(x) − exp(−x)
= = tanh(x).
exp(x) + exp(−x)
θ,1
(1.65)
NM a,l ,Ma,l2 ,...,Ma,lL−1 ,idR (x) = tanh(x)
1
43
Chapter 1: Basics on ANNs
tanh 1
softsign
0
4 2 0 2 4
1
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)
44
1.2. Activation functions
2.0
ReLU
leaky ReLU
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.5
9
10 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . relu (x , alpha =0.1) ,
12 label = ’ leaky ReLU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / leaky_relu . pdf " , bbox_inches = ’ tight ’)
Lemma 1.2.31. Let γ ∈ [0, 1] and let a : R → R be a function. Then a is the leaky ReLU
activation function with leak factor γ if and only if it holds for all x ∈ R that
Proof of Lemma 1.2.31. Note that the fact that γ ≤ 1 and (1.67) establish (1.68). The
proof of Lemma 1.2.31 is thus complete.
Lemma 1.2.32. Let u, β ∈ R, v ∈ (u, ∞), α ∈ (−∞, 0], let a1 be the softplus activation
function, let a2 be the GELU activation function, let a3 be the standard logistic activation
function, let a4 be the swish activation function with parameter β, let a5 be the softsign
activation function, and let l be the leaky ReLU activation function with leaky parameter γ
(cf. Definitions 1.2.11, 1.2.15, 1.2.18, 1.2.22, 1.2.28, and 1.2.30). Then
(i) it holds for all f ∈ {r, cu,v , tanh, a1 , a2 , . . . , a5 } that lim supx→−∞ |f ′ (x)| = 0 and
45
Chapter 1: Basics on ANNs
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -1 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)
9
10 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . relu (x , alpha =0.1) , linewidth =2 ,
label = ’ leaky ReLU ’)
12 ax . plot (x , tf . keras . activations . elu ( x ) , linewidth =0.9 , label = ’ ELU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / elu . pdf " , bbox_inches = ’ tight ’)
46
1.2. Activation functions
2.0
ReLU
leaky ReLU
ELU 1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.5
1.0
Figure 1.13 (plots/elu.pdf): A plot of the ELU activation function with asymptotic
−1, the leaky ReLU activation function with leak factor 1/10, and the ReLU activation
function
Lemma 1.2.35. Let γ ∈ (−∞, 0] and let a be the ELU activation function with asymptotic
γ (cf. Definition 1.2.34). Then
Proof of Lemma 1.2.35. Observe that (1.69) establishes (1.70). The proof of Lemma 1.2.35
is thus complete.
Definition 1.2.37 (RePU activation function). Let p ∈ N. Then we say that a is the RePU
activation function with power p if and only if it holds that a : R → R is the function from
R to R which satisfies for all x ∈ R that
47
Chapter 1: Basics on ANNs
3.0
ReLU
RePU
2.5
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.5
Figure 1.14 (plots/repu.pdf): A plot of the RePU activation function with power
2 and the ReLU activation function
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,3) )
7 ax . set_ylim ( -.5 , 3)
8
9 x = np . linspace ( -2 , 2 , 100)
10
11 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
12 ax . plot (x , tf . keras . activations . relu ( x ) **2 , label = ’ RePU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / repu . pdf " , bbox_inches = ’ tight ’)
48
1.2. Activation functions
Definition 1.2.39 (Sine activation function). We say that a is the sine activation function
if and only if it holds that a : R → R is the function from R to R which satisfies for all
x ∈ R that
a(x) = sin(x). (1.72)
1
0
6 4 2 0 2 4 6
1
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2* np . pi ,2* np . pi ) , ( -1.5 ,1.5) )
7
8 x = np . linspace ( -2* np . pi , 2* np . pi , 100)
9
10 ax . plot (x , np . sin ( x ) )
11
12 plt . savefig ( " ../../ plots / sine . pdf " , bbox_inches = ’ tight ’)
Definition 1.2.40 (Multidimensional sine activation functions). Let d ∈ N and let a be the
sine activation function (cf. Definition 1.2.39). Then we say that A is the d-dimensional
sine activation function if and only if A = Ma,d (cf. Definition 1.2.1).
49
Chapter 1: Basics on ANNs
if and only if it holds that a : R → R is the function from R to R which satisfies for all
x ∈ R that (
1 :x≥0
a(x) = 1[0,∞) (x) = (1.73)
0 : x < 0.
1.5
Heaviside
standard logistic 1.0
0.5
0.0
3 2 1 0 1 2 3
0.5
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -.5 ,1.5) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 ax . plot ( x [0:50] , [0]*50 , ’ C0 ’)
11 ax . plot ( x [50:100] , [1]*50 , ’ C0 ’ , label = ’ Heaviside ’)
12 ax . plot (x , tf . keras . activations . sigmoid ( x ) , ’ C1 ’ ,
13 label = ’ standard logistic ’)
14 ax . legend ()
15
16 plt . savefig ( " ../../ plots / heaviside . pdf " , bbox_inches = ’ tight ’)
50
1.3. Fully-connected feedforward ANNs (structured description)
tum
Proof of Lemma 1.2.44. Observe that (1.74) demonstrates that for all x = (x1 , x2 , . . . , xd ) ∈
Rd it holds that
Xd Xd Pd
exp(xk )
Ak (x) = Pd
exp(xk )
= Pk=1
d = 1. (1.76)
( i=1 exp(xi )) i=1 exp(xi )
k=1 k=1
51
Chapter 1: Basics on ANNs
L
×
(Rlk ×lk−1 × Rlk ) ⊆ N we denote by
for every L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ ∈ k=1
P(Φ), L(Φ), I(Φ), O(Φ) ∈ N, H(Φ) ∈ N0 the numbers given by
× L lk ×lk−1 lk
for every n ∈ N0 , L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ ∈ k=1
(R × R ) ⊆ N we denote by
Dn (Φ) ∈ N0 the number given by
(
ln : n ≤ L
Dn (Φ) = (1.79)
0 : n > L,
×
and for every L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈
L
k=1
(Rlk ×lk−1 ×
Rlk ) ⊆ N, n ∈ {1, 2, . . . , L} we denote by Wn,Φ ∈ Rln ×ln−1 , Bn,Φ ∈ Rln the matrix and the
vector given by
Wn,Φ = Wn and Bn,Φ = Bn . (1.81)
Φ∈N (1.82)
52
1.3. Fully-connected feedforward ANNs (structured description)
×L
(Rlk ×lk−1
× Rlk )
S S
Φ∈N= L∈N (l0 ,l1 ,...,lL )∈NL+1 k=1
× L
(Rlk ×lk−1 (1.85)
Φ∈ k=1
× Rlk ) .
RN
a (Φ) : R
I(Φ)
→ RO(Φ) (1.89)
the function which satisfies for all x0 ∈ RD0 (Φ) , x1 ∈ RD1 (Φ) , . . . , xL(Φ) ∈ RDL(Φ) (Φ) with
∀ k ∈ {1, 2, . . . , L(Φ)} : xk = Ma1(0,L(Φ)) (k)+idR 1{L(Φ)} (k),Dk (Φ) (Wk,Φ xk−1 + Bk,Φ ) (1.90)
that
(RN
a (Φ))(x0 ) = xL(Φ) (1.91)
and we call RNa (Φ) the realization function of the fully-connected feedforward ANN Φ with
activation function a (we call RNa (Φ) the realization of the fully-connected feedforward ANN
Φ with activation a) (cf. Definition 1.2.1).
53
Chapter 1: Basics on ANNs
satisfy
−1 2 0
1 3
W1 = , B1 = , W2 = 3 −4, B2 = 0,
(1.93)
2 4
−5 6 0
and (1.94)
W3 = −1 1 −1 , B3 = −4 .
Prove or disprove the following statement: It holds that
(RN
r (Φ))(−1) = 0 (1.95)
RN
tanh (Φ) = a (1.96)
54
1.3. Fully-connected feedforward ANNs (structured description)
1 import torch
2 import torch . nn as nn
3
4
5 class Fu llyConne ctedANN ( nn . Module ) :
6 def __init__ ( self ) :
7 super () . __init__ ()
8 # Define the layers of the network in terms of Modules .
9 # nn . Linear (3 , 20) represents an affine function defined
10 # by a 20 x3 weight matrix and a 20 - dimensional bias vector .
11 self . affine1 = nn . Linear (3 , 20)
12 # The torch . nn . ReLU class simply wraps the
13 # torch . nn . functional . relu function as a Module .
14 self . activation1 = nn . ReLU ()
15 self . affine2 = nn . Linear (20 , 30)
16 self . activation2 = nn . ReLU ()
17 self . affine3 = nn . Linear (30 , 1)
18
55
Chapter 1: Basics on ANNs
1 import torch
2 import torch . nn as nn
3
4 # A Module whose forward method is simply a composition of Modules
5 # can be represented using the torch . nn . Sequential class
6 model = nn . Sequential (
7 nn . Linear (3 , 20) ,
8 nn . ReLU () ,
9 nn . Linear (20 , 30) ,
10 nn . ReLU () ,
11 nn . Linear (30 , 1) ,
12 )
13
14 # Prints a summary of the model architecture
15 print ( model )
16
17 x0 = torch . Tensor ([1 , 2 , 3])
18 print ( model ( x0 ) )
56
1.3. Fully-connected feedforward ANNs (structured description)
..
.
θ(Pk−1 li (li−1 +1))+lk lk−1 +lk
i=1
θ( Pk−1
li (li−1 +1))+1 θ(Pk−1 li (li−1 +1))+2 ··· θ(Pk−1 li (li−1 +1))+lk−1
θ Pk−1i=1 i=1
θ(Pk−1 li (li−1 +1))+lk−1 +2 ···
i=1
θ(Pk−1 li (li−1 +1))+2lk−1
( i=1 li (li−1 +1))+lk−1 +1 i=1 i=1
θ(Pk−1 li (li−1 +1))+2lk−1 +1 θ(Pk−1 li (li−1 +1))+2lk−1 +2 ··· θ( k−1 li (li−1 +1))+3lk−1
P
i=1 i=1 i=1
.. .. .. ..
. . . .
θ( k−1 li (li−1 +1))+(lk −1)lk−1 +1 θ( k−1 li (li−1 +1))+(lk −1)lk−1 +2 · · ·
P P θ(Pk−1 li (li−1 +1))+lk lk−1
i=1 i=1 i=1
(1.97)
Proof of Lemma 1.3.6. Observe that (1.97) establishes (1.98). The proof of Lemma 1.3.6
is thus complete.
57
Chapter 1: Basics on ANNs
Proof of Lemma 1.3.7. Observe that (1.97) establishes (1.99). The proof of Lemma 1.3.7 is
thus complete.
Proof of Lemma 1.3.8. Note that (1.97) implies (1.100). The proof of Lemma 1.3.8 is thus
complete.
Exercise 1.3.3. Prove or disprove the following statement: The function T is injective (cf.
Definition 1.3.5).
Exercise 1.3.4. Prove or disprove the following statement: The function T is surjective (cf.
Definition 1.3.5).
Exercise 1.3.5. Prove or disprove the following statement: The function T is bijective (cf.
Definition 1.3.5).
Note that (1.97) shows that for all k ∈ {1, 2, . . . , L}, x ∈ Rlk−1 it holds that
Pk−1
T (Φ), li (li−1 +1)
Wk,Φ x + Bk,Φ = Alk ,lk−1 i=1
(x) (1.103)
58
1.4. Convolutional ANNs (CNNs)
(cf. Definitions 1.1.1 and 1.3.5). This demonstrates that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . ,
xL−1 ∈ RlL−1 with ∀ k ∈ {1, 2, . . . , L − 1} : xk = Ma,lk (Wk,Φ xk−1 + Bk,Φ ) it holds that
x 0 :L=1
T (Φ), L−2
P
l (l +1)
xL−1 = i=1 i i−1
Ma,lL−1 ◦ AlL−1 ,lL−2 (1.104)
T (Φ),
PL−3
l (l +1) T (Φ),0 : L > 1
i=1 i i−1
◦M ◦A
a,lL−2 lL−2 ,lL−3 ◦ ... ◦ M ◦ A
a,l1 (x )
l1 ,l0 0
(cf. Definition 1.2.1). This, (1.103), (1.5), and (1.91) show that for all x0 ∈ Rl0 , x1 ∈
Rl1 , . . . , xL ∈ RlL with ∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),lk (Wk,Φ xk−1 + Bk,Φ ) it
holds that
T (Φ), L−1
P
N
l (l +1)
Ra (Φ) (x0 ) = xL = WL,Φ xL−1 + BL,Φ = AlL ,lL−1 i=1 i i−1 (xL−1 )
NidT (Φ),l0 (x0 ) (1.105)
:L=1
RlL
=
N T (Φ),l0
Ma,l ,Ma,l ,...,Ma,l ,id l (x0 ) : L > 1
1 2 L−1 R L
(cf. Definitions 1.1.3 and 1.3.4). The proof of Proposition 1.3.9 is thus complete.
59
Chapter 1: Basics on ANNs
for applications of CNNs to audio processing, and we refer, for example, to [46, 105, 236,
348, 408, 440] for applications of CNNs to time series analysis. Finally, for approximation
results for feedforward CNNs we refer, for instance, to Petersen & Voigtländer [334] and
the references therein.
dt = at − wt + 1. (1.106)
Then we denote by A ∗ W = ((A ∗ W )i1 ,i2 ,...,iT )(i1 ,i2 ,...,iT )∈(×Tt=1 {1,2,...,dt }) ∈ Rd1 ×d2 ×...×dT the
tensor which satisfies for all i1 ∈ {1, 2, . . . , d1 }, i2 ∈ {1, 2, . . . , d2 }, . . . , iT ∈ {1, 2, . . . , dT }
that
w1 X
X w2 wT
X
(A ∗ W )i1 ,i2 ,...,iT = ··· Ai1 −1+r1 ,i2 −1+r2 ,...,iT −1+rT Wr1 ,r2 ,...,rT . (1.107)
r1 =1 r2 =1 rT =1
C=
!
L
× (R
[ [ [
ck,1 ×ck,2 ×...×ck,T lk ×lk−1
× Rlk . (1.108)
)
T,L∈N l0 ,l1 ,...,lL ∈N (ck,t )(k,t)∈{1,2,...,L}×{1,2,...,T } ⊆N k=1
Definition 1.4.3 (Feedforward CNNs). We say that Φ is a feedforward CNN if and only if
it holds that
Φ∈C (1.109)
(cf. Definition 1.4.2).
Idi11,i,d22,...,i
,...,dT
T
= 1. (1.110)
60
1.4. Convolutional ANNs (CNNs)
and
that
(RC
a (Φ))(x0 ) = xL (1.114)
and we call RC a (Φ) the realization function of the feedforward CNN Φ with activation
function a (we call RCa (Φ) the realization of the feedforward CNN Φ with activation a) (cf.
Definitions 1.2.1, 1.4.1, 1.4.2, and 1.4.4).
1 import torch
2 import torch . nn as nn
3
4
5 class ConvolutionalANN ( nn . Module ) :
6 def __init__ ( self ) :
7 super () . __init__ ()
8 # The convolutional layer defined here takes any tensor of
9 # shape (1 , n , m ) [ a single input ] or (N , 1 , n , m ) [ a batch
10 # of N inputs ] where N , n , m are natural numbers satisfying
11 # n >= 3 and m >= 3.
12 self . conv1 = nn . Conv2d (
13 in_channels =1 , out_channels =5 , kernel_size =(3 , 3)
14 )
61
Chapter 1: Basics on ANNs
× (Rck,1 ×ck,2 ×...×ck,T )lk ×lk−1 × Rlk = (R2×2 )2×1 × R2 × (R1×1 )1×2 × R1
Φ∈
k=1
(1.115)
satisfy
0 0
0 0 1
(1.116)
Φ= ,
, −2 2 , 3
.
1 0 −1
0 1
62
1.4. Convolutional ANNs (CNNs)
Then
1 2 3
11 15
C
(1.117)
Rr (Φ) 4 5 6 =
23 27
7 8 9
(cf. Definitions 1.2.4 and 1.4.5).
Proof for Example 1.4.6. Throughout this proof, let x0 ∈ R3×3 , x1 = (x1,1 , x1,2 ) ∈ (R2×2 )2 ,
x2 ∈ R2×2 with satisfy that
1 2 3
0 0
x0 = 4 5 6, 2,2
x1,1 = Mr,2×2 I + x0 ∗ , (1.118)
0 0
7 8 9
1 0
2,2
x1,2 = Mr,2×2 (−1)I + x0 ∗ , (1.119)
0 1
and x2 = MidR ,2×2 3I2,2 + x1,1 ∗ −2 + x1,2 ∗ 2 . (1.120)
Note that (1.114), (1.116), (1.118), (1.119), and (1.120) imply that
1 2 3
RC 4 5 6 = RC (1.121)
r (Φ) r (Φ) (x0 ) = x2 .
7 8 9
Next observe that (1.118) ensures that
2,2 0 0 1 1 0 0
x1,1 = Mr,2×2 I + x0 ∗ = Mr,2×2 +
0 0 1 1 0 0
(1.122)
1 1 1 1
= Mr,2×2 = .
1 1 1 1
Furthermore, note that (1.119) assures that
2,2 1 0 −1 −1 6 8
x1,2 = Mr,2×2 (−1)I + x0 ∗ = Mr,2×2 +
0 1 −1 −1 12 14
(1.123)
5 7 5 7
= Mr,2×2 = .
11 13 11 13
Moreover, observe that this, (1.122), and (1.120) demonstrate that
x2 = MidR ,2×2 3I2,2 + x1,1 ∗ −2 + x1,2 ∗ 2
2,2 1 1 5 7
= MidR ,2×2 3I + ∗ −2 + ∗ 2
1 1 11 13
(1.124)
3 3 −2 −2 10 14
= MidR ,2×2 + +
3 3 −2 −2 22 26
11 15 11 15
= MidR ,2×2 = .
23 27 23 27
63
Chapter 1: Basics on ANNs
This and (1.121) establish (1.117). The proof for Example 1.4.6 is thus complete.
1 import torch
2 import torch . nn as nn
3
4
5 model = nn . Sequential (
6 nn . Conv2d ( in_channels =1 , out_channels =2 , kernel_size =(2 , 2) ) ,
7 nn . ReLU () ,
8 nn . Conv2d ( in_channels =2 , out_channels =1 , kernel_size =(1 , 1) ) ,
9 )
10
11 with torch . no_grad () :
12 model [0]. weight . set_ (
13 torch . Tensor ([[[[0 , 0] , [0 , 0]]] , [[[1 , 0] , [0 , 1]]]])
14 )
15 model [0]. bias . set_ ( torch . Tensor ([1 , -1]) )
16 model [2]. weight . set_ ( torch . Tensor ([[[[ -2]] , [[2]]]]) )
17 model [2]. bias . set_ ( torch . Tensor ([3]) )
18
19 x0 = torch . Tensor ([[[1 , 2 , 3] , [4 , 5 , 6] , [7 , 8 , 9]]])
20 print ( model ( x0 ) )
satisfy
W1,1,1 = (1, −1), W1,2,1 = (2, −2), W1,3,1 = (−3, 3), (B1,n )n∈{1,2,3} = (1, 2, 3), (1.126)
W2,1,1 = (1, −1, 1), W2,1,2 = (2, −2, 2), W2,1,3 = (−3, 3, −3), and B2,1 = −2 (1.127)
(RC
r (Φ))(v) (1.128)
explicitly and prove that your result is correct (cf. Definitions 1.2.4 and 1.4.5)!
64
1.4. Convolutional ANNs (CNNs)
satisfy
W1,3,1 = (−3, −3, 3), (B1,n )n∈{1,2,3} = (3, −2, −1), (1.131)
W2,1,1 = (2, −1), W2,1,2 = (−1, 2), W2,1,3 = (−1, 0), and B2,1 = −2 (1.132)
and let v ∈ R9 satisfy v = (1, −1, 1, −1, 1, −1, 1, −1, 1). Specify
(RC
r (Φ))(v) (1.133)
explicitly and prove that your result is correct (cf. Definitions 1.2.4 and 1.4.5)!
Exercise 1.4.3. Prove or disprove the following statement: For every a ∈ C(R, R), Φ ∈ N
there exists Ψ ∈ C such that for all x ∈ RI(Φ) it holds that RI(Φ) ⊆ Domain(RC
a (Ψ)) and
(RC N
a (Ψ))(x) = (Ra (Φ))(x) (1.134)
d
(1.135)
P
⟨x, y⟩ = xi yi .
i=1
65
Chapter 1: Basics on ANNs
66
1.5. Residual ANNs (ResNets)
Proof of Lemma 1.5.3. Throughout this proof, for all sets A and B let F (A, B) be the set
of all function from A to B. Note that
×
# (r,k)∈S Rlk ×lr = # f ∈ F S, S(r,k)∈S Rlk ×lr : (∀ (r, k) ∈ S : f (r, k) ∈ Rlk ×lr ) .
(1.140)
This and the fact that for all sets B it holds that #(F (∅, B)) = 1 ensure that
×
# (r,k)∈∅ Rlk ×lr = #(F (∅, ∅)) = 1. (1.141)
Next note that (1.140) assures that for all (R, K) ∈ S it holds that
×
# (r,k)∈S Rlk ×lr ≥ # F {(R, K)}, RlK ×lR = ∞. (1.142)
Combining this and (1.141) establishes (1.139). The proof of Lemma 1.5.3 is thus complete.
RR l0
a (Φ) : R → R
lL
(1.143)
the function which satisfies for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL ∈ RlL with
∀ k ∈ {1, 2, . . . , L} :
xk = Ma1(0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk + r∈N0 ,(r,k)∈S Vr,k xr ) (1.144)
P
that
(RR
a (Φ))(x0 ) = xL (1.145)
and we call RR a (Φ) the realization function of the fully-connected ResNet Φ with activation
function a (we call RR a (Φ) the realization of the fully-connected ResNet Φ with activation
a) (cf. Definitions 1.2.1 and 1.5.1).
67
Chapter 1: Basics on ANNs
Definition 1.5.5 (Identity matrices). Let d ∈ N. Then we denote by Id ∈ Rd×d the identity
matrix in Rd×d .
1 import torch
2 import torch . nn as nn
3
4 class ResidualANN ( nn . Module ) :
5 def __init__ ( self ) :
6 super () . __init__ ()
7 self . affine1 = nn . Linear (3 , 10)
8 self . activation1 = nn . ReLU ()
9 self . affine2 = nn . Linear (10 , 20)
10 self . activation2 = nn . ReLU ()
11 self . affine3 = nn . Linear (20 , 10)
12 self . activation3 = nn . ReLU ()
13 self . affine4 = nn . Linear (10 , 1)
14
15 def forward ( self , x0 ) :
16 x1 = self . activation1 ( self . affine1 ( x0 ) )
17 x2 = self . activation2 ( self . affine2 ( x1 ) )
18 x3 = self . activation3 ( x1 + self . affine3 ( x2 ) )
19 x4 = self . affine4 ( x3 )
20 return x4
satisfy
1 0
(1.147)
W1 = 1 , B1 = 0 , W2 = , B2 = ,
2 1
1 0 0
(1.148)
W3 = , B3 = , W4 = 2 2 , and B4 = 1 ,
0 1 0
and let V = (Vr,k )(r,k)∈S ∈ × (r,k)∈S
Rlk ×lr satisfy
(1.149)
V0,4 = −1 .
68
1.5. Residual ANNs (ResNets)
Then
(RR
r (Φ, V ))(5) = 28 (1.150)
(cf. Definitions 1.2.4 and 1.5.4).
Proof for Example 1.5.6. Throughout this proof, let x0 ∈ R1 , x1 ∈ R1 , x2 ∈ R2 , x3 ∈ R2 ,
x4 ∈ R1 satisfy for all k ∈ {1, 2, 3, 4} that x0 = 5 and
(1.151)
P
xk = Mr1(0,4) (k)+idR 1{4} (k),lk (Wk xk−1 + Bk + r∈N0 ,(r,k)∈S Vr,k xr ).
Observe that (1.151) assures that
(RR
r (Φ, V ))(5) = x4 . (1.152)
Next note that (1.151) ensures that
x1 = Mr,1 (W1 x0 + B1 ) = Mr,1 (5), (1.153)
1 0 5 5
(1.154)
x2 = Mr,2 (W2 x1 + B2 ) = Mr,1 5 + = Mr,1 = ,
2 1 11 11
1 0 5 0 5 5
x3 = Mr,2 (W3 x2 + B3 ) = Mr,1 + = Mr,1 = , (1.155)
0 1 11 0 11 11
and x4 = Mr,1 (W4 x3 + B4 + V0,4 x0 )
(1.156)
5
= Mr,1 2 2 + 1 + −1 5 = Mr,1 (28) = 28.
11
This and (1.152) establish (1.150). The proof for Example 1.5.6 is thus complete.
Exercise 1.5.1. Let l0 = 1, l1 = 2, l2 = 3, l3 = 1, S = {(0, 3), (1, 3)}, let
× 3
(Rlk ×lk−1 (1.157)
Φ = ((W1 , B1 ), (W2 , B2 ), (W3 , B3 )) ∈ k=1
× Rlk )
satisfy
−1 2 0
1 3
W1 = , B1 = ,W2 = 3 −4, B2 = 0, (1.158)
2 4
−5 6 0
and (1.159)
W3 = −1 1 −1 , B3 = −4 ,
and let V = (Vr,k )(r,k)∈S ∈ ×
(r,k)∈S
Rlk ×lr satisfy
and (1.160)
V0,3 = 1 V1,3 = 3 −2 .
Prove or disprove the following statement: It holds that
(RR
r (Φ, V ))(−1) = 0 (1.161)
(cf. Definitions 1.2.4 and 1.5.4).
69
Chapter 1: Basics on ANNs
and we call Rf,T,i the T -times unrolled function f with initial information I.
Definition 1.6.2 (Description of RNNs). Let X, Y, I be sets, let d, T ∈ N, θ ∈ Rd , I ∈ I,
and let N = (Nϑ )ϑ∈Rd : Rd × X × I → Y × I be a function. Then we call R the realization
function of the T -step unrolled RNN with RNN node N, parameter vector θ, and initial
70
1.6. Recurrent ANNs (RNNs)
information I (we call R the realization of the T -step unrolled RNN with RNN node N,
parameter vector θ, and initial information I) if and only if
(i) it holds that r is the realization of the simple fully-connected RNN node with parameters
θ and activations Ψ1 and Ψ2 and
R = Rr,T,I (1.165)
71
Chapter 1: Basics on ANNs
(i) It holds that R is the realization of the T -step unrolled simple fully-connected RNN
with parameter vector θ, activations Ψ1 and Ψ2 , and initial information I (cf. Defini-
tion 1.6.4).
(ii) It holds that R is the realization of the T -step unrolled RNN with RNN node N,
parameter vector θ, and initial information I (cf. Definition 1.6.2).
Proof of Lemma 1.6.5. Observe that (1.163) and (1.165) ensure that ((i) ↔ (ii)). The proof
of Lemma 1.6.5 is thus complete.
Exercise 1.6.1. For every T ∈ N, α ∈ (0, 1) let RT,α be the realization of the T -step
unrolled simple fully-connected RNN with parameter vector (1, 0, 0, α, 0, 1 − α, 0, 0, −1, 1, 0),
activations Mr,2 and idR , and initial information (0, 0) (cf. Definitions 1.2.1, 1.2.4, and
1.6.4). For every T ∈ N, α ∈ (0, 1) specify RT,α (1, 1, . . . , 1) explicitly and prove that your
result is correct!
72
1.7. Further types of ANNs
• We refer, for instance, to [49, 198, 200, 253, 356] for foundational references introducing
and refining the idea of autoencoders,
• we refer, for example, to [402, 403, 416] for so-called denoising autoencoders which
add random pertubation to the input data in the training of autoencoders,
• we refer, for instance, to [51, 107, 246] for so-called variational autoencoders which
use techniques from bayesian statistics in the training of autoencoders,
• we refer, for example, [294, 349] for autoencoders involving convolutions, and
• we refer, for instance, [118, 292] for adversarial autoencoders which combine the
principles of autoencoders with the paradigm of generative adversarial networks (see
Goodfellow et al. [165]).
73
Chapter 1: Basics on ANNs
through the information state passed on from the previous processing step of the RNN.
Consequently, it can be hard for RNNs to learn to understand long-term dependencies in
the input sequence. In Section 1.6.3 above, we briefly discussed the LSTM architecture for
RNNs which is an architecture for RNNs aimed at giving such RNNs the capacity to indeed
learn to understand such long-term dependencies.
Another approach in the literature to design ANN architectures which process sequential
data and are capable to efficiently learn to understand long-term dependencies in data
sequences is called the attention mechanism. Very roughly speaking, in the context of
sequences of the data, the attention mechanism aims to give ANNs the capacity to "pay
attention" to selected parts of the entire input sequence when they are processing a data
point of the sequence. The idea for using attention mechanisms in ANNs was first introduced
in Bahdanau et al. [11] in the context of RNNs trained for machine translation. In this
context the proposed ANN architecture still processes the input sequence sequentially,
however past information is not only available through the information state from the
previous processing step, but also through the attention mechanism, which can directly
extract information from data points far away from the data point being processed.
Likely the most famous ANNs based on the attention mechanism do however not involve
any recurrent elements and have been named Transfomer ANNs by the authors of the
seminal paper Vaswani et al. [397] called "Attention is all you need". Roughly speaking,
Transfomer ANNs are designed to process sequences of data by considering the entire input
sequence at once and relying only on the attention mechanism to understand dependencies
between the data points in the sequence. Transfomer ANNs are the basis for many recently
very successful large language models (LLMs), such as, generative pre-trained transformers
(GPTs) in [54, 320, 341, 342] which are the models behind the famous ChatGPT application,
Bidirectional Encoder Representations from Transformers (BERT) models in Devlin et
al. [104], and many others (cf., for example, [91, 267, 343, 418, 422] and the references
therein).
Beyond the NLP applications for which Transformers and attention mechanisms have
been introduced, similar ideas have been employed in several other areas, such as, computer
vision (cf., for instance, [109, 240, 278, 404]), protein structure prediction (cf., for example,
[232]), multimodal learning (cf., for instance, [283]), and long sequence time-series forecasting
(cf., for example, [441]). Moreover, we refer, for instance, to [81, 288], [157, Chapter 17],
and [164, Section 12.4.5.1] for explorations and explanations of the attention mechanism in
the literature.
74
1.7. Further types of ANNs
for example, West [411] for an introduction on graphs). As a consequence, many ANN
architectures which can process graphs as inputs, so-called graph neural networks (GNNs),
have been introduced in the literature.
• We refer, for instance, to [362, 415, 439, 442] for overview articles on GNNs,
• we refer, for example, to [166, 366] for foundational articles for GNNs,
• we refer, for instance, to [399, 426] for applications of attention mechanisms (cf.
Section 1.7.2 above) to GNNs,
• we refer, for example, to [55, 95, 412, 424] for GNNs involving convolutions on graphs,
and
• we refer, for instance, to [16, 151, 361, 368, 414] for applications of GNNs to problems
from the natural sciences.
75
Chapter 1: Basics on ANNs
392, 406, 413, 432] for extensions and theoretical results on deepONets. For a comparison
between deepONets and FNOs we refer, for example, to Lu et al. [285].
A further natural approach is to employ CNNs (see Section 1.4) to develop neural
operator architectures. We refer, for instance, to [185, 192, 244, 350, 443] for such CNN-
based neural operators. Finally, we refer, for example, to [67, 94, 98, 135, 136, 227, 273,
277, 301, 344, 369, 419] for further neural operator architectures and theoretical results for
neural operators.
76
Chapter 2
ANN calculus
In this chapter we review certain operations that can be performed on the set of fully-
connected feedforward ANNs such as compositions (see Section 2.1), paralellizations (see
Section 2.2), scalar multiplications (see Section 2.3), and sums (see Section 2.4) and thereby
review an appropriate calculus for fully-connected feedforward ANNsṪhe operations and
the calculus for fully-connected feedforward ANNs presented in this chapter will be used in
Chapters 3 and 4 to establish certain ANN approximation results.
In the literature such operations on ANNs and such kind of calculus on ANNs has been
used in many research articles such as [128, 159, 180, 181, 184, 228, 321, 329, 333] and the
references therein. The specific presentation of this chapter is based on Grohs et al. [180,
181].
the function which satisfies for all Φ, Ψ ∈ N, k ∈ {1, 2, . . . , L(Φ) + L(Ψ) − 1} with
I(Φ) = O(Ψ) that L(Φ • Ψ) = L(Φ) + L(Ψ) − 1 and
(Wk,Ψ , Bk,Ψ )
: k < L(Ψ)
(Wk,Φ•Ψ , Bk,Φ•Ψ ) = (W1,Φ WL(Ψ),Ψ , W1,Φ BL(Ψ),Ψ + B1,Φ ) : k = L(Ψ) (2.2)
(Wk−L(Ψ)+1,Φ , Bk−L(Ψ)+1,Φ ) : k > L(Ψ)
77
Chapter 2: ANN calculus
D(Φ • Ψ) = (D0 (Ψ), D1 (Ψ), . . . , DH(Ψ) (Ψ), D1 (Φ), D2 (Φ), . . . , DL(Φ) (Φ)), (2.3)
and
I(Ψ)
(v) it holds for all a ∈ C(R, R) that RN
a (Φ • Ψ) ∈ C(R , RO(Φ) ) and
RN N N
a (Φ • Ψ) = [Ra (Φ)] ◦ [Ra (Ψ)] (2.7)
∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),Dk (Φ•Ψ) (Wk,Φ•Ψ xk−1 + Bk,Φ•Ψ ) . (2.8)
Note that the fact that L(Φ • Ψ) = L(Φ) + L(Ψ) − 1 and the fact that for all Θ ∈ N it holds
that H(Θ) = L(Θ) − 1 establish items (ii) and (iii). Observe that item (iii) in Lemma 1.3.3
and (2.2) show that for all k ∈ {1, 2, . . . , L} it holds that
Dk (Ψ)×Dk−1 (Ψ)
R
: k < L(Ψ)
Wk,Φ•Ψ ∈ R D1 (Φ)×DL(Ψ)−1 (Ψ)
: k = L(Ψ) (2.9)
Dk−L(Ψ)+1 (Φ)×Dk−L(Ψ) (Φ)
R : k > L(Ψ).
78
2.1. Compositions of fully-connected feedforward ANNs
This, item (iii) in Lemma 1.3.3, and the fact that H(Ψ) = L(Ψ) − 1 ensure that for all
k ∈ {0, 1, . . . , L} it holds that
(
Dk (Ψ) : k ≤ H(Ψ)
Dk (Φ • Ψ) = (2.10)
Dk−L(Ψ)+1 (Φ) : k > H(Ψ).
This proves item (iv). Observe that (2.10) and item (ii) in Lemma 1.3.3 ensure that
RN
a (Φ • Ψ) ∈ C(R
I(Φ•Ψ)
, RO(Φ•Ψ) ) = C(RI(Ψ) , RO(Φ) ). (2.13)
Next note that (2.2) implies that for all k ∈ N ∩ (1, L(Φ) + 1) it holds that
This and (2.10) ensure that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa , k ∈ N∩(1, L(Φ)+
1) it holds that
79
Chapter 2: ANN calculus
Furthermore, observe that (2.2) and (2.10) show that for all a ∈ C(R, R), x = (x0 , x1 , . . . ,
xL ) ∈ Xa it holds that
Combining this and (2.15) proves that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa it
holds that
(RNa (Φ))(WL(Ψ),Ψ xL(Ψ)−1 + BL(Ψ),Ψ ) = xL . (2.17)
Moreover, note that (2.2) and (2.10) imply that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈
Xa , k ∈ N ∩ (0, L(Ψ)) it holds that
This proves that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa it holds that
(RN
a (Ψ))(x0 ) = WL(Ψ),Ψ xL(Ψ)−1 + BL(Ψ),Ψ . (2.19)
Combining this with (2.17) demonstrates that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa
it holds that
(RN N N
(2.20)
a (Φ)) (Ra (Ψ))(x0 ) = xL = Ra (Φ • Ψ) (x0 ).
This and (2.13) prove item (v). The proof of Proposition 2.1.2 is thus complete.
Proof of Lemma 2.1.3. Observe that the fact that for all Ψ1 , Ψ2 ∈ N with I(Ψ1 ) = O(Ψ2 )
it holds that L(Ψ1 • Ψ2 ) = L(Ψ1 ) + L(Ψ2 ) − 1 and the assumption that L(Φ2 ) = 1 ensure
that
L(Φ1 • Φ2 ) = L(Φ1 ) and L(Φ2 • Φ3 ) = L(Φ3 ) (2.22)
(cf. Definition 2.1.1). Therefore, we obtain that
80
2.1. Compositions of fully-connected feedforward ANNs
Next note that (2.22), (2.2), and the assumption that L(Φ2 ) = 1 imply that for all
k ∈ {1, 2, . . . , L(Φ1 )} it holds that
(
(W1,Φ1 W1,Φ2 , W1,Φ1 B1,Φ2 + B1,Φ1 ) : k = 1
(Wk,Φ1 •Φ2 , Bk,Φ1 •Φ2 ) = (2.24)
(Wk,Φ1 , Bk,Φ1 ) : k > 1.
This, (2.2), and (2.23) prove that for all k ∈ {1, 2, . . . , L(Φ1 ) + L(Φ3 ) − 1} it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 )
(Wk,Φ3 , Bk,Φ3 )
: k < L(Φ3 )
= (W1,Φ1 •Φ2 WL(Φ3 ),Φ3 , W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 ) : k = L(Φ3 )
(Wk−L(Φ3 )+1,Φ1 •Φ2 , Bk−L(Φ3 )+1,Φ1 •Φ2 ) : k > L(Φ3 ) (2.25)
(Wk,Φ3 , Bk,Φ3 )
: k < L(Φ3 )
= (W1,Φ1 •Φ2 WL(Φ3 ),Φ3 , W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 ) : k = L(Φ3 )
(Wk−L(Φ3 )+1,Φ1 , Bk−L(Φ3 )+1,Φ1 ) : k > L(Φ3 ).
Furthermore, observe that (2.2), (2.22), and (2.23) show that for all k ∈ {1, 2, . . . , L(Φ1 ) +
L(Φ3 ) − 1} it holds that
(Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) )
(Wk,Φ2 •Φ3 , Bk,Φ2 •Φ3 )
: k < L(Φ2 • Φ3 )
= (W1,Φ1 WL(Φ2 •Φ3 ),Φ2 •Φ3 , W1,Φ BL(Φ2 •Φ3 ),Φ2 •Φ3 + B1,Φ1 ) : k = L(Φ2 • Φ3 )
(Wk−L(Φ2 •Φ3 )+1,Φ1 , Bk−L(Φ2 •Φ3 )+1,Φ1 ) : k > L(Φ2 • Φ3 ) (2.26)
(Wk,Φ3 , Bk,Φ3 )
: k < L(Φ3 )
= (W1,Φ1 WL(Φ3 ),Φ2 •Φ3 , W1,Φ BL(Φ3 ),Φ2 •Φ3 + B1,Φ1 ) : k = L(Φ3 )
(Wk−L(Φ3 )+1,Φ1 , Bk−L(Φ3 )+1,Φ1 ) : k > L(Φ3 ).
Combining this with (2.25) establishes that for all k ∈ {1, 2, . . . , L(Φ1 )+L(Φ3 )−1}\{L(Φ3 )}
it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 ) = (Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) ). (2.27)
Moreover, note that (2.24) and (2.2) ensure that
W1,Φ1 •Φ2 WL(Φ3 ),Φ3 = W1,Φ1 W1,Φ2 WL(Φ3 ),Φ3 = W1,Φ1 WL(Φ3 ),Φ2 •Φ3 . (2.28)
In addition, observe that (2.24) and (2.2) demonstrate that
W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 = W1,Φ1 W1,Φ2 BL(Φ3 ),Φ3 + W1,Φ1 B1,Φ2 + B1,Φ1
= W1,Φ1 (W1,Φ2 BL(Φ3 ),Φ3 + B1,Φ2 ) + B1,Φ1 (2.29)
= W1,Φ BL(Φ3 ),Φ2 •Φ3 + B1,Φ1 .
81
Chapter 2: ANN calculus
Combining this and (2.28) with (2.27) proves that for all k ∈ {1, 2, . . . , L(Φ1 ) + L(Φ3 ) − 1}
it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 ) = (Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) ). (2.30)
This and (2.23) imply that
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ). (2.31)
The proof of Lemma 2.1.3 is thus complete.
Lemma 2.1.4. Let Φ1 , Φ2 , Φ3 ∈ N satisfy I(Φ1 ) = O(Φ2 ), I(Φ2 ) = O(Φ3 ), and L(Φ2 ) > 1
(cf. Definition 1.3.1). Then
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ) (2.32)
(cf. Definition 2.1.1).
Proof of Lemma 2.1.4. Note that the fact that for all Ψ, Θ ∈ N it holds that L(Ψ • Θ) =
L(Ψ) + L(Θ) − 1 ensures that
L((Φ1 • Φ2 ) • Φ3 ) = L(Φ1 • Φ2 ) + L(Φ3 ) − 1
= L(Φ1 ) + L(Φ2 ) + L(Φ3 ) − 2
(2.33)
= L(Φ1 ) + L(Φ2 • Φ3 ) − 1
= L(Φ1 • (Φ2 • Φ3 ))
(cf. Definition 2.1.1). Furthermore, observe that (2.2) shows that for all k ∈ {1, 2, . . . ,
L((Φ1 • Φ2 ) • Φ3 )} it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 )
(Wk,Φ3 , Bk,Φ3 )
: k < L(Φ3 )
(2.34)
= (W1,Φ1 •Φ2 WL(Φ3 ),Φ3 , W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 ) : k = L(Φ3 )
(Wk−L(Φ3 )+1,Φ1 •Φ2 , Bk−L(Φ3 )+1,Φ1 •Φ2 ) : k > L(Φ3 ).
Moreover, note that (2.2) and the assumption that L(Φ2 ) > 1 ensure that for all k ∈
N ∩ (L(Φ3 ), L((Φ1 • Φ2 ) • Φ3 )] it holds that
(Wk−L(Φ3 )+1,Φ1 •Φ2 , Bk−L(Φ3 )+1,Φ1 •Φ2 )
(Wk−L(Φ3 )+1,Φ2 , Bk−L(Φ3 )+1,Φ2 )
: k − L(Φ3 ) + 1 < L(Φ2 )
= (W1,Φ1 WL(Φ2 ),Φ2 , W1,Φ1 BL(Φ2 ),Φ2 + B1,Φ1 ) : k − L(Φ3 ) + 1 = L(Φ2 )
(Wk−L(Φ3 )+1−L(Φ2 )+1,Φ1 , Bk−L(Φ3 )+1−L(Φ2 )+1,Φ1 ) : k − L(Φ3 ) + 1 > L(Φ2 ) (2.35)
(Wk−L(Φ3 )+1,Φ2 , Bk−L(Φ3 )+1,Φ2 )
:k < L(Φ2 ) + L(Φ3 ) − 1
= (W1,Φ1 WL(Φ2 ),Φ2 , W1,Φ1 BL(Φ2 ),Φ2 + B1,Φ1 ) : k = L(Φ2 ) + L(Φ3 ) − 1
(Wk−L(Φ3 )−L(Φ2 )+2,Φ1 , Bk−L(Φ3 )−L(Φ2 )+2,Φ1 ) : k > L(Φ2 ) + L(Φ3 ) − 1.
82
2.1. Compositions of fully-connected feedforward ANNs
Combining this with (2.34) proves that for all k ∈ {1, 2, . . . , L((Φ1 • Φ2 ) • Φ3 )} it holds
that
In addition, observe that (2.2), the fact that L(Φ2 • Φ3 ) = L(Φ2 ) + L(Φ3 ) − 1, and the
assumption that L(Φ2 ) > 1 demonstrate that for all k ∈ {1, 2, . . . , L(Φ1 • (Φ2 • Φ3 ))} it
holds that
This, (2.36), and (2.33) establish that for all k ∈ {1, 2, . . . , L(Φ1 ) + L(Φ2 ) + L(Φ3 ) − 2} it
holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 ) = (Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) ). (2.38)
Hence, we obtain that
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ). (2.39)
The proof of Lemma 2.1.4 is thus complete.
83
Chapter 2: ANN calculus
Corollary 2.1.5. Let Φ1 , Φ2 , Φ3 ∈ N satisfy I(Φ1 ) = O(Φ2 ) and I(Φ2 ) = O(Φ3 ) (cf.
Definition 1.3.1). Then
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ) (2.40)
(cf. Definition 2.1.1).
Proof of Corollary 2.1.5. Note that Lemma 2.1.3 and Lemma 2.1.4 establish (2.40). The
proof of Corollary 2.1.5 is thus complete.
Proof of Lemma 2.1.7. Observe that Proposition 2.1.2, (2.41), and induction establish
(2.42). The proof of Lemma 2.1.7 is thus complete.
84
2.2. Parallelizations of fully-connected feedforward ANNs
the function which satisfies for all Φ = (Φ1 , . . . , Φn ) ∈ Nn , k ∈ {1, 2, . . . , L(Φ1 )} with
L(Φ1 ) = L(Φ2 ) = · · · = L(Φn ) that
Wk,Φ1 0 0 ··· 0
0
Wk,Φ2 0 ··· 0
L(Pn (Φ)) = L(Φ1 ),
0
Wk,Pn (Φ) = 0 W k,Φ3 ··· 0 ,
.. .. .. .. ..
. . . . .
0 0 0 ··· Wk,Φn
Bk,Φ1
Bk,Φ
(2.44)
2
and Bk,Pn (Φ) = ..
.
Bk,Φn
and
Proof of Lemma 2.2.2. Note that item (iii) in Lemma 1.3.3 and (2.44) imply that for all
k ∈ {1, 2, . . . , L} it holds that
Pn
Dk (Φj ))×( n
Pn
and (2.48)
P
Wk,Pn (Φ) ∈ R( j=1 j=1 Dk−1 (Φj )) Bk,Pn (Φ) ∈ R( j=1 Dk−1 (Φj ))
(cf. Definition 2.2.1). Item (iii) in Lemma 1.3.3 therefore establishes items (i) and (ii). Note
that item (ii) implies item (iii). The proof of Lemma 2.2.2 is thus complete.
85
Chapter 2: ANN calculus
and
RN
a Pn (Φ) (x1 , x2 , . . . , xn )
[ n
P (2.50)
= (RN N N j=1 O(Φj )]
a (Φ1 ))(x 1 ), (Ra (Φ2 ))(x 2 ), . . . , (Ra (Φn ))(x n ) ∈ R
Proof of Proposition 2.2.3. Throughout this proof, let L = L(Φ1 ), for every j ∈ {1, 2, . . . ,
n} let
∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),Dk (Φj ) (Wk,Φj xk−1 + Bk,Φj ) , (2.51)
and let
X = x = (x0 , x1 , . . . , xL ) ∈ RD0 (Pn (Φ)) × RD1 (Pn (Φ)) × · · · × RDL (Pn (Φ)) :
∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),Dk (Pn (Φ)) (Wk,Pn (Φ) xk−1 + Bk,Pn (Φ) ) . (2.52)
Observe that item (ii) in Lemma 2.2.2 and item (ii) in Lemma 1.3.3 imply that
n
X n
X
I(Pn (Φ)) = D0 (Pn (Φ)) = D0 (Φn ) = I(Φn ). (2.53)
j=1 j=1
Furthermore, note that item (ii) in Lemma 2.2.2 and item (ii) in Lemma 1.3.3 ensure that
n
X n
X
O(Pn (Φ)) = DL(Pn (Φ)) (Pn (Φ)) = DL(Φn ) (Φn ) = O(Φn ). (2.54)
j=1 j=1
Observe that (2.44) and item (ii) in Lemma 2.2.2 show that for allPa ∈ C(R, R), k ∈
n
{1, 2, . . . , L}, x1 ∈ RDk (Φ1 ) , x2 ∈ RDk (Φ2 ) , . . . , xn ∈ RDk (Φn ) , x ∈ R[ j=1 Dk (Φj )] with x =
86
2.2. Parallelizations of fully-connected feedforward ANNs
This proves that for all k ∈ {1, 2, . . . , L}, x = (x0 , x1 , . . . , xL ) ∈ X, x1 = (x10 , x11 , . . . , x1L ) ∈ X 1 ,
x2 = (x20 , x21 , . . . , x2L ) ∈ X 2 , . . . , xn = (xn0 , xn1 , . . . , xnL ) ∈ X n with xk−1 = (x1k−1 , x2k−1 , . . . ,
xnk−1 ) it holds that
xk = (x1k , x2k , . . . , xnk ). (2.56)
Induction, and (1.91) hence demonstrate that for all k ∈ {1, 2, . . . , L}, x = (x0 , x1 , . . . , xL ) ∈
X, x1 = (x10 , x11 , . . . , x1L ) ∈ X 1 , x2 = (x20 , x21 , . . . , x2L ) ∈ X 2 , . . . , xn = (xn0 , xn1 , . . . , xnL ) ∈ X n
with x0 = (x10 , x20 , . . . , xn0 ) it holds that
RN 1 2 n
a (Pn (Φ)) (x0 ) = xL = (xL , xL , . . . , xL )
(2.57)
= (RN 1 N 2 N n
a (Φ1 ))(x0 ), (Ra (Φ2 ))(x0 ), . . . , (Ra (Φn ))(x0 ) .
This establishes item (ii). The proof of Proposition 2.2.3 is thus complete.
Proof of Proposition 2.2.4. Throughout this proof, for every j ∈ {1, 2, . . . , n}, k ∈ {0, 1,
87
Chapter 2: ANN calculus
. . . , L} let lj,k = Dk (Φj ). Note that item (ii) in Lemma 2.2.2 demonstrates that
L h
X ih P i
Pn n
P(Pn (Φ1 , Φ2 , . . . , Φn )) = i=1 li,k l
i=1 i,k−1 + 1
k=1
L h
X ih P i
Pn n
= i=1 li,k j=1 lj,k−1 +1
k=1
Xn Xn X L n X
X n X
L
≤ li,k (lj,k−1 + 1) ≤ li,k (lj,ℓ−1 + 1)
i=1 j=1 k=1 i=1 j=1 k,ℓ=1
n n
(2.59)
X XhPL ihP
L
i
= k=1 li,k ℓ=1 (lj,ℓ−1 + 1)
i=1 j=1
Xn X n h ihP i
PL 1 L
≤ k=1 2 li,k (l i,k−1 + 1) ℓ=1 lj,ℓ (lj,ℓ−1 + 1)
i=1 j=1
Xn X n hP i2
1 1 n
= 2
P(Φi )P(Φ j ) = 2 i=1 P(Φ i ) .
i=1 j=1
Corollary 2.2.5 (Lower and upper bounds for the numbers of parameters of parallelizations
of fully-connected feedforward ANNs). Let n ∈ N, Φ = (Φ1 , . . . , Φn ) ∈ Nn satisfy D(Φ1 ) =
D(Φ2 ) = . . . = D(Φn ) (cf. Definition 1.3.1). Then
n2 n2 +n 2
(2.60)
Pn
2
P(Φ1 ) ≤ 2
P(Φ1 ) ≤ P(Pn (Φ)) ≤ n2 P(Φ1 ) ≤ 21 i=1 P(Φi )
Observe that (2.61) and the assumption that D(Φ1 ) = D(Φ2 ) = . . . = D(Φn ) imply that for
all j ∈ {1, 2, . . . , n} it holds that
88
2.2. Parallelizations of fully-connected feedforward ANNs
Furthermore, note that the assumption that D(Φ1 ) = D(Φ2 ) = . . . = D(Φn ) and the fact
that P(Φ1 ) ≥ l1 (l0 + 1) ≥ 2 ensure that
n 2 n 2
n2
2 2 1 2 1 1
(2.65)
P P
n P(Φ1 ) ≤ 2 [P(Φ1 )] = 2 [nP(Φ1 )] = 2 P(Φ1 ) = 2 P(Φi ) .
i=1 i=1
Moreover, observe that (2.63) and the fact that for all a, b ∈ N it holds that
2(ab + 1) = ab + 1 + (a − 1)(b − 1) + a + b ≥ ab + a + b + 1 = (a + 1)(b + 1) (2.66)
show that
L
1
P
P(Pn (Φ)) ≥ 2
(nlj )(n + 1)(lj−1 + 1)
j=1
L
(2.67)
n(n+1) P n2 +n
= 2
lj (lj−1 + 1) = 2
P(Φ1 ).
j=1
This, (2.64), and (2.65) establish (2.60). The proof of Corollary 2.2.5 is thus complete.
Exercise 2.2.1. Prove or disprove the following statement: For every n ∈ N, Φ = (Φ1 , . . . ,
Φn ) ∈ Nn with L(Φ1 ) = L(Φ2 ) = . . . = L(Φn ) it holds that
P(Pn (Φ1 , Φ2 , . . . , Φn )) ≤ n ni=1 P(Φi ) . (2.68)
P
Exercise 2.2.2. Prove or disprove the following statement: For every n ∈ N, Φ = (Φ1 , . . . ,
Φn ) ∈ Nn with P(Φ1 ) = P(Φ2 ) = . . . = P(Φn ) it holds that
P(Pn (Φ1 , Φ2 , . . . , Φn )) ≤ n2 P(Φ1 ). (2.69)
89
Chapter 2: ANN calculus
(RN N
r (Id ))(x) = Rr Pd (I1 , I1 , . . . , I1 ) (x1 , x2 , . . . , xd )
= (RN N N
(2.77)
r (I1 ))(x1 ), (Rr (I1 ))(x2 ), . . . , (Rr (I1 ))(xd )
= (x1 , x2 , . . . , xd ) = x
(cf. Definition 2.2.1). This establishes item (ii). The proof of Lemma 2.2.7 is thus complete.
90
2.2. Parallelizations of fully-connected feedforward ANNs
and
(cf. Definition 2.1.6). Combining this with (1.78) and Lemma 1.3.3 ensures that
establishes (2.84) in the base case n = 0 (cf. Definition 1.5.5). For the induction step assume
that there exists n ∈ N0 which satisfies
(
(d, d) :n=0
Nn+2 ∋ D(Ψ•n ) = (2.86)
(d, i, i, . . . , i, d) : n ∈ N.
Note that (2.86), (2.41), (2.83), item (i) in Proposition 2.1.2, and the fact that D(Ψ) =
(d, i, d) ∈ N3 imply that
91
Chapter 2: ANN calculus
(cf. Definition 2.1.1). Induction therefore proves (2.84). This and (2.83) establish item (i).
Observe that (2.79), item (iii) in Proposition 2.1.2, (2.82), and the fact that H(Φ) = L(Φ)−1
imply that for all L ∈ N ∩ [L(Φ), ∞) it holds that
The fact that H EL,Ψ (Φ) = L EL,Ψ (Φ) − 1 hence proves that
(2.89)
L EL,Ψ (Φ) = H EL,Ψ (Φ) + 1 = L.
This establishes item (ii). The proof of Lemma 2.2.9 is thus complete.
RN N
a (EL,I (Φ)) = Ra (Φ) (2.91)
Proof of Lemma 2.2.10. Throughout this proof, let Φ ∈ N, L, d ∈ N satisfy L(Φ) ≤ L and
I(I) = O(Φ) = d. We claim that for all n ∈ N0 it holds that
RN •n d d
a (I ) ∈ C(R , R ) and ∀ x ∈ Rd : (RN •n
a (I ))(x) = x. (2.92)
We now prove (2.92) by induction on n ∈ N0 . Note that (2.41) and the fact that O(I) = d
demonstrate that RN a (I ) ∈ C(R , R ) and ∀ x ∈ R : (Ra (I ))(x) = x. This establishes
•0 d d d N •0
(2.92) in the base case n = 0. For the induction step observe that for all n ∈ N0 with
a (I ) ∈ C(R , R ) and ∀ x ∈ R : (Ra (I ))(x) = x it holds that
•n N •n
RN d d d
RN
a (I
•(n+1)
) = RN •n N N •n d d
a (I • (I )) = (Ra (I)) ◦ (Ra (I )) ∈ C(R , R ) (2.93)
and
•(n+1) N •n
∀ x ∈ Rd : RN N
a (I ) (x) = [R a (I)] ◦ [Ra (I )] (x)
N •n
(2.94)
= (Ra (I)) Ra (I ) (x) = (RN
N
a (I))(x) = x.
92
2.2. Parallelizations of fully-connected feedforward ANNs
Induction therefore proves (2.92). This establishes item (i). Note (2.79), item (v) in
Proposition 2.1.2, item (i), and the fact that I(I) = O(Φ) ensure that
•(L−L(Φ))
RN N
a (EL,I (Φ)) = Ra ((I ) • Φ)
(2.95)
∈ C(RI(Φ) , RO(I) ) = C(RI(Φ) , RI(I) ) = C(RI(Φ) , RO(Φ) )
and
∀ x ∈ RI(Φ) : RN N •(L−L(Φ)) N
a (E L,I (Φ)) (x) = Ra (I ) (Ra (Φ))(x)
(2.96)
= (RN
a (Φ))(x).
This establishes item (ii). The proof of Lemma 2.2.10 is thus complete.
Proof of Lemma 2.2.11. Observe that item (i) in Lemma 2.2.9 demonstrates that
93
Chapter 2: ANN calculus
(2.106)
Pn,Ψ (Φ) = Pn Emaxk∈{1,2,...,n} L(Φk ),Ψ1 (Φ1 ), . . . , Emaxk∈{1,2,...,n} L(Φk ),Ψn (Φn )
RN
a (Pn,I (Φ)) (x1 , x2 , . . . , xn )
[ n
P (2.108)
= (RN N N j=1 O(Φj )]
a (Φ 1 ))(x 1 ), (Ra (Φ 2 ))(x 2 ), . . . , (Ra (Φn ))(x n ) ∈ R
RN N
(2.109)
a (EL,Ij (Φj )) (x) = (Ra (Φj ))(x)
94
2.2. Parallelizations of fully-connected feedforward ANNs
(cf. Definition 2.2.8). Items (i) and (ii) in Proposition 2.2.3 therefore imply
(A) that
Pn Pn
RN ∈ C R[ I(Φj )]
, R[ O(Φj )]
(2.110)
a Pn EL,I1 (Φ1 ), EL,I2 (Φ2 ), . . . , EL,In (Φn )
j=1 j=1
and
RN
a P n E L,I1 (Φ 1 ), E L,I 2 (Φ 2 ), . . . , E L,I n (Φ n ) (x1 , x2 , . . . , xn )
= RN N N
a E L,I1 (Φ 1 ) (x 1 ), R a E L,I2 (Φ 2 ) (x 2 ), . . . , R a EL,In (Φn ) (x n ) (2.111)
= (RN N
a (Φ1 ))(x1 ), (Ra (Φ2 ))(x2 ), . . . , (Ra (Φn ))(xn )
N
(cf. Definition 2.2.1). Combining this with (2.106) and the fact that L = maxj∈{1,2,...,n}
L(Φj ) ensures
(C) that
[ n
Pn
(2.112)
P
RN j=1 I(Φj )] , R[ j=1 O(Φj )]
a Pn,I (Φ) ∈ C R
and
RN
a Pn,I (Φ) (x1 , x2 , . . . , xn )
= RN
a Pn EL,I1 (Φ1 ), EL,I2 (Φ2 ), . . . , EL,In (Φn ) (x1 , x2 , . . . , xn ) (2.113)
N N N
= (Ra (Φ1 ))(x1 ), (Ra (Φ2 ))(x2 ), . . . , (Ra (Φn ))(xn ) .
This establishes items items (i) and (ii). The proof of Lemma 2.2.13 is thus complete.
Exercise 2.2.3. For every d ∈ N let Fd : Rd → Rd satisfy for all x = (x1 , . . . , xd ) ∈ Rd that
Fd (x) = (max{|x1 |}, max{|x1 |, |x2 |}, . . . , max{|x1 |, |x2 |, . . . , |xd |}). (2.114)
Prove or disprove the following statement: For all d ∈ N there exists Φ ∈ N such that
RN
r (Φ) = Fd (2.115)
95
Chapter 2: ANN calculus
(RN
a (AW,B ))(x) = Wx + B (2.118)
This proves item (i). Furthermore, observe that the fact that
and (1.91) ensure that for all a ∈ C(R, R), x ∈ Rn it holds that RN n m
a (AW,B ) ∈ C(R , R )
and
(RN
a (AW,B ))(x) = Wx + B. (2.121)
This establishes items (ii) and (iii). The proof of Lemma 2.3.2 is thus complete. The proof
of Lemma 2.3.2 is thus complete.
Lemma 2.3.3 (Compositions with fully-connected feedforward affine transformation ANNs).
Let Φ ∈ N (cf. Definition 1.3.1). Then
96
2.3. Scalar multiplications of fully-connected feedforward ANNs
(RN N
(2.123)
a (A W,B • Φ))(x) = W (Ra (Φ))(x) + B,
(RN N
a (Φ • AW,B ))(x) = (Ra (Φ))(Wx + B) (2.125)
(RN
a (AW,B ))(x) = Wx + B (2.126)
(cf. Definitions 1.3.4 and 2.3.1). Combining this and Proposition 2.1.2 proves items (i), (ii),
(iii), (iv), (v), and (vi). The proof of Lemma 2.3.3 is thus complete.
λ ⊛ Φ = Aλ IO(Φ) ,0 • Φ (2.127)
97
Chapter 2: ANN calculus
I(Φ)
(ii) it holds for all a ∈ C(R, R) that RN
a (λ ⊛ Φ) ∈ C(R , RO(Φ) ), and
(iii) it holds for all a ∈ C(R, R), x ∈ RI(Φ) that
RN N
(2.128)
a (λ ⊛ Φ) (x) = λ (Ra (Φ))(x)
(cf. Definition 1.3.4). This proves items (ii) and (iii). The proof of Lemma 2.3.5 is thus
complete.
98
2.4. Sums of fully-connected feedforward ANNs with the same length
(cf. Definitions 1.3.4, 1.5.5, and 2.3.1). This proves items (ii) and (iii). The proof of
Lemma 2.4.2 is thus complete.
Lemma 2.4.3. Let m, n ∈ N, a ∈ C(R, R), Φ ∈ N satisfy O(Φ) = mn (cf. Definition 1.3.1).
Then
I(Φ)
(i) it holds that RN
a (Sm,n • Φ) ∈ C(R , Rm ) and
(ii) it holds for all x ∈ RI(Φ) , y1 , y2 , . . . , yn ∈ Rm with (RN
a (Φ))(x) = (y1 , y2 , . . . , yn ) that
n
RN (2.138)
P
a (Sm,n • Φ) (x) = yk
k=1
(cf. Definitions 1.3.4 and 2.4.1). Combining this and item (v) in Proposition 2.1.2 establishes
items (i) and (ii). The proof of Lemma 2.4.3 is thus complete.
99
Chapter 2: ANN calculus
(cf. Definitions 1.3.4 and 2.4.1). Combining this and item (v) in Proposition 2.1.2 proves
items (i) and (ii). The proof of Lemma 2.4.4 is thus complete.
(RN
a (Tm,n ))(x) = (x, x, . . . , x) (2.144)
100
2.4. Sums of fully-connected feedforward ANNs with the same length
101
Chapter 2: ANN calculus
102
2.4. Sums of fully-connected feedforward ANNs with the same length
(cf. Definition 2.2.1). Furthermore, note that item (i) in Lemma 2.4.2 demonstrates that
D(SO(Φm ),n−m+1 ) = ((n − m + 1)O(Φm ), O(Φm )) (2.157)
(cf. Definition 2.4.1). This, (2.156), and item (i) in Proposition 2.1.2 show that
D SO(Φm ),n−m+1 • Pn−m+1 (Φm , Φm+1 , . . . , Φn )
(2.158)
n n n
P P P
= (n − m + 1)I(Φm ), D1 (Φk ), D2 (Φk ), . . . , DL(Φm )−1 (Φk ), O(Φm ) .
k=m k=m k=m
(cf. Definition 2.4.10). This proves items (i) and (ii). Note that Lemma 2.4.9 and (2.156)
imply that for all a ∈ C(R, R), x ∈ RI(Φm ) it holds that
RN I(Φm )
, R(n−m+1)O(Φm ) ) (2.161)
a [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 ∈ C(R
and
RN
a [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 (x)
(2.162)
= RN
a Pn−m+1 (Φm , Φm+1 , . . . , Φn ) (x, x, . . . , x)
(cf. Definition 1.3.4). Combining this with item (ii) in Proposition 2.2.3 demonstrates that
for all a ∈ C(R, R), x ∈ RI(Φm ) it holds that
RN
a [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 (x)
(2.163)
= (RN N N (n−m+1)O(Φm )
a (Φm ))(x), (Ra (Φm+1 ))(x), . . . , (Ra (Φn ))(x) ∈ R .
Lemma 2.4.3, (2.157), and Corollary 2.1.5 hence show that for all a ∈ C(R, R), x ∈ RI(Φm )
it holds that RN n I(Φm )
, RO(Φm ) ) and
L
a k=m Φk ∈ C(R
n
N
L
Ra Φk (x)
k=m
= RN
a SO(Φm ),n−m+1 • [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 (x) (2.164)
X n
= (RN
a (Φk ))(x).
k=m
This establishes item (iii). The proof of Lemma 2.4.11 is thus complete.
103
Chapter 2: ANN calculus
104
Part II
Approximation
105
Chapter 3
In learning problems ANNs are heavily used with the aim to approximate certain target
functions. In this chapter we review basic ReLU ANN approximation results for a class
of one-dimensional target functions (see Section 3.3). ANN approximation results for
multi-dimensional target functions are treated in Chapter 4 below.
In the scientific literature the capacity of ANNs to approximate certain classes of target
functions has been thoroughly studied; cf., for instance, [14, 41, 89, 203, 204] for early
universal ANN approximation results, cf., for example, [28, 43, 175, 333, 374, 423] and
the references therein for more recent ANN approximation results establishing rates in the
approximation of different classes of target functions, and cf., for instance, [128, 179, 259,
370] and the references therein for approximation capacities of ANNs related to solutions of
PDEs (cf. also Chapters 16 and 17 in Part VI of these lecture notes for machine learning
methods for PDEs). This chapter is based on Ackermann et al. [3, Section 4.2] (cf., for
example, also Hutzenthaler et al. [209, Section 3.4]).
107
Chapter 3: One-dimensional ANN approximation results
Lemma 3.1.2 (Elementary properties of moduli of continuity). Let A ⊆ R be a set and let
f : A → R be a function. Then
(i) it holds that wf is non-decreasing,
(ii) it holds that f is uniformly continuous if and only if limh↘0 wf (h) = 0,
(iii) it holds that f is globally bounded if and only if wf (∞) < ∞, and
(iv) it holds for all x, y ∈ A that |f (x) − f (y)| ≤ wf (|x − y|)
(cf. Definition 3.1.1).
Proof of Lemma 3.1.2. Observe that (3.1) proves items (i), (ii), (iii), and (iv). The proof
of Lemma 3.1.2 is thus complete.
Lemma 3.1.3 (Subadditivity of moduli of continuity). Let a ∈ [−∞, ∞], b ∈ [a, ∞], let
f : ([a, b] ∩ R) → R be a function, and let h, h ∈ [0, ∞]. Then
wf (h + h) ≤ wf (h) + wf (h) (3.2)
(cf. Definition 3.1.1).
Proof of Lemma 3.1.3. Throughout this proof, assume without loss of generality that
h ≤ h < ∞. Note that the fact that for all x, y ∈ [a, b] ∩ R with |x − y| ≤ h + h it
holds that [x − h, x + h] ∩ [y − h, y + h] ∩ [a, b] ̸= ∅ ensures that for all x, y ∈ [a, b] ∩ R with
|x − y| ≤ h + h there exists z ∈ [a, b] ∩ R such that
|x − z| ≤ h and |y − z| ≤ h. (3.3)
Items (i) and (iv) in Lemma 3.1.2 therefore imply that for all x, y ∈ [a, b] ∩ R with
|x − y| ≤ h + h there exists z ∈ [a, b] ∩ R such that
|f (x) − f (y)| ≤ |f (x) − f (z)| + |f (y) − f (z)|
(3.4)
≤ wf (|x − z|) + wf (|y − z|) ≤ wf (h) + wf (h)
(cf. Definition 3.1.1). Combining this with (3.1) demonstrates that
wf (h + h) ≤ wf (h) + wf (h). (3.5)
The proof of Lemma 3.1.3 is thus complete.
Lemma 3.1.4 (Properties of moduli of continuity of Lipschitz continuous functions). Let
A ⊆ R, L ∈ [0, ∞), let f : A → R satisfy for all x, y ∈ A that
|f (x) − f (y)| ≤ L|x − y|, (3.6)
and let h ∈ [0, ∞). Then
wf (h) ≤ Lh (3.7)
(cf. Definition 3.1.1).
108
3.1. Linear interpolation of one-dimensional functions
Proof of Lemma 3.1.4. Observe that (3.1) and (3.6) show that
wf (h) = sup |f (x) − f (y)| ∈ [0, ∞) : (x, y ∈ A with |x − y| ≤ h) ∪ {0}
≤ sup({Lh, 0}) = Lh
(cf. Definition 3.1.1). The proof of Lemma 3.1.4 is thus complete.
109
Chapter 3: One-dimensional ANN approximation results
Proposition 3.1.7 (Approximation and continuity properties for the linear interpolation
operator). Let K ∈ N, x0 , x1 , . . . , xK ∈ R satisfy x0 < x1 < . . . < xK and let f : [x0 , xK ] → R
be a function. Then
(Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) − (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
)(y)
K
(3.16)
wf (xk − xk−1 )
≤ max |x − y|
k∈{1,2,...,K} xk − xk−1
and
l(x) = (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) (3.19)
(cf. Definitions 3.1.1 and 3.1.5). Observe that item (ii) in Lemma 3.1.6, item (iv) in
Lemma 3.1.2, and (3.18) ensure that for all k ∈ {1, 2, . . . , K}, x, y ∈ [xk−1 , xk ] with x ≠ y it
holds that
Furthermore, note that that the triangle inequality and item (i) in Lemma 3.1.6 imply that
for all k, l ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ], y ∈ [xl−1 , xl ] with k < l it holds that
110
3.1. Linear interpolation of one-dimensional functions
Item (iv) in Lemma 3.1.2, and (3.18) hence demonstrate that for all k, l ∈ {1, 2, . . . , K},
x ∈ [xk−1 , xk ], y ∈ [xl−1 , xl ] with k < l and x ̸= y it holds that
|l(x) − l(y)|
l−1
!
X
≤ |l(x) − l(xk )| + wf (|xj − xj−1 |) + |l(xl−1 ) − l(y)|
j=k+1
l−1 ! (3.22)
X wf (xj − xj−1 )
= |l(x) − l(xk )| + (xj − xj−1 ) + |l(xl−1 ) − l(y)|
j=k+1
xj − xj−1
≤ |l(xk ) − l(x)| + L(xl−1 − xk ) + |l(y) − l(xl−1 )|.
This and (3.21) show that for all k, l ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ], y ∈ [xl−1 , xl ] with k < l
and x ̸= y it holds that
l−1
! !
X
|l(x) − l(y)| ≤ L (xk − x) + (xj − xj−1 ) + (y − xl−1 ) = L|x − y|. (3.23)
j=k+1
Combining this and (3.20) proves that for all x, y ∈ [x0 , xK ] with x ̸= y it holds that
This, the fact that for all x, y ∈ (−∞, x0 ] with x ̸= y it holds that
and the triangle inequality therefore establish that for all x, y ∈ R with x ̸= y it holds that
This proves item (i). Observe that item (iii) in Lemma 3.1.6 ensures that for all k ∈
{1, 2, . . . , K}, x ∈ [xk−1 , xk ] it holds that
xk − x x − xk−1
|l(x) − f (x)| = f (xk−1 ) + f (xk ) − f (x)
xk − xk−1 xk − xk−1
xk − x x − xk−1
= (f (xk−1 ) − f (x)) + (f (xk ) − f (x)) (3.28)
xk − xk−1 xk − xk−1
xk − x x − xk−1
≤ |f (xk−1 ) − f (x)| + |f (xk ) − f (x)|.
xk − xk−1 xk − xk−1
111
Chapter 3: One-dimensional ANN approximation results
Combining this with (3.1) and Lemma 3.1.2 implies that for all k ∈ {1, 2, . . . , K}, x ∈
[xk−1 , xk ] it holds that
xk − x x − xk−1
|l(x) − f (x)| ≤ wf (|xk − xk−1 |) +
xk − xk−1 xk − xk−1 (3.29)
= wf (|xk − xk−1 |) ≤ wf (maxj∈{1,2,...,K} |xj − xj−1 |).
This establishes item (ii). The proof of Proposition 3.1.7 is thus complete.
Corollary 3.1.8 (Approximation and Lipschitz continuity properties for the linear inter-
polation operator). Let K ∈ N, L, x0 , x1 , . . . , xK ∈ R satisfy x0 < x1 < . . . < xK and let
f : [x0 , xK ] → R satisfy for all x, y ∈ [x0 , xK ] that
Then
(Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) − (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(y) ≤ L|x − y| (3.31)
and
Proof of Corollary 3.1.8. Note that the assumption that for all x, y ∈ [x0 , xK ] it holds that
|f (x) − f (y)| ≤ L|x − y| demonstrates that
Combining this, Lemma 3.1.4, and the assumption that for all x, y ∈ [x0 , xK ] it holds that
|f (x) − f (y)| ≤ L|x − y| with item (i) in Proposition 3.1.7 shows that for all x, y ∈ R it
holds that
(Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) − (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
)(y)
K
(3.34)
L|xk − xk−1 |
≤ max |x − y| = L|x − y|.
k∈{1,2,...,K} |xk − xk−1 |
112
3.2. Linear interpolation with fully-connected feedforward ANNs
This proves item (i). Observe that the assumption that for all x, y ∈ [x0 , xK ] it holds that
|f (x) − f (y)| ≤ L|x − y|, Lemma 3.1.4, and item (ii) in Proposition 3.1.7 ensure that
f (x0 ),f (x1 ),...,f (xK )
sup (Lx0 ,x1 ,...,xK )(x) − f (x) ≤ wf max |xk − xk−1 |
x∈[x0 ,xK ] k∈{1,2,...,K}
(3.35)
≤L max |xk − xk−1 | .
k∈{1,2,...,K}
This establishes item (ii). The proof of Corollary 3.1.8 is thus complete.
RN
a (in ) = Ma,n (3.38)
Proof of Lemma 3.2.2. Note that the fact that in ∈ ((Rn×n × Rn ) × (Rn×n × Rn )) ⊆ N
implies that
D(in ) = (n, n, n) ∈ N3 (3.39)
(cf. Definitions 1.3.1 and 3.2.1). This proves item (i). Observe that (1.91) and the fact that
113
Chapter 3: One-dimensional ANN approximation results
(RN
a (in ))(x) = In (Ma,n (In x + 0)) + 0 = Ma,n (x). (3.41)
This establishes item (ii). The proof of Lemma 3.2.2 is thus complete.
Lemma 3.2.3 (Compositions of fully-connected feedforward activation ANNs with general
fully-connected feedforward ANNs). Let Φ ∈ N (cf. Definition 1.3.1). Then
(i) it holds that
D(iO(Φ) • Φ)
(3.42)
= (D0 (Φ), D1 (Φ), D2 (Φ), . . . , DL(Φ)−1 (Φ), DL(Φ) (Φ), DL(Φ) (Φ)) ∈ NL(Φ)+2 ,
I(Φ)
(ii) it holds for all a ∈ C(R, R) that RN
a (iO(Φ) • Φ) ∈ C(R , RO(Φ) ),
D(Φ • iI(Φ) )
(3.43)
= (D0 (Φ), D0 (Φ), D1 (Φ), D2 (Φ), . . . , DL(Φ)−1 (Φ), DL(Φ) (Φ)) ∈ NL(Φ)+2 ,
I(Φ)
(v) it holds for all a ∈ C(R, R) that RN
a (Φ • iI(Φ) ) ∈ C(R , RO(Φ) ), and
114
3.2. Linear interpolation with fully-connected feedforward ANNs
and ∀ x ∈ R : (RN
r (Aα,β ))(x) = αx + β (cf. Definitions 1.2.4 and 1.3.4). Proposition 2.1.2,
Lemma 3.2.2, Lemma 3.2.3, (1.26), (1.91), and (2.2) hence imply that
This establishes items (i), (ii), (iii), and (iv). The proof of Lemma 3.2.4 is thus complete.
(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Then
115
Chapter 3: One-dimensional ANN approximation results
Proof of Proposition 3.2.5. Throughout this proof, let c0 , c1 , . . . , cK ∈ R satisfy for all
k ∈ {0, 1, . . . , K} that
RN
r (Φk ) ∈ C(R, R), D(Φk ) = (1, 1, 1) ∈ N3 , (3.54)
and ∀ x ∈ R: (RN
r (Φk ))(x) = ck max{x − xk , 0} (3.55)
(cf. Definitions 1.2.4 and 1.3.4). This, Lemma 2.3.3, Lemma 2.4.11, and (3.51) prove that
This establishes item (i). Observe that item (i) and (1.78) ensure that
This implies item (iii). Note that (3.52), (3.55), Lemma 2.3.3, and Lemma 2.4.11 demonstrate
that for all x ∈ R it holds that
K
X K
X
(RN
r (F))(x) = f0 + N
(Rr (Φk ))(x) = f0 + ck max{x − xk , 0}. (3.58)
k=0 k=0
This and the fact that for all k ∈ {0, 1, . . . , K} it holds that x0 ≤ xk show that for all
x ∈ (−∞, x0 ] it holds that
(RN
r (F))(x) = f0 + 0 = f0 . (3.59)
Next we claim that for all k ∈ {1, 2, . . . , K} it holds that
k−1
X fk − fk−1
cn = . (3.60)
n=0
xk − xk−1
We now prove (3.60) by induction on k ∈ {1, 2, . . . , K}. For the base case k = 1 observe
that (3.52) proves that
0
X f1 − f0
cn = c0 = . (3.61)
n=0
x 1 − x0
116
3.2. Linear interpolation with fully-connected feedforward ANNs
This establishes (3.60) in the base case k = 1. For the induction step observe that (3.52)
fk−1 −fk−2
n=0 cn = xk−1 −xk−2 it holds that
ensures that for all k ∈ N ∩ (1, ∞) ∩ (0, K] with k−2
P
k−1 k−2
X X fk − fk−1 fk−1 − fk−2 fk−1 − fk−2 fk − fk−1
cn = ck−1 + cn = − + = . (3.62)
n=0 n=0
xk − xk−1 xk−1 − xk−2 xk−1 − xk−2 xk − xk−1
Induction thus implies (3.60). Furthermore, note that (3.58), (3.60), and the fact that for
all k ∈ {1, 2, . . . , K} it holds that xk−1 < xk demonstrate that for all k ∈ {1, 2, . . . , K},
x ∈ [xk−1 , xk ] it holds that
K
X
(RN
r (F))(x) − (RN
r (F))(xk−1 ) = cn (max{x − xn , 0} − max{xk−1 − xn , 0})
n=0
k−1 k−1
cn (x − xk−1 ) (3.63)
X X
= cn [(x − xn ) − (xk−1 − xn )] =
n=0 n=0
fk − fk−1
= (x − xk−1 ).
xk − xk−1
Next we claim that for all k ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ] it holds that
fk − fk−1
N
(Rr (F))(x) = fk−1 + (x − xk−1 ). (3.64)
xk − xk−1
We now prove (3.64) by induction on k ∈ {1, 2, . . . , K}. For the base case k = 1 observe
that (3.59) and (3.63) show that for all x ∈ [x0 , x1 ] it holds that
f1 − f0
N N N N
(Rr (F))(x) = (Rr (F))(x0 )+(Rr (F))(x)−(Rr (F))(x0 ) = f0 + (x − x0 ). (3.65)
x1 − x0
This proves (3.64) in the base case k = 1. For the induction step note that (3.63) establishes
that for all k ∈ N ∩ (1, ∞) ∩ [1, K], x ∈ [xk−1 , xk ] with ∀ y ∈ [xk−2 , xk−1 ] : (RN
r (F))(y) =
fk−1 −fk−2
fk−2 + xk−1 −xk−2 (y − xk−2 ) it holds that
(RN N N N
r (F))(x) = (Rr (F))(xk−1 ) + (Rr (F))(x) − (Rr (F))(xk−1 )
fk−1 − fk−2 fk − fk−1
= fk−2 + (xk−1 − xk−2 ) + (x − xk−1 )
xk−1 − xk−2 xk − xk−1 (3.66)
fk − fk−1
= fk−1 + (x − xk−1 ).
xk − xk−1
Induction thus ensures (3.64). Moreover, observe that (3.52) and (3.60) imply that
K K−1
X X fK − fK−1 fK − fK−1
cn = cK + cn = − + = 0. (3.67)
n=0 n=0
xK − xK−1 xK − xK−1
117
Chapter 3: One-dimensional ANN approximation results
The fact that for all k ∈ {0, 1, . . . , K} it holds that xk ≤ xK and (3.58) therefore demonstrate
that for all x ∈ [xK , ∞) it holds that
" K #
X
N N
(Rr (F))(x) − (Rr (F))(xK ) = cn (max{x − xn , 0} − max{xK − xn , 0})
n=0
K K
(3.68)
X X
= cn [(x − xn ) − (xK − xn )] = cn (x − xK ) = 0.
n=0 n=0
This and (3.64) show that for all x ∈ [xK , ∞) it holds that
fK −fK−1
(RN N
r (F))(x) = (Rr (F))(xK ) = fK−1 + xK −xK−1
(xK − xK−1 ) = fK . (3.69)
Combining this, (3.59), (3.64), and (3.11) proves item (ii). The proof of Proposition 3.2.5 is
thus complete.
Exercise 3.2.1. Prove or disprove the following statement: There exists Φ ∈ N such that
P(Φ) ≤ 16 and
sup cos(x) − (RN 1
r (Φ))(x) ≤ 2 (3.70)
x∈[−2π,2π]
118
3.3. ANN approximations results for one-dimensional functions
(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Then
−1
(iv) it holds that supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ L(b − a)K , and
Proof of Proposition 3.3.1. Note that the fact that for all k ∈ {0, 1, . . . , K} it holds that
This and Proposition 3.2.5 prove items (i), (ii), and (v). Observe that item (i) in Corol-
lary 3.1.8, item (ii), and the assumption that for all x, y ∈ [a, b] it holds that
prove item (iii). Note that item (ii), the assumption that for all x, y ∈ [a, b] it holds that
item (ii) in Corollary 3.1.8, and the fact that for all k ∈ {1, 2, . . . , K} it holds that
(b − a)
xk − xk−1 = (3.77)
K
ensure that for all x ∈ [a, b] it holds that
L(b − a)
N
|(Rr (F))(x) − f (x)| ≤ L max |xk − xk−1 | = . (3.78)
k∈{1,2,...,K} K
This establishes item (iv). The proof of Proposition 3.3.1 is thus complete.
119
Chapter 3: One-dimensional ANN approximation results
Proof of Lemma 3.3.2. Observe that items (i) and (ii) in Lemma 2.3.3, and items (ii)
and (iii) in Lemma 3.2.4 establish items (i) and (ii). Note that item (iii) in Lemma 2.3.3
and item (iii) in Lemma 2.3.5 imply that for all x ∈ R it holds that
(RN N
r (F))(x) = (Rr (0 ⊛ (i1 • A1,−ξ )))(x) + f (ξ)
(3.81)
= 0 (RN
r (i1 • A1,−ξ ))(x) + f (ξ) = f (ξ)
(cf. Definitions 1.2.4 and 1.3.4). This proves item (iii). Observe that (3.81), the fact that
ξ ∈ [a, b], and the assumption that for all x, y ∈ [a, b] it holds that
|(RN
r (F))(x) − f (x)| = |f (ξ) − f (x)| ≤ L|x − ξ| ≤ L max{ξ − a, b − ξ}. (3.83)
This establishes item (iv). Note that (1.78) and item (i) show that
This proves item (v). The proof of Lemma 3.3.2 is thus complete.
120
3.3. ANN approximations results for one-dimensional functions
Corollary 3.3.3 (Explicit ANN approximations with prescribed error tolerances). Let
L(b−a) L(b−a)
ε ∈ (0, ∞), L, a ∈ R, b ∈ (a, ∞), K ∈ N0 ∩ ε
, ε + 1 , x0 , x1 , . . . , xK ∈ R satisfy for
k(b−a)
all k ∈ {0, 1, . . . , K} that xk = a + max{K,1} , let f : [a, b] → R satisfy for all x, y ∈ [a, b] that
(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Then
L(b−a)
(iv) it holds that supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ max{K,1}
≤ ε, and
L(b − a)
K ≤1+ , (3.88)
ε
imply that
3L(b − a)
P(F) = 3K + 4 ≤ + 7. (3.89)
ε
This proves item (v). The proof of Corollary 3.3.3 is thus complete.
121
Chapter 3: One-dimensional ANN approximation results
Proof of Corollary 3.3.5. Throughout this proof, assume without loss of generality that
a < b, let K ∈ N0 ∩ L(b−a) , L(b−a) + 1 , x0 , x1 , . . . , xK ∈ [a, b], c0 , c1 , . . . , cK ∈ R satisfy for
ε ε
all k ∈ {0, 1, . . . , K} that
(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Observe that Corollary 3.3.3
demonstrates that
122
3.3. ANN approximations results for one-dimensional functions
(cf. Definitions 1.2.4 and 1.3.4). This establishes items (i), (iv), and (v). Note that item (I)
and the fact that
L(b − a)
K ≤1+ (3.94)
ε
prove items (ii) and (iii). Observe that item (ii) and items (I) and (V) show that
3L(b − a)
P(F) = 3K + 4 = 3(K + 1) + 1 = 3(D1 (F)) + 1 ≤ + 7. (3.95)
ε
This proves item (vi). Note that Lemma 3.2.4 ensures that for all k ∈ {0, 1, . . . , K} it holds
that
ck ⊛ (i1 • A1,−xk ) = ((1, −xk ), (ck , 0)). (3.96)
Combining this with (2.152), (2.143), (2.134), and (2.2) implies that
1 −x0
1 −x1
F = .. , .. , c0 c1 · · · cK , f (x0 )
. .
1 −xK
∈ (R(K+1)×1 × RK+1 ) × (R1×(K+1) × R). (3.97)
∥T (F)∥∞ = max{|x0 |, |x1 |, . . . , |xK |, |c0 |, |c1 |, . . . , |cK |, |f (x0 )|, 1} (3.98)
(cf. Definitions 1.3.5 and 3.3.4). Furthermore, observe that the assumption that for all
x, y ∈ [a, b] it holds that
|f (x) − f (y)| ≤ L|x − y| (3.99)
and the fact that for all k ∈ N ∩ (0, K + 1) it holds that
(b − a)
xk − xk−1 = (3.100)
max{K, 1}
123
Chapter 3: One-dimensional ANN approximation results
This and (3.98) prove item (vii). The proof of Corollary 3.3.5 is thus complete.
Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N such that
RN
r (F) ∈ C(R, R), supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ ε, H(F) = 1, (3.103)
∥T (F)∥∞ ≤ max{1, |a|, |b|, 2L, |f (a)|}, and P(F) ≤ Cε −1
(3.104)
Proof of Corollary 3.3.6. Throughout this proof, assume without loss of generality that
a < b and let
C = 3L(b − a) + 7. (3.105)
Note that the assumption that a < b shows that L ≥ 0. Furthermore, observe that (3.105)
ensures that for all ε ∈ (0, 1] it holds that
This and Corollary 3.3.5 imply that for all ε ∈ (0, 1] there exists F ∈ N such that
RN
r (F) ∈ C(R, R), supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ ε, H(F) = 1, (3.107)
∥T (F)∥∞ ≤ max{1, |a|, |b|, 2L, |f (a)|}, and P(F) ≤ 3L(b − a)ε −1
+ 7 ≤ Cε −1
(3.108)
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 3.3.4). The proof of Corollary 3.3.6 is thus
complete.
124
3.3. ANN approximations results for one-dimensional functions
Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N such that
RN
r (F) ∈ C(R, R), supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−1 (3.110)
Proof of Corollary 3.3.7. Note that Corollary 3.3.6 establishes (3.110). The proof of Corol-
lary 3.3.7 is thus complete.
Exercise 3.3.1. Let f : [−2, 3] → R satisfy for all x ∈ [−2, 3] that
Prove or disprove the following statement: There exist c ∈ R and F = (Fε )ε∈(0,1] : (0, 1] → N
such that for all ε ∈ (0, 1] it holds that
RN
r (Fε ) ∈ C(R, R), supx∈[−2,3] |(RN
r (Fε ))(x) − f (x)| ≤ ε, and P(Fε ) ≤ cε−1 (3.112)
125
Chapter 3: One-dimensional ANN approximation results
126
Chapter 4
In this chapter we review basic deep ReLU ANN approximation results for possibly multi-
dimensional target functions. We refer to the beginning of Chapter 3 for a small selection
of ANN approximation results from the literature. The specific presentation of this chapter
is strongly based on [25, Sections 2.2.6, 2.2.7, 2.2.8, and 3.1], [226, Sections 3 and 4.2], and
[230, Section 3].
(positive definiteness),
(triangle inequality).
127
Chapter 4: Multi-dimensional ANN approximation results
Definition 4.1.2 (Metric space). We say that E is a metric space if and only if there exist
a set E and a metric δ on E such that
E = (E, δ) (4.4)
Proof of Proposition 4.1.3. First, observe that the assumption that for all x ∈ D, y ∈ M
it holds that |f (x) − f (y)| ≤ Lδ(x, y) ensures that for all x ∈ D, y ∈ M it holds that
This establishes item (ii). Moreover, note that (4.5) implies that for all x ∈ M it holds that
This and (4.8) establish item (i). Note that (4.7) (applied for every y, z ∈ M with x ↶ y,
y ↶ z in the notation of (4.7)) and the triangle inequality ensure that for all x ∈ E,
y, z ∈ M it holds that
128
4.1. Approximations through supremal convolutions
This and the assumption that M = ̸ ∅ prove item (iii). Note that item (iii), (4.5), and the
triangle inequality show that for all x, y ∈ E it holds that
F (x) − F (y) = sup (f (v) − Lδ(x, v)) − sup (f (w) − Lδ(y, w))
v∈M w∈M
= sup f (v) − Lδ(x, v) − sup (f (w) − Lδ(y, w))
v∈M w∈M
(4.12)
≤ sup f (v) − Lδ(x, v) − (f (v) − Lδ(y, v))
v∈M
= sup (Lδ(y, v) − Lδ(x, v))
v∈M
≤ sup (Lδ(y, x) + Lδ(x, v) − Lδ(x, v)) = Lδ(x, y).
v∈M
This and the fact that for all x, y ∈ E it holds that δ(x, y) = δ(y, x) establish item (iv).
Observe that items (i) and (iv), the triangle inequality, and the assumption that ∀ x ∈
D, y ∈ M : |f (x) − f (y)| ≤ Lδ(x, y) ensure that for all x ∈ D it holds that
This establishes item (v). The proof of Proposition 4.1.3 is thus complete.
. Then
(iii) it holds for all x, y ∈ E that |F (x) − F (y)| ≤ Lδ(x, y), and
129
Chapter 4: Multi-dimensional ANN approximation results
Proof of Corollary 4.1.4. Note that Proposition 4.1.3 establishes items (i), (ii), (iii), and
(iv). The proof of Corollary 4.1.4 is thus complete.
Exercise 4.1.1. Prove or disprove the following statement: There exists Φ ∈ N such that
I(Φ) = 2, O(Φ) = 1, P(Φ) ≤ 3 000 000 000, and
and
130
4.2. ANN representations
Proof of Proposition 4.2.2. Observe that the fact that D(L1 ) = (1, 2, 1) and Lemma 2.2.2
show that
D(Pd (L1 , L1 , . . . , L1 )) = (d, 2d, d) (4.18)
(cf. Definitions 1.3.1, 2.2.1, and 4.2.1). Combining this, Proposition 2.1.2, and Lemma 2.3.2
ensures that
(4.19)
D(Ld ) = D S1,d • Pd (L1 , L1 , . . . , L1 ) = (d, 2d, 1)
(cf. Definitions 2.1.1 and 2.4.1). This establishes item (i). Note that (4.17) assures that for
all x ∈ R it holds that
(RN
r (L1 ))(x) = r(x) + r(−x) = max{x, 0} + max{−x, 0} = |x| = ∥x∥1 (4.20)
(cf. Definitions 1.2.4, 1.3.4, and 3.3.4). Combining this and Proposition 2.2.3 shows that for
all x = (x1 , . . . , xd ) ∈ Rd it holds that
RN (4.21)
r (Pd (L1 , L1 , . . . , L1 )) (x) = (|x1 |, |x2 |, . . . , |xd |).
This and Lemma 2.4.2 demonstrate that for all x = (x1 , . . . , xd ) ∈ Rd it holds that
(RN N
r (Ld ))(x) = Rr (S1,d • Pd (L1 , L1 , . . . , L1 )) (x)
d (4.22)
= RN
P
r (S1,d ) (|x 1 |, |x 2 |, . . . , |x d |) = |x k | = ∥x∥ 1 .
k=1
This establishes items (ii) and (iii). The proof of Proposition 4.2.2 is thus complete.
Lemma 4.2.3. Let d ∈ N. Then
(i) it holds that B1,Ld = 0 ∈ R2d ,
(ii) it holds that B2,Ld = 0 ∈ R,
(iii) it holds that W1,Ld ∈ {−1, 0, 1}(2d)×d ,
(iv) it holds for all x ∈ Rd that ∥W1,Ld x∥∞ = ∥x∥∞ , and
(v) it holds that W2,Ld = 1 1 · · · 1 ∈ R1×(2d)
(cf. Definitions 1.3.1, 3.3.4, and 4.2.1).
Proof of Lemma 4.2.3. Throughout this proof, assume without loss of generality that d > 1.
Note that the fact that B1,L1 = 0 ∈ R2 , the fact that B2,L1 = 0 ∈ R, the fact that B1,S1,d
= 0 ∈ R, and the fact that Ld = S1,d • Pd (L1 , L1 , . . . , L1 ) establish items (i) and (ii) (cf.
Definitions 1.3.1, 2.1.1, 2.2.1, 2.4.1, and 4.2.1). In addition, observe that the fact that
W1,L1 0 ··· 0
0 W1,L1 · · · 0
1
W1,L1 = and W1,Ld = .. .. ... .. ∈ R
(2d)×d
(4.23)
−1 . . .
0 0 · · · W1,L1
131
Chapter 4: Multi-dimensional ANN approximation results
proves item (iii). Next note that (4.23) implies item (iv). Moreover, note that the fact that
W2,L1 = (1 1) and the fact that Ld = S1,d • Pd (L1 , L1 , . . . , L1 ) show that
1 ∈ R1×(2d) .
= 1 1 ···
This establishes item (v). The proof of Lemma 4.2.3 is thus complete.
Exercise 4.2.1. Let d = 9, S = {(1, 3), (3, 5)}, V = (Vr,k )(r,k)∈S ∈ × (r,k)∈S
Rd×d satisfy
V1,3 = V3,5 = Id , let Ψ ∈ N satisfy
(RR
r (Φ))(x) (4.27)
explicitly and prove that your result is correct (cf. Definitions 1.2.4 and 1.5.4)!
132
4.2. ANN representations
(v) it holds for all d ∈ {2, 3, 4, . . .} that ϕ2d = ϕd • Pd (ϕ2 , ϕ2 , . . . , ϕ2 ) , and
(vi) it holds for all d ∈ {2, 3, 4, . . .} that ϕ2d−1 = ϕd • Pd (ϕ2 , ϕ2 , . . . , ϕ2 , I1 )
(cf. Definitions 1.3.1, 2.1.1, 2.2.1, 2.2.6, and 2.3.1).
Proof of Lemma 4.2.4. Throughout this proof, let ψ ∈ N satisfy
1 −1 0
1 , 0, 1 1 −1 , 0 ∈ (R3×2 × R3 ) × (R1×3 × R1 ) (4.29)
ψ= 0
0 −1 0
(cf. Definition 1.3.1). Observe that (4.29) and Lemma 2.2.7 demonstrate that
Lemma 2.2.2 and Lemma 2.2.7 therefore prove that for all d ∈ N ∩ (1, ∞) it holds that
133
Chapter 4: Multi-dimensional ANN approximation results
(v) it holds for all d ∈ {2, 3, 4, . . .} that M2d = Md • Pd (M2 , M2 , . . . , M2 ) , and
(vi) it holds for all d ∈ {2, 3, 4, . . .} that M2d−1 = Md • Pd (M2 , M2 , . . . , M2 , I1 )
(cf. Definitions 1.3.1, 2.1.1, 2.2.1, 2.2.6, and 2.3.1 and Lemma 4.2.4).
Definition 4.2.6 (Floor and ceiling of real numbers). We denote by ⌈·⌉ : R → Z and
⌊·⌋ : R → Z the functions which satisfy for all x ∈ R that
⌈x⌉ = min(Z ∩ [x, ∞)) and ⌊x⌋ = max(Z ∩ (−∞, x]). (4.35)
Exercise 4.2.2. Prove or disprove the following statement: For all n ∈ {3, 5, 7, . . . } it holds
that ⌈log2 (n + 1)⌉ = ⌈log2 (n)⌉.
Proposition 4.2.7 (Properties of fully-connected feedforward maxima ANNs). Let d ∈ N.
Then
(i) it holds that H(Md ) = ⌈log2 (d)⌉,
H(M2 ) = 1 (4.36)
(cf. Definitions 1.3.1 and 4.2.5). This and (2.44) demonstrate that for all d ∈ {2, 3, 4, . . .} it
holds that
(cf. Definitions 2.2.1 and 2.2.6). Combining this with Proposition 2.1.2 establishes that for
all d ∈ {3, 4, 5, . . .} it holds that
(cf. Definition 4.2.6). This assures that for all d ∈ {4, 6, 8, . . .} with H(Md/2 ) = ⌈log2 (d/2)⌉ it
holds that
134
4.2. ANN representations
Furthermore, note that (4.38) and the fact that for all d ∈ {3, 5, 7, . . .} it holds that
⌈log2 (d + 1)⌉ = ⌈log2 (d)⌉ ensure that for all d ∈ {3, 5, 7, . . .} with H(M⌈d/2⌉ ) = ⌈log2 (⌈d/2⌉)⌉
it holds that
H(Md ) = log2 (⌈d/2⌉) + 1 = log2 ((d+1)/2) + 1
(4.40)
= ⌈log2 (d + 1) − 1⌉ + 1 = ⌈log2 (d + 1)⌉ = ⌈log2 (d)⌉.
Combining this and (4.39) demonstrates that for all d ∈ {3, 4, 5, . . .} with ∀ k ∈ {2, 3, . . . ,
d − 1} : H(Mk ) = ⌈log2 (k)⌉ it holds that
The fact that H(M2 ) = 1 and induction hence establish item (i). Observe that the fact that
D(M2 ) = (2, 3, 1) assure that for all i ∈ N it holds that
(4.42)
2
Di (M2 ) ≤ 3 = 3 2i
.
Moreover, note that Proposition 2.1.2 and Lemma 2.2.2 imply that for all d ∈ {2, 3, 4, . . .},
i ∈ N it holds that (
3d :i=1
Di (M2d ) = (4.43)
Di−1 (Md ) : i ≥ 2
and (
3d − 1 :i=1
Di (M2d−1 ) = (4.44)
Di−1 (Md ) : i ≥ 2.
This assures that for all d ∈ {2, 4, 6, . . .} it holds that
D1 (Md ) = 3( 2d ) = 3 (4.45)
d
2
.
In addition, observe that (4.44) ensures that for all d ∈ {3, 5, 7, . . . } it holds that
D1 (Md ) = 3 2d − 1 ≤ 3 2d . (4.46)
This and (4.45) show that for all d ∈ {2, 3, 4, . . .} it holds that
D1 (Md ) ≤ 3 2d . (4.47)
Next note that (4.43) demonstrates that for all d ∈ {4, 6, 8, . . .}, i ∈ {2, 3, 4, . . .} with
Di−1 (Md/2 ) ≤ 3 ( /2) 2i−1 it holds that
1
d
1
= 3 2di . (4.48)
Di (Md ) = Di−1 (Md/2 ) ≤ 3 (d/2) 2i−1
135
Chapter 4: Multi-dimensional ANN approximation results
Furthermore,
d+1 observe that (4.44) and the fact that for all d ∈ {3, 5, 7, . . .}, i ∈ N it holds
that d
assure that for all d ∈ {3, 5, 7, . . .}, i ∈ {2, 3, 4, . . .} with Di−1 (M⌈d/2⌉ ) ≤
2i
= 2i
3 ⌈ /2⌉ 2i−1 it holds that
1
d
1
= 3 d+1 = 3 2di . (4.49)
Di (Md ) = Di−1 (M⌈d/2⌉ ) ≤ 3 ⌈d/2⌉ 2i−1 2i
This and (4.48) ensure that for all d ∈ {3, 4, 5, . . .}, i ∈ {2, 3, 4, . . .} with ∀ k ∈ {2, 3, . . . , d −
1}, j ∈ {1, 2, . . . , i − 1} : Dj (Mk ) ≤ 3 2kj it holds that
Combining this, (4.42), and (4.47) with induction establishes item (ii). Note that (4.34)
ensures that for all x = (x1 , x2 ) ∈ R2 it holds that
(RN
r (M2 ))(x) = max{x1 − x2 , 0} + max{x2 , 0} − max{−x2 , 0}
(4.51)
= max{x1 − x2 , 0} + x2 = max{x1 , x2 }
(cf. Definitions 1.2.4 and 1.3.4). Proposition 2.2.3, Proposition 2.1.2, Lemma 2.2.7, and
induction hence imply that for all d ∈ {2, 3, 4, . . .}, x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that
RN d
and RN (4.52)
r (Md ) ∈ C(R , R) r (Md ) (x) = max{x1 , x2 , . . . , xd }.
This establishes items (iii) and (iv). The proof of Proposition 4.2.7 is thus complete.
Lemma 4.2.8. Let d ∈ N, i ∈ {1, 2, . . . , L(Md )} (cf. Definitions 1.3.1 and 4.2.5). Then
(ii) it holds that Wi,Md ∈ {−1, 0, 1}Di (Md )×Di−1 (Md ) , and
Proof of Lemma 4.2.8. Throughout this proof, assume without loss of generality that d > 2
(cf. items (iii) and (iv) in Definition 4.2.5) and let A1 ∈ R3×2 , A2 ∈ R1×3 , C1 ∈ R2×1 ,
C2 ∈ R1×2 satisfy
1 −1
1
and
A1 = 0 1 ,
A2 = 1 1 −1 , C1 = , C2 = 1 −1 .
−1
0 −1
(4.53)
136
4.2. ANN representations
Note that items (iv), (v), and (vi) in Definition 4.2.5 assure that for all d ∈ {2, 3, 4, . . .} it
holds that
A1 0 · · · 0 0
0 A1 · · · 0 A1 0 · · · 0
0
0 A1 · · · 0
W1,M2d−1 = ... .. . . . ..
. . .. ., W1,M2d = .. .. . . .. ,
. . . .
(4.54)
0 0 · · · A1 0
0 0 · · · A1
0 0 ··· 0 C1 | {z }
∈R(3d)×(2d)
| {z }
∈R(3d−1)×(2d−1)
B1,M2d−1 = 0 ∈ R3d−1 , and B1,M2d = 0 ∈ R3d .
This and (4.53) proves item (iii). Furthermore, note that (4.54) and item (iv) in Defini-
tion 4.2.5 imply that for all d ∈ {2, 3, 4, . . .} it holds that B1,Md = 0. Items (iv), (v), and
(vi) in Definition 4.2.5 hence ensure that for all d ∈ {2, 3, 4, . . .} it holds that
A2 0 · · · 0 0
0 A2 · · · 0 0 A2 0 · · · 0
0 A2 · · · 0
W2,M2d−1 = W1,Md ... .. . . .. .. ,
. . . . W = W 1,Md .. .. . . .. ,
2,M2d
. . . .
0 0 · · · A2 0
0 0 · · · A2
0 0 · · · 0 C2 | {z }
∈Rd×(3d)
| {z }
∈Rd×(3d−1)
B2,M2d−1 = B1,Md = 0, and B2,M2d = B1,Md = 0.
(4.55)
Combining this and item (iv) in Definition 4.2.5 shows that for all d ∈ {2, 3, 4, . . .} it holds
that B2,Md = 0. Moreover, note that (2.2) demonstrates that for all d ∈ {2, 3, 4, . . . , },
i ∈ {3, 4, . . . , L(Md ) + 1} it holds that
This, (4.53), (4.54), (4.55), the fact that for all d ∈ {2, 3, 4, . . .} it holds that B2,Md = 0, and
induction establish items (i) and (ii). The proof of Lemma 4.2.8 is thus complete.
(4.57)
Φ = MK • A−L IK ,y • PK Ld • AId ,−x1 , Ld • AId ,−x2 , . . . , Ld • AId ,−xK • Td,K
(cf. Definitions 1.3.1, 1.5.5, 2.1.1, 2.2.1, 2.3.1, 2.4.6, 4.2.1, and 4.2.5). Then
137
Chapter 4: Multi-dimensional ANN approximation results
Proof of Lemma 4.2.9. Throughout this proof, let Ψk ∈ N, k ∈ {1, 2, . . . , K}, satisfy for
all k ∈ {1, 2, . . . , K} that Ψk = Ld • AId ,−xk , let Ξ ∈ N satisfy
(4.58)
Ξ = A−L IK ,y • PK Ψ1 , Ψ2 , . . . , ΨK • Td,K ,
and let ~·~ : m,n∈N Rm×n → [0, ∞) satisfy for all m, n ∈ N, M = (Mi,j )i∈{1,...,m}, j∈{1,...,n} ∈
S
Rm×n that ~M ~ = maxi∈{1,...,m}, j∈{1,...,n} |Mi,j |. Observe that (4.57) and Proposition 2.1.2
ensure that O(Φ) = O(MK ) = 1 and I(Φ) = I(Td,K ) = d. This proves items (i) and (ii).
Moreover, observe that the fact that for all m, n ∈ N, W ∈ Rm×n , B ∈ Rm it holds that
H(AW,B ) = 0 = H(Td,K ), the fact that H(Ld ) = 1, and Proposition 2.1.2 assure that
(cf. Definition 4.2.6). This establishes item (iii). Next observe that the fact that H(Ξ) = 1,
Proposition 2.1.2, and Proposition 4.2.7 assure that for all i ∈ {2, 3, 4, . . .} it holds that
(4.61)
K
Di (Φ) = Di−1 (MK ) ≤ 3 2i−1 .
This proves item (v). Furthermore, note that Proposition 2.1.2, Proposition 2.2.4, and
Proposition 4.2.2 assure that
K
X K
X
D1 (Φ) = D1 (Ξ) = D1 (PK (Ψ1 , Ψ2 , . . . , ΨK )) = D1 (Ψi ) = D1 (Ld ) = 2dK. (4.62)
i=1 i=1
138
4.2. ANN representations
This establishes item (iv). Moreover, observe that (2.2) and Lemma 4.2.8 imply that
Φ = (W1,Ξ , B1,Ξ ), (W1,MK W2,Ξ , W1,MK B2,Ξ ),
(4.63)
(W2,MK , 0), . . . , (WL(MK ),MK , 0) .
Next note that the fact that for all k ∈ {1, 2, . . . , K} it holds that W1,Ψk = W1,AId ,−xk W1,Ld =
W1,Ld assures that
W1,Ψ1 0 ··· 0 Id
0 W1,Ψ2 · · · 0 I d
W1,Ξ = W1,PK (Ψ1 ,Ψ2 ,...,ΨK ) W1,Td,K = .. .. . .. ..
. . . . . .
0 0 · · · W1,ΨK Id
(4.64)
W1,Ψ1 W1,Ld
W1,Ψ W1,L
2 d
= .. = .. .
. .
W1,ΨK W1,Ld
Lemma 4.2.3 hence demonstrates that ~W1,Ξ ~ = 1. In addition, note that (2.2) implies
that
B1,Ψ1
B1,Ψ
2
B1,Ξ = W1,PK (Ψ1 ,Ψ2 ,...,ΨK ) B1,Td,K + B1,PK (Ψ1 ,Ψ2 ,...,ΨK ) = B1,PK (Ψ1 ,Ψ2 ,...,ΨK ) = .. .
.
B1,ΨK
(4.65)
Furthermore, observe that Lemma 4.2.3 implies that for all k ∈ {1, 2, . . . , K} it holds that
(cf. Definition 3.3.4). Combining this, (4.63), Lemma 4.2.8, and the fact that ~W1,Ξ ~ = 1
shows that
∥T (Φ)∥∞ = max{~W1,Ξ ~, ∥B1,Ξ ∥∞ , ~W1,MK W2,Ξ ~, ∥W1,MK B2,Ξ ∥∞ , 1}
(4.68)
= max 1, maxk∈{1,2,...,K} ∥xk ∥∞ , ~W1,MK W2,Ξ ~, ∥W1,MK B2,Ξ ∥∞
(cf. Definition 1.3.5). Next note that Lemma 4.2.3 ensures that for all k ∈ {1, 2, . . . , K} it
holds that B2,Ψk = B2,Ld = 0. Hence, we obtain that B2,PK (Ψ1 ,Ψ2 ,...,ΨK ) = 0. This implies
that
B2,Ξ = W1,A−L IK ,y B2,PK (Ψ1 ,Ψ2 ,...,ΨK ) + B1,A−L IK ,y = B1,A−L IK ,y = y. (4.69)
139
Chapter 4: Multi-dimensional ANN approximation results
In addition, observe that the fact that for all k ∈ {1, 2, . . . , K} it holds that W2,Ψk = W2,Ld
assures that
W2,Ξ = W1,A−L IK ,y W2,PK (Ψ1 ,Ψ2 ,...,ΨK ) = −LW2,PK (Ψ1 ,Ψ2 ,...,ΨK )
W2,Ψ1 0 ··· 0 −LW2,Ld 0 ··· 0
0 W2,Ψ2 · · · 0 0 −LW2,Ld · · · 0
= −L .. .. .. .. = .. .. .. .. .
. . . . . . . .
0 0 · · · W2,ΨK 0 0 · · · −LW2,Ld
(4.70)
Item (v) in Lemma 4.2.3 and Lemma 4.2.8 hence imply that
Combining this with (4.68) and (4.71) establishes item (vi). Next observe that Proposi-
tion 4.2.2 and Lemma 2.3.3 show that for all x ∈ Rd , k ∈ {1, 2, . . . , K} it holds that
(RN N N
(4.73)
r (Ψk ))(x) = Rr (Ld ) ◦ Rr (AId ,−xk ) (x) = ∥x − xk ∥1 .
This, Proposition 2.2.3, and Proposition 2.1.2 imply that for all x ∈ Rd it holds that
RN (4.74)
r (PK (Ψ1 , Ψ2 , . . . , ΨK ) • Td,K ) (x) = ∥x − x1 ∥1 , ∥x − x2 ∥1 , . . . , ∥x − xK ∥1 .
(cf. Definitions 1.2.4 and 1.3.4). Combining this and Lemma 2.3.3 establishes that for all
x ∈ Rd it holds that
(RN N N
r (Ξ))(x) = Rr (A−L IK ,y ) ◦ Rr (PK (Ψ1 , Ψ2 , . . . , ΨK ) • Td,K ) (x)
(4.75)
= y1 − L∥x − x1 ∥1 , y2 − L∥x − x2 ∥1 , . . . , yK − L∥x − xK ∥1 .
Proposition 2.1.2 and Proposition 4.2.7 hence demonstrate that for all x ∈ Rd it holds that
(RN N N
r (Φ))(x) = Rr (MK ) ◦ Rr (Ξ) (x)
= (RN
r (MK )) y1 − L∥x − x1 ∥1 , y2 − L∥x − x2 ∥1 , . . . , yK − L∥x − xK ∥1
= maxk∈{1,2,...,K} (yk − L∥x − xk ∥1 ).
(4.76)
This establishes item (vii). The proof of Lemma 4.2.9 is thus complete.
140
4.3. ANN approximations results for multi-dimensional functions
141
Chapter 4: Multi-dimensional ANN approximation results
Lemma 4.3.3. Let (E, δ) be a metric space and let r ∈ [0, ∞]. Then
0 :X=∅
inf n ∈ N : ∃ x1 , x2 , . . . , xn ∈ E :
C (E,δ),r =
n : X ̸= ∅
S
E⊆ {v ∈ E : d(xm , v) ≤ r} ∪ {∞}
m=1
(4.83)
(cf. Definition 4.3.2).
Proof of Lemma 4.3.3. Throughout this proof, assume without loss of generality that E ̸=
∅. Observe that Lemma 12.2.4 establishes (4.83). The proof of Lemma 4.3.3 is thus
complete.
Exercise 4.3.2. Prove or disprove the following statement: For every metric space (X, d),
every Y ⊆ X, and every r ∈ [0, ∞] it holds that C (Y,d|Y ×Y ),r ≤ C (X,d),r .
Exercise 4.3.3. Prove or disprove the following statement: For every metric space (E, δ) it
holds that C (E,δ),∞ = 1.
Exercise 4.3.4. Prove or disprove the following statement: For every metric space (E, δ)
and every r ∈ [0, ∞) with C (E,δ),r < ∞ it holds that E is bounded. (Note: A metric space
(E, δ) is bounded if and only if there exists r ∈ [0, ∞) such that it holds for all x, y ∈ E
that δ(x, y) ≤ r.)
Exercise 4.3.5. Prove or disprove the following statement: For every bounded metric space
(E, δ) and every r ∈ [0, ∞] it holds that C (E,δ),r < ∞.
Lemma 4.3.4. Let d ∈ N, a ∈ R, b ∈ (a, ∞), r ∈ (0, ∞) and for every p ∈ [1, ∞) let
δp : ([a, b]d ) × ([a, b]d ) → [0, ∞) satisfy for all x, y ∈ [a, b]d that δp (x, y) = ∥x − y∥p (cf.
Definition 3.3.4). Then it holds for all p ∈ [1, ∞) that
(
d
l 1/p md 1 : r ≥ d(b−a)/2
C ([a,b] ,δp ),r ≤ d (b−a)
2r
≤ d(b−a) d
(4.84)
r
: r < d(b−a)/2.
Proof of Lemma 4.3.4. Throughout this proof, let (Np )p∈[1,∞) ⊆ N satisfy for all p ∈ [1, ∞)
that l 1/p m
Np = d (b−a)
2r
, (4.85)
142
4.3. ANN approximations results for multi-dimensional functions
|x − gN,i | = a + (i−1/2)(b−a)
N
−x≤a+ (i−1/2)(b−a)
N
− a+ (i−1)(b−a)
N
= b−a
2N
. (4.88)
In addition, note that it holds for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [gN,i , a + i(b−a)/N ] that
|x − gN,i | = x − a + (i−1/2)(b−a)
N
≤a+ i(b−a)
N
− a+ (i−1/2)(b−a)
N
= b−a
2N
. (4.89)
|x − y| ≤ b−a
2N
. (4.90)
This establishes that for every p ∈ [1, ∞), x = (x1 , x2 , . . . , xd ) ∈ [a, b]d there exists
y = (y1 , y2 , . . . , yd ) ∈ Ap such that
d
1/p d
1/p
1/p 1
(b−a)p d /p (b−a)2r
p d (b−a)
= r. (4.91)
P P
δp (x, y) = ∥x − y∥p = |xi − yi | ≤ (2Np )p
= 2Np
≤ 2d1/p (b−a)
i=1 i=1
Combining this with (4.82), (4.87), (4.85), and the fact that ∀ x ∈ [0, ∞) : ⌈x⌉ ≤ 1(0,1] (x) +
2x1(1,∞) (x) = 1(0,r] (rx) + 2x1(r,∞) (rx) yields that for all p ∈ [1, ∞) it holds that
d ,δ
l 1/p
md d(b−a) d
d (b−a)
C ([a,b] p ),r
≤ |Ap | = (Np )d = 2r
≤ 2r
≤ 1(0,r] d(b−a)
+ 2d(b−a)
1 d(b−a)
d (4.92)
2 2r (r,∞) 2
143
Chapter 4: Multi-dimensional ANN approximation results
(RN (4.93)
r (F))(x) = f (a+b)/2, (a+b)/2, . . . , (a+b)/2 .
The fact that for all x ∈ [a, b] it holds that |x − (a+b)/2| ≤ (b−a)/2 and the assumption that
for all x, y ∈ [a, b]d it holds that |f (x) − f (y)| ≤ L∥x − y∥1 hence ensure that for all
x = (x1 , x2 , . . . , xd ) ∈ [a, b]d it holds that
|(RN
r (F))(x) − f (x)| = |f
(a+b)/2, (a+b)/2, . . . , (a+b)/2 − f (x)|
≤ L (a+b)/2, (a+b)/2, . . . , (a+b)/2 − x 1
(4.94)
d d
P P L(b−a) dL(b−a)
= L |(a+b)/2 − xi | ≤ 2
= 2
.
i=1 i=1
This and the fact that ∥T (F)∥∞ = |f ((a+b)/2, (a+b)/2, . . . , (a+b)/2)| ≤ supx∈[a,b]d |f (x)| complete
the proof of Lemma 4.3.5.
Proposition 4.3.6. Let d ∈ N, L, a ∈ R, b ∈ (a, ∞), r ∈ (0, d/4), let f : [a, b]d → R and
δ : [a, b]d × [a, b]d → R satisfy for all x, y ∈ [a, b]d that |f (x) − f (y)| ≤ L∥x − y∥1 and
δ(x, y) = ∥x − y∥1 , and let K ∈ N, x1 , x2 , . . . ,xK ∈ [a, b]d , y ∈ RK , F ∈ N satisfy K =
d
C ([a,b] ,δ),(b−a)r , supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ (b − a)r, y = (f (x1 ), f (x2 ), . . . , f (xK )),
and
(4.95)
F = MK • A−L IK ,y • PK Ld • AId ,−x1 , Ld • AId ,−x2 , . . . , Ld • AId ,−xK • Td,K
(cf. Definitions 1.3.1, 1.5.5, 2.1.1, 2.2.1, 2.3.1, 2.4.6, 3.3.4, 4.2.1, 4.2.5, and 4.3.2). Then
(i) it holds that I(F) = d,
144
4.3. ANN approximations results for multi-dimensional functions
3d d
(iv) it holds that D1 (F) ≤ 2d 4r
,
3d d 1
(v) it holds for all i ∈ {2, 3, 4, . . .} that Di (F) ≤ 3 4r 2i−1
,
3d 2d 2
(vi) it holds that P(F) ≤ 35 4r
d,
(vii) it holds that ∥T (F)∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}, and
Proof of Proposition 4.3.6. Note that the assumption that for all x, y ∈ [a, b]d it holds that
|f (x) − f (y)| ≤ L∥x − y∥1 assures that L ≥ 0. Next observe that (4.95), Lemma 4.2.9, and
Proposition 4.3.1 demonstrate that
(VI) it holds that ∥T (F)∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2[maxk∈{1,2,...,K} |f (xk )|]},
and
(cf. Definitions 1.2.4, 1.3.4, 1.3.5, and 4.2.6). Note that items (I) and (II) establish items (i)
and (ii). Next observe that Lemma 4.3.4 and the fact that 2r d
≥ 2 imply that
l md d d d
d ,δ),(b−a)r d(b−a) 3d d (4.96)
K = C ([a,b] 3 d
≤ 2(b−a)r
= 2r
≤ (
2 2r
) = 4r
.
This establishes item (iii). Moreover, note that (4.96) and item (IV) imply that
3d d (4.98)
D1 (F) = 2dK ≤ 2d 4r
.
145
Chapter 4: Multi-dimensional ANN approximation results
This establishes item (iv). In addition, observe that item (V) and (4.96) establish item (v).
Next note that item (III) ensures that for all i ∈ N ∩ (1, H(F)] it holds that
K
2i−1
≥ K
2H(F)−1
= K
2⌈log2 (K)⌉
≥ K
2log2 (K)+1
= K
2K
= 12 . (4.99)
Item (V) and (4.96) hence show that for all i ∈ N ∩ (1, H(F)] it holds that
3d d 3 (4.100)
K 3K
Di (F) ≤ 3 2i−1 ≤ 2i−2
≤ 4r 2i−2
.
Furthermore, note that the fact that for all x ∈ [a, b]d it holds that ∥x∥∞ ≤ max{|a|, |b|}
and item (VI) imply that
This establishes item (vii). Moreover, observe that the assumption that
(4.102)
supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ (b − a)r
supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ 2L supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ 2L(b − a)r.
(4.103)
This establishes item (viii). It thus remains to prove item (vi). For this note that items (I)
and (II), (4.98), and (4.100) assure that
L(F)
X
P(F) = Di (F)(Di−1 (F) + 1)
i=1
d d d
≤ 2d 3d (d + 1) + 3d 3 2d 3d
4r 4r 4r
+1 (4.104)
L(F)−1
3d d 3 3d d 3 3d d
X
3
+ 4ri−22 i−3 + 1
4r 2
+
4r 2L(F)−3
+ 1.
i=3
3d d
d d d 3
(d + 1) + 3d 3 2d 3d + 1 + 3d
2d 4r 4r 4r 4r 2L(F)−3
+1
3d 2d 3 (4.105)
≤ 4r
2d(d + 1) + 3(2d + 1) + 21−3 + 1
3d 2d 2 3d 2d 2
≤ 4r
d (4 + 9 + 12 + 1) = 26 4r
d.
146
4.3. ANN approximations results for multi-dimensional functions
This establishes item (vi). The proof of Proposition 4.3.6 is thus complete.
Proposition 4.3.7. Let d ∈ N, L, a ∈ R, b ∈ (a, ∞), r ∈ (0, ∞) and let f : [a, b]d → R
satisfy for all x, y ∈ [a, b]d that |f (x) − f (y)| ≤ L∥x − y∥1 (cf. Definition 3.3.4). Then there
exists F ∈ N such that
(i) it holds that I(F) = d,
(vii) it holds that ∥T (F)∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}, and
147
Chapter 4: Multi-dimensional ANN approximation results
(4.110)
supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ (b − a)r.
Combining this with Proposition 4.3.6 establishes items (i), (ii), (iii), (iv), (v), (vi), (vii),
and (viii). The proof of Proposition 4.3.7 is thus complete.
Proposition 4.3.8 (Implicit multi-dimensional ANN approximations with prescribed error
tolerances and explicit parameter bounds). Let d ∈ N, L, a ∈ R, b ∈ [a, ∞), ε ∈ (0, 1] and
let f : [a, b]d → R satisfy for all x, y ∈ [a, b]d that
L(b − a) ̸= 0. (4.112)
Observe that (4.112) ensures that L ̸= 0 and a < b. Combining this with the assumption
that for all x, y ∈ [a, b]d it holds that
ensures that L > 0. Proposition 4.3.7 therefore ensures that there exists F ∈ N which
satisfies that
148
4.3. ANN approximations results for multi-dimensional functions
3dL(b−a) d 1
(V) it holds for all i ∈ {2, 3, 4, . . .} that Di (F) ≤ 3 ,
2ε 2i−1
(VII) it holds that ∥T (F)∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}, and
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 4.2.6). Note that item (III) assures that
3d max{L(b−a),1} d
1 ε
+ 1[d/4,∞) ε
D1 (F) ≤ d (0, d/4)
ε 2L(b−a) 2L(b−a)
(4.115)
−d d
≤ ε d(3d max{L(b − a), 1}) .
Moreover, note that item (V) establishes that for all i ∈ {2, 3, 4, . . . } it holds that
3dL(b−a) d 1 (3dL(b−a))d
+ 1 ≤ ε−d 3 (4.116)
Di (F) ≤ 3 2ε 2i−1 2i
+1 .
3d max{L(b−a),1} 2d 2
1 ε
+ (d + 1)1[d/4,∞) ε
P(F) ≤ 9 ε
d (0,d/4) 2L(b−a) 2L(b−a)
2d 2 (4.117)
−2d
≤ε 9 3d max{L(b − a), 1} d.
Combining this, (4.114), (4.115), and (4.116) with items (I), (II), (VII), and (VIII) estab-
lishes items (i), (ii), (iii), (iv), (v), (vi), (vii), and (viii). The proof of Proposition 4.3.8 is
thus complete.
149
Chapter 4: Multi-dimensional ANN approximation results
(cf. Definition 3.3.4). Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N
such that
H(F) ≤ C(log2 (ε−1 ) + 1), ∥T (F)∥∞ ≤ max 1, L, |a|, |b|, 2 supx∈[a,b]d |f (x)| , (4.119)
RN d
r (F) ∈ C(R , R), supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.120)
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5).
Note that items (i), (ii), (iii), (vi), (vii), and (viii) in Proposition 4.3.8 and the fact that for
all ε ∈ (0, 1] it holds that
≤ C(log2 (ε−1 ) + 1)
(4.122)
H(F) ≤ C(log2 (ε−1 ) + 1), ∥T (F)∥∞ ≤ max 1, L, |a|, |b|, 2 supx∈[a,b]d |f (x)| , (4.123)
RN d
r (F) ∈ C(R , R), supx∈[a,b]d |(RN
r (F))(x)−f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.124)
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5). The proof of Corollary 4.3.9 is thus complete.
Lemma 4.3.10 (Explicit estimates for vector norms). Let d ∈ N, p, q ∈ (0, ∞] satisfy
p ≤ q. Then it holds for all x ∈ Rd that
150
4.3. ANN approximations results for multi-dimensional functions
Proof of Lemma 4.3.10. Throughout this proof, assume without loss of generality that
q < ∞, let e1 , e2 , . . . , ed ∈ Rd satisfy e1 = (1, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0), . . . , ed =
(0, . . . , 0, 1), let r ∈ R satisfy
r = p−1 q, (4.126)
and let x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ Rd satisfy for all i ∈ {1, 2, . . . , d} that
yi = |xi |p . (4.127)
(4.130)
d
1/p " d
#1/p " d
#1/p " d
#1/p
X X X X
= yi ei ≤ ∥yi ei ∥r = |yi |∥ei ∥r = |yi |
i=1 r i=1 i=1 i=1
1/p
= ∥y∥1 = ∥x∥p .
(cf. Definition 3.3.4). Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N
such that
RN d
r (F) ∈ C(R , R), supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.132)
151
Chapter 4: Multi-dimensional ANN approximation results
θ,l
Then we denote by Nu,v : Rl0 → RlL the function which satisfies for all x ∈ Rl0 that
( θ,l0
NCu,v,l (x) :L=1
Nu,vθ,l
(x) = L
(4.134)
NRθ,ll 0,Rl ,...,Rl ,Cu,v,l (x) : L > 1
1 2 L−1 L
152
4.4. Refined ANN approximations results for multi-dimensional functions
(cf. Definition 1.3.4). Furthermore, observe that the assumption that for all k ∈ {1, 2, . . . , l},
i ∈ {1, 2, . . . , lk }, j ∈ N ∩ (lk−1 , lk−1 + 1) it holds that Wk,i,j = 0 shows that for all
k ∈ {1, 2, . . . , L}, x = (x1 , x2 , . . . , xlk−1 ) ∈ Rlk−1 it holds that
πk (Wk x + Bk )
" lk−1 # " lk−1 # " lk−1 # !
X X X
= Wk,1,i xi + Bk,1 , Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk
i=1 i=1 i=1 (4.140)
" lk−1 # " lk−1 # " lk−1 # !
X X X
= Wk,1,i xi + Bk,1 , Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk .
i=1 i=1 i=1
Combining this with the assumption that for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk }, j ∈
N∩(0, lk−1 ] it holds that Wk,i,j = Wk,i,j and Bk,i = Bk,i ensures that for all k ∈ {1, 2, . . . , L},
x = (x1 , x2 , . . . , xlk−1 ) ∈ Rlk−1 it holds that
πk (Wk x + Bk )
" lk−1 # " lk−1 # " lk−1 # !
(4.141)
X X X
= Wk,1,i xi + Bk,1 , Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk
i=1 i=1 i=1
= Wk πk−1 (x) + Bk .
153
Chapter 4: Multi-dimensional ANN approximation results
Hence, we obtain that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 , k ∈ N ∩ (0, L) with
∀ m ∈ N ∩ (0, L) : xm = Ma,lm (Wm xm−1 + Bm ) it holds that
πk (xk ) = Ma,lk (πk (Wk xk−1 + Bk )) = Ma,lk (Wk πk−1 (xk−1 ) + Bk ) (4.142)
(cf. Definition 1.2.1). Induction, the assumption that l0 = l0 and lL = lL , and (4.141)
therefore imply that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 with ∀ k ∈ N ∩ (0, L) : xk =
Ma,lk (Wk xk−1 + Bk ) it holds that
RN
a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (x0 )
= RN
a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (π0 (x0 ))
= WL πL−1 (xL−1 ) + BL (4.143)
= πL (WL xL−1 + BL ) = WL xL−1 + BL
= RN
a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (x0 ).
154
4.4. Refined ANN approximations results for multi-dimensional functions
(4.149)
∥T ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∥∞ = ∥T (Φ)∥∞
(cf. Definitions 1.3.5 and 3.3.4). Moreover, note that Lemma 4.4.3 proves that
RN N
a (Φ) = Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL ))
(4.150)
= RN
a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL ))
(4.151)
∥T (Φ1 • Φ2 )∥∞ ≤ max ∥T (Φ1 )∥∞ , ∥T (Φ2 )∥∞ , T ((W1 WL , W1 BL + B1 )) ∞
Proof of Lemma 4.4.5. Observe that (2.2) and Lemma 1.3.8 establish (4.151). The proof
of Lemma 4.4.5 is thus complete.
Lemma 4.4.6. Let d, L ∈ N, Φ ∈ N satisfy L ≥ L(Φ) and d = O(Φ) (cf. Definition 1.3.1).
Then
∥T (EL,Id (Φ))∥∞ ≤ max{1, ∥T (Φ)∥∞ } (4.152)
(cf. Definitions 1.3.5, 2.2.6, 2.2.8, and 3.3.4).
Proof of Lemma 4.4.6. Throughout this proof, assume without loss of generality that
L > L(Φ) and let l0 , l1 , . . . , lL−L(Φ)+1 ∈ N satisfy
Note that Lemma 2.2.7 shows that D(Id ) = (d, 2d, d) ∈ N3 (cf. Definition 2.2.6). Item (i)
in Lemma 2.2.9 hence ensures that
(cf. Definition 2.1.1). This implies that there exist Wk ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L−L(Φ)+1},
and Bk ∈ Rlk , k ∈ {1, 2, . . . , L − L(Φ) + 1}, which satisfy
155
Chapter 4: Multi-dimensional ANN approximation results
Furthermore, observe that (2.44), (2.70), (2.71), (2.2), and (2.41) demonstrate that
1 0 ··· 0
−1 0 · · · 0
0
1 · · · 0
W1 = 0 −1 · · · 0 ∈ R(2d)×d
.. .. . . ..
. . . .
0 0 ··· 1 (4.156)
0 0 · · · −1
1 −1 0 0 · · · 0 0
0 0 1 −1 · · · 0 0
and WL−L(Φ)+1 = .. .. .. .. . . .. .. ∈ Rd×(2d) .
. . . . . . .
0 0 0 0 · · · 1 −1
Moreover, note that (2.44), (2.70), (2.71), (2.2), and (2.41) prove that for all k ∈ N ∩ (1, L −
L(Φ) + 1) it holds that
1 0 ··· 0
−1 0 · · · 0
1 −1 0 0 · · · 0 0
0 1 ··· 0
0 −1 · · · 0 0 0 1 −1 · · · 0 0
Wk = . . . . . . .. ..
.. .. . . .. .. .. .. .. . . .
. . . .
0 0 0 0 · · · 1 −1
0 0 ··· 1 | {z }
0 0 · · · −1 ∈Rd×(2d)
(4.157)
| {z }
∈R(2d)×d
1 −1 0 0 ··· 0 0
−1 1 0 0 · · · 0 0
0
0 1 −1 · · · 0 0
= 0
0 −1 1 · · · 0 0 ∈ R(2d)×(2d) .
.. .. .. .. . . . ..
. . . . . .. .
0 0 0 0 · · · 1 −1
0 0 0 0 · · · −1 1
In addition, observe that (2.70), (2.71), (2.44), (2.41), and (2.2) establish that for all
k ∈ N ∩ [1, L − L(Φ)] it holds that
Bk = 0 ∈ R2d and BL−L(Φ)+1 = 0 ∈ Rd . (4.158)
Combining this, (4.156), and (4.157) shows that
T (Id )•(L−L(Φ)) (4.159)
∞
=1
156
4.4. Refined ANN approximations results for multi-dimensional functions
(cf. Definitions 1.3.5 and 3.3.4). Next note that (4.156) ensures that for all k ∈ N,
W = (wi,j )(i,j)∈{1,2,...,d}×{1,2,...,k} ∈ Rd×k it holds that
w1,1 w1,2 · · · w1,k
−w1,1 −w1,2 · · · −w1,k
w2,1
w 2,2 · · · w 2,k
W1 W = −w2,1 −w2,2 · · · −w2,k ∈ R(2d)×k . (4.160)
.. .. . . .
.
. . . .
wd,1 wd,2 · · · wd,k
−wd,1 −wd,2 · · · −wd,k
Furthermore, observe that (4.156) and (4.158) imply that for all B = (b1 , b2 , . . . , bd ) ∈ Rd
it holds that
1 0 ··· 0 b1
−1 0 · · · 0 −b1
b1
0
1 ··· 0 b2 b2
W1 B + B1 = 0 −1 · · · 0 .. = −b2 ∈ R2d . (4.161)
.. .. . . .. . ..
. . . .
.
b
0 ··· 1 d
0 bd
0 0 · · · −1 −bd
Combining this with (4.160) demonstrates that for all k ∈ N, W ∈ Rd×k , B ∈ Rd it holds
that
(4.162)
T ((W1 W, W1 B + B1 )) ∞ = T ((W, B)) ∞ .
This, Lemma 4.4.5, and (4.159) prove that
L ≥ L, l0 = l0 , and lL = lL , (4.164)
assume for all i ∈ N ∩ [0, L) that li ≥ li , assume for all i ∈ N ∩ (L − 1, L) that li ≥ 2lL , and
let Φ ∈ N satisfy D(Φ) = (l0 , l1 , . . . , lL ) (cf. Definition 1.3.1). Then there exists Ψ ∈ N
such that
157
Chapter 4: Multi-dimensional ANN approximation results
Proof of Lemma 4.4.7. Throughout this proof, let Ξ ∈ N satisfy Ξ = EL,IlL (Φ) (cf. Def-
initions 2.2.6 and 2.2.8). Note that item (i) in Lemma 2.2.7 establishes that D(IlL ) =
(lL , 2lL , lL ) ∈ N3 . Combining this with Lemma 2.2.11 shows that D(Ξ) ∈ NL+1 and
(
(l0 , l1 , . . . , lL ) :L=L
D(Ξ) = (4.166)
(l0 , l1 , . . . , lL−1 , 2lL , 2lL , . . . , 2lL , lL ) : L > L.
(cf. Definitions 1.3.5 and 3.3.4). Moreover, note that item (ii) in Lemma 2.2.7 implies that
for all x ∈ RlL it holds that
(RNr (IlL ))(x) = x (4.168)
(cf. Definitions 1.2.4 and 1.3.4). This and item (ii) in Lemma 2.2.10 prove that
RN N
r (Ξ) = Rr (Φ). (4.169)
In addition, observe that (4.166), the assumption that for all i ∈ [0, L) it holds that
l0 = l0 , lL = lL , and li ≤ li , the assumption that for all i ∈ N ∩ (L − 1, L) it holds
that li ≥ 2lL , and Lemma 4.4.4 (applied with a ↶ r, L ↶ L, (l0 , l1 , . . . , lL ) ↶ D(Ξ),
(l0 , l1 , . . . , lL ) ↶ (l0 , l1 , . . . , lL ), Φ ↶ Ξ in the notation of Lemma 4.4.4) demonstrate that
there exists Ψ ∈ N such that
Combining this with (4.167) and (4.169) proves (4.165). The proof of Lemma 4.4.7 is thus
complete.
and lL = lL , (4.171)
PL PL
d≥ i=1 li (li−1 + 1), d≥ i=1 li (li−1 + 1), L ≥ L, l0 = l0 ,
assume for all i ∈ N ∩ [0, L) that li ≥ li , and assume for all i ∈ N ∩ (L − 1, L) that li ≥ 2lL .
Then there exists ϑ ∈ Rd such that
158
4.4. Refined ANN approximations results for multi-dimensional functions
θ = (η1 , η2 , . . . , ηd ) (4.173)
and let Φ ∈ × L
Rli ×li−1 × Rli satisfy
i=1
(cf. Definitions 1.3.1 and 1.3.5). Note that Lemma 4.4.7 establishes that there exists Ψ ∈ N
which satisfies
(cf. Definitions 1.2.4, 1.3.4, and 3.3.4). Next let ϑ = (ϑ1 , ϑ2 , . . . , ϑd ) ∈ Rd satisfy
Furthermore, note that Lemma 4.4.2 and (4.174) ensure that for all x ∈ Rl0 it holds that
(cf. Definition 4.4.1). Moreover, observe that Lemma 4.4.2, (4.175), and (4.176) imply that
for all x ∈ Rl0 it holds that
ϑ,(l ,l ,...,lL ) T (Ψ),D(Ψ)
0 1
N−∞,∞ (x) = N−∞,∞ (x) = (RN
r (Ψ))(x). (4.179)
Combining this and (4.178) with (4.175) and the assumption that l0 = l0 and lL = lL
demonstrates that
θ,(l0 ,l1 ,...,lL ) ϑ,(l0 ,l1 ,...,lL )
N−∞,∞ = N−∞,∞ . (4.180)
(cf. Definition 1.2.10). This and (4.177) prove (4.172). The proof of Lemma 4.4.8 is thus
complete.
159
Chapter 4: Multi-dimensional ANN approximation results
K
assume for all i ∈ N∩(1, L) that li ≥ 3 2i−1 , let E ⊆ Rd be a set, let x1 , x2 , . . . , xK ∈ E, and
let f : E → R satisfy for all x, y ∈ E that |f (x) − f (y)| ≤ L∥x − y∥1 (cf. Definitions 3.3.4
and 4.2.6). Then there exists θ ∈ Rd such that
and
θ,l
(4.184)
supx∈E f (x) − N−∞,∞ (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1
(cf. Definition 4.4.1).
(4.185)
Φ = MK • A−L IK ,y • PK Ld • AId ,−x1 , Ld • AId ,−x2 , . . . , Ld • AId ,−xK • Td,K
(cf. Definitions 1.3.1, 1.5.5, 2.1.1, 2.2.1, 2.3.1, 2.4.6, 4.2.1, and 4.2.5). Note that Lemma 4.2.9
and Proposition 4.3.1 establish that
(VI) it holds that ∥T (Φ)∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|}, and
(cf. Definitions 1.2.4, 1.3.4, and 1.3.5). Furthermore, observe that the fact that L ≥
⌈log2 (K)⌉ + 2 = L(Φ), the fact that l0 = d = D0 (Φ), the fact that l1 ≥ 2dK = D1 (Φ), the
fact that for all i ∈ {1, 2, . . . , L(Φ) − 1}\{1} it holds that li ≥ 3⌈ 2i−1
K
⌉ ≥ Di (Φ), the fact
that for all i ∈ N ∩ (L(Φ) − 1, L) it holds that li ≥ 3⌈ 2i−1
K
⌉ ≥ 2 = 2DL(Φ) (Φ), the fact that
lL = 1 = DL(Φ) (Φ), and Lemma 4.4.8 show that there exists θ ∈ Rd which satisfies that
160
4.4. Refined ANN approximations results for multi-dimensional functions
Moreover, note that (4.186), Lemma 4.4.2, and item (VII) imply that
θ,(l ,l ,...,lL )
0 1 T (Φ),D(Φ)
supx∈E f (x) − N−∞,∞ (x) = supx∈E f (x) − N−∞,∞ (x)
= supx∈E f (x) − (RN
r (Φ))(x) (4.188)
≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1
(cf. Definition 4.4.1). The proof of Corollary 4.4.9 is thus complete.
Corollary 4.4.10. Let d, K, d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 , L ∈ [0, ∞), u ∈ [−∞, ∞),
v ∈ (u, ∞] satisfy that
L ≥ ⌈log2 K⌉ + 2, l0 = d, lL = 1, l1 ≥ 2dK, and d ≥ Li=1 li (li−1 + 1), (4.189)
P
K
assume for all i ∈ N ∩ (1, L) that li ≥ 3 2i−1 , let E ⊆ Rd be a set, let x1 , x2 , . . . , xK ∈ E,
and let f : E → ([u, v] ∩ R) satisfy for all x, y ∈ E that |f (x) − f (y)| ≤ L∥x − y∥1 (cf.
Definitions 3.3.4 and 4.2.6). Then there exists θ ∈ Rd such that
∥θ∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|} (4.190)
and
θ,l
(4.191)
supx∈E f (x) − Nu,v (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1 .
(cf. Definition 4.4.1).
Proof of Corollary 4.4.10. Observe that Corollary 4.4.9 demonstrates that there exists
θ ∈ Rd such that
∥θ∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|} (4.192)
and
θ,l
(4.193)
supx∈E f (x) − N−∞,∞ (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1 .
Furthermore, note that the assumption that f (E) ⊆ [u, v] proves that for all x ∈ E it holds
that
f (x) = cu,v (f (x)) (4.194)
(cf. Definitions 1.2.9 and 4.4.1). The fact that for all x, y ∈ R it holds that |cu,v (x)−cu,v (y)| ≤
|x − y| and (4.193) hence establish that
θ,l θ,l
supx∈E f (x) − Nu,v (x) = supx∈E |cu,v (f (x)) − cu,v (N−∞,∞ (x))|
θ,l
(4.195)
≤ supx∈E f (x) − N−∞,∞ (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1 .
The proof of Corollary 4.4.10 is thus complete.
161
Chapter 4: Multi-dimensional ANN approximation results
f : [a, b]d → ([u, v] ∩ R) satisfy for all x, y ∈ [a, b]d that |f (x) − f (y)| ≤ L∥x − y∥1 (cf.
Definition 3.3.4). Then there exists ϑ ∈ Rd such that ∥ϑ∥∞ ≤ supx∈[a,b]d |f (x)| and
dL(b − a)
ϑ,l
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ (4.196)
2
(cf. Definition 4.4.1).
Proof of Lemma 4.4.11. Throughout this proof, let d = + 1), let m = (m1 , m2 ,
PL
i=1 li (li−1
. . . , md ) ∈ [a, b]d satisfy for all i ∈ {1, 2, . . . , d} that
a+b
mi = , (4.197)
2
and let ϑ = (ϑ1 , ϑ2 , . . . , ϑd ) ∈ Rd satisfy for all i ∈ {1, 2, . . . , d}\{d} that ϑi = 0 and
ϑd = f (m). Observe that the assumption that lL = 1 and the fact that ∀ i ∈ {1, 2, . . . , d −
1} : ϑi = 0 show that for all x = (x1 , x2 , . . . , xlL−1 ) ∈ RlL−1 it holds that
lL−1
ϑ, L−1
P
i=1 li (li−1 +1)
P
A1,lL−1 (x) = ϑ[ L−1 li (li−1 +1)]+i xi + ϑ[PL−1 li (li−1 +1)]+lL−1 +1
P
i=1 i=1
i=1
lL−1
(4.198)
P
= ϑ[PL li (li−1 +1)]−(lL−1 −i+1) xi + ϑPL li (li−1 +1)
i=1 i=1
i=1
lL−1
P
= ϑd−(lL−1 −i+1) xi + ϑd = ϑd = f (m)
i=1
(cf. Definition 1.1.1). Combining this with the fact that f (m) ∈ [u, v] ensures that for all
x ∈ RlL−1 it holds that
ϑ, L−1 ϑ, L−1
P P
i=1 li (li−1 +1) i=1 li (li−1 +1)
Cu,v,lL ◦ AlL ,lL−1 (x) = Cu,v,1 ◦ A1,lL−1 (x)
= cu,v (f (m)) = max{u, min{f (m), v}} (4.199)
= max{u, f (m)} = f (m)
(cf. Definitions 1.2.9 and 1.2.10). This implies for all x ∈ Rd that
ϑ,l
Nu,v (x) = f (m). (4.200)
Furthermore, note that (4.197) demonstrates that for all x ∈ [a, m1 ], x ∈ [m1 , b] it holds
that
|m1 − x| = m1 − x = (a+b)/2 − x ≤ (a+b)/2 − a = (b−a)/2
(4.201)
and |m1 − x| = x − m1 = x − (a+b)/2 ≤ b − (a+b)/2 = (b−a)/2.
162
4.4. Refined ANN approximations results for multi-dimensional functions
The assumption that ∀ x, y ∈ [a, b]d : |f (x) − f (y)| ≤ L∥x − y∥1 and (4.200) therefore prove
that for all x = (x1 , x2 , . . . , xd ) ∈ [a, b]d it holds that
d
ϑ,l
P
|Nu,v (x) − f (x)| = |f (m) − f (x)| ≤ L∥m − x∥1 = L |mi − xi |
i=1
d L(b − a)
(4.202)
d
P P dL(b − a)
= L |m1 − xi | ≤ = .
i=1 i=1 2 2
This and the fact that ∥ϑ∥∞ = maxi∈{1,2,...,d} |ϑi | = |f (m)| ≤ supx∈[a,b]d |f (x)| establish
(4.196). The proof of Lemma 4.4.11 is thus complete.
and let f : [a, b]d → ([u, v] ∩ R) satisfy for all x, y ∈ [a, b]d that
(cf. Definitions 3.3.4 and 4.2.6). Then there exists ϑ ∈ Rd such that ∥ϑ∥∞ ≤ max{1, L, |a|,
allowbreakabsb, 2[supx∈[a,b]d |f (x)|]} and
3dL(b − a)
ϑ,l
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ (4.206)
A1/d
(cf. Definition 4.4.1).
Proof of Proposition 4.4.12. Throughout this proof, assume without loss of generality that
A 1/d
A > 6d (cf. Lemma 4.4.11), let Z = ⌊ 2d ⌋ ∈ Z. Observe that the fact that for all k ∈ N
it holds that 2k ≤ 2(2 ) = 2 shows that 3d = 6d/2d ≤ A/(2d). Hence, we obtain that
k−1 k
1/d 1/d
2≤ 2 A
3 2d
≤ A
2d
− 1 < Z. (4.207)
In the next step let r = d(b−a)/2Z ∈ (0, ∞), let δ : [a, b]d ×[a, b]d → R satisfy for all x, y ∈ [a, b]d
that δ(x, y) = ∥x − y∥1 , and let K = max(2, C ([a,b] ,δ),r ) ∈ N ∪ {∞} (cf. Definition 4.3.2).
d
163
Chapter 4: Multi-dimensional ANN approximation results
(4.211)
supx∈[a,b]d inf k∈{1,2,...,K} δ(x, xk ) ≤ r.
Observe that (4.210), the assumptions that l0 = d, lL = 1, d ≥ Li=1 li (li−1 + 1), and
P
∀ x, y ∈ [a, b]d : |f (x) − f (y)| ≤ L∥x − y∥1 , and Corollary 4.4.10 establish that there exists
ϑ ∈ Rd such that
and
ϑ,l
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ 2L supx∈[a,b]d inf k∈{1,2,...,K} ∥x − xk ∥1
(4.213)
= 2L supx∈[a,b]d inf k∈{1,2,...,K} δ(x, xk ) .
Furthermore, observe that (4.213), (4.207), (4.211), and the fact that for all k ∈ N it holds
that 2k ≤ 2(2k−1 ) = 2k ensure that
ϑ,l
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ 2L supx∈[a,b]d inf k∈{1,2,...,K} δ(x, xk )
dL(b − a) dL(b − a) (2d)1/d 3dL(b − a) 3dL(b − a) (4.215)
≤ 2Lr = ≤ 1/d
= 1/d
≤ .
Z 2 A 2A A1/d
3 2d
Combining this with (4.214) implies (4.206). The proof of Proposition 4.4.12 is thus
complete.
Corollary 4.4.13. Let d ∈ N, a ∈ R, b ∈ (a, ∞), L ∈ (0, ∞) and let f : [a, b]d → R satisfy
for all x, y ∈ [a, b]d that
164
4.4. Refined ANN approximations results for multi-dimensional functions
(cf. Definition 3.3.4). Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N
such that
H(F) ≤ max 0, d(log2 (ε−1 ) + log2 (d) + log2 (3L(b − a)) + 1) , (4.217)
RN d
(4.218)
∥T (F)∥∞ ≤ max 1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|] , r (F) ∈ C(R , R),
supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.219)
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5).
(cf. Definition 4.2.6). Observe that the fact that for all ε ∈ (0, 1] it holds that Lε ≥
1 + log2 A2dε + 1 1(6d ,∞) (Aε ), the fact that for all ε ∈ (0, 1] it holds that l0 = d,
(ε)
the fact that for all ε ∈ (0, 1] it holds that l1 ≥ Aε 1(6d ,∞) (Aε ), the fact that for all
(ε)
(ε)
ε ∈ (0, 1] it holds that lLε = 1, the fact that for all ε ∈ (0, 1], i ∈ {2, 3, . . . , Lε − 1}
it holds that li ≥ 3⌈ 2Aiεd ⌉1(6d ,∞) (Aε ), Proposition
(ε)
4.4.12, and Lemma
4.4.2 demonstrate
×
Lε (ε) (ε) (ε)
that for all ε ∈ (0, 1] there exists Fε ∈ i=1
Rli ×li−1
×R li
⊆ N which satisfies
∥T (Fε )∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]} and
3dL(b − a)
supx∈[a,b]d |(RN
r (Fε ))(x) − f (x)| ≤ = ε. (4.224)
(Aε )1/d
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5). Furthermore, note that the fact that d ≥ 1
proves that for all ε ∈ (0, 1] it holds that
165
Chapter 4: Multi-dimensional ANN approximation results
Combining this and the fact that for all ε ∈ (0, 1] it holds that
log2 (Aε ) = d log2 3dL(b−a) = d log2 (ε−1 ) + log2 (d) + log2 (3L(b − a)) (4.226)
ε
H(Fε ) ≤ max 0, d log2 (ε−1 ) + log2 (d) + log2 (3L(b − a)) + 1 . (4.227)
Moreover, observe that (4.222) and (4.223) show that for all ε ∈ (0, 1] it holds that
Lε
X (ε) (ε)
P(Fε ) = li (li−1 + 1)
i=1
≤ ⌊Aε ⌋ + 1 (d + 1) + 3 A4dε ⌊Aε ⌋ + 2
ε −1
L
X (4.228)
+ max ⌊Aε ⌋ + 1, 3 2LAε −1 3 2Aiεd (3 2i−1
ε Aε
d
+ 1 + d
+ 1)
i=3
L
Xε −1
Aε Aε 3Aε
≤ (Aε + 1)(d + 1) + 3 4
+ 1 Aε + 2 + 3Aε + 4 + 3 2i
+1 2i−1
+4 .
i=3
In addition, note that the fact that ∀ x ∈ (0, ∞) : log2 (x) = log2 (x/2) + 1 ≤ x/2 + 1 ensures
that for all ε ∈ (0, 1] it holds that
Lε ≤ 2 + log2 ( Adε ) ≤ 3 + Aε
2d
≤3+ Aε
2
. (4.229)
This and (4.228) demonstrate that for all ε ∈ (0, 1] it holds that
P(Fε ) ≤ ( 34 + 38 )(Aε )2 + (d + 1 + 29 + 3 + 27
)Aε +d+1+6+4
2
(4.231)
= 98 (Aε )2 + (d + 22)Aε + d + 11.
166
4.4. Refined ANN approximations results for multi-dimensional functions
Combining this with (4.224) and (4.227) establishes (4.217), (4.218), and (4.219). The
proof of Corollary 4.4.13 is thus complete.
Remark 4.4.14 (High-dimensional ANN approximation results). Corollary 4.4.13 above is a
multi-dimensional ANN approximation result in the sense that the input dimension d ∈ N
of the domain of definition [a, b]d of the considered target function f that we intend to
approximate can be any natural number. However, we note that Corollary 4.4.13 does
not provide a useful contribution in the case when the dimension d is large, say d ≥ 5, as
Corollary 4.4.13 does not provide any information on how the constant C in (4.219) grows
in d and as the dimension d appears in the exponent of the reciprocal ε−1 of the prescribed
approximation accuracy ε in the bound for the number of ANN parameters in (4.219).
In the literature there are also a number of suitable high-dimensional ANN approximation
results which assure that the constant in the parameter bound grows at most polynomially
in the dimension d and which assure that the exponent of the reciprocal ε−1 of the prescribed
approximation accuracy ε in the ANN parameter bound is completely independent of the
dimension d. Such results do have the potential to provide a useful practical conclusion for
ANN approximations even when the dimension d is large. We refer, for example, to [14, 15,
28, 70, 121, 160] and the references therein for such high-dimensional ANN approximation
results in the context of general classes of target functions and we refer, for instance, to [3,
29, 35, 123, 128, 161–163, 177, 179, 205, 209, 228, 259, 353] and the references therein for
such high-dimensional ANN approximation results where the target functions are solutions
of PDEs (cf. also Section 18.4 below).
Remark 4.4.15 (Infinite dimensional ANN approximation results). In the literature there
are now also results where the target function that we intend to approximate is defined on
an infinite dimensional vector space and where the dimension of the domain of definition
of the target function is thus infinity (see, for example, [32, 68, 69, 202, 255, 363] and the
references therein). This perspective seems to be very reasonable as in many applications,
input data, such as images and videos, that should be processed through the target function
are more naturally represented by elements of infinite dimensional spaces instead of elements
of finite dimensional spaces.
167
Chapter 4: Multi-dimensional ANN approximation results
168
Part III
Optimization
169
Chapter 5
171
Chapter 5: Optimization through ODEs
L(E) = 0 (5.3)
(cf. Definitions 1.1.3 and 1.2.1). Note that h is the number of hidden layers of the ANNs
in (5.4), note for every i ∈ {1, 2, . . . , h} that li ∈ N is the number of neurons in the i-th
hidden layer of the ANNs in (5.4), and note that d is the number of real parameters used
to describe the ANNs in (5.4). Observe that for every θ ∈ Rd we have that the function
θ,d
Rd ∋ x 7→ NM a,l 1
,Ma,l2 ,...,Ma,lh ,idR ∈R (5.5)
in (5.4) is nothing else than the realization function associated to a fully-connected feed-
forward ANN where before each hidden layer a multidimensional version of the activation
function a : R → R is applied. We restrict ourselves in this section to a differentiable
activation function as this differentiability property allows us to consider gradients (cf. (5.7),
(5.8), and Section 5.3.2 below for details).
We now discretize the optimization problem in (5.2) as the problem of computing
approximate minimizers of the function L : Rd → [0, ∞) which satisfies for all θ ∈ Rd that
"M #
1 X 2
θ,d
(5.6)
L(θ) = NMa,l ,Ma,l ,...,Ma,l ,idR (xm ) − ym
M m=1 1 2 h
172
5.2. Basics for GFs
The process (θn )n∈N0 is the GD process for the minimization problem associated to (5.6)
with learning rates (γn )n∈N and initial value ξ (see Definition 6.1.1 below for the precise
definition).
This plain-vanilla GD optimization method and related GD-type optimization methods
can be regarded as discretizations of solutions of GF ODEs. In the context of the min-
imization problem in (5.6) such solutions of GF ODEs can be described as follows. Let
Θ = (Θt )t∈[0,∞) : [0, ∞) → Rd be a continuously differentiable function which satisfies for all
t ∈ [0, ∞) that
Θ0 = ξ and Θ̇t = ∂
Θ
∂t t
= −(∇L)(Θt ). (5.8)
The process (Θt )t∈[0,∞) is the solution of the GF ODE corresponding to the minimization
problem associated to (5.6) with initial value ξ.
In Chapter 6 below we introduce and study deterministic GD-type optimization methods
such as the GD optimization method in (5.7). To develop intuitions for GD-type optimization
methods and for some of the tools which we employ to analyze such GD-type optimization
methods, we study in the remainder of this chapter GF ODEs such as (5.8) above. In
deep learning algorithms usually not GD-type optimization methods but stochastic variants
of GD-type optimization methods are employed to solve optimization problems of the
form (5.6). Such SGD-type optimization methods can be viewed as suitable Monte Carlo
approximations of deterministic GD-type methods and in Chapter 7 below we treat such
SGD-type optimization methods.
173
Chapter 5: Optimization through ODEs
Then we say that Θ is a GF trajectory for the objective function L with generalized gradient
G and initial value ξ (we say that Θ is a GF trajectory for the objective function L with
initial value ξ, we say that Θ is a solution of the GF ODE for the objective function L
with generalized gradient G and initial value ξ, we say that Θ is a solution of the GF ODE
for the objective function L with initial value ξ) if and only if it holds that Θ : [0, ∞) → Rd
is a function from [0, ∞) to Rd which satisfies for all t ∈ [0, ∞) that
Z t
Θt = ξ − G(Θs ) ds. (5.10)
0
Then
and
Proof of Lemma 5.2.2. Note that (5.11) implies that for all v ∈ Rd it holds that
174
5.2. Basics for GFs
(cf. Definition 1.4.7). The Cauchy–Schwarz inequality hence ensures that for all v ∈ Rd
with ∥v∥2 = r it holds that
(cf. Definition 3.3.4). Furthermore, observe that (5.14) shows that for all c ∈ R it holds that
Combining this and (5.15) proves item (i) and item (ii). The proof of Lemma 5.2.2 is thus
complete.
Lemma 5.2.3. RLet d ∈ N, Θ ∈ C([0, ∞), Rd ), L ∈ C 1 (Rd , R) and assume for all t ∈ [0, ∞)
t
that Θt = Θ0 − 0 (∇L)(Θs ) ds. Then
(i) it holds that Θ ∈ C 1 ([0, ∞), Rd ),
(ii) it holds for all t ∈ (0, ∞) that Θ̇t = −(∇L)(Θt ), and
(iii) it holds for all t ∈ [0, ∞) that
Z t
L(Θt ) = L(Θ0 ) − ∥(∇L)(Θs )∥22 ds (5.17)
0
(cf. Definitions 1.4.7 and 3.3.4). This establishes item (iii). The proof of Lemma 5.2.3 is
thus complete.
Corollary 5.2.4 (Illustration for the negative GF). Let d ∈ d
R tN, Θ ∈ C([0, ∞), R ), L ∈
1 d
C (R , R) and assume for all t ∈ [0, ∞) that Θ(t) = Θ(0) − 0 (∇L)(Θ(s)) ds. Then
(i) it holds that Θ ∈ C 1 ([0, ∞), Rd ),
(ii) it holds for all t ∈ (0, ∞) that
and
175
Chapter 5: Optimization through ODEs
(iii) it holds for all Ξ ∈ C 1 ([0, ∞), Rd ), τ ∈ (0, ∞) with Ξ(τ ) = Θ(τ ) and ∥Ξ′ (τ )∥2 =
∥Θ′ (τ )∥2 that
Proof of Corollary 5.2.4. Observe that Lemma 5.2.3 and the fundamental theorem of cal-
culus imply item (i) and item (ii). Note that Lemma 5.2.2 shows for all Ξ ∈ C 1 ([0, ∞), Rd ),
t ∈ (0, ∞) it holds that
(cf. Definition 3.3.4). Lemma 5.2.3 therefore ensures that for all Ξ ∈ C 1 ([0, ∞), Rd ),
τ ∈ (0, ∞) with Ξ(τ ) = Θ(τ ) and ∥Ξ′ (τ )∥2 = ∥Θ′ (τ )∥2 it holds that
This and item (ii) establish item (iii). The proof of Corollary 5.2.4 is thus complete.
176
5.2. Basics for GFs
177
Chapter 5: Optimization through ODEs
2
2 0 2 4 6
178
5.2. Basics for GFs
16 # Plot arrows
17 for x in np . linspace ( -1.9 ,1.9 ,21) :
18 d = nabla_f ( x )
19 plt . arrow (x , f ( x ) , -.05 * d , 0 ,
20 l e n g t h _ i n c l ud e s _ h e a d = True , head_width =0.08 ,
21 head_length =0.05 , color = ’b ’)
22
23 plt . savefig ( " ../ plots / gradient_plot1 . pdf " )
Source code 5.1 (code/gradient_plot1.py): Python code used to create Figure 5.1
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 K = [1. , 10.]
5 vartheta = np . array ([1. , 1.])
6
7 def f (x , y ) :
8 result = K [0] / 2. * np . abs ( x - vartheta [0]) **2 \
9 + K [1] / 2. * np . abs ( y - vartheta [1]) **2
10 return result
11
12 def nabla_f ( x ) :
13 return K * ( x - vartheta )
14
15 plt . figure ()
16
17 # Plot contour lines of f
18 x = np . linspace ( -3. , 7. , 100)
19 y = np . linspace ( -2. , 4. , 100)
20 X , Y = np . meshgrid (x , y )
21 Z = f (X , Y )
22 cp = plt . contour (X , Y , Z , colors = " black " ,
23 levels = [0.5 ,2 ,4 ,8 ,16] ,
24 linestyles = " : " )
25
Source code 5.2 (code/gradient_plot2.py): Python code used to create Figure 5.2
179
Chapter 5: Optimization through ODEs
Then
Proof of Lemma 5.3.1. Observe that (5.23) implies that for all x1 ∈ Rd1 , x2 ∈ Rd2 it holds
that
(A1 ◦ F1 ◦ B1 + A2 ◦ F2 ◦ B2 )(x1 , x2 ) = (A1 ◦ F1 )(x1 ) + (A2 ◦ F2 )(x2 )
= (F1 (x1 ), 0) + (0, F2 (x2 )) (5.24)
= (F1 (x1 ), F2 (x2 )).
Combining this and the fact that A1 , A2 , F1 , F2 , B1 , and B2 are differentiable with the chain
rule establishes that f is differentiable. The proof of Lemma 5.3.1 is thus complete.
Lemma 5.3.2. Let d1 , d2 , l0 , l1 , l2 ∈ N, let A : Rd1 × Rd2 × Rl0 → Rd2 × Rd1 +l0 and B : Rd2 ×
Rd1 +l0 → Rd2 × Rl1 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that
A(θ1 , θ2 , x) = (θ2 , (θ1 , x)) and B(θ2 , (θ1 , x)) = (θ2 , F1 (θ1 , x)), (5.25)
for every k ∈ {1, 2} let Fk : Rdk × Rlk−1 → Rlk be differentiable, and let f : Rd1 × Rd2 × Rl0 →
Rl2 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that
(5.26)
f (θ1 , θ2 , x) = F2 (θ2 , ·) ◦ F1 (θ1 , ·) (x).
Then
180
5.3. Regularity properties for ANNs
Proof of Lemma 5.3.2. Note that (5.25) and (5.26) ensure that for all θ1 ∈ Rd1 , θ2 ∈ Rd2 ,
x ∈ Rl0 it holds that
f (θ1 , θ2 , x) = F2 (θ2 , F1 (θ1 , x)) = F2 (B(θ2 , (θ1 , x))) = F2 (B(A(θ1 , θ2 , x))). (5.27)
and
is differentiable
(cf. Definition 1.1.3).
Proof of Lemma 5.3.3. Note that (1.1) shows that for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , . . ., θL ∈ RdL ,
k ∈ {1, 2, . . . , L} it holds that
Pk−1
(θ1 ,θ2 ,...,θL ), dj
Alk ,lk−1 j=1
= Aθlkk,l,0k−1 . (5.31)
Hence, we obtain that for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , . . ., θL ∈ RdL , k ∈ {1, 2, . . . , L} it holds
that
(θ1 ,θ2 ,...,θL ), k−1
P
j=1 dj
(5.32)
Fk (θk , x) = Ψk ◦ Alk ,lk−1 (x).
181
Chapter 5: Optimization through ODEs
Combining this with (1.5) establishes item (i). Observe that the assumption that for all
k ∈ {1, 2, . . . , L} it holds that Ψk is differentiable, the fact that for all m, n ∈ N, θ ∈ Rm(n+1)
it holds that Rm(n+1) × Rn ∋ (θ, x) 7→ Aθ,0 m,n (x) ∈ R
m
is differentiable, and the chain rule
ensure that for all k ∈ {1, 2, . . . , L} it holds that Fk is differentiable. Lemma 5.3.2 and
induction hence prove that
is differentiable. This and item (i) prove item (ii). The proof of Lemma 5.3.3 is thus
complete.
Proof of Lemma 5.3.4. Note that Lemma 5.3.3 and Lemma 5.3.1 (applied with d1 ↶ d + l0 ,
d2 ↶ lL , l1 ↶ lL , l2 ↶ lL , F1 ↶ (Rd × Rl0 ∋ (θ, x) 7→ NΨθ,l1 ,Ψ
lL
0
2 ,...,ΨL
(x) ∈ R ), F2 ↶ idRlL
in the notation of Lemma 5.3.1) show that
is differentiable. The assumption that L is differentiable and the chain rule therefore ensure
that for all x ∈ Rl0 , y ∈ RlL it holds that
Rd ∋ θ 7→ L NΨθ,l1 ,Ψ (5.36)
0
2 ,...,ΨL
(x m ), ym ∈R
is differentiable. This implies that L is differentiable. The proof of Lemma 5.3.4 is thus
complete.
Proof of Lemma 5.3.5. Observe that the assumption that a is differentiable, Lemma 5.3.1,
and induction demonstrate that for all m ∈ N it holds that Ma,m is differentiable. The
proof of Lemma 5.3.5 is thus complete.
182
5.4. Loss functions
Proof of Corollary 5.3.6. Note that Lemma 5.3.5, and Lemma 5.3.4 prove that L is differ-
entiable. The proof of Corollary 5.3.6 is thus complete.
(cf. Definitions 1.1.3, 1.2.1, and 1.2.43 and Lemma 1.2.44). Then L is differentiable.
Proof of Corollary 5.3.7. Observe that Lemma 5.3.5, the fact that A is differentiable, and
Lemma 5.3.4 establish that L is differentiable. The proof of Corollary 5.3.7 is thus
complete.
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
183
Chapter 5: Optimization through ODEs
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
¹-error
0.5
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
184
5.4. Loss functions
2.0
1.5
1.0
0.5
0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Mean squared error
0.5
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)
9
10 mse_loss = tf . keras . losses . MeanSquaredError (
11 reduction = tf . keras . losses . Reduction . NONE )
12 zero = tf . zeros ([100 ,1])
13
14 ax . plot (x , mse_loss ( x . reshape ([100 ,1]) , zero ) ,
15 label = ’ Mean squared error ’)
16 ax . legend ()
17
18 plt . savefig ( " ../../ plots / mseloss . pdf " , bbox_inches = ’ tight ’)
Lemma 5.4.3. Let d ∈ N and let L be the mean squared error loss function based on
Rd ∋ x 7→ ∥x∥2 ∈ [0, ∞) (cf. Definitions 3.3.4 and 5.4.2). Then
L(u, v) = L(x, y)+L′ (x, y)(u−x, v−y)+ 12 L(2) (x, y) (u−x, v−y), (u−x, v−y) . (5.41)
185
Chapter 5: Optimization through ODEs
Proof of Lemma 5.4.3. Note that (5.40) implies that for all x = (x1 , . . . , xd ), y = (y1 , . . . ,
yd ) ∈ Rd it holds that
d
X
L(x, y) = ∥x − y∥22 = ⟨x − y, x − y⟩ = (xi − yi )2 . (5.42)
i=1
Furthermore, observe that (5.43) implies that for all x, y ∈ Rd it holds that L ∈ C 2 (Rd ×
Rd , R) and
2 Id −2 Id
(Hess(x,y) L) = . (5.45)
−2 Id 2 Id
Therefore, we obtain that for all x, y, h, k ∈ Rd it holds that
L(2) (x, y) (h, k), (h, k) = 2⟨h, h⟩ − 2⟨h, k⟩ − 2⟨k, h⟩ + 2⟨k, k⟩ = 2∥h − k∥22 . (5.46)
Combining this with (5.43) shows that for all x, y ∈ Rd , h, k ∈ Rd it holds that L ∈
C ∞ (Rd × Rd , R) and
This implies items (i) and (ii). The proof of Lemma 5.4.3 is thus complete.
186
5.4. Loss functions
4.0
Scaled mean squared error
¹-error3.5
1-Huber-error
3.0
2.5
2.0
1.5
1.0
0.5
0.0
3 2 1 0 1 2 3
0.5
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -.5 ,4) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 mse_loss = tf . keras . losses . MeanSquaredError (
11 reduction = tf . keras . losses . Reduction . NONE )
12 mae_loss = tf . keras . losses . Me anAbsolu teError (
13 reduction = tf . keras . losses . Reduction . NONE )
14 huber_loss = tf . keras . losses . Huber (
15 reduction = tf . keras . losses . Reduction . NONE )
16
17 zero = tf . zeros ([100 ,1])
18
19 ax . plot (x , mse_loss ( x . reshape ([100 ,1]) , zero ) /2. ,
20 label = ’ Scaled mean squared error ’)
21 ax . plot (x , mae_loss ( x . reshape ([100 ,1]) , zero ) ,
22 label = ’ ℓ 1 - error ’)
23 ax . plot (x , huber_loss ( x . reshape ([100 ,1]) , zero ) ,
24 label = ’1 - Huber - error ’)
25 ax . legend ()
187
Chapter 5: Optimization through ODEs
26
27 plt . savefig ( " ../../ plots / huberloss . pdf " , bbox_inches = ’ tight ’)
3.0
Cross-entropy
2.5
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis ((0 ,1) , (0 ,3) )
7
188
5.4. Loss functions
8 ax . set_aspect (.3)
9
10 x = np . linspace (0 , 1 , 100)
11
12 cce_loss = tf . keras . losses . C a t e g o r i c a l C r o s s e n t r o p y (
13 reduction = tf . keras . losses . Reduction . NONE )
14 y = tf . constant ([[0.3 , 0.7]] * 100 , shape =(100 , 2) )
15
Lemma 5.4.6. Let d ∈ N\{1} and let L be the d-dimensional cross-entropy loss function
(cf. Definition 5.4.5). Then
(i) it holds for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d that
(5.50)
(L(x, y) = ∞) ↔ ∃ i ∈ {1, 2, . . . , d} : [(xi = 0) ∧ (yi ̸= 0)] ,
(ii) it holds for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d with ∀ i ∈ {1, 2, . . . , d} :
[(xi ̸= 0) ∨ (yi = 0)] that
X
L(x, y) = − ln(xi )yi ∈ R, (5.51)
i∈{1,2,...,d},
yi ̸=0
and
(iii) it holds for all x = (x1 , . . . , xd ) ∈ (0, ∞)d , y = (y1 , . . . , yd ) ∈ [0, ∞)d that
d
X
L(x, y) = − ln(xi )yi ∈ R. (5.52)
i=1
Proof of Lemma 5.4.6. Note that (5.49) and the fact that for all a, b ∈ [0, ∞) it holds that
0 :b=0
(5.53)
lim ln(z)b = ln(a)b : (a ̸= 0) ∧ (b ̸= 0)
z↘a
−∞ : (a = 0) ∧ (b ̸= 0)
prove items (i), (ii), and (iii). The proof of Lemma 5.4.6 is thus complete.
189
Chapter 5: Optimization through ODEs
Lemma 5.4.7. Let d ∈ N\{1}, let L be the d-dimensional cross-entropy loss function, let
x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d satisfy di=1 xi = di=1 yi and x =
P P
̸ y, and let
f : [0, 1] → (−∞, ∞] satisfy for all h ∈ [0, 1] that
f (h) = L(x + h(y − x), y) (5.54)
(cf. Definition 5.4.5). Then f is strictly decreasing.
Proof of Lemma 5.4.7. Throughout this proof, let g : [0, 1) → (−∞, ∞] satisfy for all
h ∈ [0, 1) that
g(h) = f (1 − h) (5.55)
and let J = {i ∈ {1, 2, . . . , d} : yi ̸= 0}. Observe that (5.54) shows that for all h ∈ [0, 1) it
holds that
g(h) = L(x + (1 − h)(y − x), y) = L(y + h(x − y), y). (5.56)
Furthermore, note that the fact that for all i ∈ J it holds that xi ∈ [0, ∞) and yi ∈ (0, ∞)
ensures that for all i ∈ J, h ∈ [0, 1) it holds that
yi + h(xi − yi ) = (1 − h)yi + hxi ≥ (1 − h)yi > 0. (5.57)
This, (5.56), and item (ii) in Lemma 5.4.6 imply that for all h ∈ [0, 1) it holds that
X
g(h) = − ln(yi + h(xi − yi ))yi ∈ R. (5.58)
i∈J
The chain rule hence demonstrates that for all h ∈ [0, 1) it holds that ([0, 1) ∋ z 7→ g(z) ∈
R) ∈ C ∞ ([0, 1), R) and
X yi (xi − yi )
g ′ (h) = − . (5.59)
y i + h(xi − yi )
i∈J
This and the chain rule establish that for all h ∈ [0, 1) it holds that
X yi (xi − yi )2
′′
g (h) = . (5.60)
i∈J
(yi + h(xi − yi ))2
Moreover, observe that the fact that for all z = (z1 , . . . , zd ) ∈ [0, ∞)d with
Pd Pd
i=1 zi = i=1 yi
and ∀ i ∈ J : zi = yi it holds that
" # " #
X X X
zi = zi − zi
i∈{1,2,...,d}\J i∈{1,2,...,d} i∈J
" # " #
(5.61)
X X
= yi − zi
i∈{1,2,...,d} i∈J
X
= (yi − zi ) = 0
i∈J
190
5.4. Loss functions
proves that for all z = (z1 , . . . , zd ) ∈ [0, ∞)d with di=1 zi = di=1 yi and ∀ i ∈ J : zi = yi
P P
it holds that z = y. The assumption that i=1 xi = i=1 yi and x ̸= y therefore ensures
Pd Pd
that there exists i ∈ J such that xi ̸= yi > 0. Combining this with (5.60) shows that for all
h ∈ [0, 1) it holds that
g ′′ (h) > 0. (5.62)
The fundamental theorem of calculus hence implies that for all h ∈ (0, 1) it holds that
Z h
′ ′
g (h) = g (0) + g ′′ (h) dh > g ′ (0). (5.63)
0
In addition, note that (5.59) and the assumption that di=1 xi = di=1 yi demonstrate that
P P
" # " #
X yi (xi − yi ) X X X
g ′ (0) = − = (yi − xi ) = yi − xi
i∈J
yi i∈J i∈J i∈J
" # " # " # " # " # (5.64)
X X X X X
= yi − xi = xi − xi = xi ≥ 0.
i∈{1,2,...,d} i∈J i∈{1,2,...,d} i∈J i∈{1,2,...,d}\J
Combining this and (5.63) establishes that for all h ∈ (0, 1) it holds that
Therefore, we obtain that g is strictly increasing. This and (5.55) prove that f |(0,1] is strictly
decreasing. Next observe that (5.55) and (5.58) ensure that for all h ∈ (0, 1] it holds that
X X
f (h) = − ln(yi + (1 − h)(xi − yi ))yi = − ln(xi + h(yi − xi ))yi ∈ R. (5.66)
i∈J i∈J
Furthermore, note that items (i) and (ii) in Lemma 5.4.6 show that
X
[f (0) = ∞] ∨ f (0) = − ln(xi + 0(yi − xi ))yi ∈ R . (5.67)
i∈J
This and the fact that f |(0,1] is strictly decreasing demonstrate that f is strictly decreasing.
The proof of Lemma 5.4.7 is thus complete.
Pd
Corollary 5.4.8. Let d ∈ N\{1}, let A = {x = (x1 , . . . , xd ) ∈ [0, 1]d : i=1 xi = 1}, let L
be the d-dimensional cross-entropy loss function, and let y ∈ A (cf. Definition 5.4.5). Then
191
Chapter 5: Optimization through ODEs
Proof of Corollary 5.4.8. Observe that Lemma 5.4.7 shows that for all x ∈ A\{y} it holds
that
L(x, y) = L(x + 0(y − x), y) > L(x + 1(y − x), y) = L(y, y). (5.71)
This and item (ii) in Lemma 5.4.6 establish items (i) and (ii). The proof of Corollary 5.4.8
is thus complete.
and
(ii) it holds for all y ∈ [0, ∞) that
(
0 :y=0
z z
(5.73)
lim inf ln x
x = z
= lim sup ln x
x .
x↘y ln y
y :y>0 x↘y
Proof of Lemma 5.4.9. Throughout this proof, let f : (0, ∞) → R and g : (0, ∞) → R
satisfy for all x ∈ (0, ∞) that
Note that the chain rule proves that for all x ∈ (0, ∞) it holds that f is differentiable and
Combining this, the fact that limx→∞ |f (x)| = ∞ = limx→∞ |g(x)|, the fact that g is
differentiable, the fact that for all x ∈ (0, ∞) it holds that g ′ (x) = 1 ̸= 0, and the fact that
−1
limx→∞ −x1 = 0 with l’Hôpital’s rule ensures that
192
5.4. Loss functions
Observe that item (i) and the fact that for all x ∈ (0, ∞) it holds that ln xz x = ln(z)x −
ln(x)x prove item (ii). The proof of Lemma 5.4.9 is thus complete.
Definition 5.4.10. Let d ∈ N\{1}. We say that L is the d-dimensional Kullback–Leibler
divergence loss function if and only if it holds that L : [0, ∞)d × [0, ∞)d → (−∞, ∞] is
the function from [0, ∞)d × [0, ∞)d to (−∞, ∞] which satisfies for all x = (x1 , . . . , xd ),
y = (y1 , . . . , yd ) ∈ [0, ∞)d that
d
X
z
(5.78)
L(x, y) = − lim lim ln u
u
z↘xi u↘yi
i=1
3.0
Kullback-Leibler divergence
Cross-entropy
2.5
2.0
1.5
1.0
0.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
193
Chapter 5: Optimization through ODEs
5
6 ax = plot_util . setup_axis ((0 ,1) , (0 ,3) )
7
8 ax . set_aspect (.3)
9
10 x = np . linspace (0 , 1 , 100)
11
12 kld_loss = tf . keras . losses . KLDivergence (
13 reduction = tf . keras . losses . Reduction . NONE )
14 cce_loss = tf . keras . losses . C a t e g o r i c a l C r o s s e n t r o p y (
15 reduction = tf . keras . losses . Reduction . NONE )
16 y = tf . constant ([[0.3 , 0.7]] * 100 , shape =(100 , 2) )
17
18 X = tf . stack ([ x ,1 - x ] , axis =1)
19
20 ax . plot (x , kld_loss (y , X ) , label = ’ Kullback - Leibler divergence ’)
21 ax . plot (x , cce_loss (y , X ) , label = ’ Cross - entropy ’)
22 ax . legend ()
23
24 plt . savefig ( " ../../ plots / kldloss . pdf " , bbox_inches = ’ tight ’)
Lemma 5.4.11. Let d ∈ N\{1}, let LCE be the d-dimensional cross-entropy loss function,
and let LKLD be the d-dimensional Kullback–Leibler divergence loss function (cf. Defini-
tions 5.4.5 and 5.4.10). Then it holds for all x, y ∈ [0, ∞)d that
Proof of Lemma 5.4.11. Note that Lemma 5.4.9 implies that for all a, b ∈ [0, ∞) it holds
that
lim lim ln uz u = lim lim ln(z)u − ln(u)u
z↘a u↘b z↘a u↘b
h i
= lim ln(z)b − lim [ln(u)u] (5.80)
z↘a u↘b
= lim [ln(z)b] − lim [ln(u)u] .
z↘a u↘b
This and (5.78) demonstrate that for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d it holds
that
Xd
lim lim ln uz u
LKLD (x, y) = −
z↘xi u↘yi
i=1
d
! d
! (5.81)
X X
=− lim [ln(z)yi ] + lim [ln(u)u] .
z↘xi u↘yi
i=1 i=1
194
5.5. GF optimization in the training of ANNs
Furthermore, observe that Lemma 5.4.9 ensures that for all b ∈ [0, ∞) it holds that
(
0 :b=0
(5.82)
lim ln(u)u = = lim ln(u)b .
u↘b ln(b)b : b > 0 u↘b
Combining this with (5.81) shows that for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d it
holds that
d
! d
!
X X
LKLD (x, y) = − lim [ln(z)yi ] + lim [ln(u)yi ] = LCE (x, y) − LCE (y, y). (5.83)
z↘xi u↘yi
i=1 i=1
Proof of Lemma 5.4.12. Note that Lemma 5.4.7 and Lemma 5.4.11 establish (5.84). The
proof of Lemma 5.4.12 is thus complete.
Pd
Corollary 5.4.13. Let d ∈ N\{1}, let A = {x = (x1 , . . . , xd ) ∈ [0, 1]d : i=1 xi = 1},
let L be the d-dimensional Kullback–Leibler divergence loss function, and let y ∈ A (cf.
Definition 5.4.10). Then
Proof of Corollary 5.4.13. Observe that Corollary 5.4.13 and Lemma 5.4.11 prove items (i)
and (ii). The proof of Corollary 5.4.13 is thus complete.
195
Chapter 5: Optimization through ODEs
L : RlL × RlL → R be the mean squared error loss function based on Rd ∋ x 7→ ∥x∥2 ∈ [0, ∞),
let L : Rd → [0, ∞) satisfy for all θ ∈ Rd that
"M #
1 X θ,d
(5.86)
L(θ) = L NMa,l ,Ma,l ,...,Ma,l ,id l (xm ), ym ,
M m=1 1 2 h R L
(cf. Definitions 1.1.3, 1.2.1, 3.3.4, and 5.4.2, Corollary 5.3.6, and Lemma 5.4.3). Then Θ
is a GF trajectory for the objective function L with initial value ξ (cf. Definition 5.2.1).
Proof for Example 5.5.1. Note that (5.9), (5.10), and (5.87) demonstrate that Θ is a GF
trajectory for the objective function L with initial value ξ (cf. Definition 5.2.1). The proof
for Example 5.5.1 is thus complete.
PL
Example 5.5.2. Let d, L, d ∈ N, l1 , l2 , . . . , lL ∈ N satisfy d = l1 (d + 1) + k=2 lk (lk−1 + 1) ,
let a : R → R be differentiable, let A : RlL → RlL be the lL -dimensional softmax activation
function, let M ∈ N, x1 , x2 , . . . , xM ∈ Rd , y1 , y2 , . . . , yM ∈ [0, ∞)lL , let L1 be the lL -
dimensional cross-entropy loss function, let L2 be the lL -dimensional Kullback–Leibler
divergence loss function, for every i ∈ {1, 2} let Li : Rd → [0, ∞) satisfy for all θ ∈ Rd that
"M #
1 X θ,d
(5.88)
Li (θ) = Li NM a,l1 ,Ma,l2 ,...,Ma,lh ,A
(xm ), ym ,
M m=1
let ξ ∈ Rd , and for every i ∈ {1, 2} let Θi : [0, ∞) → Rd satisfy for all t ∈ [0, ∞) that
Z t
i
Θt = ξ − (∇Li )(Θis ) ds (5.89)
0
(cf. Definitions 1.1.3, 1.2.1, 1.2.43, 5.4.5, and 5.4.10 and Corollary 5.3.7). Then it holds
for all i, j ∈ {1, 2} that Θi is a GF trajectory for the objective function Lj with initial value
ξ (cf. Definition 5.2.1).
Proof for Example 5.5.2. Observe that Lemma 5.4.11 implies that for all x, y ∈ (0, ∞)lL it
holds that
(∇x L1 )(x, y) = (∇x L2 )(x, y). (5.90)
Hence, we obtain that for all x ∈ Rd it holds that
(∇L1 )(x) = (∇L2 )(x). (5.91)
This, (5.9), (5.10), and (5.89) demonstrate that for all i ∈ {1, 2} it holds that Θi is a GF
trajectory for the objective function Lj with initial value ξ (cf. Definition 5.2.1). The proof
for Example 5.5.2 is thus complete.
196
5.6. Lyapunov-type functions for GFs
Proof of Lemma 5.6.1. Throughout this proof, let v : [0, T ] → R satisfy for all t ∈ [0, T ]
that Z t
v(t) = eαt
e−αs)
β(s) ds (5.94)
0
Note that the product rule and the fundamental theorem of calculus demonstrate that for
all t ∈ [0, T ] it holds that v ∈ C 1 ([0, T ], R) and
Z t Z t
′ α(t−s) α(t−s)
v (t) = αe β(s) ds + β(t) = α e β(s) ds + β(t) = αv(t) + β(t).
0 0
(5.96)
The assumption that ϵ ∈ C 1 ([0, T ], R) and the product rule therefore ensure that for all
t ∈ [0, T ] it holds that u ∈ C 1 ([0, T ], R) and
Combining this with the assumption that for all t ∈ [0, T ] it holds that ϵ′ (t) ≤ αϵ(t) + β(t)
proves that for all t ∈ [0, T ] it holds that
197
Chapter 5: Optimization through ODEs
This and the fundamental theorem of calculus imply that for all t ∈ [0, T ] it holds that
Z t Z t
u(t) = u(0) + ′
u (s) ds ≤ u(0) + 0 ds = u(0) = ϵ(0). (5.99)
0 0
Combining this, (5.94), and (5.95) shows that for all t ∈ [0, T ] it holds that
Z t
αt αt αt
ϵ(t) = e u(t) + v(t) ≤ e ϵ(0) + v(t) ≤ e ϵ(0) + eα(t−s) β(s) ds. (5.100)
0
Proof of Proposition 5.6.2. Throughout this proof, let ϵ, b ∈ C([0, T ], R) satisfy for all
t ∈ [0, T ] that
ϵ(t) = V (Θt ) and b(t) = β(Θt ). (5.103)
Observe that (5.101), (5.103), the fundamental theorem of calculus, and the chain rule
ensure that for all t ∈ [0, T ] it holds that
ϵ′ (t) = dt
d
(V (Θt )) = V ′ (Θt ) Θ̇t = V ′ (Θt )G(Θt ) ≤ αV (Θt ) + β(Θt ) = αϵ(t) + b(t). (5.104)
Lemma 5.6.1 and (5.103) hence demonstrate that for all t ∈ [0, T ] it holds that
Z t Z t
αt
V (Θt ) = ϵ(t) ≤ ϵ(0)e + e α(t−s) αt
b(s) ds = V (Θ0 )e + eα(t−s) β(Θs ) ds. (5.105)
0 0
198
5.6. Lyapunov-type functions for GFs
Proof of Corollary 5.6.3. Note that Proposition 5.6.2 and (5.106) establish (5.107). The
proof of Corollary 5.6.3 is thus complete.
Proof of Corollary 5.6.5. Throughout this proof, let G : O → Rd satisfy for all θ ∈ O that
199
Chapter 5: Optimization through ODEs
Observe that Lemma 5.6.4 and (5.112) ensure that for all θ ∈ O it holds that V ∈ C 1 (O, R)
and
V ′ (θ)G(θ) = ⟨(∇V )(θ), G(θ)⟩ = ⟨2(θ − ϑ), G(θ)⟩
(5.116)
= −2⟨(θ − ϑ), (∇L)(θ)⟩ ≤ −2c∥θ − ϑ∥22 = −2cV (θ).
Corollary 5.6.3 hence proves that for all t ∈ [0, T ] it holds that
Proof of Lemma 5.6.6. Throughout this proof, let v ∈ Rd \{0} satisfy v = −(∇L)(ϑ), let
δ ∈ (0, ∞) satisfy for all t ∈ (−δ, δ) that
ϑ + tv = ϑ − t(∇L)(ϑ) ∈ O, (5.118)
200
5.6. Lyapunov-type functions for GFs
The fact that ∥v∥22 > 0 therefore demonstrates that there exists t ∈ (0, δ) such that
∥v∥22
L(t) − L(0)
+ ∥v∥22 < . (5.123)
t 2
The triangle inequality and the fact that ∥v∥22 > 0 hence prove that
L(t) − L(0) L(t) − L(0) 2 2 L(t) − L(0)
= + ∥v∥2 − ∥v∥2 ≤ + ∥v∥22 − ∥v∥22
t t t
2 2
(5.124)
∥v∥2 ∥v∥2
< − ∥v∥22 = − < 0.
2 2
This ensures that
L(ϑ + tv) = L(t) < L(0) = L(ϑ). (5.125)
The proof of Lemma 5.6.6 is thus complete.
Lemma 5.6.7 (A necessary condition for a local minimum point). Let d ∈ N, let O ⊆ Rd
be open, let ϑ ∈ O, let L : O → R be a function, assume that L is differentiable at ϑ, and
assume
L(ϑ) = inf θ∈O L(θ). (5.126)
Then (∇L)(ϑ) = 0.
Proof of Lemma 5.6.7. We prove Lemma 5.6.7 by contradiction. We thus assume that
(∇L)(ϑ) ̸= 0. Lemma 5.6.6 then implies that there exists θ ∈ O such that L(θ) < L(ϑ).
Combining this with (5.126) shows that
Lemma 5.6.8 (A sufficient condition for a local minimum point). Let d ∈ N, c ∈ (0, ∞),
r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that
201
Chapter 5: Optimization through ODEs
Proof of Lemma 5.6.8. Throughout this proof, let B be the set given by
Note that (5.128) implies that for all v ∈ Rd with ∥v∥2 ≤ r it holds that
The fundamental theorem of calculus hence demonstrates that for all θ ∈ B it holds that
t=1
L(θ) − L(ϑ) = L(ϑ + t(θ − ϑ)) t=0
Z 1
= L ′ (ϑ + t(θ − ϑ))(θ − ϑ) dt
Z0 1 (5.131)
1
= ⟨(∇L)(ϑ + t(θ − ϑ)), t(θ − ϑ)⟩ dt
t
Z0 1 Z 1
21 2
≥ c∥t(θ − ϑ)∥2 dt = c∥θ − ϑ∥2 t dt = 2c ∥θ − ϑ∥22 .
0 t 0
This proves item (i). Next observe that (5.131) ensures that for all θ ∈ B\{ϑ} it holds that
This establishes item (ii). It thus remains thus remains to prove item (iii). For this observe
that item (ii) ensures that
Combining this, the fact that B is open, and Lemma 5.6.7 (applied with d ↶ d, O ↶ B,
ϑ ↶ ϑ, L ↶ L|B in the notation of Lemma 5.6.7) assures that (∇L)(ϑ) = 0. This
establishes item (iii). The proof of Lemma 5.6.8 is thus complete.
202
5.7. Optimization through flows of ODEs
Proof of Lemma 5.6.9. Observe that (5.135), the Cauchy-Schwarz inequality, and the fun-
damental theorem of calculus ensure that for all θ ∈ B it holds that
t=1
L(θ) − L(ϑ) = L(ϑ + t(θ − ϑ)) t=0
Z 1
= L ′ (ϑ + t(θ − ϑ))(θ − ϑ) dt
Z0 1
= ⟨(∇L)(ϑ + t(θ − ϑ)), θ − ϑ⟩ dt
0
Z 1 (5.137)
≤ ∥(∇L)(ϑ + t(θ − ϑ))∥2 ∥θ − ϑ∥2 dt
Z0 1
≤ L∥ϑ + t(θ − ϑ) − ϑ∥2 ∥θ − ϑ∥2 dt
0
Z 1
2
= L∥θ − ϑ∥2 t dt = L2 ∥θ − ϑ∥22
0
203
Chapter 5: Optimization through ODEs
(ii) it holds for all t ∈ [0, T ] that ∥Θt − ϑ∥2 ≤ e−ct ∥ξ − ϑ∥2 , and
Proof of Proposition 5.7.1. Throughout this proof, let V : Rd → [0, ∞) satisfy for all θ ∈ Rd
that V (θ) = ∥θ − ϑ∥22 , let ϵ : [0, T ] → [0, ∞) satisfy for all t ∈ [0, T ] that ϵ(t) = ∥Θt − ϑ∥22 =
V (Θt ), and let τ ∈ [0, T ] be the real number given by
Note that (5.138) and item (ii) in Lemma 5.6.8 establish item (i). Next observe that
Lemma 5.6.4 implies that for all θ ∈ Rd it holds that V ∈ C 1 (Rd , [0, ∞)) and
Moreover, observe that the fundamental theorem of calculus (see, for example, Coleman
[85, Theorem 3.9]) and the fact that Rd ∋ v 7→ (∇L)(v) ∈ Rd and Θ : [0, T ] → Rd are
continuous functions ensure that for all t ∈ [0, T ] it holds that Θ ∈ C 1 ([0, T ], Rd ) and
d
dt
(Θt ) = −(∇L)(Θt ). (5.142)
Combining (5.138) and (5.141) hence demonstrates that for all t ∈ [0, τ ] it holds that
ϵ ∈ C 1 ([0, T ], [0, ∞)) and
ϵ′ (t) = dt
d
V (Θt ) = V ′ (Θt ) dt
d
(Θt )
d
= ⟨(∇V )(Θt ), dt (Θt )⟩
= ⟨2(Θt − ϑ), −(∇L)(Θt )⟩ (5.143)
= −2⟨(Θt − ϑ), (∇L)(Θt )⟩
≤ −2c∥Θt − ϑ∥22 = −2cϵ(t).
The Gronwall inequality, for instance, in Lemma 5.6.1 therefore implies that for all t ∈ [0, τ ]
it holds that
ϵ(t) ≤ ϵ(0)e−2ct . (5.144)
Hence, we obtain for all t ∈ [0, τ ] that
(5.145)
p p
∥Θt − ϑ∥2 = ϵ(t) ≤ ϵ(0)e−ct = ∥Θ0 − ϑ∥2 e−ct = ∥ξ − ϑ∥2 e−ct .
204
5.7. Optimization through flows of ODEs
In our proof of (5.146) we distinguish between the case ε(0) = 0 and the case ε(0) > 0. We
first prove (5.146) in the case
ε(0) = 0. (5.147)
Observe that (5.147), the assumption that r ∈ (0, ∞], and the fact that ϵ : [0, T ] → [0, ∞)
is a continuous function show that
This establishes (5.146) in the case ε(0) = 0. In the next step we prove (5.146) in the case
Note that (5.143) and the assumption that c ∈ (0, ∞) assure that for all t ∈ [0, τ ] with
ϵ(t) > 0 it holds that
ϵ′ (t) ≤ −2cϵ(t) < 0. (5.150)
Combining this with (5.149) shows that
The fact that ϵ′ : [0, T ] → [0, ∞) is a continuous function and the assumption that T ∈ (0, ∞)
therefore demonstrate that
Next note that the fundamental theorem of calculus and the assumption that ξ ∈ B imply
that for all s ∈ [0, T ] with s < inf({t ∈ [0, T ] : ϵ′ (t) > 0} ∪ {T }) it holds that
Z s
ϵ(s) = ϵ(0) + ϵ′ (u) du ≤ ϵ(0) = ∥ξ − ϑ∥22 ≤ r2 . (5.153)
0
This establishes (5.146) in the case ε(0) > 0. Observe that (5.145), (5.146), and the
assumption that c ∈ (0, ∞) demonstrate that
The fact that ϵ : [0, T ] → [0, ∞) is a continuous function, (5.140), and (5.146) hence assure
that τ = T . Combining this with (5.145) proves that for all t ∈ [0, T ] it holds that
205
Chapter 5: Optimization through ODEs
This establishes item (ii). It thus remains to prove item (iii). For this observe that (5.138)
and item (i) in Lemma 5.6.8 demonstrate that for all θ ∈ B it holds that
Combining this and item (ii) implies that for all t ∈ [0, T ] it holds that
This establishes item (iii). The proof of Proposition 5.7.1 is thus complete.
Proof of Lemma 5.7.2. Note that, for example, Teschl [394, Theorem 2.2 and Corollary 2.16]
implies (5.159) (cf., for instance, [5, Theorem 7.6] and [222, Theorem 1.1]). The proof of
Lemma 5.7.2 is thus complete.
Lemma 5.7.3 (Local existence of maximal solution of ODEs on an infinite time interval).
Let d ∈ N, ξ ∈ Rd , let ~·~ : Rd → [0, ∞) be a norm, and let G : Rd → Rd be locally Lipschitz
continuous. Then there exist a unique extended real number τ ∈ (0, ∞] and a unique
continuous function Θ : [0, τ ) → Rd such that for all t ∈ [0, τ ) it holds that
Z t
(5.160)
lim inf ~Θs ~ + s = ∞ and Θt = ξ + G(Θs ) ds.
s↗τ 0
Proof of Lemma 5.7.3. First, observe that Lemma 5.7.2 implies that there exist unique real
numbers τn ∈ (0, n], n ∈ N, and unique continuous functions Θ(n) : [0, τn ) → Rd , n ∈ N,
such that for all n ∈ N, t ∈ [0, τn ) it holds that
h i Z t
(n)
and (5.161)
(n)
1
lim inf Θs + (n−s) = ∞ Θt = ξ + G(Θ(n)
s ) ds.
s↗τn 0
This shows that for all n ∈ N, t ∈ [0, min{τn+1 , n}) it holds that
h i Z t
(n+1)
and (5.162)
(n+1)
lim inf Θs
1
+ (n+1−s) = ∞ Θt =ξ+ G(Θ(n+1)
s ) ds.
s↗τn+1 0
206
5.7. Optimization through flows of ODEs
Hence, we obtain that for all n ∈ N, t ∈ [0, min{τn+1 , n}) it holds that
h i
(5.163)
(n+1) 1
lim inf Θs + (n−s) = ∞
s↗min{τn+1 ,n}
Z t
(n+1)
and Θt =ξ+ G(Θ(n+1)
s ) ds. (5.164)
0
Combining this with (5.161) demonstrates that for all n ∈ N it holds that
t = lim τn (5.167)
n→∞
Observe that for all t ∈ [0, t) there exists n ∈ N such that t ∈ [0, τn ). This, (5.161), and
(5.166) assure that for all t ∈ [0, t) it holds that Θ ∈ C([0, t), Rd ) and
Z t
Θt = ξ + G(Θs ) ds. (5.169)
0
In addition, note that (5.165) ensures that for all n ∈ N, k ∈ N ∩ [n, ∞) it holds that
This shows that for all n ∈ N, k ∈ N ∩ (n, ∞) it holds that min{τk , n} = min{τk−1 , n}.
Hence, we obtain that for all n ∈ N, k ∈ N ∩ (n, ∞) it holds that
Combining this with the fact that (τn )n∈N ⊆ [0, ∞) is a non-decreasing sequence implies
that for all n ∈ N it holds that
n o
(5.172)
min{t, n} = min lim τk , n = lim min{τk , n} = lim τn = τn .
k→∞ k→∞ k→∞
τn = min{t, n} = t. (5.173)
207
Chapter 5: Optimization through ODEs
This, (5.161), and (5.168) demonstrate that for all n ∈ N with t < n it holds that
lim inf ~Θs ~ = lim inf ~Θs ~ = lim inf Θ(n)
s
s↗t s↗τn s↗τn
h i
1
= − (n−t) + lim inf Θ(n)
s
+ 1
(n−t) (5.174)
s↗τn
h i
1
= − (n−t) + lim inf Θ(n)
s
+ 1
(n−s)
= ∞.
s↗τn
Next note that for all t̂ ∈ (0, ∞], Θ̂ ∈ C([0, t̂), RRd ), n ∈ N, t ∈ [0, min{t̂, n}) with
s
lim inf s↗t̂ [~Θ̂s ~ + s] = ∞ and ∀ s ∈ [0, t̂) : Θ̂s = ξ + 0 G(Θ̂u ) du it holds that
h i Z t
1
lim inf ~Θ̂s ~ + (n−s) = ∞ and Θ̂t = ξ + G(Θ̂s ) ds. (5.176)
s↗min{t̂,n} 0
This and (5.161) prove that for all t̂ ∈ (0, ∞], Θ̂ ∈ C([0, t̂), Rd ), n ∈ N with lim inf t↗t̂ [~Θ̂t ~+
Rt
t] = ∞ and ∀ t ∈ [0, t̂) : Θ̂t = ξ + 0 G(Θ̂s ) ds it holds that
Combining (5.169) and (5.175) hence assures that for all t̂ ∈ R(0, ∞], Θ̂ ∈ C([0, t̂), Rd ),
t
n ∈ N with lim inf t↗t̂ [~Θ̂t ~ + t] = ∞ and ∀ t ∈ [0, t̂) : Θ̂t = ξ + 0 G(Θ̂s ) ds it holds that
This and (5.167) show that for all t̂ ∈ (0, ∞], Θ̂ ∈ C([0, t̂), Rd ) with lim inf t↗t̂ [~Θ̂t ~+t] = ∞
Rt
and ∀ t ∈ [0, t̂) : Θ̂t = ξ + 0 G(Θ̂s ) ds it holds that
t̂ = t and Θ̂ = Θ. (5.179)
Combining this, (5.169), and (5.175) completes the proof of Lemma 5.7.3.
208
5.7. Optimization through flows of ODEs
(i) there exists a unique continuous function Θ : [0, ∞) → Rd such that for all t ∈ [0, ∞)
it holds that Z t
Θt = ξ − (∇L)(Θs ) ds, (5.181)
0
(iii) it holds for all t ∈ [0, ∞) that ∥Θt − ϑ∥2 ≤ e−ct ∥ξ − ϑ∥2 , and
Proof of Theorem 5.7.4. First, observe that the assumption that L ∈ C 2 (Rd , R) ensures
that
Rd ∋ θ 7→ −(∇L)(θ) ∈ Rd (5.183)
is continuously differentiable. The fundamental theorem of calculus hence implies that
Rd ∋ θ 7→ −(∇L)(θ) ∈ Rd (5.184)
is locally Lipschitz continuous. Combining this with Lemma 5.7.3 (applied with G ↶ (Rd ∋
θ 7→ −(∇L)(θ) ∈ Rd ) in the notation of Lemma 5.7.3) proves that there exists a unique
extended real number τ ∈ (0, ∞] and a unique continuous function Θ : [0, τ ) → Rd such
that for all t ∈ [0, τ ) it holds that
Z t
and (5.185)
lim inf ∥Θs ∥2 + s = ∞ Θt = ξ − (∇L)(Θs ) ds.
s↗τ 0
Next observe that Proposition 5.7.1 proves that for all t ∈ [0, τ ) it holds that
209
Chapter 5: Optimization through ODEs
(i) there exists a unique continuous function Θ : [0, ∞) → Rd such that for all t ∈ [0, ∞)
it holds that Z t
Θt = ξ − (∇L)(Θs ) ds, (5.190)
0
(iii) it holds for all t ∈ [0, ∞) that ∥Θt − ϑ∥2 ≤ e−ct ∥ξ − ϑ∥2 , and
Proof of Corollary 5.7.5. Theorem 5.7.4 and Lemma 5.6.9 establish items (i), (ii), (iii), and
(iv). The proof of Corollary 5.7.5 is thus complete.
210
Chapter 6
This chapter reviews and studies deterministic GD-type optimization methods such as the
classical plain-vanilla GD optimization method (see Section 6.1 below) as well as more
sophisticated GD-type optimization methods including GD optimization methods with
momenta (cf. Sections 6.3, 6.4, and 6.8 below) and GD optimization methods with adaptive
modifications of the learning rates (cf. Sections 6.5, 6.6, 6.7, and 6.8 below).
There are several other outstanding reviews on gradient based optimization methods in
the literature; cf., for example, the books [9, Chapter 5], [52, Chapter 9], [57, Chapter 3],
[164, Sections 4.3 and 5.9 and Chapter 8], [303], and [373, Chapter 14] and the references
therein and, for instance, the survey articles [33, 48, 122, 354, 386] and the references
therein.
6.1 GD optimization
In this section we review and study the classical plain-vanilla GD optimization method
(cf., for example, [303, Section 1.2.3], [52, Section 9.3], and [57, Chapter 3]). A simple
intuition behind the GD optimization method is the idea to solve a minimization problem
by performing successive steps in direction of the steepest descents of the objective function,
that is, by performing successive steps in the opposite direction of the gradients of the
objective function.
A slightly different and maybe a bit more accurate perspective for the GD optimization
method is to view the GD optimization method as a plain-vanilla Euler discretization of
the associated GF ODE (see, for example, Theorem 5.7.4 in Chapter 5 above)
Definition 6.1.1 (GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞), ξ ∈ Rd and
let L : Rd → R and G : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with
211
Chapter 6: Deterministic GD optimization methods
let ξ ∈ Rd , let (γn )n∈N ⊆ N, and let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ) (6.5)
(cf. Definitions 1.1.3 and 1.2.1 and Corollary 5.3.6). Then Θ is the GD process for the
objective function L with learning rates (γn )n∈N and initial value ξ.
212
6.1. GD optimization
Proof for Example 6.1.2. Note that (6.5), (6.1), and (6.2) demonstrate that Θ is the GD
process for the objective function L with learning rates (γn )n∈N and initial value ξ. The
proof for Example 6.1.2 is thus complete.
Proof of Theorem 6.1.3. Observe that the fundamental theorem of calculus assures that
for all g ∈ C 1 ([0, 1], R) it holds that
Z 1 Z 1 ′
g (r)(1 − r)0
g(1) = g(0) + ′
g (r) dr = g(0) + dr. (6.7)
0 0 0!
Furthermore, note that integration by parts ensures that for all n ∈ N, g ∈ C n+1 ([0, 1], R)
it holds that
Z 1 (n) r=1 Z 1 (n+1)
g (r)(1 − r)n−1
(n)
g (r)(1 − r)n g (r)(1 − r)n
dr = − + dr
(n − 1)! n! n!
0
Z 1 (n+1)
r=0 0
(6.8)
g (n) (0) g (r)(1 − r)n
= + dr.
n! 0 n!
Combining this with (6.7) and induction shows that for all g ∈ C N ([0, 1], R) it holds that
"N −1 # Z
1 (N )
X g (n) (0) g (r)(1 − r)N −1
g(1) = + dr. (6.9)
n=0
n! 0 (N − 1)!
213
Chapter 6: Deterministic GD optimization methods
Proof of Lemma 6.1.4. Note that the fundamental theorem of calculus, the hypothesis that
G ∈ C 1 (Rd , Rd ), and (6.10) assure that for all t ∈ (0, ∞) it holds that Θ ∈ C 1 ([0, ∞), Rd )
and
Combining this with the hypothesis that G ∈ C 1 (Rd , Rd ) and the chain rule ensures that
for all t ∈ (0, ∞) it holds that Θ ∈ C 2 ([0, ∞), Rd ) and
∥ΘT +γ − θ∥2
Z 1
= ΘT + γG(ΘT ) + γ 2
(1 − r)G′ (ΘT +rγ )G(ΘT +rγ ) dr − (ΘT + γG(ΘT ))
0 2
Z 1 (6.16)
≤ γ2 (1 − r)∥G′ (ΘT +rγ )G(ΘT +rγ )∥2 dr
0
Z 1
2 2 c2 γ 2
≤c γ r dr = ≤ c2 γ 2 .
0 2
Corollary 6.1.5 (Local error of the Euler method for GF ODEs). Let d ∈ N, T, γ, c ∈ [0, ∞),
L ∈ C 2 (Rd , R), Θ ∈ C([0, ∞), Rd ), θ ∈ Rd satisfy for all x, y ∈ Rd , t ∈ [0, ∞) that
Z t
Θt = Θ0 − (∇L)(Θs ) ds, θ = ΘT − γ(∇L)(ΘT ), (6.17)
0
214
6.1. GD optimization
Proof of Corollary 6.1.5. Throughout this proof, let G : Rd → Rd satisfy for all θ ∈ Rd that
Proof of Proposition 6.1.6. We prove (6.23) by induction on n ∈ N0 . For the base case
n = 0 note that the assumption that Θ0 = ξ ensures that V (Θ0 ) = V (ξ). This establishes
(6.23) in the base case n = 0. For the
Q induction step observe that (6.22) and (6.21) ensure
that for all n ∈ N0 with V (Θn ) ≤ ( nk=1 ε(γk ))V (ξ) it holds that
Induction thus establishes (6.23). The proof of Proposition 6.1.6 is thus complete.
215
Chapter 6: Deterministic GD optimization methods
Corollary 6.1.7 (On quadratic Lyapunov-type functions for the GD optimization method).
Let d ∈ N, ϑ, ξ ∈ Rd , c ∈ (0, ∞), (γn )n∈N ⊆ [0, c], L ∈ C 1 (Rd , R), let ~·~ : Rd → [0, ∞) be
a norm, let ε : [0, c] → [0, ∞) satisfy for all θ ∈ Rd , t ∈ [0, c] that
~θ − t(∇L)(θ) − ϑ~2 ≤ ε(t)~θ − ϑ~2 , (6.25)
and let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ). (6.26)
Then it holds for all n ∈ N0 that
n
(6.27)
Q 1/2
~Θn − ϑ~ ≤ [ε(γk )] ~ξ − ϑ~.
k=1
216
6.1. GD optimization
(iii) it holds for all γ ∈ (0, L2c2 ) that 0 ≤ 1 − 2γc + γ 2 L2 < 1, and
Proof of Lemma 6.1.8. First of all, note that (6.30) ensures that for all θ ∈ B, γ ∈ [0, ∞)
it holds that
0 ≤ ~θ − γ(∇L)(θ) − ϑ~2 = ~(θ − ϑ) − γ(∇L)(θ)~2
= ~θ − ϑ~2 − 2γ ⟨⟨θ − ϑ, (∇L)(θ)⟩⟩ + γ 2 ~(∇L)(θ)~2
(6.33)
≤ ~θ − ϑ~2 − 2γc~θ − ϑ~2 + γ 2 L2 ~θ − ϑ~2
= (1 − 2γc + γ 2 L2 )~θ − ϑ~2 .
This establishes item (ii). Moreover, note that the fact that B\{ϑ} =
̸ ∅ and (6.33) assure
that for all γ ∈ [0, ∞) it holds that
1 − 2γc + γ 2 L2 ≥ 0. (6.34)
2
This implies that Lc 2 ≤ 1. Therefore, we obtain that c2 ≤ L2 . This establishes item (i).
Furthermore, observe that (6.34) ensures that for all γ ∈ (0, L2c2 ) it holds that
This proves item (iii). In addition, note that for all γ ∈ [0, Lc2 ] it holds that
Combining this with (6.33) establishes item (iv). The proof of Lemma 6.1.8 is thus
complete.
217
Chapter 6: Deterministic GD optimization methods
Exercise 6.1.3. Prove or disprove the following statement: There exist d ∈ N, γ ∈ (0, ∞),
ε ∈ (0, 1), r ∈ (0, ∞], ϑ, θ ∈ Rd and there exists a function G : Rd → Rd such that
∥θ − ϑ∥2 ≤ r, ∀ ξ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ∥ξ − γg(ξ) − ϑ∥2 ≤ ε∥ξ − ϑ∥2 , and
2 γ
⟨θ − ϑ, g(θ)⟩ < min 1−ε , 2 max ∥θ − ϑ∥22 , ∥G(θ)∥22 . (6.38)
2γ
Exercise 6.1.4. Prove or disprove the following statement: For all d ∈ N, r ∈ (0, ∞],
ϑ ∈ Rd and for every function G : Rd → Rd which satisfies ∀ θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤
r} : ⟨θ − ϑ, G(θ)⟩ ≥ 12 max{∥θ − ϑ∥22 , ∥G(θ)∥22 } it holds that
∀θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ⟨θ − ϑ, G(θ)⟩ ≥ 21 ∥θ − ϑ∥22 ∧ ∥G(θ)∥2 ≤ 2∥θ − ϑ∥2 . (6.39)
Exercise 6.1.5. Prove or disprove the following statement: For all d ∈ N, c ∈ (0, ∞),
r ∈ (0, ∞], ϑ, v ∈ Rd , L ∈ C 1 (Rd , R), s, t ∈ [0, 1] such that ∥v∥2 ≤ r, s ≤ t, and
∀ θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 it holds that
L(ϑ + tv) − L(ϑ + sv) ≥ 2c (t2 − s2 )∥v∥22 . (6.40)
Exercise 6.1.6. Prove or disprove the following statement: For every d ∈ N, c ∈ (0, ∞),
r ∈ (0, ∞], ϑ ∈ Rd and for every L ∈ C 1 (Rd , R) which satisfies for all v ∈ Rd , s, t ∈ [0, 1]
with ∥v∥2 ≤ r and s ≤ t that L(ϑ + tv) − L(ϑ + sv) ≥ c(t2 − s2 )∥v∥22 it holds that
∀ θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ⟨θ − ϑ, (∇L)(θ)⟩ ≥ 2c∥θ − ϑ∥22 . (6.41)
Exercise 6.1.7. Let d ∈ N and for every v ∈ Rd , R ∈ [0, ∞] let BR (v) = {w ∈ Rd : ∥w−v∥2 ≤
R}. Prove or disprove the following statement: For all r ∈ (0, ∞], ϑ ∈ Rd , L ∈ C 1 (Rd , R)
the following two statements are equivalent:
(i) There exists c ∈ (0, ∞) such that for all θ ∈ Br (ϑ) it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 . (6.42)
(ii) There exists c ∈ (0, ∞) such that for all v, w ∈ Br (ϑ), s, t ∈ [0, 1] with s ≤ t it holds
that
L(ϑ + t(v − ϑ)) − L(ϑ + s(v − ϑ)) ≥ c(t2 − s2 )∥v − ϑ∥22 . (6.43)
Exercise 6.1.8. Let d ∈ N and for every v ∈ Rd , R ∈ [0, ∞] let BR (v) = {w ∈ Rd : ∥v −w∥2 ≤
R}. Prove or disprove the following statement: For all r ∈ (0, ∞], ϑ ∈ Rd , L ∈ C 1 (Rd , R)
the following three statements are equivalent:
(i) There exist c, L ∈ (0, ∞) such that for all θ ∈ Br (ϑ) it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 . (6.44)
(ii) There exist γ ∈ (0, ∞), ε ∈ (0, 1) such that for all θ ∈ Br (ϑ) it holds that
∥θ − γ(∇L)(θ) − ϑ∥2 ≤ ε∥θ − ϑ∥2 . (6.45)
(iii) There exists c ∈ (0, ∞) such that for all θ ∈ Br (ϑ) it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c max ∥θ − ϑ∥22 , ∥(∇L)(θ)∥22 . (6.46)
218
6.1. GD optimization
(iii) it holds for all n ∈ N that ∥Θn − ϑ∥2 ≤ (1 − 2cγn + (γn )2 L2 )1/2 ∥Θn−1 − ϑ∥2 ≤ r,
and
Proof of Proposition 6.1.9. First, observe that (6.47) and item (ii) in Lemma 5.6.8 prove
item (i). Moreover, note that (6.47), item (iii) in Lemma 6.1.8, the assumption that for all
n ∈ N it holds that γn ∈ [0, L2c2 ], and the fact that
2c 2 4c2 (6.51)
4c2 4c2 4c2
L2 = 1 − L2 = 1 −
2c
1 − 2c L2
+ L2 L2
+ L4 L2
+ L2
=1
219
Chapter 6: Deterministic GD optimization methods
and establish item (ii). Next we claim that for all n ∈ N it holds that
∥Θn − ϑ∥2 ≤ (1 − 2cγn + (γn )2 L2 ) /2 ∥Θn−1 − ϑ∥2 ≤ r. (6.52)
1
We now prove (6.52) by induction on n ∈ N. For the base case n = 1 observe that (6.48),
the assumption that Θ0 = ξ ∈ B, item (ii) in Lemma 6.1.8, and item (ii) ensure that
∥Θ1 − ϑ∥22 = ∥Θ0 − γ1 (∇L)(Θ0 ) − ϑ∥22
≤ (1 − 2cγ1 + (γ1 )2 L2 )∥Θ0 − ϑ∥22 (6.53)
≤ ∥Θ0 − ϑ∥22 ≤ r2 .
This establishes (6.52) in the base case n = 1. For the induction step note that (6.48),
item (ii) in Lemma 6.1.8, and item (ii) imply that for all n ∈ N with Θn ∈ B it holds that
∥Θn+1 − ϑ∥22 = ∥Θn − γn+1 (∇L)(Θn ) − ϑ∥22
≤ (1 − 2cγn+1 + (γn+1 )2 L2 )∥Θn − ϑ∥22
| {z } (6.54)
∈[0,1]
≤ ∥Θn − ϑ∥22 ≤ r2 .
This demonstrates that for all n ∈ N with ∥Θn − ϑ∥2 ≤ r it holds that
∥Θn+1 − ϑ∥2 ≤ (1 − 2cγn+1 + (γn+1 )2 L2 ) /2 ∥Θn − ϑ∥2 ≤ r. (6.55)
1
Induction thus proves (6.52). Next observe that (6.52) establishes item (iii). Moreover, note
that induction, item (ii), and item (iii) prove item (iv). Furthermore, observe that item (iii)
and the fact that Θ0 = ξ ∈ B ensure that for all n ∈ N0 it holds that Θn ∈ B. Combining
this, (6.47), and Lemma 5.6.9 with items (i) and (iv) establishes item (v). The proof of
Proposition 6.1.9 is thus complete.
220
6.1. GD optimization
and
(iv) it holds for all n ∈ N0 that
n
0 ≤ L(Θn ) − L(ϑ) ≤ L2 ∥Θn − ϑ∥22 ≤ L
1 − 2cγ + γ 2 L2 ∥ξ − ϑ∥22 . (6.59)
2
Proof of Corollary 6.1.10. Observe that item (iii) in Lemma 6.1.8 proves item (ii). In
addition, note that Proposition 6.1.9 establishes items (i), (iii), and (iv). The proof of
Corollary 6.1.10 is thus complete.
Corollary 6.1.10 above establishes under suitable hypotheses convergence of the con-
sidered GD process in the case where the learning rates are constant and strictly smaller
than L2c2 . The next result, Theorem 6.1.11 below, demonstrates that the condition that
the learning rates are strictly smaller than L2c2 in Corollary 6.1.10 can, in general, not be
relaxed.
Theorem 6.1.11 (Sharp bounds on the learning rate for the convergence of GD ). Let
d ∈ N, α ∈ (0, ∞), γ ∈ R, ϑ ∈ Rd , ξ ∈ Rd \{ϑ}, let L : Rd → R satisfy for all θ ∈ Rd that
221
Chapter 6: Deterministic GD optimization methods
Proof of Theorem 6.1.11. First of all, note that Lemma 5.6.4 ensures that for all θ ∈ Rd it
holds that L ∈ C ∞ (Rd , R) and
This proves item (ii). Moreover, observe that (6.63) assures that for all θ ∈ Rd it holds that
(cf. Definition 1.4.7). This establishes item (i). Observe that (6.61) and (6.63) demonstrate
that for all n ∈ N it holds that
Θn − ϑ = Θn−1 − γ(∇L)(Θn−1 ) − ϑ
= Θn−1 − γα(Θn−1 − ϑ) − ϑ (6.65)
= (1 − γα)(Θn−1 − ϑ).
The assumption that Θ0 = ξ and induction hence prove that for all n ∈ N0 it holds that
This establishes item (iii). Combining item (iii) with the fact that for all t ∈ (0, 2/α) it holds
that |1 − tα| ∈ [0, 1), the fact that for all t ∈ {0, 2/α} it holds that |1 − tα| = 1, the fact
that for all t ∈ R\[0, 2/α] it holds that |1 − tα| ∈ (1, ∞), and the fact that ∥ξ − ϑ∥2 > 0
establishes item (iv). The proof of Theorem 6.1.11 is thus complete.
222
6.1. GD optimization
Θ(r)
(r) −r
n = Θn−1 − n (∇L)(Θn−1 ).
(r)
(6.72)
Prove or disprove the following statement: It holds for all r ∈ (1, ∞) that
Θ(r)
(r) −r
n = Θn−1 − n (∇L)(Θn−1 ).
(r)
(6.75)
Prove or disprove the following statement: It holds for all r ∈ (1, ∞) that
Corollary 6.1.12 (Qualitative convergence of GD). Let d ∈ N, L ∈ C 1 (Rd , R), (γn )n∈N ⊆
R, c, L ∈ (0, ∞), ξ, ϑ ∈ Rd satisfy for all θ ∈ Rd that
223
Chapter 6: Deterministic GD optimization methods
(ii) there exist ϵ ∈ (0, 1), C ∈ R such that for all n ∈ N0 it holds that
and
(iii) there exist ϵ ∈ (0, 1), C ∈ R such that for all n ∈ N0 it holds that
(cf. (6.78)), let m ∈ N satisfy for all n ∈ N that γm+n ∈ [α, β], and let h : R → R satisfy for
all t ∈ R that
h(t) = 1 − 2ct + t2 L2 . (6.83)
Observe that (6.77) and item (ii) in Lemma 5.6.8 prove item (i). In addition, observe that
the fact that for all t ∈ R it holds that h′ (t) = −2c + 2tL2 implies that for all t ∈ (−∞, Lc2 ]
it holds that
h′ (t) ≤ −2c + 2 Lc2 L2 = 0. (6.84)
The fundamental theorem of calculus hence assures that for all t ∈ [α, β] ∩ [0, Lc2 ] it holds
that
Z t Z t
h(t) = h(α) + ′
h (s) ds ≤ h(α) + 0 ds = h(α) ≤ max{h(α), h(β)}. (6.85)
α α
Furthermore, observe that the fact that for all t ∈ R it holds that h′ (t) = −2c + 2tL2 implies
that for all t ∈ [ Lc2 , ∞) it holds that
The fundamental theorem of calculus hence ensures that for all t ∈ [α, β] ∩ [ Lc2 , ∞) it holds
that
Z β Z β
max{h(α), h(β)} ≥ h(β) = h(t) + ′
h (s) ds ≥ h(t) + 0 ds = h(t). (6.87)
t t
Combining this and (6.85) establishes that for all t ∈ [α, β] it holds that
224
6.1. GD optimization
Moreover, observe that the fact that α, β ∈ (0, L2c2 ) and item (iii) in Lemma 6.1.8 ensure
that
{h(α), h(β)} ⊆ [0, 1). (6.89)
Hence, we obtain that
max{h(α), h(β)} ∈ [0, 1). (6.90)
This implies that there exists ε ∈ R such that
Next note that the fact that for all n ∈ N it holds that γm+n ∈ [α, β] ⊆ [0, L2c2 ], items (ii)
and (iv) in Proposition 6.1.9 (applied with d ↶ d, c ↶ c, L ↶ L, r ↶ ∞, (γn )n∈N ↶
(γm+n )n∈N , ϑ ↶ ϑ, ξ ↶ Θm , L ↶ L in the notation of Proposition 6.1.9), (6.77), (6.79),
and (6.88) demonstrate that for all n ∈ N it holds that
" n #
Y
(1 − 2cγm+k + (γm+k )2 L2 ) /2 ∥Θm − ϑ∥2
1
∥Θm+n − ϑ∥2 ≤
"k=1
n
#
(6.92)
Y
(h(γm+k )) /2 ∥Θm − ϑ∥2
1
=
k=1
≤ ε /2 ∥Θm − ϑ∥2 .
n
(6.93)
(n−m)/2
∥Θn − ϑ∥2 ≤ ε ∥Θm − ϑ∥2 .
225
Chapter 6: Deterministic GD optimization methods
This proves item (ii). In addition, note that Lemma 5.6.9, item (i), and (6.95) assure that
for all n ∈ N0 it holds that
εn L ∥Θk − ϑ∥22
L
0 ≤ L(Θn ) − L(ϑ) ≤ 2 ∥Θn − ϑ∥2 ≤ 2
max : k ∈ {0, 1, . . . , m} . (6.96)
2 εk
This establishes item (iii). The proof of Corollary 6.1.12 is thus complete.
and
226
6.1. GD optimization
Proof of Corollary 6.1.13. Note that item (ii) in Proposition 6.1.9 and the assumption that
for all n ∈ N it holds that γn ∈ [0, Lc2 ] ensure that for all n ∈ N it holds that
h c i
0 ≤ 1 − 2cγn + (γn )2 L2 ≤ 1 − 2cγn + γn 2 L2 = 1 − 2cγn + γn c = 1 − cγn ≤ 1. (6.101)
L
This proves item (ii). Moreover, note that (6.101) and Proposition 6.1.9 establish items (i),
(iii), and (iv). The proof of Corollary 6.1.13 is thus complete.
In the next result, Corollary 6.1.14 below, we, roughly speaking, specialize Corol-
lary 6.1.13 above to the case where the learning rates (γn )n∈N ⊆ [0, Lc2 ] are a constant
sequence.
Corollary 6.1.14 (Error estimates in the case of small and constant learning rates). Let
d ∈ N, c, L ∈ (0, ∞), r ∈ (0, ∞], γ ∈ (0, Lc2 ], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B,
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 , (6.102)
and let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γ(∇L)(Θn−1 ) (6.103)
(cf. Definitions 1.4.7 and 3.3.4). Then
(i) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},
(ii) it holds that 0 ≤ 1 − cγ < 1,
(iii) it holds for all n ∈ N0 that ∥Θn − ϑ∥2 ≤ (1 − cγ)n/2 ∥ξ − ϑ∥2 , and
L
(iv) it holds for all n ∈ N0 that 0 ≤ L(Θn ) − L(ϑ) ≤ 2
(1 − cγ)n ∥ξ − ϑ∥22 .
Proof of Corollary 6.1.14. Corollary 6.1.14 is an immediate consequence of Corollary 6.1.13.
The proof of Corollary 6.1.14 is thus complete.
227
Chapter 6: Deterministic GD optimization methods
Lemma 6.1.15 (Properties of the spectrum of real symmetric matrices). Let d ∈ N, let
A ∈ Rd×d be a symmetric matrix, and let
Then
and
Proof of Lemma 6.1.15. Throughout this proof, let e1 , e2 , . . . , ed ∈ Rd be the vectors given
by
Observe that the spectral theorem for symmetric matrices (see, for example, Petersen [331,
Theorem 4.3.4]) proves that there exist (d × d)-matrices Λ = (Λi,j )(i,j)∈{1,2,...,d}2 , O =
(Oi,j )(i,j)∈{1,2,...,d}2 ∈ Rd×d such that S = {Λ1,1 , Λ2,2 , . . . , Λd,d }, O∗ O = OO∗ = Id , A = OΛO∗ ,
and
Λ1,1 0
.. d×d
(6.108)
Λ= . ∈R
0 Λd,d
(cf. Definition 1.5.5). Hence, we obtain that S ⊆ R. Next note that the assumption
that S = {λ ∈ C : (∃ v ∈ Cd \{0} : Av = λv)} ensures that for every λ ∈ S there exists
v ∈ Cd \{0} such that
The fact that S ⊆ R therefore demonstrates that for every λ ∈ S there exists v ∈ Rd \{0}
such that Av = λv. This and the fact that S ⊆ R ensure that S ⊆ {λ ∈ R : (∃ v ∈
Rd \{0} : Av = λv)}. Combining this and the fact that {λ ∈ R : (∃ v ∈ Rd \{0} : Av =
228
6.1. GD optimization
λv)} ⊆ S proves item (i). Furthermore, note that (6.108) assures that for all v =
(v1 , v2 , . . . , vd ) ∈ Rd it holds that
" d
#1/2 " d
#1/2
X X
2
max |Λ1,1 |2 , . . . , |Λd,d |2 |vi |2
∥Λv∥2 = |Λi,i vi | ≤
i=1 i=1
i1/2
(6.110)
h 2 2
= max |Λ1,1 |, . . . , |Λd,d | ∥v∥2
= max |Λ1,1 |, . . . , |Λd,d | ∥v∥2
= maxλ∈S |λ| ∥v∥2
(cf. Definition 3.3.4). The fact that O is an orthogonal matrix and the fact that A = OΛO∗
therefore imply that for all v ∈ Rd it holds that
In addition, note that the fact that S = {Λ1,1 , Λ2,2 . . . , Λd,d } ensures that there exists
j ∈ {1, 2, . . . , d} such that
|Λj,j | = maxλ∈S |λ|. (6.113)
Next observe that the fact that A = OΛO∗ , the fact that O is an orthogonal matrix, and
(6.113) imply that
∥Av∥2 ∥AOej ∥2
sup ≥ = ∥OΛO∗ Oej ∥2 = ∥OΛej ∥2
d
v∈R \{0} ∥v∥ 2 ∥Oe ∥
j 2 (6.114)
= ∥Λej ∥2 = ∥Λj,j ej ∥2 = |Λj,j | = maxλ∈S |λ|.
Combining this and (6.112) establishes item (ii). It thus remains to prove item (iii). For
this note that (6.108) ensures that for all v = (v1 , v2 , . . . , vd ) ∈ Rd it holds that
d
X d
X
2
⟨v, Λv⟩ = Λi,i |vi | ≤ max{Λ1,1 , . . . , Λd,d }|vi |2
i=1 i=1
(6.115)
= max{Λ1,1 , . . . , Λd,d }∥v∥22 = max(S)∥v∥22
229
Chapter 6: Deterministic GD optimization methods
(cf. Definition 1.4.7). The fact that O is an orthogonal matrix and the fact that A = OΛO∗
therefore demonstrate that for all v ∈ Rd it holds that
Moreover, observe that (6.108) implies that for all v = (v1 , v2 , . . . , vd ) ∈ Rd it holds that
d
X d
X
2
⟨v, Λv⟩ = Λi,i |vi | ≥ min{Λ1,1 , . . . , Λd,d }|vi |2
i=1 i=1
(6.117)
= min{Λ1,1 , . . . , Λd,d }∥v∥22 = min(S)∥v∥22 .
The fact that O is an orthogonal matrix and the fact that A = OΛO∗ hence demonstrate
that for all v ∈ Rd it holds that
Combining this with (6.116) establishes item (iii). The proof of Lemma 6.1.15 is thus
complete.
We now present the promised Proposition 6.1.16 which discloses suitable conditions
(cf. (6.119) and (6.120) below) on the Hessians of the objective function of the considered
optimization problem which are sufficient to ensure that (6.47) is satisfied so that we are
in the position to apply the error analysis in Sections 6.1.4.1, 6.1.4.2, 6.1.4.3, and 6.1.4.4
above.
Proposition 6.1.16 (Conditions on the spectrum of the Hessian of the objective function
at a local minimum point). Let d ∈ N, let ~·~ : Rd×d → [0, ∞) satisfy for all A ∈ Rd×d that
~A~ = supv∈Rd \{0} ∥Av∥
∥v∥2
2
, and let λ, α ∈ (0, ∞), β ∈ [α, ∞), ϑ ∈ Rd , L ∈ C 2 (Rd , R) satisfy
for all v, w ∈ Rd that
230
6.1. GD optimization
Proof of Proposition 6.1.16. Throughout this proof, let B ⊆ Rd be the set given by
B = w ∈ Rd : ∥w − ϑ∥2 ≤ α
(6.122)
λ
Note that the fact that (Hess L)(ϑ) ∈ Rd×d is a symmetric matrix, item (i) in Lemma 6.1.15,
and (6.120) imply that
Next observe that the assumption that (∇L)(ϑ) = 0 and the fundamental theorem of
calculus ensure that for all θ, w ∈ Rd it holds that
(cf. Definition 1.4.7). The fact that (Hess L)(ϑ) ∈ Rd×d is a symmetric matrix, item (iii)
in Lemma 6.1.15, and the Cauchy-Schwarz inequality therefore imply that for all θ ∈ B it
holds that
⟨θ − ϑ, (∇L)(θ)⟩
≥ θ − ϑ, [(Hess L)(ϑ)](θ − ϑ)
Z 1
− θ − ϑ, (Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ) (θ − ϑ) dt
0 (6.126)
≥ min(S)∥θ − ϑ∥22
Z 1
− ∥θ − ϑ∥2 (Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ) (θ − ϑ) 2
dt.
0
231
Chapter 6: Deterministic GD optimization methods
Combining this with (6.124) and (6.119) shows that for all θ ∈ B it holds that
⟨θ − ϑ, (∇L)(θ)⟩
≥ α∥θ − ϑ∥22
Z 1
− ∥θ − ϑ∥2 ~(Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ)~∥θ − ϑ∥2 dt
0
(6.127)
Z 1
2
≥ α∥θ − ϑ∥2 − λ∥ϑ + t(θ − ϑ) − ϑ∥2 dt ∥θ − ϑ∥22
Z 1 0
t dt λ∥θ − ϑ∥2 ∥θ − ϑ∥22 = α − λ2 ∥θ − ϑ∥2 ∥θ − ϑ∥22
= α−
0
≥ α − 2λ ∥θ − ϑ∥22 = α2 ∥θ − ϑ∥22 .
λα
Moreover, observe that (6.119), (6.124), (6.125), the fact that (Hess L)(ϑ) ∈ Rd×d is a
symmetric matrix, item (ii) in Lemma 6.1.15, the Cauchy-Schwarz inequality, and the
assumption that α ≤ β ensure that for all θ ∈ B, w ∈ Rd with ∥w∥2 = 1 it holds that
⟨w, (∇L)(θ)⟩
≤ w, [(Hess L)(ϑ)](θ − ϑ)
Z 1
+ w, (Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ) (θ − ϑ) dt
0
≤ ∥w∥2 ∥[(Hess L)(ϑ)](θ − ϑ)∥2
Z 1
+ ∥w∥2 ∥[(Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ)](θ − ϑ)∥2 dt
0
" #
∥[(Hess L)(ϑ)]v∥2
≤ sup ∥θ − ϑ∥2 (6.128)
v∈Rd \{0} ∥v∥2
Z 1
+ ~(Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ)~∥θ − ϑ∥2 dt
0
Z 1
≤ max S ∥θ − ϑ∥2 + λ∥ϑ + t(θ − ϑ) − ϑ∥2 dt ∥θ − ϑ∥2
0
Z 1
t dt ∥θ − ϑ∥2 ∥θ − ϑ∥2 = β + λ2 ∥θ − ϑ∥2 ∥θ − ϑ∥2
≤ β+λ
0
≤ β + 2λ ∥θ − ϑ∥2 = 2β+α
λα
∥θ − ϑ∥2 ≤ 3β
2 2
∥θ − ϑ∥2 .
Therefore, we obtain for all θ ∈ B that
∥(∇L)(θ)∥2 = sup [⟨w, (∇L)(θ)⟩] ≤ 3β
2
∥θ − ϑ∥2 . (6.129)
w∈Rd , ∥w∥2 =1
Combining this and (6.127) establishes (6.121). The proof of Proposition 6.1.16 is thus
complete.
232
6.1. GD optimization
The next result, Corollary 6.1.17 below, combines Proposition 6.1.16 with Proposi-
tion 6.1.9 to obtain an error analysis which assumes the conditions in (6.119) and (6.120)
in Proposition 6.1.16 above. A result similar to Corollary 6.1.17 can, for instance, be found
in Nesterov [303, Theorem 1.2.4].
Corollary 6.1.17 (Error analysis for the GD optimization method under conditions on the
Hessian of the objective function). Let d ∈ N, let ~·~ : Rd×d → R satisfy for all A ∈ Rd×d that
~A~ = supv∈Rd \{0} ∥Av∥
∥v∥2
2 4α
, and let λ, α ∈ (0, ∞), β ∈ [α, ∞), (γn )n∈N ⊆ [0, 9β d
2 ], ϑ, ξ ∈ R ,
and
(iv) it holds for all n ∈ N0 that
n h i
9β 2 (γk )2
3β
∥ξ − ϑ∥22 . (6.134)
Q
0 ≤ L(Θn ) − L(ϑ) ≤ 4
1 − αγk + 4
k=1
Proof of Corollary 6.1.17. Note that (6.130), (6.131), and Proposition 6.1.16 prove that for
all θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ αλ } it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ α2 ∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ 3β
2
∥θ − ϑ∥2 (6.135)
(cf. Definition 1.4.7). Combining this, the assumption that
α
∥ξ − ϑ∥2 ≤ , (6.136)
λ
(6.132), and items (iv) and (v) in Proposition 6.1.9 (applied with c ↶ α2 , L ↶ 3β2
, r ↶ αλ in
the notation of Proposition 6.1.9) establishes items (i), (ii), (iii), and (iv). The proof of
Corollary 6.1.17 is thus complete.
233
Chapter 6: Deterministic GD optimization methods
Proof of Lemma 6.1.19. First, note that (6.137) ensures that for all θ ∈ B it holds that
ε2 ~θ − ϑ~2 ≥ ~θ − γG(θ) − ϑ~2 = ~(θ − ϑ) − γG(θ)~2
(6.139)
= ~θ − ϑ~2 − 2γ ⟨⟨θ − ϑ, G(θ)⟩⟩ + γ 2 ~G(θ)~2 .
Hence, we obtain for all θ ∈ B that
2γ⟨⟨θ − ϑ, G(θ)⟩⟩ ≥ (1 − ε2 )~θ − ϑ~2 + γ 2 ~G(θ)~2
(6.140)
≥ max (1 − ε2 )~θ − ϑ~2 , γ 2 ~G(θ)~2 ≥ 0.
234
6.1. GD optimization
235
Chapter 6: Deterministic GD optimization methods
(cf. Definition 1.4.7). Hence, we obtain that for all θ ∈ Rd \{ϑ} with ∥θ − ϑ∥2 < r it holds
that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ 2c∥θ − ϑ∥22 . (6.153)
Combining this with the fact that the function
Rd ∋ v 7→ (∇L)(v) ∈ Rd (6.154)
is continuous establishes (6.151). The proof of Lemma 6.1.22 is thus complete.
Lemma 6.1.23. Let d ∈ N, L ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r},
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that
∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 (6.155)
(cf. Definition 3.3.4). Then it holds for all v, w ∈ B that
(6.156)
|L(v) − L(w)| ≤ L max ∥v − ϑ∥2 , ∥w − ϑ∥2 ∥v − w∥2 .
236
6.1. GD optimization
Proof of Lemma 6.1.23. Observe that (6.155), the fundamental theorem of calculus, and
the Cauchy-Schwarz inequality assure that for all v, w ∈ B it holds that
h=1
|L(v) − L(w)| = L(w + h(v − w)) h=0
Z 1
= L ′ (w + h(v − w))(v − w) dh
0
Z 1
= (∇L) w + h(v − w) , v − w dh
0
Z 1
≤ ∥(∇L) hv + (1 − h)w ∥2 ∥v − w∥2 dh
Z0 1
≤ L∥hv + (1 − h)w − ϑ∥2 ∥v − w∥2 dh (6.157)
0
Z 1
≤ L h∥v − ϑ∥2 + (1 − h)∥w − ϑ∥2 ∥v − w∥2 dh
0
Z 1
= L ∥v − w∥2 h∥v − ϑ∥2 + h∥w − ϑ∥2 dh
0
Z 1
= L ∥v − ϑ∥2 + ∥w − ϑ∥2 ∥v − w∥2 h dh
0
≤ L max{∥v − ϑ∥2 , ∥w − ϑ∥2 }∥v − w∥2
(6.158)
|L(v) − L(w)| ≤ L max ∥v − ϑ∥2 , ∥w − ϑ∥2 ∥v − w∥2
Proof of Lemma 6.1.24. Note that (6.158) implies that for all θ ∈ Rd with ∥θ − ϑ∥2 < r it
237
Chapter 6: Deterministic GD optimization methods
holds that
h i
∥(∇L)(θ)∥2 = sup L ′ (θ)(w)
w∈Rd ,∥w∥2 =1
h 1 i
= sup lim h (L(θ + hw) − L(θ))
w∈Rd ,∥w∥2 =1 h↘0
h i
L
≤ sup lim inf h max ∥θ + hw − ϑ∥2 , ∥θ − ϑ∥2 ∥θ + hw − θ∥2
w∈Rd ,∥w∥2 =1 h↘0
h i
1
= sup lim inf L max ∥θ + hw − ϑ∥2 , ∥θ − ϑ∥2 h
∥hw∥2
w∈Rd ,∥w∥2 =1 h↘0
h i
= sup lim inf L max ∥θ + hw − ϑ∥2 , ∥θ − ϑ∥2
w∈Rd ,∥w∥2 =1 h↘0
h i
= sup L∥θ − ϑ∥2 = L∥θ − ϑ∥2 .
w∈Rd ,∥w∥2 =1
(6.160)
The fact that the function Rd ∋ v 7→ (∇L)(v) ∈ Rd is continuous therefore establishes
(6.159). The proof of Lemma 6.1.24 is thus complete.
Corollary 6.1.25. Let d ∈ N, r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r},
L ∈ C 1 (Rd , R) (cf. Definition 3.3.4). Then the following four statements are equivalent:
(i) There exist c, L ∈ (0, ∞) such that for all θ ∈ B it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 . (6.161)
(ii) There exist γ ∈ (0, ∞), ε ∈ (0, 1) such that for all θ ∈ B it holds that
∥θ − γ(∇L)(θ) − ϑ∥2 ≤ ε∥θ − ϑ∥2 . (6.162)
(iii) There exists c ∈ (0, ∞) such that for all θ ∈ B it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c max ∥θ − ϑ∥22 , ∥(∇L)(θ)∥22 . (6.163)
(iv) There exist c, L ∈ (0, ∞) such that for all v, w ∈ B, s, t ∈ [0, 1] with s ≤ t it holds that
L ϑ + t(v − ϑ) − L ϑ + s(v − ϑ) ≥ c(t2 − s2 )∥v − ϑ∥22 (6.164)
(6.165)
and |L(v) − L(w)| ≤ L max ∥v − ϑ∥2 , ∥w − ϑ∥2 ∥v − w∥2
(cf. Definition 1.4.7).
Proof of Corollary 6.1.25. Note that items (ii) and (iii) in Lemma 6.1.8 prove that ((i) →
(ii)). Observe that Lemma 6.1.19 demonstrates that ((ii) → (iii)). Note that Lemma 6.1.20
establishes that ((iii) → (i)). Observe that Lemma 6.1.21 and Lemma 6.1.23 show that ((i)
→ (iv)). Note that Lemma 6.1.22 and Lemma 6.1.24 establish that ((iv) → (i)). The proof
of Corollary 6.1.25 is thus complete.
238
6.2. Explicit midpoint GD optimization
Then we say that Θ is the explicit midpoint GD process for the objective function L with
generalized gradient G, learning rates (γn )n∈N , and initial value ξ (we say that Θ is the
explicit midpoint GD process for the objective function L with learning rates (γn )n∈N and
initial value ξ) if and only if it holds that Θ : N0 → Rd is the function from N0 to Rd which
satisfies for all n ∈ N that
∥G(x)∥2 ≤ c, ∥G′ (x)y∥2 ≤ c∥y∥2 , and ∥G′′ (x)(y, z)∥2 ≤ c∥y∥2 ∥z∥2 (6.169)
(cf. Definition 3.3.4). Then
Proof of Lemma 6.2.2. Note that the fundamental theorem of calculus, the assumption that
G ∈ C 2 (Rd , Rd ), and (6.168) assure that for all t ∈ [0, ∞) it holds that Θ ∈ C 1 ([0, ∞), Rd )
and
239
Chapter 6: Deterministic GD optimization methods
Combining this with the assumption that G ∈ C 2 (Rd , Rd ) and the chain rule ensures that
for all t ∈ [0, ∞) it holds that Θ ∈ C 2 ([0, ∞), Rd ) and
γ2 1
hγ i Z
ΘT + γ2 − ΘT − G(ΘT ) = (1 − r)G′ (ΘT +rγ/2 )G(ΘT +rγ/2 ) dr. (6.174)
2 4 0
Combining this, the fact that for all x, y ∈ Rd it holds that ∥G(x) − G(y)∥2 ≤ c∥x − y∥2 ,
and (6.169) ensures that
cγ 2 1
Z
≤ (1 − r) G′ (ΘT +rγ/2 )G(ΘT +rγ/2 ) 2 dr
4 0 (6.175)
1
c3 γ 2 c3 γ 2
Z
≤ r dr = .
4 0 8
Furthermore, observe that (6.171), (6.172), the hypothesis that G ∈ C 2 (Rd , Rd ), the product
rule, and the chain rule assure that for all t ∈ [0, ∞) it holds that Θ ∈ C 3 ([0, ∞), Rd ) and
...
Θ t = G′′ (Θt )(Θ̇t , G(Θt )) + G′ (Θt )G′ (Θt )Θ̇t
(6.176)
= G′′ (Θt )(G(Θt ), G(Θt )) + G′ (Θt )G′ (Θt )G(Θt ).
Theorem 6.1.3, (6.171), and (6.172) hence imply that for all s, t ∈ [0, ∞) it holds that
240
6.2. Explicit midpoint GD optimization
γ3 1
Z
= γG(ΘT + 2 ) +
γ (1 − r) G′′ (ΘT +(1+r)γ/2 )(G(ΘT +(1+r)γ/2 ), G(ΘT +(1+r)γ/2 ))
2
16 0
+ G (ΘT +(1+r)γ/2 )G′ (ΘT +(1+r)γ/2 )G(ΘT +(1+r)γ/2 )
′
241
Chapter 6: Deterministic GD optimization methods
Corollary 6.2.3 (Local error of the explicit midpoint method for GF ODEs). Let d ∈ N,
T, γ, c ∈ [0, ∞), L ∈ C 3 (Rd , R), Θ ∈ C([0, ∞), Rd ), θ ∈ Rd satisfy for all x, y, z ∈ Rd ,
t ∈ [0, ∞) that
Z t
θ = ΘT − γ(∇L) ΘT − γ2 (∇L)(ΘT ) , (6.180)
Θt = Θ0 − (∇L)(Θs ) ds,
0
∥(∇L)(x)∥2 ≤ c, ∥(Hess L)(x)y∥2 ≤ c∥y∥2 , and ∥(∇L)′′ (x)(y, z)∥2 ≤ c∥y∥2 ∥z∥2
(6.181)
(cf. Definition 3.3.4). Then
Proof of Corollary 6.2.3. Throughout this proof, let G : Rd → Rd satisfy for all θ ∈ Rd that
Note that the fact that for all t ∈ [0, ∞) it holds that
Z t
Θt = Θ0 + G(Θs ) ds, (6.184)
0
the fact that for all x ∈ Rd it holds that ∥G(x)∥2 ≤ c, the fact that for all x, y ∈ Rd it holds
that ∥G′ (x)y∥2 ≤ c∥y∥2 , the fact that for all x, y, z ∈ Rd it holds that
242
6.3. GD optimization with classical momentum
method (see Definition 6.3.1 below). The idea to improve GD optimization methods with a
momentum term was first introduced in Polyak [337]. To illustrate the advantage of the
momentum GD optimization method over the plain-vanilla GD optimization method we
now review a result proving that the momentum GD optimization method does indeed
outperform the classical plain-vanilla GD optimization method in the case of a simple class
of optimization problems (see Section 6.3.3 below).
In the scientific literature there are several very similar, but not exactly equivalent
optimization techniques which are referred to as optimization with momentum. Our
definition of the momentum GD optimization method in Definition 6.3.1 below is based on
[247, 306] and (7) in [111]. A different version where, roughly speaking, the factor (1 − αn )
in (6.189) in Definition 6.3.1 is replaced by 1 can, for instance, be found in [112, Algorithm
2]. A further alternative definition where, roughly speaking, the momentum terms are
accumulated over the increments of the optimization process instead of over the gradients
of the objective function (cf. (6.190) in Definition 6.3.1 below) can, for example, be found
in (9) in [337], (2) in [339], and (4) in [354].
Definition 6.3.1 (Momentum GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(αn )n∈N ⊆ [0, 1], ξ ∈ Rd and let L : Rd → R and G : Rd → Rd satisfy for all U ∈ {V ⊆
Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that
Then we say that Θ is the momentum GD process for the objective function L with
generalized gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial
value ξ (we say that Θ is the momentum GD process for the objective function L with
learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial value ξ) if and only
if it holds that Θ : N0 → Rd is the function from N0 to Rd which satisfies that there exists
m : N0 → Rd such that for all n ∈ N it holds that
Θ0 = ξ, m0 = 0, (6.189)
Definition 6.3.1). Specify Θ1 , Θ2 , and Θ3 explicitly and prove that your results are correct!
Exercise 6.3.2. Let ξ = (ξ1 , ξ2 ) ∈ R2 satisfy (ξ1 , ξ2 ) = (2, 3), let L : R2 → R satisfy for all
θ = (θ1 , θ2 ) ∈ R2 that
243
Chapter 6: Deterministic GD optimization methods
and let Θ be the momentum GD process for the objective function L with learning rates
N ∋ n 7→ 2/n ∈ [0, ∞), momentum decay factors N ∋ n 7→ 1/2 ∈ [0, 1], and initial value ξ (cf.
Definition 6.3.1). Specify Θ1 and Θ2 explicitly and prove that your results are correct!
Lemma 6.3.2. Let (αn )n∈N ⊆ R and let (mn )n∈N0 ⊆ R satisfy for all n ∈ N that m0 = 0
and
mn = αn mn−1 + 1 − αn . (6.192)
Proof of Lemma 6.3.2. We prove (6.193) by induction on n ∈ N0 . For the base case n = 0
observe that the assumption that m0 = 0 establishes that
0
Y
m0 = 0 = 1 − αk . (6.194)
k=1
This establishes (6.193) in the base case nQ= 0. For the induction step note that (6.192)
assures that for all n ∈ N0 with mn = 1 − nk=1 αk it holds that
" n
#
Y
mn+1 = αn+1 mn + 1 − αn+1 = αn+1 1 − αk + 1 − αn+1
k=1
n+1 n+1
(6.195)
Y Y
= αn+1 − αk + 1 − αn+1 = 1 − αk .
k=1 k=1
Induction hence establishes (6.193). The proof of Lemma 6.3.2 is thus complete.
244
6.3. GD optimization with classical momentum
Lemma 6.3.3 (An explicit representation of momentum terms). Let d ∈ N, (αn )n∈N ⊆ R,
(an,k )(n,k)∈(N0 )2 ⊆ R, (Gn )n∈N0 ⊆ Rd , (mn )n∈N0 ⊆ Rd satisfy for all n ∈ N, k ∈ {0, 1, . . . , n −
1} that
" n #
Y
m0 = 0, mn = αn mn−1 + (1 − αn )Gn−1 , and an,k = (1 − αk+1 ) αl (6.196)
l=k+2
Then
(i) it holds for all n ∈ N0 that
n−1
X
mn = an,kGk (6.197)
k=0
and
(ii) it holds for all n ∈ N0 that
n−1
X n
Y
an,k = 1 − αk . (6.198)
k=0 k=1
Proof of Lemma 6.3.3. Throughout this proof, let (mn )n∈N0 ⊆ R satisfy for all n ∈ N0 that
n−1
X
mn = an,k . (6.199)
k=0
We now prove item (i) by induction on n ∈ N0 . For the base case n = 0 note that (6.196)
ensures that
X−1
m0 = 0 = an,kGk . (6.200)
k=0
245
Chapter 6: Deterministic GD optimization methods
Induction thus proves item (i). Furthermore, observe that (6.196) and (6.199) demonstrate
that for all n ∈ N it holds that m0 = 0 and
n−1 n−1
" n # n−2
" n #
X X Y X Y
mn = an,k = (1 − αk+1 ) αl = 1 − αn + (1 − αk+1 ) αl
k=0 k=0 l=k+2 k=0 l=k+2
n−2
" n−1
# n−2
X Y X
= 1 − αn + (1 − αk+1 )αn αl = 1 − αn + αn an−1,k = 1 − αn + αn mn−1 .
k=0 l=k+2 k=0
(6.202)
Combining this with Lemma 6.3.2 implies that for all n ∈ N0 it holds that
n
Y
mn = 1 − αk . (6.203)
k=1
This establishes item (ii). The proof of Lemma 6.3.3 is thus complete.
Corollary 6.3.4 (On a representation of the momentum GD optimization method). Let
d ∈ N, (γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1], (an,k )(n,k)∈(N0 )2 ⊆ R, ξ ∈ Rd satisfy for all n ∈ N,
k ∈ {0, 1, . . . , n − 1} that
" n #
Y
an,k = (1 − αk+1 ) αl , (6.204)
l=k+2
and
(iii) it holds for all n ∈ N that
" n−1 #
X
Θn = Θn−1 − γn an,kG(Θk ) . (6.207)
k=0
246
6.3. GD optimization with classical momentum
Proof of Corollary 6.3.4. Throughout this proof, let m : N0 → Rd satisfy for all n ∈ N that
m0 = 0 and mn = αn mn−1 + (1 − αn )G(Θn−1 ). (6.208)
Note that (6.204) implies item (i). Observe that (6.204), (6.208), and Lemma 6.3.3 assure
that for all n ∈ N0 it holds that
n−1
X n−1
X n
Y
mn = an,kG(Θk ) and an,k = 1 − αk . (6.209)
k=0 k=0 k=1
This proves item (ii). Note that (6.189), (6.190), (6.191), (6.208), and (6.209) demonstrate
that for all n ∈ N it holds that
" n−1 #
X
Θn = Θn−1 − γn mn = Θn−1 − γn an,kG(Θk ) . (6.210)
k=0
This establishes item (iii). The proof of Corollary 6.3.4 is thus complete.
247
Chapter 6: Deterministic GD optimization methods
Proof of Corollary 6.3.6. Throughout this proof, let m : N0 → Rd satisfy for all n ∈ N that
m0 = 0 and mn = αn mn−1 + (1 − αn )G(Θn−1 ) (6.219)
and let (bn,k )(n,k)∈(N0 )2 ⊆ R satisfy for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that
" n #
Y
bn,k = (1 − αk+1 ) αl . (6.220)
l=k+2
Observe that (6.215) implies item (i). Note that (6.215), (6.219), (6.220), and Lemma 6.3.3
assure that for all n ∈ N it holds that
n−1 n−1 Pn−1
1 − nk=1 αk
Q
X X bn,k
mn = bn,kG(Θk ) and an,k = k=0
Qn = Qn = 1. (6.221)
k=0 k=0
1 − k=1 α k 1 − k=1 α k
This proves item (ii). Observe that (6.212), (6.213), (6.214), (6.219), and (6.221) demon-
strate that for all n ∈ N it holds that
" n−1 #
γn mn X bn,k
Θn = Θn−1 − = Θn−1 − γn G(Θk )
1 − nl=1 αl 1 − nl=1 αl
Q Q
k=0
" n−1 # (6.222)
X
= Θn−1 − γn an,kG(Θk ) .
k=0
This establishes item (iii). The proof of Corollary 6.3.6 is thus complete.
248
6.3. GD optimization with classical momentum
Θ0 = ξ and Θn = Θn−1 − 2
(K+κ)
(∇L)(Θn−1 ). (6.224)
Proof of Lemma 6.3.7. Throughout this proof, let Θ(1) , Θ(2) , . . . , Θ(d) : N0 → R satisfy for
(1) (2) (d)
all n ∈ N0 that Θn = (Θn , Θn , . . . , Θn ). Note that (6.223) implies that for all θ =
(θ1 , θ2 , . . . , θd ) ∈ Rd , i ∈ {1, 2, . . . , d} it holds that
∂f
(6.226)
∂θi
(θ) = λi (θi − ϑi ).
249
Chapter 6: Deterministic GD optimization methods
Combining this and (6.224) ensures that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) ∂f
Θ(i) 2
n − ϑi = Θn−1 − (K+κ) ∂θi (Θn−1 ) − ϑi
(i) (i)
2
(6.227)
= Θn−1 − ϑi − (K+κ) λi (Θn−1 − ϑi )
2λi
(i)
= 1 − (K+κ) (Θn−1 − ϑi ).
i=1
d h i
2λi 2
X (i)
= 1− (K+κ)
|Θn−1 − ϑi |2
i=1
" d # (6.228)
h i X
2λ1 2 2λd 2 (i)
|Θn−1 − ϑi |2
≤ max 1 − (K+κ)
,..., 1− (K+κ)
i=1
h i2
2λ1 2λd
∥Θn−1 − ϑ∥22
= max 1 − (K+κ)
,..., 1 − (K+κ)
(cf. Definition 3.3.4). Moreover, note that the fact that for all i ∈ {1, 2, . . . , d} it holds that
λi ≥ κ implies that for all i ∈ {1, 2, . . . , d} it holds that
K+κ−2κ K−κ
1− 2λi
(K+κ)
≤1− 2κ
(K+κ)
= K+κ
= K+κ
≥ 0. (6.229)
In addition, observe that the fact that for all i ∈ {1, 2, . . . , d} it holds that λi ≤ K implies
that for all i ∈ {1, 2, . . . , d} it holds that
= K+κ−2K
K−κ
1 − (K+κ)2λi 2K
≥ 1 − (K+κ) (K+κ)
= − K+κ
≤ 0. (6.230)
This and (6.229) ensure that for all i ∈ {1, 2, . . . , d} it holds that
1− 2λi
(K+κ)
≤ K−κ
K+κ
. (6.231)
Combining this with (6.228) demonstrates that for all n ∈ N it holds that
h n oi
2λ1 2λd
∥Θn − ϑ∥2 ≤ max 1 − K+κ , . . . , 1 − K+κ ∥Θn−1 − ϑ∥2
K−κ (6.232)
≤ K+κ ∥Θn−1 − ϑ∥2 .
Induction therefore establishes that for all n ∈ N0 it holds that
K−κ n n
∥Θ0 − ϑ∥2 = K−κ (6.233)
∥Θn − ϑ∥2 ≤ K+κ K+κ
∥ξ − ϑ∥2 .
250
6.3. GD optimization with classical momentum
Lemma 6.3.7 above establishes, roughly speaking, the convergence rate K−κ
K+κ
(see (6.225)
above for the precise statement) for the GD optimization method in the case of the objective
function in (6.223). The next result, Lemma 6.3.8 below, essentially proves in the situation
of Lemma 6.3.7 that this convergence rate cannot be improved by means of a difference
choice of the learning rate.
Lemma 6.3.8 (Lower bound for the convergence rate of GD for quadratic objective
functions). Let d ∈ N, ξ = (ξ1 , ξ2 , . . . , ξd ), ϑ = (ϑ1 , ϑ2 , . . . , ϑd ) ∈ Rd , γ, κ, K, λ1 , λ2 . . . , λd ∈
(0, ∞) satisfy κ = min{λ1 , λ2 , . . . , λd } and K = max{λ1 , λ2 , . . . , λd }, let L : Rd → R satisfy
for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd that
" d
#
X
L(θ) = 1
2
λi |θi − ϑi |2 , (6.234)
i=1
Proof of Lemma 6.3.8. Throughout this proof, let Θ(1) , Θ(2) , . . . , Θ(d) : N0 → R satisfy for
(1) (2) (d)
all n ∈ N0 that Θn = (Θn , Θn , . . . , Θn ) and let ι, I ∈ {1, 2, . . . , d} satisfy λι = κ and
λI = K. Observe that (6.234) implies that for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd , i ∈ {1, 2, . . . , d}
it holds that
∂f
(6.237)
∂θi
(θ) = λi (θi − ϑi ).
Combining this with (6.235) implies that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) ∂f
Θ(i)
n − ϑi = Θn−1 − γ ∂θi
(Θn−1 ) − ϑi
(i)
= Θn−1 − ϑi − γλi (Θn−1 − ϑi )
(i)
(6.238)
(i)
= (1 − γλi )(Θn−1 − ϑi ).
Θ(i) n (i) n
n − ϑi = (1 − γλi ) (Θ0 − ϑi ) = (1 − γλi ) (ξi − ϑi ). (6.239)
251
Chapter 6: Deterministic GD optimization methods
i=1 i=1
" d #
(6.240)
X
≥ min |ξ1 − ϑ1 |2 , . . . , |ξd − ϑd |2 |1 − γλi |2n
2 2
i=1
max{|1 − γλ1 |2n , . . . , |1 − γλd |2n }
≥ min |ξ1 − ϑ1 | , . . . , |ξd − ϑd |
2 2n
= min |ξ1 − ϑ1 |, . . . , |ξd − ϑd | max{|1 − γλ1 |, . . . , |1 − γλd |}
252
6.3. GD optimization with classical momentum
and Lemma 6.3.10 below. Lemma 6.3.9 is a special case of the so-called Gelfand spectral
radius formula in the literature. Lemma 6.3.10 establishes a formula for the determinants
of quadratic block matrices (see (6.247) below for the precise statement). Lemma 6.3.10
and its proof can, for example, be found in Silvester [377, Theorem 3].
Lemma 6.3.9 (A special case of Gelfand’s spectral radius formula for real matrices). Let
d ∈ N, A ∈ Rd×d , S = {λ ∈ C : (∃ v ∈ Cd \{0} : Av = λv)} and let ~·~ : Rd → [0, ∞) be a
norm. Then
" #1/n " #1/n
n n
~A v~ = lim sup sup ~A v~ = max |λ|. (6.246)
lim inf sup
n→∞ d
v∈R \{0} ~v~ n→∞ d
v∈R \{0} ~v~ λ∈S∪{0}
Proof of Lemma 6.3.9. Note that, for instance, Einsiedler & Ward [127, Theorem 11.6]
establishes (6.246) (cf., for example, Tropp [395]). The proof of Lemma 6.3.9 is thus
complete.
Proof of Lemma 6.3.10. Throughout this proof, let Dx ∈ Cd×d , x ∈ C, satisfy for all x ∈ C
that
Dx = D − x Id (6.248)
(cf. Definition 1.5.5). Observe that the fact that for all x ∈ C it holds that CDx = Dx C
and the fact that for all X, Y, Z ∈ Cd×d it holds that
X Y X 0
det = det(X) det(Z) = det (6.249)
0 Z Y Z
(cf., for instance, Petersen [331, Proposition 5.5.3 and Proposition 5.5.4]) imply that for all
x ∈ C it holds that
A B Dx 0 (ADx − BC) B
det = det
C Dx −C Id (CDx − Dx C) Dx
(6.250)
(ADx − BC) B
= det
0 Dx
= det(ADx − BC) det(Dx ).
253
Chapter 6: Deterministic GD optimization methods
Moreover, note that (6.249) and the multiplicative property of the determinant (see, for
example, Petersen [331, (1) in Proposition 5.5.2]) imply that for all x ∈ C it holds that
A B Dx 0 A B Dx 0
det = det det
C Dx −C Id C Dx −C Id
A B
= det det(Dx ) det(Id ) (6.251)
C Dx
A B
= det det(Dx ).
C Dx
Combining this and (6.250) demonstrates that for all x ∈ C it holds that
A B
det det(Dx ) = det(ADx − BC) det(Dx ). (6.252)
C Dx
Hence, we obtain for all x ∈ C that
A B
det − det(ADx − BC) det(Dx ) = 0. (6.253)
C Dx
This implies that for all x ∈ C with det(Dx ) ̸= 0 it holds that
A B
det − det(ADx − BC) = 0. (6.254)
C Dx
Moreover, note that the fact that C ∋ x 7→ det(D − x Id ) ∈ C is a polynomial function of
degree d ensures that {x ∈ C : det(Dx ) = 0} = {x ∈ C : det(D − x Id ) = 0} is a finite set.
Combining this and (6.254) with the fact that the function
A B
C ∋ x 7→ det − det(ADx − BC) ∈ C (6.255)
C Dx
is continuous shows that for all x ∈ C it holds that
A B
det − det(ADx − BC) = 0. (6.256)
C Dx
Hence, we obtain for all x ∈ C that
A B
det = det(ADx − BC). (6.257)
C Dx
This establishes that
A B A B
det = det = det(AD0 − BC) = det(AD0 − BC). (6.258)
C D C D0
The proof of Lemma 6.3.10 is thus completed.
254
6.3. GD optimization with classical momentum
We are now in the position to formulate and prove the promised error analysis for
the momentum GD optimization method in the case of the considered class of quadratic
objective functions; see Proposition 6.3.11 below.
Proposition 6.3.11 (Error analysis for the momentum GD optimization method in
the case of quadratic objective functions). Let d ∈ N, ξ ∈ Rd , ϑ = (ϑ1 , . . . , ϑd ) ∈ Rd ,
κ, K, λ1 , λ2 , . . . , λd ∈ (0, ∞) satisfy κ = min{λ1 , λ2 , . . . , λd } and K = max{λ1 , λ2 , . . . , λd },
let L : Rd → R satisfy for all θ = (θ1 , . . . , θd ) ∈ Rd that
" d #
X
L(θ) = 21 λi |θi − ϑi |2 , (6.259)
i=1
Then
(i) it holds that Θ|N0 : N0 → Rd is the momentum GD process for the objective function
1
L with learning rates N ∋ n 7→ √Kκ ∈ [0, ∞), momentum decay factors N ∋ n 7→
K1/2 −κ1/2 2
K1/2 +κ1/2
∈ [0, 1], and initial value ξ and
(ii) for every ε ∈ (0, ∞) there exists C ∈ (0, ∞) such that for all n ∈ N0 it holds that
h√ √ in
∥Θn − ϑ∥2 ≤ C √K− κ
√ +ε
K+ κ
(6.261)
255
Chapter 6: Deterministic GD optimization methods
(cf. Definition 1.5.5). Observe that (6.260), (6.263), and the fact that
√ √ √ √ h√ √ √ √ √ √ √ √ i
( K+ κ)2 −( K− κ)2 1
4
= 4 ( K + κ + K − κ)( K + κ − [ K − κ])
h √ √ i √ (6.268)
= 14 (2 K)(2 κ) = Kκ
Moreover, note that (6.263) implies that for all n ∈ N0 it holds that
256
6.3. GD optimization with classical momentum
In addition, observe that the assumption that Θ−1 = Θ0 = ξ and (6.263) ensure that
√
(6.271)
m0 = − Kκ Θ0 − Θ−1 = 0.
Combining this and the assumption that Θ0 = ξ with (6.269) and (6.270) proves item (i).
It thus remains to prove item (ii). For this observe that (6.259) implies that for all
θ = (θ1 , θ2 , . . . , θd ) ∈ Rd , i ∈ {1, 2, . . . , d} it holds that
∂f
(6.272)
∂θi
(θ) = λi (θi − ϑi ).
This, (6.260), and (6.264) imply that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) ∂f (i) (i)
Θ(i)
n − ϑi = Θn−1 − ϱ ∂θi (Θn−1 ) + α(Θn−1 − Θn−2 ) − ϑi
(i) (i) (i) (i)
(6.273)
= (Θn−1 − ϑi ) − ϱλi (Θn−1 − ϑi ) + α (Θn−1 − ϑi ) − (Θn−2 − ϑi )
(i) (i)
= (1 − ϱλi + α)(Θn−1 − ϑi ) − α(Θn−2 − ϑi ).
Combining this with (6.265) demonstrates that for all n ∈ N it holds that
Rd ∋ (Θn − ϑ) = M (Θn−1 − ϑ) − α(Θn−2 − ϑ)
Θn−1 − ϑ
= M (−α Id ) . (6.274)
| {z } Θn−2 − ϑ
∈ Rd×2d | {z }
∈ R2d
257
Chapter 6: Deterministic GD optimization methods
This implies that there exists m ∈ N which satisfies for all n ∈ N0 ∩ [m, ∞) that
n 1/n
~A ~ ≤ ε + max |µ|. (6.279)
µ∈S∪{0}
Combining this and (6.280) proves that for all n ∈ N0 it holds that
h in h n k~
o i
~An ~ ≤ ε + max |µ| max (ε+max~A µ∈S∪{0} |µ|)
k : k ∈ N0 ∩ [0, m) ∪ {1} . (6.282)
µ∈S∪{0}
Next observe that Lemma 6.3.10, (6.266), and the fact that for all µ ∈ C it holds that
Id (−µ Id ) = −µ Id = (−µ Id ) Id ensure that for all µ ∈ C it holds that
(M − µ Id ) (−α Id )
det(A − µ I2d ) = det
Id −µ Id
(6.283)
= det (M − µ Id )(−µ Id ) − (−α Id ) Id
= det (M − µ Id )(−µ Id ) + α Id .
This and (6.265) demonstrate that for all µ ∈ C it holds that
(1 − ϱλ1 + α − µ)(−µ) + α 0
..
det(A − µ I2d ) = det .
0 (1 − ϱλd + α − µ)(−µ) + α
d
Y
= (1 − ϱλi + α − µ)(−µ) + α
i=1
Yd
µ2 − (1 − ϱλi + α)µ + α .
=
i=1
(6.284)
258
6.3. GD optimization with classical momentum
µ ∈ C : µ2 − (1 − ϱλi + α)µ + α = 0
h i2 h i
(1−ϱλi +α) 1
2
= µ ∈ C: µ − 2
= 4 1 − ϱλi + α − 4α
√ √
(6.286)
(1−ϱλi +α)+ [1−ϱλi +α]2 −4α (1−ϱλi +α)− [1−ϱλi +α]2 −4α
= 2
, 2
,
[ q
2
1
= 2
1 − ϱλi + α + s (1 − ϱλi + α) − 4α .
s∈{−1,1}
S = {µ ∈ C : det(A − µ I2d ) = 0}
( " d #)
Y
2
= µ ∈ C: µ − (1 − ϱλi + α)µ + α = 0
i=1
d
[ (6.287)
µ ∈ C : µ2 − (1 − ϱλi + α)µ + α = 0
=
i=1
[d [ q
1 2
= 2
1 − ϱλi + α + s (1 − ϱλi + α) − 4α .
i=1 s∈{−1,1}
Moreover, observe that the fact that for all i ∈ {1, 2, . . . , d} it holds that λi ≥ κ and (6.264)
ensure that for all i ∈ {1, 2, . . . , d} it holds that
h i √ √ 2
1 − ϱλi + α ≤ 1 − ϱκ + α = 1 − (√K+4√κ)2 κ + ((√K− √
κ)
K+ κ)2
√ √ √ √ √ √ √ √
( K+ κ)2 −4κ+( K− κ)2 K+2 K κ+κ−4κ+K−2 K κ+κ
= √ √ 2
( K+ κ)
= √ √ 2
( K+ κ)
(6.288)
√ √ √ √ h√ √ i
2( K− κ)( K+ κ)
= √2K−2κ
√
( K+ κ)2
= √ √
( K+ κ)2
= 2 √K− √
K+ κ
κ
≥ 0.
In addition, note that the fact that for all i ∈ {1, 2, . . . , d} it holds that λi ≤ K and (6.264)
259
Chapter 6: Deterministic GD optimization methods
1
h p i (6.291)
= max max 1 − ϱλi + α + s (−1)(4α − [1 − ϱλi + α] ) 2
2 i∈{1,2,...,d} s∈{−1,1}
1
h p i 2 1/2
= max max 1 − ϱλi + α + si 4α − (1 − ϱλi + α) 2 .
2 i∈{1,2,...,d} s∈{−1,1}
Combining this with (6.290) proves that
1/2
1 2 p
2 2
max |µ| = 2 max max 1 − ϱλi + α + s 4α − (1 − ϱλi + α)
µ∈S∪{0} i∈{1,2,...,d} s∈{−1,1}
1/2 (6.292)
1 2 2
= max max
2 i∈{1,2,...,d} s∈{−1,1}
(1 − ϱλi + α) + 4α − (1 − ϱλi + α)
√
= 21 [4α] /2 =
1
α.
Combining (6.277) and (6.282) hence ensures that for all n ∈ N0 it holds that
√
Θn − ϑ 2 ≤ 2 ∥ξ − ϑ∥2 ~An ~
n
√
≤ 2 ∥ξ − ϑ∥2 ε + max |µ|
µ∈S∪{0}
h n k~
o i
· max (ε+max~A µ∈S∪{0} |µ|)
k ∈ R : k ∈ N 0 ∩ [0, m) ∪ {1}
√ n h n
~Ak ~
o i
= 2 ∥ξ − ϑ∥2 ε + α /2 max (ε+α
1
1/2 )k ∈ R : k ∈ N 0 ∩ [0, m) ∪ {1}
√ h √
K− κ
√ in h n
~Ak ~
o i
= 2 ∥ξ − ϑ∥2 ε + K+√κ
√ max (ε+α1/2 )k ∈ R : k ∈ N0 ∩ [0, m) ∪ {1} .
(6.293)
260
6.3. GD optimization with classical momentum
This establishes item (ii). The proof of Proposition 6.3.11 it thus completed.
Then
261
Chapter 6: Deterministic GD optimization methods
(i) there exist γ, C ∈ (0, ∞) such that for all n ∈ N0 it holds that
n
∥Θγn − ϑ∥2 ≤ C K−κ (6.299)
K+κ
,
(iii) for every ε ∈ (0, ∞) there exists C ∈ (0, ∞) such that for all n ∈ N0 it holds that
h√ √ in
∥Mn − ϑ∥2 ≤ C √K− κ
√ +ε ,
K+ κ
(6.301)
and
√ √
K− κ K−κ
(iv) it holds that √ √
K+ κ
< K+κ
The next elementary result, Lemma 6.3.14 below, demonstrates that the momentum GD
optimization method in (6.298) and the plain-vanilla GD optimization method in (6.224)
in Lemma 6.3.7 above coincide in the case where min{λ1 , . . . , λd } = max{λ1 , . . . , λd }.
Lemma 6.3.14 (Concurrence of the GD optimization method and the momentum GD
optimization method). Let d ∈ N, ξ, ϑ ∈ Rd , α ∈ (0, ∞), let L : Rd → R satisfy for all
θ ∈ Rd that
L(θ) = α2 ∥θ − ϑ∥22 , (6.303)
let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − 2
(α+α)
(∇L)(Θn−1 ), (6.304)
262
6.3. GD optimization with classical momentum
Mn = Mn−1 − √4
(2 α)2
(∇L)(Mn−1 ) = Mn−1 − α1 (∇L)(Mn−1 ). (6.306)
Combining this with the assumption that M0 = ξ establishes item (i). Next note that
(6.304) ensures that for all n ∈ N it holds that
Combining this with (6.306) and the assumption that Θ0 = ξ = M0 proves item (ii).
Furthermore, observe that Lemma 5.6.4 assures that for all θ ∈ Rd it holds that
Θn = ϑ. (6.309)
We now prove (6.309) by induction on n ∈ N. For the base case n = 1 note that (6.307)
and (6.308) imply that
This establishes (6.309) in the base case n = 1. For the induction step observe that (6.307)
and (6.308) assure that for all n ∈ N with Θn = ϑ it holds that
Induction thus proves (6.309). Combining (6.309) and item (ii) establishes item (iii). The
proof of Lemma 6.3.14 is thus complete.
263
Chapter 6: Deterministic GD optimization methods
2
Θ1 = Θ0 − 11 (∇L)(Θ0 ) ≈ Θ0 − 0.18(∇L)(Θ0 )
5 5−1 5 − 0.18 · 4
= − 0.18 =
3 10(3 − 1) 3 − 0.18 · 10 · 2 (6.318)
5 − 0.72 4.28
= = ,
3 − 3.6 −0.6
264
6.3. GD optimization with classical momentum
4.28 4.28 − 1
Θ2 ≈ Θ1 − 0.18(∇L)(Θ1 ) = − 0.18
−0.6 10(−0.6 − 1)
4.28 − 0.18 · 3.28 4.10 − 0.18 · 2 − 0.18 · 0.28
= =
−0.6 − 0.18 · 10 · (−1.6) −0.6 + 1.8 · 1.6
−4
(6.319)
3.74 − 9 · 56 · 10−4
4.10 − 0.36 − 2 · 9 · 4 · 7 · 10
= =
−0.6 + 1.6 · 1.6 + 0.2 · 1.6 −0.6 + 2.56 + 0.32
−4
3.74 − 504 · 10 3.6896 3.69
= = ≈ ,
2.88 − 0.6 2.28 2.28
3.69 3.69 − 1
Θ3 ≈ Θ2 − 0.18(∇L)(Θ2 ) ≈ − 0.18
2.28 10(2.28 − 1)
3.69 − 0.18 · 2.69 3.69 − 0.2 · 2.69 + 0.02 · 2.69
= =
2.28 − 0.18 · 10 · 1.28 2.28 − 1.8 · 1.28
(6.320)
3.69 − 0.538 + 0.0538 3.7438 − 0.538
= =
2.28 − 1.28 − 0.8 · 1.28 1 − 1.28 + 0.2 · 1.28
3.2058 3.2058 3.21
= = ≈ ,
0.256 − 0.280 −0.024 −0.02
..
.
and
0 5−1
m1 = 0.5 (m0 + (∇L)(M0 )) = 0.5 +
0 10(3 − 1)
(6.322)
0.5 (0 + 4) 2
= = ,
0.5 (0 + 10 · 2) 10
5 2 4.4
M1 = M0 − 0.3 m1 = − 0.3 = , (6.323)
3 10 0
265
Chapter 6: Deterministic GD optimization methods
2 4.4 − 1
m2 = 0.5 (m1 + (∇L)(M1 )) = 0.5 +
10 10(0 − 1)
(6.324)
0.5 (2 + 3.4) 2.7
= = ,
0.5 (10 − 10) 0
4.4 2.7 4.4 − 0.81 3.59
M2 = M1 − 0.3 m2 = − 0.3 = = , (6.325)
0 0 0 0
2.7 3.59 − 1
m3 = 0.5 (m2 + (∇L)(M2 )) = 0.5 +
0 10(0 − 1)
0.5 (2.7 + 2.59) 0.5 · 5.29
= = (6.326)
0.5 (0 − 10) 0.5(−10)
2.5 + 0.145 2.645 2.65
= = ≈ ,
−5 −5 −5
3.59 2.65
M3 = M2 − 0.3 m3 ≈ − 0.3
0 −5
(6.327)
3.59 − 0.795 3 − 0.205 2.795 2.8
= = = ≈ ,
1.5 1.5 1.5 1.5
..
.
.
266
6.3. GD optimization with classical momentum
16 def f (x , y ) :
17 result = K [0] / 2. * np . abs ( x - vartheta [0]) **2 \
18 + K [1] / 2. * np . abs ( y - vartheta [1]) **2
19 return result
20
21 def nabla_f ( x ) :
22 return K * ( x - vartheta )
23
24 # Coefficients for GD
25 gamma_GD = 2 /( K [0] + K [1])
26
27 # Coefficients for momentum
28 gamma_momentum = 0.3
29 alpha = 0.5
30
31 # Placeholder for processes
32 Theta = np . zeros (( N +1 , d ) )
33 M = np . zeros (( N +1 , d ) )
34 m = np . zeros (( N +1 , d ) )
35
36 Theta [0] = xi
37 M [0] = xi
38
39 # Perform gradient descent
40 for i in range ( N ) :
41 Theta [ i +1] = Theta [ i ] - gamma_GD * nabla_f ( Theta [ i ])
42
43 # Perform momentum GD
44 for i in range ( N ) :
45 m [ i +1] = alpha * m [ i ] + (1 - alpha ) * nabla_f ( M [ i ])
46 M [ i +1] = M [ i ] - gamma_momentum * m [ i +1]
47
48
49 # ## Plot ###
50 plt . figure ()
51
52 # Plot the gradient descent process
53 plt . plot ( Theta [: , 0] , Theta [: , 1] ,
54 label = " GD " , color = " c " ,
55 linestyle = " --" , marker = " * " )
56
57 # Plot the momentum gradient descent process
58 plt . plot ( M [: , 0] , M [: , 1] ,
59 label = " Momentum " , color = " orange " , marker = " * " )
60
61 # Target value
62 plt . scatter ( vartheta [0] , vartheta [1] ,
63 label = " vartheta " , color = " red " , marker = " x " )
64
267
Chapter 6: Deterministic GD optimization methods
4
GD
Momentum
3 vartheta
2
2
2 0 2 4 6
Exercise 6.3.3. Let (γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1] satisfy for all n ∈ N that γn = n1 and
αn = 12 , let L : R → R satisfy for all θ ∈ R that L(θ) = θ2 , and let Θ be the momentum
GD process for the objective function L with learning rates (γn )n∈N , momentum decay
factors (αn )n∈N , and initial value 1 (cf. Definition 6.3.1). Specify Θ1 , Θ2 , Θ3 , and Θ4
explicitly and prove that your results are correct!
268
6.4. GD optimization with Nesterov momentum
269
Chapter 6: Deterministic GD optimization methods
Then we say that Θ is the Adagrad GD process for the objective function L with generalized
gradient G, learning rates (γn )n∈N , regularizing factor ε, and initial value ξ (we say that Θ is
the Adagrad GD process for the objective function L with learning rates (γn )n∈N , regularizing
factor ε, and initial value ξ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is
the function from N0 to Rd which satisfies for all n ∈ N, i ∈ {1, 2, . . . , d} that
n−1
−1/2
(i)
Θ(i) 2
(6.333)
P
Θ0 = ξ and n = Θn−1 − γn ε + |Gi (Θk )| Gi (Θn−1 ).
k=0
Definition 6.6.1 (RMSprop GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), ξ ∈ Rd and let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd
satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that
Then we say that Θ is the RMSprop GD process for the objective function L with generalized
gradient G, learning rates (γn )n∈N , second moment decay factors (βn )n∈N , regularizing factor
ε, and initial value ξ (we say that Θ is the RMSprop GD process for the objective function
L with learning rates (γn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε,
and initial value ξ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is the function
from N0 to Rd which satisfies that there exists M = (M(1) , . . . , M(d) ) : N0 → Rd such that
for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i)
Θ0 = ξ, M0 = 0, M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| , (6.335)
(i) (i) −1/2
Θ(i) (6.336)
and n = Θn−1 − γn ε + Mn Gi (Θn−1 ).
270
6.6. Root mean square propagation GD optimization (RMSprop)
and let Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd be the RMSprop GD process for the objective function
L with generalized gradient G, learning rates (γn )n∈N , second moment decay factors (βn )n∈N ,
regularizing factor ε, and initial value ξ (cf. Definition 6.6.1). Then
and
Proof of Lemma 6.6.2. Throughout this proof, let M = (M(1) , . . . , M(d) ) : N0 → Rd satisfy
(i)
for all n ∈ N, i ∈ {1, 2, . . . , d} that M0 = 0 and
(i)
M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| . (6.341)
Note that (6.337) implies item (i). Furthermore, observe that (6.337), (6.341), and
Lemma 6.3.3 assure that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
n−1
X n−1
X n
Y
M(i)
n = bn,k |Gi (Θk )|2
and bn,k = 1 − βk . (6.342)
k=0 k=0 k=1
271
Chapter 6: Deterministic GD optimization methods
This proves item (ii). Moreover, note that (6.335), (6.336), (6.341), and (6.342) demonstrate
that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) (i) −1/2
Θ(i)
n = Θn−1 − γn ε + M n Gi (Θn−1 )
#−1/2
(6.343)
" n−1
(i)
X
= Θn−1 − γn ε + bn,k |Gi (Θk )|2 Gi (Θn−1 ).
k=0
This establishes item (iii). The proof of Lemma 6.6.2 is thus complete.
β1 < 1 (6.344)
Then we say that Θ is the bias-adjusted RMSprop GD process for the objective function L
with generalized gradient G, learning rates (γn )n∈N , second moment decay factors (βn )n∈N ,
regularizing factor ε, and initial value ξ (we say that Θ is the bias-adjusted RMSprop GD
process for the objective function L with learning rates (γn )n∈N , second moment decay
factors (βn )n∈N , regularizing factor ε, and initial value ξ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is the function from N0 to Rd which satisfies that there exists
M = (M(1) , . . . , M(d) ) : N0 → Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i)
Θ0 = ξ, M0 = 0, M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| , (6.346)
h (i)
i1/2 −1
(i)
and Θ(i)
n = Θn−1 − γn ε + 1−QMnn βk
Gi (Θn−1 ). (6.347)
k=1
Lemma 6.6.4 (On a representation of the second order terms in bias-adjusted RMSprop).
Let d ∈ N, (γn )n∈N ⊆ [0, ∞), (βn )n∈N ⊆ [0, 1), (bn,k )(n,k)∈(N0 )2 ⊆ R, ε ∈ (0, ∞), ξ ∈ Rd
satisfy for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that
(1 − βk+1 ) nl=k+2 βl
Q
bn,k = , (6.348)
1 − nk=1 βk
Q
272
6.6. Root mean square propagation GD optimization (RMSprop)
Proof of Lemma 6.6.4. Throughout this proof, let M = (M(1) , . . . , M(d) ) : N0 → Rd satisfy
(i)
for all n ∈ N, i ∈ {1, 2, . . . , d} that M0 = 0 and
(i)
M(i)
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )|
2
(6.352)
and let (Bn,k )(n,k)∈(N0 )2 ⊆ R satisfy for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that
" n #
Y
Bn,k = (1 − βk+1 ) βl . (6.353)
l=k+2
Observe that (6.348) implies item (i). Note that (6.348), (6.352), (6.353), and Lemma 6.3.3
assure that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
n−1 n−1 Pn−1 Qn
X X Bn,k 1 − βk
(i)
Mn = Bn,k |Gi (Θk )|2
and bn,k = k=0
Qn = Qk=1
n = 1. (6.354)
k=0 k=0
1 − k=1 βk 1 − k=1 βk
This proves item (ii). Observe that (6.346), (6.347), (6.352), and (6.354) demonstrate that
for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
h (i)
i1/2 −1
(i) (i)
Θn = Θn−1 − γn ε + 1−Qnn βk M
Gi (Θn−1 )
k=1
" n−1 #1/2 −1 (6.355)
(i)
X
= Θn−1 − γn ε + bn,k |Gi (Θk )|2 Gi (Θn−1 ).
k=0
273
Chapter 6: Deterministic GD optimization methods
This establishes item (iii). The proof of Lemma 6.6.4 is thus complete.
Definition 6.7.1 (Adadelta GD optimization method). Let d ∈ N, (βn )n∈N ⊆ [0, 1],
(δn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), ξ ∈ Rd and let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd
satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that
Then we say that Θ is the Adadelta GD process for the objective function L with generalized
gradient G, second moment decay factors (βn )n∈N , delta decay factors (δn )n∈N , regularizing
factor ε, and initial value ξ (we say that Θ is the Adadelta GD process for the objective func-
tion L with second moment decay factors (βn )n∈N , delta decay factors (δn )n∈N , regularizing
factor ε, and initial value ξ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is
the function from N0 to Rd which satisfies that there exist M = (M(1) , . . . , M(d) ) : N0 → Rd
and ∆ = (∆(1) , . . . , ∆(d) ) : N0 → Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
Θ0 = ξ, M0 = 0, ∆0 = 0, (6.357)
(i)
M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| , (6.358)
(i) 1/2
(i) ε + ∆n−1
Θ(i)
n = Θn−1 − (i)
Gi (Θn−1 ), (6.359)
ε + Mn
(i) (i)
and ∆(i) (i) 2
n = δn ∆n−1 + (1 − δn ) |Θn − Θn−1 | . (6.360)
274
6.8. Adaptive moment estimation GD optimization
(Adam)
Definition 6.8.1 (Adam GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(αn )n∈N ⊆ [0, 1], (βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), ξ ∈ Rd satisfy
Then we say that Θ is the Adam GD process for the objective function L with generalized
gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , second moment decay
factors (βn )n∈N , regularizing factor ε, and initial value ξ (we say that Θ is the Adam GD
process for the objective function L with learning rates (γn )n∈N , momentum decay factors
(αn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε, and initial value ξ) if
and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is the function from N0 to Rd which
satisfies that there exist m = (m(1) , . . . , m(d) ) : N0 → Rd and M = (M(1) , . . . , M(d) ) : N0 →
Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
Θ0 = ξ, m0 = 0, M0 = 0, (6.363)
275
Chapter 6: Deterministic GD optimization methods
276
Chapter 7
This chapter reviews and studies SGD-type optimization methods such as the classical
plain-vanilla SGD optimization method (see Section 7.2) as well as more sophisticated
SGD-type optimization methods including SGD-type optimization methods with momenta
(cf. Sections 7.4, 7.5, and 7.9 below) and SGD-type optimization methods with adaptive
modifications of the learning rates (cf. Sections 7.6, 7.7, 7.8, and 7.9 below).
For a brief list of resources in the scientific literature providing reviews on gradient
based optimization methods we refer to the beginning of Chapter 6.
ym = E(xm ). (7.1)
277
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Note that the process (θn )n∈N0 is the GD process for the objective function L with learning
rates (γn )n∈N and initial value ξ (cf. Definition 6.1.1). Moreover, observe that the assumption
that a is differentiable ensures that L in (7.4) is also differentiable (see Section 5.3.2 above
for details).
In typical practical deep learning applications the number M of available known input-
output data pairs is very large, say, for example, M ≥ 106 . As a consequence it is typically
computationally prohibitively expensive to determine the exact gradient of the objective
function to perform steps of deterministic GD-type optimization methods. As a remedy for
this, deep learning algorithms usually employ stochastic variants of GD-type optimization
methods, where in each step of the optimization method the precise gradient of the objective
function is replaced by a Monte Carlo approximation of the gradient of the objective function.
278
7.2. SGD optimization
We now sketch this approach for the GD optimization method in (7.4) resulting in the
popular SGD optimization method applied to (7.2).
Specifically, let S = {1, 2, . . . , M }, J ∈ N, let (Ω, F, P) be a probability space, for every
n ∈ N, j ∈ {1, 2, . . . , J} let mn,j : Ω → S be a uniformly distributed random variable, let
l : Rd × S → R satisfy for all θ ∈ Rd , m ∈ S that
2
θ,d
(7.5)
l(θ, m) = NM a,l ,Ma,l ,...,Ma,l ,idR
(xm ) − ym ,
1 2 h
279
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Definition 7.2.1 (SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N,
let (Ω, F, P) be a probability space, let (S, S) be a measurable space, let ξ : Ω → Rd be
a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let Xn,j : Ω → S be a random
variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and g : Rd × S → Rd satisfy for all
U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that
Then we say that Θ is the SGD process on ((Ω, F, P), (S, S)) for the loss function l with
generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , initial value ξ, and data
(Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the SGD process for the loss function l with
learning rates (γn )n∈N , batch sizes (Jn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } )
if and only if it holds that Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies
for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn g(Θn−1 , Xn,j ) . (7.9)
Jn j=1
280
7.2. SGD optimization
let (Ω, F, P) be a probability space, let (Jn )n∈N ⊆ N, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let
mn,j : Ω → S be a uniformly distributed random variable, and let Θ : N0 × Ω → Rd satisfy
for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ ℓ)(Θn−1 , mn,j ) (7.13)
Jn j=1
(ii) it holds that Θ is the SGD process for the loss function ℓ with learning rates (γn )n∈N ,
batch sizes (Jn )n∈N , initial value ξ, and data (mn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } , and
Proof for Example 7.2.2. Note that (7.12) proves item (i). Observe that (7.13) proves
item (ii). Note that (7.11), (7.10), and the assumption that for all n ∈ N, j ∈ {1, 2, . . . , Jn }
it holds that mn,j is uniformly distributed imply that for all n ∈ N, j ∈ {1, 2, . . . , Jn } it
holds that
"M #
1 X
E[ℓ(η, mn,j )] = ℓ(η, m)
M m=1
"M # (7.15)
1 X θ,d
2
= NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(xm ) − ym = L(θ).
M m=1
281
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Source codes 7.1 and 7.2 give two concrete implementations in PyTorch of the
framework described in Example 7.2.2 with different data and network architectures. The
plots generated by these codes can be found in in Figures 7.1 and 7.2, respectively. They
show the approximations of the respective target functions by the realization functions of
the ANNs at various points during the training.
1 import torch
2 import torch . nn as nn
3 import numpy as np
4 import matplotlib . pyplot as plt
5
6 M = 10000 # number of training samples
7
8 # We fix a random seed . This is not necessary for training a
9 # neural network , but we use it here to ensure that the same
10 # plot is created on every run .
11 torch . manual_seed (0)
12
13 # Here , we define the training set .
14 # Create a tensor of shape (M , 1) with entries sampled from a
15 # uniform distribution on [ -2 * pi , 2 * pi )
16 X = ( torch . rand (( M , 1) ) - 0.5) * 4 * np . pi
17 # We use the sine as the target function , so this defines the
18 # desired outputs .
19 Y = torch . sin ( X )
20
282
7.2. SGD optimization
44 y = torch . sin ( x )
45 for ax in axs . flatten () :
46 ax . plot (x , y , label = " Target " )
47 ax . set_xlim ([ -2 * np . pi , 2 * np . pi ])
48 ax . set_ylim ([ -1.1 , 1.1])
49
50 plot_after = [1 , 30 , 100 , 300 , 1000 , 3000 , 10000 , 30000 , 100000]
51
70 if n + 1 in plot_after :
71 # Plot the realization function of the ANN
72 i = plot_after . index ( n + 1)
73 ax = axs [ i // 3][ i % 3]
74 ax . set_title ( f " Batch { n +1} " )
75
283
Chapter 7: Stochastic gradient descent (SGD) optimization methods
0.0
0.5
1.0
Batch 300 Batch 1000 Batch 3000
1.0
0.5
0.0
0.5
1.0
Batch 10000 Batch 30000 Batch 100000
1.0
0.5
0.0
0.5
1.0
6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6
1 import torch
2 import torch . nn as nn
3 import numpy as np
284
7.2. SGD optimization
6 def plot_heatmap ( ax , g ) :
7 x = np . linspace ( -2 * np . pi , 2 * np . pi , 100)
8 y = np . linspace ( -2 * np . pi , 2 * np . pi , 100)
9 x , y = np . meshgrid (x , y )
10
11 # flatten the grid to [ num_points , 2] and convert to tensor
12 grid = np . vstack ([ x . flatten () , y . flatten () ]) . T
13 grid_torch = torch . from_numpy ( grid ) . float ()
14
15 # pass the grid through the network
16 z = g ( grid_torch )
17
36 N = 100000
37
38 loss = nn . MSELoss ()
39 gamma = 0.05
40
41 fig , axs = plt . subplots (
42 3 , 3 , figsize =(12 , 12) , sharex = " col " , sharey = " row " ,
43 )
44
45 net = nn . Sequential (
46 nn . Linear (2 , 50) ,
47 nn . Softplus () ,
48 nn . Linear (50 ,50) ,
49 nn . Softplus () ,
50 nn . Linear (50 , 1)
51 )
52
285
Chapter 7: Stochastic gradient descent (SGD) optimization methods
55 for n in range ( N + 1) :
56 indices = torch . randint (0 , M , (J ,) )
57
58 x = X [ indices ]
59 y = Y [ indices ]
60
61 net . zero_grad ()
62
63 loss_val = loss ( net ( x ) , y )
64 loss_val . backward ()
65
66 with torch . no_grad () :
67 for p in net . parameters () :
68 p . sub_ ( gamma * p . grad )
69
70 if n in plot_after :
71 i = plot_after . index ( n )
72
286
7.2. SGD optimization
6
Batch 1000 Batch 3000 Batch 10000
6
6
Batch 30000 Batch 100000 Target
6
6
6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6
287
Chapter 7: Stochastic gradient descent (SGD) optimization methods
(7.17)
p
~v~ = ⟨⟨v, v⟩⟩,
let (Ω, F, P) be a probability space, and let Z : Ω → Rd be a random variable with E[~Z~] <
∞. Then
E ~Z − ϑ~2 = E ~Z − E[Z]~2 + ~E[Z] − ϑ~2 . (7.18)
Proof of Lemma 7.2.3. Observe that the assumption that E[~Z~] < ∞ and the Cauchy-
Schwarz inequality demonstrate that
E |⟨⟨Z − E[Z], E[Z] − ϑ⟩⟩| ≤ E ~Z − E[Z]~~E[Z] − ϑ~
(7.19)
≤ (E[~Z~] + ~E[Z]~)~E[Z] − ϑ~ < ∞.
288
7.2. SGD optimization
Proof of Lemma 7.2.4. Note that the fact that (X, Y )(P) = (X(P)) ⊗ (Y (P)), the integral
transformation theorem, Fubini’s theorem, and the assumption that E[|X| + |Y |] < ∞ show
that
Z
E |XY | = |X(ω)Y (ω)| P(dω)
ZΩ
= |xy| (X, Y )(P) (dx, dy)
ZR×RZ
= |xy| (X(P))(dx) (Y (P))(dy)
Z R R
Z (7.21)
= |y| |x| (X(P))(dx) (Y (P))(dy)
R R
Z Z
= |x| (X(P))(dx) |y| (Y (P))(dy)
R R
= E |X| E |Y | < ∞.
This establishes item (i). Observe that item (i), the fact that (X, Y )(P) = (X(P)) ⊗ (Y (P)),
the integral transformation theorem, and Fubini’s theorem prove that
Z
E XY = X(ω)Y (ω) P(dω)
ZΩ
= xy (X, Y )(P) (dx, dy)
ZR×R
Z
= xy (X(P))(dx) (Y (P))(dy)
R R (7.22)
Z Z
= y x (X(P))(dx) (Y (P))(dy)
R R
Z Z
= x (X(P))(dx) y (Y (P))(dy)
R R
= E[X]E[Y ].
This establishes item (ii). The proof of Lemma 7.2.4 is thus complete.
289
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Lemma 7.2.5. Let (Ω, F, P) be a probability space, let d ∈ N, let ⟨⟨·, ·⟩⟩ : Rd × Rd → R be a
scalar product, let ~·~ : Rd → [0, ∞) satisfy for all v ∈ Rd that
(7.23)
p
~v~ = ⟨⟨v, v⟩⟩,
let X : Ω → Rd be a random variable, assume E ~X~2 < ∞, let e1 , e2 , . . . , ed ∈ Rd satisfy
for all i, j ∈ {1, ⟩⟩ = 1{i} (j), and for every random variable Y : Ω → Rd
2, . . . , d} that ⟨⟨ei , ejd×d
2
with E ~Y ~ < ∞ let Cov(Y ) ∈ R satisfy
(7.24)
Cov(Y ) = E[⟨⟨ei , Y − E[Y ]⟩⟩⟨⟨ej , Y − E[Y ]⟩⟩] (i,j)∈{1,2,...,d}2 .
Then
Trace(Cov(X)) = E ~X − E[X]~2 . (7.25)
Proof of Lemma 7.2.5. First, note that the fact that ∀ i, j ∈ {1, 2, . . . , d} : ⟨⟨ei , ej ⟩⟩ = 1{i} (j)
implies that for all v ∈ Rd it holds that di=1 ⟨⟨ei , v⟩⟩ei = v. Combining this with the fact
P
that ∀ i, j ∈ {1, 2, . . . , d} : ⟨⟨ei , ej ⟩⟩ = 1{i} (j) demonstrates that
d
X
Trace(Cov(X)) = E ⟨⟨ei , X − E[X]⟩⟩⟨⟨ei , X − E[X]⟩⟩
i=1
Xd Xd
= E[⟨⟨ei , X − E[X]⟩⟩⟨⟨ej , X − E[X]⟩⟩⟨⟨ei , ej ⟩⟩] (7.26)
i=1 j=1
Pd Pd
=E i=1 ⟨⟨ei , X − E[X]⟩⟩ei , j=1 ⟨⟨ej , X
− E[X]⟩⟩ej
= E[⟨⟨X − E[X], X − E[X]⟩⟩] = E ~X − E[X]~2 .
(7.27)
p
~v~ = ⟨⟨v, v⟩⟩,
d
let (Ω, F, P) be a probability space,
Pn let Xk : Ω → R , k ∈ {1, 2, . . . , n}, be independent
random variables, and assume k=1 E ~Xk ~ < ∞. Then
h P n
i X
2
E ~ nk=1 (Xk − E[Xk ])~ = E ~Xk − E[Xk ]~2 . (7.28)
k=1
Proof of Lemma 7.2.6. First, observe that Lemma 7.2.4 and the assumption that E[~X1 ~ +
~X2 ~ + . . . + ~Xn ~] < ∞ ensure that for all k1 , k2 ∈ {1, 2, . . . , n} with k1 ̸= k2 it holds that
E |⟨⟨Xk1 − E[Xk1 ], Xk2 − E[Xk2 ]⟩⟩| ≤ E ~Xk1 − E[Xk1 ]~~Xk2 − E[Xk2 ]~ < ∞ (7.29)
290
7.2. SGD optimization
and
E ⟨⟨Xk1 − E[Xk1 ], Xk2 − E[Xk2 ]⟩⟩
= ⟨⟨E[Xk1 − E[Xk1 ]], E[Xk2 − E[Xk2 ]]⟩⟩ (7.30)
= ⟨⟨E[Xk1 ] − E[Xk1 ], E[Xk2 ] − E[Xk2 ]⟩⟩ = 0.
Then
(i) it holds that the function ϕ is Y/B([0, ∞])-measurable and
Proof of Lemma 7.2.7. First, note that Fubini’s theorem (cf., for example, Klenke [248,
(14.6) in Theorem 14.16]), the assumption that the function X : Ω → X is F/X -measurable,
291
Chapter 7: Stochastic gradient descent (SGD) optimization methods
is Y/B([0, ∞])-measurable. This proves item (i). Observe that the integral transformation
theorem, the fact that (X, Y )(P) = (X(P)) ⊗ (Y (P)), and Fubini’s theorem establish that
Z
E Φ(X, Y ) = Φ(X(ω), Y (ω)) P(dω)
ZΩ
= Φ(x, y) (X, Y )(P) (dx, dy)
ZX×Y
Z
= Φ(x, y) (X(P))(dx) (Y (P))(dy) (7.35)
Y X
Z
= E Φ(X, y) (Y (P))(dy)
ZY
= ϕ(y) (Y (P))(dy) = E ϕ(Y ) .
Y
This proves item (ii). The proof of Lemma 7.2.7 is thus complete.
Lemma 7.2.8. Let d ∈ N, let (S, S) be a measurable space, let l = (l(θ, x))(θ,x)∈Rd ×S :
Rd × S → R be (B(Rd ) ⊗ S)/B(R)-measurable, and assume for every x ∈ S that the function
Rd ∋ θ 7→ l(θ, x) ∈ R is differentiable. Then the function
Rd × S ∋ (θ, x) 7→ (∇θ l)(θ, x) ∈ Rd (7.36)
is (B(Rd ) ⊗ S)/B(Rd )-measurable.
Proof of Lemma 7.2.8. Throughout this proof, let g = (g1 , . . . , gd ) : Rd × S → Rd satisfy
for all θ ∈ Rd , x ∈ S that
g(θ, x) = (∇θ l)(θ, x). (7.37)
The assumption that the function l : Rd × S → R is (B(Rd ) ⊗ S)/B(R)-measurable implies
that for all i ∈ {1, 2, . . . , d}, h ∈ R\{0} it holds that the function
Rd × S ∋ (θ, x) = ((θ1 , . . . , θd ), x) 7→ l((θ1 ,...,θi−1 ,θi +h,θhi+1 ,...,θd ),x)−l(θ,x) ∈ R (7.38)
is (B(Rd )⊗S)/B(R)-measurable. The fact that for all i ∈ {1, 2, . . . , d}, θ = (θ1 , . . . , θd ) ∈ Rd ,
x ∈ S it holds that
−n ,θ
gi (θ, x) = lim l((θ1 ,...,θi−1 ,θi +2 2−n i+1 ,...,θd ),x)−l(θ,x)
(7.39)
n→∞
hence demonstrates that for all i ∈ {1, 2, . . . , d} it holds that the function gi : Rd × S → R
is (B(Rd ) ⊗ S)/B(R)-measurable. This ensures that g is (B(Rd ) ⊗ S)/B(Rd )-measurable.
The proof of Lemma 7.2.8 is thus complete.
292
7.2. SGD optimization
Lemma 7.2.9. Let d ∈ N, (γn )n∈N ⊆ (0, ∞), (Jn )n∈N ⊆ N, let ⟨⟨·, ·⟩⟩ : Rd × Rd → R be a
scalar product, let ~·~ : Rd → [0, ∞) satisfy for all v ∈ Rd that
(7.40)
p
~v~ = ⟨⟨v, v⟩⟩,
and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , Xn,j ) . (7.42)
Jn j=1
Proof of Lemma 7.2.9. Throughout this proof, for every n ∈ N let ϕn : Rd → [0, ∞] satisfy
for all θ ∈ Rd that
h i 2
γn PJn
ϕn (θ) = E θ − Jn j=1 (∇θ l)(θ, Xn,j ) − ϑ . (7.44)
Note that Lemma 7.2.3 shows that for all ϑ ∈ Rd and all random variables Z : Ω → Rd with
E[~Z~] < ∞ it holds that
293
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Lemma 7.2.6, the fact that Xn,j : Ω → S, j ∈ {1, 2, . . . , Jn }, n ∈ N, are i.i.d. random
variables, and the fact that for all n ∈ N, j ∈ {1, 2, . . . , Jn }, θ ∈ Rd it holds that
(7.47)
E ~(∇θ l)(θ, Xn,j )~ = E ~(∇θ l)(θ, X1,1 )~ < ∞
Furthermore, observe that (7.42), (7.44), the fact that for all n ∈ N it holds that Θn−1
and (Xn,j )j∈{1,2,...,Jn } are independent random variables, and Lemma 7.2.7 prove that for all
n ∈ N, ϑ ∈ Rd it holds that
h i 2
2 γn PJn
E ~Θn − ϑ~ = E Θn−1 − Jn j=1 (∇θ l)(Θn−1 , Xn,j ) − ϑ
(7.49)
= E ϕn (Θn−1 ) .
Combining this with (7.48) implies that for all n ∈ N, ϑ ∈ Rd it holds that
h 2 i 2
E ~Θn − ϑ~2 ≥ E (γJnn) V(Θn−1 ) = (γJnn) E V(Θn−1 ) . (7.50)
Corollary 7.2.10. Let d ∈ N, ε ∈ (0, ∞), (γn )n∈N ⊆ (0, ∞), (Jn )n∈N ⊆ N, let ⟨⟨·, ·⟩⟩ : Rd ×
Rd → R be a scalar product, let ~·~ : Rd → [0, ∞) satisfy for all v ∈ Rd that
(7.51)
p
~v~ = ⟨⟨v, v⟩⟩,
294
7.2. SGD optimization
and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , Xn,j ) . (7.53)
Jn j=1
Then
(i) it holds for all n ∈ N, ϑ ∈ Rd that
1/2 γn
E ~Θn − ϑ~2 (7.54)
≥ε
(Jn )1/2
and
(ii) it holds for all ϑ ∈ Rd that
2 1/2 γn
(7.55)
lim inf E ~Θn − ϑ~ ≥ ε lim inf .
n→∞ n→∞ (Jn )1/2
Proof of Corollary 7.2.10. Throughout this proof, let V : Rd → [0, ∞] satisfy for all θ ∈ Rd
that
V(θ) = E ~(∇θ l)(θ, X1,1 ) − E (∇θ l)(θ, X1,1 ) ~2 . (7.56)
Then
295
Chapter 7: Stochastic gradient descent (SGD) optimization methods
(7.61)
E ∥(∇θ l)(θ, X1,1 )∥2 < ∞,
and
Proof of Lemma 7.2.11. First, note that (7.59) and Lemma 5.6.4 prove that for all θ, x ∈ Rd
it holds that
(∇θ l)(θ, x) = 21 (2(θ − x)) = θ − x. (7.64)
The assumption that E[∥X1,1 ∥2 ] < ∞ hence implies that for all θ ∈ Rd it holds that
(7.65)
E ∥(∇θ l)(θ, X1,1 )∥2 = E ∥θ − X1,1 ∥2 ≤ ∥θ∥2 + E ∥X1,1 ∥2 < ∞.
This establishes item (i). Furthermore, observe that (7.64) and item (i) demonstrate that
for all θ ∈ Rd it holds that
This proves item (ii). Note that item (i) in Corollary 7.2.10 and items (i) and (ii) establish
item (iii). The proof of Lemma 7.2.11 is thus complete.
Lemma 7.2.12 (A lower bound for the natural logarithm). It holds for all x ∈ (0, ∞) that
(x − 1)
ln(x) ≥ . (7.67)
x
296
7.2. SGD optimization
Proof of Lemma 7.2.12. First, observe that the fundamental theorem of calculus ensures
that for all x ∈ [1, ∞) it holds that
Z x Z x
1 1 (x − 1)
ln(x) = ln(x) − ln(1) = dt ≥ dt = . (7.68)
1 t 1 x x
Furthermore, note that the fundamental theorem of calculus shows that for all x ∈ (0, 1] it
holds that
Z 1
1
ln(x) = ln(x) − ln(1) = −(ln(1) − ln(x)) = − dt
x t
Z 1 Z 1 (7.69)
1 1 1 (x − 1)
= − dt ≥ − dt = (1 − x) − = .
x t x x x x
This and (7.68) prove (7.67). The proof of Lemma 7.2.12 is thus complete.
Lemma 7.2.13 (GD fails to converge for a summable sequence of learning rates). Let
d ∈ N, ϑ ∈ Rd , ξ ∈ Rd \{ϑ}, α ∈ (0, ∞), (γn )n∈N ⊆ [0, ∞)\{1/α} satisfy ∞
P
n=1 n < ∞, let
γ
d d
L : R → R satisfy for all θ ∈ R that
Then
and
297
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Proof of Lemma 7.2.13. Throughout this proof, let m ∈ N satisfy for all k ∈ N ∩ [m, ∞)
that γk < 1/(2α). Observe that Lemma 5.6.4 implies that for all θ ∈ Rd it holds that
(∇L)(θ) = α2 (2(θ − ϑ)) = α(θ − ϑ). (7.75)
Therefore, we obtain for all n ∈ N that
Θn − ϑ = Θn−1 − γn (∇L)(Θn−1 ) − ϑ
= Θn−1 − γn α(Θn−1 − ϑ) − ϑ (7.76)
= (1 − γn α)(Θn−1 − ϑ).
Induction hence demonstrates that for all n ∈ N it holds that
" n #
Y
Θn − ϑ = (1 − γk α) (Θ0 − ϑ), (7.77)
k=1
This and the assumption that Θ0 = ξ establish item (i). Note that the fact that for all
k ∈ N it holds that γk α ̸= 1 ensures that
m−1
Y
1 − γk α > 0. (7.78)
k=1
Moreover, note that the fact that for all k ∈ N ∩ [m, ∞) it holds that γk α ∈ [0, 1/2) assures
that for all k ∈ N ∩ [m, ∞) it holds that
(1 − γk α) ∈ (1/2, 1]. (7.79)
P∞
This, Lemma 7.2.12, and the assumption that n=1 γn < ∞ show that for all n ∈ N∩[m, ∞)
it holds that
n
! n
Y X
ln 1 − γk α = ln(1 − γk α)
k=m k=m
n n
X (1 − γk α) − 1 X γk α
≥ = − (7.80)
k=m
(1 − γ k α)
k=m
(1 − γk α)
n " n
# "∞ #
X γk α X X
≥ − 1 = −2α γk ≥ −2α γk > −∞.
k=m
( 2
) k=m k=1
Combining this with (7.78) proves that for all n ∈ N ∩ [m, ∞) it holds that
n
"m−1 # n
!!
Y Y Y
1 − γk α = 1 − γk α exp ln 1 − γk α
k=1 k=1 k=m
"m−1 # "∞ #! (7.81)
Y X
≥ 1 − γk α exp −2α γk > 0.
k=1 k=1
298
7.2. SGD optimization
This establishes item (ii). Observe that items (i) and (ii) and the assumption that ξ ̸= ϑ
imply that
" n #
Y
lim inf ∥Θn − ϑ∥2 = lim inf (1 − γk α) (ξ − ϑ)
n→∞ n→∞
k=1
n
!2
Y
= lim inf (1 − γk α) ∥ξ − ϑ∥2 (7.83)
n→∞
k=1
" n
#!
Y
= ∥ξ − ϑ∥2 lim inf 1 − γk α > 0.
n→∞
k=1
This proves item (iii). The proof of Lemma 7.2.13 is thus complete.
and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that Θ0 = 0
and
Θn = Θn−1 − n1 (∇θ l)(Θn−1 , Xn ) (7.85)
(cf. Definition 3.3.4). Then
299
Chapter 7: Stochastic gradient descent (SGD) optimization methods
and
Proof for Example 7.2.14. Note that the assumption that E[∥X1 ∥22 ] < ∞ and Lemma 7.2.3
demonstrate that for all θ ∈ Rd it holds that
This establishes item (i). Observe that Lemma 5.6.4 ensures that for all θ, x ∈ Rd it holds
that
(∇θ l)(θ, x) = 12 (2(θ − x)) = θ − x. (7.89)
This and (7.85) assure that for all n ∈ N it holds that
Θn = n1 (X1 + X2 + . . . + Xn ). (7.91)
We now prove (7.91) by induction on n ∈ N. For the base case n = 1 note that (7.90)
implies that
Θ1 = 10 Θ0 + X1 = 11 (X1 ). (7.92)
This establishes (7.91) in the base case n = 1. For the induction step note that (7.90) shows
that for all n ∈ {2, 3, 4, . . .} with Θn−1 = (n−1)
1
(X1 + X2 + . . . + Xn−1 ) it holds that
h ih i
(n−1) (n−1)
Θn = Θn−1 + n1 Xn = 1
(X1 + X2 + . . . + Xn−1 ) + n1 Xn
n n (n−1)
(7.93)
= n1 (X1 + X2 + . . . + Xn−1 ) + n1 Xn = n1 (X1 + X2 + . . . + Xn ).
Induction hence implies (7.91). Furthermore, note that (7.91) proves item (ii). Observe
that Lemma 7.2.6, item (ii), and the fact that (Xn )n∈N are i.i.d. random variables with
300
7.2. SGD optimization
This proves item (iv). The proof for Example 7.2.14 is thus complete.
The next result, Theorem 7.2.15 below, specifies strong and weak convergence rates for
the SGD optimization method in dependence on the asymptotic behavior of the sequence
of learning rates. The statement and the proof of Theorem 7.2.15 can be found in Jentzen
et al. [229, Theorem 1.1].
Theorem 7.2.15 (Convergence rates in dependence of learning rates). Let d ∈ N, α, γ, ν ∈
(0, ∞), ξ ∈ Rd , let (Ω, F, P) be a probability space, let Xn : Ω → Rd , n ∈ N, be i.i.d. random
variables with E[∥X1 ∥22 ] < ∞ and P(X1 = E[X1 ]) < 1, let (rε,i )(ε,i)∈(0,∞)×{0,1} ⊆ R satisfy
for all ε ∈ (0, ∞), i ∈ {0, 1} that
ν/2
:ν<1
rε,i = min{1/2, γα + (−1)i ε} : ν = 1 (7.96)
0 : ν > 1,
301
Chapter 7: Stochastic gradient descent (SGD) optimization methods
and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γ
nν
(∇θ l)(Θn−1 , Xn ). (7.98)
Then
(i) there exists a unique ϑ ∈ Rd which satisfies that {θ ∈ Rd : L(θ) = inf w∈Rd L(w)} =
{ϑ},
(ii) for every ε ∈ (0, ∞) there exist c0 , c1 ∈ (0, ∞) such that for all n ∈ N it holds that
1/2
c0 n−rε,0 ≤ E ∥Θn − ϑ∥22 ≤ c1 n−rε,1 , (7.99)
and
(iii) for every ε ∈ (0, ∞) there exist c0 , c1 ∈ (0, ∞) such that for all n ∈ N it holds that
Proof of Theorem 7.2.15. Note that Jentzen et al. [229, Theorem 1.1] establishes items (i),
(ii), and (iii). The proof of Theorem 7.2.15 is thus complete.
(7.101)
E |l(θ, X1 )| + ∥(∇θ l)(θ, X1 )∥2 < ∞,
θ − ϑ, E[(∇θ l)(θ, X1 )] ≥ c max ∥θ − ϑ∥22 , ∥E[(∇θ l)(θ, X1 )]∥22 , (7.102)
Θ0 = ξ and Θn = Θn−1 − α
nν
(∇θ l)(Θn−1 , Xn ) (7.104)
302
7.3. Explicit midpoint SGD optimization
(i) it holds that θ ∈ Rd : L(θ) = inf w∈Rd L(w) = {ϑ} and
Proof of Theorem 7.2.16. Observe that Jentzen et al. [221, Theorem 1.1] proves items (i)
and (ii). The proof of Theorem 7.2.16 is thus complete.
Definition 7.3.1 (Explicit midpoint SGD optimization method). Let d ∈ N, (γn )n∈N ⊆
[0, ∞), (Jn )n∈N ⊆ N, let (Ω, F, P) be a probability space, let (S, S) be a measurable space, let
ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let Xn,j : Ω → S be a
random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and g = (g1 , . . . , gd ) : Rd ×
S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U with (U ∋ ϑ 7→ l(ϑ, x) ∈
R) ∈ C 1 (U, R) that
g(θ, x) = (∇θ l)(θ, x). (7.106)
Then we say that Θ is the explicit midpoint SGD process for the loss function l with
generalized gradient g, learning rates (γn )n∈N , and initial value ξ (we say that Θ is the
explicit midpoint SGD process for the loss function l with learning rates (γn )n∈N and initial
value ξ) if and only if it holds that Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd
which satisfies for all n ∈ N that
" Jn
#
1 X γn h 1 PJn i
Θ0 = ξ and Θn = Θn−1 − γn g Θn−1 − g(Θn−1 , Xn,j ) , Xn,j .
Jn j=1 2 Jn j=1
(7.107)
303
Chapter 7: Stochastic gradient descent (SGD) optimization methods
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.003
20
21 for n in range ( N ) :
22 indices = torch . randint (0 , M , (J ,) )
23
24 x = X [ indices ]
25 y = Y [ indices ]
26
27 net . zero_grad ()
28
304
7.4. SGD optimization with classical momentum
58 p . copy_ ( param )
59
60 if n % 1000 == 0:
61 with torch . no_grad () :
62 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
63 y = torch . sin ( x )
64 loss_val = loss ( net ( x ) , y )
65 print ( f " Iteration : { n +1} , Loss : { loss_val } " )
Definition 7.4.1 (Momentum SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, 1], let (Ω, F, P) be a probability space, let (S, S) be a measurable
space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let
Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that
Then we say that Θ is the momentum SGD process on ((Ω, F, P), (S, S)) for the loss function
l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay
factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the
momentum SGD process for the loss function l with learning rates (γn )n∈N , batch sizes
(Jn )n∈N , momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } )
if and only if Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that there
exists m : N0 × Ω → Rd such that for all n ∈ N it holds that
Θ0 = ξ, m0 = 0, (7.109)
"
Jn
#
1 X
mn = αn mn−1 + (1 − αn ) g(Θn−1 , Xn,j ) , (7.110)
Jn j=1
305
Chapter 7: Stochastic gradient descent (SGD) optimization methods
16 loss = nn . MSELoss ()
17 lr = 0.01
18 alpha = 0.999
19
20 fig , axs = plt . subplots (1 , 4 , figsize =(12 , 3) , sharey = ’ row ’)
21
22 net = nn . Sequential (
23 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
24 )
25
26 for i , alpha in enumerate ([0 , 0.9 , 0.99 , 0.999]) :
27 print ( f " alpha = { alpha } " )
28
29 for lr in [0.1 , 0.03 , 0.01 , 0.003]:
30 torch . manual_seed (0)
31 net . apply (
32 lambda m : m . reset_parameters ()
33 if isinstance (m , nn . Linear )
34 else None
35 )
36
37 momentum = [
38 p . clone () . detach () . zero_ () for p in net . parameters ()
306
7.4. SGD optimization with classical momentum
39 ]
40
41 losses = []
42 print ( f " lr = { lr } " )
43
44 for n in range ( N ) :
45 indices = torch . randint (0 , M , (J ,) )
46
47 x = X [ indices ]
48 y = Y [ indices ]
49
50 net . zero_grad ()
51
52 loss_val = loss ( net ( x ) , y )
53 loss_val . backward ()
54
55 with torch . no_grad () :
56 for m , p in zip ( momentum , net . parameters () ) :
57 m . mul_ ( alpha )
58 m . add_ ((1 - alpha ) * p . grad )
59 p . sub_ ( lr * m )
60
61 if n % 100 == 0:
62 with torch . no_grad () :
63 x = ( torch . rand ((1000 , 1) ) - 0.5) * 4 * np . pi
64 y = torch . sin ( x )
65 loss_val = loss ( net ( x ) , y )
66 losses . append ( loss_val . item () )
67
68 axs [ i ]. plot ( losses , label = f " $ \\ gamma = { lr } $ " )
69
70 axs [ i ]. set_yscale ( " log " )
71 axs [ i ]. set_ylim ([1 e -6 , 1])
72 axs [ i ]. set_title ( f " $ \\ alpha = { alpha } $ " )
73
74 axs [0]. legend ()
75
76 plt . tight_layout ()
77 plt . savefig ( " ../ plots / sgd_momentum . pdf " , bbox_inches = ’ tight ’)
307
Chapter 7: Stochastic gradient descent (SGD) optimization methods
100
=0 = 0.9 = 0.99 = 0.999
10 1
10 2
10 3
= 0.1
10 4
= 0.03
10 5 = 0.01
= 0.003
10 6
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Then we say that Θ is the bias-adjusted momentum SGD process on ((Ω, F, P), (S, S)) for
the loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say
that Θ is the bias-adjusted momentum SGD process for the loss function l with learning
rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay factors (αn )n∈N , initial value ξ, and
data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if Θ : N0 × Ω → Rd is the function from N0 × Ω
to Rd which satisfies that there exists m : N0 × Ω → Rd such that for all n ∈ N it holds that
Θ0 = ξ, m0 = 0, (7.113)
Jn
" #
1 X
mn = αn mn−1 + (1 − αn ) g(Θn−1 , Xn,j ) , (7.114)
Jn j=1
γn mn
and Θn = Θn−1 − . (7.115)
1 − nl=1 αl
Q
308
7.4. SGD optimization with classical momentum
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.01
20 alpha = 0.99
21 adj = 1
22
23 momentum = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
24
25 for n in range ( N ) :
26 indices = torch . randint (0 , M , (J ,) )
27
28 x = X [ indices ]
29 y = Y [ indices ]
30
31 net . zero_grad ()
32
33 loss_val = loss ( net ( x ) , y )
34 loss_val . backward ()
35
36 adj *= alpha
37
44 if n % 1000 == 0:
45 with torch . no_grad () :
46 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
47 y = torch . sin ( x )
48 loss_val = loss ( net ( x ) , y )
49 print ( f " Iteration : { n +1} , Loss : { loss_val } " )
309
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Definition 7.5.1 (Nesterov accelerated SGD optimization method). Let d ∈ N, (γn )n∈N ⊆
[0, ∞), (Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, 1], let (Ω, F, P) be a probability space, let (S, S) be a
measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn }
let Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that
Then we say that Θ is the Nesterov accelerated SGD process on ((Ω, F, P), (S, S)) for the
loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we
say that Θ is the Nesterov accelerated SGD process for the loss function l with learning
rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay rates (αn )n∈N , initial value ξ, and data
(Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd
which satisfies that there exists m : N0 × Ω → Rd such that for all n ∈ N it holds that
Θ0 = ξ, m0 = 0, (7.117)
" Jn
#
1 X
(7.118)
mn = αn mn−1 + (1 − αn ) g Θn−1 − γn αn mn−1 , Xn,j ,
Jn j=1
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
310
7.5. SGD optimization with Nesterov momentum
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.003
20 alpha = 0.999
21
311
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Lemma 7.5.2 (Relations between Definition 7.5.1 and Definition 7.5.3). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N, (αn )n∈N0 ⊆ [0, 1), let (Ω, F, P) be a probability space, let
(S, S) be a measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈
{1, 2, . . . , Jn } let Xn,j : Ω → S be a random variable, let l = (l(θ, x))(θ,x)∈Rd ×S : Rd ×S → R
and g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all x ∈ S, θ ∈ {v ∈ Rd : l(·, x) is
differentiable at v} that
g(θ, x) = (∇θ l)(θ, x), (7.120)
let Θ : N0 × Ω → Rd and m : N0 × Ω → Rd satisfy for all n ∈ N that
Θ0 = ξ, m0 = 0, (7.121)
"Jn
#
1 X
(7.122)
mn = αn mn−1 + (1 − αn ) g Θn−1 − γn αn mn−1 , Xn,j ,
Jn j=1
αn (1 − αn−1 )
βn = and δn = (1 − αn )γn , (7.124)
1 − αn
(i) it holds that Θ is the Nesterov accelerated SGD process on ((Ω, F, P), (S, S)) for the
loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk }
and
312
7.5. SGD optimization with Nesterov momentum
Proof of Lemma 7.5.2. Note that (7.121), (7.122), and (7.123) show item (i). Observe that
(7.122) and (7.125) imply that for all n ∈ N it holds that
Jn
αn mn−1 1 X
mn =
+ g Ψn−1 , Xn,j
1 − αn Jn j=1
(7.129)
αn (1 − αn−1 )mn−1
Jn
1 X
= + g Ψn−1 , Xn,j .
1 − αn Jn j=1
Jn
1 X
mn = βn mn−1 + (7.130)
g Ψn−1 , Xn,j .
Jn j=1
Furthermore, note that (7.122), (7.123), and (7.125) ensure that for all n ∈ N it holds that
Ψn = Θn − γn+1 αn+1 mn
= Θn−1 − γn mn − γn+1 αn+1 mn
= Ψn−1 + γn αn mn−1 − γn mn − γn+1 αn+1 mn
"
Jn
#
1 X
= Ψn−1 + γn αn mn−1 − γn αn mn−1 − γn (1 − αn ) g Ψn−1 , Xn,j
Jn j=1
− γn+1 αn+1 mn (7.131)
" Jn
#
1 X
= Ψn−1 − γn+1 αn+1 mn − γn (1 − αn ) g Ψn−1 , Xn,j
Jn j=1
" Jn
#
1 X
= Ψn−1 − γn+1 αn+1 (1 − αn )mn − γn (1 − αn )
g Ψn−1 , Xn,j .
Jn j=1
313
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Combining this with (7.121), (7.125), and (7.130) proves item (ii). The proof of Lemma 7.5.2
is thus complete.
Definition 7.5.3 (Simplified Nesterov accelerated SGD optimization method). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, ∞), let (Ω, F, P) be a probability space, let
(S, S) be a measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈
{1, 2, . . . , Jn } let Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd ×
S → R and g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all x ∈ S, θ ∈ {v ∈ Rd : l(·, x) is
differentiable at v} that
g(θ, x) = (∇θ l)(θ, x). (7.133)
Then we say that Θ is the simplified Nesterov accelerated SGD process on ((Ω, F, P), (S, S))
for the loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes
(Jn )n∈N , momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk }
(we say that Θ is the simplified Nesterov accelerated SGD process for the loss function l
with learning rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay rates (αn )n∈N , initial
value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if Θ : N0 × Ω → Rd is the function
from N0 × Ω to Rd which satisfies that there exists m : N0 × Ω → Rd such that for all n ∈ N
it holds that
Θ0 = ξ, m0 = 0, (7.134)
Jn
1 X
(7.135)
mn = αn mn−1 + g Θn−1 , Xn,j ,
Jn j=1
" Jn
#
1 X
(7.136)
and Θn = Θn−1 − γn αn mn − γn g Θn−1 , Xn,j .
Jn j=1
The simplified Nesterov accelerated SGD optimization method as described in Defini-
tion 7.5.3 is implemented in PyTorch in the form of the torch.optim.SGD optimizer with
the nesterov=True option.
314
7.6. Adagrad SGD optimization (Adagrad)
Definition 7.6.1 (Adagrad SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(Jn )n∈N ⊆ N, ε ∈ (0, ∞), let (Ω, F, P) be a probability space, let (S, S) be a measurable
space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let
Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that
Then we say that Θ is the Adagrad SGD process on ((Ω, F, P), (S, S)) for the loss function l
with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , regularizing factor
ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the Adagrad SGD
process for the loss function l with learning rates (γn )n∈N , batch sizes (Jn )n∈N , regularizing
factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies for all
n ∈ N, i ∈ {1, 2, . . . , d} that Θ0 = ξ and
" n
#1/2 !−1 " Jn
#
(i)
X PJk 2 1 X
Θ(i)
n = Θn−1 − γn ε + 1
Jk j=1 gi (Θk−1 , Xk,j ) gi (Θn−1 , Xn,j ) .
k=1
Jn j=1
(7.138)
An implementation in PyTorch of the Adagrad SGD optimization method as described
in Definition 7.6.1 above is given in Source code 7.7. The Adagrad SGD optimization
method as described in Definition 7.6.1 above is also available in PyTorch in the form of
the built-in torch.optim.Adagrad optimizer (which, for applications, is generally much
preferable to implementing it from scratch).
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
315
Chapter 7: Stochastic gradient descent (SGD) optimization methods
19 lr = 0.02
20 eps = 1e -10
21
22 sum_sq_grad = [ p . clone () . detach () . fill_ ( eps ) for p in net .
parameters () ]
23
24 for n in range ( N ) :
25 indices = torch . randint (0 , M , (J ,) )
26
27 x = X [ indices ]
28 y = Y [ indices ]
29
30 net . zero_grad ()
31
Definition 7.7.1 (RMSprop SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(Jn )n∈N ⊆ N, (βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), let (Ω, F, P) be a probability space, let (S, S) be a
measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn }
let Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
316
7.7. Root mean square propagation SGD optimization (RMSprop)
Then we say that Θ is the RMSprop SGD process on ((Ω, F, P), (S, S)) for the loss function l
with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , second moment decay
factors (βn )n∈N , regularizing factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we
say that Θ is the RMSprop SGD process for the loss function l with learning rates (γn )n∈N ,
batch sizes (Jn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε, initial
value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) :
N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that there exists M =
(M(1) , . . . , M(d) ) : N0 × Ω → Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
Θ0 = ξ, M0 = 0, (7.140)
Jn
" #2
(i) 1 X
M(i)
n = βn Mn−1 + (1 − βn ) gi (Θn−1 , Xn,j ) , (7.141)
Jn j=1
" Jn
#
(i) γn 1 X
and (i)
Θn = Θn−1 − (i) 1/2 gi (Θn−1 , Xn,j ) . (7.142)
ε + Mn Jn j=1
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
317
Chapter 7: Stochastic gradient descent (SGD) optimization methods
16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.001
20 beta = 0.9
21 eps = 1e -10
22
23 moments = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
24
25 for n in range ( N ) :
26 indices = torch . randint (0 , M , (J ,) )
27
28 x = X [ indices ]
29 y = Y [ indices ]
30
31 net . zero_grad ()
32
33 loss_val = loss ( net ( x ) , y )
34 loss_val . backward ()
35
42 if n % 1000 == 0:
43 with torch . no_grad () :
44 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
45 y = torch . sin ( x )
46 loss_val = loss ( net ( x ) , y )
47 print ( f " Iteration : { n +1} , Loss : { loss_val } " )
318
7.7. Root mean square propagation SGD optimization (RMSprop)
Then we say that Θ is the bias-adjusted RMSprop SGD process on ((Ω, F, P), (S, S)) for
the loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
second moment decay factors (βn )n∈N , regularizing factor ε, initial value ξ, and data
(Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the bias-adjusted RMSprop SGD process for the
loss function l with learning rates (γn )n∈N , batch sizes (Jn )n∈N , second moment decay factors
(βn )n∈N , regularizing factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only
if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd
which satisfies that there exists M = (M(1) , . . . , M(d) ) : N0 × Ω → Rd such that for all n ∈ N,
i ∈ {1, 2, . . . , d} it holds that
Θ0 = ξ, M0 = 0, (7.144)
" Jn
# 2
(i) 1X
(i)
Mn = βn Mn−1 + (1 − βn ) gi (Θn−1 , Xn,j ) , (7.145)
Jn j=1
" Jn
#
h (i)
i1/2 −1 1 X
(i)
and Θ(i)
n = Θn−1 − γn ε + (1− n
QMn
gi (Θn−1 , Xn,j ) . (7.146)
l=1 βl ) Jn j=1
18 loss = nn . MSELoss ()
19 lr = 0.001
20 beta = 0.9
21 eps = 1e -10
22 adj = 1
319
Chapter 7: Stochastic gradient descent (SGD) optimization methods
23
24 moments = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
25
26 for n in range ( N ) :
27 indices = torch . randint (0 , M , (J ,) )
28
29 x = X [ indices ]
30 y = Y [ indices ]
31
32 net . zero_grad ()
33
34 loss_val = loss ( net ( x ) , y )
35 loss_val . backward ()
36
320
7.8. Adadelta SGD optimization
Then we say that Θ is the Adadelta SGD process on ((Ω, F, P), (S, S)) for the loss function l
with generalized gradient g, batch sizes (Jn )n∈N , second moment decay factors (βn )n∈N , delta
decay factors (δn )n∈N , regularizing factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk }
(we say that Θ is the Adadelta SGD process for the loss function l with batch sizes
(Jn )n∈N , second moment decay factors (βn )n∈N , delta decay factors (δn )n∈N , regularizing
factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that
there exist M = (M(1) , . . . , M(d) ) : N0 × Ω → Rd and ∆ = (∆(1) , . . . , ∆(d) ) : N0 × Ω → Rd
such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
Θ0 = ξ, M0 = 0, ∆0 = 0, (7.148)
"Jn
#2
(i) 1 X
M(i)
n = + (1 − βn )
βn Mn−1 gi (Θn−1 , Xn,j ) , (7.149)
Jn j=1
(i) 1/2
" Jn
#
(i) ε + ∆ 1 X
Θ(i)
n = Θn−1 −
n−1
(i)
gi (Θn−1 , Xn,j ) , (7.150)
ε + Mn Jn j=1
(i) (i) 2
and ∆(i) (i)
n = δn ∆n−1 + (1 − δn ) Θn − Θn−1 . (7.151)
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 beta = 0.9
321
Chapter 7: Stochastic gradient descent (SGD) optimization methods
20 delta = 0.9
21 eps = 1e -10
22
23 moments = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
24 Delta = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
25
26 for n in range ( N ) :
27 indices = torch . randint (0 , M , (J ,) )
28
29 x = X [ indices ]
30 y = Y [ indices ]
31
32 net . zero_grad ()
33
46 if n % 1000 == 0:
47 with torch . no_grad () :
48 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
49 y = torch . sin ( x )
50 loss_val = loss ( net ( x ) , y )
51 print ( f " Iteration : { n +1} , Loss : { loss_val } " )
322
7.9. Adaptive moment estimation SGD optimization
(Adam)
Then we say that Θ is the Adam SGD process on ((Ω, F, P), (S, S)) for the loss function
l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , momentum
decay factors (αn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε ∈ (0, ∞),
initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the Adam SGD pro-
cess for the loss function l with learning rates (γn )n∈N , batch sizes (Jn )n∈N , momen-
tum decay factors (αn )n∈N , second moment decay factors (βn )n∈N , regularizing factor
ε ∈ (0, ∞), initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that
there exist m = (m(1) , . . . , m(d) ) : N0 × Ω → Rd and M = (M(1) , . . . , M(d) ) : N0 × Ω → Rd
such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
Θ0 = ξ, m0 = 0, M0 = 0, (7.154)
Jn
" #
1 X
mn = αn mn−1 + (1 − αn ) g(Θn−1 , Xn,j ) , (7.155)
Jn j=1
" Jn
#2
(i) 1 X
M(i)
n = βn Mn−1 + (1 − βn ) gi (Θn−1 , Xn,j ) , (7.156)
Jn j=1
" #
h (i)
i1/2 −1 (i)
mn
(i)
and Θ(i)
n = Θn−1 − γn ε +
Mn
Q n
(1− l=1 βl )
Qn . (7.157)
(1 − l=1 αl )
and 10−8 = ε as default values for (γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1], (βn )n∈N ⊆ [0, 1],
ε ∈ (0, ∞) in Definition 7.9.1.
An implementation in PyTorch of the Adam SGD optimization method as described
in Definition 7.9.1 above is given in Source code 7.11. The Adam SGD optimization method
as described in Definition 7.9.1 above is also available in PyTorch in the form of the
built-in torch.optim.Adam optimizer (which, for applications, is generally much preferable
to implementing it from scratch).
323
Chapter 7: Stochastic gradient descent (SGD) optimization methods
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.0001
20 alpha = 0.9
21 beta = 0.999
22 eps = 1e -8
23 adj = 1.
24 adj2 = 1.
25
26 m = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
27 MM = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
28
29 for n in range ( N ) :
30 indices = torch . randint (0 , M , (J ,) )
31
32 x = X [ indices ]
33 y = Y [ indices ]
34
35 net . zero_grad ()
36
37 loss_val = loss ( net ( x ) , y )
38 loss_val . backward ()
39
40 with torch . no_grad () :
41 adj *= alpha
42 adj2 *= beta
43 for m_p , M_p , p in zip (m , MM , net . parameters () ) :
44 m_p . mul_ ( alpha )
45 m_p . add_ ((1 - alpha ) * p . grad )
46 M_p . mul_ ( beta )
47 M_p . add_ ((1 - beta ) * p . grad * p . grad )
48 p . sub_ ( lr * m_p / ((1 - adj ) * ( eps + ( M_p / (1 - adj2 )
324
7.9. Adaptive moment estimation SGD optimization
(Adam)
) . sqrt () ) ) )
49
50 if n % 1000 == 0:
51 with torch . no_grad () :
52 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
53 y = torch . sin ( x )
54 loss_val = loss ( net ( x ) , y )
55 print ( f " Iteration : { n +1} , Loss : { loss_val } " )
325
Chapter 7: Stochastic gradient descent (SGD) optimization methods
22 # ( usually by integers ) .
23 # The torchvision . datasets module contains functions for loading
24 # popular machine learning datasets , possibly downloading and
25 # transforming the data .
26
27 # Here we load the MNIST dataset , containing 28 x28 grayscale images
28 # of handwritten digits with corresponding labels in
29 # {0 , 1 , ... , 9}.
30
31 # First load the training portion of the data set , downloading it
32 # from an online source to the local folder ./ data ( if it is not
33 # yet there ) and transforming the data to PyTorch Tensors .
34 mnist_train = datasets . MNIST (
35 " ./ data " ,
36 train = True ,
37 transform = transforms . ToTensor () ,
38 download = True ,
39 )
40 # Next load the test portion
41 mnist_test = datasets . MNIST (
42 " ./ data " ,
43 train = False ,
44 transform = transforms . ToTensor () ,
45 download = True ,
46 )
47
326
7.9. Adaptive moment estimation SGD optimization
(Adam)
71 nn . Conv2d (5 , 3 , 5) , # (N , 3 , 16 , 16)
72 nn . ReLU () ,
73 nn . Flatten () , # (N , 3 * 16 * 16) = (N , 768)
74 nn . Linear (768 , 128) , # (N , 128)
75 nn . ReLU () ,
76 nn . Linear (128 , 10) , # output shape (N , 10)
77 ) . to ( device )
78
327
Chapter 7: Stochastic gradient descent (SGD) optimization methods
120 # entries .
121 correct_count += torch . sum (
122 pred_labels == labels
123 ) . item ()
124 avg_test_loss = total_test_loss / len ( mnist_test )
125 accuracy = correct_count / len ( mnist_test )
126 return ( avg_test_loss , accuracy )
127
128
129 # Initialize a list that holds the computed loss on every
130 # batch during training
131 train_losses = []
132
133 # Every 10 batches , we will compute the loss on the entire test
134 # set as well as the accuracy of the model ’s predictions on the
135 # entire test set . We do this for the purpose of illustrating in
136 # the produced plot the generalization capability of the ANN .
137 # Computing these losses and accuracies so frequently with such a
138 # relatively large set of datapoints ( compared to the training
139 # set ) is extremely computationally expensive , however ( most of
140 # the training runtime will be spent computing these values ) and
141 # so is not advisable during normal neural network training .
142 # Usually , the test set is only used at the very end to judge the
143 # performance of the final trained network . Often , a third set of
144 # datapoints , called the validation set ( not used to train the
145 # network directly nor to evaluate it at the end ) is used to
146 # judge overfitting or to tune hyperparameters .
147 test_interval = 10
148 test_losses = []
149 accuracies = []
150
151 # We run the training for 5 epochs , i . e . , 5 full iterations
152 # through the training set .
153 i = 0
154 for e in range (5) :
155 for images , labels in train_loader :
156 # Move the data to the device
157 images = images . to ( device )
158 labels = labels . to ( device )
159
160 # Zero out the gradients
161 optimizer . zero_grad ()
162 # Compute the output of the neural network on the current
163 # minibatch
164 output = net ( images )
165 # Compute the cross entropy loss
166 loss = loss_fn ( output , labels )
167 # Compute the gradients
168 loss . backward ()
328
7.9. Adaptive moment estimation SGD optimization
(Adam)
329
Chapter 7: Stochastic gradient descent (SGD) optimization methods
Source code 7.12 (code/mnist.py): Python code training an ANN on the MNIST
dataset in PyTorch. This code produces a plot showing the progression of the
average loss on the test set and the accuracy of the model’s predictions, see Figure 7.4.
0.99
Training loss (left axis)
Test loss (left axis)
Accuracy (right axis)
100
10 1
0.90
10 2
10 3 0.50
Figure 7.4 (plots/mnist.pdf): The plot produced by Source code 7.12, showing
the average loss over each minibatch used during training (training loss) as well as
the average loss over the test set and the accuracy of the model’s predictions over
the course of the training.
Source code 7.13 compares the performance of several of the optimization methods
330
7.9. Adaptive moment estimation SGD optimization
(Adam)
introduced in this chapter, namely the plain vanilla SGD optimization method introduced
in Definition 7.2.1, the momentum SGD optimization method introduced in Definition 7.4.1,
the simplified Nesterov accelerated SGD optimization method introduced in Definition 7.5.3,
the Adagrad SGD optimization method introduced in Definition 7.6.1, the RMSprop SGD
optimization method introduced in Definition 7.7.1, the Adadelta SGD optimization method
introduced in Definition 7.8.1, and the Adam SGD optimization method introduced in
Definition 7.9.1, during training of an ANN on the MNIST dataset. The code produces two
plots showing the progression of the training loss as well as the accuracy of the model’s
predictions on the test set, see Figure 7.5. Note that this compares the performance of
the optimization methods only on one particular problem and without any efforts towards
choosing good hyperparameters for the considered optimization methods. Thus, the results
are not necessarily representative of the performance of these optimization methods in
general.
1 import torch
2 import torchvision . datasets as datasets
3 import torchvision . transforms as transforms
4 import torch . nn as nn
5 import torch . utils . data as data
6 import torch . optim as optim
7 import matplotlib . pyplot as plt
8 from matplotlib . ticker import ScalarFormatter , NullFormatter
9 import copy
10
331
Chapter 7: Stochastic gradient descent (SGD) optimization methods
34 )
35 test_loader = data . DataLoader (
36 mnist_test , batch_size =64 , shuffle = False
37 )
38
39 # Define a neural network
40 net = nn . Sequential ( # input shape (N , 1 , 28 , 28)
41 nn . Conv2d (1 , 5 , 5) , # (N , 5 , 24 , 24)
42 nn . ReLU () ,
43 nn . Conv2d (5 , 5 , 3) , # (N , 5 , 22 , 22)
44 nn . ReLU () ,
45 nn . Conv2d (5 , 3 , 3) , # (N , 3 , 20 , 20)
46 nn . ReLU () ,
47 nn . Flatten () , # (N , 3 * 16 * 16) = (N , 1200)
48 nn . Linear (1200 , 128) , # (N , 128)
49 nn . ReLU () ,
50 nn . Linear (128 , 10) , # output shape (N , 10)
51 ) . to ( device )
52
53 # Save the initial state of the neural network
54 initial_state = copy . deepcopy ( net . state_dict () )
55
56 # Define the loss function
57 loss_fn = nn . CrossEntropyLoss ()
58
59 # Define the optimizers that we want to compare . Each entry in the
60 # list is a tuple of a label ( for the plot ) and an optimizer
61 optimizers = [
62 # For SGD we use a learning rate of 0.001
63 (
64 " SGD " ,
65 optim . SGD ( net . parameters () , lr =1 e -3) ,
66 ),
67 (
68 " SGD with momentum " ,
69 optim . SGD ( net . parameters () , lr =1 e -3 , momentum =0.9) ,
70 ),
71 (
72 " Nesterov SGD " ,
73 optim . SGD (
74 net . parameters () , lr =1 e -3 , momentum =0.9 , nesterov = True
75 ),
76 ),
77 # For the adaptive optimization methods we use the default
78 # hyperparameters
79 (
80 " RMSprop " ,
81 optim . RMSprop ( net . parameters () ) ,
82 ),
332
7.9. Adaptive moment estimation SGD optimization
(Adam)
83 (
84 " Adagrad " ,
85 optim . Adagrad ( net . parameters () ) ,
86 ),
87 (
88 " Adadelta " ,
89 optim . Adadelta ( net . parameters () ) ,
90 ),
91 (
92 " Adam " ,
93 optim . Adam ( net . parameters () ) ,
94 ),
95 ]
96
97 def c o m p u t e _ t e s t _ l o s s _ a n d _ a c c u r a c y () :
98 total_test_loss = 0.0
99 correct_count = 0
100 with torch . no_grad () :
101 for images , labels in test_loader :
102 images = images . to ( device )
103 labels = labels . to ( device )
104
105 output = net ( images )
106 loss = loss_fn ( output , labels )
107
108 total_test_loss += loss . item () * images . size (0)
109 pred_labels = torch . max ( output , dim =1) . indices
110 correct_count += torch . sum (
111 pred_labels == labels
112 ) . item ()
113
114 avg_test_loss = total_test_loss / len ( mnist_test )
115 accuracy = correct_count / len ( mnist_test )
116
117 return ( avg_test_loss , accuracy )
118
119
120 loss_plots = []
121 accuracy_plots = []
122
123 test_interval = 100
124
125 for _ , optimizer in optimizers :
126 train_losses = []
127 accuracies = []
128 print ( optimizer )
129
130 with torch . no_grad () :
131 net . load_state_dict ( initial_state )
333
Chapter 7: Stochastic gradient descent (SGD) optimization methods
132
133 i = 0
134 for e in range (5) :
135 print ( f " Epoch { e +1} " )
136 for images , labels in train_loader :
137 images = images . to ( device )
138 labels = labels . to ( device )
139
334
7.9. Adaptive moment estimation SGD optimization
(Adam)
335
Chapter 7: Stochastic gradient descent (SGD) optimization methods
100
10 1
SGD
SGD with momentum
Nesterov SGD
RMSprop
Adagrad
Adadelta
Adam
0 1000 2000 3000 4000
0.990
0.900
0.500
0.100
336
Chapter 8
Backpropagation
(8.1)
fk (θ, xk−1 ) = FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ Fk (θk , ·) (xk−1 ),
let ϑ = (ϑ1 , ϑ2 , . . . , ϑL ) ∈ Rd1 × Rd2 × . . . × RdL , x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL ∈ RlL satisfy for
all k ∈ {1, 2, . . . , L} that
and let Dk ∈ RlL ×lk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all k ∈ {1, 2, . . . , L} that DL+1 = IlL
and
∂Fk
Dk = Dk+1 (ϑk , xk−1 ) (8.3)
∂xk−1
337
Chapter 8: Backpropagation
(i) it holds for all k ∈ {1, 2, . . . , L} that fk : Rdk × Rdk+1 × . . . × RdL × Rlk−1 → RlL is
differentiable,
and
Proof of Proposition 8.1.1. Note that (8.1), the fact that for all k ∈ N∩(0, L), (θk , θk+1 , . . . ,
θL ) ∈ Rdk × Rdk+1 × . . . × RdL , xk−1 ∈ Rlk−1 it holds that
the assumption that for all k ∈ {1, 2, . . . , L} it holds that Fk : Rdk × Rlk−1 → Rlk is
differentiable, Lemma 5.3.2, and induction imply that for all k ∈ {1, 2, . . . , L} it holds that
is differentiable. This proves item (i). Next we prove (8.4) by induction on k ∈ {L, L −
1, . . . , 1}. Note that (8.3), the assumption that DL+1 = IlL , and the fact that fL = FL
assure that
∂FL ∂fL
DL = DL+1 (ϑL , xL−1 ) = (ϑL , xL−1 ). (8.8)
∂xL−1 ∂xL−1
This establishes (8.4) in the base case k = L. For the induction step note that (8.3), the
chain rule, and the fact that for all k ∈ N ∩ (0, L), xk−1 ∈ Rlk−1 it holds that
338
8.1. Backpropagation for parametric functions
∂fk+1
imply that for all k ∈ N ∩ (0, L) with Dk+1 = ∂xk
((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) it holds that
∂fk
((ϑk , ϑk+1 , . . . , ϑL ), xk−1 )
∂xk−1
′
= Rlk−1 ∋ xk−1 7→ fk ((ϑk , ϑk+1 , . . . , ϑL ), xk−1 ) ∈ RlL (xk−1 )
′
= Rlk−1 ∋ xk−1 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), Fk (ϑk , xk−1 )) ∈ RlL (xk−1 )
h ′ i
= Rlk−1 ∋ xk 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk )) ∈ RlL (Fk (ϑk , xk−1 ))
h i (8.10)
′
Rlk−1 ∋ xk−1 7→ Fk (ϑk , xk−1 )) ∈ Rlk (xk−1 )
∂fk+1 ∂Fk
= ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) (ϑk , xk−1 )
∂xk ∂xk−1
∂Fk
= Dk+1 (ϑk , xk−1 ) = Dk .
∂xk−1
Induction thus proves (8.4). This establishes item (ii). Moreover, observe that (8.1) and
(8.2) assure that for all k ∈ N ∩ (0, L), θk ∈ Rlk it holds that
Combining this with the chain rule, (8.2), and (8.4) demonstrates that for all k ∈ N ∩ (0, L)
it holds that
∂f1 ′
(ϑ, x0 ) = Rnk ∋ θk 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), Fk (θk , xk−1 )) ∈ RlL (ϑk )
∂θk
h i
lk lL ′
= R ∋ xk 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) ∈ R (Fk (ϑk , xk−1 ))
h ′ i
Rnk ∋ θk 7→ Fk (θk , xk−1 ) ∈ Rlk (ϑk ) (8.12)
∂fk+1 ∂Fk
= ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) (ϑk , xk−1 )
∂xk ∂θk
∂Fk
= Dk+1 (ϑk , xk−1 ) .
∂θk
339
Chapter 8: Backpropagation
Furthermore, observe that (8.1) and the fact that DL+1 = IlL ensure that
∂f1 ′
(ϑ, x0 ) = RnL ∋ θL 7→ FL (θL , xL−1 )) ∈ RlL (ϑL )
∂θL
∂FL
= (ϑL , xL−1 ) (8.13)
∂θL
∂FL
= DL+1 (ϑL , xL−1 ) .
∂θL
Combining this and (8.12) establishes item (iii). The proof of Proposition 8.1.1 is thus
complete.
Corollary 8.1.2 (Backpropagation for parametric functions with loss). Let L ∈ N,
l0 , l1 , . . . , lL , d1 , d2 , . . . , dL ∈ N, ϑ = (ϑ1 , ϑ2 , . . . , ϑL ) ∈ Rd1 × Rd2 × . . . × RdL , x0 ∈ Rl0 , x1 ∈
Rl1 , . . . , xL ∈ RlL , y ∈ RlL , let C = (C(x, y))(x,y)∈RlL ×RlL : RlL × RlL → R be differentiable,
for every k ∈ {1, 2, . . . , L} let Fk = (Fk (θk , xk−1 ))(θk ,xk−1 )∈Rdk ×Rlk−1 : Rdk × Rlk−1 → Rlk be
differentiable, let L = (L(θ1 , θ2 , . . . , θL ))(θ1 ,θ2 ,...,θL )∈Rd1 ×Rd2 ×...×RdL : Rd1 ×Rd2 ×. . .×RdL → R
satisfy for all θ = (θ1 , θ2 , . . . , θL ) ∈ Rd1 × Rd2 × . . . × RdL that
(8.14)
L(θ) = C(·, y) ◦ FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·) (x0 ),
and let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all k ∈ {1, 2, . . . , L} that
∗
∂Fk
DL+1 = (∇x C)(xL , y) and Dk = (ϑk , xk−1 ) Dk+1 . (8.16)
∂xk−1
Then
(i) it holds that L : Rd1 × Rd2 × . . . × RdL → R is differentiable and
Proof of Corollary 8.1.2. Throughout this proof, let Dk ∈ RlL ×lk−1 , k ∈ {1, 2, . . . , L + 1},
satisfy for all k ∈ {1, 2, . . . , L} that DL+1 = IlL and
∂Fk
Dk = Dk+1 (ϑk , xk−1 ) (8.18)
∂xk−1
340
8.1. Backpropagation for parametric functions
and let f = (f (θ1 , θ2 , . . . , θL ))(θ1 ,θ2 ,...,θL )∈Rd1 ×Rd2 ×...×RdL : Rd1 × Rd2 × . . . × RdL → RlL satisfy
for all θ = (θ1 , θ2 , . . . , θL ) ∈ Rd1 × Rd2 × . . . × RdL that
(8.19)
f (θ) = FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·) (x0 )
(cf. Definition 1.5.5). Note that item (i) in Proposition 8.1.1 ensures that f : Rd1 ×Rd2 ×. . .×
RdL → RlL is differentiable. This, the assumption that C : RlL × RlL → R is differentiable,
and the fact that L = C(·, y) ◦ f ensure that L : Rd1 × Rd2 × . . . × RdL → R is differentiable.
This establishes item (i). Next we claim that for all k ∈ {1, 2, . . . , L + 1} it holds that
∂C
∗
[Dk ] = (xL , y) Dk . (8.20)
∂x
We now prove (8.20) by induction on k ∈ {L + 1, L, . . . , 1}. For the base case k = L + 1
note that (8.16) and (8.18) assure that
∗ ∗ ∂C
[DL+1 ] = [(∇x C)(xL , y)] = (xL , y)
∂x
(8.21)
∂C ∂C
= (xL , y) IlL = (xL , y) DL+1 .
∂x ∂x
This establishes (8.20) in the base case k = L + 1. For the induction step ∂Cobserve (8.16)
and (8.18) demonstrate that for all k ∈ {L, L − 1, . . . , 1} with [Dk+1 ] = ∂x (xL , y) Dk+1
∗
it holds that
∗ ∗ ∂Fk
[Dk ] = [Dk+1 ] (ϑk , xk−1 )
∂xk−1
(8.22)
∂C ∂Fk ∂C
= (xL , y) Dk+1 (ϑk , xk−1 ) = (xL , y) Dk .
∂x ∂xk−1 ∂x
Induction thus establishes (8.20). Furthermore, note that item (iii) in Proposition 8.1.1
assures that for all k ∈ {1, 2, . . . , L} it holds that
∂f ∂Fk
(ϑ) = Dk+1 (ϑk , xk−1 ) . (8.23)
∂θk ∂θk
Combining this with chain rule, the fact that L = C(·, y) ◦ f , and (8.20) ensures that for
all k ∈ {1, 2, . . . , L} it holds that
∂L ∂C ∂f
(ϑ) = (f (ϑ), y) (ϑ)
∂θk ∂x ∂θk
∂C ∂Fk
= (xL , y) Dk+1 (ϑk , xk−1 ) (8.24)
∂x ∂θk
∗ ∂Fk
= [Dk+1 ] (ϑk , xk−1 ) .
∂θk
341
Chapter 8: Backpropagation
This establishes item (ii). The proof of Corollary 8.1.2 is thus complete.
×
L
(Rlk ×lk−1
let L = L((W1 , B1 ), . . . , (WL , BL )) ((W1 ,B1 ),...,(W ×Lk=1 (Rlk ×lk−1 ×Rlk ) : k=1
×
L ,BL ))∈
L(Ψ) = C((RN
a (Ψ))(x0 ), y), (8.28)
and let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all k ∈ {1, 2, . . . , L − 1} that
342
8.2. Backpropagation for ANNs
Proof of Corollary 8.2.2. Throughout this proof, for every k ∈ {1, 2, . . . , L} let
(m)
Fk = (Fk )m∈{1,2,...,lk }
= Fk ((Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } , Bk ),
(8.33)
xk−1 (((W ) ,B ),x )∈(Rlk ×lk−1 ×Rlk−1 )×Rlk−1
k,i,j (i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } k k−1
satisfy for all (Wk , Bk ) ∈ Rlk ×lk−1 × Rlk−1 , xk−1 ∈ Rlk−1 that
Moreover, observe that (8.27) and (8.34) imply that for all k ∈ {1, 2, . . . , L} it holds that
343
Chapter 8: Backpropagation
Combining this and (8.37) with (8.29) and (8.30) demonstrates that for all k ∈ {1, 2, . . . , L}
it holds that
∗
∂Fk
DL+1 = (∇x C)(xL , y) and Dk = (ϑk , xk−1 ) Dk+1 . (8.39)
∂xk−1
Next note that this, (8.35), (8.36), and Corollary 8.1.2 prove that for all k ∈ {1, 2, . . . , L}
it holds that
∗
∂Fk
(∇Bk L)(Φ) = ((Wk , Bk ), xk−1 ) Dk+1 and (8.40)
∂Bk
∗
∂Fk
(∇Wk L)(Φ) = ((Wk , Bk ), xk−1 ) Dk+1 . (8.41)
∂Wk
Moreover, observe that (8.34) implies that
∂FL
((WL , BL ), xL−1 ) = IlL (8.42)
∂BL
This establishes item (ii). Furthermore, note that (8.34) assures that for all k ∈ {1, 2, . . . , L−
1} it holds that
∂Fk
((Wk , Bk ), xk−1 ) = diag(Ma′ ,lk (Wk xk−1 + Bk )). (8.44)
∂Bk
Combining this with (8.40) implies that for all k ∈ {1, 2, . . . , L − 1} it holds that
This establishes item (iii). In addition, observe that (8.34) ensures that for all m, i ∈
{1, 2, . . . , lL }, j ∈ {1, 2, . . . , lL−1 } it holds that
!
(m)
∂FL
((WL , BL ), xL−1 ) = 1{m} (i)⟨xL−1 , ej L−1 ⟩
(l )
(8.46)
∂WL,i,j
344
8.2. Backpropagation for ANNs
(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 }
∗
= [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 [xk−1 ] .
(8.49)
This establishes item (v). The proof of Corollary 8.2.2 is thus complete.
Corollary 8.2.3 (Backpropagation for ANNs with minibatches). Let L, M ∈ N, l0 , l1 , . . . ,
L
×
lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈ k=1 (Rlk ×lk−1 × Rlk ), let a : R → R and C =
(C(x, y))(x,y)∈RlL ×RlL : RlL × RlL → R be differentiable, for every m ∈ {1, 2, . . . , M } let
(m) (m) (m)
x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL ∈ RlL , y(m) ∈ RlL satisfy for all k ∈ {1, 2, . . . , L} that
(m)
= Ma1[0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk ),
xk
(m)
(8.50)
× L
(Rlk ×lk−1
let L = L((W1 , B1 ), . . . , (WL , BL )) ((W1 ,B1 ),...,(W ,B ))∈×L (Rlk ×lk−1 ×Rlk ) : k=1
×
L L k=1
L
×
Rlk ) → R satisfy for all Ψ ∈ k=1 (Rlk ×lk−1 × Rlk ) that
M
1 P
L(Ψ) = N (m)
C((Ra (Ψ))(x0 ), y ) , (m)
(8.51)
M m=1
345
Chapter 8: Backpropagation
(m)
and for every m ∈ {1, 2, . . . , M } let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all
k ∈ {1, 2, . . . , L − 1} that
(m) (m)
DL+1 = (∇x C)(xL , y(m) ), DL
(m) (m)
= [WL ]∗ DL+1 , and (8.52)
(m)
Dk
(m)
= [Wk ]∗ [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1
(m)
(8.53)
(cf. Definitions 1.2.1, 1.3.4, and 8.2.1). Then
× L
Note that (8.56) and (8.51) ensure that for all Ψ ∈ k=1 (Rlk ×lk−1 × Rlk ) it holds that
M
1 P
L(Ψ) = (m)
L (Ψ) . (8.57)
M m=1
Corollary 8.2.2 hence establishes items (i), (ii), (iii), (iv), and (v). The proof of Corollary 8.2.3
is thus complete.
Corollary 8.2.4 (Backpropagation for ANNs with quadratic loss and minibatches). Let
×
L
L, M ∈ N, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈ k=1 (Rlk ×lk−1 × Rlk ), let
(m) (m)
a : R → R be differentiable, for every m ∈ {1, 2, . . . , M } let x0 ∈ Rl0 , x1 ∈ Rl1 , . . . ,
(m)
xL ∈ RlL , y(m) ∈ RlL satisfy for all k ∈ {1, 2, . . . , L} that
(m)
xk
(m)
= Ma1[0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk ), (8.58)
346
8.2. Backpropagation for ANNs
×
L
(Rlk ×lk−1
let L = L((W1 , B1 ), . . . , (WL , BL )) ((W1 ,B1 ),...,(WL ,BL ))∈×L lk ×lk−1
×Rlk )
: k=1
×
k=1 (R
× L lk ×lk−1
Rlk ) → R satisfy for all Ψ ∈ k=1
(R lk
× R ) that
M
1 P
L(Ψ) = N (m) (m) 2
∥(Ra (Ψ))(x0 ) − y ∥2 , (8.59)
M m=1
(m)
and for every m ∈ {1, 2, . . . , M } let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all
k ∈ {1, 2, . . . , L − 1} that
(m) (m)
DL+1 = 2(xL − y(m) ), DL
(m)
= [WL ]∗ DL+1 ,
(m)
and (8.60)
Dk
(m) (m)
= [Wk ]∗ [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1
(m)
(8.61)
(cf. Definitions 1.2.1, 1.3.4, 3.3.4, and 8.2.1). Then
Proof of Corollary 8.2.4. Throughout this proof, let C = (C(x, y))(x,y)∈RlL ×RlL : RlL ×RlL →
R satisfy for all x, y ∈ RlL that
Observe that (8.64) ensures that for all m ∈ {1, 2, . . . , M } it holds that
(m) (m)
(∇x C)(xL , y(m) ) = 2(xL − y(m) ) = DL+1 .
(m)
(8.65)
Combining this, (8.58), (8.59), (8.60), and (8.61) with Corollary 8.2.3 establishes items (i),
(ii), (iii), (iv), and (v). The proof of Corollary 8.2.4 is thus complete.
347
Chapter 8: Backpropagation
348
Chapter 9
349
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
and
350
9.2. Convergence analysis using standard KL functions (regular regime)
Observe that the assumption that for all θ ∈ O it holds that |L(θ)−L(ϑ)|α ≤ C∥(∇L)(θ)∥2
shows that for all θ ∈ O it holds that
Furthermore, note that (9.8) ensures that for all θ ∈ O it holds that V ∈ C 1 (O, R) and
Combining this with (9.9) implies that for all θ ∈ O it holds that
⟨(∇V )(θ), −(∇L)(θ)⟩ = −|L(θ) − L(ϑ)|−2 ∥(∇L)(θ)∥22
(9.11)
≤ −C−2 |L(θ) − L(ϑ)|2α−2 ≤ −c−1 .
The assumption that for all t ∈ [0, R t∞) it holds that Θt ∈ O, the assumption that for all
t ∈ [0, ∞) it holds that Θt = Θ0 − 0 (∇L)(Θs ) ds, and Proposition 5.6.2 therefore establish
that for all t ∈ [0, ∞) it holds that
Z t
−1
−|L(Θt ) − L(ϑ)| = V (Θt ) ≤ V (Θ0 ) + −c−1 ds = V (Θ0 ) − c−1 t
0 (9.12)
−1 −1
= −|L(Θ0 ) − L(ϑ)| − c t.
Moreover, observe that (9.8) ensures that for all θ ∈ O it holds that U ∈ C 1 (O, R) and
The assumption that for all θ ∈ O it holds that |L(θ) − L(ϑ)|α ≤ C∥(∇L)(θ)∥2 therefore
demonstrates that for all θ ∈ O it holds that
⟨(∇U )(θ), −(∇L)(θ)⟩ = −(1 − α)|L(θ) − L(ϑ)|−α ∥(∇L)(θ)∥22
(9.15)
≤ −C−1 (1 − α)∥(∇L)(θ)∥2 .
Combining this, the assumption that for all t ∈ [0, ∞) it holds that Θt ∈ O, the fact that
for all s, t ∈ [0, ∞) it holds that
Z t
Θs+t = Θs − (∇L)(Θs+u ) du, (9.16)
0
351
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
and Proposition 5.6.2 (applied for every s ∈ [0, ∞), t ∈ (s, ∞) with d ↶ d, T ↶ t − s,
O ↶ O, α ↶ 0, β ↶ (O ∋ θ 7→ −C−1 (1 − α)∥(∇L)(θ)∥2 ∈ R), G ↶ (∇L), Θ ↶
([0, t − s] ∋ u 7→ Θs+u ∈ O) in the notation of Proposition 5.6.2) ensures that for all
s, t ∈ [0, ∞) with s < t it holds that
0 ≤ |L(Θt ) − L(ϑ)|1−α = U (Θt )
Z t
≤ U (Θs ) + −C−1 (1 − α)∥(∇L)(Θu )∥2 du
s (9.17)
Z t
1−α −1
= |L(Θs ) − L(ϑ)| − C (1 − α) ∥(∇L)(Θu )∥2 du .
s
This implies that for all s, t ∈ [0, ∞) with s < t it holds that
Z t
∥(∇L)(Θu )∥2 du ≤ C(1 − α)−1 |L(Θs ) − L(ϑ)|1−α . (9.18)
s
Combining this and the assumption that L is continuous with (9.13) demonstrates that
(9.23)
L(ψ) = L limt→∞ Θt = limt→∞ L(Θt ) = L(ϑ).
Next observe that (9.22), (9.18), and (9.21) show that for all t ∈ [0, ∞) it holds that
∥Θt − ψ∥2 = Θt − lims→∞ Θs 2
= lim ∥Θt − Θs ∥2
s→∞
Z ∞ (9.24)
≤ ∥(∇L)(Θu )∥2 du
t
≤ C(1 − α)−1 |L(Θt ) − L(ϑ)|1−α .
352
9.3. Standard KL inequalities for monomials
Combining this with (9.13) and (9.23) establishes items (i), (ii), and (iii). The proof of
Proposition 9.2.1 is thus complete.
Proof of Lemma 9.3.1. First, note that the fact that for all ϑ ∈ Rd it holds that
L(ϑ) = (∥ϑ∥22 ) /2 (9.27)
p
353
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Proof of Lemma 9.4.1. Note that (9.31) and the triangle inequality ensure that for all
ϑ ∈ U it holds that
c∥(∇L)(θ)∥2
= c∥(∇L)(ϑ) + [(∇L)(θ) − (∇L)(ϑ)]∥2 (9.33)
c∥(∇L)(θ)∥2
≤ c∥(∇L)(ϑ)∥2 + c∥(∇L)(θ) − (∇L)(ϑ)∥2 ≤ c∥(∇L)(ϑ)∥2 + 2
.
Combining this with (9.31) establishes that for all ϑ ∈ U it holds that
Proof of Corollary 9.4.2. Observe that the assumption that L ∈ C 1 (Rd , R) ensures that
(9.37)
lim supε↘0 supϑ∈{v∈Rd : ∥v−θ∥2 <ε} ∥(∇L)(θ) − (∇L)(ϑ)∥2 = 0
(cf. Definition 3.3.4). Combining this and the fact that c > 0 with the fact that L is
continuous demonstrates that
lim supε↘0 supϑ∈{v∈Rd : ∥v−θ∥2 <ε} max |L(θ) − L(ϑ)|α , c∥(∇L)(θ) − (∇L)(ϑ)∥2 = 0.
(9.38)
The fact that c > 0 and the fact that ∥(∇L)(θ)∥2 > 0 therefore prove that there exists
ε ∈ (0, 1) which satisfies
Note that (9.39) ensures that for all ϑ ∈ {v ∈ Rd : ∥v − θ∥2 < ε} it holds that
This and Lemma 9.4.1 establish (9.36). The proof of Corollary 9.4.2 is thus complete.
354
9.5. Standard KL inequalities with increased exponents
and let β ∈ (α, ∞), C ∈ R satisfy C = c(supϑ∈U |L(θ) − L(ϑ)|β−α ). Then it holds for all
ϑ ∈ U that
|L(θ) − L(ϑ)|β ≤ C|G(ϑ)|. (9.42)
Proof of Lemma 9.5.1. Observe that (9.41) shows that for all ϑ ∈ U it holds that
(9.43)
Proof of Corollary 9.5.2. Note that Lemma 9.5.1 establishes (9.45). The proof of Corol-
lary 9.5.2 is thus complete.
(9.46)
PN
p(x) = n=0 βn (x − ξ)n .
Proof of Corollary 9.6.1. Observe that Theorem 6.1.3 establishes (9.46). The proof of
Corollary 9.6.1 is thus complete.
355
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
PN |βn n| −1
and ε = 12 [ n=1 |βm m| ] . Then it holds for all x ∈ [ξ − ε, ξ + ε] that
Proof of Corollary 9.6.2. Note that Corollary 9.6.1 ensures that for all x ∈ R it holds that
p(x) − p(ξ) = N n
(9.49)
P
n=1 βn (x − ξ) .
(9.50)
PN
p′ (x) = n=1 βn n(x − ξ)n−1
(9.52)
PN
|p(x) − p(ξ)|α ≤ n=m |βn |α |x − ξ|nα .
|p(x) − p(ξ)|α ≤ N α nα
≤ N α m−1
P P
n=m |βn | |x − ξ| n=m |βn | |x − ξ|
PN α
PN α
(9.53)
= |x − ξ|m−1 n=m |βn | = |x − ξ|m−1 n=1 |βn | .
(9.54)
PN c
|p(x) − p(ξ)|α ≤ |x − ξ|m−1 n=1 |β n |α
= 2 |x − ξ|m−1 |βm m|.
Furthermore, observe that (9.51) ensures that for all x ∈ R with |x − ξ| ≤ 1 it holds that
PN PN
|p′ (x)| = n=m βn n(x − ξ)
n−1
≥ |βm m||x − ξ|m−1 − n=m+1 βn n(x − ξ)
n−1
PN n−1
≥ |x − ξ|m−1 |βm m| − n=m+1 |x − ξ| |βn n|
PN (9.55)
≥ |x − ξ|m−1 |βm m| − m
n=m+1 |x − ξ| |βn n|
PN
= |x − ξ|m−1 |βm m| − |x − ξ|m n=m+1 |βn n| .
356
9.6. Standard KL inequalities for one-dimensional polynomials
|βn n| −1
Therefore, we obtain for all x ∈ R with |x − ξ| ≤ that
1
PN
2 n=m |βm m|
|βn n| −1
Combining this with (9.54) demonstrates that for all x ∈ R with |x − ξ| ≤ 1
PN
2 n=m |βm m|
it holds that
|p(x) − p(ξ)|α ≤ 2c |x − ξ|m−1 |βm m| ≤ c|p′ (x)|. (9.57)
This establishes (9.48). The proof of Corollary 9.6.2 is thus complete.
Corollary 9.6.3 (Quantitative standard KL inequalities for general one-dimensional
polynomials). Let ξ ∈ R, N ∈ N, p ∈ C ∞ (R, R) satisfy for all x ∈ R that p(N +1) (x) = 0,
(n)
let β0 , β1 , . . . , βN ∈ R satisfy for all n ∈ {0, 1, . . . , N } that βn = p n!(ξ) , let ρ ∈ R satisfy
ρ = 1{0}
PN SN PN
n=1 |βn n| + min n=1 {|β n n|} \{0} ∪ n=1 |β n n| , and let α ∈ (0, 1],
c, ε ∈ [0, ∞) satisfy
α ≥ 1 − N −1 , c ≥ 2ρ−1 [ N 1
P α
PN PN −1
n=1 |β n | ], and ε ≤ ρ[ {0} ( n=1 |β n |) + 2( n=1 |βn n|)] .
(9.58)
Then it holds for all x ∈ [ξ − ε, ξ + ε] that
Proof of Corollary 9.6.3. Throughout this proof, assume without loss of generality that
Note that Corollary 9.6.1 and (9.60) ensure that N n=1 |βn | > 0. Hence, we obtain that
P
there exists m ∈ {1, 2, . . . , N } which satisfies
m−1
X
|βm | > 0 = |βn |. (9.61)
n=1
Observe that (9.61), the fact that α ≥ 1 − N −1 , and Corollary 9.6.2 ensure that for all
|βn n| −1
x ∈ R with |x − ξ| ≤ 12 [ N it holds that
P
n=1 |βm m| ]
N
" # " " N ##
α
X 2|βn | 2 X
|p(x) − p(ξ)|α ≤ |p′ (x)| ≤ |βn |α |p′ (x)| ≤ c|p′ (x)|. (9.62)
n=1
|β m m| ρ n=1
357
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Proof of Corollary 9.6.4. Note that Corollary 9.6.3 establishes (9.63). The proof of Corol-
lary 9.6.4 is thus complete.
Proof of Corollary 9.6.5. Observe that (9.2) and Corollary 9.6.4 establish that L is a
standard KL function (cf. Definition 9.1.2). The proof of Corollary 9.6.5 is thus complete.
358
9.7. Power series and analytic functions
and ∞
k!
(9.68)
f (l) (x)(v1 , . . . , vl ) =
P
(k−l)!
Ak (v1 , v2 , . . . , vl , x, x, . . . , x) ,
k=l
and
(iv) it holds for all k ∈ N that f (k) (0) = k!Ak .
Proof of Proposition 9.7.2. Throughout this proof, for every K ∈ N0 let FK : Rm → Rn
satisfy for all x ∈ Rm that
K
X
FK (x) = f (0) + Ak (x, x, . . . , x). (9.69)
k=1
"∞ # (9.73)
X ∥x∥2 k
εx εx εx
≤ sup Ak ∥x∥2 , ∥x∥2 , . . . , ∥x∥2 2 < ∞.
k=1
ε k∈N
Combining this with (9.65) establishes item (i). Observe that, for instance, Krantz &
Parks [254, Proposition 2.2.3] implies items (ii) and (iii). Note that (9.68) implies item (iv).
The proof of Proposition 9.7.2 is thus complete.
359
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
(iii) It holds for all compact C ⊆ U that there exists c ∈ R such that for all x ∈ C, k ∈ N,
v ∈ Rm it holds that
∥f (k) (x)(v, v, . . . , v)∥2 ≤ k! ck ∥v∥k2 . (9.76)
Proof of Proposition 9.7.3. The equivalence is a direct consequence from Proposition 9.7.2.
The proof of Proposition 9.7.3 is thus complete.
Lemma 9.8.1. Let ε ∈ (0, ∞), let U ⊆ R satisfy U = {x ∈ R : |x| ≤ ε}, let (ak )k∈N ⊆ R,
and let f : U → R satisfy for all x ∈ U that
K
ak xk = 0. (9.77)
P
lim sup f (x) − f (0) −
K→∞ k=1
Then
P∞ k
(i) it holds for all x ∈ {y ∈ U : |y| < ε} that k=1 |ak ||x| < ∞ and
∞
ak x k , (9.78)
P
f (x) = f (0) +
k=1
360
9.8. Standard KL inequalities for one-dimensional analytic functions
P∞ k!
(iii) it holds for all x ∈ {y ∈ U : |y| < ε}, l ∈ N that k=l (k−l)! |ak ||x|k−l < ∞ and
∞
f (l) (x) = (9.79)
P k!
k−l
(k−l)!
ak x ,
k=l
and
Lemma 9.8.2. Let ε, δ ∈ (0, 1), N ∈ N\{1}, (an )n∈N0 ⊆ R satisfy N = min({k ∈ N : ak ̸=
0} ∪ {∞}), let U ⊆ R satisfy U = {ξ ∈ R : |ξ| ≤ ε}, let L : U → R satisfy for all θ ∈ U
that K
k
(9.80)
P
lim sup L(θ) − L(0) − ak θ = 0,
K→∞ k=1
and let M ∈ N ∩ (N, ∞) satisfy for all k ∈ N ∩ [M, ∞) that k|ak | ≤ (2ε−1 )k and
−1
δ = min 4ε , |aN | 2(max{|a1 |, |a2 |, . . . , |aM |}) + (2ε−1 )N +1 (9.81)
.
Proof of Lemma 9.8.2. Note that the assumption that for all k ∈ N ∩ [M, ∞) it holds that
|ak | ≤ k|ak | ≤ (2ε−1 )k ensures that for all K ∈ N ∩ [M, ∞) it holds that
K+N
P+1
|ak ||θ|k
k=N +1
K
N +1 k
P
= |θ| |ak+N +1 ||θ|
k=0
M K
N +1 k k
(9.83)
P P
= |θ| |ak+N +1 ||θ| + |ak+N +1 ||θ|
k=0 k=M +1
M K
N +1 −1 k+N +1
P k P k
≤ |θ| (max{|a1 |, |a2 |, . . . , |aM |}) |θ| + (2ε ) |θ|
k=0 k=M +1
M K
N +1 −1 N +1 −1
P k P k
= |θ| (max{|a1 |, |a2 |, . . . , |aM |}) |θ| + (2ε ) (2ε |θ|) .
k=0 k=M +1
361
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
(9.84)
This demonstrates that for all θ ∈ R with |θ| ≤ δ it holds that
∞
|ak ||θ|k ≤ |aN ||θ|N . (9.85)
P
k=N +1
Next observe that the assumption that for all k ∈ N ∩ [M, ∞) it holds that k|ak | ≤ (2ε−1 )k
ensures that for all K ∈ N ∩ [M, ∞) it holds that
N +K+1
k|ak ||θ|k−1
P
k=N +1
M −N −1 K
N k k
P P
= |θ| (k + N + 1)|ak+N +1 ||θ| + (k + N + 1)|ak+N +1 ||θ|
k=0 k=M −N
M −N −1 K (9.87)
N k −1 k+N +1 k
P P
≤ |θ| max{|a1 |, 2|a2 |, . . . , M |aM |} |θ| + (2ε ) |θ|
k=0 k=M −N
M −N −1 K−N
N k −1 N +1 −1 k
P P
≤ |θ| max{|a1 |, 2|a2 |, . . . , M |aM |} |θ| + (2ε ) |2ε θ| .
k=0 k=M −N
(9.88)
∞ ∞
N
P 1 k −1 N +1
P 1 k
≤ |θ| max{|a1 |, 2|a2 |, . . . , M |aM |} 4
+ (2ε ) 2
k=0 k=1
N −1 N +1
≤ |θ| 2(max{|a1 |, 2|a2 |, . . . , M |aM |}) + (2ε ) .
362
9.8. Standard KL inequalities for one-dimensional analytic functions
Hence, we obtain for all K ∈ N ∩ [N, ∞), θ ∈ R with |θ| < δ that
K K ∞
kak θk−1 = kak θk−1 ≥ N |aN ||θ|N −1 − k|ak ||θ|k−1 ≥ (N − 1)|aN ||θ|N −1 .
P P P
k=1 k=N k=N +1
(9.90)
Proposition
P∞ 9.7.2 therefore proves that for all θ ∈ {ξ ∈ R : |x| < ε} it holds that
k=1 k|ak θ
k−1
| < ∞ and
∞
|L ′ (θ)| = kak θk−1 ≥ (N − 1)|aN ||θ|N −1 . (9.91)
P
k=1
Combining this with (9.86) shows that for all θ ∈ R with |θ| ≤ δ it holds that
N −1 N −1 N −1
|L(θ)−L(0)| N ≤ |2aN | N |θ|N −1 ≤ |2aN | N (N −1)−1 |aN |−1 |L ′ (θ)| ≤ 2|L ′ (θ)|. (9.92)
The proof of Lemma 9.8.2 is thus complete.
Corollary 9.8.3. Let ε ∈ (0, ∞), U ⊆ R satisfy U = {θ ∈ R : |θ| ≤ ε} and let L : U → R
satisfy for all θ ∈ U that
K
ak θk = 0. (9.93)
P
lim sup L(θ) − L(0) −
K→∞ k=1
Then there exist δ ∈ (0, ε), c ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ {ξ ∈ R : |ξ| < δ} it
holds that
|L(θ) − L(0)|α ≤ c |L ′ (0)|. (9.94)
Proof of Corollary 9.8.3. Throughout this proof, assume without loss of generality that
ε < 1, let N ∈ N ∪ {∞} satisfy N = min({k ∈ N : ak ̸= 0} ∪ {∞}), and assume without
loss of generality that 1 < N < ∞ (cf. item (iv) in Lemma 9.8.1 and Corollary 9.4.2). Note
that item (iii) in Lemma 9.8.1 ensures that for all θ ∈ R with |θ| < ε it holds that
∞
k|ak ||θ|k−1 < ∞. (9.95)
P
k=1
363
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Then there exist δ ∈ (0, ε), c ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ {ξ ∈ R : |ξ − ϑ| < δ}
it holds that
|L(θ) − L(ϑ)|α ≤ c |L ′ (ϑ)|. (9.100)
Corollary 9.8.3 hence establishes that there exist δ ∈ (0, ε), c ∈ (0, ∞), α ∈ (0, 1) which
satisfy that for all θ ∈ {ξ ∈ R : |ξ| < δ} it holds that
Proof of Corollary 9.8.5. Note that Corollary 9.8.4 establishes (9.104). The proof of Corol-
lary 9.8.5 is thus complete.
Proof of Corollary 9.8.6. Observe that (9.2) and Corollary 9.8.5 establish that L is a
standard KL function (cf. Definition 9.1.2). The proof of Corollary 9.8.6 is thus complete.
364
9.9. Standard KL inequalities for analytic functions
Proof of Theorem 9.9.1. Note that Łojasiewicz [281, Proposition 1] demonstrates (9.105)
(cf., for example, also Bierstone & Milman [38, Proposition 6.8]). The proof of Theorem 9.9.1
is thus complete.
Corollary 9.9.2. Let d ∈ N and let L : Rd → R be analytic (cf. Definition 9.7.1). Then
L is a standard KL function (cf. Definition 9.1.2).
Proof of Corollary 9.9.2. Observe that (9.2) and Theorem 9.9.1 establish that L is a
standard KL function (cf. Definition 9.1.2). The proof of Corollary 9.9.2 is thus complete.
9.10 Counterexamples
Example 9.10.1 (Example of a smooth function that is not a standard KL function). Let
L : R → R satisfy for all x ∈ R that
(
exp(−x−1 ) : x > 0
L(x) = (9.106)
0 : x ≤ 0.
Then
|L(x) − L(0)|α
sup = ∞, (9.107)
x∈(0,ε) |L ′ (x)|
and
365
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
and for every f ∈ C((0, ∞), R) let Gf : (0, ∞) → R satisfy for all x ∈ (0, ∞) that
Note that the chain rule and the product rule ensure that for all f ∈ C 1 ((0, ∞), R),
x ∈ (0, ∞) it holds that Gf ∈ C 1 ((0, ∞), R) and
(Gp )′ = Gq . (9.111)
Combining this and (9.110) with induction ensures that for all p ∈ P , n ∈ N it holds that
This and the fact that for all p ∈ P it holds that limx↘0 Gp (x) = 0 establish that for all
p ∈ P it holds that
lim (Gp )(n) (x) = 0. (9.113)
x↘0
The fact that L|(0,∞) = G(0,∞)∋x7→1∈R and (9.110) therefore establish item (i) and item (iii).
Observe that (9.106) and the fact that for all y ∈ (0, ∞) it holds that
∞
X yk y3 y3
exp(y) = ≥ = (9.114)
k=0
k! 3! 6
ensure that for all α ∈ (0, 1), ε ∈ (0, ∞), x ∈ (0, ε) it holds that
|L(x) − L(0)|α |L(x)|α x2 |L(x)|α
′
= ′
= = x2 |L(x)|α−1
|L (x)| |L (x)| L(x)
(9.115)
x2 (1 − α)3 (1 − α)3
2 (1 − α)
= x exp ≥ = .
x 6x3 6x
Hence, we obtain for all α ∈ (0, 1), ε ∈ (0, ∞) that
|L(x) − L(0)|α (1 − α)3
sup ≥ sup = ∞. (9.116)
x∈(0,ε) |L ′ (x)| x∈(0,ε) 6x
366
9.10. Counterexamples
Example 9.10.2 (Example of a differentiable function that fails to satisfy the standard
KL inequality). Let L : R → R satisfy for all x ∈ R that
R max{x,0}
L(x) = 0
y|sin(y −1 )| dy. (9.117)
Then
(ii) it holds for all c ∈ R, α, ε ∈ (0, ∞) that there exist x ∈ (0, ε) such that
and
(iii) it holds for all c ∈ R, α, ε ∈ (0, ∞) that we do not have that L satisfies the standard
KL inequality at 0 on [0, ε) with exponent α and constant c
Proof for Example 9.10.2. Throughout this proof, let G : R → R satisfy for all x ∈ R that
(
x|sin(x−1 )| : x > 0
G(x) = (9.119)
0 : x ≤ 0.
Therefore, we obtain that G is continuous. This, (9.117), and the fundamental theorem of
calculus ensure that L is continuously differentiable with
L ′ = G. (9.122)
Combining this with (9.120) demonstrates that for all c ∈ R, α ∈ (0, ∞), k ∈ N it holds
that
|L((kπ)−1 ) − L(0)|α = [L((kπ)−1 )]α > 0 = c|G((kπ)−1 )| = c|L ′ ((kπ)−1 )|. (9.123)
367
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
|L(θ)−L(ϑ)|α ≤ C∥G(θ)∥2 , c = |L(Θ0 )−L(ϑ)|, C(1−α)−1 c1−α +∥Θ0 −ϑ∥2 < ε, (9.126)
and inf t∈{s∈[0,∞) : ∀ r∈[0,s] : ∥Θr −ϑ∥2 <ε} L(Θt ) ≥ L(ϑ) (cf. Definition 3.3.4). Then there exists
ψ ∈ Rd such that
(i) it holds that L(ψ) = L(ϑ),
(iii) it holds for all t ∈ [0, ∞) that 0 ≤ L(Θt ) − L(ψ) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 , and
Proof of Proposition 9.11.2. Throughout this proof, let L : [0, ∞) → R satisfy for all t ∈
[0, ∞) that
L(t) = L(Θt ) − L(ϑ), (9.128)
368
9.11. Convergence analysis for solutions of GF ODEs
let B ⊆ Rd satisfy
B = {θ ∈ Rd : ∥θ − ϑ∥2 < ε}, (9.129)
let T ∈ [0, ∞] satisfy
T = inf({t ∈ [0, ∞) : Θt ∈
/ B} ∪ {∞}), (9.130)
let τ ∈ [0, T ] satisfy
τ = inf({t ∈ [0, T ) : L(t) = 0} ∪ {T }), (9.131)
R∞
let g = (gt )t∈[0,∞) : [0, ∞) → [0, ∞] satisfy for all t ∈ [0, ∞) that gt = t ∥G(Θs )∥2 ds, and
let D ∈ R satisfy D = C2 c(2−2α) . In the first step of our proof of items (i), (ii), (iii), and
(iv) we show that for all t ∈ [0, ∞) it holds that
Θt ∈ B. (9.132)
demonstrate that for almost all t ∈ [0, ∞) it holds that L is differentiable at t and satisfies
L′ (t) = d
dt
(L(Θt )) = −∥G(Θt )∥22 . (9.135)
Furthermore, observe that the assumption that inf t∈{s∈[0,∞) : ∀ r∈[0,s] : ∥Θr −ϑ∥2 <ε} L(Θt ) ≥ L(ϑ)
ensures that for all t ∈ [0, T ) it holds that
L(t) ≥ 0. (9.136)
Combining this with (9.126), (9.128), and (9.131) establishes that for all t ∈ [0, τ ) it holds
that
0 < [L(t)]α = |L(Θt ) − L(ϑ)|α ≤ C∥G(Θt )∥2 . (9.137)
369
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
The chain rule and (9.135) hence prove that for almost all t ∈ [0, τ ) it holds that
d
([L(t)]1−α ) = (1 − α)[L(t)]−α (−∥G(Θt )∥22 )
dt
(9.138)
≤ −(1 − α)C−1 ∥G(Θt )∥−1 2 −1
2 ∥G(Θt )∥2 = −C (1 − α)∥G(Θt )∥2 .
Moreover, note that (9.134) shows that [0, ∞) ∋ t 7→ L(t) ∈ R is absolutely continuous.
This and the fact that for all r ∈ (0, ∞) it holds that [r, ∞) ∋ y 7→ y 1−α ∈ R is Lipschitz
continuous imply that for all t ∈ [0, τ ) it holds that [0, t] ∋ s 7→ [L(s)]1−α ∈ R is absolutely
continuous. Combining this with (9.138) demonstrates that for all s, t ∈ [0, τ ) with s ≤ t it
holds that
Z t
∥G(Θu )∥2 du ≤ −C(1 − α)−1 ([L(t)]1−α − [L(s)]1−α ) ≤ C(1 − α)−1 [L(s)]1−α . (9.139)
s
In the next step we observe that (9.134) ensures that [0, ∞) ∋ t 7→ L(Θt ) ∈ R is non-
increasing. This and (9.128) establish that L is non-increasing. Combining (9.131) and
(9.136) therefore proves that for all t ∈ [τ, T ) it holds that L(t) = 0. Hence, we obtain that
for all t ∈ (τ, T ) it holds that
L′ (t) = 0. (9.140)
This and (9.135) show that for almost all t ∈ (τ, T ) it holds that
G(Θt ) = 0. (9.141)
Combining this with (9.139) implies that for all s, t ∈ [0, T ) with s ≤ t it holds that
Z t
∥G(Θu )∥2 du ≤ C(1 − α)−1 [L(s)]1−α . (9.142)
s
In addition, note that (9.126) demonstrates that Θ0 ∈ B. Combining this with (9.130)
ensures that T > 0. This, (9.143), and (9.126) establish that
Z T
∥G(Θu )∥2 du ≤ C(1 − α)−1 [L(0)]1−α < ε < ∞. (9.144)
0
T = ∞. (9.145)
This establishes (9.132). In the next step of our proof of items (i), (ii), (iii), and (iv) we
verify that Θt ∈ Rd , t ∈ [0, ∞), is convergent (see (9.147) below). For this observe that the
370
9.11. Convergence analysis for solutions of GF ODEs
Rt
assumption that for all t ∈ [0, ∞) it holds that Θt = Θ0 − 0 G(Θs ) ds shows that for all
r, s, t ∈ [0, ∞) with r ≤ s ≤ t it holds that
Z t Z t Z ∞
∥Θt − Θs ∥2 = G(Θu ) du ≤ ∥G(Θu )∥2 du ≤ ∥G(Θu )∥2 du = gr . (9.146)
s 2 s r
Next note that (9.144) and (9.145) imply that ∞ > g0 ≥ lim supr→∞ gr = 0. Combining
this with (9.146) demonstrates that there exist ψ ∈ Rd which satisfies
In the next step of our proof of items (i), (ii), (iii), and (iv) we show that L(Θt ), t ∈ [0, ∞),
converges to L(ψ) with convergence order 1. We accomplish this by bringing a suitable
differential inequality for the reciprocal of the function L in (9.128) into play (see (9.150)
below for details). More specifically, observe that (9.135), (9.145), (9.130), and (9.126)
ensure that for almost all t ∈ [0, ∞) it holds that
Hence, we obtain that L is non-increasing. This proves that for all t ∈ [0, ∞) it holds that
L(t) ≤ L(0). This and the fact that for all t ∈ [0, τ ) it holds that L(t) > 0 establish that
for almost all t ∈ [0, τ ) it holds that
L′ (t) ≤ −C−2 [L(t)](2α−2) [L(t)]2 ≤ −C−2 [L(0)](2α−2) [L(t)]2 = −D−1 [L(t)]2 . (9.149)
371
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
L(t) ≤ D (D[L(0)]−1 + t)−1 = C2 c2−2α (C2 c1−2α + t)−1 = C2 c2 (C2 c + c2α t)−1 . (9.154)
The fact that for all t ∈ [τ, ∞) it holds that L(t) = 0 and (9.131) hence ensure that for all
t ∈ [0, ∞) it holds that
0 ≤ L(t) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 . (9.155)
Moreover, observe that (9.147) and the assumption that L ∈ C(Rd , R) prove that
lim supt→∞ |L(Θt ) − L(ψ)| = 0. Combining this with (9.155) establishes that L(ψ) = L(ϑ).
This and (9.155) show that for all t ∈ [0, ∞) it holds that
0 ≤ L(Θt ) − L(ψ) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 . (9.156)
In the final step of our proof of items (i), (ii), (iii), and (iv) we establish convergence rates
for the real numbers ∥Θt − ψ∥2 , t ∈ [0, ∞). Note that (9.147), (9.146), and (9.142) imply
that for all t ∈ [0, ∞) it holds that
∥Θt −ψ∥2 = ∥Θt − [lims→∞ Θs ]∥2 = lims→∞ ∥Θt −Θs ∥2 ≤ gt ≤ C(1−α)−1 [L(t)]1−α . (9.157)
This and (9.156) demonstrate that for all t ∈ [0, ∞) it holds that
∥Θt − ψ∥2 ≤ gt ≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
≤ C(1 − α)−1 C2 c2 (1{0} (c) + C2 c + c2α t)−1 (9.158)
1−α
372
9.11. Convergence analysis for solutions of GF ODEs
Proof of Corollary 9.11.3. Observe that Proposition 9.11.2 ensures that there exists ψ ∈ Rd
which satisfies that
(i) it holds that L(ψ) = L(ϑ),
(ii) it holds for all t ∈ [0, ∞) that ∥Θt − ϑ∥2 < ε,
(iii) it holds for all t ∈ [0, ∞) that 0 ≤ L(Θt ) − L(ψ) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 , and
(iv) it holds for all t ∈ [0, ∞) that
Z ∞
∥Θt − ψ∥2 ≤ ∥G(Θs )∥2 ds ≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
t (9.162)
≤C 3−2α 2−2α
c (1 − α) (1{0} (c) + C c + c t)
−1 2 2α α−1
.
Note that item (iii) and the assumption that c ≤ 1 establish that for all t ∈ [0, ∞) it holds
that
0 ≤ L(Θt ) − L(ψ) ≤ c2 (C−2 1{0} (c) + c + C−2 c2α t)−1 ≤ (1 + C−2 t)−1 . (9.163)
This and item (iv) show that for all t ∈ [0, ∞) it holds that
Z ∞
∥Θt − ψ∥2 ≤ ∥G(Θs )∥2 ds ≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
t (9.164)
−1 −2 α−1
≤ C(1 − α) (1 + C t) .
Combining this with item (i), item (ii), and (9.163) proves (9.161). The proof of Corol-
lary 9.11.3 is thus complete.
and assume lim inf t→∞ ∥Θt ∥2 < ∞. Then there exist ϑ ∈ Rd , C, τ, β ∈ (0, ∞) such that for
all t ∈ [τ, ∞) it holds that
−β −1
∥Θt − ϑ∥2 ≤ 1 + C(t − τ ) and 0 ≤ L(Θt ) − L(ϑ) ≤ 1 + C(t − τ ) . (9.167)
373
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Proof of Proposition 9.11.4. Observe that (9.166) implies that [0, ∞) ∋ t 7→ L(Θt ) ∈ R is
non-increasing. Therefore, we obtain that there exists m ∈ [−∞, ∞) which satisfies
m = lim supt→∞ L(Θt ) = lim inf t→∞ L(Θt ) = inf t∈[0,∞) L(Θt ). (9.168)
Furthermore, note that the assumption that lim inf t→∞ ∥Θt ∥2 < ∞ demonstrates that there
exist ϑ ∈ Rd and δ = (δn )n∈N : N → [0, ∞) which satisfy
Observe that (9.168), (9.169), and the fact that L is continuous ensure that
Next let ε, C ∈ (0, ∞), α ∈ (0, 1) satisfy for all θ ∈ Rd with ∥θ − ϑ∥2 < ε that
Note that (9.169) and the fact that L is continuous demonstrate that there exist n ∈ N,
c ∈ [0, 1] which satisfy
c = |L(Θδn ) − L(ϑ)| and C(1 − α)−1 c1−α + ∥Θδn − ϑ∥2 < ε. (9.172)
Φt = Θδn +t . (9.173)
Observe that (9.166), (9.170), and (9.173) establish that for all t ∈ [0, ∞) it holds that
Z t Z t
L(Φt ) = L(Φ0 ) − 2
∥G(Φs )∥2 ds, Φt = Φ0 − G(Φs ) ds, and L(Φt ) ≥ L(ϑ).
0 0
(9.174)
Combining this with (9.171), (9.172), (9.173), and Corollary 9.11.3 (applied with Θ ↶ Φ in
the notation of Corollary 9.11.3) establishes that there exists ψ ∈ Rd which satisfies for all
t ∈ [0, ∞) that
0 ≤ L(Φt ) − L(ψ) ≤ (1 + C−2 t)−1 , ∥Φt − ψ∥2 ≤ C(1 − α)−1 (1 + C−2 t)α−1 , (9.175)
and L(ψ) = L(ϑ). Note that (9.173) and (9.175) show for all t ∈ [0, ∞) that 0 ≤
L(Θδn +t ) − L(ψ) ≤ (1 + C−2 t)−1 and ∥Θδn +t − ψ∥2 ≤ C(1 − α)−1 (1 + C−2 t)α−1 . Hence, we
obtain for all τ ∈ [δn , ∞), t ∈ [τ, ∞) that
374
9.11. Convergence analysis for solutions of GF ODEs
and
Observe that (9.176), (9.177), and (9.178) demonstrate for all t ∈ [τ, ∞) that
and
h 1 iα−1
−1 α−1 −2 −1
∥Θt − ψ∥2 ≤ C(1 − α) 1 + C (τ − δn ) + C (t − τ )
iα−1
1 1 (9.180)
h
−1 α−1 −1 1−α −1
= C(1 − α) 1 + C(1 − α) + C (t − τ )
α−1
≤ 1 + C −1 (t − τ )
.
and assume lim inf t→∞ ∥Θt ∥2 < ∞ (cf. Definition 3.3.4). Then there exist ϑ ∈ Rd , C, β ∈
(0, ∞) which satisfy for all t ∈ [0, ∞) that
∥Θt − ϑ∥2 ≤ C(1 + t)−β and 0 ≤ L(Θt ) − L(ϑ) ≤ C(1 + t)−1 . (9.183)
375
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Proof of Corollary 9.11.5. Note that Proposition 9.11.4 demonstrates that there exist ϑ ∈
Rd , C, τ, β ∈ (0, ∞) which satisfy for all t ∈ [τ, ∞) that
−β −1
∥Θt − ϑ∥2 ≤ 1 + C(t − τ ) and 0 ≤ L(Θt ) − L(ϑ) ≤ 1 + C(t − τ ) . (9.184)
(9.185)
Observe that (9.184), (9.185), and the fact that [0, ∞) ∋ t 7→ L(Θt ) ∈ R is non-increasing
prove for all t ∈ [0, τ ] that
∥Θt − ϑ∥2 ≤ sups∈[0,τ ] ∥Θs − ϑ∥2 ≤ C(1 + τ )−β ≤ C(1 + t)−β (9.186)
and
0 ≤ L(Θt ) − L(ϑ) ≤ L(Θ0 ) − L(ϑ) ≤ C(1 + τ )−1 ≤ C(1 + t)−1 . (9.187)
Furthermore, note that (9.184) and (9.185) imply for all t ∈ [τ, ∞) that
−β −β
= C C /β + C /β C(t − τ )
1 1
∥Θt − ϑ∥2 ≤ 1 + C(t − τ )
−β (9.188)
≤ C C /β + t − τ ≤ C(1 + t)−β .
1
Moreover, observe that (9.184) and (9.185) demonstrate for all t ∈ [τ, ∞) that
−1 −1
0 ≤ L(Θt ) − L(ϑ) ≤ C C + CC(t − τ ) ≤C C−τ +t ≤ C(1 + t)−1 . (9.189)
Corollary 9.11.6. Let d ∈ N, Θ ∈ C([0, ∞), Rd ), L ∈ C 1 (Rd , R), assume that for all
ϑ ∈ Rd there exist ε, C ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ Rd with ∥θ − ϑ∥2 < ε it
holds that
|L(θ) − L(ϑ)|α ≤ C∥(∇L)(θ)∥2 , (9.190)
assume for all t ∈ [0, ∞) that
Z t
Θt = Θ0 − (∇L)(Θs ) ds, (9.191)
0
and assume lim inf t→∞ ∥Θt ∥2 < ∞ (cf. Definition 3.3.4). Then there exist ϑ ∈ Rd , C, β ∈
(0, ∞) which satisfy for all t ∈ [0, ∞) that
376
9.11. Convergence analysis for solutions of GF ODEs
Proof of Corollary 9.11.6. Note that Lemma 9.11.1 demonstrates that for all t ∈ [0, ∞) it
holds that Z t
L(Θt ) = L(Θ0 ) − ∥(∇L)(Θs )∥22 ds. (9.193)
0
Corollary 9.11.5 therefore establishes that there exist ϑ ∈ Rd , C, β ∈ (0, ∞) which satisfy
for all t ∈ [0, ∞) that
∥Θt − ϑ∥2 ≤ C(1 + t)−β and 0 ≤ L(Θt ) − L(ϑ) ≤ C(1 + t)−1 . (9.194)
(∇L)(ϑ) = 0. (9.199)
Combining this with (9.194) establishes (9.192). The proof of Corollary 9.11.6 is thus
complete.
and assume lim inf t→∞ ∥Θt ∥2 < ∞ (cf. Definitions 3.3.4 and 9.7.1). Then there exist ϑ ∈ Rd ,
C, β ∈ (0, ∞) which satisfy for all t ∈ [0, ∞) that
377
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Proof of Corollary 9.11.7. Note that Theorem 9.9.1 shows that for all ϑ ∈ Rd there exist
ε, C ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ Rd with ∥θ − ϑ∥2 < ε it holds that
|L(θ) − L(ϑ)|α ≤ C∥(∇L)(θ)∥2 . (9.202)
Corollary 9.11.6 therefore establishes (9.201). The proof of Corollary 9.11.7 is thus complete.
Exercise 9.11.1. Prove or disprove the following statement: For all d ∈ N, L ∈ (0, ∞),
γ ∈ [0, L−1 ], all open and convex sets U ⊆ Rd , and all L ∈ C 1 (U, R), x ∈ U with
x − γ(∇L)(x) ∈ U and ∀ v, w ∈ U : ∥(∇L)(v) − (∇L)(w)∥2 ≤ L∥v − w∥2 it holds that
L(x − γ(∇L)(x)) ≤ L(x) − γ2 ∥(∇L)(x)∥22 (9.203)
(cf. Definition 3.3.4).
378
9.12. Convergence analysis for GD processes
Proof of Corollary 9.12.2. Observe that Lemma 9.12.1 ensures that for all x ∈ U with
x − γ(∇L)(x) ∈ U it holds that
L(x − γ(∇L)(x)) ≤ L(x) + ⟨(∇L)(x), −γ(∇L)(x)⟩ + L2 ∥γ(∇L)(x)∥22
Lγ 2
(9.209)
= L(x) − γ∥(∇L)(x)∥22 + 2
∥(∇L)(x)∥22 .
This establishes (9.208). The proof of Corollary 9.12.2 is thus complete.
Corollary 9.12.3. Let d ∈ N, L ∈ (0, ∞), γ ∈ [0, L−1 ], let U ⊆ Rd be open and convex, let
L ∈ C 1 (U, R), and assume for all x, y ∈ U that
∥(∇L)(x) − (∇L)(y)∥2 ≤ L∥x − y∥2 (9.210)
(cf. Definition 3.3.4). Then it holds for all x ∈ U with x − γ(∇L)(x) ∈ U that
L(x − γ(∇L)(x)) ≤ L(x) − γ2 ∥(∇L)(x)∥22 ≤ L(x). (9.211)
Proof of Corollary 9.12.3. Note that Corollary 9.12.2, the fact that γ ≥ 0, and the fact
that Lγ
2
− 1 ≤ − 12 establish (9.211). The proof of Corollary 9.12.3 is thus complete.
Exercise 9.12.1. Let (γn )n∈N ⊆ (0, ∞) satisfy for all n ∈ N that γn = 1
n+1
and let L : R → R
satisfy for all x ∈ R that
L(x) = 2x + sin(x). (9.212)
Prove or disprove the following statement: For every Θ = (Θk )k∈N0 : N0 → R with ∀ k ∈
N : Θk = Θk−1 − γk (∇L)(Θk−1 ) and every n ∈ N it holds that
1 3
|2 + cos(Θn−1 )|2 . (9.213)
L(Θn ) ≤ L(Θn−1 ) − n+1 1 − 2(n+1)
Prove or disprove the following statement: For every Θ = (Θn )n∈N0 : N0 → R with ∀ n ∈
1
N : Θn = Θn−1 − n+1 (∇L)(Θn−1 ) and every k ∈ N it holds that
L(Θk ) < L(Θk−1 ). (9.215)
379
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
|L(θ) − L(ϑ)|α ≤ C∥G(θ)∥2 , c = |L(Θ0 ) − L(ϑ)|, 2C(1 − α)−1 c1−α + ∥Θ0 − ϑ∥2 < γL+1 ε
,
(9.217)
and inf n∈{m∈N0 : ∀ k∈N0 ∩[0,m] : Θk ∈B} L(Θn ) ≥ L(ϑ) (cf. Definition 3.3.4). Then there exists
ψ ∈ L −1 ({L(ϑ)}) ∩ G−1 ({0}) ∩ B such that
(ii) it holds for all n ∈ N0 that 0 ≤ L(Θn ) − L(ψ) ≤ 2C2 c2 (1{0} (c) + c2α nγ + 2C2 c)−1 ,
and
T = inf({n ∈ N0 : Θn ∈
/ B} ∪ {∞}), (9.219)
let L : N0 → R satisfy for all n ∈ N0 that L(n) = L(Θn ) − L(ϑ), and let τ ∈ N0 ∪ {∞}
satisfy
τ = inf({n ∈ N0 ∩ [0, T ) : L(n) = 0} ∪ {T }). (9.220)
Observe that the assumption that G(ϑ) = 0 implies for all θ ∈ B that
This, the fact that ∥Θ0 − ϑ∥2 < ε, and the fact that
∥Θ1 − ϑ∥2 ≤ ∥Θ1 − Θ0 ∥2 + ∥Θ0 − ϑ∥2 = γ∥G(Θ0 )∥2 + ∥Θ0 − ϑ∥2 ≤ (γL + 1)∥Θ0 − ϑ∥2 < ε
(9.222)
380
9.12. Convergence analysis for GD processes
L(n) ≥ 0. (9.224)
Furthermore, observe that the fact that B ⊆ Rd is open and convex, Corollary 9.12.3, and
(9.217) demonstrate for all n ∈ N0 ∩ [0, T − 1) that
L(n + 1) − L(n) = L(Θn+1 ) − L(Θn ) ≤ − γ2 ∥G(Θn )∥22 = − 12 ∥G(Θn )∥2 ∥γG(Θn )∥2
= − 12 ∥G(Θn )∥2 ∥Θn+1 − Θn ∥2 ≤ −(2C)−1 |L(Θn ) − L(ϑ)|α ∥Θn+1 − Θn ∥2
= −(2C)−1 [L(n)]α ∥Θn+1 − Θn ∥2 ≤ 0.
(9.225)
L(n) = 0. (9.227)
The fact that γ > 0 therefore establishes for all n ∈ N0 ∩ [τ, T − 1) that G(Θn ) = 0. Hence,
we obtain for all n ∈ N0 ∩ [τ, T ) that
Θn = Θτ . (9.229)
Moreover, note that (9.220) and (9.225) ensure for all n ∈ N0 ∩ [0, τ ) ∩ [0, T − 1) that
381
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Combining this with the triangle inequality proves for all m, n ∈ N0 ∩ [0, T ) with m ≤ n
that
n−1
" n−1 #
X 2C X 1−α 1−α
∥Θn − Θm ∥2 ≤ ∥Θk+1 − Θk ∥2 ≤ [L(k)] − [L(k + 1)]
1 − α k=m
k=m (9.232)
2C([L(m)]1−α − [L(n)]1−α ) 2C[L(m)]1−α
= ≤ .
1−α 1−α
This and (9.217) demonstrate for all n ∈ N0 ∩ [0, T ) that
−L(n) ≤ L(n + 1) − L(n) ≤ − γ2 ∥G(Θn )∥22 ≤ − 2Cγ 2 [L(n)]2α ≤ − 2C2 cγ2−2α [L(n)]2 . (9.240)
382
9.12. Convergence analysis for GD processes
fact that for all n ∈ N0 ∩ [τ, ∞) it holds that L(n) = 0 shows that for all n ∈ N0 we have
that
2C2 c2
L(n) ≤ . (9.244)
1{0} (c) + c2α nγ + 2C2 c
This, (9.237), and the assumption that L is continuous prove that
L(ψ) = limn→∞ L(Θn ) = L(ϑ). (9.245)
Combining this with (9.244) implies for all n ∈ N0 that
2C2 c2
0 ≤ L(Θn ) − L(ψ) ≤ . (9.246)
1{0} (c) + c2α nγ + 2C2 c
Furthermore, observe that the fact that B ∋ θ 7→ G(θ) ∈ Rd is continuous, the fact that
ψ ∈ B, and (9.237) demonstrate that
G(ψ) = limn→∞ G(Θn ) = limn→∞ (γ −1 (Θn − Θn+1 )) = 0. (9.247)
Next note that (9.244) and (9.232) ensure for all n ∈ N0 that
∞
X 2C[L(n)]1−α
∥Θn − ψ∥2 = lim ∥Θn − Θm ∥2 ≤ ∥Θk+1 − Θk ∥2 ≤
m→∞ 1−α
k=n (9.248)
22−α C3−2α c2−2α
≤ .
(1 − α)(1{0} (c) + c2α nγ + 2C2 c)1−α
Combining this with (9.245), (9.235), (9.247), and (9.246) establishes items (i), (ii), and
(iii). The proof of Proposition 9.12.4 is thus complete.
383
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Corollary 9.12.5. Let d ∈ N, c ∈ [0, 1], ε, L, C ∈ (0, ∞), α ∈ (0, 1), γ ∈ (0, L−1 ], ϑ ∈ Rd ,
let B ⊆ Rd satisfy B = {θ ∈ Rd : ∥θ − ϑ∥2 < ε}, let L ∈ C(Rd , R) satisfy L|B ∈ C 1 (B, R),
let G : Rd → Rd satisfy for all θ ∈ B that G(θ) = (∇L)(θ), assume for all θ1 , θ2 ∈ B that
∥G(θ1 ) − G(θ2 )∥2 ≤ L∥θ1 − θ2 ∥2 , (9.249)
let Θ = (Θn )n∈N0 : N0 → Rd satisfy for all n ∈ N0 that
Proof of Corollary 9.12.5. Observe that the fact that L(ϑ) = inf θ∈B L(θ) ensures that
G(ϑ) = (∇L)(ϑ) = 0 and inf n∈{m∈N0 : ∀ k∈N0 ∩[0,m] : Θk ∈B} L(Θn ) ≥ L(ϑ). Combining this
with Proposition 9.12.4 ensures that there exists ψ ∈ L −1 ({L(ϑ)}) ∩ G−1 ({0}) such that
(I) it holds for all n ∈ N0 that Θn ∈ B,
2 2
(II) it holds for all n ∈ N0 that 0 ≤ L(Θn ) − L(ψ) ≤ 1{0} (c)+c2α nγ+2C2 c , and
2C c
Note that item (II) and the assumption that c ≤ 1 establish for all n ∈ N0 that
384
9.13. On the analyticity of realization functions of ANNs
Prove or disprove the following statement: For every continuous Θ = (Θt )t∈[0,∞) : [0, ∞) → R
Rt
with supt∈[0,∞) |Θt | < ∞ and ∀ t ∈ [0, ∞) : Θt = Θ0 − 0 (∇L)(Θs ) ds there exists ϑ ∈ R
such that
Prove or disprove the following statement: For every Θ ∈ C([0, ∞), R) with supt∈[0,∞) |Θt | <
Rt
∞ and ∀ t ∈ [0, ∞) : Θt = Θ0 − 0 (∇L)(Θs ) ds there exists ϑ ∈ R, C, β ∈ (0, ∞) such that
for all t ∈ [0, ∞) it holds that
Proof of Proposition 9.13.1. Observe that Faà di Bruno’s formula (cf., for instance, Fraenkel
[134]) establishes that f ◦ g is analytic (cf. also, for example, Krantz & Parks [254, Proposi-
tion 2.8]). The proof of Proposition 9.13.1 is thus complete.
Lemma 9.13.2. Let d1 , d2 , l1 , l2 ∈ N, for every k ∈ {1, 2} let Fk : Rdk → Rlk be analytic,
and let f : Rd1 × Rd2 → Rl1 × Rl2 satisfy for all x1 ∈ Rd1 , x2 ∈ Rd2 that
385
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Proof of Lemma 9.13.2. Throughout this proof, let A1 : Rl1 → Rl1 × Rl2 and A2 : Rl2 →
Rl1 × Rl2 satisfy for all x1 ∈ Rl1 , x2 ∈ Rl2 that
and for every k ∈ {1, 2} let Bk : Rl1 × Rl2 → Rlk satisfy for all x1 ∈ Rl1 , x2 ∈ Rl2 that
Bk (x1 , x2 ) = xk . (9.263)
f = A1 ◦ F1 ◦ B1 + A2 ◦ F2 ◦ B2 . (9.264)
This, the fact that A1 , A2 , F1 , F2 , B1 , and B2 are analytic, and Proposition 9.13.1 establishes
that f is differentiable. The proof of Lemma 9.13.2 is thus complete.
Lemma 9.13.3. Let d1 , d2 , l0 , l1 , l2 ∈ N, for every k ∈ {1, 2} let Fk : Rdk × Rlk−1 → Rlk be
analytic, and let f : Rd1 × Rd2 × Rl0 → Rl2 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that
(9.265)
f (θ1 , θ2 , x) = F2 (θ2 , ·) ◦ F1 (θ1 , ·) (x)
Proof of Lemma 9.13.3. Throughout this proof, let A : Rd1 × Rd2 × Rl0 → Rd2 × Rd1 +l0 and
B : Rd2 × Rd1 +l0 → Rd2 × Rl1 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that
A(θ1 , θ2 , x) = (θ2 , (θ1 , x)) and B(θ2 , (θ1 , x)) = (θ2 , F1 (θ1 , x)), (9.266)
f = F2 ◦ B ◦ A. (9.267)
386
9.13. On the analyticity of realization functions of ANNs
Proof of Corollary 9.13.4. Throughout this proof, for every k ∈ {1, 2, . . . , L} let dk =
lk (lk−1 + 1) and for every k ∈ {1, 2, . . . , L} let Fk : Rdk × Rlk−1 → Rlk satisfy for all θ ∈ Rdk ,
x ∈ Rlk−1 that
Fk (θ, x) = Ψk Aθ,0 (9.269)
lk ,lk−1 (x)
(cf. Definition 1.1.1). Observe that item (i) in Lemma 5.3.3 demonstrates that for all
θ1 ∈ Rd1 , θ2 ∈ Rd2 , . . ., θL ∈ RdL , x ∈ Rl0 it holds that
(θ ,θ ,...,θ ),l
NΨ11,Ψ22 ,...,ΨLL 0 (x) = (FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·))(x) (9.270)
(cf. Definition 1.1.3). Note that the assumption that for all k ∈ {1, 2, . . . , L} it holds that Ψk
is analytic, the fact that for all m, n ∈ N, θ ∈ Rm(n+1) it holds that Rm(n+1) × Rn ∋ (θ, x) 7→
Aθ,0
m,n (x) ∈ R
m
is analytic, and Proposition 9.13.1 ensure that for all k ∈ {1, 2, . . . , L} it
holds that Fk is analytic. Lemma 5.3.2 and induction hence ensure that
is analytic. The assumption that L is differentiable and the chain rule therefore establish
that for all x ∈ Rl0 , y ∈ RlL it holds that
Rd ∋ θ 7→ L NM θ,l0
(9.275)
a,l ,Ma,l ,...,M a,l ,id l
(x m ), ym ∈R
1 2 L−1 R L
is analytic. This proves (9.273). The proof of Corollary 9.13.5 is thus complete.
387
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
(cf. Definitions 1.1.3, 1.2.1, 3.3.4, and 9.7.1). Then there exist ϑ ∈ Rd , c, β ∈ (0, ∞) such
that for all t ∈ (0, ∞) it holds that
Proof of Theorem 9.14.1. Note that Corollary 9.13.5 demonstrates that L is analytic.
Combining this with Corollary 9.11.7 establishes (9.278). The proof of Theorem 9.14.1 is
thus complete.
Lemma 9.14.2. Let a : R → R be the softplus activation function (cf. Definition 1.2.11).
Then a is analytic (cf. Definition 9.7.1).
Proof of Lemma 9.14.2. Throughout this proof, let f : R → (0, ∞) satisfy for all x ∈ R that
f (x) = 1 + exp(x). Observe that the fact that R ∋ x 7→ exp(x) ∈ R is analytic implies that
f is analytic (cf. Definition 9.7.1). Combining this and the fact that (0, ∞) ∋ x 7→ ln(x) ∈ R
is analytic with Proposition 9.13.1 and (1.47) demonstrates that a is analytic. The proof of
Lemma 9.14.2 is thus complete.
Lemma 9.14.3. Let d ∈ N and let L be the mean squared error loss function based
on Rd ∋ x 7→ ∥x∥2 ∈ [0, ∞) (cf. Definitions 3.3.4 and 5.4.2). Then L is analytic (cf.
Definition 9.7.1).
Proof of Lemma 9.14.3. Note that Lemma 5.4.3 ensures that L is analytic (cf. Defini-
tion 9.7.1). The proof of Lemma 9.14.3 is thus complete.
Corollary 9.14.4 (Empirical risk minimization for ANNs with softplus activation). Let
L, d ∈ N\{1}, M, l0 , l1 , . . . , lL ∈ N, x1 , x2 , . . . , xM ∈ Rl0 , y1 , y2 , . . . , yM ∈ RlL satisfy
388
9.14. Standard KL inequalities for empirical risks in the training of ANNs with analytic
activation functions
d = Lk=1 lk (lk−1 + 1), let a be the softplus activation function, let L : Rd → R satisfy for
P
all θ ∈ Rd that
"M #
1 X 2
L(θ) = θ,l0
ym − NMa,l ,Ma,l ,...,Ma,l ,id l (xm ) 2 , (9.279)
M m=1 1 2 L−1 R L
(cf. Definitions 1.1.3, 1.2.1, 1.2.11, and 3.3.4). Then there exist ϑ ∈ Rd , c, β ∈ (0, ∞) such
that for all t ∈ (0, ∞) it holds that
Proof of Corollary 9.14.4. Observe that Lemma 9.14.2, Lemma 9.14.3, and Theorem 9.14.1
establish (9.281). The proof of Corollary 9.14.4 is thus complete.
Remark 9.14.5 (Convergence to a good suboptimal critical point whose risk value is close
to the optimal risk value). Corollary 9.14.4 establishes convergence of a non-divergent GF
trajectory in the training of fully-connected feedforward ANNs to a critical point ϑ ∈ Rd of
the objective function. In several scenarios in the training of ANNs such limiting critical
points seem to be with high probability not global minimum points but suboptimal critical
points at which the value of the objective function is, however, not far away from the
minimal value of the objective function (cf. Ibragimov et al. [216] and also [144, 409]). In
view of this, there has been an increased interest in landscape analyses associated to the
objective function to gather more information on critical points of the objective function
(cf., for instance, [12, 72, 79, 80, 92, 113, 141, 215, 216, 239, 312, 357, 358, 365, 381–383,
400, 435, 436] and the references therein).
In general in most cases it remains an open problem to rigorously prove that the value
of the objective function at the limiting critical point is indeed with high probability close
to the minimal/infimal value1 of the objective function and thereby establishing a full
convergence analysis. However, in the so-called overparametrized regime where there are
much more ANN parameters than input-output training data pairs, several convergence
analyses for the training of ANNs have been achieved (cf., for instance, [74, 75, 114, 218]
and the references therein).
Remark 9.14.6 (Almost surely excluding strict saddle points). We also note that in several
situations it has been shown that the limiting critical point of the considered GF trajectory
1
It is of interest to note that it seems to strongly depend on the activation function, the architecture of
the ANN, and the underlying probability distribution of the data of the considered learning problem whether
the infimal value of the objective function is also a minimal value of the objective function or whether there
exists no minimal value of the objective function (cf., for example, [99, 142] and Remark 9.14.7 below).
389
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
390
9.15. Fréchet subdifferentials and limiting Fréchet subdifferentials
and we call (DL)(x) the set limiting Fréchet subgradients of f at x (cf. Definitions 1.4.7
and 3.3.4).
(i) it holds for all y ∈ {z ∈ Rd : ∥z − x∥2 < ε} that A(y) = ⟨a, y − x⟩ + L(x) and
Proof of Lemma 9.15.2. Note that (9.285) shows for all y ∈ {z ∈ Rd : ∥z − x∥2 < ε} that
This establishes item (i). Observe that (9.285) and item (i) ensure for all h ∈ {z ∈ Rd : 0 <
∥z∥2 < ε} that
This and (9.283) establish item (ii). The proof of Lemma 9.15.2 is thus complete.
(DL)(x) = y ∈ Rd : ∃ z = (z1 , z2 ) : N → Rd × Rd : ∀ k ∈ N :
, (9.288)
z2 (k) ∈ (DL)(z1 (k)) ∧ lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0
391
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Proof of Lemma 9.15.3. Throughout this proof, for every x, y ∈ Rd let Z x,y = (Z1x,y ,
Z2x,y ) : N → Rd × Rd satisfy for all k ∈ N that
Note that (9.284) proves that for all x ∈ Rd , y ∈ (DL)(x), ε ∈ (0, ∞) it holds that
(9.290)
S
y∈ v∈{w∈Rn : ∥x−w∥2 <ε} (DL)(v) .
This
S implies that for all x ∈ R , y ∈ (DL)(x) and all ε, δ ∈ (0, ∞) there exists Y ∈
d
∥y − Y ∥2 < δ. (9.291)
This ensures that for all x, y ∈ Rd , ε, δ ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd with
lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there
exist v ∈ {w ∈ Rd : ∥x − w∥2 < ε}, Y ∈ (DL)(v) such that ∥y − Y ∥2 < δ. Therefore,
we obtain that for all x, y ∈ Rd , ε, δ ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd with
Sk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there exists
lim sup
Y ∈ v∈{w∈Rd : ∥x−w∥2 <ε} (DL)(v) such that
∥y − Y ∥2 < δ. (9.296)
392
9.15. Fréchet subdifferentials and limiting Fréchet subdifferentials
This establishes that for all x, y ∈ Rd , ε ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd with
lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) it holds
that
(9.297)
S
y∈ v∈{w∈Rn : ∥x−w∥2 <ε} (DL)(v) .
This and (9.284) show that for all x, y ∈ Rd and all z = (z1 , z2 ) : N → Rd × Rd with
lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) it holds
that
y ∈ (DL)(x). (9.298)
Combining this with (9.293) proves item (i). Note that (9.289) implies that for all x ∈ Rd ,
y ∈ (DL)(x) it holds that
x,y x,y x,y x,y
∀ k ∈ N : Z2 (k) ∈ (DL)(Z1 (k)) ∧ lim sup ∥Z1 (k) − x∥2 + ∥Z2 (k) − y∥2 = 0
k→∞
(9.299)
(cf. Definitions 3.3.4 and 9.15.1). Combining this with item (i) establishes item (ii).
Observe that the fact that for all a ∈ R it holds that −a ≤ |a| demonstrates that for all
x ∈ {y ∈ Rd : L is differentiable at y} it holds that
lim inf Rd \{0}∋h→0 L(x+h)−L(x)−⟨(∇L)(x),h⟩
∥h∥2
≥ − lim inf d
R \{0}∋h→0
L(x+h)−L(x)−⟨(∇L)(x),h⟩
∥h∥2
h i
|L(x+h)−L(x)−⟨(∇L)(x),h⟩|
≥ − lim supRd \{0}∋h→0 ∥h∥2
=0 (9.300)
393
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Combining this with (9.301) proves item (iii). Observe that items (ii) and (iii) ensure that
for all open U ⊆ Rn and all x ∈ U with L|U ∈ C 1 (U, R) it holds that
In addition, note that for all open U ⊆ Rd , all x ∈ U , y ∈ Rd and all z = (z1 , z2 ) : N →
Rd ×Rd with lim supk→∞ (∥z1 (k)−x∥2 +∥z2 (k)−y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k))
there exists K ∈ N such that for all k ∈ N ∩ [K, ∞) it holds that
z1 (k) ∈ U. (9.305)
Combining this with item (iii) shows that for all open U ⊆ Rd , all x ∈ U , y ∈ Rd and all
z = (z1 , z2 ) : N → Rd ×Rd with L|U ∈ C 1 (U, R), lim supk→∞ (∥z1 (k)−x∥2 +∥z2 (k)−y∥2 ) = 0
and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there exists K ∈ N such that ∀ k ∈ N∩[K, ∞) : z1 (k) ∈ U
and
lim supN∩[K,∞)∋k→∞ (∥z1 (k) − x∥2 + ∥(∇L)(z1 (k)) − y∥2 )
(9.306)
= lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0.
This and item (i) imply that for all open U ⊆ Rd and all x ∈ U , y ∈ (DL)(x) with
L|U ∈ C 1 (U, R) it holds that
y = (∇L)(x). (9.307)
Combining this with (9.304) establishes item (iv). Observe that (9.284) demonstrates that
for all x ∈ Rd it holds that
(9.308)
Rd \((DL)(x)) = ε∈(0,∞) Rd \ Sy∈{z∈Rd : ∥x−z∥2 <ε} (DL)(y)
S
Therefore, we obtain for all x ∈ Rd that Rd \((DL)(x)) is open. This proves item (v). The
proof of Lemma 9.15.3 is thus complete.
Lemma 9.15.4 (Fréchet subgradients for maxima). Let c ∈ R and let L : R → R satisfy
for all x ∈ R that L(x) = max{x, c}. Then
(i) it holds for all x ∈ (−∞, c) that (DL)(x) = {0},
394
9.15. Fréchet subdifferentials and limiting Fréchet subdifferentials
Furthermore, note that the assumption that for all x ∈ R it holds that L(x) = max{x, c}
ensures that for all a ∈ (1, ∞), h ∈ (0, ∞) it holds that
L(c + h) − L(c) − ah (c + h) − c − ah
= = 1 − a < 0. (9.310)
|h| h
Moreover, observe that the assumption that for all x ∈ R it holds that L(x) = max{x, c}
shows that for all a, h ∈ (−∞, 0), it holds that
L(c + h) − L(c) − ah c − c − ah
= = a < 0. (9.311)
|h| −h
Combining this with (9.310) demonstrates that
(DL)(c) ⊆ [0, 1]. (9.312)
This and (9.309) establish item (iii). The proof of Lemma 9.15.4 is thus complete.
Lemma 9.15.5 (Limits of limiting Fréchet subgradients). Let d ∈ N, L ∈ C(Rd , R), let
(xk )k∈N0 ⊆ Rd and (yk )k∈N0 ⊆ Rd satisfy
lim supk→∞ (∥xk − x0 ∥2 + ∥yk − y0 ∥2 ) = 0, (9.313)
and assume for all k ∈ N that yk ∈ (DL)(xk ) (cf. Definitions 3.3.4 and 9.15.1). Then
y0 ∈ (DL)(x0 ).
Proof of Lemma 9.15.5. Note that item (i) in Lemma 9.15.3 and the fact that for all k ∈ N
(k) (k)
it holds that yk ∈ (DL)(xk ) imply that for every k ∈ N there exists z (k) = (z1 , z2 ) : N →
Rd × Rd which satisfies for all v ∈ N that
(k) (k) (k) (k)
z2 (v) ∈ (DL)(z1 (v)) and lim supw→∞ ∥z1 (w) − xk ∥2 + ∥z2 (w) − yk ∥2 = 0. (9.314)
Observe that (9.314) demonstrates that there exists v = (vk )k∈N : N → N which satisfies for
all k ∈ N that
(k) (k)
∥z1 (vk ) − xk ∥2 + ∥z2 (vk ) − yk ∥2 ≤ 2−k . (9.315)
Next let Z = (Z1 , Z2 ) : N → Rd × Rd satisfy for all j ∈ {1, 2}, k ∈ N that
(k)
Zj (k) = zj (vk ). (9.316)
Note that (9.314), (9.315), (9.316), and the assumption that lim supk→∞ (∥xk − x0 ∥2 + ∥yk −
y0 ∥2 ) = 0 prove that
lim supk→∞ ∥Z1 (k) − x0 ∥2 + ∥Z2 (k) − y0 ∥2
≤ lim supk→∞ ∥Z1 (k) − xk ∥2 + ∥Z2 (k) − yk ∥2
+ lim supk→∞ ∥xk − x0 ∥2 + ∥yk − y0 ∥2
(9.317)
= lim supk→∞ ∥Z1 (k) − xk ∥2 + ∥Z2 (k) − yk ∥2
(k) (k)
= lim supk→∞ ∥z1 (vk ) − xk ∥2 + ∥z2 (vk ) − yk ∥2
≤ lim supk→∞ 2−k = 0.
395
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
Furthermore, observe that (9.314) and (9.316) establish that for all k ∈ N it holds that
Z2 (k) ∈ (DL)(Z1 (k)). Combining this and (9.317) with item (i) in Lemma 9.15.3 proves
that y0 ∈ (DL)(x0 ). The proof of Lemma 9.15.5 is thus complete.
Exercise 9.15.1. Prove or disprove the following statement: It holds for all d ∈ N, L ∈
C 1 (Rd , R), x ∈ Rd that (DL)(x) = (DL)(x) (cf. Definition 9.15.1).
Exercise 9.15.2. Prove or disprove the following statement: There exists d ∈ N such that
for all L ∈ C(Rd , R), x ∈ Rd it holds that (DL)(x) ⊆ (DL)(x) (cf. Definition 9.15.1).
Exercise 9.15.3. Prove or disprove the following statement: It holds for all d ∈ N, L ∈
C(Rd , R), x ∈ Rd that (DL)(x) is convex (cf. Definition 9.15.1).
Exercise 9.15.4. Prove or disprove the following statement: It holds for all d ∈ N, L ∈
C(Rd , R), x ∈ Rn that (DL)(x) is convex (cf. Definition 9.15.1).
Exercise 9.15.5. For every α ∈ (0, ∞), s ∈ {−1, 1} let Lα,s : R → R satisfy for all x ∈ R
that
(
x :x>0
Lα,s (x) = α
(9.318)
s|x| : x ≤ 0.
For every α ∈ (0, ∞), s ∈ {−1, 1}, x ∈ R specify (DLα,s )(x) and (DLα,s )(x) explicitly and
prove that your results are correct (cf. Definition 9.15.1)!
(9.319)
SL (θ) = inf r ∈ R : (∃ h ∈ (DL)(θ) : r = ∥h∥2 ) ∪ {∞}
and we call Sf the non-smooth slope of f (cf. Definitions 3.3.4 and 9.15.1).
396
9.17. Generalized KL functions
Remark 9.17.3 (Examples and convergence results for generalized KL functions). In Theo-
rem 9.9.1 and Corollary 9.13.5 above we have seen that in the case of an analytic activation
function we have that the associated empirical risk function is also analytic and therefore
a standard KL function. In deep learning algorithms often deep ANNs with non-analytic
activation functions such as the ReLU activation (cf. Section 1.2.3) and the leaky ReLU
activation (cf. Section 1.2.11) are used. In the case of such non-differentiable activation
functions, the associated risk function is typically not a standard KL function. However,
under suitable assumptions on the target function and the underlying probability measure of
the input data of the considered learning problem, using Bolte et al. [44, Theorem 3.1] one
can verify in the case of such non-differentiable activation functions that the risk function
is a generalized KL function in the sense of Definition 9.17.2 above; cf., for instance, [126,
224]. Similar as for standard KL functions (cf., for example, Dereich & Kassing [100] and
Sections 9.11 and 9.12) one can then also develop a convergence theory for gradient based
optimization methods for generalized KL function (cf., for instance, Bolte et al. [44, Section
4] and Corollary 9.11.5).
Remark 9.17.4 (Further convergence analyses). We refer, for example, to [2, 7, 8, 44, 100,
391] and the references therein for convergence analyses under KL-type conditions for
gradient based optimization methods in the literature. Beyond the KL approach reviewed
in this chapter there are also several other approaches in the literature with which one
can conclude convergence of gradient based optimization methods to suitable generalized
critical points; cf., for instance, [45, 65, 93] and the references therein.
397
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities
398
Chapter 10
In data-driven learning problems popular methods that aim to accelerate ANN training
procedures are BN methods. In this chapter we rigorously review such methods in detail.
In the literature BN methods have first been introduced in Ioffe & Szegedi [217].
Further investigation on BN techniques and applications of such methods can, for
example, be found in [4, Section 12.3.3], [131, Section 6.2.3], [164, Section 8.7.1], and [40,
364].
399
Chapter 10: ANNs with batch normalization
and we call Batchvar(x) the batch variance of the batch x (cf. Definition 10.1.2).
(m)
Lemma 10.1.4. Let d, M ∈ N, x = (x(m) )m∈{1,2,...,M } = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈
(Rd )M , let (Ω, F, P) be a probability space, and let U : Ω → {1, 2, . . . , M } be a {1, 2, . . . , M }-
uniformly distributed random variable. Then
(i) it holds that Batchmean(x) = E x(U ) and
(U )
(ii) it holds for all i ∈ {1, 2, . . . , d} that Batchvari (x) = Var(xi ).
Proof of Lemma 10.1.4. Note that (10.1) proves item (i). Furthermore, note that item (i)
and (10.3) establish item (ii). The proof of Lemma 10.1.4 is thus complete.
Definition 10.1.5 (BN operations for given batch mean and batch variance). Let d ∈ N,
ε ∈ (0, ∞), β = (β1 , . . . , βd ), γ = (γ1 , . . . , γd ), µ = (µ1 , . . . , µd ) ∈ Rd , V = (V1 , . . . , Vd ) ∈
[0, ∞)d . Then we denote by
batchnormβ,γ,µ,V,ε : Rd → Rd (10.4)
and we call batchnormβ,γ,µ,V,ε the BN operation with mean parameter β, standard deviation
parameter γ, and regularization parameter ε given the batch mean µ and batch variance V .
Definition 10.1.6 (Batch normalization). Let d ∈ N, ε ∈ (0, ∞), β, γ ∈ Rd . Then we
denote by
(10.6)
S d M
S d M
Batchnormβ,γ,ε : M ∈N (R ) → M ∈N (R )
the function which satisfies for all M ∈ N, x = (x(m) )m∈{1,2,...,M } ∈ (Rd )M that
(10.7)
and we call Batchnormβ,γ,ε the BN with mean parameter β, standard deviation parameter
γ, and regularization parameter ε (cf. Definitions 10.1.2, 10.1.3, and 10.1.5).
Lemma 10.1.7. Let d, M ∈ N, β = (β1 , . . . , βd ), γ = (γ1 , . . . , γd ) ∈ Rd . Then
(m)
(i) it holds for all ε ∈ (0, ∞), x = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈ (Rd )M that
h x(m) − Batchmean (x) i
i i
Batchnormβ,γ,ε (x) = γi √ + βi ,
Batchvari (x) + ε i∈{1,2,...,d} m∈{1,2,...,M }
(10.8)
400
10.1. Batch normalization (BN)
401
Chapter 10: ANNs with batch normalization
Φ∈B (10.15)
and x = Mr,l1 ,M (W y + (B, B, B, B)) (cf. Definitions 1.2.1 and 1.2.4). Prove the following
statement: It holds that
0 4 9 0
x= 2 , 0 , 0 , 3.
(10.17)
0 4 9 0
(10.18)
RB k M S l M
S
a,ε : B → C( M ∈N (R ) , M ∈N (R ) )
S
k,l∈N
402
10.4. Structured descr. of fully-connected feedforward ANNs with BN (inference)
RB l0 B S lL B
RB (x0 ) = yL ∈ (RlL )M (10.21)
a,ε (Φ) ∈ C( B∈N (R ) , B∈N (R ) ) and (Φ)
S
a,ε
and for every Φ ∈ B we call RB a,ε (Φ) the realization function of the fully-connected feed-
forward ANN with BN Φ with activation function a and BN regularization parameter ε
(we call RBa,ε (Φ) the realization of the fully-connected feedforward ANN with BN Φ with
activation a and BN regularization parameter ε) (cf. Definitions 1.2.1, 10.1.6, and 10.2.1).
Φ∈b (10.23)
403
Chapter 10: ANNs with batch normalization
by
(10.24)
Rba,ε : b → C(Rk , Rl )
S
k,l∈N
and
404
10.6. On the connection between BN for training and BN for inference
(RB b
a,ε (Φ))(x) = ((Ra,ε (Ψ))(x
(m)
))m∈{1,2,...,M } (10.31)
Proof of Lemma 10.6.2. Observe that (10.19), (10.20), (10.21), (10.25), (10.26), (10.27),
(10.28), (10.29), and (10.30) establish (10.31). The proof of Lemma 10.6.2 is thus complete.
Exercise 10.6.1. Let l0 = 2, l1 = 3, l2 = 1, N = {0, 1}, γ0 = (2, 2), β0 = (0, 0), γ1 = (1, 1, 1),
β1 = (0, 1, 0), x = ((0, 1), (1, 0), (−2, 2), (2, −2)), Φ ∈ B satisfy
1 2
−1 −1 1 −1
Φ = 3 4, , , −2 , ((γk , βk ))k∈N
−1 1 −1 1 (10.32)
5 6
× 2
Rlk ×lk−1 × Rlk × ×
∈ k=1 k∈N
(Rlk )2
and let Ψ ∈ b be the fully-connected feedforward ANNs with BN for given batch means and
batch variances associated to (Φ, x, r, 0.01). Compute (RB 1 (Φ))(x) and (R
r, 100
b
1 (Ψ))(−1, 1)
r, 100
explicitly and prove that your results are correct (cf. Definitions 1.2.4, 10.2.1, 10.3.1, 10.4.1,
10.5.1, and 10.6.1)!
405
Chapter 10: ANNs with batch normalization
406
Chapter 11
Furthermore, observe that the fact that [0, ∞)2 ∋ (x, y) 7→ 1[y,∞) (x) ∈ R is (B([0, ∞)) ⊗
B([0, ∞)))/B(R)-measurable, the assumption that µ is a sigma-finite measure, and Fubini’s
407
Chapter 11: Optimization through random initializations
Combining this with (11.2) shows that for all ε ∈ (0, ∞) it holds that
Z ∞ Z ∞ Z ∞
x µ(dx) = µ([y, ∞)) dy ≥ µ((y, ∞)) dy
0 0
Z ∞ 0
Z ∞ (11.4)
≥ µ([y + ε, ∞)) dy = µ([y, ∞)) dy.
0 ε
Proof of Lemma 11.1.2. Throughout this proof, let Y : Ω → [0, ∞) satisfy for all ω ∈ Ω
that Y (ω) = mink∈{1,2,...,K} [δ(Xk (ω), x)]p . Note that the fact that Y is a random variable,
the assumption that ∀ y ∈ E, ω ∈ Ω : |R(x, ω) − R(y, ω)| ≤ Lδ(x, y), and Lemma 11.1.1
demonstrate that
E mink∈{1,2,...,K} |R(Xk ) − R(x)|p ≤ Lp E mink∈{1,2,...,K} [δ(Xk , x)]p
Z ∞ Z ∞
p p p
= L E[Y ] = L y PY (dy) = L PY ((ε, ∞)) dε
0 0
(11.7)
Z ∞ Z ∞
= Lp P(Y > ε) dε = Lp P mink∈{1,2,...,K} [δ(Xk , x)]p > ε dε.
0 0
408
11.2. Strong convergences rates for the optimization error
Furthermore, observe that the assumption that Xk , k ∈ {1, 2, . . . , K}, are i.i.d. random
variables establishes that for all ε ∈ (0, ∞) it holds that
K (11.8)
P([δ(Xk , x)]p > ε) = [P([δ(X1 , x)]p > ε)]K = [P(δ(X1 , x) > ε /p )]K .
Q 1
=
k=1
Combining this with (11.7) proves (11.6). The proof of Lemma 11.1.2 is thus complete.
Proof of Lemma 11.2.1. Throughout this proof, let x, y ∈ (0, ∞), let Φ : (0, ∞) × (0, 1) →
(0, ∞)2 satisfy for all u ∈ (0, ∞), v ∈ (0, 1) that
and let f : (0, ∞)2 → (0, ∞) satisfy for all s, t ∈ (0, ∞) that
Note that the integration by parts formula proves that for all x ∈ (0, ∞) it holds that
Z ∞ Z ∞
((x+1)−1) −t
tx −e−t dt
Γ(x + 1) = t e dt = −
0
Z ∞ 0
Z ∞ (11.11)
x −t t=∞ (x−1) −t (x−1) −t
= − t e t=0 − x t e dt = x t e dt = x · Γ(x).
0 0
409
Chapter 11: Optimization through random initializations
This and item (i) prove item (ii). Moreover, note that the integral transformation theorem
with the diffeomorphism (1, ∞) ∋ t 7→ 1t ∈ (0, 1) ensures that
Z 1 Z ∞
(x−1) (y−1) 1 (x−1) (y−1) 1
B(x, y) = t (1 − t) dt = t
1 − 1t t2
dt
0 1
Z ∞ Z ∞
(−x−1) t−1 (y−1)
t(−x−y) (t − 1)(y−1) dt
= t t
dt = (11.13)
1 1
Z ∞ Z ∞
(−x−y) (y−1) t(y−1)
= (t + 1) t dt = dt.
0 0 (t + 1)(x+y)
In addition, observe that the fact that for all (u, v) ∈ (0, ∞) × (0, 1) it holds that
1 − v −u
′
Φ (u, v) = (11.14)
v u
shows that for all (u, v) ∈ (0, ∞) × (0, 1) it holds that
det(Φ′ (u, v)) = (1 − v)u − v(−u) = u − vu + vu = u ∈ (0, ∞). (11.15)
This, the fact that
Z ∞ Z ∞
(x−1) −t (y−1) −t
Γ(x) · Γ(y) = t e dt t e dt
0 0
Z ∞ Z ∞
(x−1) −s (y−1) −t
= s e ds t e dt
0
Z ∞Z ∞ 0
(11.16)
= s(x−1) t(y−1) e−(s+t) dt ds
Z0 0
410
11.2. Strong convergences rates for the optimization error
Proof of Proposition 11.2.3. Throughout this proof, let ⌊·⌋ : [0, ∞) → N0 satisfy for all
x ∈ [0, ∞) that ⌊x⌋ = max([0, x] ∩ N0 ). Observe that the fact that for all t ∈ (0, ∞) it holds
that R ∋ x 7→ tx ∈ (0, ∞) is convex establishes that for all x, y ∈ (0, ∞), α ∈ [0, 1] it holds
that
Z ∞ Z ∞
αx+(1−α)y−1 −t
Γ(αx + (1 − α)y) = t e dt = tαx+(1−α)y t−1 e−t dt
Z0 ∞ 0
411
Chapter 11: Optimization through random initializations
412
11.2. Strong convergences rates for the optimization error
Next observe that Lemma 11.2.2 demonstrates that for all x ∈ (0, ∞), α ∈ [0, 1] it holds
that
α
α α max{x + α − 1, 0}
(max{x + α − 1, 0}) = (x + α)
x+α
α
α 1
= (x + α) max 1 − ,0
x+α (11.31)
α α α x
≤ (x + α) 1 − = (x + α)
x+α x+α
x
= .
(x + α)1−α
This and (11.27) prove item (iii). Furthermore, note that induction, item (i) in Lemma 11.2.1,
the fact that ∀ α ∈ [0, ∞) : α − ⌊α⌋ ∈ [0, 1), and item (iii) ensure that for all x ∈ (0, ∞),
α ∈ [0, ∞) it holds that
⌊α⌋ ⌊α⌋
Γ(x + α) Γ(x + α − ⌊α⌋)
(x + α − i) xα−⌊α⌋
Q Q
= (x + α − i) ≤
Γ(x) i=1 Γ(x) i=1
Moreover, observe that the fact that ∀ α ∈ [0, ∞) : α − ⌊α⌋ ∈ [0, 1), item (iii), induction,
and item (i) in Lemma 11.2.1 show that for all x ∈ (0, ∞), α ∈ [0, ∞) it holds that
Combining this with (11.32) establishes item (iv). The proof of Proposition 11.2.3 is thus
complete.
413
Chapter 11: Optimization through random initializations
Corollary
R 1 x−1 11.2.4. Let B : (0, ∞)2 → (0, ∞) satisfy for all x, y ∈ (0, ∞) that B(x, y) =
y−1
R0∞t x−1(1−t− t) dt and let Γ : (0, ∞) → (0, ∞) satisfy for all x ∈ (0, ∞) that Γ(x) =
0
t e dt. Then it holds for all x, y ∈ (0, ∞) with x + y > 1 that
Proof of Corollary 11.2.4. Note that item (iii) in Lemma 11.2.1 implies that for all x, y ∈
(0, ∞) it holds that
Γ(x)Γ(y)
B(x, y) = . (11.35)
Γ(y + x)
Furthermore, observe that the fact that for all x, y ∈ (0, ∞) with x + y > 1 it holds
that y + min{x − 1, 0} > 0 and item (iv) in Proposition 11.2.3 demonstrate that for all
x, y ∈ (0, ∞) with x + y > 1 it holds that
Γ(y + x)
0 < (y + min{x − 1, 0})x ≤ ≤ (y + max{x − 1, 0})x . (11.36)
Γ(y)
Combining this with (11.35) and item (ii) in Proposition 11.2.3 proves that for all x, y ∈
(0, ∞) with x + y > 1 it holds that
Then
and
414
11.2. Strong convergences rates for the optimization error
Proof of Lemma 11.2.5. Throughout this proof, let D = (D1 , . . . , Dn ) : E → Rn satisfy for
all x ∈ E that
D(x) = (D1 (x), D2 (x), . . . , Dn (x)) = (d(x, e1 ), d(x, e2 ), . . . , d(x, en )). (11.40)
This establishes item (i). It thus remains to prove item (ii). For this observe that the
fact that d : E × E → [0, ∞) is continuous shows that D : E → Rn is continuous. Hence,
we obtain that D : E → Rn is B(E)/B(Rn )-measurable. Furthermore, note that item (i)
implies that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek }) it holds that
Therefore, we obtain that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek }) it holds that
Moreover, observe that (11.38) demonstrates that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek })
it holds that
min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu )
u∈{1,2,...,n} (11.44)
∈ l ∈ {1, 2, . . . , n} : el = ek ⊆ k, k + 1, . . . , n .
Hence, we obtain that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek }) with ek ∈ l∈N∩[0,k) {el } it
S
/
holds that
min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) ≥ k. (11.45)
u∈{1,2,...,n}
Combining this with (11.43) proves that for all k ∈ {1, 2, . . . , n}, x ∈ P ({ek }) with
−1
415
Chapter 11: Optimization through random initializations
This and (11.38) ensure that for all k ∈ {1, 2, . . . , n} with ek ∈ l∈N∩[0,k) {el } it holds that
S
/
−1
P ({ek }) = x ∈ E : min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) = k .
u∈{1,2,...,n}
(11.48)
Combining (11.40) with the fact that D : E →SR is B(E)/B(R n
)-measurable hence estab-
n
lishes that for all k ∈ {1, 2, . . . , n} with ek ∈ l∈N∩[0,k) {el } it holds that
/
P −1 ({ek })
= x ∈ E : min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) = k
u∈{1,2,...,n}
= x ∈ E : min l ∈ {1, 2, . . . , n} : Dl (x) = min Du (x) = k
u∈{1,2,...,n}
(11.49)
∀ l ∈ N ∩ [0, k) : Dk (x) < Dl (x) and
= x ∈ E:
∀ l ∈ {1, 2, . . . , n} : Dk (x) ≤ Dl (x)
k−1 n
\ \ \
= {x ∈ E : Dk (x) < Dl (x)} {x ∈ E : Dk (x) ≤ Dl (x)} ∈ B(E).
| {z } | {z }
l=1 l=1
∈B(E) ∈B(E)
This proves item (ii). The proof of Lemma 11.2.5 is thus complete.
Lemma 11.2.6. Let (E, d) be a separable metric space, let (E, δ) be a metric space, let (Ω, F)
be a measurable space, let X : E × Ω → E, assume for all e ∈ E that Ω ∋ ω 7→ X(e, ω) ∈ E
is F/B(E)-measurable, and assume for all ω ∈ Ω that E ∋ e 7→ X(e, ω) ∈ E is continuous.
Then X : E × Ω → E is (B(E) ⊗ F)/B(E)-measurable.
Proof of Lemma 11.2.6. Throughout this proof, let e = (em )m∈N : N → E satisfy
{em : m ∈ N} = E, (11.52)
416
11.2. Strong convergences rates for the optimization error
Note that (11.54) shows that for all n ∈ N, B ∈ B(E) it holds that
Item (ii) in Lemma 11.2.5 therefore implies that for all n ∈ N, B ∈ B(E) it holds that
[ n h io
−1
(Xn ) (B) = (x, ω) ∈ E × Ω : X(y, ω) ∈ B and x ∈ (Pn ) ({y})
−1
y∈Im(Pn )
[
−1
(11.56)
= {(x, ω) ∈ E × Ω : X(y, ω) ∈ B} ∩ (Pn ) ({y}) × Ω
y∈Im(Pn )
[
E × (X(y, ·))−1 (B) ∩ (Pn )−1 ({y}) × Ω ∈ (B(E) ⊗ F).
=
| {z } | {z }
y∈Im(Pn )
∈(B(E)⊗F) ∈(B(E)⊗F)
Combining this with the fact that for all n ∈ N it holds that Xn : E × Ω → E is (B(E) ⊗
F)/B(E)-measurable establishes that X : E × Ω → E is (B(E) ⊗ F)/B(E)-measurable. The
proof of Lemma 11.2.6 is thus complete.
417
Chapter 11: Optimization through random initializations
p 1/p
L(β − α) max{1, (p/d)1/d }
E mink∈{1,2,...,K} |R(Θk ) − R(θ)| ≤
K 1/d (11.58)
L(β − α) max{1, p}
≤ .
K 1/d
Proof of Proposition 11.2.7. Throughout this proof, assume without loss of generality that
L > 0, let δ : ([α, β]d ) × ([α, β]d ) → [0, ∞) satisfy for all θ, ϑ ∈ [α, β]d that
Z 1
B(x, y) = tx−1 (1 − t)y−1 dt, (11.60)
0
and let Θ1,1 , Θ1,2 , . . . , Θ1,d : Ω → [α, β] satisfy Θ1 = (Θ1,1 , Θ1,2 , . . . , Θ1,d ). First, note that
the assumption that for all θ, ϑ ∈ [α, β]d , ω ∈ Ω it holds that
proves that for all ω ∈ Ω it holds that [α, β]d ∋ θ 7→ R(θ, ω) ∈ R is continuous. Combining
this with the fact that ([α, β]d , δ) is a separable metric space, the fact that for all θ ∈ [α, β]d
it holds that Ω ∋ ω 7→ R(θ, ω) ∈ R is F/B(R)-measurable, and Lemma 11.2.6 establishes
item (i). Observe that the fact that for all θ ∈ [α, β], ε ∈ [0, ∞) it holds that
and the assumption that Θ1 is continuously uniformly distributed on [α, β]d show that for
418
11.2. Strong convergences rates for the optimization error
Hence, we obtain for all θ ∈ [α, β]d , p ∈ (0, ∞), ε ∈ [0, ∞) that
n d
ε /p
o n d
ε /p
o (11.64)
≤ 1 − min 1, (β−α) d = max 0, 1 − (β−α)d
.
This, item (i), the assumption that for all θ, ϑ ∈ [α, β]d , ω ∈ Ω it holds that
the assumption that Θk , k ∈ {1, 2, . . . , K}, are i.i.d. random variables, and Lemma 11.1.2
(applied with (E, δ) ↶ ([α, β]d , δ), (Xk )k∈{1,2,...,K} ↶ (Θk )k∈{1,2,...,K} in the notation of
Lemma 11.1.2) imply that for all θ ∈ [α, β]d , p ∈ (0, ∞) it holds that
Z ∞
p p
[P(∥Θ1 − θ∥∞ > ε /p )]K dε
1
E mink∈{1,2,...,K} |R(Θk ) − R(θ)| ≤ L
0
Z ∞h n oiK Z (β−α)p K
d/p d
p ε p ε /p
≤L max 0, 1 − (β−α)d dε = L 1 − (β−α) dε
(11.66)
d
0 0
Z 1 Z 1
= dp Lp (β − α)p t /d−1 (1 − t)K dt = dp Lp (β − α)p t /d−1 (1 − t)K+1−1 dt
p p
0 0
p p p
= d
L (β − α) B(p/d, K + 1).
Corollary 11.2.4 (applied with x ↶ p/d, y ↶ K + 1 for p ∈ (0, ∞) in the notation of (11.34)
in Corollary 11.2.4) therefore demonstrates that for all θ ∈ [α, β]d , p ∈ (0, ∞) it holds that
p p
L (β − α)p max{1, (p/d)p/d }
E mink∈{1,2,...,K} |R(Θk ) − R(θ)|p ≤ d
p
(K + 1 + min{p/d − 1, 0})p/d
d (11.67)
L (β − α)p max{1, (p/d)p/d }
p
≤ .
K p/d
419
Chapter 11: Optimization through random initializations
420
11.3. Strong convergences rates for the optimization error involving ANNs
Proof of Lemma 11.3.3. Observe that Lemma 11.3.1 demonstrates (11.73). The proof of
Lemma 11.3.3 is thus complete.
Lemma 11.3.4. Let d ∈ N, u ∈ [−∞, ∞), v ∈ (u, ∞]. Then it holds for all x, y ∈ Rd that
Proof of Lemma 11.3.4. Note that Lemma 11.3.1, Corollary 11.3.2, and the fact that for
all x ∈ R it holds that max{−∞, x} = x = min{x, ∞} imply that for all x, y ∈ R it holds
that
|cu,v (x) − cu,v (y)| = |max{u, min{x, v}} − max{u, min{y, v}}|
(11.75)
≤ |min{x, v} − min{y, v}| ≤ |x − y|
(cf. Definition 1.2.9). Hence, we obtain that for all x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd ) ∈
Rd it holds that
(cf. Definitions 1.2.10 and 3.3.4). The proof of Lemma 11.3.4 is thus complete.
Lemma 11.3.5 (Row sum norm, operator norm induced by the maximum norm). Let
a, b ∈ N, M = (Mi,j )(i,j)∈{1,2,...,a}×{1,2,...,b} ∈ Ra×b . Then
" #
∥M v∥∞ b
(11.77)
P
sup = max |Mi,j | ≤ b max max |Mi,j |
v∈Rb \{0} ∥v∥∞ i∈{1,2,...,a} j=1 i∈{1,2,...,a} j∈{1,2,...,b}
421
Chapter 11: Optimization through random initializations
= sup ∥M v∥∞
v=(v1 ,v2 ,...,vb )∈[−1,1]b
!
b
P
= sup max Mi,j vj
v=(v1 ,v2 ,...,vb )∈[−1,1]b i∈{1,2,...,a} j=1 (11.78)
!
b
P
= max sup Mi,j vj
i∈{1,2,...,a} v=(v1 ,v2 ,...,vb )∈[−1,1]b j=1
!
b
P
= max |Mi,j |
i∈{1,2,...,a} j=1
let Wj,k ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, j ∈ {1, 2}, and Bj,k ∈ Rlk , k ∈ {1, 2, . . . , L},
j ∈ {1, 2}, satisfy for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} that
(11.82)
T (Wj,1 , Bj,1 ), (Wj,2 , Bj,2 ), . . . , (Wj,L , Bj,L ) = (θj,1 , θj,2 , . . . , θj,d ),
422
11.3. Strong convergences rates for the optimization error involving ANNs
let ϕj,k ∈ N, k ∈ {1, 2, . . . , L}, j ∈ {1, 2}, satisfy for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} that
h k i
×
ϕj,k = (Wj,1 , Bj,1 ), (Wj,2 , Bj,2 ), . . . , (Wj,k , Bj,k ) ∈ i=1
Rli ×li−1 li
×R , (11.83)
let D = [a, b]l0 , let mj,k ∈ [0, ∞), j ∈ {1, 2}, k ∈ {0, 1, . . . , L}, satisfy for all j ∈ {1, 2},
k ∈ {0, 1, . . . , L} that
(
max{1, |a|, |b|} :k=0
mj,k = N
(11.84)
max 1, supx∈D ∥(Rr (ϕj,k ))(x)∥∞ : k > 0,
and let ek ∈ [0, ∞), k ∈ {0, 1, . . . , L}, satisfy for all k ∈ {0, 1, . . . , L} that
(
0 :k=0
ek = (11.85)
supx∈D ∥(RN N
r (ϕ1,k ))(x) − (Rr (ϕ2,k ))(x)∥∞ :k>0
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 3.3.4). Note that Lemma 11.3.5 ensures that
e1 = sup ∥(RN N
r (ϕ1,1 ))(x) − (Rr (ϕ2,1 ))(x)∥∞
x∈D
= sup ∥(W1,1 x + B1,1 ) − (W2,1 x + B2,1 )∥∞
x∈D
≤ sup ∥(W1,1 − W2,1 )x∥∞ + ∥B1,1 − B2,1 ∥∞
x∈D
" # (11.86)
∥(W1,1 − W2,1 )v∥∞
≤ sup sup ∥x∥∞ + ∥B1,1 − B2,1 ∥∞
v∈Rl0 \{0} ∥v∥∞ x∈D
Furthermore, observe that the triangle inequality proves that for all k ∈ {1, 2, . . . , L}∩(1, ∞)
it holds that
ek = sup ∥(RN N
r (ϕ1,k ))(x) − (Rr (ϕ2,k ))(x)∥∞
x∈D
h i
= sup W1,k Rlk−1 (RN
r (ϕ1,k−1 ))(x) + B1,k
x∈D
(11.87)
h i
N
− W2,k Rlk−1 (Rr (ϕ2,k−1 ))(x) + B2,k
∞
≤ sup W1,k Rlk−1 (RN
r (ϕ1,k−1 ))(x) − W2,k Rlk−1 (RN
r (ϕ2,k−1 ))(x)
x∈D ∞
+ ∥θ1 − θ2 ∥∞ .
423
Chapter 11: Optimization through random initializations
The triangle inequality therefore establishes that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞)
it holds that
N
ek ≤ sup W1,k − W2,k Rlk−1 (Rr (ϕj,k−1 ))(x) ∞
x∈D
N
N
+ sup W3−j,k Rlk−1 (Rr (ϕ1,k−1 ))(x) − Rlk−1 (Rr (ϕ2,k−1 ))(x)
x∈D ∞
+ ∥θ1 − θ2 ∥∞
" #
∥(W1,k − W2,k )v∥∞ (11.88)
sup Rlk−1 (RN
≤ sup r (ϕj,k−1 ))(x) ∞
v∈Rlk−1 \{0}
∥v∥∞ x∈D
" #
∥W3−j,k v∥∞
sup Rlk−1 (RN
+ sup r (ϕ1,k−1 ))(x)
v∈Rlk−1 \{0}
∥v∥∞ x∈D
N
− Rlk−1 (Rr (ϕ2,k−1 ))(x) ∞ + ∥θ1 − θ2 ∥∞ .
Lemma 11.3.5 and Lemma 11.3.3 hence show that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L}∩(1, ∞)
it holds that
N
ek ≤ lk−1 ∥θ1 − θ2 ∥∞ sup Rlk−1 (Rr (ϕj,k−1 ))(x) ∞ + ∥θ1 − θ2 ∥∞
x∈D
N N
+ lk−1 ∥θ3−j ∥∞ sup Rlk−1 (Rr (ϕ1,k−1 ))(x) − Rlk−1 (Rr (ϕ2,k−1 ))(x) ∞
x∈D
(11.89)
N
≤ lk−1 ∥θ1 − θ2 ∥∞ sup (Rr (ϕj,k−1 ))(x) ∞ + ∥θ1 − θ2 ∥∞
x∈D
N N
+ lk−1 ∥θ3−j ∥∞ sup (Rr (ϕ1,k−1 ))(x) − (Rr (ϕ2,k−1 ))(x) ∞
x∈D
≤ ∥θ1 − θ2 ∥∞ (lk−1 mj,k−1 + 1) + lk−1 ∥θ3−j ∥∞ ek−1 .
Therefore, we obtain that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞) it holds that
Combining this with (11.86), the fact that e0 = 0, and the fact that m1,0 = m2,0 demonstrates
that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds that
This implies that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2} and all k ∈ {1, 2, . . . , L}
it holds that
424
11.3. Strong convergences rates for the optimization error involving ANNs
Hence, we obtain that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2} and all k ∈
{1, 2, . . . , L} it holds that
k−1
" k−1 # !
X Y
ek ≤ lm ∥θ3−jm ∥∞ mjn ,n (ln + 1)∥θ1 − θ2 ∥∞
n=0 m=n+1
" k−1 " k−1 # !# (11.93)
X Y
= ∥θ1 − θ2 ∥∞ lm ∥θ3−jm ∥∞ mjn ,n (ln + 1) .
n=0 m=n+1
Moreover, note that Lemma 11.3.5 ensures that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞),
x ∈ D it holds that
∥(RNr (ϕj,k ))(x)∥∞
N
= Wj,k Rlk−1 (Rr (ϕj,k−1 ))(x) + Bj,k
∞
" #
∥Wj,k v∥∞
Rlk−1 (RN
≤ sup r (ϕj,k−1 ))(x) ∞
+ ∥Bj,k ∥∞
v∈Rlk−1 \{0}
∥v∥∞
(11.94)
≤ lk−1 ∥θj ∥∞ Rlk−1 (RN
r (ϕj,k−1 ))(x) ∞
+ ∥θj ∥∞
≤ lk−1 ∥θj ∥∞ (RN
r (ϕj,k−1 ))(x) ∞
+ ∥θj ∥∞
= lk−1 (RN
r (ϕj,k−1 ))(x) ∞
+ 1 ∥θj ∥∞
≤ (lk−1 mj,k−1 + 1)∥θj ∥∞ ≤ mj,k−1 (lk−1 + 1)∥θj ∥∞ .
In addition, observe that Lemma 11.3.5 proves that for all j ∈ {1, 2}, x ∈ D it holds that
∥(RN
r (ϕj,1 ))(x)∥∞ = ∥Wj,1 x + Bj,1 ∥∞
" #
∥Wj,1 v∥∞
≤ sup ∥x∥∞ + ∥Bj,1 ∥∞
v∈Rl0 \{0} ∥v∥∞ (11.96)
≤ l0 ∥θj ∥∞ ∥x∥∞ + ∥θj ∥∞ ≤ l0 ∥θj ∥∞ max{|a|, |b|} + ∥θj ∥∞
= (l0 max{|a|, |b|} + 1)∥θj ∥∞ ≤ m1,0 (l0 + 1)∥θj ∥∞ .
Combining this with (11.95) establishes that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds
that
mj,k ≤ max{1, mj,k−1 (lk−1 + 1)∥θj ∥∞ }. (11.98)
425
Chapter 11: Optimization through random initializations
Therefore, we obtain that for all j ∈ {1, 2}, k ∈ {0, 1, . . . , L} it holds that
"k−1 #
Y k
(11.99)
mj,k ≤ mj,0 (ln + 1) max{1, ∥θj ∥∞ } .
n=0
Combining this with (11.93) shows that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2}
and all k ∈ {1, 2, . . . , L} it holds that
" k−1 " k−1 #
X Y
ek ≤ ∥θ1 − θ2 ∥∞ lm ∥θ3−jm ∥∞
n=0 m=n+1
"n−1 # !!#
Y
n
· mjn ,0 (lv + 1) max{1, ∥θjn ∥∞ }(ln + 1)
v=0
" k−1 " k−1 # " n
# !!#
X Y Y
(lv + 1) max{1, ∥θjn ∥n∞ }
= m1,0 ∥θ1 − θ2 ∥∞ lm ∥θ3−jm ∥∞
n=0 m=n+1 v=0
" k−1 " k−1 #"k−1 # !#
X Y Y
≤ m1,0 ∥θ1 − θ2 ∥∞ ∥θ3−jm ∥∞ (lv + 1) max{1, ∥θjn ∥n∞ }
n=0 m=n+1 v=0
"k−1 #" k−1 " k−1 # !#
Y X Y
= m1,0 ∥θ1 − θ2 ∥∞ (ln + 1) ∥θ3−jm ∥∞ max{1, ∥θjn ∥n∞ } .
n=0 n=0 m=n+1
(11.100)
Hence, we obtain that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds that
"k−1 #" k−1 " k−1 # !#
Y X Y
ek ≤ m1,0 ∥θ1 − θ2 ∥∞ (ln + 1) ∥θ3−j ∥∞ max{1, ∥θj ∥n∞ }
n=0 n=0 m=n+1
"k−1 #" k−1 #
(11.101)
Y X
max{1, ∥θj ∥n∞ } ∥θ3−j ∥k−1−n
= m1,0 ∥θ1 − θ2 ∥∞ (ln + 1) ∞
n=0 n=0
" k−1 #
k−1
Y
≤ k m1,0 ∥θ1 − θ2 ∥∞ (max{1, ∥θ1 ∥∞ , ∥θ2 ∥∞ }) lm + 1 .
m=0
426
11.3. Strong convergences rates for the optimization error involving ANNs
Proof of Corollary 11.3.7. Note that Lemma 11.3.4 and Theorem 11.3.6 demonstrate that
for all θ, ϑ ∈ Rd it holds that
θ,l ϑ,l
sup ∥Nu,v (x) − Nu,v (x)∥∞
x∈[a,b]l0
θ,l ϑ,l
= sup ∥Cu,v,lL (N−∞,∞ (x)) − Cu,v,lL (N−∞,∞ (x))∥∞
x∈[a,b]l0 (11.104)
θ,l ϑ,l
≤ sup ∥N−∞,∞ (x) − N−∞,∞ (x)∥∞
x∈[a,b]l0
≤ L max{1, |a|, |b|} (∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞
(cf. Definitions 1.2.10, 3.3.4, and 4.4.1). The proof of Corollary 11.3.7 is thus complete.
(cf. Definition 4.4.1). Then it holds for all θ, ϑ ∈ [−B, B]d , ω ∈ Ω that
Proof of Lemma 11.3.8. Observe that the fact that for all x1 , x2 , y ∈ R it holds that
(x1 − y)2 − (x2 − y)2 = (x1 − x2 )((x1 − y) + (x2 − y)), the fact that for all θ ∈ Rd , x ∈ Rd
it holds that Nu,v
θ,l
(x) ∈ [u, v], and the assumption that for all j ∈ {1, 2, . . . , M }, ω ∈ Ω it
427
Chapter 11: Optimization through random initializations
holds that Yj (ω) ∈ [u, v] imply that for all θ, ϑ ∈ [−B, B]d , ω ∈ Ω it holds that
Furthermore, note that the assumption that D ⊆ [−b, b]d , d ≥ Li=1 li (li−1 + 1), l0 = d,
P
lL = 1, b ≥ 1, and B ≥ 1 and Corollary 11.3.7 (applied with a ↶ −b, b ↶ b, u ↶ u, v ↶ v,
d ↶ d, L ↶ L, l ↶ l in the notation of Corollary 11.3.7) ensure that for all θ, ϑ ∈ [−B, B]d
it holds that
θ,l ϑ,l θ,l ϑ,l
supx∈D |Nu,v (x) − Nu,v (x)| ≤ supx∈[−b,b]d |Nu,v (x) − Nu,v (x)|
≤ L max{1, b}(∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞ (11.108)
L L−1
≤ bL(∥l∥∞ + 1) B ∥θ − ϑ∥∞
(cf. Definition 3.3.4). This and (11.107) prove that for all θ, ϑ ∈ [−B, B]d , ω ∈ Ω it holds
that
|R(θ, ω) − R(ϑ, ω)| ≤ 2(v − u)bL(∥l∥∞ + 1)L B L−1 ∥θ − ϑ∥∞ . (11.109)
The proof of Lemma 11.3.8 is thus complete.
Corollary 11.3.9. Let d, d, d, L, M, K ∈ N, B, b ∈ [1, ∞), u ∈ R, vP∈ (u, ∞), l =
(l0 , l1 , . . . , lL ) ∈ NL+1 , D ⊆ [−b, b]d , assume l0 = d, lL = 1, and d ≥ d = Li=1 li (li−1 + 1),
let (Ω, F, P) be a probability space, let Θk : Ω → [−B, B]d , k ∈ {1, 2, . . . , K}, be i.i.d.
random variables, assume that Θ1 is continuously uniformly distributed on [−B, B]d , let
Xj : Ω → D, j ∈ {1, 2, . . . , M }, and Yj : Ω → [u, v], j ∈ {1, 2, . . . , M }, be random variables,
and let R : [−B, B]d × Ω → [0, ∞) satisfy for all θ ∈ [−B, B]d , ω ∈ Ω that
M
1 P
R(θ, ω) = θ,l
|N (Xj (ω)) − Yj (ω)|2
(11.110)
M j=1 u,v
428
11.3. Strong convergences rates for the optimization error involving ANNs
Observe that the fact that ∀ θ ∈ [−B, B]d : Nu,v θ,l P (θ),l
= Nu,v establishes that for all θ ∈
[−B, B] , ω ∈ Ω it holds that
d
M
1 P θ,l 2
R(θ, ω) = |N (Xj (ω)) − Yj (ω)|
M j=1 u,v
M (11.113)
1 P P (θ),l 2
= |N (Xj (ω)) − Yj (ω)| = R(P (θ), ω).
M j=1 u,v
Furthermore, note that Lemma 11.3.8 (applied with d ↶ d, R ↶ ([−B, B]d × Ω ∋ (θ, ω) 7→
R(θ, ω) ∈ [0, ∞)) in the notation of Lemma 11.3.8) shows that for all θ, ϑ ∈ [−B, B]d ,
ω ∈ Ω it holds that
|R(θ, ω) − R(ϑ, ω)| ≤ 2(v − u)bL(∥l∥∞ + 1)L B L−1 ∥θ − ϑ∥∞ = L∥θ − ϑ∥∞ . (11.114)
Moreover, observe that the assumption that Xj , j ∈ {1, 2, . . . , M }, and Yj , j ∈ {1, 2, . . . ,
M }, are random variables demonstrates that R : [−B, B]d × Ω → R is a random field. This,
(11.114), the fact that P ◦ Θk : Ω → [−B, B]d , k ∈ {1, 2, . . . , K}, are i.i.d. random variables,
the fact that P ◦Θ1 is continuously uniformly distributed on [−B, B]d , and Proposition 11.2.7
(applied with d ↶ d, α ↶ −B, β ↶ B, R ↶ R, (Θk )k∈{1,2,...,K} ↶ (P ◦ Θk )k∈{1,2,...,K} in
the notation of Proposition 11.2.7) imply that for all θ ∈ [−B, B]d , p ∈ (0, ∞) it holds that
R is (B([−B, B]d ) ⊗ F)/B(R)-measurable and
1/p
E mink∈{1,2,...,K} |R(P (Θk )) − R(P (θ))|p
The fact that P is B([−B, B]d )/B([−B, B]d )-measurable and (11.113) therefore prove
item (i). In addition, note that (11.113), (11.115), and the fact that 2 ≤ d = i=1 li (li−1 +
PL
1) ≤ L(∥l∥∞ + 1)2 ensure that for all θ ∈ [−B, B]d , p ∈ (0, ∞) it holds that
1/p
E mink∈{1,2,...,K} |R(Θk ) − R(θ)|p
1/p
= E mink∈{1,2,...,K} |R(P (Θk )) − R(P (θ))|p
(11.116)
p
4(v − u)bL(∥l∥∞ + 1)L B L max{1, p/d}
≤
K 1/d
4(v − u)bL(∥l∥∞ + 1)L B L max{1, p}
≤ .
K [L−1 (∥l∥∞ +1)−2 ]
This establishes item (ii). The proof of Corollary 11.3.9 is thus complete.
430
Part IV
Generalization
431
Chapter 12
In Chapter 15 below we establish a full error analysis for the training of ANNs in the specific
situation of GD-type optimization methods with many independent random initializations
(see Corollary 15.2.3). For this combined error analysis we do not only employ estimates
for the approximation error (see Part II above) and the optimization error (see Part III
above) but we also employ suitable generalization error estimates. Such generalization error
estimates are the subject of this chapter (cf. Corollary 12.3.10 below) and the next (cf.
Corollary 13.3.3 below). While in this chapter, we treat probabilistic generalization error
estimates, in Chapter we will present generalization error estimates in the strong Lp -sense.
In the literature, related generalization error estimates can, for instance, be found in
the survey articles and books [25, 35, 36, 87, 373] and the references therein. The specific
material in Section 12.1 is inspired by Duchi [116], the specific material in Section 12.2
is inspired by Cucker & Smale [87, Section 6 in Chapter I] and Carl & Stephani [61,
Section 1.1], and the specific presentation of Section 12.3 is strongly based on Beck et al. [25,
Section 3.2].
Proof of Lemma 12.1.1. Observe that the fact that X ≥ 0 proves that
ε1{X≥ε} X 1{X≥ε} X
1{X≥ε} = ≤ ≤ . (12.2)
ε ε ε
Hence, we obtain that
R
X dµ
Z
µ(X ≥ ε) = 1{X≥ε} dµ ≤ Ω
. (12.3)
Ω ε
The proof of Lemma 12.1.1 is thus complete.
434
12.1. Concentration inequalities for random variables
Lemma 12.1.5. Let (Ω, F, P) be a probability space, let a ∈ R, b ∈ [a, ∞), and let
X : Ω → [a, b] be a random variable. Then
(b − a)2
Var(X) ≤ . (12.9)
4
Proof of Lemma 12.1.5. Throughout this proof, assume without loss of generality that
a < b. Observe that Lemma 12.1.4 implies that
2
2
2 X−a−(E[X]−a)
Var(X) = E (X − E[X]) = (b − a) E b−a
h X−a 2 i
= (b − a)2 E X−a
b−a
− E b−a
(12.10)
(b − a)2
= (b − a)2 Var X−a
≤ (b − a)2 ( 14 ) =
b−a
.
4
The proof of Lemma 12.1.5 is thus complete.
435
Chapter 12: Probabilistic generalization error estimates
and we call MX,P the moment-generating function of X with respect to P (we call MX,P the
moment-generating function of X).
Proof of Lemma 12.1.8. Observe that Fubini’s theorem ensures that for all t ∈ R it holds
that
h PN i hYN i YN YN
MPNn=1 Xn (t) = E et( n=1 Xn ) = E etXn = E etXn =
MXn (t).
n=1 n=1 n=1
(12.17)
The proof of Lemma 12.1.8 is thus complete.
436
12.1. Concentration inequalities for random variables
Proof of Proposition 12.1.9. Note that Lemma 12.1.1 ensures that for all λ ∈ [0, ∞) it
holds that
E[exp(λX)]
= e−λε E eλX .
P(X ≥ ε) ≤ P(λX ≥ λε) = P(exp(λX) ≥ exp(λε)) ≤
exp(λε)
(12.19)
The proof of Proposition 12.1.9 is thus complete.
Corollary 12.1.10. Let (Ω, F, P) be a probability space, let X : Ω → R be a random
variable, and let c, ε ∈ R. Then
P(X ≥ c + ε) ≤ inf e−λε MX−c (λ) . (12.20)
λ∈[0,∞)
Proof of Corollary 12.1.11. Observe that Corollary 12.1.10 (applied with c ↶ E[X] in the
notation of Corollary 12.1.10) establishes (12.23). The proof of Corollary 12.1.11 is thus
complete.
Proof of Corollary 12.1.12. Throughout this proof, let c ∈ R satisfy c = −c and let X : Ω →
R satisfy
X = −X. (12.25)
Observe that Corollary 12.1.10 and (12.25) ensure that
P(X ≤ c − ε) = P(−X ≥ −c + ε) = P(X ≥ c + ε) ≤ inf e−λε MX−c (λ)
λ∈[0,∞)
−λε
(12.26)
= inf e Mc−X (λ) .
λ∈[0,∞)
437
Chapter 12: Probabilistic generalization error estimates
Yn = Xn − E[Xn ]. (12.28)
Observe that Proposition 12.1.9, Lemma 12.1.8, and (12.28) ensure that
" N # ! " N # !
X X
−λε
P Xn − E[Xn ] ≥ ε = P Yn ≥ ε ≤ inf e M Nn=1 Yn (λ)P
λ∈[0,∞)
n=1 n=1
" N
#! " N
#! (12.29)
Y Y
= inf e−λε MYn (λ) = inf e−λε MXn −E[Xn ] (λ) .
λ∈[0,∞) λ∈[0,∞)
n=1 n=1
438
12.1. Concentration inequalities for random variables
The fact that R ∋ x 7→ ex ∈ R is convex hence demonstrates that for all x ∈ [a, b] it holds
that
λx b−x b−x b − x λa b−x
e = exp λa + 1 − λb ≤ e + 1− eλb .
b−a b−a b−a b−a
(12.34)
The assumption that E[X] = 0 therefore assures that
b b
E eλX ≤ λa
eλb . (12.35)
e + 1−
b−a b−a
demonstrates that
b b
E eλX ≤ λa
eλb
e + 1−
b−a b−a
= (1 − p)eλa + [1 − (1 − p)]eλb (12.37)
λa λb
= (1 − p)e + p e
= (1 − p) + p eλ(b−a) eλa .
439
Chapter 12: Probabilistic generalization error estimates
Proof of Lemma 12.1.15. Observe that the fundamental theorem of calculus ensures that
for all x ∈ R it holds that
Z x
ϕ(x) = ϕ(0) + ϕ′ (y) dy
0
Z xZ y
′
= ϕ(0) + ϕ (0)x + ϕ′′ (z) dz dy (12.39)
0 0
x2
′ ′′
≤ ϕ(0) + ϕ (0)x + sup ϕ (z) .
2 z∈R
to obtain that for all x ∈ R it holds that ϕ′′ (x) ≤ 14 . This, (12.39), and (12.41) ensure that
for all x ∈ R it holds that
x2 x2 x2 x2
′ ′′ ′′
ϕ(x) ≤ ϕ(0) + ϕ (0)x + sup ϕ (z) = ϕ(0) + sup ϕ (z) ≤ ϕ(0) + = .
2 z∈R 2 z∈R 8 8
(12.43)
Lemma 12.1.16. Let (Ω, F, P) be a probability space, let a ∈ R, b ∈ [a, ∞), λ ∈ R, and let
X : Ω → [a, b] be a random variable with E[X] = 0. Then
2 2
E exp(λX) ≤ exp λ (b−a) (12.44)
8
.
440
12.1. Concentration inequalities for random variables
Proof of Lemma 12.1.16. Throughout this proof, assume without loss of generality that
a < b, let p ∈ R satisfy p = (b−a)
−a
, and let ϕr : R → R, r ∈ [0, 1], satisfy for all r ∈ [0, 1],
x ∈ R that
ϕr (x) = ln(1 − r + rex ) − rx. (12.45)
Observe that the assumption that E[X] = 0 and the fact that a ≤ E[X] ≤ b ensures that
a ≤ 0 ≤ b. Combining this with the assumption that a < b implies that
−a (b − a)
0≤p= ≤ = 1. (12.46)
(b − a) (b − a)
Combining this and (12.49) establishes (12.48). The proof of Lemma 12.1.17 is thus
complete.
Corollary 12.1.18. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ [0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ̸= 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then
" N # ! !
2
X −2ε
(12.51)
P Xn − E[Xn ] ≥ε ≤ exp PN .
n=1 n=1 (bn − an ) 2
441
Chapter 12: Probabilistic generalization error estimates
Moreover, note that Lemma 12.1.16 proves that for all n ∈ {1, 2, . . . , N } it holds that
2 2
2
λ (bn −an )2
MXn −E[Xn ] (λ) ≤ exp λ [(bn −E[Xn ])−(a
8
n −E[Xn ])]
= exp 8
. (12.54)
Proof of Corollary 12.1.19. Throughout this proof, let Xn : Ω → [−bn , −an ], n ∈ {1, 2, . . . ,
N }, satisfy for all n ∈ {1, 2, . . . , N } that
Xn = −Xn . (12.57)
442
12.1. Concentration inequalities for random variables
Combining this with Corollary 12.1.18 and Corollary 12.1.19 establishes (12.59). The proof
of Corollary 12.1.20 is thus complete.
Corollary 12.1.21. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ [0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ̸= 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then
N
! !
2 2
1 X −2ε N
(12.61)
P Xn − E[Xn ] ≥ ε ≤ 2 exp PN .
N n=1 n=1 (bn − an )
2
443
Chapter 12: Probabilistic generalization error estimates
N
! 2
1 X −ε N
P (Xn − E[Xn ]) ≥ ε ≤ 2 exp . (12.63)
N i=1 2
Exercise 12.1.2. Prove or disprove the following statement: For every probability space
(Ω, F, P), every N ∈ N, and every random variable X = (XQ 1 , X2 , . . . , XN ) : Ω → [−1, 1]
N
with ∀ a = (a1 , a2 , . . . , aN ) ∈ [−1, 1]N : P( i=1 {Xi ≤ ai }) = i=1 ai2+1 it holds that
N N
T
N
!
1 X 1 h e iN
P (Xn − E[Xn ]) ≥ ≤2 . (12.64)
N n=1 2 4
Exercise 12.1.3. Prove or disprove the following statement: For every probability space
(Ω, F, P), every N ∈ N, and every random variable X = (XQ 1 , X2 , . . . , XN ) : Ω → [−1, 1]
N
N
! N
e − e−3
1 X 1
P (Xn − E[Xn ]) ≥ ≤2 . (12.65)
N n=1 2 4
Exercise 12.1.4. Prove or disprove the following statement: For every probability space
(Ω, F, P), every N ∈ N, ε ∈ [0, ∞), and every standard normal random variable X =
(X1 , X2 , . . . , XN ) : Ω → RN it holds that
N
! 2
1 X −ε N
P (Xn − E[Xn ]) ≥ ε ≤ 2 exp . (12.66)
N n=1 2
444
12.2. Covering number estimates
f (x) f (x)
(i) it holds that limx→∞ g(x)
= limx↘0 g(x)
= 0 and
Proof of Lemma 12.1.22. Note that the fact that limx→∞ exp(−x) x−1
= limx↘0 exp(−x)
x−1
= 0
establishes item (i). Moreover, observe that the fact that e < 3 implies item (ii). The proof
of Lemma 12.1.22 is thus complete.
Corollary 12.1.23. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ (0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ̸= 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then
N
! ( ! P )
N 2
X −2ε2 (b n − a n )
, n=1 2
P Xn − E[Xn ] ≥ ε ≤ min 1, 2 exp PN .
n=1 (b
n=1 n − a n ) 2 4ε
(12.67)
Proof of Corollary 12.1.23. Observe that Lemma 12.1.6, Corollary 12.1.20, and the fact
that for all B ∈ F it holds that P(B) ≤ 1 establish (12.67). The proof of Corollary 12.1.23
is thus complete.
Lemma 12.2.2. Let (X, d) be a metric space, let n ∈ N, r ∈ [0, ∞], assume X ̸= ∅,
and let A ⊆ X satisfy |A| ≤ n and ∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r. Then there exist
x1 , x2 , . . . , xn ∈ X such that
"n #
[
X⊆ {v ∈ X : d(xi , v) ≤ r} . (12.69)
i=1
445
Chapter 12: Probabilistic generalization error estimates
Proof of Lemma 12.2.2. Note that the assumption that X ̸= ∅ and the assumption that
|A| ≤ n imply that there exist x1 , x2 , . . . , xn ∈ X which satisfy A ⊆ {x1 , x2 , . . . , xn }. This
and the assumption that ∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r ensure that
" # "n #
[ [
X⊆ {v ∈ X : d(a, v) ≤ r} ⊆ {v ∈ X : d(xi , v) ≤ r} . (12.70)
a∈A i=1
∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r. (12.71)
Proof of Lemma 12.2.3.SThroughout this proof, let A = {x1 , x2 , . . . , xn }. Note that the
assumption that X ⊆ i=1 {v ∈ X : d(xi , v) ≤ r} implies that for all v ∈ X there exists
n
446
12.2. Covering number estimates
Exercise 12.2.1. Prove or disprove the following statement: For every metric space (X, d) and
every n, m ∈ N it holds that C(X,d),n < ∞ if and only if C(X,d),m < ∞ (cf. Definition 12.2.1)
Exercise 12.2.2. Prove or disprove the following statement: For every metric space (X, d) and
every n ∈ N it holds that (X, d) is bounded if and only if C(X,d),n < ∞ (cf. Definition 12.2.1).
Exercise 12.2.3. Prove or disprove the following statement: For every n ∈ N and every
metric space (X, d) with X ̸= ∅ it holds that
and we call P(X,d),n the n-packing radius of (X, d) (we call P X,r the n-packing radius of X).
Exercise 12.2.4. Prove or disprove the following statement: For every n ∈ N and every
metric space (X, d) with X ̸= ∅ it holds that
P (X,d),r = sup
n ∈ N : ∃ x1 , x2 , . . . , xn+1 ∈ X :
mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) > 2r ∪ {0} (12.77)
and we call P (X,d),r the r-packing number of (X, d) (we call P X,r the r-packing number of
X).
447
Chapter 12: Probabilistic generalization error estimates
Proof of Lemma 12.2.8. Note that (12.77) ensures that there exist x1 , x2 , . . . , xn+1 ∈ X
such that
(12.78)
mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) > 2r.
This implies that P(X,d),n ≥ r (cf. Definition 12.2.6). The proof of Lemma 12.2.8 is thus
complete.
12.2.2.2 Upper bounds for packing numbers based on upper bounds for packing
radii
Lemma 12.2.9. Let (X, d) be a metric space and let n ∈ N, r ∈ [0, ∞] satisfy P(X,d),n < r
(cf. Definition 12.2.6). Then P (X,d),r < n (cf. Definition 12.2.7).
Proof of Lemma 12.2.9. Observe that Lemma 12.2.8 establishes that P (X,d),r < n (cf. Defi-
nition 12.2.7). The proof of Lemma 12.2.9 is thus complete.
12.2.2.3 Upper bounds for packing radii based on upper bounds for covering
radii
Lemma 12.2.10. Let (X, d) be a metric space and let n ∈ N. Then P(X,d),n ≤ C(X,d),n (cf.
Definitions 12.2.1 and 12.2.6).
Proof of Lemma 12.2.10. Throughout this proof, assume without loss of generality that
C(X,d),n < ∞ and P(X,d),n > 0, let r ∈ [0, ∞), x1 , x2 , . . . , xn ∈ X satisfy
" n #
[
X⊆ {v ∈ X : d(xm , v) ≤ r} , (12.79)
m=1
(12.80)
mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) > 2r,
448
12.2. Covering number estimates
(cf. Definitions 12.2.1 and 12.2.6 and Lemma 12.2.5). Observe that (12.81) shows that for
all v ∈ X it holds that
(12.82)
v ∈ w ∈ X : d(xφ(v) , w) ≤ r .
Moreover, note that the fact that φ(x1 ), φ(x2 ), . . . , φ(xn+1 ) ∈ {1, 2, . . . , n} ensures that
there exist i, j ∈ {1, 2, . . . , n + 1} which satisfy
2r < d(xi , xj ) ≤ d(xi , xφ(xi ) ) + d(xφ(xi ) , xj ) = d(xi , xφ(xi ) ) + d(xj , xφ(xj ) ) ≤ 2r. (12.85)
This implies that r < r. The proof of Lemma 12.2.10 is thus complete.
Lemma 12.2.11. Let (X, d) be a metric space, let n ∈ N, x ∈ X, r ∈ (0, ∞], and let
S = {v ∈ X : d(x, v) ≤ r}. Then P(S,d|S×S ),n ≤ r (cf. Definition 12.2.6).
Proof of Lemma 12.2.11. Throughout this proof, assume without loss of generality that
P(S,d|S×S ),n > 0 (cf. Definition 12.2.6). Observe that for all x1 , x2 , . . . , xn+1 ∈ S, i, j ∈
{1, 2, . . . , n + 1} it holds that
Moreover, note that (12.75) ensures that for all ρ ∈ [0, P(S,d|S×S ),n ) there exist x1 , x2 , . . . ,
xn+1 ∈ S such that
mini,j∈{1,2,...,n+1},i̸=j d(xi , xj ) > 2ρ. (12.88)
This and (12.87) demonstrate that for all ρ ∈ [0, P(S,d|S×S ),n ) it holds that 2ρ < 2r. The
proof of Lemma 12.2.11 is thus complete.
449
Chapter 12: Probabilistic generalization error estimates
This establishes that C (X,d),r ≤ n (cf. Definition 4.3.2). The proof of Lemma 12.2.12 is thus
complete.
Lemma 12.2.13. Let (X, d) be a compact metric space and let r ∈ [0, ∞], n ∈ N, satisfy
C(X,d),n ≤ r (cf. Definition 12.2.1). Then C (X,d),r ≤ n (cf. Definition 4.3.2).
Proof of Lemma 12.2.13. Throughout this proof, assume without loss of generality that
X ̸= ∅ and let xk,m ∈ X, m ∈ {1, 2, . . . , n}, k ∈ N, satisfy for all k ∈ N that
" n #
[
X⊆ v ∈ X : d(xk,m , v) ≤ r + k1 (12.90)
m=1
(cf. Lemma 12.2.4). Note that the assumption that (X, d) is a compact metric space
demonstrates that there exist x = (xm )m∈{1,2,...,n} : {1, 2, . . . , n} → X and k = (kl )l∈N : N →
N which satisfy that
lim supl→∞ maxm∈{1,2,...,n} d(xm , xkl ,m ) = 0 and lim supl→∞ kl = ∞. (12.91)
Next observe that the assumption that d is a metric ensures that for all v ∈ X, m ∈
{1, 2, . . . , n}, l ∈ N it holds that
d(v, xm ) ≤ d(v, xkl ,m ) + d(xkl ,m , xm ). (12.92)
This and (12.90) prove that for all v ∈ X, l ∈ N it holds that
minm∈{1,2,...,n} d(v, xm ) ≤ minm∈{1,2,...,n} [d(v, xkl ,m ) + d(xkl ,m , xm )]
≤ minm∈{1,2,...,n} d(v, xkl ,m ) + maxm∈{1,2,...,n} d(xkl ,m , xm ) (12.93)
This establishes that C (X,d),r ≤ n (cf. Definition 4.3.2). The proof of Lemma 12.2.13 is thus
complete.
450
12.2. Covering number estimates
12.2.3.2 Upper bounds for covering radii based on upper bounds for covering
numbers
Lemma 12.2.14. Let (X, d) be a metric space and let r ∈ [0, ∞], n ∈ N satisfy C (X,d),r ≤ n
(cf. Definition 4.3.2). Then C(X,d),n ≤ r (cf. Definition 12.2.1).
Proof of Lemma 12.2.14. Observe that the assumption that C (X,d),r ≤ n ensures that there
exists A ⊆ X such that |A| ≤ n and
" #
[
X⊆ {v ∈ X : d(a, v) ≤ r} . (12.95)
a∈A
This establishes that C(X,d),n ≤ r (cf. Definition 12.2.1). The proof of Lemma 12.2.14 is
thus complete.
12.2.3.3 Upper bounds for covering radii based on upper bounds for packing
radii
Lemma 12.2.15. Let (X, d) be a metric space and let n ∈ N. Then C(X,d),n ≤ 2P(X,d),n (cf.
Definitions 12.2.1 and 12.2.6).
Proof of Lemma 12.2.15. Throughout this proof, assume w.l.o.g. that X ̸= ∅, assume
without loss of generality that P(X,d),n < ∞, let r ∈ [0, ∞] satisfy r > P(X,d),n , and let
N ∈ N0 ∪ {∞} satisfy N = P (X,d),r (cf. Definitions 12.2.6 and 12.2.7). Observe that
Lemma 12.2.9 ensures that
N = P (X,d),r < n. (12.96)
Moreover, note that the fact that N = P (X,d),r and (12.77) demonstrate that for all
x1 , x2 , . . . , xN +1 , xN +2 ∈ X it holds that
In addition, observe that the fact that N = P (X,d),r and (12.77) imply that there exist
x1 , x2 , . . . , xN +1 ∈ X which satisfy that
(12.98)
min {d(xi , xj ) : i, j ∈ {1, 2, . . . , N + 1}, i ̸= j} ∪ {∞} > 2r.
Combining this with (12.97) establishes that for all v ∈ X it holds that
451
Chapter 12: Probabilistic generalization error estimates
Combining this and Lemma 12.2.5 shows that C(X,d),n ≤ 2r (cf. Definition 12.2.1). The
proof of Lemma 12.2.15 is thus complete.
Then
(i) it holds that Φ is linear,
PN 1/2 PN 1/2
(ii) it holds for all r = (r1 , r2 , . . . , rN ) ∈ RN that ~Φ(r)~ ≤ n=1 ~bn ~
2
n=1 |rn |
2
,
(vi) it holds for all r ∈ (0, ∞), v ∈ V , A ∈ B(V ) that ν({(ra + v) ∈ V : a ∈ A}) =
rN ν(A),
(vii) it holds for all r ∈ (0, ∞) that ν({v ∈ V : ~v~ ≤ r}) = rN ν({v ∈ V : ~v~ ≤ 1}), and
452
12.2. Covering number estimates
This establishes item (ii). Moreover, note that item (ii) proves item (iii). Furthermore,
observe that the assumption that b1 , b2 , . . . , bN ∈ V is a Hamel-basis of V establishes
item (iv). Next note that (12.102) and item (iii) prove item (v). In addition, observe that
the integral transformation theorem shows that for all r ∈ (0, ∞), v ∈ RN , A ∈ B(RN ) it
holds that
Z
N N
1{ra∈RN : a∈A} (x) dx
λ (ra + v) ∈ R : a ∈ A = λ ra ∈ R : a ∈ A =
RN
Z Z (12.105)
= 1A ( r ) dx = r
x N
1A (x) dx = r λ(A).
N
RN RN
Combining item (i) and item (iv) hence demonstrates that for all r ∈ (0, ∞), v ∈ V ,
A ∈ B(V ) it holds that
ν({(ra + v) ∈ V : a ∈ A}) = λ Φ−1 ({(ra + v) ∈ V : a ∈ A})
= λ Φ−1 (ra + v) ∈ RN : a ∈ A
453
Chapter 12: Probabilistic generalization error estimates
Hence, we obtain that ν({v ∈ V : ~v~ ≤ 1}) ̸= 0. This establishes item (viii). The proof of
Lemma 12.2.17 is thus complete.
454
12.2. Covering number estimates
Next observe that Lemma 12.2.17 demonstrates that ν(X) > 0. Combining this with
(12.117) assures that (n + 1)ρN ≤ 2N . Therefore, we obtain that ρN ≤ (n + 1)−1 2N . Hence,
we obtain that ρ ≤ 2(n + 1)− /N . The proof of Lemma 12.2.18 is thus complete.
1
Proof of Corollary 12.2.19. Observe that Corollary 12.2.16 and Lemma 12.2.18 establish
(12.118). The proof of Corollary 12.2.19 is thus complete.
Proof of Lemma 12.2.20. Throughout this proof, assume without loss of generality that
C(X,d),n < ∞, let ρ ∈ (C(X,d),n , ∞), let λ : B(RN ) → [0, ∞] be the Lebesgue-Borel measure
on RN , let b1 , b2 , . . . , bN ∈ V be a Hamel-basis of V , let Φ : RN → V satisfy for all
r = (r1 , r2 , . . . , rN ) ∈ RN that
Φ(r) = r1 b1 + r2 b2 + . . . + rN bN , (12.120)
455
Chapter 12: Probabilistic generalization error estimates
(cf. Definition 12.2.1). The fact that ρ > C(X,d),n demonstrates that there exist x1 , x2 , . . . ,
xn ∈ X which satisfy " n #
[
X⊆ {v ∈ X : d(xm , v) ≤ ρ} . (12.122)
m=1
This and Lemma 12.2.17 demonstrate that 1 ≤ nρN . Hence, we obtain that ρN ≥ n−1 .
This ensures that ρ ≥ n−1/N . The proof of Lemma 12.2.20 is thus complete.
Proof of Corollary 12.2.21. Observe that Corollary 12.2.19 and Lemma 12.2.20 establish
(12.124). The proof of Corollary 12.2.21 is thus complete.
456
12.2. Covering number estimates
Proof of Lemma 12.2.22. Throughout this proof, let Φ : V → V satisfy for all v ∈ V that
Φ(v) = rv. Observe that Exercise 12.2.3 shows that
r C(X,d),n = r inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(xi , v)
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} ~rxi − rv~
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} ~Φ(xi ) − Φ(v)~
(12.126)
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(Φ(xi ), Φ(v))
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(Φ(xi ), v)
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(xi , v) = C(X,d|X×X ),n
(cf. Definition 12.2.1). This establishes (12.125). The proof of Lemma 12.2.22 is thus
complete.
d(v, w) = ~v − w~ (12.129)
457
Chapter 12: Probabilistic generalization error estimates
Proposition 12.2.24. Let d ∈ N, a ∈ R, b ∈ (a, ∞), r ∈ (0, ∞) and let δ : ([a, b]d ) ×
([a, b]d ) → [0, ∞) satisfy for all x, y ∈ [a, b]d that δ(x, y) = ∥x − y∥∞ (cf. Definition 3.3.4).
Then (
d d 1 : r ≥ (b−a)/2
C ([a,b] ,δ),r ≤ b−a (12.135)
2r
≤ b−a d
r
: r < (b−a)/2
(cf. Definitions 4.2.6 and 4.3.2).
N = b−a (12.136)
2r
,
In addition, note that it holds for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [gN,i , a + i(b−a)/N ] that
458
12.3. Empirical risk minimization
such that
δ(x, y) = ∥x − y∥∞ = max |xi − yi | ≤ b−a 2N
≤ (b−a)2r
2(b−a)
= r. (12.142)
i∈{1,2,...,d}
Combining this with (4.82), (12.138), (12.136), and the fact that ∀ x ∈ [0, ∞) : ⌈x⌉ ≤
1(0,r] (rx) + 2x1(r,∞) (rx) demonstrates that
≤ 1(0,r] b−a 1(r,∞) b−a
d d d
C ([a,b] ,δ),r ≤ |A| = (N)d = b−a + b−a (12.143)
2r 2 r 2
(cf. Definition 4.3.2). The proof of Proposition 12.2.24 is thus complete.
459
Chapter 12: Probabilistic generalization error estimates
Combining this with the fact that for all v ∈ F ∩ {em ∈ E : m ∈ N} it holds that
inf m∈N d(v, fm ) = 0 ensures that the set {fn ∈ F : n ∈ N} is dense in F . The proof of
Lemma 12.3.1 is thus complete.
Then
Proof of Lemma 12.3.2. Observe that the assumption that E is dense in E shows that for
all g ∈ C(E, R) it holds that
sup g(x) = sup g(x). (12.148)
x∈E x∈E
This and the assumption that for all ω ∈ Ω it holds that E ∋ x 7→ fx (ω) ∈ R is continuous
demonstrate that for all ω ∈ Ω it holds that
This proves item (i). Furthermore, note that item (i) and the assumption that for all
x ∈ E it holds that fx : Ω → R is F/B(R)-measurable establish item (ii). The proof of
Lemma 12.3.2 is thus complete.
Proof of Lemma 12.3.3. Throughout this proof, let B1 , B2 , . . . , BN ⊆ E satisfy for all
i ∈ {1, 2, . . . , N } that Bi = {x ∈ E : 2Lδ(x, zi ) ≤ ε}. Observe that the triangle inequality
460
12.3. Empirical risk minimization
and the assumption that for all x, y ∈ E it holds that |Zx − Zy | ≤ Lδ(x, y) show that for
all i ∈ {1, 2, . . . , N }, x ∈ Bi it holds that
Combining this with Lemma 12.3.2 and Lemma 12.3.1 proves that for all i ∈ {1, 2, . . . , N }
it holds that
P supx∈Bi |Zx | ≥ ε ≤ P 2ε + |Zzi | ≥ ε = P |Zzi | ≥ 2ε . (12.152)
N
X N
X (12.153)
ε
≤ P supx∈Bi |Zx | ≥ ε ≤ P |Zzi | ≥ 2
.
i=1 i=1
sume without loss of generality that N < ∞, and let z1 , z2 , . . . , zN ∈ E satisfy E ⊆ i=1 {x ∈
ε
E : δ(x, zi ) ≤ 2L } (cf. Definition 4.3.2). Observe that Lemma 12.3.2 and Lemma 12.3.3
establish that
N
X
ε
≤ N supx∈E P |Zx | ≥ 2ε . (12.155)
P(supx∈E |Zx | ≥ ε) ≤ P |Zzi | ≥ 2
i=1
and
461
Chapter 12: Probabilistic generalization error estimates
(ii) it holds that Ω ∋ η 7→ supx∈E |Zx (η) − E[Zx ]| ∈ [0, ∞] is F/B([0, ∞])-measurable.
Proof of Lemma 12.3.5. Observe that the assumption that for all x, y ∈ E it holds that
|Zx − Zy | ≤ Lδ(x, y) implies that for all x, y ∈ E, η ∈ Ω it holds that
|(Zx (η) − E[Zx ]) − (Zy (η) − E[Zy ])| = |(Zx (η) − Zy (η)) + (E[Zy ] − E[Zx ])|
≤ |Zx (η) − Zy (η)| + |E[Zx ] − E[Zy ]|
≤ Lδ(x, y) + |E[Zx ] − E[Zy ]|
(12.157)
= Lδ(x, y) + |E[Zx − Zy ]|
≤ Lδ(x, y) + E[|Zx − Zy |]
≤ Lδ(x, y) + Lδ(x, y) = 2Lδ(x, y).
This ensures item (i). Note that item (i) shows that for all η ∈ Ω it holds that E ∋ x 7→
|Zx (η) − E[Zx ]| ∈ R is continuous. Combining this and the assumption that E is separable
with Lemma 12.3.2 proves item (ii). The proof of Lemma 12.3.5 is thus complete.
Lemma 12.3.6. Let (E, δ) be a separable metric space, assume E = ̸ ∅, let ε, L ∈ (0, ∞),
let (Ω, F, P) be a probability space, and let Zx : Ω → R, x ∈ E, be random variables which
satisfy for all x, y ∈ E that E[|Zx |] < ∞ and |Zx − Zy | ≤ Lδ(x, y). Then
(E,δ), ε −1
P(supx∈E |Zx − E[Zx ]| ≥ ε) ≤ supx∈E P |Zx − E[Zx ]| ≥ 2ε . (12.158)
C 4L
Proof of Lemma 12.3.6. Throughout this proof, let Yx : Ω → R, x ∈ E, satisfy for all x ∈ E,
η ∈ Ω that Yx (η) = Zx (η) − E[Zx ]. Observe that Lemma 12.3.5 ensures that for all x, y ∈ E
it holds that
|Yx − Yy | ≤ 2Lδ(x, y). (12.159)
This and Lemma 12.3.4 (applied with (E, δ) ↶ (E, δ), ε ↶ ε, L ↶ 2L, (Ω, F, P) ↶
(Ω, F, P), (Zx )x∈E ↶ (Yx )x∈E in the notation of Lemma 12.3.4) establish (12.158). The
proof of Lemma 12.3.6 is thus complete.
Then
462
12.3. Empirical risk minimization
(ii) it holds that Ω ∋ η 7→ supx∈E |Zx (η) − E[Zx ]| ∈ [0, ∞] is F/B([0, ∞])-measurable, and
Next note that the assumption that for all x ∈ E, m ∈ {1, 2, . . . , M }, ω ∈ Ω it holds that
|Yx,m (ω)| ∈ [0, D] ensures that for all x ∈ E it holds that
" "M ## "M #
1 X 1 X
(12.163)
E |Zx | = E Yx,m = E Yx,m ≤ D < ∞.
M m=1 M m=1
This proves item (i). Furthermore, note that item (i), (12.162), and Lemma 12.3.5 establish
item (ii). Next observe that (12.160) shows that for all x ∈ E it holds that
"M # " "M ## M
1 X 1 X 1 X
Yx,m − E Yx,m . (12.164)
|Zx −E[Zx ]| = Yx,m − E Yx,m =
M m=1 M m=1 M m=1
Combining this with Corollary 12.1.21 (applied with (Ω, F, P) ↶ (Ω, F, P), N ↶ M ,
ε ↶ 2ε , (a1 , a2 , . . . , aN ) ↶ (0, 0, . . . , 0), (b1 , b2 , . . . , bN ) ↶ (D, D, . . . , D), (Xn )n∈{1,2,...,N } ↶
(Yx,m )m∈{1,2,...,M } for x ∈ E in the notation of Corollary 12.1.21) ensures that for all x ∈ E
it holds that
ε 2 2 ! 2
−2 M −ε M
ε 2
(12.165)
P |Zx − E[Zx ]| ≥ 2 ≤ 2 exp = 2 exp .
M D2 2D2
Combining this, (12.162), and (12.163) with Lemma 12.3.6 establishes item (iii). The proof
of Lemma 12.3.7 is thus complete.
463
Chapter 12: Probabilistic generalization error estimates
Observe that the fact that for all x1 , x2 , y ∈ R it holds that (x1 − y)2 − (x2 − y)2 =
(x1 − x2 )((x1 − y) + (x2 − y)), the assumption that for all x ∈ E, m ∈ {1, 2, . . . , M } it holds
that |Xx,m − Ym | ≤ D, and the assumption that for all x, y ∈ E, m ∈ {1, 2, . . . , M } it holds
that |Xx,m − Xy,m | ≤ Lδ(x, y) imply that for all x, y ∈ E, m ∈ {1, 2, . . . , M } it holds that
In addition, note that (12.166) and the assumption that for all x ∈ E it holds that
(Xx,m , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random variables show that for all x ∈ E it holds
that
"M # "M # "M #
1 X 1 X 1 X
E |Xx,m − Ym |2 = E |Xx,1 − Y1 |2 =
E Ex = Ex = Ex .
M m=1 M m=1 M m=1
(12.170)
Furthermore, observe that the assumption that for all x ∈ E it holds that (Xx,m , Ym ),
m ∈ {1, 2, . . . , M }, are i.i.d. random variables ensures that for all x ∈ E it holds that Ex,m ,
464
12.3. Empirical risk minimization
m ∈ {1, 2, . . . , M }, are i.i.d. random variables. Combining this, (12.169), and (12.170)
with Lemma 12.3.7 (applied with (E, δ) ↶ (E, δ), M ↶ M , ε ↶ ε, L ↶ 2LD, D ↶ D2 ,
(Ω, F, P) ↶ (Ω, F, P), (Yx,m )x∈E, m∈{1,2,...,M } ↶ (Ex,m )x∈E, m∈{1,2,...,M } , (Zx )x∈E = (Ex )x∈E in
the notation of Lemma 12.3.7) establishes (12.167). The proof of Lemma 12.3.8 is thus
complete.
(cf. Definition 3.3.4). Then Ω ∋ ω 7→ supθ∈[−R,R]d |E(θ, ω) − E(Hθ )| ∈ [0, ∞] is F/B([0, ∞])-
measurable and
d 2
16LRR −ε M
(12.172)
P supθ∈[−R,R]d |E(θ) − E(Hθ )| ≥ ε ≤ 2 max 1, exp .
ε 2R4
Proof of Proposition 12.3.9. Throughout this proof, let B ⊆ Rd satisfy B = [−R, R]d =
{θ ∈ Rd : ∥θ∥∞ ≤ R} and let δ : B × B → [0, ∞) satisfy for all θ, ϑ ∈ B that
Observe that the assumption that (Xm , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random vari-
ables and the assumption that for all θ ∈ [−R, R]d it holds that Hθ is continuous imply
that for all θ ∈ B it holds that (Hθ (Xm ), Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random
variables. Combining this, the assumption that for all θ, ϑ ∈ B, x ∈ D it holds that
|Hθ (x) − Hϑ (x)| ≤ L∥θ − ϑ∥∞ , and the assumption that for all θ ∈ B, m ∈ {1, 2, . . . , M }
it holds that |Hθ (Xm ) − Ym | ≤ R with Lemma 12.3.8 (applied with (E, δ) ↶ (B, δ),
M ↶ M , ε ↶ ε, L ↶ L, D ↶ R, (Ω, F, P) ↶ (Ω, F, P), (Xx,m )x∈E, m∈{1,2,...,M } ↶
(Hθ (Xm ))θ∈B, m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ↶ (Ym )m∈{1,2,...,M } , (Ex )x∈E ↶ (Ω ∋ ω 7→
E(θ, ω) ∈ [0, ∞)) θ∈B , (Ex )x∈E ↶ (E(Hθ ))θ∈B in the notation of Lemma 12.3.8) estab-
465
Chapter 12: Probabilistic generalization error estimates
(cf. Definition 4.3.2). Moreover, note that Proposition 12.2.24 (applied with d ↶ d, a ↶ −R,
b ↶ R, r ↶ 8LR ε
, δ ↶ δ in the notation of Proposition 12.2.23) demonstrates that
d
ε 16LRR
C (B,δ), 8LR
≤ max 1, . (12.175)
ε
This and (12.174) prove (12.172). The proof of Proposition 12.3.9 is thus complete.
Corollary 12.3.10. Let d, M, L ∈ N, u ∈ P R, v ∈ (u, ∞), R ∈ [1, ∞), ε, b ∈ (0, ∞),
l = (l0 , l1 , . . . , lL ) ∈ N L+1
satisfy lL = 1 and Lk=1 lk (lk−1 + 1) ≤ d, let D ⊆ [−b, b]l0 be a
compact set, let (Ω, F, P) be a probability space, let Xm : Ω → D, m ∈ {1, 2, . . . , M }, and
Ym : Ω → [u, v], m ∈ {1, 2, . . . , M }, be functions, assume that (Xm , Ym ), m ∈ {1, 2, . . . , M },
are i.i.d. random variables, let E : C(D, R) → [0, ∞) satisfy for all f ∈ C(D, R) that
E(f ) = E[|f (X1 ) − Y1 |2 ], and let E : [−R, R]d × Ω → [0, ∞) satisfy for all θ ∈ [−R, R]d ,
ω ∈ Ω that "M #
1 X θ,l
E(θ, ω) = |N (Xm (ω)) − Ym (ω)|2 (12.176)
M m=1 u,v
(cf. Definition 4.4.1). Then
θ,l
(i) it holds that Ω ∋ ω 7→ supθ∈[−R,R]d E(θ, ω) − E Nu,v |D ∈ [0, ∞] is F/B([0, ∞])-
measurable and
466
12.3. Empirical risk minimization
Furthermore, observe that the fact that for all θ ∈ Rd , x ∈ Rl0 it holds that Nu,v
θ,l
(x) ∈ [u, v]
and the assumption that for all m ∈ {1, 2, . . . , M }, ω ∈ Ω it holds that Ym (ω) ∈ [u, v]
demonstrate that for all θ ∈ [−R, R]d , m ∈ {1, 2, . . . , M } it holds that
θ,l
|Nu,v (Xm ) − Ym | ≤ v − u. (12.180)
d
−ε2 M
θ,l
16LR(v − u)
P supθ∈[−R,R]d E(θ) − E Nu,v |D≥ ε ≤ 2 max 1, exp .
ε 2(v − u)4
(12.181)
The proof of Corollary 12.3.10 is thus complete.
467
Chapter 12: Probabilistic generalization error estimates
468
Chapter 13
M M 2 1/2
1 P 1 P
E Xj − E Xj
M j=1 M j=1
2
(13.1)
1 2 1/2
≤√ max E ∥Xj − E[Xj ]∥2 .
M j∈{1,2,...,M }
Proof of Proposition 13.1.1. Observe that the fact that for all x ∈ Rd it holds that ⟨x, x⟩ =
469
Chapter 13: Strong generalization error estimates
1 P M 2
= 2 Xj − E[Xj ] (13.2)
M j=1 2
M
1 P
= 2 Xi − E[Xi ], Xj − E[Xj ]
M i,j=1
M
1 P 2 1 P
= 2 ∥Xj − E[Xj ]∥2 + 2 Xi − E[Xi ], Xj − E[Xj ]
M j=1 M (i,j)∈{1,2,...,M }2 , i̸=j
(cf. Definition 1.4.7). This, the fact that for all independent random variables Y : Ω → Rd
and Z : Ω → Rd with E[∥Y ∥2 + ∥Z∥2 ] < ∞ it holds that E[|⟨Y, Z⟩|] < ∞ and E[⟨Y, Z⟩] =
⟨E[Y ], E[Z]⟩, and the assumption that Xj : Ω → Rd , j ∈ {1, 2, . . . , M }, are independent
random variables establish that
M M
2
1 P 1 P
E Xj − E Xj
M j=1 M j=1 2
M
1 P 2
1 P
= E ∥Xj − E[Xj ]∥2 + 2 E Xi − E[Xi ] , E Xj − E[Xj ]
M 2 j=1 M (i,j)∈{1,2,...,M }2 , i̸=j
M
1 P 2
(13.3)
= E ∥Xj − E[Xj ]∥2
M 2 j=1
1 2
≤ max E ∥Xj − E[Xj ]∥2 .
M j∈{1,2,...,M }
Definition 13.1.2 (Rademacher family). Let (Ω, F, P) be a probability space and let J
be a set. Then we say that (rj )j∈J is a P-Rademacher family if and only if it holds that
rj : Ω → {−1, 1}, j ∈ J, are independent random variables with
470
13.1. Monte Carlo estimates
(13.6)
p
Kp ≤ p − 1 < ∞
471
Chapter 13: Strong generalization error estimates
Proof of Corollary 13.1.6. Note that Proposition 13.1.5 and Lemma 13.1.4 show that
M M
p 1/p
1 P 1 P
E Xj − E Xj
M j=1 M j=1 2
M M p 1/p
1 P P
= E Xj − E Xj
M j=1 j=1 2
M 1/2
2Kp P 2/p
≤ E ∥Xj − E[Xj ]∥p2
M j=1 (13.11)
1/2
2Kp p 2/p
≤ M max E ∥Xj − E[Xj ]∥2
M j∈{1,2,...,M }
2Kp p 1/p
=√ max E ∥Xj − E[Xj ]∥2
M j∈{1,2,...,M }
√
2 p−1 p 1/p
≤ √ max E ∥Xj − E[Xj ]∥2
M j∈{1,2,...,M }
E⊆ N (13.12)
S
n=1 {x ∈ E : δ(x, zn ) ≤ rn },
Proof of Lemma 13.2.1. Throughout this proof, for every n ∈ {1, 2, . . . , N } let
Bn = {x ∈ E : δ(x, zn ) ≤ rn }. (13.14)
E⊆ N and (13.15)
S SN
n=1 Bn E⊇ n=1 Bn .
472
13.2. Uniform strong error estimates for random fields
(13.17)
N N
p
E supx∈Bn |Zx |p .
P P
≤E supx∈Bn |Zx | =
n=1 n=1
(cf. Lemma 12.3.2). Furthermore, note that the assumption that for all x, y ∈ E it holds
that |Zx − Zy | ≤ Lδ(x, y) demonstrates that for all n ∈ {1, 2, . . . , N }, x ∈ Bn it holds that
|Zx | = |Zx − Zzn + Zzn | ≤ |Zx − Zzn | + |Zzn | ≤ Lδ(x, zn ) + |Zzn | ≤ Lrn + |Zzn |. (13.18)
E⊆ N (13.21)
S
n=1 {x ∈ E : δ(x, zn ) ≤ r}
473
Chapter 13: Strong generalization error estimates
Lemma 13.2.3. Let (E, δ) be a non-empty separable metric space, let (Ω, F, P) be a
probability space, for every x ∈ E let Zx : Ω → R be a random variable with E[|Zx |] < ∞, let
L ∈ (0, ∞) satisfy for all x, y ∈ E that |Zx − Zy | ≤ Lδ(x, y), and let p ∈ [1, ∞), r ∈ (0, ∞).
Then
1/p h 1/p i
E supx∈E |Zx − E[Zx ]|p ≤ (C (E,δ),r ) /p 2Lr + supx∈E E |Zx − E[Zx ]|p (13.23)
1
(E,δ),r 1/p
h i
p 1/p
(13.26)
≤ (C ) 2Lr + supx∈E E |Yx |
h 1/p i
= (C (E,δ),r ) /p 2Lr + supx∈E E |Zx − E[Zx ]|p
1
.
474
13.2. Uniform strong error estimates for random fields
This establishes item (i). Note that (13.27) demonstrates that for all x, y ∈ E it holds that
M M M
1 P 1 P
|Yx,m − Yy,m | ≤ Lδ(x, y). (13.31)
P
|Zx − Zy | = Yx,m − Yy,m ≤
M m=1 m=1 M m=1
Item (i) and Lemma 12.3.5 therefore prove item (ii). It thus remains to show item (iii).
For this observe that item (i), (13.31), and Lemma 13.2.3 imply that for all p ∈ [1, ∞),
r ∈ (0, ∞) it holds that
h i
p 1/p p 1/p
(E,δ),r 1/p
(13.32)
E supx∈E |Zx − E[Zx ]| ≤ (C ) 2Lr + supx∈E E |Zx − E[Zx ]|
(cf. Definition 4.3.2). Furthermore, note that (13.30) and Corollary 13.1.6 (applied with
d ↶ 1, (Xm )m∈{1,2,...,M } ↶ (Yx,m )m∈{1,2,...,M } for x ∈ E in the notation of Corollary 13.1.6)
ensure that for all x ∈ E, p ∈ [2, ∞), r ∈ (0, ∞) it holds that
M M
p 1/p
p 1/p
1 P 1 P
E |Zx − E[Zx ]| = E Yx,m − E Yx,m
M m=1 M m=1
√ (13.33)
2 p−1 p 1/p
≤ √ max E |Yx,m − E[Yx,m ]| .
M m∈{1,2,...,M }
Combining this with (13.32) shows that for all p ∈ [2, ∞), r ∈ (0, ∞) it holds that
1/p
E supx∈E |Zx − E[Zx ]|p
h √ i
(E,δ),r 1/p 2 √p−1 p 1/p
≤ (C ) 2Lr + M supx∈E maxm∈{1,2,...,M } E |Yx,m − E[Yx,m ]| (13.34)
h √ 1/p
i
= 2(C (E,δ),r ) /p Lr + √p−1 supx∈E maxm∈{1,2,...,M } E |Yx,m − E[Yx,m ]|p
1
M
.
The proof of Lemma 13.2.4 is thus complete.
475
Chapter 13: Strong generalization error estimates
Corollary 13.2.5. Let (E, δ) be a non-empty separable metric space, let (Ω, F, P) be a
probability space, let M ∈ N, for every x ∈ E let Yx,m : Ω → R, m ∈ {1, 2, . . . , M }, be
independent random variables with E |Yx,1 | + |Yx,2 | + . . . + |Yx,m | < ∞, let L ∈ (0, ∞) satisfy
for all x, y ∈ E, m ∈ {1, 2, . . . , M } that |Yx,m − Yy,m | ≤ Lδ(x, y), and for every x ∈ E let
Zx : Ω → R satisfy M
1 P
Zx = Yx,m . (13.35)
M m=1
Then
(ii) it holds that Ω ∋ ω 7→ supx∈E |Zx (ω) − E[Zx ]| ∈ [0, ∞] is F/B([0, ∞])-measurable, and
(cf. Definition 4.3.2). This proves item (iii). The proof of Corollary 13.2.5 is thus complete.
476
13.3. Strong convergence rates for the generalisation error
Then
Proof of Lemma 13.3.1. Throughout this proof, for every x ∈ E, m ∈ {1, 2, . . . , M } let
Yx,m : Ω → R satisfy Yx,m = |Xx,m − Ym |2 . Observe that the assumption that for all x ∈ E
it holds that (Xx,m , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random variables implies that for all
x ∈ E it holds that
2
|X − |
M
1 P M E x,1 Y 1
E |Xx,m − Ym |2 = (13.41)
E[R(x)] = = R(x).
M m=1 M
Furthermore, note that the assumption that for all x ∈ E, m ∈ {1, 2, . . . , M } it holds that
|Xx,m − Ym | ≤ b shows that for all x ∈ E, m ∈ {1, 2, . . . , M } it holds that
and
E[Yx,m ] − Yx,m = E |Xx,m − Ym |2 − |Xx,m − Ym |2 ≤ E |Xx,m − Ym |2 ≤ b2 . (13.44)
Observe that (13.42), (13.43), and (13.44) ensure for all x ∈ E, m ∈ {1, 2, . . . , M },
p ∈ (0, ∞) that
1/p 1/p
E |Yx,m − E[Yx,m ]|p ≤ E b2p = b2 . (13.45)
477
Chapter 13: Strong generalization error estimates
Moreover, note that (13.38) and the fact that for all x1 , x2 , y ∈ R it holds that (x1 − y)2 −
(x2 − y)2 = (x1 − x2 )((x1 − y) + (x2 − y)) show that for all x, y ∈ E, m ∈ {1, 2, . . . , M } it
holds that
|Yx,m − Yy,m | = |(Xx,m − Ym )2 − (Xy,m − Ym )2 |
≤ |Xx,m − Xy,m |(|Xx,m − Ym | + |Xy,m − Ym |) (13.46)
≤ 2b|Xx,m − Xy,m | ≤ 2bLδ(x, y).
The fact that for all x ∈ E it holds that Yx,m , m ∈ {1, 2, . . . , M }, are independent
random variables, (13.42), and Corollary 13.2.5 (applied with (Yx,m )x∈E, m∈{1,2,...,M } ↶
(Yx,m )x∈E, m∈{1,2,...,M } , L ↶ 2bL, (Zx )x∈E ↶ (Ω ∋ ω 7→ R(x, ω) ∈ R)x∈E in the notation of
Corollary 13.2.5) hence establish that
(I) it holds that Ω ∋ ω 7→ supx∈E |R(x, ω) − R(x)| ∈ [0, ∞] is F/B([0, ∞])-measurable
and
(II) it holds for all p ∈ [2, ∞), c ∈ (0, ∞) that
√ 2√ 1/p h
(E,δ), cb √p−1
1/p
2 √p−1
E supx∈E |R(x) − E[R(x)]|p cb2
≤ M
C 2bL M
i
p 1/p
. (13.47)
+ supx∈E maxm∈{1,2,...,M } E |Yx,m − E[Yx,m ]|
Observe that item (II), (13.41), (13.42), and (13.45) demonstrate that for all p ∈ [2, ∞),
c ∈ (0, ∞) it holds that
√
√ 1/p
(E,δ), cb √p−1
p 1/p 2 √p−1
[cb2 + b2 ]
E supx∈E |R(x) − R(x)| ≤ M C 2L M
478
13.3. Strong convergence rates for the generalisation error
Proof of Proposition 13.3.2. Throughout this proof, let (κc )c∈(0,∞) ⊆ (0, ∞) satisfy for all
c ∈ (0, ∞) that √
2 M L(β − α)
κc = , (13.52)
cb
let Xθ,m : Ω → R, m ∈ {1, 2, . . . , M }, θ ∈ [α, β]d , satisfy for all θ ∈ [α, β]d , m ∈
{1, 2, . . . , M } that
Xθ,m = fθ (Xm ), (13.53)
and let δ : [α, β]d × [α, β]d → [0, ∞) satisfy for all θ, ϑ ∈ [α, β]d that
First, note that the assumption that for all θ ∈ [α, β]d , m ∈ {1, 2, . . . , M } it holds that
|fθ (Xm ) − Ym | ≤ b implies for all θ ∈ [α, β]d , m ∈ {1, 2, . . . , M } that
Furthermore, observe that the assumption that for all θ, ϑ ∈ [α, β]d , x ∈ D it holds that
|fθ (x) − fϑ (x)| ≤ L∥θ − ϑ∥∞ ensures for all θ, ϑ ∈ [α, β]d , m ∈ {1, 2, . . . , M } that
|Xθ,m − Xϑ,m | = |fθ (Xm ) − fϑ (Xm )| ≤ supx∈D |fθ (x) − fϑ (x)| ≤ L∥θ − ϑ∥∞ = Lδ(θ, ϑ).
(13.56)
The fact that for all θ ∈ [α, β]d it holds that (Xθ,m , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random
variables, (13.55), and Lemma 13.3.1 (applied with p ↶ q, C ↶ C, (E, δ) ↶ ([α, β]d , δ),
(Xx,m )x∈E, m∈{1,2,...,M } ↶ (Xθ,m )θ∈[α,β]d , m∈{1,2,...,M } for p ∈ [2, ∞), C ∈ (0, ∞) in the notation
of Lemma 13.3.1) therefore ensure that for all p ∈ [2, ∞), c ∈ (0, ∞) it holds that
Ω ∋ ω 7→ supθ∈[α,β]d |R(θ, ω) − R(θ)| ∈ [0, ∞] is F/B([0, ∞])-measurable and
√ 1/p 2(c + 1)b2 √p − 1
([α,β]d ,δ), cb √p−1
p 1/p
(13.57)
E supθ∈[α,β]d |R(θ) − R(θ)| ≤ C 2L M √
M
479
Chapter 13: Strong generalization error estimates
(cf. Definition 4.3.2). This establishes item (i). Note that Proposition 12.2.24 (applied with
d ↶ d, a ↶ α, b ↶ β, r ↶ r for r ∈ (0, ∞) in the notation of Proposition 12.2.24) shows
that for all r ∈ (0, ∞) it holds that
This, (13.57), and Jensen’s inequality demonstrate that for all c, ε, p ∈ (0, ∞) it holds that
1/p
E supθ∈[α,β]d |R(θ) − R(θ)|p
1
≤ E supθ∈[α,β]d |R(θ) − R(θ)|max{2,p, /ε} max{2,p,d/ε}
d
n d
o 2(c + 1)b2 pmax{2, p, d/ε} − 1
≤ max 1, (κc ) max{2,p, d/ε}
√
p
M (13.60)
2
2(c + 1)b max{1, p − 1, d/ε − 1}
= max 1, (κc )min{ /2, /p,ε}
d d
√
M
2 ε
p
2(c + 1)b max{1, (κc ) } max{1, p, d/ε}
≤ √ .
M
Moreover, observe that the fact that for all a ∈ (1, ∞) it holds that
√
a /(2 ln(a)) = e /(2 ln(a)) = e /2 = e ≥ 1 (13.61)
1 ln(a) 1
480
13.3. Strong convergence rates for the generalisation error
and (13.60) therefore imply that for all p ∈ (0, ∞) it holds that
1/p
E supθ∈[α,β]d |R(θ) − R(θ)|p
" p #
2(c + 1)b2 max{1, (κc )ε } max{1, p, d/ε}
≤ inf √
c,ε∈(0,∞) M
" √ p #
2(c + 1)b2 max{1, [2 M L(β − α)(cb)−1 ]ε } max{1, p, d/ε}
= inf √
c,ε∈(0,∞) M (13.64)
" p #
2(c + 1)b2 e max{1, p, d ln([κc ]2 )}
≤ inf √
c∈(0,∞) M
" p #
2(c + 1)b2 e max{1, p, d ln(4M L2 (β − α)2 (cb)−2 )}
= inf √ .
c∈(0,∞) M
This establishes item (ii). The proof of Proposition 13.3.2 is thus complete.
481
Chapter 13: Strong generalization error estimates
and let R : [−B, B]d × Ω → [0, ∞) satisfy for all θ ∈ [−B, B]d , ω ∈ Ω that
M M
1 P 1 P
R(θ, ω) = 2
|fθ (Xm (ω)) − Ym (ω)| = θ,l
|N (Xm (ω)) − Ym (ω)|2
(13.69)
M m=1 M m=1 u,v
(cf. Definition 3.3.4). Note that the fact that for all θ ∈ Rd , x ∈ Rd it holds that Nu,v θ,l
(x) ∈
[u, v] and the assumption that for all m ∈ {1, 2, . . . , M } it holds that Ym (Ω) ⊆ [u, v] ensure
for all θ ∈ [−B, B]d , m ∈ {1, 2, . . . , M } that
θ,l
|fθ (Xm ) − Ym | = |Nu,v (Xm ) − Ym | ≤ supy1 ,y2 ∈[u,v] |y1 − y2 | = v − u. (13.70)
Furthermore, observe that the assumption that D ⊆ [−b, b]d , l0 = d, and lL = 1, Corol-
lary 11.3.7 (applied with a ↶ −b, b ↶ b, u ↶ u, v ↶ v, d ↶ d, L ↶ L, l ↶ l in the
notation of Corollary 11.3.7), and the assumption that b ≥ 1 and B ≥ 1 show that for all
θ, ϑ ∈ [−B, B]d , x ∈ D it holds that
θ,l ϑ,l
|fθ (x) − fϑ (x)| ≤ supy∈[−b,b]d |Nu,v (y) − Nu,v (y)|
≤ L max{1, b}(∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞ (13.71)
L L−1
≤ bL(∥l∥∞ + 1) B ∥θ − ϑ∥∞ = L∥θ − ϑ∥∞ .
Moreover, note that the fact that d ≥ d and the fact that for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd it
holds that Nu,v
θ,l (θ1 ,θ2 ,...,θd ),l
= Nu,v demonstrates that for all ω ∈ Ω it holds that
482
13.3. Strong convergence rates for the generalisation error
In addition, observe that (13.70), (13.71), Proposition 13.3.2 (applied with α ↶ −B, β ↶ B,
d ↶ d, b ↶ v − u, R ↶ R, R ↶ R in the notation of Proposition 13.3.2), the fact that
v − u ≥ (u + 1) − u = 1 (13.73)
prove that for all p ∈ (0, ∞) it holds that Ω ∋ ω 7→ supθ∈[−B,B]d |R(θ, ω) − R(θ)| ∈ [0, ∞] is
F/B([0, ∞])-measurable and
1/p
E supθ∈[−B,B]d |R(θ) − R(θ)|p
" p #
2(C + 1)(v − u)2 e max{1, p, d ln(4M L2 (2B)2 (C[v − u])−2 )}
≤ inf √
C∈(0,∞) M (13.75)
" p #
2(C + 1)(v − u)2 e max{1, p, L(∥l∥∞ + 1)2 ln(24 M L2 B 2 C −2 )}
≤ inf √ .
C∈(0,∞) M
Combining this with (13.72) establishes item (i). Note that (13.72), (13.75), the fact that
26 L2 ≤ 26 · 22(L−1) = 24+2L ≤ 24L+2L = 26L , the fact that 3 ≥ e, and the assumption that
B ≥ 1, L ≥ 1, M ≥ 1, and b ≥ 1 imply that for all p ∈ (0, ∞) it holds that
1/p 1/p
E supθ∈[−B,B]d |R(θ) − R(θ)|p = E supθ∈[−B,B]d |R(θ) − R(θ)|p
p
2(1/2 + 1)(v − u)2 e max{1, p, L(∥l∥∞ + 1)2 ln(24 M L2 B 2 22 )}
≤ √
M
2
p
3(v − u) e max{p, L(∥l∥∞ + 1)2 ln(26 M b2 L2 (∥l∥∞ + 1)2L B 2L )}
= √
M
(13.76)
p
2
3(v − u) e max{p, 3L2 (∥l∥∞ + 1)2 ln([26L M b2 (∥l∥∞ + 1)2L B 2L ]1/(3L) )}
≤ √
M
p
2
3(v − u) 3 max{p, 3L (∥l∥∞ + 1) ln(22 (M b2 )1/(3L) (∥l∥∞ + 1)B)}
2 2
≤ √
M
p
2
9(v − u) L(∥l∥∞ + 1) max{p, ln(4(M b)1/L (∥l∥∞ + 1)B)}
≤ √ .
M
Next observe that the fact that for all n ∈ N it holds that n ≤ 2n−1 and the fact that
∥l∥∞ ≥ 1 ensure that
4(∥l∥∞ + 1) ≤ 22 · 2(∥l∥∞ +1)−1 = 23 · 2(∥l∥∞ +1)−2 ≤ 32 · 3(∥l∥∞ +1)−2 = 3(∥l∥∞ +1) . (13.77)
483
Chapter 13: Strong generalization error estimates
484
Part V
485
Chapter 14
In Chapter 15 below we combine parts of the approximation error estimates from Part II,
parts of the optimization error estimates from Part III, and parts of the generalization
error estimates from Part IV to establish estimates for the overall error in the training of
ANNs in the specific situation of GD-type optimization methods with many independent
random initializations. For such a combined error analysis we employ a suitable overall
error decomposition for supervised learning problems. It is the subject of this chapter to
review and derive this overall error decomposition (see Proposition 14.2.1 below).
In the literature such kind of error decompositions can, for example, be found in [25, 35,
36, 87, 230]. The specific presentation of this chapter is strongly based on [25, Section 4.1]
and [230, Section 6.1].
Then
(i) it holds for all f ∈ L2 (PX ; R) that
r(f ) = E |f (X) − E[Y |X]|2 + E |Y − E[Y |X]|2 , (14.2)
and
487
Chapter 14: Overall error decomposition
Proof of Lemma 14.1.1. First, note that (14.1) shows that for all f ∈ L2 (PX ; R) it holds
that
r(f ) = E |f (X) − Y |2 = E |(f (X) − E[Y |X]) + (E[Y |X] − Y )|2
+ E |E[Y |X] − Y |2
Furthermore, observe that the tower rule demonstrates that for all f ∈ L2 (PX ; R) it holds
that
E f (X) − E[Y |X] E[Y |X] − Y
h i
= E E f (X) − E[Y |X] E[Y |X] − Y X
h i (14.6)
= E f (X) − E[Y |X] E E[Y |X] − Y X
= E f (X) − E[Y |X] E[Y |X] − E[Y |X] = 0.
Combining this with (14.5) establishes that for all f ∈ L2 (PX ; R) it holds that
r(f ) = E |f (X) − E[Y |X]|2 + E |E[Y |X] − Y |2 . (14.7)
Combining this with (14.7) and (14.8) proves items (i), (ii), and (iii). The proof of
Lemma 14.1.1 is thus complete.
Then
f ∈ L2 (PX ; R) : E(f ) = inf g∈L2 (PX ;R) E(g)
488
14.1. Bias-variance decomposition
Proof of Proposition 14.1.2. Note that Lemma 14.1.1 ensures that for all g ∈ L2 (PX ; R) it
holds that
E(g) = E |g(X) − E[Y |X]|2 + E |E[Y |X] − Y |2 . (14.12)
Combining this with (14.13) establishes (14.11). The proof of Proposition 14.1.2 is thus
complete.
Corollary 14.1.3. Let (Ω, F, P) be a probability space, let (S, S) be a measurable space,
let X : Ω → S be a random variable, let M = {(f : S → R) : f is S/B(R)-measurable}, let
φ ∈ M, and let E : M → [0, ∞) satisfy for all f ∈ M that
Then
Proof of Corollary 14.1.3. Note that (14.15) demonstrates that E(φ) = 0. Therefore, we
obtain that
inf E(g) = 0. (14.17)
g∈M
∈ M : E |f (X) − φ(X)|2 = 0
{f ∈ M : E(f ) = 0} = f
= f ∈ M : P {ω ∈ Ω : f (X(ω)) ̸= φ(X(ω))} = 0
(14.18)
∈ M : P X −1 ({x ∈ S : f (x) ̸= φ(x)}) = 0
= f
= {f ∈ M : PX ({x ∈ S : f (x) ̸= φ(x)}) = 0}.
489
Chapter 14: Overall error decomposition
(14.20)
PL
l0 = d, lL = 1, and d≥ i=1 li (li−1 + 1),
(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all ϑ ∈ [−B, B]d that
Z
Θk ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l (14.25)
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
≤ supx∈D |Nu,v
+ min(k,n)∈{1,2,...,K}×T, ∥Θk,n ∥∞ ≤B [R(Θk,n ) − R(ϑ)].
Proof of Proposition 14.2.1. Throughout this proof, let r : L2 (PX1 ; R) → [0, ∞) satisfy for
all f ∈ L2 (PX1 ; R) that
r(f ) = E[|f (X1 ) − Y1 |2 ]. (14.26)
Observe that the assumption that for all ω ∈ Ω it holds that Y1 (ω) ∈ [u, v] and the fact
that for all θ ∈ Rd , x ∈ Rd it holds that Nu,v θ,l
(x) ∈ [u, v] imply that for all θ ∈ Rd it holds
that E[|Y1 |2 ] ≤ max{u2 , v 2 } < ∞ and
Z
θ,l
(x)|2 PX1 (dx) = E |Nu,v (14.27)
θ,l
(X1 )|2 ≤ max{u2 , v 2 } < ∞.
|Nu,v
D
490
14.2. Overall error decomposition
Item (iii) in Lemma 14.1.1 (applied with (Ω, F, P) ↶ (Ω, F, P), (S, S) ↶ (D, B(D)),
X ↶ X1 , Y ↶ (Ω ∋ ω 7→ Y1 (ω) ∈ R), r ↶ r, f ↶ Nu,v θ,l
|D , g ↶ Nu,v
ϑ,l
|D for θ, ϑ ∈ Rd in the
notation of item (iii) in Lemma 14.1.1) hence proves that for all θ, ϑ ∈ Rd it holds that
Z
θ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
θ,l (14.28)
(X1 ) − E(X1 )|2 = E |Nu,v
θ,l
(X1 ) − E[Y1 |X1 ]|2
= E |Nu,v
ϑ,l
(X1 ) − E[Y1 |X1 ]|2 + r(Nu,v
θ,l ϑ,l
= E |Nu,v |D ) − r(Nu,v |D )
Combining this with (14.26) and (14.19) ensures that for all θ, ϑ ∈ Rd it holds that
Z
θ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
(14.29)
ϑ,l
(X1 ) − E(X1 )|2 + E |Nu,v
θ,l
(X1 ) − Y1 |2 − E |Nu,v
ϑ,l
(X1 ) − Y1 |2
= E |Nu,v
Z
ϑ,l
= |Nu,v (x) − E(x)|2 PX1 (dx) + R(θ) − R(ϑ).
D
491
Chapter 14: Overall error decomposition
492
Chapter 15
In Part II we have established several estimates for the approximation error, in Part III
we have established several estimates for the optimization error, and in Part IV we have
established several estimates for the generalization error. In this chapter we employ the error
decomposition from Chapter 14 as well as parts of Parts II, III, and IV (see Proposition 4.4.12
and Corollaries 11.3.9 and 13.3.3) to establish estimates for the overall error in the training
of ANNs in the specific situation of GD-type optimization methods with many independent
random initializations.
In the literature such overall error analyses can, for instance, be found in [25, 226, 230].
The material in this chapter consist of slightly modified extracts from Jentzen & Welti [230,
Sections 6.2 and 6.3].
493
Chapter 15: Composed error estimates
is F/B([0, ∞])-measurable
(cf. Definition 4.4.1).
Proof of Lemma 15.1.1. Throughout this proof let Ξ : Ω → Rd satisfy for all ω ∈ Ω that
Ξ(ω) = Θk(ω) (ω). (15.3)
Observe that the assumption that Θk,n : Ω → Rd , k, n ∈ N0 , and k : Ω → (N0 )2 are random
variables implies that for all U ∈ B(Rd ) it holds that
Ξ−1 (U ) = {ω ∈ Ω : Ξ(ω) ∈ U } = {ω ∈ Ω : Θk(ω) (ω) ∈ U }
= ω ∈ Ω : ∃ k, n ∈ N0 : ([Θk,n (ω) ∈ U ] ∧ [k(ω) = (k, n)])
∞ S∞
(15.4)
S
= {ω ∈ Ω : Θk,n (ω) ∈ U } ∩ {ω ∈ Ω : k(ω) = (k, n)}
k=0 n=0
∞ S ∞
[(Θk,n )−1 (U )] ∩ [k−1 ({(k, n)})] ∈ F.
S
=
k=0 n=0
(cf. Definitions 3.3.4 and 4.4.1). This shows for all x ∈ Rd that
Rd ∋ θ 7→ Nu,v
θ,l
(x) ∈ R (15.7)
is continuous. Moreover, observe that the fact that for all θ ∈ Rd it holds that Nu,v
θ,l
∈
C(R , R) establishes that for all θ ∈ R it holds that Nu,v (x) is B(R )/B(R)-measurable.
d d θ,l d
This, (15.7), the fact that (Rd , ∥·∥∞ |Rd ) is a separable normed R-vector space, and
Lemma 11.2.6 prove item (i). Note that item (i) and (15.5) demonstrate that
Ω × Rd ∋ (ω, x) 7→ Nu,v
Θk(ω) (ω),l
(x) ∈ R (15.8)
is (F ⊗ B(Rd ))/B(R)-measurable. This implies item (ii). Observe that item (ii) and the
assumption that E : D → R is B(D)/B(R)-measurable ensure that for all p ∈ [0, ∞) it holds
that
Θk(ω) (ω),l
Ω × D ∋ (ω, x) 7→ |Nu,v (x) − E(x)|p ∈ [0, ∞) (15.9)
is (F ⊗ B(D))/B([0, ∞))-measurable. Tonelli’s theorem hence establishes item (iii). The
proof of Lemma 15.1.1 is thus complete.
494
15.1. Full strong error analysis for the training of ANNs
(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
4(v − u)bL(∥l∥∞ + 1)L cL max{1, p}
θ,l
(x) − E(x)|2 + (15.15)
≤ inf θ∈[−c,c]d supx∈D |Nu,v
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}
+ √
M
(cf. Lemma 15.1.1).
Proof of Proposition 15.1.2. Throughout this proof, let R : Rd → [0, ∞) satisfy for all
θ ∈ Rd that
θ,l
R(θ) = E[|Nu,v (X1 ) − Y1 |2 ]. (15.16)
Note that Proposition 14.2.1 shows that for all ϑ ∈ [−B, B]d it holds that
Z
Θk ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l (15.17)
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
≤ supx∈D |Nu,v
+ min(k,n)∈{1,2,...,K}×T, ∥Θk,n ∥∞ ≤B |R(Θk,n ) − R(ϑ)|.
495
Chapter 15: Composed error estimates
The assumption that ∞ k=1 Θk,0 (Ω) ⊆ [−B, B] and the assumption that 0 ∈ T therefore
d
S
prove that
Z
Θk ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
≤ supx∈D |Nu,v
(15.18)
+ mink∈{1,2,...,K}, ∥Θk,0 ∥∞ ≤B |R(Θk,0 ) − R(ϑ)|
ϑ,l
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
= supx∈D |Nu,v
+ mink∈{1,2,...,K} |R(Θk,0 ) − R(ϑ)|.
Minkowski’s inequality hence demonstrates that for all p ∈ [1, ∞), ϑ ∈ [−c, c]d ⊆ [−B, B]d
it holds that
hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l
1/p 1/p
(x) − E(x)|2p + 2 E supθ∈[−B,B]d |R(θ) − R(θ)|p
≤ E supx∈D |Nu,v
+ E mink∈{1,2,...,K} |R(Θk,0 ) − R(ϑ)|p
1/p (15.19)
ϑ,l
1/p
(x) − E(x)|2 + 2 E supθ∈[−B,B]d |R(θ) − R(θ)|p
≤ supx∈D |Nu,v
1/p
+ supθ∈[−c,c]d E mink∈{1,2,...,K} |R(Θk,0 ) − R(θ)|p
(cf. item (i) in Corollary 13.3.3 and item (i) in Corollary 11.3.9). Furthermore, observe that
Corollary 13.3.3 (applied with v ↶ max{u + 1, v}, R ↶ R|[−B,B]d , R ↶ R|[−B,B]d ×Ω in
the notation of Corollary 13.3.3) implies that for all p ∈ (0, ∞) it holds that
1/p
E supθ∈[−B,B]d |R(θ) − R(θ)|p
Combining this and (15.20) with (15.19) establishes that for all p ∈ [1, ∞) it holds that
hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
4(v − u)bL(∥l∥∞ + 1)L cL max{1, p}
θ,l
(x) − E(x)|2 + (15.22)
≤ inf θ∈[−c,c]d supx∈D |Nu,v
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}
+ √ .
M
In addition, observe that that Jensen’s inequality shows that for all p ∈ (0, ∞) it holds that
hZ p i1/p
Θk ,l 2
E |Nu,v (x) − E(x)| PX1 (dx)
D
Z 1
max{1,p} max{1,p} (15.23)
Θk ,l 2
≤ E |Nu,v (x) − E(x)| PX1 (dx)
D
This, (15.22), and the fact that ln(3M Bb) ≥ 1 prove that for all p ∈ (0, ∞) it holds that
hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
4(v − u)bL(∥l∥∞ + 1)L cL max{1, p}
θ,l
(x) − E(x)|2 + (15.24)
≤ inf θ∈[−c,c]d supx∈D |Nu,v
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}
+ √ .
M
The proof of Proposition 15.1.2 is thus complete.
1/p
a px
Lemma 15.1.3. Let a, x, p ∈ (0, ∞). Then axp ≤ exp e
.
Proof of Lemma 15.1.3. Note that the fact that for all y ∈ R it holds that y + 1 ≤ ey
demonstrates that
1/p p 1/p p 1/p
axp = (a /p x)p = e a e x − 1 + 1 = exp a e px . (15.25)
1
≤ e exp a e x − 1
497
Chapter 15: Composed error estimates
l0 = d, l1 ≥ A1(6d ,∞) (A), li ≥ 1(6d ,∞) (A) max{A/d − 2i + 3, 2}, and lL = 1, (15.28)
PL
let d ∈ N satisfy d ≥ i=1 li (li−1 + 1), let R : Rd × Ω → [0, ∞) satisfy for all θ ∈ Rd that
M
1 P
R(θ) = θ,l 2
|N (Xj ) − Yj | , (15.29)
M j=1 u,v
let L ∈ R satisfy for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤ L∥x − y∥1 , let K ∈ N, c ∈
[max{1, L, |a|, |b|, 2|u|, 2|v|},
S∞ ∞), B ∈ [c, ∞), for every k, n ∈ N0 let Θk,n : Ω → Rd be a
random variable, assume k=1 Θk,0 (Ω) ⊆ [−B, B]d , assume that Θk,0 , k ∈ {1, 2, . . . , K},
are i.i.d., assume that Θ1,0 is continuously uniformly distributed on [−c, c]d , let N ∈ N,
T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and assume for
all ω ∈ Ω that
(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
hZ p i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P X1 (dx)
[a,b]
2 4
36d c 4L(∥l∥∞ + 1)L cL+2 max{1, p}
≤ + (15.33)
A 2/d
K [L−1 (∥l∥∞ +1)−2 ]
23B 3 L(∥l∥∞ + 1)2 max{p, ln(eM )}
+ √
M
Proof of Theorem 15.1.5. Note that the assumption that for all x, y ∈ [a, b]d it holds
that |E(x) − E(y)| ≤ L∥x − y∥1 establishes that E : [a, b]d → [u, v] is B([a, b]d )/B([u, v])-
measurable. Proposition 15.1.2 (applied with b ↶ max{1, |a|, |b|}, D ↶ [a, b]d in the
498
15.1. Full strong error analysis for the training of ANNs
notation of Proposition 15.1.2) hence shows that for all p ∈ (0, ∞) it holds that
hZ p i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| PX1 (dx)
[a,b] θ,l
(x) − E(x)|2
≤ inf θ∈[−c,c]d supx∈[a,b]d |Nu,v
4(v − u) max{1, |a|, |b|}L(∥l∥∞ + 1)L cL max{1, p} (15.34)
+
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M B max{1, |a|, |b|})}
+ √ .
M
The fact that max{1, |a|, |b|} ≤ c therefore proves that for all p ∈ (0, ∞) it holds that
hZ p i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| PX1 (dx)
[a,b] θ,l
(x) − E(x)|2
≤ inf θ∈[−c,c]d supx∈[a,b]d |Nu,v
4(v − u)L(∥l∥∞ + 1)L cL+1 max{1, p} (15.35)
+ [L −1 (∥l∥ +1)−2 ]
K ∞
499
Chapter 15: Composed error estimates
Moreover, note that the fact that max{1, L, |a|, |b|} ≤ c and (b−a)2 ≤ (|a|+|b|)2 ≤ 2(a2 +b2 )
shows that
9L2 (b − a)2 ≤ 18c2 (a2 + b2 ) ≤ 18c2 (c2 + c2 ) = 36c4 . (15.40)
In addition, observe that the fact that B ≥ c ≥ 1, the fact that M ≥ 1, and Lemma 15.1.4
prove that ln(3M Bc) ≤ 23B 18
ln(eM ). This, (15.40), the fact that (v − u) ≤ 2 max{|u|, |v|} =
max{2|u|, 2|v|} ≤ c ≤ B, and the fact that B ≥ 1 demonstrate that for all p ∈ (0, ∞) it
holds that
9d2 L2 (b − a)2 4(v − u)L(∥l∥∞ + 1)L cL+1 max{1, p}
+
A2/d K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bc)}
+ √
M
2 4 L L+2
(15.41)
36d c 4L(∥l∥∞ + 1) c max{1, p}
≤ +
A2/d K [L−1 (∥l∥∞ +1)−2 ]
23B L(∥l∥∞ + 1)2 max{p, ln(eM )}
3
+ √ .
M
Combining this with (15.39) implies (15.33). The proof of Theorem 15.1.5 is thus complete.
let L ∈ R satisfy for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤ L∥x − y∥1 , let K ∈ N, c ∈
[max{1, L, |a|, |b|, 2|u|, 2|v|},
S∞ ∞), B ∈ [c, ∞), for every k, n ∈ N0 let Θk,n : Ω → Rd be a
random variable, assume k=1 Θk,0 (Ω) ⊆ [−B, B]d , assume that Θk,0 , k ∈ {1, 2, . . . , K},
are i.i.d., assume that Θ1,0 is continuously uniformly distributed on [−c, c]d , let N ∈ N,
T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and assume for
all ω ∈ Ω that
500
15.1. Full strong error analysis for the training of ANNs
(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
hZ p/2 i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P X1 (dx)
[a,b]
6dc2 2L(∥l∥∞ + 1)L cL+1 max{1, p}
≤ + (15.47)
[min({L} ∪ {li : i ∈ N ∩ [0, L)})]1/d K [(2L)−1 (∥l∥∞ +1)−2 ]
5B 2 L(∥l∥∞ + 1) max{p, ln(eM )}
+
M 1/4
L ≥ A = A − 1 + 1 ≥ (A − 1)1[2,∞) (A) + 1
A1 (15.49)
≥ A − A2 1[2,∞) (A) + 1 = [2,∞) + 1 ≥ A1(6 2d
(A) d ,∞) (A)
2
+ 1.
Furthermore, observe that the assumption that lL = 1 and (15.48) establish that
l1 = l1 1{1} (L) + l1 1[2,∞) (L) ≥ 1{1} (L) + A1[2,∞) (L) = A ≥ A1(6d ,∞) (A). (15.50)
Moreover, note that (15.48) shows that for all i ∈ {2, 3, 4, . . .} ∩ [0, L) it holds that
Combining this, (15.49), and (15.50) with Theorem 15.1.5 (applied with p ↶ p/2 for
p ∈ (0, ∞) in the notation of Theorem 15.1.5) proves that for all p ∈ (0, ∞) it holds that
hZ p/2 i2/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P X1 (dx)
[a,b]
36d2 c4 4L(∥l∥∞ + 1)L cL+2 max{1, p/2}
≤ + (15.52)
A2/d K [L−1 (∥l∥∞ +1)−2 ]
23B 3 L(∥l∥∞ + 1)2 max{p/2, ln(eM )}
+ √ .
M
This, (15.48), and the fact that L ≥ 1, c ≥ 1, B ≥ 1, and ln(eM ) ≥ 1 demonstrate that for
501
Chapter 15: Composed error estimates
let L ∈ R satisfy for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤ L∥x − y∥1 , let (Jn )n∈N ⊆ N, for
every k, n ∈ N let G k,n : Rd × Ω → Rd satisfy for all ω ∈ Ω, θ ∈ {ϑ ∈ Rd : (RJk,n n
(·, ω) :
d
R → [0, ∞) is differentiable at ϑ)} that
let K ∈ N, c ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|},S∞), B ∈ [c, ∞), for every k, n ∈ N0 let
Θk,n : Ω → Rd be a random variable, assume ∞ d
k=1 Θk,0 (Ω) ⊆ [−B, B] , assume that Θk,0 ,
502
15.2. Full strong error analysis with optimization via SGD with random initializations
k ∈ {1, 2, . . . , K}, are i.i.d., assume that Θ1,0 is continuously uniformly distributed on
[−c, c]d , let (γn )n∈N ⊆ R satisfy for all k, n ∈ N that
Θk,n = Θk,n−1 − γn G k,n (Θk,n−1 ), (15.58)
let N ∈ N, T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and
assume for all ω ∈ Ω that
k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (15.59)
and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (15.60)
(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
hZ p/2 i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P 0,0 (dx)
X1
[a,b]
6dc2 2L(∥l∥∞ + 1)L cL+1 max{1, p}
≤ + (15.61)
[min({L} ∪ {li : i ∈ N ∩ [0, L)})]1/d
K [(2L)−1 (∥l∥∞ +1)−2 ]
5B 2 L(∥l∥∞ + 1) max{p, ln(eM )}
+
M 1/4
(cf. Lemma 15.1.1).
Proof of Corollary 15.2.1. Note that Corollary 15.1.6 (applied with (Xj )j∈N ↶ (Xj0,0 )j∈N ,
(Yj )j∈N ↶ (Yj0,0 )j∈N , R ↶ RM0,0
in the notation of Corollary 15.1.6) implies (15.61). The
proof of Corollary 15.2.1 is thus complete.
Corollary 15.2.2. Let (Ω, F, P) be a probability space, let M, d ∈ N, a, u ∈ R, b ∈
(a, ∞), v ∈ (u, ∞), for every k, n, j ∈ N0 let Xjk,n : Ω → [a, b]d and Yjk,n : Ω → [u, v] be
random variables, assume that (Xj0,0 , Yj0,0 ), j ∈ {1, 2, . . . , M }, are i.i.d., let d, L ∈ N,
l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy
d ≥ Li=1 li (li−1 + 1), (15.62)
P
l0 = d, lL = 1, and
for every k, n ∈ N0 , J ∈ N let RJk,n : Rd × Ω → [0, ∞) satisfy for all θ ∈ Rd that
J
1 P
k,n
RJ (θ) = θ,l k,n k,n 2
|N (X ) − Yj | , (15.63)
J j=1 u,v j
503
Chapter 15: Composed error estimates
let K ∈ N, c ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|},S∞), B ∈ [c, ∞), for every k, n ∈ N0 let
Θk,n : Ω → Rd be a random variable, assume ∞ d
k=1 Θk,0 (Ω) ⊆ [−B, B] , assume that Θk,0 ,
k ∈ {1, 2, . . . , K}, are i.i.d., assume that Θ1,0 is continuously uniformly distributed on
[−c, c]d , let (γn )n∈N ⊆ R satisfy for all k, n ∈ N that
This and Corollary 15.2.1 (applied with p ↶ 1 in the notation of Corollary 15.2.1) establish
(15.69). The proof of Corollary 15.2.2 is thus complete.
Corollary 15.2.3. Let (Ω, F, P) be a probability space, M, d ∈ N, for every k, n, j ∈ N0
let Xjk,n : Ω → [0, 1]d and Yjk,n : Ω → [0, 1] be random variables, assume that (Xj0,0 , Yj0,0 ),
j ∈ {1, 2, . . . , M }, are i.i.d., for every k, n ∈ N0 , J ∈ N let RJk,n : Rd × Ω → [0, ∞) satisfy
for all θ ∈ Rd that
J
1 P
k,n
RJ (θ, ω) = θ,l k,n k,n
|N (X (ω)) − Yj (ω)| , 2
(15.71)
J j=1 0,1 j
(15.72)
PL
l0 = d, lL = 1, and d≥ i=1 li (li−1 + 1),
504
15.2. Full strong error analysis with optimization via SGD with random initializations
let c ∈ [2, ∞), satisfy for all x, y ∈ [0, 1]d that |E(x) − E(y)| ≤ c∥x − y∥1 , let (Jn )n∈N ⊆ N,
for every k, n ∈ N let G k,n : Rd × Ω → Rd satisfy for all ω ∈ Ω, θ ∈ {ϑ ∈ Rd : (RJk,n n
(·, ω) :
d
R → [0, ∞) is differentiable at ϑ)} that
505
Chapter 15: Composed error estimates
506
Part VI
507
Chapter 16
Deep learning methods have not only become very popular for data-driven learning problems,
but are nowadays also heavily used for solving mathematical equations such as ordinary and
partial differential equations (cf., for example, [119, 187, 347, 379]). In particular, we refer
to the overview articles [24, 56, 88, 145, 237, 355] and the references therein for numerical
simulations and theoretical investigations for deep learning methods for PDEs.
Often deep learning methods for PDEs are obtained, first, by reformulating the PDE
problem under consideration as an infinite dimensional stochastic optimization problem,
then, by approximating the infinite dimensional stochastic optimization problem through
finite dimensional stochastic optimization problems involving deep ANNs as approximations
for the PDE solution and/or its derivatives, and thereafter, by approximately solving the
resulting finite dimensional stochastic optimization problems through SGD-type optimization
methods.
Among the most basic schemes of such deep learning methods for PDEs are PINNs
and DGMs; see [347, 379]. In this chapter we present in Theorem 16.1.1 in Section 16.1 a
reformulation of PDE problems as stochastic optimization problems, we use the theoretical
considerations from Section 16.1 to briefly sketch in Section 16.2 a possible derivation of
PINNs and DGMs, and we present in Sections 16.3 and 16.4 numerical simulations for
PINNs and DGMs. For simplicity and concreteness we restrict ourselves in this chapter
to the case of semilinear heat PDEs. The specific presentation of this chapter is based on
Beck et al. [24].
509
Chapter 16: Physics-informed neural networks (PINNs)
let f : R → R be Lipschitz continuous, and let L : C 1,2 ([0, T ] × Rd , R) → [0, ∞] satisfy for
all v = (v(t, x))(t,x)∈[0,T ]×Rd ∈ C 1,2 ([0, T ] × Rd , R) that
Proof of Theorem 16.1.1. Observe that (16.2) proves that for all v ∈ C 1,2 ([0, T ] × Rd , R)
with ∀ x ∈ Rd : u(0, x) = g(x) and ∀ t ∈ [0, T ], x ∈ Rd : ∂u
∂t
(t, x) = (∆x u)(t, x) + f (u(t, x))
it holds that
L(v) = 0. (16.4)
This and the fact that for all v ∈ C 1,2 ([0, T ] × Rd , R) it holds that L(v) ≥ 0 establish that
((ii) → (i)). Note that the assumption that f is Lipschitz continuous, the assumption that
g is twice continuously differentiable, and the assumption that g has at most polynomially
growing partial derivatives demonstrate that there exists v ∈ C 1,2 ([0, T ] × Rd , R) which
satisfies for all t ∈ [0, T ], x ∈ Rd that v(0, x) = g(x) and
∂v
(16.5)
∂t
(t, x) = (∆x v)(t, x) + f (v(t, x))
(cf., for instance, Beck et al. [23, Corollary 3.4]). This and (16.4) show that
510
16.2. Derivation of PINNs and deep Galerkin methods (DGMs)
Furthermore, observe that (16.2), (16.1), and the assumption that T and X are independent
imply that for all v ∈ C 1,2 ([0, T ] × Rd , R) it holds that
Z
2
|v(0, x) − g(x)|2 + ∂v
L(v) = ∂t
(t, x) − (∆ x v)(t, x) − f (v(t, x)) t(t)x(x) d(t, x).
[0,T ]×Rd
(16.7)
The assumption that t and x are continuous and the fact that for all t ∈ [0, T ], x ∈ Rd
it holds that t(t) ≥ 0 and x(x) ≥ 0 therefore ensure that for all v ∈ C 1,2 ([0, T ] × Rd , R),
t ∈ [0, T ], x ∈ Rd with L(v) = 0 it holds that
2
|v(0, x) − g(x)|2 + ∂v (16.8)
∂t
(t, x) − (∆ x v)(t, x) − f (v(t, x)) t(t)x(x) = 0.
This and the assumption that for all t ∈ [0, T ], x ∈ Rd it holds that t(t) > 0 and x(x) > 0
show that for all v ∈ C 1,2 ([0, T ] × Rd , R), t ∈ [0, T ], x ∈ Rd with L(v) = 0 it holds that
2
|v(0, x) − g(x)|2 + ∂v (16.9)
∂t
(t, x) − (∆x v)(t, x) − f (v(t, x)) = 0.
Combining this with (16.6) proves that ((i) → (ii)). The proof of Theorem 16.1.1 is thus
complete.
511
Chapter 16: Physics-informed neural networks (PINNs)
and let L : C 1,2 ([0, T ]×Rd , R) → [0, ∞] satisfy for all v = (v(t, x))(t,x)∈[0,T ]×Rd ∈ C 1,2 ([0, T ]×
Rd , R) that
Observe that Theorem 16.1.1 assures that the unknown function u satisfies
L(u) = 0 (16.13)
∂NMθ,d+1,M (16.14)
a,l2 ,...,Ma,lh ,idR θ,d+1
a,l1
+ ∂t
(T , X ) − ∆x NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(T , X )
θ,d+1
2
− f NM a,l ,Ma,l ,...,Ma,lh ,idR (T , X ))
1 2
(cf. Definitions 1.1.3 and 1.2.1). We can now compute an approximate minimizer of the
function L by computing an approximate minimizer ϑ ∈ Rd of the function L and employing
the realization NM ϑ,d+1
a,l1 ,Ma,l2 ,...,Ma,lh ,idR
of the ANN associated to this approximate minimizer
as an approximate minimizer of L.
The third and last step of this derivation is to approximately compute such an ap-
proximate minimizer of L by means of SGD-type optimization methods. We now sketch
this in the case of the plain-vanilla SGD optimization method (cf. Definition 7.2.1). Let
ξ ∈ Rd , J ∈ N, (γn )n∈N ⊆ [0, ∞), for every n ∈ N, j ∈ {1, 2, . . . , J} let Tn,j : Ω → [0, T ] and
Xn,j : Ω → Rd be random variables, assume for all n ∈ N, j ∈ {1, 2, . . . , J}, A ∈ B([0, T ]),
B ∈ B(Rd ) that
512
16.3. Implementation of PINNs
Finally, the idea of PINNs and DGMs is then to choose for large enough n ∈ N the
realization NMΘn ,d+1
a,l ,Ma,l ,...,Ma,l ,idR
as an approximation
1 2 h
Θn ,d+1
NM a,l ,Ma,l
1 2
,...,Ma,lh ,idR ≈u (16.18)
with u(0, x) = sin(∥x∥22 ) for t ∈ [0, 3], x ∈ R2 . This implementation follows the original
proposal in Raissi et al. [347] in that it first chooses 20000 realizations of the random variable
513
Chapter 16: Physics-informed neural networks (PINNs)
514
16.3. Implementation of PINNs
40 x = x_data [ indices , :]
41 t = t_data [ indices , :]
42
43 x1 , x2 = x [: , 0:1] , x [: , 1:2]
44
45 x1 . requires_grad_ ()
46 x2 . requires_grad_ ()
47 t . requires_grad_ ()
48
49 optimizer . zero_grad ()
50
51 # Denoting by u the realization function of the ANN , compute
52 # u (0 , x ) for each x in the batch
53 u0 = N ( torch . hstack (( torch . zeros_like ( t ) , x ) ) )
54 # Compute the loss for the initial condition
55 initial_loss = ( u0 - phi ( x ) ) . square () . mean ()
56
57 # Compute the partial derivatives using automatic
58 # differentiation
59 u = N ( torch . hstack (( t , x1 , x2 ) ) )
60 ones = torch . ones_like ( u )
61 u_t = grad (u , t , ones , create_graph = True ) [0]
62 u_x1 = grad (u , x1 , ones , create_graph = True ) [0]
63 u_x2 = grad (u , x2 , ones , create_graph = True ) [0]
64 ones = torch . ones_like ( u_x1 )
65 u_x1x1 = grad ( u_x1 , x1 , ones , create_graph = True ) [0]
66 u_x2x2 = grad ( u_x2 , x2 , ones , create_graph = True ) [0]
67
68 # Compute the loss for the PDE
69 Laplace = u_x1x1 + u_x2x2
70 pde_loss = ( u_t - (0.005 * Laplace + u - u **3) ) . square () . mean ()
71
72 # Compute the total loss and perform a gradient step
73 loss = initial_loss + pde_loss
74 loss . backward ()
75 optimizer . step ()
76
77
78 # ## Plot the solution at different times
79
80 mesh = 128
81 a , b = -3 , 3
82
83 gs = GridSpec (2 , 4 , width_ratios =[1 , 1 , 1 , 0.05])
84 fig = plt . figure ( figsize =(16 , 10) , dpi =300)
85
86 x , y = torch . meshgrid (
87 torch . linspace (a , b , mesh ) ,
88 torch . linspace (a , b , mesh ) ,
515
Chapter 16: Physics-informed neural networks (PINNs)
with u(0, x) = sin(x1 ) sin(x2 ) for t ∈ [0, 3], x = (x1 , x2 ) ∈ R2 . As originally proposed
in Sirignano & Spiliopoulos [379], this implementation chooses for each training step a
batch of 256 realizations of the random variable (T , X ), where T is continuously uniformly
distributed on [0, 3] and where X is normally distributed on R2 with mean 0 ∈ R2 and
covariance 4 I2 ∈ R2×2 (cf. Definition 1.5.5). Like the PINN implementation in Source
code 16.1, it trains a fully connected feed-forward ANN with 4 hidden layers (with 50
516
16.4. Implementation of DGMs
2 2 2 1.0
1 1 1
0 0 0
1 1 1 0.5
2 2 2
3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
0.0
t = 1.8 t = 2.4 t = 3.0
3 3 3
2 2 2
1 1 1 0.5
0 0 0
1 1 1
2 2 2 1.0
3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
Figure 16.1 (plots/pinn.pdf): Plots for the functions [−3, 3]2 ∋ x 7→ U (t, x) ∈ R,
where t ∈ {0, 0.6, 1.2, 1.8, 2.4, 3} and where U ∈ C([0, 3] × R2 , R) is an approximation
of the
function u ∈ C 1,2 ([0, 3] × R2 , R) which satisfies for all t ∈ [0, 3], x ∈ R2 that
∂u
∂t
(t, x) = 200 (∆x u)(t, x) + u(t, x) − [u(t, x)]3 and u(0, x) = sin(∥x∥22 ) computed by
1
means of the PINN method as implemented in Source code 16.1 (cf. Definition 3.3.4).
neurons on each hidden layer) and using the swish activation function with parameter 1 (cf.
Section 1.2.8). The training is performed using the Adam SGD optimization method (cf.
Section 7.9). A plot of the resulting approximation of the solution u after 30000 training
steps is shown in Figure 16.2.
1 import torch
2 import matplotlib . pyplot as plt
3 from torch . autograd import grad
4 from matplotlib . gridspec import GridSpec
5 from matplotlib . cm import ScalarMappable
6
7
8 dev = torch . device ( " cuda :0 " if torch . cuda . is_available () else
9 " cpu " )
10
11 T = 3.0 # the time horizom
517
Chapter 16: Physics-informed neural networks (PINNs)
12
13 # The initial value
14 def phi ( x ) :
15 return x . sin () . prod ( axis =1 , keepdims = True )
16
17 torch . manual_seed (0)
18
19 # We use a network with 4 hidden layers of 50 neurons each and the
20 # Swish activation function ( called SiLU in PyTorch )
21 N = torch . nn . Sequential (
22 torch . nn . Linear (3 , 50) , torch . nn . SiLU () ,
23 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
24 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
25 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
26 torch . nn . Linear (50 , 1) ,
27 ) . to ( dev )
28
29 optimizer = torch . optim . Adam ( N . parameters () , lr =3 e -4)
30
31 J = 256 # the batch size
32
45 optimizer . zero_grad ()
46
47 # Denoting by u the realization function of the ANN , compute
48 # u (0 , x ) for each x in the batch
49 u0 = N ( torch . hstack (( torch . zeros_like ( t ) , x ) ) )
50 # Compute the loss for the initial condition
51 initial_loss = ( u0 - phi ( x ) ) . square () . mean ()
52
53 # Compute the partial derivatives using automatic
54 # differentiation
55 u = N ( torch . hstack (( t , x1 , x2 ) ) )
56 ones = torch . ones_like ( u )
57 u_t = grad (u , t , ones , create_graph = True ) [0]
58 u_x1 = grad (u , x1 , ones , create_graph = True ) [0]
59 u_x2 = grad (u , x2 , ones , create_graph = True ) [0]
60 ones = torch . ones_like ( u_x1 )
518
16.4. Implementation of DGMs
76 mesh = 128
77 a , b = - torch . pi , torch . pi
78
79 gs = GridSpec (2 , 4 , width_ratios =[1 , 1 , 1 , 0.05])
80 fig = plt . figure ( figsize =(16 , 10) , dpi =300)
81
82 x , y = torch . meshgrid (
83 torch . linspace (a , b , mesh ) ,
84 torch . linspace (a , b , mesh ) ,
85 indexing = " xy "
86 )
87 x = x . reshape (( mesh * mesh , 1) ) . to ( dev )
88 y = y . reshape (( mesh * mesh , 1) ) . to ( dev )
89
90 for i in range (6) :
91 t = torch . full (( mesh * mesh , 1) , i * T / 5) . to ( dev )
92 z = N ( torch . cat (( t , x , y ) , 1) )
93 z = z . detach () . cpu () . numpy () . reshape (( mesh , mesh ) )
94
95 ax = fig . add_subplot ( gs [ i // 3 , i % 3])
96 ax . set_title ( f " t = { i * T / 5} " )
97 ax . imshow (
98 z , cmap = " viridis " , extent =[ a , b , a , b ] , vmin = -1.2 , vmax =1.2
99 )
100
101 # Add the colorbar to the figure
102 norm = plt . Normalize ( vmin = -1.2 , vmax =1.2)
103 sm = ScalarMappable ( cmap = " viridis " , norm = norm )
104 cax = fig . add_subplot ( gs [: , 3])
105 fig . colorbar ( sm , cax = cax , orientation = ’ vertical ’)
106
107 fig . savefig ( " ../ plots / dgm . pdf " , bbox_inches = " tight " )
519
Chapter 16: Physics-informed neural networks (PINNs)
u(t, x) − [u(t, x)]3 and u(0, x) = sin(x1 ) sin(x2 ). The plot created by this code is
shown in Figure 16.2.
2 2 2 1.0
1 1 1
0 0 0
1 1 1 0.5
2 2 2
3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
0.0
t = 1.8 t = 2.4 t = 3.0
3 3 3
2 2 2
1 1 1 0.5
0 0 0
1 1 1
2 2 2 1.0
3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
Figure 16.2 (plots/dgm.pdf): Plots for the functions [−π, π]2 ∋ x 7→ U (t, x) ∈ R,
where t ∈ {0, 0.6, 1.2, 1.8, 2.4, 3} and where U ∈ C([0, 3] × R2 , R) is an approximation
of the function u ∈ C 1,2 ([0, 3]×R2 , R) which satisfies for all t ∈ [0, 3], x = (x1 , x2 ) ∈ R2
that u(0, x) = sin(x1 ) sin(x2 ) and ∂u 1
∂t
(t, x) = 200 (∆x u)(t, x) + u(t, x) − [u(t, x)]3
computed by means of Source code 16.2.
520
Chapter 17
The PINNs and the DGMs presented in Chapter 16 do, on the one hand, not exploit a lot
of structure of the underlying PDE in the process of setting up the associated stochastic
optimization problems and have as such the key advantage to be very widely applicable
deep learning methods for PDEs. On the other hand, deep learning methods for PDEs that
in some way exploit the specific structure of the considered PDE problem often result in
more accurate approximations (cf., for example, Beck et al. [24] and the references therein).
In particular, there are several deep learning approximation methods in the literature which
exploit in the process of setting up stochastic optimization problems that the PDE itself
admits a stochastic representation. In the literature there are a lot of deep learning methods
which are based on such stochastic formulations of PDEs and therefore have a strong link
to stochastic analysis and formulas of the Feynman–Kac-type (cf., for instance, [20, 119,
145, 187, 207, 336] and the references therein).
The schemes in Beck et al. [19], which we refer to as DKMs, belong to the simplest of
such deep learning methods for PDEs. In this chapter we present in Sections 17.1, 17.2,
17.3, and 17.4 theoretical considerations leading to a reformulation of heat PDE problems
as stochastic optimization problems (see Proposition 17.4.1 below), we use these theoretical
considerations to derive DKMs in the specific case of heat equations in Section 17.5, and we
present an implementation of DKMs in the case of a simple two-dimensional heat equation
in Section 17.6.
Sections 17.1 and 17.2 are slightly modified extracts from Beck et al. [18], Section 17.3
is inspired by Beck et al. [23, Section 2], and Sections 17.4 and 17.5 are inspired by Beck et
al. [18].
521
Chapter 17: Deep Kolmogorov methods (DKMs)
and
(iii) it holds that
E |X − E[X]|2 = inf E |X − y|2 . (17.3)
y∈R
Proof of Lemma 17.1.1. Note that Lemma 7.2.3 establishes item (i). Observe that item (i)
proves items (ii) and (iii). The proof of Lemma 17.1.1 is thus complete.
and
(ii) it holds for all x ∈ [a, b]d that u(x) = E[Xx ].
Proof of Proposition 17.2.1. Note that item (i) in Lemma 17.1.1 and the assumption that for
all x ∈ [a, b]d it holds that E[|Xx |2 ] < ∞ demonstrate that for every function u : [a, b]d → R
and every x ∈ [a, b]d it holds that
522
17.2. Stochastic optimization problems for expectations of random fields
Fubini’s theorem (see, for example, Klenke [248, Theorem 14.16]) hence implies that for all
u ∈ C([a, b]d , R) it holds that
Z Z Z
2 2
|E[Xx ] − u(x)|2 dx. (17.6)
E |Xx − u(x)| dx = E |Xx − E[Xx ]| dx +
[a,b]d [a,b]d [a,b]d
The assumption that [a, b]d ∋ x 7→ E[Xx ] ∈ R is continuous therefore shows that
Z Z
2 2
E |Xx − E[Xx ]| dx ≥ inf E |Xx − E[Xx ]| dx
[a,b]d v∈C([a,b]d ,R) [a,b]d
Z (17.8)
E |Xx − E[Xx ]|2 dx.
=
[a,b]d
The fact that the function [a, b]d ∋ x 7→ E[Xx ] ∈ R is continuous therefore establishes that
there exists u ∈ C([a, b]d , R) such that
Z Z
2 2
(17.10)
E |Xx − u(x)| dx = inf E |Xx − v(x)| dx .
[a,b]d v∈C([a,b]d ,R) [a,b]d
Furthermore, observe that (17.6) and (17.9) prove that for all u ∈ C([a, b]d , R) with
Z Z
2 2
(17.11)
E |Xx − u(x)| dx = inf E |Xx − v(x)| dx
[a,b]d v∈C([a,b]d ,R) [a,b]d
it holds that
Z
E |Xx − E[Xx ]|2 dx
[a,b]d
Z Z
2
E |Xx − u(x)|2 dx (17.12)
= inf E |Xx − v(x)| dx =
v∈C([a,b]d ,R) [a,b]d [a,b]d
Z Z
E |Xx − E[Xx ]|2 dx + |E[Xx ] − u(x)|2 dx.
=
[a,b]d [a,b]d
523
Chapter 17: Deep Kolmogorov methods (DKMs)
it holds that Z
|E[Xx ] − u(x)|2 dx = 0. (17.14)
[a,b]d
This and the assumption that [a, b]d ∋ x 7→ E[Xx ] ∈ R is continuous demonstrate that for
all y ∈ [a, b]d , u ∈ C([a, b]d , R) with
Z Z
2 2
(17.15)
E |Xx − u(x)| dx = inf E |Xx − v(x)| dx
[a,b]d v∈C([a,b]d ,R) [a,b]d
it holds that u(y) = E[Xy ]. Combining this with (17.10) establishes items (i) and (ii). The
proof of Proposition 17.2.1 is thus complete.
Proof of Lemma 17.3.1. Note that, for instance, the variant of Lebesgue’s theorem on
dominated convergence in Klenke [248, Corollary 6.26] proves items (i), (ii), and (iii). The
proof of Lemma 17.3.1 is thus complete.
524
17.3. Feynman–Kac formulas
Then
(i) it holds that u ∈ C 1,2 ([0, T ] × Rd , R) and
and for√every t ∈ [0, T ], x ∈ Rd let ψt,x : Rm → R, satisfy for all y ∈ Rm that ψt,x (y) =
φ(x + tBy). Note that the assumption that φ ∈ C 2 (Rd , R), the chain rule, Lemma 17.3.1,
and (17.17) imply that
(I) for all x ∈ Rd it holds that (0, T ] ∋ t 7→ u(t, x) ∈ R is differentiable,
and
(cf. Definition 1.4.7). Note that items (III) and (IV), the assumption that φ ∈ C 2 (Rd , R),
the assumption that
∂2
Pd ∂
(17.23)
supx∈Rd i,j=1 φ(x) + | ∂xi φ (x)| + ∂xi ∂xj
φ (x) < ∞,
the fact that E ∥Z∥2 < ∞, and Lemma 17.3.1 ensure that
525
Chapter 17: Deep Kolmogorov methods (DKMs)
and
[0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d (17.25)
are continuous (cf. Definition 3.3.4). Furthermore, observe that item (IV) and the fact
that for all X ∈ Rm×d , Y ∈ Rd×m it holds that Trace(XY ) = Trace(Y X) show that for all
t ∈ (0, T ], x ∈ Rd it holds that
1 ∗
h
1 ∗
√ i
2
Trace BB (Hess x u)(t, x) = E 2
Trace BB (Hess φ)(x + tBZ)
√ √
h m
1 ∗
i 1 P ∗
= 2 E Trace B (Hess φ)(x + tBZ)B = 2 E ⟨ek , B (Hess φ)(x + tBZ)Bek ⟩
k=1
√ √
m m
1
P 1
P ′′
= 2E ⟨Bek , (Hess φ)(x + tBZ)Bek ⟩ = 2 E φ (x + tBZ)(Bek , Bek )
k=1 k=1
m m
1 ′′ 1
P ∂2
ψ (Z) = 2t1 E[(∆ψt,x )(Z)]
P
= 2t E (ψt,x ) (Z)(ek , ek ) = 2t E ∂y 2 t,x
k
k=1 k=1
(17.26)
(cf. Definition 2.4.5). The assumption that Z : Ω → Rm is a standard normal random
variable and integration by parts therefore demonstrate that for all t ∈ (0, T ], x ∈ Rd it
holds that
1 ∗
2
Trace BB (Hess x u)(t, x)
" # " #
exp ⟨y,y⟩ exp − ⟨y,y⟩
Z Z
1 2 1 2
= (∆ψt,x )(y) dy = ⟨(∇ψt,x )(y), y⟩ dy
2t Rm (2π)m/2 2t Rm (2π)m/2
(17.27)
" #
1
Z D √ E exp − ⟨y,y⟩
= √ B ∗ (∇φ)(x + tBy), y 2
dy
2 t Rm (2π)m/2
1 √ √
= √ E ⟨B ∗ (∇φ)(x + tBZ), Z⟩ = E (∇φ)(x + tBZ), 2√ 1
t
BZ .
2 t
Item (III) hence establishes that for all t ∈ (0, T ], x ∈ Rd it holds that
∂u
(t, x) = 12 Trace BB ∗ (Hessx u)(t, x) . (17.28)
∂t
The fundamental theorem of calculus therefore proves that for all t, s ∈ (0, T ], x ∈ Rd it
holds that
Z t Z t
∂u 1 ∗
(17.29)
u(t, x) − u(s, x) = ∂t
(r, x) dr = 2
Trace BB (Hess x u)(r, x) dr.
s s
The fact that [0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d is continuous hence implies for
all t ∈ (0, T ], x ∈ Rd that
1 t1
u(t, x) − u(0, x) u(t, x) − u(s, x)
Z
Trace BB ∗ (Hessx u)(r, x) dr. (17.30)
= lim = 2
t s↘0 t t 0
526
17.3. Feynman–Kac formulas
This and the fact that [0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d is continuous ensure
that for all x ∈ Rd it holds that
u(t, x) − u(0, x) 1
− 2 Trace BB ∗ (Hessx u)(0, x)
lim sup
t↘0 t
Z t
1 1 ∗
1 ∗
≤ lim sup Trace BB (Hessx u)(s, x) − 2 Trace BB (Hessx u)(0, x) ds
t↘0 t 0 2
" #
≤ lim sup sup 12 Trace BB ∗ (Hessx u)(s, x) − (Hessx u)(0, x)
= 0.
t↘0 s∈[0,t]
(17.31)
Item (I) therefore shows that for all x ∈ Rd it holds that [0, T ] ∋ t 7→ u(t, x) ∈ R is
differentiable. Combining this with (17.31) and (17.28) ensures that for all t ∈ [0, T ], x ∈ Rd
it holds that
∂u 1 ∗
(17.32)
∂t
(t, x) = 2
Trace BB (Hess x u)(t, x) .
This and the fact that [0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d is continuous establish
item (i). Note that (17.32) proves item (ii). The proof of Proposition 17.3.2 is thus
complete.
Definition 17.3.3 (Standard Brownian motions). Let (Ω, F, P) be a probability space.
We say that W is an m-dimensional P-standard Brownian motion (we say that W is a
P-standard Brownian motion, we say that W is a standard Brownian motion) if and only
if there exists T ∈ (0, ∞) such that
(i) it holds that m ∈ N,
(v) it holds for all t1 ∈ [0, T ], t2 ∈ [0, T ] with t1 < t2 that Ω ∋ ω 7→ (t2 − t1 )−1/2 (Wt2 (ω) −
Wt1 (ω)) ∈ Rm is a standard normal random variable, and
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 def g e n e r a t e _ b r o w n i a n _ m o t i o n (T , N ) :
5 increments = np . random . randn ( N ) * np . sqrt ( T / N )
527
Chapter 17: Deep Kolmogorov methods (DKMs)
6 BM = np . cumsum ( increments )
7 BM = np . insert ( BM , 0 , 0)
8 return BM
9
10 T = 1
11 N = 1000
12 t_values = np . linspace (0 , T , N +1)
13
14 fig , axarr = plt . subplots (2 , 2)
15
16 for i in range (2) :
17 for j in range (2) :
18 BM = g e n e r a t e _ b r o w n i a n _ m o t i o n (T , N )
19 axarr [i , j ]. plot ( t_values , BM )
20
21 plt . tight_layout ()
22 plt . savefig ( ’ ../ plots / brownian_motions . pdf ’)
23 plt . show ()
528
17.3. Feynman–Kac formulas
1.5
2.0
1.5 1.0
1.0
0.5
0.5
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.5
0.5
1.0
1.0
0.5
1.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
529
Chapter 17: Deep Kolmogorov methods (DKMs)
and, for example, the consequence of de la Vallée-Poussin’s theorem in Klenke [248, Corol-
lary 6.21] imply that {Xn : n ∈ N} is uniformly integrable. This, (17.37), and Vitali’s
convergence theorem in, for instance, Klenke [248, Theorem 6.25] prove items (i) and (ii).
Observe that items (i) and (ii) establish item (iii). The proof of Lemma 17.3.5 is thus
complete.
Proposition 17.3.6. Let d ∈ N, T, ρ ∈ (0, ∞), f ∈ C([0, T ] × Rd , R), let u ∈ C 1,2 ([0, T ] ×
Rd , R) have at most polynomially growing partial derivatives, assume for all t ∈ [0, T ],
x ∈ Rd that
∂u
(17.39)
∂t
(t, x) = ρ (∆x u)(t, x) + f (t, x),
let (Ω, F, P) be a probability space, and let W : [0, T ] × Ω → Rd be a standard Brownian
motion (cf. Definition 17.3.3). Then it holds for all t ∈ [0, T ], x ∈ Rd that
Z t
(17.40)
p p
u(t, x) = E u(0, x + 2ρWt ) + f (t − s, x + 2ρWs ) ds .
0
Proof of Proposition 17.3.6. Throughout this proof, let D1 : [0, T ] × Rd → R satisfy for all
t ∈ [0, T ], x ∈ Rd that
D1 (t, x) = ∂u (17.41)
∂t
(t, x),
let D2 = (D2,1 , D2,2 , . . . , D2,d ) : [0, T ] × Rd → Rd satisfy for all t ∈ [0, T ], x ∈ Rd that
D2 (t, x) = (∇x u)(t, x), let H = (Hi,j )i,j∈{1,2,...,d} : [0, T ] × Rd → Rd×d satisfy for all t ∈ [0, T ],
x ∈ Rd that
H(t, x) = (Hessx u)(t, x), (17.42)
let γ : Rd → R satisfy for all z ∈ Rd that
∥z∥22
γ(z) = (2π)− /2 exp − (17.43)
d
2
,
and let vt,x : [0, t] → R, t ∈ [0, T ], x ∈ Rd , satisfy for all t ∈ [0, T ], x ∈ Rd , s ∈ [0, t] that
(17.44)
p
vt,x (s) = E u(s, x + 2ρWt−s )
(cf. Definition 3.3.4). Note that the assumption that W is a standard Brownian motion
ensures that for all t ∈ (0, T ], s ∈ [0, t) it holds that (t − s)−1/2 Wt−s : Ω → Rd is a standard
normal random variable. This shows that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t) it holds that
p
vt,x (s) = E u(s, x + 2ρ(t − s)(t − s)− /2 Wt−s )
1
(17.45)
Z p
= u(s, x + 2ρ(t − s)z)γ(z) dz.
Rd
530
17.3. Feynman–Kac formulas
theorem therefore demonstrate that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t) it holds that
vt,x |[0,t) ∈ C 1 ([0, t), R) and
Z p
p
′ −ρz
(vt,x ) (s) = D1 (s, x + 2ρ(t − s)z) + D2 (s, x + 2ρ(t − s)z), √ γ(z) dz
2ρ(t−s)
Rd
(17.46)
(cf. Definition 1.4.7). Furthermore, observe that the fact that for all z ∈ Rd it holds that
(∇γ)(z) = −γ(z)z implies that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t) it holds that
Z p
√ −ρz
D2 (s, x + 2ρ(t − s)z), γ(z) dz
2ρ(t−s)
Rd
Z
(17.47)
p ρ(∇γ)(z)
= D2 (s, x + 2ρ(t − s)z), √ dz
2ρ(t−s)
Rd
X d Z p
ρ ∂γ
=√ D2,i (s, x + 2ρ(t − s)z)( ∂zi )(z1 , z2 , . . . , zd ) dz .
2ρ(t−s) i=1 Rd
Moreover, note that integration by parts proves that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t),
i ∈ {1, 2, . . . , d}, a ∈ R, b ∈ (a, ∞) it holds that
Z b p ∂γ
D2,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))( ∂z i
)(z1 , z2 , . . . , zd ) dzi
a
h izi =b
(17.48)
p
= D2,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))γ(z1 , z2 , . . . , zd )
zi =a
Z bp p
− 2ρ(t − s)Hi,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))γ(z1 , z2 , . . . , zd ) dzi .
a
The assumption that u has at most polynomially growing derivatives hence establishes that
for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t), i ∈ {1, 2, . . . , d} it holds that
Z p ∂γ
D2,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd )) ∂z i
(z1 , z2 , . . . , zd ) dzi
R
p Z p (17.49)
= − 2ρ(t − s) Hi,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))γ(z1 , z2 , . . . , zd ) dzi .
R
Combining this with (17.47) and Fubini’s theorem ensures that for all t ∈ (0, T ], x ∈ Rd ,
s ∈ [0, t) it holds that
Z p
−ρz
D2 (s, x + 2ρ(t − s)z), √ γ(z) dz
2ρ(t−s)
Rd
Xd Z
(17.50)
p
= −ρ Hi,i (s, x + 2ρ(t − s)(z))γ(z) dz
i=1 Rd
Z p
=− ρ Trace H(s, x + 2ρ(t − s)(z)) γ(z) dz.
Rd
531
Chapter 17: Deep Kolmogorov methods (DKMs)
This, (17.46), (17.39), and the fact that for all t ∈ (0, T ], s ∈ [0, t) it holds that (t −
s)−1/2 Wt−s : Ω → Rd is a standard normal random variable show that for all t ∈ (0, T ],
x ∈ Rd , s ∈ [0, t) it holds that
Z p p
′
(vt,x ) (s) = D1 (s, x + 2ρ(t − s)z) − ρ Trace H(s, x + 2ρ(t − s)z) γ(z) dz
d
ZR h i
(17.51)
p p
= f (s, x + 2ρ(t − s)z)γ(z) dz = E f (s, x + 2ρWt−s ) .
Rd
The fact that W0 = 0, the fact that for all t ∈ [0, T ], x ∈ Rd it holds that vt,x : [0, t] → R
is continuous, and the fundamental theorem of calculus therefore demonstrate that for all
t ∈ [0, T ], x ∈ Rd it holds that
h p i Z t
u(t, x) = E u(t, x + 2ρWt−t ) = vt,x (t) = vt,x (0) + (vt,x )′ (s) ds
h i Z t h
0
i (17.52)
p p
= E u(0, x + 2ρWt ) + E f (s, x + 2ρWt−s ) ds.
0
Fubini’s theorem and the fact that u and f are at most polynomially growing hence imply
(17.40). The proof of Proposition 17.3.6 is thus complete.
√
Corollary 17.3.7. Let d ∈ N, T, ρ ∈ (0, ∞), ϱ = 2ρT , a ∈ R, b ∈ (a, ∞), let φ : Rd → R
be a function, let u ∈ C 1,2 ([0, T ] × Rd , R) have at most polynomially growing partial
derivatives, assume for all t ∈ [0, T ], x ∈ Rd that u(0, x) = φ(x) and
∂u
(17.53)
∂t
(t, x) = ρ (∆x u)(t, x),
let (Ω, F, P) be a probability space, and let W : Ω → Rd be a standard normal random
variable. Then
(i) it holds that φ : Rd → R is twice continuously differentiable with at most polynomially
growing partial derivatives and
(ii) it holds for all x ∈ Rd that u(T, x) = E φ(ϱW + x) .
Proof of Corollary 17.3.7. Observe that the assumption that u ∈ C 1,2 ([0, T ] × Rd , R) has
at most polynomially growing partial derivatives and the fact that for all x ∈ Rd it holds
that φ(x) = u(0, x) prove item (i). Furthermore, note that Proposition 17.3.6 establishes
item (ii). The proof of Corollary 17.3.7 is thus complete.
Definition 17.3.8 (Continuous convolutions). Let d ∈ N and let f : Rd → R and g : Rd → R
be B(Rd )/B(R)-measurable. Then we denote by
n R
f⃝∗ g : x ∈ Rd : min Rd max{0, f (x − y)g(y)} dy,
o
− Rd min{0, f (x − y)g(y)} dy < ∞ → [−∞, ∞] (17.54)
R
532
17.3. Feynman–Kac formulas
Exercise 17.3.1. Let d ∈ N, T ∈ (0, ∞), for every σ ∈ (0, ∞) let γσ : Rd → R satisfy for all
x ∈ Rd that
−∥x∥22
2 − d2
γσ (x) = (2πσ ) exp , (17.57)
2σ 2
and for every ρ ∈ (0, ∞), φ ∈ C 2 (Rd , R) with supx∈Rd
Pd ∂
i,j=1 |φ(x)| + |( ∂xi φ)(x)| +
2
|( ∂x∂i ∂xj φ)(x)| < ∞ let uρ,φ : [0, T ] × Rd → R satisfy for all t ∈ (0, T ], x ∈ Rd that
∂uρ,φ
∂t
(t, x) = ρ (∆x uρ,φ )(t, x). (17.59)
Exercise 17.3.2. Prove or disprove the following statement: For every x ∈ R it holds that
Z
−x2 /2 1 −t2 /2 −ixt
e =√ e e dt . (17.60)
2π R
Exercise 17.3.3. Let d ∈ N, T ∈ (0, ∞), for every σ ∈ (0, ∞) let γσ : Rd → R satisfy for all
x ∈ Rd that
−∥x∥22
2 − d2
γσ (x) = (2πσ ) exp , (17.61)
2σ 2
∂2
for every φ ∈ C 2 (Rd , R) with supx∈Rd
Pd ∂
i,j=1 |φ(x)| + |( ∂xi
φ)(x)| + |( ∂xi ∂xj
φ)(x)| <∞
let uφ : [0, T ] × R → R satisfy for all t ∈ (0, T ], x ∈ R that
d d
(cf. Definitions 3.3.4 and 17.3.8). Prove or disprove the following statement: For all
i = (i1 , . . . , id ) ∈ Nd , t ∈ [0, T ], x ∈ Rd it holds that
(17.64)
Pd
uψi (t, x) = exp −π 2 2
k=1 |ik | t ψi (x).
533
Chapter 17: Deep Kolmogorov methods (DKMs)
Exercise 17.3.4. Let d ∈ N, T ∈ (0, ∞), for every σ ∈ (0, ∞) let γσ : Rd → R satisfy for all
x ∈ Rd that
−∥x∥22
2 − d2
γσ (x) = (2πσ ) exp , (17.65)
2σ 2
and for every i = (i1 , . . . , id ) ∈ Nd let ψi : Rd → R satisfy for all x = (x1 , . . . , xd ) ∈ Rd that
" d #
d
Y
ψi (x) = 2 2 sin(ik πxk ) (17.66)
k=1
(cf. Definition 3.3.4). Prove or disprove the following statement: For every i = (i1 , . . . , id ) ∈
Nd , s ∈ [0, T ], y ∈ Rd and every function u ∈ C 1,2 ([0, T ] × Rd , R) with at most polynomially
growing partial derivatives which satisfies for all t ∈ (0, T ), x ∈ Rd that u(0, x) = ψi (x) and
∂u
(17.67)
∂t
(t, x) = (∆x u)(t, x)
it holds that
(17.68)
Pd
u(s, y) = exp −π 2 2
k=1 |ik | s ψi (y).
and
534
17.4. Reformulation of PDE problems as stochastic optimization problems
(iii) it holds for every x ∈ [a, b]d that U (x) = u(T, x).
Proof of Proposition 17.4.1. First, observe that (17.69), the assumption that W is a stan-
dard normal random variable, and Corollary 17.3.7 ensure that for all x ∈ Rd it holds that
φ : Rd → R is twice continuously differentiable with at most polynomially growing partial
derivatives and
(17.71)
u(T, x) = E u(0, ϱW + x) = E φ(ϱW + x) .
Furthermore, note that the assumption that W is a standard normal random variable, the
fact that φ is continuous, and the fact that φ has at most polynomially growing partial
derivatives and is continuous show that
(I) it holds that [a, b]d × Ω ∋ (x, ω) 7→ φ(ϱW(ω) + x) ∈ R is (B([a, b]d ) ⊗ F)/B(R)-
measurable and
(II) it holds for all x ∈ [a, b]d that E[|φ(ϱW + x)|2 ] < ∞.
(A) there exists a unique continuous function U : [a, b]d → R which satisfies that
Z Z
E |φ(ϱW + x) − U (x)|2 dx = E |φ(ϱW + x) − v(x)|2 dx
inf
[a,b]d v∈C([a,b]d ,R) [a,b]d
(17.72)
and
(B) it holds for all x ∈ [a, b]d that U (x) = u(T, x).
Moreover, observe that the assumption that W and X are independent, item (I), and the
assumption that X is continuously uniformly distributed on [a, b]d demonstrate that for all
v ∈ C([a, b]d , R) it holds that
Z
1
2
E |φ(ϱW + x) − v(x)|2 dx. (17.73)
E |φ(ϱW + X ) − v(X )| = d
(b − a) [a,b]d
Combining this with item (A) implies item (ii). Note that items (A) and (B) and (17.73)
prove item (iii). The proof of Proposition 17.4.1 is thus complete.
While Proposition 17.4.1 above recasts the solutions of the PDE in (17.69) at a particular
point in time as the solutions of a stochastic optimization problem, we can also derive from
this a corollary which shows that the solutions of the PDE over an entire timespan are
similarly the solutions of a stochastic optimization problem.
535
Chapter 17: Deep Kolmogorov methods (DKMs)
√
Corollary 17.4.2. Let d ∈ N, T, ρ ∈ (0, ∞), ϱ = 2ρ, a ∈ R, b ∈ (a, ∞), let φ : Rd → R
be a function, let u ∈ C 1,2 ([0, T ] × Rd , R) be a function with at most polynomially growing
partial derivatives which satisfies for all t ∈ [0, T ], x ∈ Rd that u(0, x) = φ(x) and
∂u
(17.74)
∂t
(t, x) = ρ (∆x u)(t, x),
let (Ω, F, P) be a probability space, let W : Ω → Rd be a standard normal random variable,
let τ : Ω → [0, T ] be a continuously uniformly distributed random variable, let X : Ω → [a, b]d
be a continuously uniformly distributed random variable, and assume that W, τ , and X are
independent. Then
(i) there exists a unique U ∈ C([0, T ] × [a, b]d , R) which satisfies that
√ √
E |φ(ϱ τ W + X ) − U (τ, X )|2 = E |φ(ϱ τ W + X ) − v(τ, X )|2
inf
v∈C([0,T ]×[a,b]d ,R)
(17.75)
and
(ii) it holds for all t ∈ [0, T ], x ∈ [a, b]d that U (t, x) = u(t, x).
Proof of Corollary 17.4.2. Throughout this proof, let F : C([0, T ] × [a, b]d , R) → [0, ∞]
satisfy for all v ∈ C([0, T ] × [a, b]d , R) that
√
F (v) = E |φ(ϱ τ W + X ) − v(τ, X )|2 . (17.76)
Observe that Proposition 17.4.1 establishes that for all v ∈ C([0, T ] × [a, b]d , R), s ∈ [0, T ]
it holds that
√ √
E |φ(ϱ sW + X ) − v(s, X )|2 ≥ E |φ(ϱ sW + X ) − u(s, X )|2 . (17.77)
Furthermore, note that the assumption that W, τ , and X are independent, the assumption
that τ : Ω → [0, T ] is continuously uniformly distributed, and Fubini’s theorem ensure that
for all v ∈ C([0, T ] × [a, b]d , R) it holds that
√ √
Z
2
E |φ(ϱ sW + X ) − v(s, X )|2 ds. (17.78)
F (v) = E |φ(ϱ τ W + X ) − v(τ, X )| =
[0,T ]
This and (17.77) show that for all v ∈ C([0, T ] × [a, b]d , R) it holds that
√
Z
(17.79)
F (v) ≥ E |φ(ϱ sW + X ) − u(s, X )| ds.
[0,T ]
Combining this with (17.78) demonstrates that for all v ∈ C([0, T ] × [a, b]d , R) it holds that
F (v) ≥ F (u). Therefore, we obtain that
F (u) = inf F (v). (17.80)
v∈C([0,T ]×[a,b]d ,R)
536
17.5. Derivation of DKMs
This and (17.78) imply that for all U ∈ C([0, T ] × [a, b]d , R) with
it holds that
√ √
Z Z
E |φ(ϱ sW + X ) − u(s, X )| ds. (17.82)
E |φ(ϱ sW + X ) − U (s, X )| ds =
[0,T ] [0,T ]
Combining this with (17.77) proves that for all U R∈ C([0, T ] × [a, b]d , R) with F (U ) =
inf v∈C([0,T ]×[a,b]d ,R) F (v) there exists A ⊆ [0, T ] with A 1 dx = T such that for all s ∈ A it
holds that
√ √
E |φ(ϱ sW + X ) − U (s, X )|2 = E |φ(ϱ sW + X ) − u(s, X )|2 . (17.83)
Proposition 17.4.1 therefore establishes that for all UR∈ C([0, T ] × [a, b]d , R) with F (U ) =
inf v∈C([0,T ]×[a,b]d ,R) F (v) there exists A ⊆ [0, T ] with A 1 dx = T such that for all s ∈ A
it holds that U (s) = u(s). The fact that u ∈ C([0, T ] × [a, b]d , R) hence ensures that for
all U ∈ C([0, T ] × [a, b]d , R) with F (U ) = inf v∈C([0,T ]×[a,b]d ,R) F (v) it holds that U = u.
Combining this with (17.80) proves items (i) and (ii). The proof of Corollary 17.4.2 is thus
complete.
In the framework described in the previous sentence, we think of u as the unknown PDE
solution. The objective of this derivation is to develop deep learning methods which aim to
approximate the unknown PDE solution u(T, ·)|[a,b]d : [a, b]d → R at time T restricted on
[a, b]d .
537
Chapter 17: Deep Kolmogorov methods (DKMs)
In the first step, we employ Proposition 17.4.1 to recast the unknown target function √
u(T, ·)|[a,b]d : [a, b]d → R as the solution of an optimization problem. For this let ϱ = 2ρT ,
let (Ω, F, P) be a probability space, let W : Ω → Rd be a standard normally distributed
random variable, let X : Ω → [a, b]d be a continuously uniformly distributed random variable,
assume that W and X are independent, and let L : C([a, b]d , R) → [0, ∞] satisfy for all
v ∈ C([a, b]d , R) that
L(v) = E |φ(ϱW + X ) − v(X )|2 . (17.85)
Proposition 17.4.1 then ensures that the unknown target function u(T, ·)|[a,b]d : [a, b]d → R
is the unique global minimizer of the function L : C([a, b]d , R) → [0, ∞]. Minimizing L is,
however, not yet amenable to numerical computations.
In the second step, we therefore reduce this infinite dimensional stochastic optimization
problem to a finite dimensional stochastic optimization problem involving ANNs. Specifically,
let
Pha : R → R be differentiable, let h ∈ dN, l1 , l2 , . . . , lh , d ∈ N satisfy dd = l1 (d + 1) +
k=2 lk (lk−1 + 1) + lh + 1, and let L : R → [0, ∞) satisfy for all θ ∈ R that
θ,d
L(θ) = L NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
|[a,b] d
(17.86)
θ,d 2
= E |φ(ϱW + X ) − NM a,l ,M a,l ,...,M a,l ,id R
(X )|
1 2 h
(cf. Definitions 1.1.3 and 1.2.1). We can now compute an approximate minimizer of the
function L by computing an approximate minimizer ϑ ∈d R of the function L and employing
d
the realization NMa,l ,Ma,l ,...,Ma,l ,idR |[a,b]d ∈ C([a, b] , R) of the ANN associated to this
θ,d
1 2 h
approximate minimizer restricted on [a, b]d as an approximate minimizer of L.
In the third step, we use SGD-type methods to compute such an approximate minimizer
of L. We now sketch this in the case of the plain-vanilla SGD optimization method (cf.
Definition 7.2.1). Let ξ ∈ Rd , J ∈ N, (γn )n∈N ⊆ [0, ∞), for every n ∈ N, j ∈ {1, 2, . . . , J} let
Wn,j : Ω → Rd be a standard normally distributed random variable and let Xn,j : Ω → [a, b]d
be a continuously uniformly distributed random variable, let l : Rd × [0, T ] × Rd → R satisfy
for all θ ∈ Rd , w ∈ Rd , x ∈ [a, b]d that
2
θ,d
l(θ, w, x) = NM a,l ,Ma,l2 ,...,Ma,lh ,idR (ϱw + x) − v(x) ,
(17.87)
1
Finally, the idea of DKMs is to consider for large enough n ∈ N the realization function
NMΘn ,d
a,l ,Ma,l ,...,Ma,l ,idR
as an approximation
1 2 h
Θn ,d
(17.89)
NM a,l ,Ma,l ,...,Ma,lh ,idR |[a,b]d ≈ u(T, ·)|[a,b]d
1 2
538
17.6. Implementation of DKMs
of the unknown solution u of the PDE in (17.84) at time T restricted to [a, b]d .
An implementation in the case of a two-dimensional heat equation of the DKMs derived
above that employs the more sophisticated Adam SGD optimization method instead of the
SGD optimization method can be found in the next section.
with u(0, x) = cos(x1 ) + cos(x2 ) for t ∈ [0, 2], x = (x1 , x2 ) ∈ R2 . This implementation
trains a fully connected feed-forward ANN with 2 hidden layers (with 50 neurons on each
hidden layer) and using the ReLU activation function (cf. Section 1.2.3). The training uses
batches of size 256 with each batch consisting of 256 randomly chosen realizations of the
random variable (T , X ), where T is continuously uniformly distributed random variable
on [0, 2] and where X is a continuously uniformly distributed random variable on [−5, 5]2 .
The training is performed using the Adam SGD optimization method (cf. Section 7.9). A
plot of the resulting approximation of the solution u after 3000 training steps is shown in
Figure 16.1.
1 import torch
2 import matplotlib . pyplot as plt
3
4 # Use the GPU if available
5 dev = torch . device ( " cuda " if torch . cuda . is_available () else " cpu " )
6
539
Chapter 17: Deep Kolmogorov methods (DKMs)
23
24 # Define a neural network with two hidden layers with 50 neurons
25 # each using ReLU activations
26 N = torch . nn . Sequential (
27 torch . nn . Linear ( d +1 , 50) , torch . nn . ReLU () ,
28 torch . nn . Linear (50 , 50) , torch . nn . ReLU () ,
29 torch . nn . Linear (50 , 1)
30 ) . to ( dev )
31
540
17.6. Implementation of DKMs
541
Chapter 17: Deep Kolmogorov methods (DKMs)
542
Chapter 18
Besides PINNs, DGMs, and DKMs reviewed in Chapters 16 and 17 above there are also a
large number of other works which propose and study deep learning based approximation
methods for various classes of PDEs. In the following we mention a selection of such methods
from the literature roughly grouped into three classes. Specifically, we consider deep learning
methods for PDEs which employ strong formulations of PDEs to set up learning problems in
Section 18.1, we consider deep learning methods for PDEs which employ weak or variational
formulations of PDEs to set up learning problems in Section 18.2, and we consider deep
learning methods for PDEs which employ intrinsic stochastic representations of PDEs to
set up learning problems in Section 18.3. Finally, in Section 18.4 we also point to several
theoretical results and error analyses for deep learning methods for PDEs in the literature.
Our selection of references for methods as well as theoretical results is by no means
complete. For more complete reviews of the literature on deep learning methods for PDEs
and corresponding theoretical results we refer, for instance, to the overview articles [24, 56,
88, 120, 145, 237, 355].
543
Chapter 18: Further deep learning methods for PDEs
• the extended PINNs (XPINNs) methodology in Jagtap & Karniadakis [90] which
generalizes the domain decomposition idea of Jagtap et al. [219] to other types of
PDEs,
• the Navier-Stokes flow nets (NSFnets) methodology in Jin et al. [231] which explores
the use of PINNs for the incompressible Navier-Stokes PDEs,
• the Bayesian PINNs methodology in Yang et al. [421] which combines PINNs with
Bayesian neural networks (BNNs) from Bayesian learning (cf., for instance, [287,
300]),
• the parareal PINNs (PPINNs) methodology for time-dependent PDEs with long time
horizons in Meng et al. [295] which combines the PINNs methodology with ideas
from parareal algorithms (cf., for example, [42, 290]) in order to split up long-time
problems into many independent short-time problems,
• the SelectNets methodology in Gu et al. [183] which extends the PINNs methodology
by employing a second ANN to adaptively select during the training process the
points at which the residual of the PDE is considered, and
• the fractional PINNs (fPINNs) methodology in Pang et al. [324] which extends the
PINNs methodology to PDEs with fractional derivatives such as space-time fractional
advection-diffusion equations.
We also refer to the article Lu et al. [286] which introduces an elegant Python library for
PINNs called DeepXDE and also provides a good introduction to PINNs.
• the VarNets methodology in Khodayi-Mehr & Zavlanos [243] which employs a similar
methodology than VPINNs but also consider parametric PDEs,
544
18.3. Deep learning methods based on stochastic representations of PDEs
• the weak form TGNN methodology in Xu et al. [420] which further extend the VPINNs
methodology by (amongst other adaptions) considering test functions in the weak
formulation of PDEs tailored to the considered problem,
• the deep fourier residual method in Taylor et al. [393] which is based on minimizing
the dual norm of the weak-form residual operator of PDEs by employing Fourier-type
representations of this dual norm which can efficiently be approximated using the
discrete sine transform (DST) and discrete cosine transform (DCT),
• the weak adversarial networks (WANs) methodology in Zang et al. [428] (cf. also Bao
et al. [13]) which is based on approximating both the solution of the PDE and the test
function in the weak formulation of the PDE by ANNs and on using an adversarial
approach (cf., for instance, Goodfellow et al. [165]) to train both networks to minimize
and maximize, respectively, the weak-form residual of the PDE,
• the Friedrichs learning methodology in Chen et al. [66] which is similar to the WAN
methodology but uses a different minimax formulation for the weak solution related
to Friedrichs’ theory on symmetric system of PDEs (see Friedrichs [139]),
• the deep Ritz method for elliptic PDEs in E & Yu [124] which employs variational
minimization problems associated to PDEs to set up a learning problem,
• the deep Nitsche method in Liao & Ming [274] which refines the deep Ritz method
using Nitsche’s method (see Nitsche [313]) to enforce boundary conditions, and
• the deep domain decomposition method (D3M) in Li et al. [268] which refines the deep
Ritz method using domain decompositions.
We also refer to the multi-scale deep neural networks (MscaleDNNs) in Cai et al. [58, 279]
for a refined ANN architecture which can be employed in both the strong-form-based PINNs
methodology and the variational-form-based deep Ritz methodology.
545
Chapter 18: Further deep learning methods for PDEs
• the deep BSDE methodology in E et al. [119, 187] which suggests to approximate
solutions of semilinear parabolic PDEs by approximately solving the BSDE associated
to the considered PDE through the nonlinear Feyman-Kac formula (see Pardoux &
Peng [325, 326]) using a new deep learning methodology based on
• the generalization of the deep BSDE methodology in Han & Long [188] for semilinear
and quasilinear parabolic PDEs based on forward backward stochastic differential
equations (FBSDEs)
• the refinements of the deep BSDE methodology in [64, 140, 196, 317, 346] which
explore different nontrivial variations and extensions of the original deep BSDE
methodology including different ANN architectures, initializations, and loss functions,
• the extension of the deep BSDE methodology to fully nonlinear parabolic PDEs in
Beck et al. [20] which is based on a nonlinear Feyman-Kac formula involving second
order BSDEs (see Cheridito et al. [73]),
• the deep backward schemes for semilinear parabolic PDEs in Huré et al. [207] which
also rely on BSDEs but set up many separate learning problems which are solved
inductively backwards in time instead of one single optimization problem,
• the deep backward schemes in Pham et al. [336] which extend the methodology in
Huré et al. [207] to fully nonlinear parabolic PDEs,
• the deep splitting method for semilinear parabolic PDEs in Beck et al. [17] which
iteratively solve for small time increments linear approximations of the semilinear
parabolic PDEs using DKMs,
• the methods in Nguwi et al. [308, 309, 311] which are based on representations of
PDE solutions involving branching-type processes (cf., for example, also [195, 197,
546
18.4. Error analyses for deep learning methods for PDEs
310] and the references therein for nonlinear Feynman–Kac-type formulas based on
such branching-type processes), and
• the methodology for elliptic PDEs in Kremsner et al. [256] which relies on suitable
representations of elliptic PDEs involving BSDEs with random terminal times.
also referred to as polynomial tractability (cf., for instance, [314, Definition 4.44], [315], and
[316]).
547
Chapter 18: Further deep learning methods for PDEs
548
Index of abbreviations
549
Index of abbreviations
550
List of figures
551
List of figures
552
List of source codes
553
List of source codes
554
List of definitions
Chapter 1
Definition 1.1.1: Affine functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Definition 1.1.3: Vectorized description of fully-connected feedforward ANNs . . . . . . . 23
Definition 1.2.1: Multidimensional versions of one-dimensional functions . . . . . . . . . . . 27
Definition 1.2.4: ReLU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Definition 1.2.5: Multidimensional ReLU activation functions . . . . . . . . . . . . . . . . . . . . . . 30
Definition 1.2.9: Clipping activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Definition 1.2.10: Multidimensional clipping activation functions . . . . . . . . . . . . . . . . . . . 35
Definition 1.2.11: Softplus activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
Definition 1.2.13: Multidimensional softplus activation functions . . . . . . . . . . . . . . . . . . . 36
Definition 1.2.15: GELU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Definition 1.2.17: Multidimensional GELU unit activation function . . . . . . . . . . . . . . . . 38
Definition 1.2.18: Standard logistic activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Definition 1.2.19: Multidimensional standard logistic activation functions . . . . . . . . . . 39
Definition 1.2.22: Swish activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Definition 1.2.24: Multidimensional swish activation functions. . . . . . . . . . . . . . . . . . . . . .41
Definition 1.2.25: Hyperbolic tangent activation function . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Definition 1.2.26: Multidimensional hyperbolic tangent activation functions . . . . . . . . 43
Definition 1.2.28: Softsign activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Definition 1.2.29: Multidimensional softsign activation functions . . . . . . . . . . . . . . . . . . . 44
Definition 1.2.30: Leaky ReLU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Definition 1.2.33: Multidimensional leaky ReLU activation function . . . . . . . . . . . . . . . . 46
Definition 1.2.34: ELU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Definition 1.2.36: Multidimensional ELU activation function . . . . . . . . . . . . . . . . . . . . . . . 47
Definition 1.2.37: RePU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Definition 1.2.38: Multidimensional RePU activation function . . . . . . . . . . . . . . . . . . . . . . 48
Definition 1.2.39: Sine activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Definition 1.2.40: Multidimensional sine activation functions . . . . . . . . . . . . . . . . . . . . . . . 49
Definition 1.2.41: Heaviside activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Definition 1.2.42: Multidimensional Heaviside activation functions . . . . . . . . . . . . . . . . . 50
Definition 1.2.43: Softmax activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51
555
List of definitions
556
Definition 4.1.2: Metric space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Definition 4.2.1: 1-norm ANN representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Definition 4.2.5: Maxima ANN representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Definition 4.2.6: Floor and ceiling of real numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Definition 4.3.2: Covering numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Definition 4.4.1: Rectified clipped ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Chapter 6
Definition 6.1.1: GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Definition 6.2.1: Explicit midpoint GD optimization method . . . . . . . . . . . . . . . . . . . . . . 239
Definition 6.3.1: Momentum GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Definition 6.3.5: Bias-adjusted momentum GD optimization method . . . . . . . . . . . . . . 247
Definition 6.4.1: Nesterov accelerated GD optimization method . . . . . . . . . . . . . . . . . . . 269
Definition 6.5.1: Adagrad GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Definition 6.6.1: RMSprop GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Definition 6.6.3: Bias-adjusted RMSprop GD optimization method . . . . . . . . . . . . . . . 272
Definition 6.7.1: Adadelta GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Definition 6.8.1: Adam GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Chapter 7
Definition 7.2.1: SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Definition 7.3.1: Explicit midpoint SGD optimization method . . . . . . . . . . . . . . . . . . . . 303
Definition 7.4.1: Momentum SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . .305
Definition 7.4.2: Bias-adjusted momentum SGD optimization method . . . . . . . . . . . . . 307
Definition 7.5.1: Nesterov accelerated SGD optimization method. . . . . . . . . . . . . . . . . .310
Definition 7.5.3: Simplified Nesterov accelerated SGD optimization method . . . . . . . 314
Definition 7.6.1: Adagrad SGD optimization method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315
Definition 7.7.1: RMSprop SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Definition 7.7.3: Bias-adjusted RMSprop SGD optimization method . . . . . . . . . . . . . . 318
Definition 7.8.1: Adadelta SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Definition 7.9.1: Adam SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Chapter 8
Definition 8.2.1: Diagonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Chapter 9
Definition 9.1.1: Standard KL inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Definition 9.1.2: Standard KL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Definition 9.7.1: Analytic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Definition 9.15.1: Fréchet subgradients and limiting Fréchet subgradients . . . . . . . . . 390
Definition 9.16.1: Non-smooth slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Definition 9.17.1: Generalized KL inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Definition 9.17.2: Generalized KL functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .397
Chapter 10
557
List of definitions
558
Bibliography
[1] Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., and Yu, D.
Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio,
Speech, Language Process. 22, 10 (2014), pp. 1533–1545. url: doi.org/10.1109/
TASLP.2014.2339736.
[2] Absil, P.-A., Mahony, R., and Andrews, B. Convergence of the iterates of
descent methods for analytic cost functions. SIAM J. Optim. 16, 2 (2005), pp. 531–
547. url: doi.org/10.1137/040605266.
[3] Ackermann, J., Jentzen, A., Kruse, T., Kuckuck, B., and Padgett, J. L.
Deep neural networks with ReLU, leaky ReLU, and softplus activation provably
overcome the curse of dimensionality for Kolmogorov partial differential equations
with Lipschitz nonlinearities in the Lp -sense. arXiv:2309.13722 (2023), 52 pp. url:
arxiv.org/abs/2309.13722.
[4] Alpaydın, E. Introduction to Machine Learning. 4th ed. MIT Press, Cambridge,
Mass., 2020. 712 pp.
[5] Amann, H. Ordinary differential equations. Walter de Gruyter & Co., Berlin, 1990.
xiv+458 pp. url: doi.org/10.1515/9783110853698.
[6] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J.,
Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K.,
Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L.,
Gong, C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B.,
LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S.,
Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J.,
Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A.,
Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z.,
Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D.,
Yuan, B., Zhan, J., and Zhu, Z. Deep Speech 2 : End-to-End Speech Recognition
in English and Mandarin. In Proceedings of The 33rd International Conference on
Machine Learning (New York, NY, USA, June 20–22, 2016). Ed. by Balcan, M. F.
559
Bibliography
560
Bibliography
[17] Beck, C., Becker, S., Cheridito, P., Jentzen, A., and Neufeld, A. Deep
splitting method for parabolic PDEs. SIAM J. Sci. Comput. 43, 5 (2021), A3135–
A3154. url: doi.org/10.1137/19M1297919.
[18] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. Solving
stochastic differential equations and Kolmogorov equations by means of deep learning.
arXiv:1806.00421 (2018), 56 pp. url: arxiv.org/abs/1806.00421.
[19] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. Solving
the Kolmogorov PDE by means of deep learning. J. Sci. Comput. 88, 3 (2021),
Art. No. 73, 28 pp. url: doi.org/10.1007/s10915-021-01590-0.
[20] Beck, C., E, W., and Jentzen, A. Machine learning approximation algorithms
for high-dimensional fully nonlinear partial differential equations and second-order
backward stochastic differential equations. J. Nonlinear Sci. 29, 4 (2019), pp. 1563–
1619. url: doi.org/10.1007/s00332-018-9525-3.
[21] Beck, C., Gonon, L., and Jentzen, A. Overcoming the curse of dimensionality in
the numerical approximation of high-dimensional semilinear elliptic partial differential
equations. arXiv:2003.00596 (2020), 50 pp. url: arxiv.org/abs/2003.00596.
[22] Beck, C., Hornung, F., Hutzenthaler, M., Jentzen, A., and Kruse, T.
Overcoming the curse of dimensionality in the numerical approximation of Allen-
Cahn partial differential equations via truncated full-history recursive multilevel
Picard approximations. J. Numer. Math. 28, 4 (2020), pp. 197–222. url: doi.org/
10.1515/jnma-2019-0074.
[23] Beck, C., Hutzenthaler, M., and Jentzen, A. On nonlinear Feynman–Kac
formulas for viscosity solutions of semilinear parabolic partial differential equations.
Stoch. Dyn. 21, 8 (2021), Art. No. 2150048, 68 pp. url: doi . org / 10 . 1142 /
S0219493721500489.
[24] Beck, C., Hutzenthaler, M., Jentzen, A., and Kuckuck, B. An overview
on deep learning-based approximation methods for partial differential equations.
Discrete Contin. Dyn. Syst. Ser. B 28, 6 (2023), pp. 3697–3746. url: doi.org/10.
3934/dcdsb.2022238.
[25] Beck, C., Jentzen, A., and Kuckuck, B. Full error analysis for the training
of deep neural networks. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 25, 2
(2022), Art. No. 2150020, 76 pp. url: doi.org/10.1142/S021902572150020X.
[26] Belak, C., Hager, O., Reimers, C., Schnell, L., and Würschmidt, M.
Convergence Rates for a Deep Learning Algorithm for Semilinear PDEs (2021).
Available at SSRN, 42 pp. url: doi.org/10.2139/ssrn.3981933.
[27] Bellman, R. Dynamic programming. Reprint of the 1957 edition. Princeton
University Press, Princeton, NJ, 2010, xxx+340 pp. url: doi . org / 10 . 1515 /
9781400835386.
561
Bibliography
[28] Beneventano, P., Cheridito, P., Graeber, R., Jentzen, A., and Kuck-
uck, B. Deep neural network approximation theory for high-dimensional functions.
arXiv:2112.14523 (2021), 82 pp. url: arxiv.org/abs/2112.14523.
[29] Beneventano, P., Cheridito, P., Jentzen, A., and von Wurstemberger, P.
High-dimensional approximation spaces of artificial neural networks and applications
to partial differential equations. arXiv:2012.04326 (2020). url: arxiv.org/abs/
2012.04326.
[30] Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies
with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), pp. 157–166.
url: doi.org/10.1109/72.279181.
[31] Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. Advances in
optimizing recurrent networks. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing (Vancouver, BC, Canada, May 26–31, 2013). 2013,
pp. 8624–8628. url: doi.org/10.1109/ICASSP.2013.6639349.
[32] Benth, F. E., Detering, N., and Galimberti, L. Neural networks in Fréchet
spaces. Ann. Math. Artif. Intell. 91, 1 (2023), pp. 75–103. url: doi.org/10.1007/
s10472-022-09824-z.
[33] Bercu, B. and Fort, J.-C. Generic Stochastic Gradient Methods. In Wiley
Encyclopedia of Operations Research and Management Science. Ed. by Cochran,
J. J., Cox Jr., L. A., Keskinocak, P., Kharoufeh, J. P., and Smith, J. C. John Wiley
& Sons, Ltd., 2013. url: doi.org/10.1002/9780470400531.eorms1068.
[34] Berg, J. and Nyström, K. A unified deep artificial neural network approach to
partial differential equations in complex geometries. Neurocomputing 317 (2018),
pp. 28–41. url: doi.org/10.1016/j.neucom.2018.06.056.
[35] Berner, J., Grohs, P., and Jentzen, A. Analysis of the Generalization Error:
Empirical Risk Minimization over Deep Artificial Neural Networks Overcomes the
Curse of Dimensionality in the Numerical Approximation of Black–Scholes Partial
Differential Equations. SIAM J. Math. Data Sci. 2, 3 (2020), pp. 631–657. url:
doi.org/10.1137/19M125649X.
[36] Berner, J., Grohs, P., Kutyniok, G., and Petersen, P. The Modern
Mathematics of Deep Learning. In Mathematical Aspects of Deep Learning. Ed.
by Grohs, P. and Kutyniok, G. Cambridge University Press, 2022, pp. 1–111. url:
doi.org/10.1017/9781009025096.002.
[37] Beznea, L., Cimpean, I., Lupascu-Stamate, O., Popescu, I., and Zarnescu,
A. From Monte Carlo to neural networks approximations of boundary value problems.
arXiv:2209.01432 (2022), 40 pp. url: arxiv.org/abs/2209.01432.
[38] Bierstone, E. and Milman, P. D. Semianalytic and subanalytic sets. Inst. Hautes
Études Sci. Publ. Math. 67 (1988), pp. 5–42. url: doi.org/10.1007/BF02699126.
562
Bibliography
[39] Bishop, C. M. Neural networks for pattern recognition. The Clarendon Press, Oxford
University Press, New York, 1995, xviii+482 pp.
[40] Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. Understand-
ing Batch Normalization. In Advances in Neural Information Processing Systems
(NeurIPS 2018). Ed. by Bengio, S., Wallach, H., Larochelle, H., Grauman, K.,
Cesa-Bianchi, N., and Garnett, R. Vol. 31. Curran Associates, Inc., 2018. url:
proceedings.neurips.cc/paper_files/paper/2018/file/36072923bfc3cf477
45d704feb489480-Paper.pdf.
[41] Blum, E. K. and Li, L. K. Approximation theory and feedforward networks. Neural
Networks 4, 4 (1991), pp. 511–515. url: doi.org/10.1016/0893-6080(91)90047-9.
[42] Blumers, A. L., Li, Z., and Karniadakis, G. E. Supervised parallel-in-time
algorithm for long-time Lagrangian simulations of stochastic dynamics: Application
to hydrodynamics. J. Comput. Phys. 393 (2019), pp. 214–228. url: doi.org/10.
1016/j.jcp.2019.05.016.
[43] Bölcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. Optimal approxi-
mation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1, 1
(2019), pp. 8–45. url: doi.org/10.1137/18M118709X.
[44] Bolte, J., Daniilidis, A., and Lewis, A. The Łojasiewicz inequality for nons-
mooth subanalytic functions with applications to subgradient dynamical systems.
SIAM J. Optim. 17, 4 (2006), pp. 1205–1223. url: doi.org/10.1137/050644641.
[45] Bolte, J. and Pauwels, E. Conservative set valued fields, automatic differentia-
tion, stochastic gradient methods and deep learning. Math. Program. 188, 1 (2021),
pp. 19–51. url: doi.org/10.1007/s10107-020-01501-5.
[46] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional Time Series
Forecasting with Convolutional Neural Networks. arXiv:1703.04691 (2017), 22 pp.
url: arxiv.org/abs/1703.04691.
[47] Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel,
L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., and Vapnik, V.
Comparison of classifier methods: a case study in handwritten digit recognition. In
Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol.
3 - Conference C: Signal Processing (Cat. No.94CH3440-5) (Jerusalem, Israel, Oct. 9–
13, 1994). Vol. 2. 1994, pp. 77–82. url: doi.org/10.1109/ICPR.1994.576879.
[48] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-
Scale Machine Learning. SIAM Rev. 60, 2 (2018), pp. 223–311. url: doi.org/10.
1137/16M1080173.
[49] Bourlard, H. and Kamp, Y. Auto-association by multilayer perceptrons and
singular value decomposition. Biol. Cybernet. 59, 4–5 (1988), pp. 291–294. url:
doi.org/10.1007/BF00332918.
563
Bibliography
[50] Boussange, V., Becker, S., Jentzen, A., Kuckuck, B., and Pellissier, L.
Deep learning approximations for non-local nonlinear PDEs with Neumann boundary
conditions. arXiv:2205.03672 (2022), 59 pp. url: arxiv.org/abs/2205.03672.
[51] Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and
Bengio, S. Generating Sentences from a Continuous Space. In Proceedings of the
20th SIGNLL Conference on Computational Natural Language Learning (Berlin,
Germany, Aug. 7–12, 2016). Ed. by Riezler, S. and Goldberg, Y. Association for
Computational Linguistics, 2016, pp. 10–21. url: doi.org/10.18653/v1/K16-1002.
[52] Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University
Press, 2004. 727 pp. url: doi.org/10.1017/CBO9780511804441.
[53] Brandstetter, J., van den Berg, R., Welling, M., and Gupta, J. K.
Clifford Neural Layers for PDE Modeling. arXiv:2209.04934 (2022), 58 pp. url:
arxiv.org/abs/2209.04934.
[54] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal,
P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S.,
Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A.,
Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S.,
Radford, A., Sutskever, I., and Amodei, D. Language Models are Few-Shot
Learners. arXiv:2005.14165 (2020), 75 pp. url: arxiv.org/abs/2005.14165.
[55] Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral Networks
and Locally Connected Networks on Graphs. arXiv:1312.6203 (2013), 14 pp. url:
arxiv.org/abs/1312.6203.
[56] Brunton, S. L. and Kutz, J. N. Machine Learning for Partial Differential
Equations. arXiv:2303.17078 (2023), 16 pp. url: arxiv.org/abs/2303.17078.
[57] Bubeck, S. Convex Optimization: Algorithms and Complexity. Found. Trends
Mach. Learn. 8, 3–4 (2015), pp. 231–357. url: doi.org/10.1561/2200000050.
[58] Cai, W. and Xu, Z.-Q. J. Multi-scale Deep Neural Networks for Solving High
Dimensional PDEs. arXiv:1910.11710 (2019), 14 pp. url: arxiv.org/abs/1910.
11710.
[59] Cakir, E., Parascandolo, G., Heittola, T., Huttunen, H., and Virtanen,
T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection.
IEEE/ACM Trans. Audio, Speech and Lang. Proc. 25, 6 (2017), pp. 1291–1303. url:
doi.org/10.1109/TASLP.2017.2690575.
[60] Calin, O. Deep learning architectures—a mathematical approach. Springer, Cham,
2020, xxx+760 pp. url: doi.org/10.1007/978-3-030-36721-3.
564
Bibliography
565
Bibliography
[73] Cheridito, P., Soner, H. M., Touzi, N., and Victoir, N. Second-order
backward stochastic differential equations and fully nonlinear parabolic PDEs. Comm.
Pure Appl. Math. 60, 7 (2007), pp. 1081–1110. url: doi.org/10.1002/cpa.20168.
[74] Chizat, L. and Bach, F. On the Global Convergence of Gradient Descent for Over-
parameterized Models using Optimal Transport. In Advances in Neural Information
Processing Systems (NeurIPS 2018). Ed. by Bengio, S., Wallach, H., Larochelle,
H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. Vol. 31. Curran Associates,
Inc., 2018. url: proceedings . neurips . cc / paper _ files / paper / 2018 / file /
a1afc58c6ca9540d057299ec3016d726-Paper.pdf.
[75] Chizat, L., Oyallon, E., and Bach, F. On Lazy Training in Differentiable
Programming. In Advances in Neural Information Processing Systems (NeurIPS
2019). Ed. by Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox,
E., and Garnett, R. Vol. 32. Curran Associates, Inc., 2019. url: proceedings .
neurips.cc/paper_files/paper/2019/file/ae614c557843b1df326cb29c57225
459-Paper.pdf.
[76] Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the
Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Proceed-
ings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical
Translation (Doha, Qatar, Oct. 25, 2014). Association for Computational Linguistics,
2014, pp. 103–111. url: doi.org/10.3115/v1/W14-4012.
[77] Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares,
F., Schwenk, H., and Bengio, Y. Learning Phrase Representations using RNN
Encoder–Decoder for Statistical Machine Translation. arXiv:1406.1078 (2014), 15 pp.
url: arxiv.org/abs/1406.1078.
[78] Choi, K., Fazekas, G., Sandler, M., and Cho, K. Convolutional recurrent
neural networks for music classification. In 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA, USA, Mar. 5–
9, 2017). 2017, pp. 2392–2396. url: doi.org/10.1109/ICASSP.2017.7952585.
[79] Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and Le-
Cun, Y. The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth
International Conference on Artificial Intelligence and Statistics (San Diego, Cal-
ifornia, USA, May 9–12, 2015). Ed. by Lebanon, G. and Vishwanathan, S. V. N.
Vol. 38. Proceedings of Machine Learning Research. PMLR, 2015, pp. 192–204. url:
proceedings.mlr.press/v38/choromanska15.html.
[80] Choromanska, A., LeCun, Y., and Ben Arous, G. Open Problem: The
landscape of the loss surfaces of multilayer networks. In Proceedings of The 28th
Conference on Learning Theory (Paris, France, July 3–6, 2015). Ed. by Grünwald, P.,
Hazan, E., and Kale, S. Vol. 40. Proceedings of Machine Learning Research. PMLR,
2015, pp. 1756–1760. url: proceedings.mlr.press/v40/Choromanska15.html.
566
Bibliography
[81] Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y.
Attention-Based Models for Speech Recognition. In Advances in Neural Informa-
tion Processing Systems (NeurIPS 2015). Ed. by Cortes, C., Lawrence, N., Lee,
D., Sugiyama, M., and Garnett, R. Vol. 28. Curran Associates, Inc., 2015. url:
proceedings.neurips.cc/paper_files/paper/2015/file/1068c6e4c8051cfd4
e9ea8072e3189e2-Paper.pdf.
[82] Cioica-Licht, P. A., Hutzenthaler, M., and Werner, P. T. Deep neural
networks overcome the curse of dimensionality in the numerical approximation
of semilinear partial differential equations. arXiv:2205.14398 (2022), 34 pp. url:
arxiv.org/abs/2205.14398.
[83] Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and Accurate
Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289
(2015), 14 pp. url: arxiv.org/abs/1511.07289.
[84] Colding, T. H. and Minicozzi II, W. P. Łojasiewicz inequalities and applications.
In Surveys in Differential Geometry 2014. Regularity and evolution of nonlinear
equations. Vol. 19. Int. Press, Somerville, MA, 2015, pp. 63–82. url: doi.org/10.
4310/SDG.2014.v19.n1.a3.
[85] Coleman, R. Calculus on normed vector spaces. Springer New York, 2012, xi+249
pp. url: doi.org/10.1007/978-1-4614-3894-6.
[86] Cox, S., Hutzenthaler, M., Jentzen, A., van Neerven, J., and Welti, T.
Convergence in Hölder norms with applications to Monte Carlo methods in infinite
dimensions. IMA J. Numer. Anal. 41, 1 (2020), pp. 493–548. url: doi.org/10.
1093/imanum/drz063.
[87] Cucker, F. and Smale, S. On the mathematical foundations of learning. Bull.
Amer. Math. Soc. (N.S.) 39, 1 (2002), pp. 1–49. url: doi.org/10.1090/S0273-
0979-01-00923-5.
[88] Cuomo, S., Di Cola, V. S., Giampaolo, F., Rozza, G., Raissi, M., and Pic-
cialli, F. Scientific Machine Learning Through Physics–Informed Neural Networks:
Where we are and What’s Next. J. Sci. Comp. 92, 3 (2022), Art. No. 88, 62 pp. url:
doi.org/10.1007/s10915-022-01939-z.
[89] Cybenko, G. Approximation by superpositions of a sigmoidal function. Math.
Control Signals Systems 2, 4 (1989), pp. 303–314. url: doi.org/10.1007/BF02551
274.
[90] D. Jagtap, A. and Em Karniadakis, G. Extended Physics-Informed Neural
Networks (XPINNs): A Generalized Space-Time Domain Decomposition Based Deep
Learning Framework for Nonlinear Partial Differential Equations. Commun. Comput.
Phys. 28, 5 (2020), pp. 2002–2041. url: doi.org/10.4208/cicp.OA-2020-0164.
567
Bibliography
[91] Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov,
R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.
In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics (Florence, Italy, July 28–Aug. 2, 2019). Association for Computational
Linguistics, 2019, pp. 2978–2988. url: doi.org/10.18653/v1/P19-1285.
[92] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and
Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional
non-convex optimization. In Advances in Neural Information Processing Systems.
Ed. by Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K.
Vol. 27. Curran Associates, Inc., 2014. url: proceedings.neurips.cc/paper_
files/paper/2014/file/17e23e50bedc63b4095e3d8204ce063b-Paper.pdf.
[93] Davis, D., Drusvyatskiy, D., Kakade, S., and Lee, J. D. Stochastic sub-
gradient method converges on tame functions. Found. Comput. Math. 20, 1 (2020),
pp. 119–154. url: doi.org/10.1007/s10208-018-09409-5.
[94] De Ryck, T. and Mishra, S. Generic bounds on the approximation error for
physics-informed (and) operator learning. arXiv:2205.11393 (2022), 40 pp. url:
arxiv.org/abs/2205.11393.
[95] Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional Neural
Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural
Information Processing Systems. Ed. by Lee, D., Sugiyama, M., Luxburg, U., Guyon,
I., and Garnett, R. Vol. 29. Curran Associates, Inc., 2016. url: proceedings .
neurips.cc/paper_files/paper/2016/file/04df4d434d481c5bb723be1b6df1
ee65-Paper.pdf.
[96] Défossez, A., Bottou, L., Bach, F., and Usunier, N. A Simple Convergence
Proof of Adam and Adagrad. arXiv:2003.02395 (2020), 30 pp. url: arxiv.org/
abs/2003.02395.
[97] Deisenroth, M. P., Faisal, A. A., and Ong, C. S. Mathematics for machine
learning. Cambridge University Press, Cambridge, 2020, xvii+371 pp. url: doi.
org/10.1017/9781108679930.
[98] Deng, B., Shin, Y., Lu, L., Zhang, Z., and Karniadakis, G. E. Approximation
rates of DeepONets for learning operators arising from advection–diffusion equations.
Neural Networks 153 (2022), pp. 411–426. url: doi.org/10.1016/j.neunet.2022.
06.019.
[99] Dereich, S., Jentzen, A., and Kassing, S. On the existence of minimizers in
shallow residual ReLU neural network optimization landscapes. arXiv:2302.14690
(2023), 26 pp. url: arxiv.org/abs/2302.14690.
568
Bibliography
569
Bibliography
[110] Dos Santos, C. and Gatti, M. Deep Convolutional Neural Networks for Sentiment
Analysis of Short Texts. In Proceedings of COLING 2014, the 25th International Con-
ference on Computational Linguistics: Technical Papers (Dublin, Ireland, Aug. 23–29,
2014). Dublin City University and Association for Computational Linguistics, 2014,
pp. 69–78. url: aclanthology.org/C14-1008.
[111] Dozat, T. Incorporating Nesterov momentum into Adam. https://openreview.
net/forum?id=OM0jvwB8jIp57ZJjtNEZ. [Accessed 6-December-2017]. 2016.
[112] Dozat, T. Incorporating Nesterov momentum into Adam. http://cs229.stanford.
edu/proj2015/054_report.pdf. [Accessed 6-December-2017]. 2016.
[113] Du, S. and Lee, J. On the Power of Over-parametrization in Neural Networks
with Quadratic Activation. In Proceedings of the 35th International Conference on
Machine Learning (Stockholm, Sweden, July 10–15, 2018). Ed. by Dy, J. and Krause,
A. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 1329–1338.
url: proceedings.mlr.press/v80/du18a.html.
[114] Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient Descent Finds Global
Minima of Deep Neural Networks. In Proceedings of the 36th International Conference
on Machine Learning (Long Beach, CA, USA, June 9–15, 2019). Ed. by Chaudhuri,
K. and Salakhutdinov, R. Vol. 97. Proceedings of Machine Learning Research. PMLR,
2019, pp. 1675–1685. url: proceedings.mlr.press/v97/du19c.html.
[115] Du, T., Huang, Z., and Li, Y. Approximation and Generalization of DeepONets
for Learning Operators Arising from a Class of Singularly Perturbed Problems.
arXiv:2306.16833 (2023), 32 pp. url: arxiv.org/abs/2306.16833.
[116] Duchi, J. Probability Bounds. https : / / stanford . edu / ~jduchi / projects /
probability_bounds.pdf. [Accessed 27-October-2023].
[117] Duchi, J., Hazan, E., and Singer, Y. Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (2011), pp. 2121–
2159. url: jmlr.org/papers/v12/duchi11a.html.
[118] Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Ar-
jovsky, M., and Courville, A. Adversarially Learned Inference. arXiv:1606.00704
(2016), 18 pp. url: arxiv.org/abs/1606.00704.
[119] E, W., Han, J., and Jentzen, A. Deep learning-based numerical methods for
high-dimensional parabolic partial differential equations and backward stochastic
differential equations. Commun. Math. Stat. 5, 4 (2017), pp. 349–380. url: doi.
org/10.1007/s40304-017-0117-6.
[120] E, W., Han, J., and Jentzen, A. Algorithms for solving high dimensional PDEs:
from nonlinear Monte Carlo to machine learning. Nonlinearity 35, 1 (2021), p. 278.
url: doi.org/10.1088/1361-6544/ac337f.
570
Bibliography
[121] E, W., Ma, C., and Wu, L. The Barron space and the flow-induced function
spaces for neural network models. Constr. Approx. 55, 1 (2022), pp. 369–406. url:
doi.org/10.1007/s00365-021-09549-y.
[122] E, W., Ma, C., Wu, L., and Wojtowytsch, S. Towards a Mathematical
Understanding of Neural Network-Based Machine Learning: What We Know and
What We Don’t. CSIAM Trans. Appl. Math. 1, 4 (2020), pp. 561–615. url: doi.
org/10.4208/csiam-am.SO-2020-0002.
[123] E, W. and Wojtowytsch, S. Some observations on high-dimensional partial
differential equations with Barron data. In Proceedings of the 2nd Mathematical
and Scientific Machine Learning Conference (Aug. 16–19, 2021). Ed. by Bruna, J.,
Hesthaven, J., and Zdeborova, L. Vol. 145. Proceedings of Machine Learning Research.
PMLR, 2022, pp. 253–269. url: proceedings.mlr.press/v145/e22a.html.
[124] E, W. and Yu, B. The deep Ritz method: a deep learning-based numerical algorithm
for solving variational problems. Commun. Math. Stat. 6, 1 (2018), pp. 1–12. url:
doi.org/10.1007/s40304-018-0127-z.
[125] Eberle, S., Jentzen, A., Riekert, A., and Weiss, G. Normalized gradient flow
optimization in the training of ReLU artificial neural networks. arXiv:2207.06246
(2022), 26 pp. url: arxiv.org/abs/2207.06246.
[126] Eberle, S., Jentzen, A., Riekert, A., and Weiss, G. S. Existence, uniqueness,
and convergence rates for gradient flows in the training of artificial neural networks
with ReLU activation. Electron. Res. Arch. 31, 5 (2023), pp. 2519–2554. url:
doi.org/10.3934/era.2023128.
[127] Einsiedler, M. and Ward, T. Functional analysis, spectral theory, and applica-
tions. Vol. 276. Springer, Cham, 2017, xiv+614 pp. url: doi.org/10.1007/978-3-
319-58540-6.
[128] Elbrächter, D., Grohs, P., Jentzen, A., and Schwab, C. DNN expression
rate analysis of high-dimensional PDEs: application to option pricing. Constr. Approx.
55, 1 (2022), pp. 3–71. url: doi.org/10.1007/s00365-021-09541-6.
[129] Encyclopedia of Mathematics: Lojasiewicz inequality. https://encyclopediaofmath.
org/wiki/Lojasiewicz_inequality. [Accessed 28-August-2023].
[130] Fabbri, M. and Moro, G. Dow Jones Trading with Deep Learning: The Un-
reasonable Effectiveness of Recurrent Neural Networks. In Proceedings of the 7th
International Conference on Data Science, Technology and Applications (Porto,
Portugal, July 26–28, 2018). Ed. by Bernardino, J. and Quix, C. SciTePress - Science
and Technology Publications, 2018. url: doi.org/10.5220/0006922101420153.
[131] Fan, J., Ma, C., and Zhong, Y. A selective overview of deep learning. Statist.
Sci. 36, 2 (2021), pp. 264–290. url: doi.org/10.1214/20-sts783.
571
Bibliography
[132] Fehrman, B., Gess, B., and Jentzen, A. Convergence Rates for the Stochastic
Gradient Descent Method for Non-Convex Objective Functions. J. Mach. Learn.
Res. 21, 136 (2020), pp. 1–48. url: jmlr.org/papers/v21/19-636.html.
[133] Fischer, T. and Krauss, C. Deep learning with long short-term memory networks
for financial market predictions. European J. Oper. Res. 270, 2 (2018), pp. 654–669.
url: doi.org/10.1016/j.ejor.2017.11.054.
[134] Fraenkel, L. E. Formulae for high derivatives of composite functions. Math.
Proc. Cambridge Philos. Soc. 83, 2 (1978), pp. 159–165. url: doi.org/10.1017/
S0305004100054402.
[135] Fresca, S., Dede’, L., and Manzoni, A. A comprehensive deep learning-based
approach to reduced order modeling of nonlinear time-dependent parametrized PDEs.
J. Sci. Comput. 87, 2 (2021), Art. No. 61, 36 pp. url: doi.org/10.1007/s10915-
021-01462-7.
[136] Fresca, S. and Manzoni, A. POD-DL-ROM: enhancing deep learning-based
reduced order models for nonlinear parametrized PDEs by proper orthogonal decom-
position. Comput. Methods Appl. Mech. Engrg. 388 (2022), Art. No. 114181, 27 pp.
url: doi.org/10.1016/j.cma.2021.114181.
[137] Frey, R. and Köck, V. Convergence Analysis of the Deep Splitting Scheme:
the Case of Partial Integro-Differential Equations and the associated FBSDEs with
Jumps. arXiv:2206.01597 (2022), 21 pp. url: arxiv.org/abs/2206.01597.
[138] Frey, R. and Köck, V. Deep Neural Network Algorithms for Parabolic PIDEs
and Applications in Insurance and Finance. Computation 10, 11 (2022). url: doi.
org/10.3390/computation10110201.
[139] Friedrichs, K. O. Symmetric positive linear differential equations. Comm. Pure
Appl. Math. 11 (1958), pp. 333–418. url: doi.org/10.1002/cpa.3160110306.
[140] Fujii, M., Takahashi, A., and Takahashi, M. Asymptotic Expansion as Prior
Knowledge in Deep Learning Method for High dimensional BSDEs. Asia-Pacific
Financial Markets 26, 3 (2019), pp. 391–408. url: doi.org/10.1007/s10690-019-
09271-7.
[141] Fukumizu, K. and Amari, S. Local minima and plateaus in hierarchical structures
of multilayer perceptrons. Neural Networks 13, 3 (2000), pp. 317–327. url: doi.
org/10.1016/S0893-6080(00)00009-5.
[142] Gallon, D., Jentzen, A., and Lindner, F. Blow up phenomena for gra-
dient descent optimization methods in the training of artificial neural networks.
arXiv:2211.15641 (2022), 84 pp. url: arxiv.org/abs/2211.15641.
572
Bibliography
[143] Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Con-
volutional Sequence to Sequence Learning. In Proceedings of the 34th International
Conference on Machine Learning (Sydney, Australia, Aug. 6–11, 2017). Ed. by Pre-
cup, D. and Teh, Y. W. Vol. 70. Proceedings of Machine Learning Research. PMLR,
2017, pp. 1243–1252. url: proceedings.mlr.press/v70/gehring17a.html.
[144] Gentile, R. and Welper, G. Approximation results for Gradient Descent trained
Shallow Neural Networks in 1d. arXiv:2209.08399 (2022), 49 pp. url: arxiv.org/
abs/2209.08399.
[145] Germain, M., Pham, H., and Warin, X. Neural networks-based algorithms
for stochastic control and PDEs in finance. arXiv:2101.08068 (2021), 27 pp. url:
arxiv.org/abs/2101.08068.
[146] Germain, M., Pham, H., and Warin, X. Approximation error analysis of some
deep backward schemes for nonlinear PDEs. SIAM J. Sci. Comput. 44, 1 (2022),
A28–A56. url: doi.org/10.1137/20M1355355.
[147] Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to Forget: Continual
Prediction with LSTM. Neural Comput. 12, 10 (2000), pp. 2451–2471. url: doi.
org/10.1162/089976600300015015.
[148] Gers, F. A., Schraudolph, N. N., and Schmidhuber, J. Learning precise
timing with LSTM recurrent networks. J. Mach. Learn. Res. 3, 1 (2003), pp. 115–143.
url: doi.org/10.1162/153244303768966139.
[149] Gess, B., Kassing, S., and Konarovskyi, V. Stochastic Modified Flows, Mean-
Field Limits and Dynamics of Stochastic Gradient Descent. arXiv:2302.07125 (2023),
24 pp. url: arxiv.org/abs/2302.07125.
[150] Giles, M. B., Jentzen, A., and Welti, T. Generalised multilevel Picard ap-
proximations. arXiv:1911.03188 (2019), 61 pp. url: arxiv.org/abs/1911.03188.
[151] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E.
Neural Message Passing for Quantum Chemistry. In Proceedings of the 34th Interna-
tional Conference on Machine Learning (Sydney, Australia, Aug. 6–11, 2017). Ed. by
Precup, D. and Teh, Y. W. Vol. 70. Proceedings of Machine Learning Research.
PMLR, 2017, pp. 1263–1272. url: proceedings.mlr.press/v70/gilmer17a.html.
[152] Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich Feature Hierar-
chies for Accurate Object Detection and Semantic Segmentation. In Proceedings of
the 2014 IEEE Conference on Computer Vision and Pattern Recognition (Columbus,
OH, USA, June 23–28, 2014). CVPR ’14. IEEE Computer Society, 2014, pp. 580–587.
url: doi.org/10.1109/CVPR.2014.81.
573
Bibliography
[153] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedfor-
ward neural networks. In Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics (Chia Laguna Resort, Sardinia, Italy, May 13–15,
2010). Ed. by Teh, Y. W. and Titterington, M. Vol. 9. Proceedings of Machine
Learning Research. PMLR, 2010, pp. 249–256. url: proceedings.mlr.press/v9/
glorot10a.html.
[154] Gnoatto, A., Patacca, M., and Picarelli, A. A deep solver for BSDEs with
jumps. arXiv:2211.04349 (2022), 31 pp. url: arxiv.org/abs/2211.04349.
[155] Gobet, E. Monte-Carlo methods and stochastic processes. From linear to non-linear.
CRC Press, Boca Raton, FL, 2016, xxv+309 pp.
[156] Godichon-Baggioni, A. and Tarrago, P. Non asymptotic analysis of Adaptive
stochastic gradient algorithms and applications. arXiv:2303.01370 (2023), 59 pp.
url: arxiv.org/abs/2303.01370.
[157] Goldberg, Y. Neural Network Methods for Natural Language Processing. Springer
Cham, 2017, xx+292 pp. url: doi.org/10.1007/978-3-031-02165-7.
[158] Gonon, L. Random Feature Neural Networks Learn Black-Scholes Type PDEs
Without Curse of Dimensionality. J. Mach. Learn. Res. 24, 189 (2023), pp. 1–51.
url: jmlr.org/papers/v24/21-0987.html.
[159] Gonon, L., Graeber, R., and Jentzen, A. The necessity of depth for artificial
neural networks to approximate certain classes of smooth and bounded functions
without the curse of dimensionality. arXiv:2301.08284 (2023), 101 pp. url: arxiv.
org/abs/2301.08284.
[160] Gonon, L., Grigoryeva, L., and Ortega, J.-P. Approximation bounds for
random neural networks and reservoir systems. Ann. Appl. Probab. 33, 1 (2023),
pp. 28–69. url: doi.org/10.1214/22-aap1806.
[161] Gonon, L., Grohs, P., Jentzen, A., Kofler, D., and Šiška, D. Uniform error
estimates for artificial neural network approximations for heat equations. IMA J.
Numer. Anal. 42, 3 (2022), pp. 1991–2054. url: doi.org/10.1093/imanum/drab027.
[162] Gonon, L. and Schwab, C. Deep ReLU network expression rates for option
prices in high-dimensional, exponential Lévy models. Finance Stoch. 25, 4 (2021),
pp. 615–657. url: doi.org/10.1007/s00780-021-00462-7.
[163] Gonon, L. and Schwab, C. Deep ReLU neural networks overcome the curse of
dimensionality for partial integrodifferential equations. Anal. Appl. (Singap.) 21, 1
(2023), pp. 1–47. url: doi.org/10.1142/S0219530522500129.
[164] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT Press,
Cambridge, MA, 2016, xxii+775 pp. url: www.deeplearningbook.org/.
574
Bibliography
[165] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Networks.
arXiv:1406.2661 (2014), 9 pp. url: arxiv.org/abs/1406.2661.
[166] Gori, M., Monfardini, G., and Scarselli, F. A new model for learning in
graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural
Networks, 2005. Vol. 2. 2005, 729–734 vol. 2. url: doi.org/10.1109/IJCNN.2005.
1555942.
[167] Goswami, S., Jagtap, A. D., Babaee, H., Susi, B. T., and Karniadakis,
G. E. Learning stiff chemical kinetics using extended deep neural operators.
arXiv:2302.12645 (2023), 21 pp. url: arxiv.org/abs/2302.12645.
[168] Graham, C. and Talay, D. Stochastic simulation and Monte Carlo methods.
Vol. 68. Mathematical foundations of stochastic simulation. Springer, Heidelberg,
2013, xvi+260 pp. url: doi.org/10.1007/978-3-642-39363-1.
[169] Graves, A. Generating Sequences With Recurrent Neural Networks. arXiv:1308.0850
(2013), 43 pp. url: arxiv.org/abs/1308.0850.
[170] Graves, A. and Jaitly, N. Towards End-To-End Speech Recognition with Re-
current Neural Networks. In Proceedings of the 31st International Conference on
Machine Learning (Bejing, China, June 22–24, 2014). Ed. by Xing, E. P. and Jebara,
T. Vol. 32. Proceedings of Machine Learning Research 2. PMLR, 2014, pp. 1764–1772.
url: proceedings.mlr.press/v32/graves14.html.
[171] Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., and
Schmidhuber, J. A Novel Connectionist System for Unconstrained Handwriting
Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 5 (2009), pp. 855–868.
url: doi.org/10.1109/TPAMI.2008.137.
[172] Graves, A., Mohamed, A.-r., and Hinton, G. E. Speech recognition with deep
recurrent neural networks. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing (Vancouver, BC, Canada, May 26–31, 2013). 2013,
pp. 6645–6649. url: doi.org/10.1109/ICASSP.2013.6638947.
[173] Graves, A. and Schmidhuber, J. Framewise phoneme classification with bidirec-
tional LSTM and other neural network architectures. Neural Networks 18, 5 (2005).
IJCNN 2005, pp. 602–610. url: doi.org/10.1016/j.neunet.2005.06.042.
[174] Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmid-
huber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural Netw. Learn. Syst.
28, 10 (2017), pp. 2222–2232. url: doi.org/10.1109/TNNLS.2016.2582924.
[175] Gribonval, R., Kutyniok, G., Nielsen, M., and Voigtlaender, F. Approx-
imation spaces of deep neural networks. Constr. Approx. 55, 1 (2022), pp. 259–367.
url: doi.org/10.1007/s00365-021-09543-4.
575
Bibliography
[176] Griewank, A. and Walther, A. Evaluating Derivatives. 2nd ed. Society for
Industrial and Applied Mathematics, 2008. url: doi.org/10.1137/1.9780898717
761.
[177] Grohs, P. and Herrmann, L. Deep neural network approximation for high-
dimensional elliptic PDEs with boundary conditions. IMA J. Numer. Anal. 42, 3
(May 2021), pp. 2055–2082. url: doi.org/10.1093/imanum/drab031.
[178] Grohs, P. and Herrmann, L. Deep neural network approximation for high-
dimensional parabolic Hamilton-Jacobi-Bellman equations. arXiv:2103.05744 (2021),
23 pp. url: arxiv.org/abs/2103.05744.
[179] Grohs, P., Hornung, F., Jentzen, A., and von Wurstemberger, P. A
proof that artificial neural networks overcome the curse of dimensionality in the
numerical approximation of Black-Scholes partial differential equations. Mem. Amer.
Math. Soc. 284, 1410 (2023), v+93 pp. url: doi.org/10.1090/memo/1410.
[180] Grohs, P., Hornung, F., Jentzen, A., and Zimmermann, P. Space-time error
estimates for deep neural network approximations for differential equations. Adv.
Comput. Math. 49, 1 (2023), Art. No. 4, 78 pp. url: doi.org/10.1007/s10444-
022-09970-2.
[181] Grohs, P., Jentzen, A., and Salimova, D. Deep neural network approximations
for solutions of PDEs based on Monte Carlo algorithms. Partial Differ. Equ. Appl.
3, 4 (2022), Art. No. 45, 41 pp. url: doi.org/10.1007/s42985-021-00100-z.
[182] Grohs, P. and Kutyniok, G., eds. Mathematical aspects of deep learning. Cambridge
University Press, Cambridge, 2023, xviii+473 pp. url: doi . org / 10 . 1016 / j .
enganabound.2022.10.033.
[183] Gu, Y., Yang, H., and Zhou, C. SelectNet: Self-paced learning for high-dimensio-
nal partial differential equations. J. Comput. Phys. 441 (2021), p. 110444. url:
doi.org/10.1016/j.jcp.2021.110444.
[184] Gühring, I., Kutyniok, G., and Petersen, P. Error bounds for approximations
with deep ReLU neural networks in W s,p norms. Anal. Appl. (Singap.) 18, 5 (2020),
pp. 803–859. url: doi.org/10.1142/S0219530519410021.
[185] Guo, X., Li, W., and Iorio, F. Convolutional Neural Networks for Steady Flow
Approximation. In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (San Francisco, California, USA, Aug. 13–
17, 2016). KDD ’16. New York, NY, USA: Association for Computing Machinery,
2016, pp. 481–490. url: doi.org/10.1145/2939672.2939738.
[186] Han, J. and E, W. Deep Learning Approximation for Stochastic Control Problems.
arXiv:1611.07422 (2016), 9 pp. url: arxiv.org/abs/1611.07422.
576
Bibliography
[187] Han, J., Jentzen, A., and E, W. Solving high-dimensional partial differential
equations using deep learning. Proc. Natl. Acad. Sci. USA 115, 34 (2018), pp. 8505–
8510. url: doi.org/10.1073/pnas.1718942115.
[188] Han, J. and Long, J. Convergence of the deep BSDE method for coupled FBSDEs.
Probab. Uncertain. Quant. Risk 5 (2020), Art. No. 5, 33 pp. url: doi.org/10.
1186/s41546-020-00047-w.
[189] Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical
learning. 2nd ed. Data mining, inference, and prediction. Springer, New York, 2009,
xxii+745 pp. url: doi.org/10.1007/978-0-387-84858-7.
[190] He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (Las Vegas, NV, USA, June 27–30, 2016). 2016, pp. 770–778. url: doi.
org/10.1109/CVPR.2016.90.
[191] He, K., Zhang, X., Ren, S., and Sun, J. Identity Mappings in Deep Residual
Networks. In Computer Vision – ECCV 2016, 14th European Conference, Proceedings
Part IV (Amsterdam, The Netherlands, Oct. 11–14, 2016). Ed. by Leibe, B., Matas,
J., Sebe, N., and Welling, M. Springer, Cham, 2016, pp. 630–645. url: doi.org/10.
1007/978-3-319-46493-0_38.
[192] Heiß, C., Gühring, I., and Eigel, M. Multilevel CNNs for Parametric PDEs.
arXiv:2304.00388 (2023), 42 pp. url: arxiv.org/abs/2304.00388.
[193] Hendrycks, D. and Gimpel, K. Gaussian Error Linear Units (GELUs).
arXiv:1606.08415v4 (2016), 10 pp. url: arxiv.org/abs/1606.08415.
[194] Henry, D. Geometric theory of semilinear parabolic equations. Vol. 840. Springer-
Verlag, Berlin, 1981, iv+348 pp.
[195] Henry-Labordere, P. Counterparty Risk Valuation: A Marked Branching Diffu-
sion Approach. arXiv:1203.2369 (2012), 17 pp. url: arxiv.org/abs/1203.2369.
[196] Henry-Labordere, P. Deep Primal-Dual Algorithm for BSDEs: Applications of
Machine Learning to CVA and IM (2017). Available at SSRN. url: doi.org/10.
2139/ssrn.3071506.
[197] Henry-Labordère, P. and Touzi, N. Branching diffusion representation for
nonlinear Cauchy problems and Monte Carlo approximation. Ann. Appl. Probab. 31,
5 (2021), pp. 2350–2375. url: doi.org/10.1214/20-aap1649.
[198] Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data
with neural networks. Science 313, 5786 (2006), pp. 504–507. url: doi.org/10.
1126/science.1127647.
577
Bibliography
[199] Hinton, G., Srivastava, N., and Swersky, K. Lecture 6e: RMSprop: Divide
the gradient by a running average of its recent magnitude. https : / / www . cs .
toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf. [Accessed
01-December-2017].
[200] Hinton, G. E. and Zemel, R. Autoencoders, Minimum Description Length and
Helmholtz Free Energy. In Advances in Neural Information Processing Systems.
Ed. by Cowan, J., Tesauro, G., and Alspector, J. Vol. 6. Morgan-Kaufmann, 1993.
url: proceedings.neurips.cc/paper_files/paper/1993/file/9e3cfc48eccf8
1a0d57663e129aef3cb-Paper.pdf.
[201] Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Comput.
9, 8 (1997), pp. 1735–1780. url: doi.org/10.1162/neco.1997.9.8.1735.
[202] Hornik, K. Some new results on neural network approximation. Neural Networks
6, 8 (1993), pp. 1069–1072. url: doi.org/10.1016/S0893-6080(09)80018-X.
[203] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural
Networks 4, 2 (1991), pp. 251–257. url: doi.org/10.1016/0893-6080(91)90009-T.
[204] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks
are universal approximators. Neural Networks 2, 5 (1989), pp. 359–366. url: doi.
org/10.1016/0893-6080(89)90020-8.
[205] Hornung, F., Jentzen, A., and Salimova, D. Space-time deep neural network
approximations for high-dimensional partial differential equations. arXiv:2006.02199
(2020), 52 pages. url: arxiv.org/abs/2006.02199.
[206] Huang, G., Liu, Z., Maaten, L. V. D., and Weinberger, K. Q. Densely
Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (Honolulu, HI, USA, July 21–26, 2017). Los
Alamitos, CA, USA: IEEE Computer Society, 2017, pp. 2261–2269. url: doi.org/
10.1109/CVPR.2017.243.
[207] Huré, C., Pham, H., and Warin, X. Deep backward schemes for high-dimensional
nonlinear PDEs. Math. Comp. 89, 324 (2020), pp. 1547–1579. url: doi.org/10.
1090/mcom/3514.
[208] Hutzenthaler, M., Jentzen, A., and Kruse, T. Overcoming the curse of
dimensionality in the numerical approximation of parabolic partial differential equa-
tions with gradient-dependent nonlinearities. Found. Comput. Math. 22, 4 (2022),
pp. 905–966. url: doi.org/10.1007/s10208-021-09514-y.
[209] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. A proof
that rectified deep neural networks overcome the curse of dimensionality in the
numerical approximation of semilinear heat equations. SN Partial Differ. Equ. Appl.
10, 1 (2020). url: doi.org/10.1007/s42985-019-0006-9.
578
Bibliography
[210] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. Multilevel
Picard approximations for high-dimensional semilinear second-order PDEs with
Lipschitz nonlinearities. arXiv:2009.02484 (2020), 37 pp. url: arxiv.org/abs/
2009.02484.
[211] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. Overcoming
the curse of dimensionality in the numerical approximation of backward stochastic
differential equations. arXiv:2108.10602 (2021), 34 pp. url: arxiv.org/abs/2108.
10602.
[212] Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T. A., and von
Wurstemberger, P. Overcoming the curse of dimensionality in the numerical
approximation of semilinear parabolic partial differential equations. Proc. A. 476,
2244 (2020), Art. No. 20190630, 25 pp. url: doi.org/10.1098/rspa.2019.0630.
[213] Hutzenthaler, M., Jentzen, A., Pohl, K., Riekert, A., and Scarpa, L.
Convergence proof for stochastic gradient descent in the training of deep neural
networks with ReLU activation for constant target functions. arXiv:2112.07369
(2021), 71 pp. url: arxiv.org/abs/2112.07369.
[214] Hutzenthaler, M., Jentzen, A., and von Wurstemberger, P. Overcoming
the curse of dimensionality in the approximative pricing of financial derivatives with
default risks. Electron. J. Probab. 25 (2020), Art. No. 101, 73 pp. url: doi.org/10.
1214/20-ejp423.
[215] Ibragimov, S., Jentzen, A., Kröger, T., and Riekert, A. On the existence
of infinitely many realization functions of non-global local minima in the training of
artificial neural networks with ReLU activation. arXiv:2202.11481 (2022), 49 pp.
url: arxiv.org/abs/2202.11481.
[216] Ibragimov, S., Jentzen, A., and Riekert, A. Convergence to good non-optimal
critical points in the training of neural networks: Gradient descent optimization
with one random initialization overcomes all bad non-global local minima with high
probability. arXiv:2212.13111 (2022), 98 pp. url: arxiv.org/abs/2212.13111.
[217] Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd Inter-
national Conference on Machine Learning – Volume 37 (Lille, France, July 6–11,
2015). Ed. by Bach, F. and Blei, D. ICML’15. JMLR.org, 2015, pp. 448–456.
[218] Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence
and Generalization in Neural Networks. In Advances in Neural Information Processing
Systems. Ed. by Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi,
N., and Garnett, R. Vol. 31. Curran Associates, Inc., 2018. url: proceedings .
neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462
f5a-Paper.pdf.
579
Bibliography
580
Bibliography
[229] Jentzen, A. and von Wurstemberger, P. Lower error bounds for the stochastic
gradient descent optimization algorithm: Sharp convergence rates for slowly and fast
decaying learning rates. J. Complexity 57 (2020), Art. No. 101438. url: doi.org/
10.1016/j.jco.2019.101438.
[230] Jentzen, A. and Welti, T. Overall error analysis for the training of deep neural
networks via stochastic gradient descent with random initialisation. Appl. Math.
Comput. 455 (2023), Art. No. 127907, 34 pp. url: doi.org/10.1016/j.amc.2023.
127907.
[231] Jin, X., Cai, S., Li, H., and Karniadakis, G. E. NSFnets (Navier-Stokes
flow nets): Physics-informed neural networks for the incompressible Navier-Stokes
equations. J. Comput. Phys. 426 (2021), Art. No. 109951. url: doi.org/10.1016/
j.jcp.2020.109951.
[232] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ron-
neberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko,
A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie,
A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T.,
Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M.,
Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O.,
Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. Highly
accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021),
pp. 583–589. url: doi.org/10.1038/s41586-021-03819-2.
[233] Kainen, P. C., Kůrková, V., and Vogt, A. Best approximation by linear
combinations of characteristic functions of half-spaces. J. Approx. Theory 122, 2
(2003), pp. 151–159. url: doi.org/10.1016/S0021-9045(03)00072-8.
[234] Karatzas, I. and Shreve, S. E. Brownian motion and stochastic calculus. 2nd ed.
Vol. 113. Springer-Verlag, New York, 1991, xxiv+470 pp. url: doi.org/10.1007/
978-1-4612-0949-2.
[235] Karevan, Z. and Suykens, J. A. Transductive LSTM for time-series prediction:
An application to weather forecasting. Neural Networks 125 (2020), pp. 1–9. url:
doi.org/10.1016/j.neunet.2019.12.030.
[236] Karim, F., Majumdar, S., Darabi, H., and Chen, S. LSTM Fully Convolutional
Networks for Time Series Classification. IEEE Access 6 (2018), pp. 1662–1669. url:
doi.org/10.1109/ACCESS.2017.2779939.
[237] Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S.,
and Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 3, 6 (2021),
pp. 422–440. url: doi.org/10.1038/s42254-021-00314-5.
581
Bibliography
[238] Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and Understanding
Recurrent Networks. arXiv:1506.02078 (2015), 12 pp. url: arxiv.org/abs/1506.
02078.
[239] Kawaguchi, K. Deep Learning without Poor Local Minima. In Advances in Neural
Information Processing Systems. Ed. by Lee, D., Sugiyama, M., Luxburg, U., Guyon,
I., and Garnett, R. Vol. 29. Curran Associates, Inc., 2016. url: proceedings .
neurips.cc/paper_files/paper/2016/file/f2fc990265c712c49d51a18a32b39
f0c-Paper.pdf.
[240] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., and Shah, M.
Transformers in Vision: A Survey. ACM Comput. Surv. 54, 10s (2022), Art. No. 200,
41 pp. url: doi.org/10.1145/3505244.
[241] Kharazmi, E., Zhang, Z., and Karniadakis, G. E. Variational Physics-Informed
Neural Networks For Solving Partial Differential Equations. arXiv:1912.00873 (2019),
24 pp. url: arxiv.org/abs/1912.00873.
[242] Kharazmi, E., Zhang, Z., and Karniadakis, G. E. M. hp-VPINNs: variational
physics-informed neural networks with domain decomposition. Comput. Methods
Appl. Mech. Engrg. 374 (2021), Art. No. 113547, 25 pp. url: doi.org/10.1016/j.
cma.2020.113547.
[243] Khodayi-Mehr, R. and Zavlanos, M. VarNet: Variational Neural Networks for
the Solution of Partial Differential Equations. In Proceedings of the 2nd Conference
on Learning for Dynamics and Control (June 10–11, 2020). Ed. by Bayen, A. M.,
Jadbabaie, A., Pappas, G., Parrilo, P. A., Recht, B., Tomlin, C., and Zeilinger, M.
Vol. 120. Proceedings of Machine Learning Research. PMLR, 2020, pp. 298–307.
url: proceedings.mlr.press/v120/khodayi-mehr20a.html.
[244] Khoo, Y., Lu, J., and Ying, L. Solving parametric PDE problems with artificial
neural networks. European J. Appl. Math. 32, 3 (2021), pp. 421–435. url: doi.org/
10.1017/S0956792520000182.
[245] Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP) (Doha, Qatar, Oct. 25–29, 2014). Ed. by Moschitti, A., Pang, B., and
Daelemans, W. Association for Computational Linguistics, 2014, pp. 1746–1751.
url: doi.org/10.3115/v1/D14-1181.
[246] Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. arXiv:1312.
6114 (2013), 14 pp. url: arxiv.org/abs/1312.6114.
[247] Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization.
arXiv:1412.6980 (2014), 15 pp. url: arxiv.org/abs/1412.6980.
[248] Klenke, A. Probability Theory. 2nd ed. Springer-Verlag London Ltd., 2014.
xii+638 pp. url: doi.org/10.1007/978-1-4471-5361-0.
582
Bibliography
583
Bibliography
[260] Lagaris, I., Likas, A., and Fotiadis, D. Artificial neural networks for solving
ordinary and partial differential equations. IEEE Trans. Neural Netw. 9, 5 (1998),
pp. 987–1000. url: doi.org/10.1109/72.712178.
[261] Lanthaler, S., Molinaro, R., Hadorn, P., and Mishra, S. Nonlinear Re-
construction for Operator Learning of PDEs with Discontinuities. arXiv:2210.01074
(2022), 40 pp. url: arxiv.org/abs/2210.01074.
[262] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,
Hubbard, W., and Jackel, L. D. Backpropagation Applied to Handwritten Zip
Code Recognition. Neural Comput. 1, 4 (1989), pp. 541–551. url: doi.org/10.
1162/neco.1989.1.4.541.
[263] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature 521 (2015),
pp. 436–444. url: doi.org/10.1038/nature14539.
[264] Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. Deeply-Supervised
Nets. In Proceedings of the Eighteenth International Conference on Artificial Intelli-
gence and Statistics (San Diego, California, USA, May 9–12, 2015). Ed. by Lebanon,
G. and Vishwanathan, S. V. N. Vol. 38. Proceedings of Machine Learning Research.
PMLR, 2015, pp. 562–570. url: proceedings.mlr.press/v38/lee15a.html.
[265] Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I.,
and Recht, B. First-order methods almost always avoid strict saddle points. Math.
Program. 176, 1–2 (2019), pp. 311–337. url: doi . org / 10 . 1007 / s10107 - 019 -
01374-3.
[266] Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient Descent
Only Converges to Minimizers. In 29th Annual Conference on Learning Theory
(Columbia University, New York, NY, USA, June 23–26, 2016). Ed. by Feldman, V.,
Rakhlin, A., and Shamir, O. Vol. 49. Proceedings of Machine Learning Research.
PMLR, 2016, pp. 1246–1257. url: proceedings.mlr.press/v49/lee16.html.
[267] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O.,
Stoyanov, V., and Zettlemoyer, L. BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation, Translation, and Comprehension.
arXiv:1910.13461 (2019). url: arxiv.org/abs/1910.13461.
[268] Li, K., Tang, K., Wu, T., and Liao, Q. D3M: A Deep Domain Decomposition
Method for Partial Differential Equations. IEEE Access 8 (2020), pp. 5283–5294.
url: doi.org/10.1109/ACCESS.2019.2957200.
[269] Li, Z., Huang, D. Z., Liu, B., and Anandkumar, A. Fourier Neural Operator
with Learned Deformations for PDEs on General Geometries. arXiv:2207.05209
(2022). url: arxiv.org/abs/2207.05209.
584
Bibliography
[270] Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K.,
Stuart, A., and Anandkumar, A. Neural Operator: Graph Kernel Network
for Partial Differential Equations. arXiv:2003.03485 (2020). url: arxiv.org/abs/
2003.03485.
[271] Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K.,
Stuart, A., and Anandkumar, A. Fourier Neural Operator for Parametric Partial
Differential Equations. In International Conference on Learning Representations.
2021. url: openreview.net/forum?id=c8P9NQVtmnO.
[272] Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Stuart, A., Bhat-
tacharya, K., and Anandkumar, A. Multipole graph neural operator for para-
metric partial differential equations. Advances in Neural Information Processing
Systems 33 (2020), pp. 6755–6766.
[273] Li, Z., Zheng, H., Kovachki, N., Jin, D., Chen, H., Liu, B., Azizzadenesheli,
K., and Anandkumar, A. Physics-Informed Neural Operator for Learning Partial
Differential Equations. arXiv:2111.03794 (2021). url: arxiv.org/abs/2111.03794.
[274] Liao, Y. and Ming, P. Deep Nitsche Method: Deep Ritz Method with Essential
Boundary Conditions. Commun. Comput. Phys. 29, 5 (2021), pp. 1365–1384. url:
doi.org/10.4208/cicp.OA-2020-0219.
[275] Liu, C. and Belkin, M. Accelerating SGD with momentum for over-parameterized
learning. arXiv:1810.13395 (2018). url: arxiv.org/abs/1810.13395.
[276] Liu, L. and Cai, W. DeepPropNet–A Recursive Deep Propagator Neural Network
for Learning Evolution PDE Operators. arXiv:2202.13429 (2022). url: arxiv.org/
abs/2202.13429.
[277] Liu, Y., Kutz, J. N., and Brunton, S. L. Hierarchical deep learning of multiscale
differential equation time-steppers. Philos. Trans. Roy. Soc. A 380, 2229 (2022),
Art. No. 20210200, 17 pp. url: doi.org/10.1098/rsta.2021.0200.
[278] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo,
B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
(Montreal, QC, Canada, Oct. 10–17, 2021). IEEE Computer Society, 2021, pp. 10012–
10022. url: doi.org/10.1109/ICCV48922.2021.00986.
[279] Liu, Z., Cai, W., and Xu, Z.-Q. J. Multi-scale deep neural network (MscaleDNN)
for solving Poisson-Boltzmann equation in complex domains. Commun. Comput.
Phys. 28, 5 (2020), pp. 1970–2001.
[280] Loizou, N. and Richtárik, P. Momentum and stochastic momentum for stochas-
tic gradient, Newton, proximal point and subspace descent methods. Comput. Optim.
Appl. 77, 3 (2020), pp. 653–710. url: doi.org/10.1007/s10589-020-00220-z.
585
Bibliography
586
Bibliography
587
Bibliography
[301] Nelsen, N. H. and Stuart, A. M. The random feature model for input-output
maps between Banach spaces. SIAM J. Sci. Comput. 43, 5 (2021), A3212–A3243.
url: doi.org/10.1137/20M133957X.
[302] Nesterov, Y. A method of solving a convex programming problem with convergence
rate O(1/k 2 ). In Soviet Mathematics Doklady. Vol. 27. 1983, pp. 372–376.
[303] Nesterov, Y. Introductory lectures on convex optimization: A basic course. Vol. 87.
Springer, New York, 2013, xviii+236 pp. url: doi.org/10.1007/978- 1- 4419-
8853-9.
[304] Neufeld, A. and Wu, S. Multilevel Picard approximation algorithm for semilinear
partial integro-differential equations and its complexity analysis. arXiv:2205.09639
(2022). url: arxiv.org/abs/2205.09639.
[305] Neufeld, A. and Wu, S. Multilevel Picard algorithm for general semilinear
parabolic PDEs with gradient-dependent nonlinearities. arXiv:2310.12545 (2023).
url: arxiv.org/abs/2310.12545.
[306] Ng, A. coursera: Improving Deep Neural Networks: Hyperparameter tuning, Reg-
ularization and Optimization. https://www.coursera.org/learn/deep-neural-
network. [Accessed 6-December-2017].
[307] Ng, J. Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga,
R., and Toderici, G. Beyond Short Snippets: Deep Networks for Video Classifica-
tion. arXiv:1503.08909 (2015). url: arxiv.org/abs/1503.08909.
[308] Nguwi, J. Y., Penent, G., and Privault, N. A deep branching solver for fully
nonlinear partial differential equations. arXiv:2203.03234 (2022). url: arxiv.org/
abs/2203.03234.
[309] Nguwi, J. Y., Penent, G., and Privault, N. Numerical solution of the incom-
pressible Navier-Stokes equation by a deep branching algorithm. arXiv:2212.13010
(2022). url: arxiv.org/abs/2212.13010.
[310] Nguwi, J. Y., Penent, G., and Privault, N. A fully nonlinear Feynman-Kac
formula with derivatives of arbitrary orders. J. Evol. Equ. 23, 1 (2023), Art. No. 22,
29 pp. url: doi.org/10.1007/s00028-023-00873-3.
[311] Nguwi, J. Y. and Privault, N. Numerical solution of the modified and non-
Newtonian Burgers equations by stochastic coded trees. Jpn. J. Ind. Appl. Math. 40,
3 (2023), pp. 1745–1763. url: doi.org/10.1007/s13160-023-00611-9.
[312] Nguyen, Q. and Hein, M. The Loss Surface of Deep and Wide Neural Networks.
In Proceedings of the 34th International Conference on Machine Learning (Sydney,
Australia, Aug. 6–11, 2017). Ed. by Precup, D. and Teh, Y. W. Vol. 70. Proceedings
of Machine Learning Research. PMLR, 2017, pp. 2603–2612. url: proceedings.
mlr.press/v70/nguyen17a.html.
588
Bibliography
[313] Nitsche, J. Über ein Variationsprinzip zur Lösung von Dirichlet-Problemen bei
Verwendung von Teilräumen, die keinen Randbedingungen unterworfen sind. Abh.
Math. Sem. Univ. Hamburg 36 (1971), pp. 9–15. url: doi.org/10.1007/BF029959
04.
[314] Novak, E. and Woźniakowski, H. Tractability of multivariate problems. Vol. I:
Linear information. Vol. 6. European Mathematical Society (EMS), Zürich, 2008,
xii+384 pp. url: doi.org/10.4171/026.
[315] Novak, E. and Woźniakowski, H. Tractability of multivariate problems. Volume
II: Standard information for functionals. Vol. 12. European Mathematical Society
(EMS), Zürich, 2010, xviii+657 pp. url: doi.org/10.4171/084.
[316] Novak, E. and Woźniakowski, H. Tractability of multivariate problems. Volume
III: Standard information for operators. Vol. 18. European Mathematical Society
(EMS), Zürich, 2012, xviii+586 pp. url: doi.org/10.4171/116.
[317] Nüsken, N. and Richter, L. Solving high-dimensional Hamilton-Jacobi-Bellman
PDEs using neural networks: perspectives from the theory of controlled diffusions
and measures on path space. Partial Differ. Equ. Appl. 2, 4 (2021), Art. No. 48,
48 pp. url: doi.org/10.1007/s42985-021-00102-x.
[318] Øksendal, B. Stochastic differential equations. 6th ed. An introduction with
applications. Springer-Verlag, Berlin, 2003, xxiv+360 pp. url: doi.org/10.1007/
978-3-642-14394-6.
[319] Olah, C. Understanding LSTM Networks. http://colah.github.io/posts/2015-
08-Understanding-LSTMs/. [Accessed 9-October-2023].
[320] OpenAI. GPT-4 Technical Report. arXiv:2303.08774 (2023). url: arxiv.org/
abs/2303.08774.
[321] Opschoor, J. A. A., Petersen, P. C., and Schwab, C. Deep ReLU networks
and high-order finite element methods. Anal. Appl. (Singap.) 18, 5 (2020), pp. 715–
770. url: doi.org/10.1142/S0219530519410136.
[322] Panageas, I. and Piliouras, G. Gradient Descent Only Converges to Minimizers:
Non-Isolated Critical Points and Invariant Regions. arXiv:1605.00405 (2016). url:
arxiv.org/abs/1605.00405.
[323] Panageas, I., Piliouras, G., and Wang, X. First-order methods almost al-
ways avoid saddle points: The case of vanishing step-sizes. In Advances in Neu-
ral Information Processing Systems. Ed. by Wallach, H., Larochelle, H., Beygelz-
imer, A., d’Alché-Buc, F., Fox, E., and Garnett, R. Vol. 32. Curran Associates,
Inc., 2019. url: proceedings . neurips . cc / paper _ files / paper / 2019 / file /
3fb04953d95a94367bb133f862402bce-Paper.pdf.
589
Bibliography
[324] Pang, G., Lu, L., and Karniadakis, G. E. fPINNs: Fractional Physics-Informed
Neural Networks. SIAM J. Sci. Comput. 41, 4 (2019), A2603–A2626. url: doi.org/
10.1137/18M1229845.
[325] Pardoux, É. and Peng, S. Backward stochastic differential equations and quasilin-
ear parabolic partial differential equations. In Stochastic partial differential equations
and their applications. Vol. 176. Lect. Notes Control Inf. Sci. Springer, Berlin, 1992,
pp. 200–217. url: doi.org/10.1007/BFb0007334.
[326] Pardoux, É. and Peng, S. G. Adapted solution of a backward stochastic differ-
ential equation. Systems Control Lett. 14, 1 (1990), pp. 55–61. url: doi.org/10.
1016/0167-6911(90)90082-6.
[327] Pardoux, E. and Tang, S. Forward-backward stochastic differential equations and
quasilinear parabolic PDEs. Probab. Theory Related Fields 114, 2 (1999), pp. 123–150.
url: doi.org/10.1007/s004409970001.
[328] Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent
neural networks. In Proceedings of the 30th International Conference on Machine
Learning (Atlanta, GA, USA, June 17–19, 2013). Ed. by Dasgupta, S. and McAllester,
D. Vol. 28. Proceedings of Machine Learning Research 3. PMLR, 2013, pp. 1310–1318.
url: proceedings.mlr.press/v28/pascanu13.html.
[329] Perekrestenko, D., Grohs, P., Elbrächter, D., and Bölcskei, H. The
universal approximation power of finite-width deep ReLU networks. arXiv:1806.01528
(2018). url: arxiv.org/abs/1806.01528.
[330] Pérez-Ortiz, J. A., Gers, F. A., Eck, D., and Schmidhuber, J. Kalman filters
improve LSTM network performance in problems unsolvable by traditional recurrent
nets. Neural Networks 16, 2 (2003), pp. 241–250. url: doi.org/10.1016/S0893-
6080(02)00219-8.
[331] Petersen, P. Linear Algebra. Springer New York, 2012. x+390 pp. url: doi.org/
10.1007/978-1-4614-3612-6.
[332] Petersen, P., Raslan, M., and Voigtlaender, F. Topological properties of
the set of functions generated by neural networks of fixed size. Found. Comput. Math.
21, 2 (2021), pp. 375–444. url: doi.org/10.1007/s10208-020-09461-0.
[333] Petersen, P. and Voigtlaender, F. Optimal approximation of piecewise smooth
functions using deep ReLU neural networks. Neural Networks 108 (2018), pp. 296–
330. url: doi.org/10.1016/j.neunet.2018.08.019.
[334] Petersen, P. and Voigtlaender, F. Equivalence of approximation by convolu-
tional neural networks and fully-connected networks. Proc. Amer. Math. Soc. 148, 4
(2020), pp. 1567–1581. url: doi.org/10.1090/proc/14789.
590
Bibliography
591
Bibliography
592
Bibliography
[357] Safran, I. and Shamir, O. On the Quality of the Initial Basin in Overspecified
Neural Networks. In Proceedings of The 33rd International Conference on Machine
Learning (New York, NY, USA, June 20–22, 2016). Vol. 48. Proceedings of Machine
Learning Research. PMLR, 2016, pp. 774–782. url: proceedings.mlr.press/v48/
safran16.html.
[358] Safran, I. and Shamir, O. Spurious Local Minima are Common in Two-Layer
ReLU Neural Networks. In Proceedings of the 35th International Conference on
Machine Learning (Stockholm, Sweden, July 10–15, 2018). Vol. 80. Proceedings of
Machine Learning Research. ISSN: 2640-3498. PMLR, 2018, pp. 4433–4441. url:
proceedings.mlr.press/v80/safran18a.html.
[359] Sainath, T. N., Mohamed, A., Kingsbury, B., and Ramabhadran, B. Deep
convolutional neural networks for LVCSR. In 2013 IEEE International Conference
on Acoustics, Speech and Signal Processing (Vancouver, BC, Canada, May 26–31,
2013). IEEE Computer Society, 2013, pp. 8614–8618. url: doi.org/10.1109/
ICASSP.2013.6639347.
[360] Sak, H., Senior, A., and Beaufays, F. Long Short-Term Memory Based Re-
current Neural Network Architectures for Large Vocabulary Speech Recognition.
arXiv:1402.1128 (2014). url: arxiv.org/abs/1402.1128.
[361] Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J., and
Battaglia, P. W. Learning to Simulate Complex Physics with Graph Networks.
arXiv:2002.09405 (Feb. 2020). url: arxiv.org/abs/2002.09405.
[362] Sanchez-Lengeling, B., Reif, E., Pearce, A., and Wiltschko, A. B. A
Gentle Introduction to Graph Neural Networks. https://distill.pub/2021/gnn-
intro/. [Accessed 10-October-2023].
[363] Sandberg, I. Approximation theorems for discrete-time systems. IEEE Trans.
Circuits Syst. 38, 5 (1991), pp. 564–566. url: doi.org/10.1109/31.76498.
[364] Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How Does Batch
Normalization Help Optimization? In Advances in Neural Information Processing
Systems. Ed. by Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi,
N., and Garnett, R. Vol. 31. Curran Associates, Inc., 2018. url: proceedings .
neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99
e1cf-Paper.pdf.
[365] Sarao Mannelli, S., Vanden-Eijnden, E., and Zdeborová, L. Optimization
and Generalization of Shallow Neural Networks with Quadratic Activation Functions.
In Advances in Neural Information Processing Systems. Ed. by Larochelle, H.,
Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. Vol. 33. Curran Associates, Inc.,
2020, pp. 13445–13455. url: proceedings . neurips . cc / paper _ files / paper /
2020/file/9b8b50fb590c590ffbf1295ce92258dc-Paper.pdf.
593
Bibliography
[366] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini,
G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 20, 1 (2009),
pp. 61–80. url: doi.org/10.1109/TNN.2008.2005605.
[367] Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks
61 (2015), pp. 85–117. url: doi.org/10.1016/j.neunet.2014.09.003.
[368] Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A., and
Müller, K.-R. SchNet – A deep learning architecture for molecules and materials.
The Journal of Chemical Physics 148, 24 (2018). url: doi.org/10.1063/1.5019779.
[369] Schwab, C., Stein, A., and Zech, J. Deep Operator Network Approximation
Rates for Lipschitz Operators. arXiv:2307.09835 (2023). url: arxiv.org/abs/
2307.09835.
[370] Schwab, C. and Zech, J. Deep learning in high dimension: neural network
expression rates for generalized polynomial chaos expansions in UQ. Anal. Appl.
(Singap.) 17, 1 (2019), pp. 19–55. url: doi.org/10.1142/S0219530518500203.
[371] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun,
Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional
Networks. arXiv:1312.6229 (2013). url: arxiv.org/abs/1312.6229.
[372] Sezer, O. B., Gudelek, M. U., and Ozbayoglu, A. M. Financial time series
forecasting with deep learning : A systematic literature review: 2005–2019. Appl. Soft
Comput. 90 (2020), Art. No. 106181. url: doi.org/10.1016/j.asoc.2020.106181.
[373] Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning.
From Theory to Algorithms. Cambridge University Press, 2014, xvi+397 pp. url:
doi.org/10.1017/CBO9781107298019.
[374] Shen, Z., Yang, H., and Zhang, S. Deep network approximation characterized
by number of neurons. Commun. Comput. Phys. 28, 5 (2020), pp. 1768–1811. url:
doi.org/10.4208/cicp.oa-2020-0149.
[375] Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., and Woo, W.-c.
Convolutional LSTM Network: A Machine Learning Approach for Precipitation
Nowcasting. In Advances in Neural Information Processing Systems. Ed. by Cortes,
C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. Vol. 28. Curran Associates,
Inc., 2015. url: proceedings . neurips . cc / paper _ files / paper / 2015 / file /
07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf.
[376] Siami-Namini, S., Tavakoli, N., and Siami Namin, A. A Comparison of ARIMA
and LSTM in Forecasting Time Series. In 2018 17th IEEE International Conference
on Machine Learning and Applications (ICMLA) (Orlando, FL, USA, Dec. 17–20,
2018). IEEE Computer Society, 2018, pp. 1394–1401. url: doi.org/10.1109/
ICMLA.2018.00227.
594
Bibliography
[377] Silvester, J. R. Determinants of block matrices. Math. Gaz. 84, 501 (2000),
pp. 460–467. url: doi.org/10.2307/3620776.
[378] Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for Large-
Scale Image Recognition. arXiv:1409.1556 (2014). url: arxiv.org/abs/1409.1556.
[379] Sirignano, J. and Spiliopoulos, K. DGM: A deep learning algorithm for solving
partial differential equations. J. Comput. Phys. 375 (2018), pp. 1339–1364. url:
doi.org/10.1016/j.jcp.2018.08.029.
[380] Sitzmann, V., Martel, J. N. P., Bergman, A. W., Lindell, D. B., and
Wetzstein, G. Implicit Neural Representations with Periodic Activation Functions.
arXiv:2006.09661 (2020). url: arxiv.org/abs/2006.09661.
[381] Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical Insights Into
the Optimization Landscape of Over-Parameterized Shallow Neural Networks. IEEE
Trans. Inform. Theory 65, 2 (2019), pp. 742–769. url: doi.org/10.1109/TIT.2018.
2854560.
[382] Soudry, D. and Carmon, Y. No bad local minima: Data independent training
error guarantees for multilayer neural networks. arXiv:1605.08361 (2016). url:
arxiv.org/abs/1605.08361.
[383] Soudry, D. and Hoffer, E. Exponentially vanishing sub-optimal local minima in
multilayer neural networks. arXiv:1702.05777 (2017). url: arxiv.org/abs/1702.
05777.
[384] Srivastava, R. K., Greff, K., and Schmidhuber, J. Training Very Deep
Networks. In Advances in Neural Information Processing Systems. Ed. by Cortes, C.,
Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. Vol. 28. Curran Associates,
Inc., 2015. url: proceedings . neurips . cc / paper _ files / paper / 2015 / file /
215a71a12769b056c3c32e7299f1c5ed-Paper.pdf.
[385] Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway Networks.
arXiv:1505.00387 (2015). url: arxiv.org/abs/1505.00387.
[386] Sun, R. Optimization for deep learning: theory and algorithms. arXiv:1912.08957
(Dec. 2019). url: arxiv.org/abs/1912.08957.
[387] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of
initialization and momentum in deep learning. In Proceedings of the 30th International
Conference on Machine Learning (Atlanta, GA, USA, June 17–19, 2013). Ed. by
Dasgupta, S. and McAllester, D. Vol. 28. Proceedings of Machine Learning Research
3. PMLR, 2013, pp. 1139–1147. url: proceedings.mlr.press/v28/sutskever13.
html.
595
Bibliography
[388] Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to Sequence Learning with
Neural Networks. In Advances in Neural Information Processing Systems. Ed. by
Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Vol. 27.
Curran Associates, Inc., 2014. url: proceedings . neurips . cc / paper _ files /
paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
[389] Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction.
2nd ed. MIT Press, Cambridge, MA, 2018, xxii+526 pp.
[390] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Er-
han, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions.
In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(Boston, MA, USA, June 7–12, 2015). IEEE Computer Society, 2015, pp. 1–9. url:
doi.org/10.1109/CVPR.2015.7298594.
[391] Tadić, V. B. Convergence and convergence rate of stochastic gradient search in the
case of multiple and non-isolated extrema. Stochastic Process. Appl. 125, 5 (2015),
pp. 1715–1755. url: doi.org/10.1016/j.spa.2014.11.001.
[392] Tan, L. and Chen, L. Enhanced DeepONet for modeling partial differential
operators considering multiple input functions. arXiv:2202.08942 (2022). url: arxiv.
org/abs/2202.08942.
[393] Taylor, J. M., Pardo, D., and Muga, I. A deep Fourier residual method for
solving PDEs using neural networks. Comput. Methods Appl. Mech. Engrg. 405
(2023), Art. No. 115850, 27 pp. url: doi.org/10.1016/j.cma.2022.115850.
[394] Teschl, G. Ordinary differential equations and dynamical systems. Vol. 140. Amer-
ican Mathematical Society, Providence, RI, 2012, xii+356 pp. url: doi.org/10.
1090/gsm/140.
[395] Tropp, J. A. An Elementary Proof of the Spectral Radius Formula for Matrices.
http://users.cms.caltech.edu/~jtropp/notes/Tro01-Spectral-Radius.pdf.
[Accessed 16-February-2018]. 2001.
[396] Van den Oord, A., Dieleman, S., and Schrauwen, B. Deep content-based
music recommendation. In Advances in Neural Information Processing Systems.
Ed. by Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.
Vol. 26. Curran Associates, Inc., 2013. url: proceedings.neurips.cc/paper_
files/paper/2013/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf.
[397] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, Ł., and Polosukhin, I. Attention is All you Need. In Advances in
Neural Information Processing Systems. Ed. by Guyon, I., Luxburg, U. V., Bengio,
S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. Vol. 30. Curran
Associates, Inc., 2017. url: proceedings.neurips.cc/paper_files/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
596
Bibliography
[398] Vatanen, T., Raiko, T., Valpola, H., and LeCun, Y. Pushing Stochastic
Gradient towards Second-Order Methods – Backpropagation Learning with Transfor-
mations in Nonlinearities. In Neural Information Processing. Ed. by Lee, M., Hirose,
A., Hou, Z.-G., and Kil, R. M. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013,
pp. 442–449.
[399] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and
Bengio, Y. Graph Attention Networks. arXiv:1710.10903 (2017). url: arxiv.org/
abs/1710.10903.
[400] Venturi, L., Bandeira, A. S., and Bruna, J. Spurious Valleys in One-hidden-
layer Neural Network Optimization Landscapes. J. Mach. Learn. Res. 20, 133 (2019),
pp. 1–34. url: jmlr.org/papers/v20/18-674.html.
[401] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T.,
and Saenko, K. Sequence to Sequence – Video to Text. In Proceedings of the IEEE
International Conference on Computer Vision (ICCV) (Santiago, Chile, Dec. 7–13,
2015). IEEE Computer Society, 2015. url: doi.org/10.1109/ICCV.2015.515.
[402] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting
and Composing Robust Features with Denoising Autoencoders. In Proceedings of the
25th International Conference on Machine Learning. ICML ’08. Helsinki, Finland:
Association for Computing Machinery, 2008, pp. 1096–1103. url: doi.org/10.
1145/1390156.1390294.
[403] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A.
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network
with a Local Denoising Criterion. J. Mach. Learn. Res. 11, 110 (2010), pp. 3371–3408.
url: jmlr.org/papers/v11/vincent10a.html.
[404] Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., and
Tang, X. Residual Attention Network for Image Classification. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu,
HI, USA, July 21–26, 2017). IEEE Computer Society, 2017. url: doi.org/10.1109/
CVPR.2017.683.
[405] Wang, N., Zhang, D., Chang, H., and Li, H. Deep learning of subsurface
flow via theory-guided neural network. J. Hydrology 584 (2020), p. 124700. url:
doi.org/10.1016/j.jhydrol.2020.124700.
[406] Wang, S., Wang, H., and Perdikaris, P. Learning the solution operator of
parametric partial differential equations with physics-informed DeepONets. Science
Advances 7, 40 (2021), eabi8605. url: doi.org/10.1126/sciadv.abi8605.
[407] Wang, Y., Zou, R., Liu, F., Zhang, L., and Liu, Q. A review of wind speed
and wind power forecasting with deep neural networks. Appl. Energy 304 (2021),
Art. No. 117766. url: doi.org/10.1016/j.apenergy.2021.117766.
597
Bibliography
[408] Wang, Z., Yan, W., and Oates, T. Time series classification from scratch with
deep neural networks: A strong baseline. In 2017 International Joint Conference on
Neural Networks (IJCNN). 2017, pp. 1578–1585. url: doi.org/10.1109/IJCNN.
2017.7966039.
[409] Welper, G. Approximation Results for Gradient Descent trained Neural Networks.
arXiv:2309.04860 (2023). url: arxiv.org/abs/2309.04860.
[410] Wen, G., Li, Z., Azizzadenesheli, K., Anandkumar, A., and Benson,
S. M. U-FNO – An enhanced Fourier neural operator-based deep-learning model for
multiphase flow. arXiv:2109.03697 (2021). url: arxiv.org/abs/2109.03697.
[411] West, D. Introduction to Graph Theory. Prentice Hall, 2001. 588 pp.
[412] Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K.
Simplifying Graph Convolutional Networks. In Proceedings of the 36th International
Conference on Machine Learning (Long Beach, California, USA, June 9–15, 2019).
Ed. by Chaudhuri, K. and Salakhutdinov, R. Vol. 97. Proceedings of Machine
Learning Research. PMLR, 2019, pp. 6861–6871. url: proceedings.mlr.press/
v97/wu19e.html.
[413] Wu, K., Yan, X.-b., Jin, S., and Ma, Z. Asymptotic-Preserving Convolutional
DeepONets Capture the Diffusive Behavior of the Multiscale Linear Transport
Equations. arXiv:2306.15891 (2023). url: arxiv.org/abs/2306.15891.
[414] Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu,
A. S., Leswing, K., and Pande, V. MoleculeNet: a benchmark for molecular
machine learning. Chem. Sci. 9 (2 2018), pp. 513–530. url: doi.org/10.1039/
C7SC02664A.
[415] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. A Compre-
hensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst.
32, 1 (2021), pp. 4–24. url: doi.org/10.1109/TNNLS.2020.2978386.
[416] Xie, J., Xu, L., and Chen, E. Image Denoising and Inpainting with Deep Neural
Networks. In Advances in Neural Information Processing Systems. Ed. by Pereira, F.,
Burges, C., Bottou, L., and Weinberger, K. Vol. 25. Curran Associates, Inc., 2012.
url: proceedings.neurips.cc/paper_files/paper/2012/file/6cdd60ea0045
eb7a6ec44c54d29ed402-Paper.pdf.
[417] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated Residual
Transformations for Deep Neural Networks. In 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (Honolulu, HI, USA, July 21–26, 2017).
IEEE Computer Society, 2017, pp. 5987–5995. url: doi.org/10.1109/CVPR.2017.
634.
598
Bibliography
[418] Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H.,
Lan, Y., Wang, L., and Liu, T.-Y. On Layer Normalization in the Transformer
Architecture. In Proceedings of the 37th International Conference on Machine Learn-
ing (July 13–18, 2020). ICML’20. JMLR.org, 2020, 975, pp. 10524–10533. url:
proceedings.mlr.press/v119/xiong20b.html.
[419] Xiong, W., Huang, X., Zhang, Z., Deng, R., Sun, P., and Tian, Y. Koopman
neural operator as a mesh-free solver of non-linear partial differential equations.
arXiv:2301.10022 (2023). url: arxiv.org/abs/2301.10022.
[420] Xu, R., Zhang, D., Rong, M., and Wang, N. Weak form theory-guided neural
network (TgNN-wf) for deep learning of subsurface single- and two-phase flow. J.
Comput. Phys. 436 (2021), Art. No. 110318, 20 pp. url: doi.org/10.1016/j.jcp.
2021.110318.
[421] Yang, L., Meng, X., and Karniadakis, G. E. B-PINNs: Bayesian physics-
informed neural networks for forward and inverse PDE problems with noisy data. J.
Comput. Phys. 425 (2021), Art. No. 109913. url: doi.org/10.1016/j.jcp.2020.
109913.
[422] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le,
Q. V. XLNet: Generalized Autoregressive Pretraining for Language Understanding.
arXiv:1906.08237 (2019). url: arxiv.org/abs/1906.08237.
[423] Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural
Networks 94 (2017), pp. 103–114. url: doi.org/10.1016/j.neunet.2017.07.002.
[424] Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and
Leskovec, J. Graph Convolutional Neural Networks for Web-Scale Recommender
Systems. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (London, United Kingdom, Aug. 19–23, 2018).
KDD ’18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 974–
983. url: doi.org/10.1145/3219819.3219890.
[425] Yu, Y., Si, X., Hu, C., and Zhang, J. A Review of Recurrent Neural Networks:
LSTM Cells and Network Architectures. Neural Comput. 31, 7 (July 2019), pp. 1235–
1270. url: doi.org/10.1162/neco_a_01199.
[426] Yun, S., Jeong, M., Kim, R., Kang, J., and Kim, H. J. Graph Transformer
Networks. In Advances in Neural Information Processing Systems. Ed. by Wallach, H.,
Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R. Vol. 32.
Curran Associates, Inc., 2019. url: proceedings . neurips . cc / paper _ files /
paper/2019/file/9d63484abb477c97640154d40595a3bb-Paper.pdf.
[427] Zagoruyko, S. and Komodakis, N. Wide Residual Networks. arXiv:1605.07146
(2016). url: arxiv.org/abs/1605.07146.
599
Bibliography
[428] Zang, Y., Bao, G., Ye, X., and Zhou, H. Weak adversarial networks for high-
dimensional partial differential equations. J. Comput. Phys. 411 (2020), pp. 109409,
14. url: doi.org/10.1016/j.jcp.2020.109409.
[429] Zeiler, M. D. ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701
(2012). url: arxiv.org/abs/1212.5701.
[430] Zeng, D., Liu, K., Lai, S., Zhou, G., and Zhao, J. Relation Classification
via Convolutional Deep Neural Network. In Proceedings of COLING 2014, the 25th
International Conference on Computational Linguistics: Technical Papers. Dublin,
Ireland: Dublin City University and Association for Computational Linguistics, Aug.
2014, pp. 2335–2344. url: aclanthology.org/C14-1220.
[431] Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. Dive into Deep Learning.
Cambridge University Press, 2023. url: d2l.ai.
[432] Zhang, J., Zhang, S., Shen, J., and Lin, G. Energy-Dissipative Evolutionary
Deep Operator Neural Networks. arXiv:2306.06281 (2023). url: arxiv.org/abs/
2306.06281.
[433] Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. Direct Runge-Kutta
Discretization Achieves Acceleration. arXiv:1805.00521 (2018). url: arxiv.org/
abs/1805.00521.
[434] Zhang, X., Zhao, J., and LeCun, Y. Character-level Convolutional Networks for
Text Classification. In Advances in Neural Information Processing Systems. Ed. by
Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. Vol. 28. Curran
Associates, Inc., 2015. url: proceedings.neurips.cc/paper_files/paper/2015/
file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf.
[435] Zhang, Y., Li, Y., Zhang, Z., Luo, T., and Xu, Z.-Q. J. Embedding Principle:
a hierarchical structure of loss landscape of deep neural networks. arXiv:2111.15527
(2021). url: arxiv.org/abs/2111.15527.
[436] Zhang, Y., Zhang, Z., Luo, T., and Xu, Z.-Q. J. Embedding Principle of Loss
Landscape of Deep Neural Networks. arXiv:2105.14573 (2021). url: arxiv.org/
abs/2105.14573.
[437] Zhang, Y. and Wallace, B. A Sensitivity Analysis of (and Practitioners’ Guide
to) Convolutional Neural Networks for Sentence Classification. In Proceedings of the
Eighth International Joint Conference on Natural Language Processing (Volume 1:
Long Papers) (Taipei, Taiwan, Nov. 27–Dec. 1, 2017). Asian Federation of Natural
Language Processing, 2017, pp. 253–263. url: aclanthology.org/I17-1026.
[438] Zhang, Y., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam Can Converge
Without Any Modification On Update Rules. arXiv:2208.09632 (2022). url: arxiv.
org/abs/2208.09632.
600
Bibliography
[439] Zhang, Z., Cui, P., and Zhu, W. Deep Learning on Graphs: A Survey. IEEE
Trans. Knowledge Data Engrg. 34, 1 (2022), pp. 249–270. url: doi.org/10.1109/
TKDE.2020.2981333.
[440] Zheng, Y., Liu, Q., Chen, E., Ge, Y., and Zhao, J. L. Time Series Classification
Using Multi-Channels Deep Convolutional Neural Networks. In Web-Age Information
Management. Ed. by Li, F., Li, G., Hwang, S.-w., Yao, B., and Zhang, Z. Springer,
Cham, 2014, pp. 298–310. url: doi.org/10.1007/978-3-319-08010-9_33.
[441] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., and Zhang, W.
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecast-
ing. Proceedings of the AAAI Conference on Artificial Intelligence 35, 12 (2021),
pp. 11106–11115. url: doi.org/10.1609/aaai.v35i12.17325.
[442] Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C.,
and Sun, M. Graph neural networks: A review of methods and applications. AI
Open 1 (2020), pp. 57–81. url: doi.org/10.1016/j.aiopen.2021.01.001.
[443] Zhu, Y. and Zabaras, N. Bayesian deep convolutional encoder-decoder networks
for surrogate modeling and uncertainty quantification. J. Comput. Phys. 366 (2018),
pp. 415–447. url: doi.org/10.1016/j.jcp.2018.04.018.
601