0% found this document useful (0 votes)
979 views601 pages

Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory

This document introduces mathematical concepts related to deep learning including neural networks, optimization methods, generalization error, and applications to partial differential equations. It aims to provide students and scientists without a background in deep learning a solid foundation of these methods and their theory.

Uploaded by

Shaon Sutradhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
979 views601 pages

Mathematical Introduction To Deep Learning: Methods, Implementations, and Theory

This document introduces mathematical concepts related to deep learning including neural networks, optimization methods, generalization error, and applications to partial differential equations. It aims to provide students and scientists without a background in deep learning a solid foundation of these methods and their theory.

Uploaded by

Shaon Sutradhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 601

Mathematical

Introduction to
Deep Learning:
arXiv:2310.20360v1 [cs.LG] 31 Oct 2023

Methods,
Implementations,
and Theory

Arnulf Jentzen
Benno Kuckuck
Philippe von Wurstemberger
Arnulf Jentzen
School of Data Science and Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)
Shenzhen, China
email: [email protected]
Applied Mathematics: Institute for Analysis and Numerics
University of Münster
Münster, Germany
email: [email protected]

Benno Kuckuck
School of Data Science and Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong Shenzhen (CUHK-Shenzhen)
Shenzhen, China
email: [email protected]
Applied Mathematics: Institute for Analysis and Numerics
University of Münster
Münster, Germany
email: [email protected]

Philippe von Wurstemberger


School of Data Science
The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen)
Shenzhen, China
email: [email protected]
Risklab, Department of Mathematics
ETH Zurich
Zurich, Switzerland
email: [email protected]

Keywords: deep learning, artificial neural network, stochastic gradient descent, optimization
Mathematics Subject Classification (2020): 68T07

Version of November 1, 2023

All Python source codes in this book can be downloaded from https://github.com/introdeeplearning/
book or from the arXiv page of this book (by clicking on “Other formats” and then “Download source”).
Preface
This book aims to provide an introduction to the topic of deep learning algorithms. Very
roughly speaking, when we speak of a deep learning algorithm we think of a computational
scheme which aims to approximate certain relations, functions, or quantities by means
of so-called deep artificial neural networks (ANNs) and the iterated use of some kind of
data. ANNs, in turn, can be thought of as classes of functions that consist of multiple
compositions of certain nonlinear functions, which are referred to as activation functions,
and certain affine functions. Loosely speaking, the depth of such ANNs corresponds to
the number of involved iterated compositions in the ANN and one starts to speak of deep
ANNs when the number of involved compositions of nonlinear and affine functions is larger
than two.
We hope that this book will be useful for students and scientists who do not yet have
any background in deep learning at all and would like to gain a solid foundation as well
as for practitioners who would like to obtain a firmer mathematical understanding of the
objects and methods considered in deep learning.
After a brief introduction, this book is divided into six parts (see Parts I, II, III, IV,
V, and VI). In Part I we introduce in Chapter 1 different types of ANNs including fully-
connected feedforward ANNs, convolutional ANNs (CNNs), recurrent ANNs (RNNs), and
residual ANNs (ResNets) in all mathematical details and in Chapter 2 we present a certain
calculus for fully-connected feedforward ANNs.
In Part II we present several mathematical results that analyze how well ANNs can
approximate given functions. To make this part more accessible, we first restrict ourselves
in Chapter 3 to one-dimensional functions from the reals to the reals and, thereafter, we
study ANN approximation results for multivariate functions in Chapter 4.
A key aspect of deep learning algorithms is usually to model or reformulate the problem
under consideration as a suitable optimization problem involving deep ANNs. It is precisely
the subject of Part III to study such and related optimization problems and the corresponding
optimization algorithms to approximately solve such problems in detail. In particular, in
the context of deep learning methods such optimization problems – typically given in the
form of a minimization problem – are usually solved by means of appropriate gradient based
optimization methods. Roughly speaking, we think of a gradient based optimization method
as a computational scheme which aims to solve the considered optimization problem by
performing successive steps based on the direction of the (negative) gradient of the function
which one wants to optimize. Deterministic variants of such gradient based optimization
methods such as the gradient descent (GD) optimization method are reviewed and studied
in Chapter 6 and stochastic variants of such gradient based optimization methods such
as the stochastic gradient descent (SGD) optimization method are reviewed and studied
in Chapter 7. GD-type and SGD-type optimization methods can, roughly speaking, be
viewed as time-discrete approximations of solutions of suitable gradient flow (GF) ordinary
differential equations (ODEs). To develop intuitions for GD-type and SGD-type optimization

3
methods and for some of the tools which we employ to analyze such methods, we study in
Chapter 5 such GF ODEs. In particular, we show in Chapter 5 how such GF ODEs can be
used to approximately solve appropriate optimization problems. Implementations of the
gradient based methods discussed in Chapters 6 and 7 require efficient computations of
gradients. The most popular and in some sense most natural method to explicitly compute
such gradients in the case of the training of ANNs is the backpropagation method, which
we derive and present in detail in Chapter 8. The mathematical analyses for gradient
based optimization methods that we present in Chapters 5, 6, and 7 are in almost all
cases too restrictive to cover optimization problems associated to the training of ANNs.
However, such optimization problems can be covered by the Kurdyka–Łojasiewicz (KL)
approach which we discuss in detail in Chapter 9. In Chapter 10 we rigorously review
batch normalization (BN) methods, which are popular methods that aim to accelerate ANN
training procedures in data-driven learning problems. In Chapter 11 we review and study
the approach to optimize an objective function through different random initializations.
The mathematical analysis of deep learning algorithms does not only consist of error
estimates for approximation capacities of ANNs (cf. Part II) and of error estimates for the
involved optimization methods (cf. Part III) but also requires estimates for the generalization
error which, roughly speaking, arises when the probability distribution associated to the
learning problem cannot be accessed explicitly but is approximated by a finite number of
realizations/data. It is precisely the subject of Part IV to study the generalization error.
Specifically, in Chapter 12 we review suitable probabilistic generalization error estimates
and in Chapter 13 we review suitable strong Lp -type generalization error estimates.
In Part V we illustrate how to combine parts of the approximation error estimates
from Part II, parts of the optimization error estimates from Part III, and parts of the
generalization error estimates from Part IV to establish estimates for the overall error in
the exemplary situation of the training of ANNs based on SGD-type optimization methods
with many independent random initializations. Specifically, in Chapter 14 we present a
suitable overall error decomposition for supervised learning problems, which we employ
in Chapter 15 together with some of the findings of Parts II, III, and IV to establish the
aforementioned illustrative overall error analysis.
Deep learning methods have not only become very popular for data-driven learning
problems, but are nowadays also heavily used for approximately solving partial differential
equations (PDEs). In Part VI we review and implement three popular variants of such deep
learning methods for PDEs. Specifically, in Chapter 16 we treat physics-informed neural
networks (PINNs) and deep Galerkin methods (DGMs) and in Chapter 17 we treat deep
Kolmogorov methods (DKMs).
This book contains a number of Python source codes, which can be downloaded
from two sources, namely from the public GitHub repository at https://github.com/
introdeeplearning/book and from the arXiv page of this book (by clicking on the link
“Other formats” and then on “Download source”). For ease of reference, the caption of each

4
source listing in this book contains the filename of the corresponding source file.
This book grew out of a series of lectures held by the authors at ETH Zurich, University
of Münster, and the Chinese University of Hong Kong, Shenzhen. It is in parts based on
recent joint articles of Christian Beck, Sebastian Becker, Weinan E, Lukas Gonon, Robin
Graeber, Philipp Grohs, Fabian Hornung, Martin Hutzenthaler, Nor Jaafari, Joshua Lee
Padgett, Adrian Riekert, Diyora Salimova, Timo Welti, and Philipp Zimmermann with
the authors of this book. We thank all of our aforementioned co-authors for very fruitful
collaborations. Special thanks are due to Timo Welti for his permission to integrate slightly
modified extracts of the article [230] into this book. We also thank Lukas Gonon, Timo
Kröger, Siyu Liang, and Joshua Lee Padget for several insightful discussions and useful
suggestions. Finally, we thank the students of the courses that we held on the basis of
preliminary material of this book for bringing several typos to our notice.
This work was supported by the internal project fund from the Shenzhen Research
Institute of Big Data under grant T00120220001. This work has been partially funded by
the National Science Foundation of China (NSFC) under grant number 12250610192. The
first author gratefully acknowledges the support of the Cluster of Excellence EXC 2044-
390685587, Mathematics Münster: Dynamics-Geometry-Structure funded by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation).

Shenzhen and Münster, Arnulf Jentzen


November 2023 Benno Kuckuck
Philippe von Wurstemberger

5
6
Contents

Preface 3

Introduction 15

I Artificial neural networks (ANNs) 19


1 Basics on ANNs 21
1.1 Fully-connected feedforward ANNs (vectorized description) . . . . . . . . 21
1.1.1 Affine functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.1.2 Vectorized description of fully-connected feedforward ANNs . . . . 23
1.1.3 Weight and bias parameters of fully-connected feedforward ANNs . 25
1.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.2.1 Multidimensional versions . . . . . . . . . . . . . . . . . . . . . . . 27
1.2.2 Single hidden layer fully-connected feedforward ANNs . . . . . . . 28
1.2.3 Rectified linear unit (ReLU) activation . . . . . . . . . . . . . . . . 29
1.2.4 Clipping activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.2.5 Softplus activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.2.6 Gaussian error linear unit (GELU) activation . . . . . . . . . . . . 37
1.2.7 Standard logistic activation . . . . . . . . . . . . . . . . . . . . . . 38
1.2.8 Swish activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1.2.9 Hyperbolic tangent activation . . . . . . . . . . . . . . . . . . . . . 42
1.2.10 Softsign activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.2.11 Leaky rectified linear unit (leaky ReLU) activation . . . . . . . . . 44
1.2.12 Exponential linear unit (ELU) activation . . . . . . . . . . . . . . 46
1.2.13 Rectified power unit (RePU) activation . . . . . . . . . . . . . . . 47
1.2.14 Sine activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.15 Heaviside activation . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.16 Softmax activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.3 Fully-connected feedforward ANNs (structured description) . . . . . . . . 51
1.3.1 Structured description of fully-connected feedforward ANNs . . . . 52
1.3.2 Realizations of fully-connected feedforward ANNs . . . . . . . . . . 53

7
Contents

1.3.3 On the connection to the vectorized description . . . . . . . . . . . 57


1.4 Convolutional ANNs (CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . 59
1.4.1 Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.4.2 Structured description of feedforward CNNs . . . . . . . . . . . . . 60
1.4.3 Realizations of feedforward CNNs . . . . . . . . . . . . . . . . . . 60
1.5 Residual ANNs (ResNets) . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
1.5.1 Structured description of fully-connected ResNets . . . . . . . . . . 66
1.5.2 Realizations of fully-connected ResNets . . . . . . . . . . . . . . . 67
1.6 Recurrent ANNs (RNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.6.1 Description of RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . 70
1.6.2 Vectorized description of simple fully-connected RNNs . . . . . . . 71
1.6.3 Long short-term memory (LSTM) RNNs . . . . . . . . . . . . . . . 72
1.7 Further types of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
1.7.1 ANNs with encoder-decoder architectures: autoencoders . . . . . . 73
1.7.2 Transformers and the attention mechanism . . . . . . . . . . . . . 73
1.7.3 Graph neural networks (GNNs) . . . . . . . . . . . . . . . . . . . . 74
1.7.4 Neural operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

2 ANN calculus 77
2.1 Compositions of fully-connected feedforward ANNs . . . . . . . . . . . . . 77
2.1.1 Compositions of fully-connected feedforward ANNs . . . . . . . . . 77
2.1.2 Elementary properties of compositions of fully-connected feedforward
ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
2.1.3 Associativity of compositions of fully-connected feedforward ANNs 80
2.1.4 Powers of fully-connected feedforward ANNs . . . . . . . . . . . . 84
2.2 Parallelizations of fully-connected feedforward ANNs . . . . . . . . . . . . 84
2.2.1 Parallelizations of fully-connected feedforward ANNs with the same
length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.2.2 Representations of the identities with ReLU activation functions . 89
2.2.3 Extensions of fully-connected feedforward ANNs . . . . . . . . . . 90
2.2.4 Parallelizations of fully-connected feedforward ANNs with different
lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.3 Scalar multiplications of fully-connected feedforward ANNs . . . . . . . . 96
2.3.1 Affine transformations as fully-connected feedforward ANNs . . . . 96
2.3.2 Scalar multiplications of fully-connected feedforward ANNs . . . . 97
2.4 Sums of fully-connected feedforward ANNs with the same length . . . . . 98
2.4.1 Sums of vectors as fully-connected feedforward ANNs . . . . . . . . 98
2.4.2 Concatenation of vectors as fully-connected feedforward ANNs . . 100
2.4.3 Sums of fully-connected feedforward ANNs . . . . . . . . . . . . . 102

8
Contents

II Approximation 105
3 One-dimensional ANN approximation results 107
3.1 Linear interpolation of one-dimensional functions . . . . . . . . . . . . . . 107
3.1.1 On the modulus of continuity . . . . . . . . . . . . . . . . . . . . . 107
3.1.2 Linear interpolation of one-dimensional functions . . . . . . . . . . 109
3.2 Linear interpolation with fully-connected feedforward ANNs . . . . . . . . 113
3.2.1 Activation functions as fully-connected feedforward ANNs . . . . . 113
3.2.2 Representations for ReLU ANNs with one hidden neuron . . . . . 114
3.2.3 ReLU ANN representations for linear interpolations . . . . . . . . 115
3.3 ANN approximations results for one-dimensional functions . . . . . . . . . 118
3.3.1 Constructive ANN approximation results . . . . . . . . . . . . . . 118
3.3.2 Convergence rates for the approximation error . . . . . . . . . . . . 122

4 Multi-dimensional ANN approximation results 127


4.1 Approximations through supremal convolutions . . . . . . . . . . . . . . . 127
4.2 ANN representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
4.2.1 ANN representations for the 1-norm . . . . . . . . . . . . . . . . . 130
4.2.2 ANN representations for maxima . . . . . . . . . . . . . . . . . . . 132
4.2.3 ANN representations for maximum convolutions . . . . . . . . . . 137
4.3 ANN approximations results for multi-dimensional functions . . . . . . . . 141
4.3.1 Constructive ANN approximation results . . . . . . . . . . . . . . 141
4.3.2 Covering number estimates . . . . . . . . . . . . . . . . . . . . . . 141
4.3.3 Convergence rates for the approximation error . . . . . . . . . . . . 143
4.4 Refined ANN approximations results for multi-dimensional functions . . . 152
4.4.1 Rectified clipped ANNs . . . . . . . . . . . . . . . . . . . . . . . . 152
4.4.2 Embedding ANNs in larger architectures . . . . . . . . . . . . . . . 153
4.4.3 Approximation through ANNs with variable architectures . . . . . 160
4.4.4 Refined convergence rates for the approximation error . . . . . . . 162

III Optimization 169


5 Optimization through gradient flow (GF) trajectories 171
5.1 Introductory comments for the training of ANNs . . . . . . . . . . . . . . 171
5.2 Basics for GFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
5.2.1 GF ordinary differential equations (ODEs) . . . . . . . . . . . . . . 173
5.2.2 Direction of negative gradients . . . . . . . . . . . . . . . . . . . . 174
5.3 Regularity properties for ANNs . . . . . . . . . . . . . . . . . . . . . . . . 180
5.3.1 On the differentiability of compositions of parametric functions . . 180
5.3.2 On the differentiability of realizations of ANNs . . . . . . . . . . . 181

9
Contents

5.4 Loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183


5.4.1 Absolute error loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
5.4.2 Mean squared error loss . . . . . . . . . . . . . . . . . . . . . . . . 184
5.4.3 Huber error loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.4.4 Cross-entropy loss . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.4.5 Kullback–Leibler divergence loss . . . . . . . . . . . . . . . . . . . 192
5.5 GF optimization in the training of ANNs . . . . . . . . . . . . . . . . . . 195
5.6 Lyapunov-type functions for GFs . . . . . . . . . . . . . . . . . . . . . . . 197
5.6.1 Gronwall differential inequalities . . . . . . . . . . . . . . . . . . . 197
5.6.2 Lyapunov-type functions for ODEs . . . . . . . . . . . . . . . . . . 198
5.6.3 On Lyapunov-type functions and coercivity-type conditions . . . . 199
5.6.4 Sufficient and necessary conditions for local minimum points . . . . 200
5.6.5 On a linear growth condition . . . . . . . . . . . . . . . . . . . . . 203
5.7 Optimization through flows of ODEs . . . . . . . . . . . . . . . . . . . . . 203
5.7.1 Approximation of local minimum points through GFs . . . . . . . . 203
5.7.2 Existence and uniqueness of solutions of ODEs . . . . . . . . . . . 206
5.7.3 Approximation of local minimum points through GFs revisited . . 208
5.7.4 Approximation error with respect to the objective function . . . . . 210

6 Deterministic gradient descent (GD) optimization methods 211


6.1 GD optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
6.1.1 GD optimization in the training of ANNs . . . . . . . . . . . . . . 212
6.1.2 Euler discretizations for GF ODEs . . . . . . . . . . . . . . . . . . 213
6.1.3 Lyapunov-type stability for GD optimization . . . . . . . . . . . . 215
6.1.4 Error analysis for GD optimization . . . . . . . . . . . . . . . . . . 219
6.2 Explicit midpoint GD optimization . . . . . . . . . . . . . . . . . . . . . . 239
6.2.1 Explicit midpoint discretizations for GF ODEs . . . . . . . . . . . 239
6.3 GD optimization with classical momentum . . . . . . . . . . . . . . . . . . 242
6.3.1 Representations for GD optimization with momentum . . . . . . . 244
6.3.2 Bias-adjusted GD optimization with momentum . . . . . . . . . . 247
6.3.3 Error analysis for GD optimization with momentum . . . . . . . . 249
6.3.4 Numerical comparisons for GD optimization with and without mo-
mentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.4 GD optimization with Nesterov momentum . . . . . . . . . . . . . . . . . 269
6.5 Adagrad GD optimization (Adagrad) . . . . . . . . . . . . . . . . . . . . . 269
6.6 Root mean square propagation GD optimization (RMSprop) . . . . . . . . 270
6.6.1 Representations of the mean square terms in RMSprop . . . . . . . 271
6.6.2 Bias-adjusted root mean square propagation GD optimization . . . 272
6.7 Adadelta GD optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
6.8 Adaptive moment estimation GD optimization
(Adam) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

10
Contents

7 Stochastic gradient descent (SGD) optimization methods 277


7.1 Introductory comments for the training of ANNs with SGD . . . . . . . . 277
7.2 SGD optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
7.2.1 SGD optimization in the training of ANNs . . . . . . . . . . . . . . 280
7.2.2 Non-convergence of SGD for not appropriately decaying learning rates288
7.2.3 Convergence rates for SGD for quadratic objective functions . . . . 299
7.2.4 Convergence rates for SGD for coercive objective functions . . . . . 302
7.3 Explicit midpoint SGD optimization . . . . . . . . . . . . . . . . . . . . . 303
7.4 SGD optimization with classical momentum . . . . . . . . . . . . . . . . . 305
7.4.1 Bias-adjusted SGD optimization with classical momentum . . . . . 307
7.5 SGD optimization with Nesterov momentum . . . . . . . . . . . . . . . . 310
7.5.1 Simplified SGD optimization with Nesterov momentum . . . . . . 312
7.6 Adagrad SGD optimization (Adagrad) . . . . . . . . . . . . . . . . . . . . 314
7.7 Root mean square propagation SGD optimization (RMSprop) . . . . . . . 316
7.7.1 Bias-adjusted root mean square propagation SGD optimization . . 318
7.8 Adadelta SGD optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 320
7.9 Adaptive moment estimation SGD optimization
(Adam) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322

8 Backpropagation 337
8.1 Backpropagation for parametric functions . . . . . . . . . . . . . . . . . . 337
8.2 Backpropagation for ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . 342

9 Kurdyka–Łojasiewicz (KL) inequalities 349


9.1 Standard KL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
9.2 Convergence analysis using standard KL functions (regular regime) . . . . 350
9.3 Standard KL inequalities for monomials . . . . . . . . . . . . . . . . . . . 353
9.4 Standard KL inequalities around non-critical points . . . . . . . . . . . . . 353
9.5 Standard KL inequalities with increased exponents . . . . . . . . . . . . . 355
9.6 Standard KL inequalities for one-dimensional polynomials . . . . . . . . . 355
9.7 Power series and analytic functions . . . . . . . . . . . . . . . . . . . . . . 358
9.8 Standard KL inequalities for one-dimensional analytic functions . . . . . . 360
9.9 Standard KL inequalities for analytic functions . . . . . . . . . . . . . . . 365
9.10 Counterexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
9.11 Convergence analysis for solutions of GF ODEs . . . . . . . . . . . . . . . 368
9.11.1 Abstract local convergence results for GF processes . . . . . . . . . 368
9.11.2 Abstract global convergence results for GF processes . . . . . . . . 373
9.12 Convergence analysis for GD processes . . . . . . . . . . . . . . . . . . . . 378
9.12.1 One-step descent property for GD processes . . . . . . . . . . . . . 378
9.12.2 Abstract local convergence results for GD processes . . . . . . . . . 380
9.13 On the analyticity of realization functions of ANNs . . . . . . . . . . . . . 385

11
Contents

9.14 Standard KL inequalities for empirical risks in the training of ANNs with
analytic activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . 388
9.15 Fréchet subdifferentials and limiting Fréchet subdifferentials . . . . . . . . 390
9.16 Non-smooth slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
9.17 Generalized KL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 396

10 ANNs with batch normalization 399


10.1 Batch normalization (BN) . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
10.2 Structured descr. of fully-connected feedforward ANNs with BN (training) 402
10.3 Realizations of fully-connected feedforward ANNs with BN (training) . . . 402
10.4 Structured descr. of fully-connected feedforward ANNs with BN (inference) 403
10.5 Realizations of fully-connected feedforward ANNs with BN (inference) . . 403
10.6 On the connection between BN for training and BN for inference . . . . . 404

11 Optimization through random initializations 407


11.1 Analysis of the optimization error . . . . . . . . . . . . . . . . . . . . . . . 407
11.1.1 The complementary distribution function formula . . . . . . . . . . 407
11.1.2 Estimates for the optimization error involving complementary distri-
bution functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
11.2 Strong convergences rates for the optimization error . . . . . . . . . . . . 409
11.2.1 Properties of the gamma and the beta function . . . . . . . . . . . 409
11.2.2 Product measurability of continuous random fields . . . . . . . . . 414
11.2.3 Strong convergences rates for the optimization error . . . . . . . . 417
11.3 Strong convergences rates for the optimization error involving ANNs . . . 420
11.3.1 Local Lipschitz continuity estimates for the parametrization functions
of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
11.3.2 Strong convergences rates for the optimization error involving ANNs 427

IV Generalization 431
12 Probabilistic generalization error estimates 433
12.1 Concentration inequalities for random variables . . . . . . . . . . . . . . . 433
12.1.1 Markov’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . . 433
12.1.2 A first concentration inequality . . . . . . . . . . . . . . . . . . . . 434
12.1.3 Moment-generating functions . . . . . . . . . . . . . . . . . . . . . 436
12.1.4 Chernoff bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
12.1.5 Hoeffding’s inequality . . . . . . . . . . . . . . . . . . . . . . . . . 438
12.1.6 A strengthened Hoeffding’s inequality . . . . . . . . . . . . . . . . 444
12.2 Covering number estimates . . . . . . . . . . . . . . . . . . . . . . . . . . 445
12.2.1 Entropy quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . 445

12
Contents

12.2.2 Inequalities for packing entropy quantities in metric spaces . . . . . 448


12.2.3 Inequalities for covering entropy quantities in metric spaces . . . . 450
12.2.4 Inequalities for entropy quantities in finite dimensional vector spaces 452
12.3 Empirical risk minimization . . . . . . . . . . . . . . . . . . . . . . . . . . 459
12.3.1 Concentration inequalities for random fields . . . . . . . . . . . . . 459
12.3.2 Uniform estimates for the statistical learning error . . . . . . . . . 464

13 Strong generalization error estimates 469


13.1 Monte Carlo estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 469
13.2 Uniform strong error estimates for random fields . . . . . . . . . . . . . . 472
13.3 Strong convergence rates for the generalisation error . . . . . . . . . . . . 476

V Composed error analysis 485


14 Overall error decomposition 487
14.1 Bias-variance decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 487
14.1.1 Risk minimization for measurable functions . . . . . . . . . . . . . 488
14.2 Overall error decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 490

15 Composed error estimates 493


15.1 Full strong error analysis for the training of ANNs . . . . . . . . . . . . . 493
15.2 Full strong error analysis with optimization via SGD with random initializations502

VI Deep learning for partial differential equations (PDEs) 507


16 Physics-informed neural networks (PINNs) 509
16.1 Reformulation of PDE problems as stochastic optimization problems . . . 510
16.2 Derivation of PINNs and deep Galerkin methods (DGMs) . . . . . . . . . 511
16.3 Implementation of PINNs . . . . . . . . . . . . . . . . . . . . . . . . . . . 513
16.4 Implementation of DGMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 516

17 Deep Kolmogorov methods (DKMs) 521


17.1 Stochastic optimization problems for expectations of random variables . . 522
17.2 Stochastic optimization problems for expectations of random fields . . . . 522
17.3 Feynman–Kac formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524
17.3.1 Feynman–Kac formulas providing existence of solutions . . . . . . 524
17.3.2 Feynman–Kac formulas providing uniqueness of solutions . . . . . 529
17.4 Reformulation of PDE problems as stochastic optimization problems . . . 534
17.5 Derivation of DKMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
17.6 Implementation of DKMs . . . . . . . . . . . . . . . . . . . . . . . . . . . 539

13
Contents

18 Further deep learning methods for PDEs 543


18.1 Deep learning methods based on strong formulations of PDEs . . . . . . . 543
18.2 Deep learning methods based on weak formulations of PDEs . . . . . . . . 544
18.3 Deep learning methods based on stochastic representations of PDEs . . . . 545
18.4 Error analyses for deep learning methods for PDEs . . . . . . . . . . . . . 547

Index of abbreviations 549

List of figures 551

List of source codes 553

List of definitions 555

Bibliography 559

14
Introduction

Very roughly speaking, the field deep learning can be divided into three subfields, deep
supervised learning, deep unsupervised learning, and deep reinforcement learning. Algorithms
in deep supervised learning often seem to be most accessible for a mathematical analysis.
In the following we briefly sketch in a simplified situation some ideas of deep supervised
learning.
Let d, M ∈ N = {1, 2, 3, . . . }, E ∈ C(Rd , R), x1 , x2 , . . . , xM +1 ∈ Rd , y1 , y2 , . . . , yM ∈ R
satisfy for all m ∈ {1, 2, . . . , M } that

ym = E(xm ). (1)

In the framework described in the previous sentence we think of M ∈ N as the number of


available known input-output data pairs, we think of d ∈ N as the dimension of the input
data, we think of E : Rd → R as an unknown function which relates input and output data
through (1), we think of x1 , x2 , . . . , xM +1 ∈ Rd as the available known input data, and we
think of y1 , y2 , . . . , yM ∈ R as the available known output data.
In the context of a learning problem of the type (1) the objective then is to approximately
compute the output E(xM +1 ) of the (M + 1)-th input data xM +1 without using explicit
knowledge of the function E : Rd → R but instead by using the knowledge of the M
input-output data pairs

(x1 , y1 ) = (x1 , E(x1 )), (x2 , y2 ) = (x2 , E(x2 )), . . . , (xM , yM ) = (xM , E(xM )) ∈ Rd × R.
(2)

To accomplish this, one considers the optimization problem of computing approximate


minimizers of the function L : C(Rd , R) → [0, ∞) which satisfies for all ϕ ∈ C(Rd , R) that
"M #
1 X
L(ϕ) = 2
|ϕ(xm ) − ym | . (3)
M m=1

Observe that (1) ensures that L(E) = 0 and, in particular, we have that the unknown
function E : Rd → R in (1) above is a minimizer of the function

L : C(Rd , R) → [0, ∞). (4)

15
Contents

The optimization problem of computing approximate minimizers of the function L is not


suitable for discrete numerical computations on a computer as the function L is defined on
the infinite dimensional vector space C(Rd , R).
To overcome this we introduce a spatially discretized version of this optimization
problem. More specifically, let d ∈ N, let ψ = (ψθ )θ∈Rd : Rd → C(Rd , R) be a function, and
let L : Rd → [0, ∞) satisfy

L = L ◦ ψ. (5)

We think of the set


ψθ : θ ∈ Rd ⊆ C(Rd , R) (6)


as a parametrized set of functions which we employ to approximate the infinite dimensional


vector space C(Rd , R) and we think of the function

Rd ∋ θ 7→ ψθ ∈ C(Rd , R) (7)

as the parametrization function associated to this set. For example, in the case d = 1 one
could think of (7) as the parametrization function associated to polynomials in the sense
that for all θ = (θ1 , . . . , θd ) ∈ Rd , x ∈ R it holds that
d−1
X
ψθ (x) = θk+1 xk (8)
k=0

or one could think of (7) as the parametrization associated to trigonometric polynomials.


However, in the context of deep supervised learning one neither chooses (7) as parametrization
of polynomials nor as parametrization of trigonometric polynomials, but instead one chooses
(7) as a parametrization associated to deep ANNs. In Chapter 1 in Part I we present
different types of such deep ANN parametrization functions in all mathematical details.
Taking the set in (6) and its parametrization function in (7) into account, we then intend
to compute approximate minimizers of the function L restricted to the set {ψθ : θ ∈ Rd },
that is, we consider the optimization problem of computing approximate minimizers of the
function "M #
1 X
d 2
(9)

ψθ : θ ∈ R ∋ ϕ 7→ L(ϕ) = |ϕ(xm ) − ym | ∈ [0, ∞).
M m=1

Employing the parametrization function in (7), one can also reformulate the optimization
problem in (9) as the optimization problem of computing approximate minimizers of the
function "M #
1 X
Rd ∋ θ 7→ L(θ) = L(ψθ ) = 2
|ψθ (xm ) − ym | ∈ [0, ∞) (10)
M m=1

16
Contents

and this optimization problem now has the potential to be amenable for discrete numer-
ical computations. In the context of deep supervised learning, where one chooses the
parametrization function in (7) as deep ANN parametrizations, one would apply an SGD-
type optimization algorithm to the optimization problem in (10) to compute approximate
minimizers of (10). In Chapter 7 in Part III we present the most common variants of such
SGD-type optimization algorithms. If ϑ ∈ Rd is an approximate minimizer of (10) in the
sense that L(ϑ) ≈ inf θ∈Rd L(θ), one then considers ψϑ (xM +1 ) as an approximation

ψϑ (xM +1 ) ≈ E(xM +1 ) (11)

of the unknown output E(xM +1 ) of the (M + 1)-th input data xM +1 . We note that in deep
supervised learning algorithms one typically aims to compute an approximate minimizer
ϑ ∈ Rd of (10) in the sense that L(ϑ) ≈ inf θ∈Rd L(θ), which is, however, typically not a
minimizer of (10) in the sense that L(ϑ) = inf θ∈Rd L(θ) (cf. Section 9.14).
In (3) above we have set up an optimization problem for the learning problem by using
the standard mean squared error function to measure the loss. This mean squared error
loss function is just one possible example in the formulation of deep learning optimization
problems. In particular, in image classification problems other loss functions such as the
cross-entropy loss function are often used and we refer to Chapter 5 of Part III for a survey
of commonly used loss function in deep learning algorithms (see Section 5.4.2). We also refer
to Chapter 9 for convergence results in the above framework where the parametrization
function in (7) corresponds to fully-connected feedforward ANNs (see Section 9.14).

17
Contents

18
Part I

Artificial neural networks (ANNs)

19
Chapter 1

Basics on ANNs

In this chapter we review different types of architectures of ANNs such as fully-connected


feedforward ANNs (see Sections 1.1 and 1.3), CNNs (see Section 1.4), ResNets (see Sec-
tion 1.5), and RNNs (see Section 1.6), we review different types of popular activation
functions used in applications such as the rectified linear unit (ReLU) activation (see
Section 1.2.3), the Gaussian error linear unit (GELU) activation (see Section 1.2.6), and
the standard logistic activation (see Section 1.2.7) among others, and we review different
procedures for how ANNs can be formulated in rigorous mathematical terms (see Section 1.1
for a vectorized description and Section 1.3 for a structured description).
In the literature different types of ANN architectures and activation functions have been
reviewed in several excellent works; cf., for example, [4, 9, 39, 60, 63, 97, 164, 182, 189, 367,
373, 389, 431] and the references therein. The specific presentation of Sections 1.1 and 1.3
is based on [19, 20, 25, 159, 180].

1.1 Fully-connected feedforward ANNs (vectorized de-


scription)
We start the mathematical content of this book with a review of fully-connected feedforward
ANNs, the most basic type of ANNs. Roughly speaking, fully-connected feedforward
ANNs can be thought of as parametric functions resulting from successive compositions of
affine functions followed by nonlinear functions, where the parameters of a fully-connected
feedforward ANN correspond to all the entries of the linear transformation matrices and
translation vectors of the involved affine functions (cf. Definition 1.1.3 below for a precise
definition of fully-connected feedforward ANNs and Figure 1.2 below for a graphical
illustration of fully-connected feedforward ANNs). The linear transformation matrices and
translation vectors are sometimes called weight matrices and bias vectors, respectively, and
can be thought of as the trainable parameters of fully-connected feedforward ANNs (cf.
Remark 1.1.5 below).

21
Chapter 1: Basics on ANNs

In this section we introduce in Definition 1.1.3 below a vectorized description of fully-


connected feedforward ANNs in the sense that all the trainable parameters of a fully-
connected feedforward ANN are represented by the components of a single Euclidean
vector. In Section 1.3 below we will discuss an alternative way to describe fully-connected
feedforward ANNs in which the trainable parameters of a fully-connected feedforward ANN
are represented by a tuple of matrix-vector pairs corresponding to the weight matrices and
bias vectors of the fully-connected feedforward ANNs (cf. Definitions 1.3.1 and 1.3.4 below).

Input layer 1st hidden layer 2nd hidden layer (L − 1)-th hidden layer Output layer
···
(1st layer) (2nd layer) (3rd layer) (L-th layer) ((L + 1)-th layer)

1 1 ··· 1

1 2 2 ··· 2 1

2 3 3 ··· 3 2

.. 4 4 ··· 4 ..
. .

l0 .. .. .. .. lL
. . . .

l1 l2 ··· lL−1

Figure 1.1: Graphical illustration of a fully-connected feedforward ANN consisting of


L ∈ N affine transformations (i.e., consisting of L + 1 layers: one input layer, L − 1
hidden layers, and one output layer) with l0 ∈ N neurons on the input layer (i.e.,
with l0 -dimensional input layer), with l1 ∈ N neurons on the first hidden layer (i.e.,
with l1 -dimensional first hidden layer), with l2 ∈ N neurons on the second hidden
layer (i.e., with l2 -dimensional second hidden layer), . . . , with lL−1 neurons on the
(L − 1)-th hidden layer (i.e., with (lL−1 )-dimensional (L − 1)-th hidden layer), and
with lL neurons in the output layer (i.e., with lL -dimensional output layer).

22
1.1. Fully-connected feedforward ANNs (vectorized description)

1.1.1 Affine functions


Definition 1.1.1 (Affine functions). Let d, m, n ∈ N, s ∈ N0 , θ = (θ1 , θ2 , . . . , θd ) ∈ Rd
satisfy d ≥ s + mn + m. Then we denote by Aθ,s n
m,n : R → R
m
the function which satisfies
n
for all x = (x1 , x2 , . . . , xn ) ∈ R that
    
θs+1 θs+2 ···
θs+n x1 θs+mn+1
 θs+n+1
 θs+n+2 ···
θs+2n  x2   θs+mn+2 
   
Aθ,s
m,n (x) = 
 θs+2n+1 θs+2n+2 ···
θs+3n  x3  +  θs+mn+3 
   

 .. .. ..  ..  
.. .. 
. . .
.  .   .
(1.1)
 
θs+(m−1)n+1 θs+(m−1)n+2 · · · θs+mn xn θs+mn+m
P
n  Pn 
= k=1 xk θs+k + θs+mn+1 , k=1 xk θs+n+k + θs+mn+2 , . . . ,
Pn  
k=1 x k θs+(m−1)n+k + θs+mn+m

and we call Aθ,s n


m,n the affine function from R to R
m
associated to (θ, s).

Example 1.1.2 (Example for Definition 1.1.1). Let θ = (0, 1, 2, 0, 3, 3, 0, 1, 7) ∈ R9 . Then

Aθ,1
2,2 ((1, 2)) = (8, 6) (1.2)

(cf. Definition 1.1.1).

Proof for Example 1.1.2. Observe that (1.1) ensures that


          
1 2 1 3 1+4 3 8
Aθ,1
2,2 ((1, 2)) = + = + = . (1.3)
0 3 2 0 0+6 0 6

The proof for Example 1.1.2 is thus complete.

Exercise 1.1.1. Let θ = (3, 1, −2, 1, −3, 0, 5, 4, −1, −1, 0) ∈ R11 . Specify Aθ,2
2,3 ((−1, 1, −1))
explicitly and prove that your result is correct (cf. Definition 1.1.1)!

1.1.2 Vectorized description of fully-connected feedforward ANNs


Definition 1.1.3 (Vectorized description of fully-connected feedforward ANNs). Let d, L ∈
N, l0 , l1 , . . . , lL ∈ N, θ ∈ Rd satisfy
L
X
d≥ lk (lk−1 + 1) (1.4)
k=1

23
Chapter 1: Basics on ANNs

and for every k ∈ {1, 2, . . . , L} let Ψk : Rlk → Rlk be a function. Then we denote by
NΨθ,l1 ,Ψ
0
2 ,...,ΨL
: Rl0 → RlL the function which satisfies for all x ∈ Rl0 that

θ, L−1 θ, L−2
P P
k=1 lk (lk−1 +1) k=1 lk (lk−1 +1)
NΨθ,l1 ,Ψ

0
2 ,...,ΨL
(x) = ΨL ◦ A lL ,lL−1 ◦ ΨL−1 ◦ A lL−1 ,lL−2 ◦ ...
θ,l (l0 +1)
l1 ,l0 (x) (1.5)
◦ Ψ1 ◦ Aθ,0

. . . ◦ Ψ2 ◦ Al2 ,l11

and we call NΨθ,l1 ,Ψ 0


2 ,...,ΨL
the realization function of the fully-connected feedforward ANN
associated to θ with L + 1 layers with dimensions (l0 , l1 , . . . , lL ) and activation functions
(Ψ1 , Ψ2 , . . . , ΨL ) (we call NΨθ,l1 ,Ψ
0
2 ,...,ΨL
the realization of the fully-connected feedforward
ANN associated to θ with L + 1 layers with dimensions (l0 , l1 , . . . , lL ) and activations
(Ψ1 , Ψ2 , . . . , ΨL )) (cf. Definition 1.1.1).

Example 1.1.4 (Example for Definition 1.1.3). Let θ = (1, −1, 2, −2, 3, −3, 0, 0, 1) ∈ R9
and let Ψ : R2 → R2 satisfy for all x = (x1 , x2 ) ∈ R2 that

Ψ(x) = (max{x1 , 0}, max{x2 , 0}). (1.6)

Then
θ,1
(1.7)

NΨ,id R
(2) = 12
(cf. Definition 1.1.3).

Proof for Example 1.1.4. Note that (1.1), (1.5), and (1.6) assure that
   
θ,1
 θ,4 θ,0
 θ,4
 1  2
NΨ,idR (2) = idR ◦A1,2 ◦ Ψ ◦ A2,1 (2) = A1,2 ◦ Ψ 2 +
−1 −2
      (1.8)
4 4  4
= Aθ,4 = Aθ,4
 
1,2 ◦ Ψ 1,2 = 3 −3 + 0 = 12
−4 0 0

(cf. Definitions 1.1.1 and 1.1.3). The proof for Example 1.1.4 is thus complete.

Exercise 1.1.2. Let θ = (1, −1, 0, 0, 1, −1, 0) ∈ R7 and let Ψ : R2 → R2 satisfy for all
x = (x1 , x2 ) ∈ R2 that

Ψ(x) = (max{x1 , 0}, min{x2 , 0}). (1.9)

Prove or disprove the following statement: It holds that


θ,1
(1.10)

NΨ,id R
(−1) = −1

(cf. Definition 1.1.3).

24
1.1. Fully-connected feedforward ANNs (vectorized description)

Exercise 1.1.3. Let θ = (θ1 , θ2 , . . . , θ10 ) ∈ R10 satisfy


θ = (θ1 , θ2 , . . . , θ10 ) = (1, 0, 2, −1, 2, 0, −1, 1, 2, 1)
and let m : R → R and q : R → R satisfy for all x ∈ R that
m(x) = max{−x, 0} and q(x) = x2 . (1.11)
Specify Nq,m,q (1), and Nq,m,q (1/2) explicitly and prove that your results are
θ,1
 θ,1
 θ,1

(0), Nq,m,q
correct (cf. Definition 1.1.3)!
Exercise 1.1.4. Let θ = (θ1 , θ2 , . . . , θ15 ) ∈ R15 satisfy
(θ1 , θ2 , . . . , θ15 ) = (1, −2, 0, 3, 2, −1, 0, 3, 1, −1, 1, −1, 2, 0, −1) (1.12)
and let Φ : R2 → R2 and Ψ : R2 → R2 satisfy for all x, y ∈ R that Φ(x, y) = (y, x) and
Ψ(x, y) = (xy, xy).
a) Prove or disprove the following statement: It holds that NΦ,Ψ
θ,2
(1, −1) = (4, 4) (cf.


Definition 1.1.3).
b) Prove or disprove the following statement: It holds that NΦ,Ψ
θ,2

(−1, 1) = (−4, −4)
(cf. Definition 1.1.3).

1.1.3 Weight and bias parameters of fully-connected feedforward


ANNs
Remark 1.1.5 (Weights and biases for fully-connected feedforward ANNs). Let L ∈ {2, 3,
4, . . .}, v0 , v1 , . . . , vL−1 ∈ N0 , l0 , l1 , . . . , lL , d ∈ N, θ = (θ1 , θ2 , . . . , θd ) ∈ Rd satisfy for all
k ∈ {0, 1, . . . , L − 1} that
L
X k
X
d≥ li (li−1 + 1) and vk = li (li−1 + 1), (1.13)
i=1 i=1

let Wk ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, and bk ∈ Rlk , k ∈ {1, 2, . . . , L}, satisfy for all
k ∈ {1, 2, . . . , L} that
 
θvk−1 +1 θvk−1 +2 ... θvk−1 +lk−1
 θv +l +1 θvk−1 +lk−1 +2 ... θvk−1 +2lk−1 
 k−1 k−1 
(1.14)
 θv +2l +1 θvk−1 +2lk−1 +2 ... θvk−1 +3lk−1 
Wk =  k−1 k−1
.. .. .. ..

. . . .
 
 
θvk−1 +(lk −1)lk−1 +1 θvk−1 +(lk −1)lk−1 +2 . . . θvk−1 +lk lk−1
| {z }
weight parameters

and (1.15)

bk = θvk−1 +lk lk−1 +1 , θvk−1 +lk lk−1 +2 , . . . , θvk−1 +lk lk−1 +lk ,
| {z }
bias parameters

and let Ψk : Rlk → Rlk , k ∈ {1, 2, . . . , L}, be functions. Then

25
Chapter 1: Basics on ANNs

Input layer 1st hidden layer 2nd hidden layer Output layer
(1st layer) (2nd layer) (3rd layer) (4th layer)

Figure 1.2: Graphical illustration of an ANN. The ANN has 2 hidden layers and
length L = 3 with 3 neurons in the input layer (corresponding to l0 = 3), 6 neurons
in the first hidden layer (corresponding to l1 = 6), 3 neurons in the second hidden
layer (corresponding to l2 = 3), and one neuron in the output layer (corresponding
to l3 = 1). In this situation we have an ANN with 39 weight parameters and 10 bias
parameters adding up to 49 parameters overall. The realization of this ANN is a
function from R3 to R.

(i) it holds that


θ,v θ,v
NΨθ,l1 ,Ψ ,lL−2 ◦ ΨL−2 ◦ . . . ◦ Al2 ,l1 ◦ Ψ1 ◦ Al1 ,l0 (1.16)
θ,v1 θ,v0
0
2 ,...,ΨL
= ΨL ◦ AlL ,lL−1
L−1
L−2
◦ ΨL−1 ◦ AlL−1

and
θ,v
(ii) it holds for all k ∈ {1, 2, . . . , L}, x ∈ Rlk−1 that Alk ,lk−1
k−1
(x) = Wk x + bk

(cf. Definitions 1.1.1 and 1.1.3).

1.2 Activation functions


In this section we review a few popular activation functions from the literature (cf. Defini-
tion 1.1.3 above and Definition 1.3.4 below for the use of activation functions in the context

26
1.2. Activation functions

of fully-connected feedforward ANNs, cf. Definition 1.4.5 below for the use of activation
functions in the context of CNNs, cf. Definition 1.5.4 below for the use of activation functions
in the context of ResNets, and cf. Definitions 1.6.3 and 1.6.4 below for the use of activation
functions in the context of RNNs).

1.2.1 Multidimensional versions


To describe multidimensional activation functions, we frequently employ the concept of the
multidimensional version of a function. This concept is the subject of the next notion.

Definition 1.2.1 (Multidimensional versions of one-dimensional functions). Let T ∈ N,


d1 , d2 , . . . , dT ∈ N and let ψ : R → R be a function. Then we denote by

Mψ,d1 ,d2 ,...,dT : Rd1 ×d2 ×...×dT → Rd1 ×d2 ×...×dT (1.17)

the function which satisfies for all x = (xk1 ,k2 ,...,kT )(k1 ,k2 ,...,kT )∈(×Tt=1 {1,2,...,dt }) ∈ Rd1 ×d2 ×...×dT ,
y = (yk1 ,k2 ,...,kT )(k1 ,k2 ,...,kT )∈(×Tt=1 {1,2,...,dt }) ∈ Rd1 ×d2 ×...×dT with ∀ k1 ∈ {1, 2, . . . , d1 }, k2 ∈
{1, 2, . . . , d2 }, . . . , kT ∈ {1, 2, . . . , dT } : yk1 ,k2 ,...,kT = ψ(xk1 ,k2 ,...,kT ) that

Mψ,d1 ,d2 ,...,dT (x) = y (1.18)

and we call Mψ,d1 ,d2 ,...,dT the d1 × d2 × . . . × dT -dimensional version of ψ.

Example 1.2.2 (Example for Definition 1.2.1). Let A ∈ R3×1×2 satisfy

(1.19)
  
A= 1 −1 , −2 2 , 3 −3

and let ψ : R → R satisfy for all x ∈ R that ψ(x) = x2 . Then

(1.20)
  
Mψ,3,1,3 (A) = 1 1 , 4 4 , 9 9

Proof for Example 1.2.2. Note that (1.18) establishes (1.20). The proof for Example 1.2.2
is thus complete.

Exercise 1.2.1. Let A ∈ R2×3 , B ∈ R2×2×2 satisfy


     
3 −2 5 0 1 −3 −4
A= and B= , (1.21)
1 0 −2 −1 0 5 2

and let ψ : R → R satisfy for all x ∈ R that ψ(x) = |x|. Specify Mψ,2,3 (A) and Mψ,2,2,2 (B)
explicitly and prove that your results are correct (cf. Definition 1.2.1)!

27
Chapter 1: Basics on ANNs

Exercise 1.2.2. Let θ = (θ1 , θ2 , . . . , θ14 ) ∈ R14 satisfy

(θ1 , θ2 , . . . , θ14 ) = (0, 1, 2, 2, 1, 0, 1, 1, 1, −3, −1, 4, 0, 1) (1.22)

and let f : R → R and g : R → R satisfy for all x ∈ R that


1
f (x) = and g(x) = x2 . (1.23)
1 + |x|

Specify NM θ,1
and θ,1
(1) explicitly and prove that your results are correct
 
f,3 ,M g,2
(1) NM g,2 ,M f,3
(cf. Definitions 1.1.3 and 1.2.1)!

1.2.2 Single hidden layer fully-connected feedforward ANNs

Input layer Hidden layer Output layer

..
.

..
.

Figure 1.3: Graphical illustration of a fully-connected feedforward ANN consisting of


two affine transformations (i.e., consisting of 3 layers: one input layer, one hidden
layer, and one output layer) with I ∈ N neurons on the input layer (i.e., with
I-dimensional input layer), with H ∈ N neurons on the hidden layer (i.e., with
H-dimensional hidden layer), and with one neuron in the output layer (i.e., with
1-dimensional output layer).

28
1.2. Activation functions

Lemma 1.2.3 (Fully-connected feedforward ANN with one hidden layer). Let I, H ∈ N,
θ = (θ1 , θ2 , . . . , θHI+2H+1 ) ∈ RHI+2H+1 , x = (x1 , x2 , . . . , xI ) ∈ RI and let ψ : R → R be a
function. Then
" H  I  #
X
θ,I
(1.24)
P
NMψ,H ,idR (x) = θHI+H+k ψ xi θ(k−1)I+i + θHI+k + θHI+2H+1 .
k=1 i=1

(cf. Definitions 1.1.1, 1.1.3, and 1.2.1).


Proof of Lemma 1.2.3. Observe that (1.5) and (1.18) show that
θ,I
NM ,id (x)
ψ,H R 
= idR ◦Aθ,HI+H
1,H ◦ M ψ,H ◦ A θ,0
H,I (x)

= Aθ,HI+H Mψ,H Aθ,0 (1.25)



1,H H,I (x)
" H  I  #
X P
= θHI+H+k ψ xi θ(k−1)I+i + θHI+k + θHI+2H+1 .
k=1 i=1

The proof of Lemma 1.2.3 is thus complete.

1.2.3 Rectified linear unit (ReLU) activation


In this subsection we formulate the ReLU function which is one of the most frequently used
activation functions in deep learning applications (cf., for example, LeCun et al. [263]).
Definition 1.2.4 (ReLU activation function). We denote by r : R → R the function which
satisfies for all x ∈ R that
r(x) = max{x, 0} (1.26)
and we call r the ReLU activation function (we call r the rectifier function).

1 import matplotlib . pyplot as plt


2

3 def setup_axis ( xlim , ylim ) :


4 _ , ax = plt . subplots ()
5
6 ax . set_aspect ( " equal " )
7 ax . set_xlim ( xlim )
8 ax . set_ylim ( ylim )
9 ax . spines [ " left " ]. set_position ( " zero " )
10 ax . spines [ " bottom " ]. set_position ( " zero " )
11 ax . spines [ " right " ]. set_color ( " none " )
12 ax . spines [ " top " ]. set_color ( " none " )
13 for s in ax . spines . values () :

29
Chapter 1: Basics on ANNs

2.0

1.5

1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.5

Figure 1.4 (plots/relu.pdf): A plot of the ReLU activation function

14 s . set_zorder (0)
15
16 return ax

Source code 1.1 (code/activation_functions/plot_util.py): Python code for


the plot_util module used in the code listings throughout this subsection

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5

6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )


7
8 x = np . linspace ( -2 , 2 , 100)
9
10 ax . plot (x , tf . keras . activations . relu ( x ) )
11

12 plt . savefig ( " ../../ plots / relu . pdf " , bbox_inches = ’ tight ’)

Source code 1.2 (code/activation_functions/relu_plot.py): Python code used


to create Figure 1.4

Definition 1.2.5 (Multidimensional ReLU activation functions). Let d ∈ N. Then we


denote by Rd : Rd → Rd the function given by

Rd = Mr,d (1.27)

and we call Rd the d-dimensional ReLU activation function (we call Rd the d-dimensional
rectifier function) (cf. Definitions 1.2.1 and 1.2.4).

30
1.2. Activation functions

Lemma 1.2.6 (An ANN with the ReLU activation function as the activation function).
Let W1 = w1 = 1, W2 = w2 = −1, b1 = b2 = B = 0. Then it holds for all x ∈ R that

x = W1 max{w1 x + b1 , 0} + W2 max{w2 x + b2 , 0} + B. (1.28)

Proof of Lemma 1.2.6. Observe that for all x ∈ R it holds that

W1 max{w1 x + b1 , 0} + W2 max{w2 x + b2 , 0} + B
= max{w1 x + b1 , 0} − max{w2 x + b2 , 0} = max{x, 0} − max{−x, 0} (1.29)
= max{x, 0} + min{x, 0} = x.

The proof of Lemma 1.2.6 is thus complete.

Exercise 1.2.3 (Real identity). Prove or disprove the PH following statement: There exist
d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ Rd with d ≥ 2l1 + 1 such that

l (l
k=2 k k−1 + 1) + lH +
for all x ∈ R it holds that
NRθ,1l ,Rl ,...,Rl ,idR (x) = x (1.30)

1 2 H

(cf. Definitions 1.1.3 and 1.2.5).


The statement of the next lemma, Lemma 1.2.7, provides a partial answer to Exer-
cise 1.2.3. Lemma 1.2.7 follows from an application of Lemma 1.2.6 and the detailed proof
of Lemma 1.2.7 is left as an exercise.

Lemma 1.2.7 (Real identity). Let θ = (1, −1, 0, 0, 1, −1, 0) ∈ R7 . Then it holds for all
x ∈ R that
NRθ,12 ,idR (x) = x (1.31)


(cf. Definitions 1.1.3 and 1.2.5).

Exercise 1.2.4 (Absolute value). Prove or disproveP the following statement: There exist
d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ Rd with d ≥ 2l1 + H
1 such that

l (l
k=2 k k−1 + 1) + lH +
for all x ∈ R it holds that
NRθ,1l ,Rl ,...,Rl ,idR (x) = |x| (1.32)

1 2 H

(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.5 (Exponential). Prove or disprove the PHfollowing statement: There exist
d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ Rd with d ≥ 2l1 + 1 such that

l (l
k=2 k k−1 + 1) + lH +
for all x ∈ R it holds that
NRθ,1l ,Rl ,...,Rl ,idR (x) = ex (1.33)

1 2 H

(cf. Definitions 1.1.3 and 1.2.5).

31
Chapter 1: Basics on ANNs

Exercise 1.2.6 (Two-dimensional maximum). Prove or disprove the following statement:


There exist d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ Rd with d ≥ 3l1 +
PH 
k=2 lk (lk−1 + 1) + lH + 1
such that for all x, y ∈ R it holds that

NRθ,2l ,Rl ,...,Rl ,idR (x, y) = max{x, y} (1.34)



1 2 H

(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.7 (Real identity with two hidden layers). Prove or disprove the following
statement: There exist d, l1 , l2 ∈ N, θ ∈ Rd with d ≥ 2l1 + l1 l2 + 2l2 + 1 such that for all
x ∈ R it holds that
NRθ,1l ,Rl ,idR (x) = x (1.35)

1 2

(cf. Definitions 1.1.3 and 1.2.5).


The statement of the next lemma, Lemma 1.2.8, provides a partial answer to Exer-
cise 1.2.7. The proof of Lemma 1.2.8 is left as an exercise.
Lemma 1.2.8 (Real identity with two hidden layers). Let θ = (1, −1, 0, 0, 1, −1, −1, 1,
0, 0, 1, −1, 0) ∈ R13 . Then it holds for all x ∈ R that

NRθ,12 ,R2 ,idR (x) = x (1.36)




(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.8 (Three-dimensional maximum). Prove or disprove PHthe following statement:
There exist d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ R with d ≥ 4l1 +
d

k=2 lk (lk−1 + 1) + lH + 1
such that for all x, y, z ∈ R it holds that

NRθ,3l ,Rl ,...,Rl ,idR (x, y, z) = max{x, y, z} (1.37)



1 2 H

(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.9 (Multidimensional maxima). Prove or disprove the following statement:
For
PHevery k ∈ N there exist d, H ∈ N, l1 , l2 , . . . , lH ∈ N, θ ∈ R with d ≥ (k + 1)l1 +
d

k=2 lk (lk−1 + 1) + lH + 1 such that for all x1 , x2 , . . . , xk ∈ R it holds that

NRθ,k (1.38)

l ,Rl ,...,Rl ,idR
(x1 , x2 , . . . , xk ) = max{x1 , x2 , . . . , xk }
1 2 H

(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.10. Prove or disprove the following statement: There exist d, H ∈ N, l1 , l2 , . . . ,
lH ∈ N, θ ∈ Rd with d ≥ 2 l1 + H
1) such that for all x ∈ R it
P 
l (l
k=2 k k−1 + 1) + (lH +
holds that
NRθ,1l ,Rl ,...,Rl ,idR (x) = max{x, x2 } (1.39)

1 2 H

(cf. Definitions 1.1.3 and 1.2.5).

32
1.2. Activation functions

Exercise 1.2.11 (Hat function). Prove or disprove the following statement: There exist
d, l ∈ N, θ ∈ Rd with d ≥ 3l + 1 such that for all x ∈ R it holds that


 1 : x≤2

 x−1 : 2<x≤3
NRθ,1l ,idR (x) = (1.40)



 5−x : 3<x≤4

1 : x>4

(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.12. Prove or disprove the following statement: There exist d, l ∈ N, θ ∈ Rd
with d ≥ 3l + 1 such that for all x ∈ R it holds that

−2
 :x≤1
θ,1
(1.41)

NRl ,idR (x) = 2x − 4 :1<x≤3

2 :x>3

(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.13. Prove or disprove P the following statement: There exists d, H ∈ N, l1 , l2 , . . . ,
lH ∈ N, θ ∈ Rd with d ≥ 2 l1 + H
H + 1) such that for all x ∈ R it

l (l
k=2 k k−1 + 1) + (l
holds that 
0
 :x≤1
θ,1
(1.42)

NRl ,Rl ,...,Rl ,idR (x) = x − 1 : 1 ≤ x ≤ 2
1 2 H 
1 :x≥2

(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.14. Prove or disprove the following statement: There exist d, l ∈ N, θ ∈ Rd
with d ≥ 3l + 1 such that for all x ∈ [0, 1] it holds that

NRθ,1l ,idR (x) = x2 (1.43)




(cf. Definitions 1.1.3 and 1.2.5).


Exercise 1.2.15. Prove or disprove
Pthe following statement: There exists d, H ∈ N, l1 , l2 , . . . ,
lH ∈ N, θ ∈ Rd with d ≥ 2 l1 + H
1) such that

l (l
k=2 k k−1 + 1) + (lH +

supx∈[−3,−2] NRθ,1l ,Rl 1


(1.44)

,...,RlH ,idR (x) − (x + 2)2 ≤ 4
1 2

(cf. Definitions 1.1.3 and 1.2.5).

33
Chapter 1: Basics on ANNs

1.2.4 Clipping activation


Definition 1.2.9 (Clipping activation function). Let u ∈ [−∞, ∞), v ∈ (u, ∞]. Then we
denote by cu,v : R → R the function which satisfies for all x ∈ R that

cu,v (x) = max{u, min{x, v}}. (1.45)

and we call cu,v the (u, v)-clipping activation function.

2.0
ReLU
(0,1)-clipping
1.5

1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.5

Figure 1.5 (plots/clipping.pdf): A plot of the (0, 1)-clipping activation function


and the ReLU activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)
9
10 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . relu (x , max_value =1) ,
12 label = ’ (0 ,1) - clipping ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / clipping . pdf " , bbox_inches = ’ tight ’)

Source code 1.3 (code/activation_functions/clipping_plot.py): Python code


used to create Figure 1.5

34
1.2. Activation functions

Definition 1.2.10 (Multidimensional clipping activation functions). Let d ∈ N, u ∈


[−∞, ∞), v ∈ (u, ∞]. Then we denote by Cu,v,d : Rd → Rd the function given by

Cu,v,d = Mcu,v ,d (1.46)

and we call Cu,v,d the d-dimensional (u, v)-clipping activation function (cf. Definitions 1.2.1
and 1.2.9).

1.2.5 Softplus activation


Definition 1.2.11 (Softplus activation function). We say that a is the softplus activation
function if and only if it holds that a : R → R is the function from R to R which satisfies
for all x ∈ R that
a(x) = ln(1 + exp(x)). (1.47)

4.0
ReLU
softplus 3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
4 3 2 1 0 1 2 3 4
0.5

Figure 1.6 (plots/softplus.pdf): A plot of the softplus activation function and


the ReLU activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -4 ,4) , ( -.5 ,4) )
7
8 x = np . linspace ( -4 , 4 , 100)
9

10 ax . plot (x , tf . keras . activations . relu ( x ) , label = ’ ReLU ’)


11 ax . plot (x , tf . keras . activations . softplus ( x ) , label = ’ softplus ’)
12 ax . legend ()
13
14 plt . savefig ( " ../../ plots / softplus . pdf " , bbox_inches = ’ tight ’)

35
Chapter 1: Basics on ANNs

Source code 1.4 (code/activation_functions/softplus_plot.py): Python code


used to create Figure 1.6

The next result, Lemma 1.2.12 below, presents a few elementary properties of the
softplus function.

Lemma 1.2.12 (Properties of the softplus function). Let a be the softplus activation
function (cf. Definition 1.2.11). Then

(i) it holds for all x ∈ [0, ∞) that x ≤ a(x) ≤ x + 1,

(ii) it holds that limx→−∞ a(x) = 0,

(iii) it holds that limx→∞ a(x) = ∞, and

(iv) it holds that a(0) = ln(2)

(cf. Definition 1.2.11).

Proof of Lemma 1.2.12. Observe that the fact that 2 ≤ exp(1) ensures that for all x ∈ [0, ∞)
it holds that
x = ln(exp(x)) ≤ ln(1 + exp(x)) = ln(exp(0) + exp(x))
≤ ln(exp(x) + exp(x)) = ln(2 exp(x)) ≤ ln(exp(1) exp(x)) (1.48)
= ln(exp(x + 1)) = x + 1.

The proof of Lemma 1.2.12 is thus complete.

Note that Lemma 1.2.12 ensures that s(0) = ln(2) = 0.693 . . . (cf. Definition 1.2.11).
In the next step we introduce the multidimensional version of the softplus function (cf.
Definitions 1.2.1 and 1.2.11 above).

Definition 1.2.13 (Multidimensional softplus activation functions). Let d ∈ N and let


a be the softplus activation function (cf. Definition 1.2.11). Then we say that A is the
d-dimensional softplus activation function if and only if A = Ma,d (cf. Definition 1.2.1).

Lemma 1.2.14. Let d ∈ N and let A : Rd → Rd be a function. Then A is the d-dimensional


softplus activation function if and only if it holds for all x = (x1 , . . . , xd ) ∈ Rd that

A(x) = (ln(1 + exp(x1 )), ln(1 + exp(x2 )), . . . , ln(1 + exp(xd ))) (1.49)

(cf. Definition 1.2.13).

36
1.2. Activation functions

Proof of Lemma 1.2.14. Throughout this proof, let a be the softplus activation function
(cf. Definition 1.2.11). Note that (1.18) and (1.47) ensure that for all x = (x1 , . . . , xd ) ∈ Rd
it holds that

Ma,d (x) = (ln(1 + exp(x1 )), ln(1 + exp(x2 )), . . . , ln(1 + exp(xd ))) (1.50)

(cf. Definition 1.2.1). The fact that A is the d-dimensional softplus activation function (cf.
Definition 1.2.13) if and only if A = Ma,d hence implies (1.49). The proof of Lemma 1.2.14
is thus complete.

1.2.6 Gaussian error linear unit (GELU) activation


Another popular activation function is the GELU activation function first introduced in
Hendrycks & Gimpel [193]. This activation function is the subject of the next definition.

Definition 1.2.15 (GELU activation function). We say that a is the GELU unit activation
function (we say that a is the GELU activation function) if and only if it holds that
a : R → R is the function from R to R which satisfies for all x ∈ R that
Z x 
x z2
a(x) = √ exp(− 2 ) dz . (1.51)
2π −∞

3.0
ReLU
softplus 2.5
GELU
2.0
1.5
1.0
0.5
0.0
4 3 2 1 0 1 2 3
0.5

Figure 1.7 (plots/gelu.pdf): A plot of the GELU activation function, the ReLU
activation function, and the softplus activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -4 ,3) , ( -.5 ,3) )
7
8 x = np . linspace ( -4 , 3 , 100)

37
Chapter 1: Basics on ANNs

9
10 ax . plot (x , tf . keras . activations . relu ( x ) , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . softplus ( x ) , label = ’ softplus ’)
12 ax . plot (x , tf . keras . activations . gelu ( x ) , label = ’ GELU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / gelu . pdf " , bbox_inches = ’ tight ’)

Source code 1.5 (code/activation_functions/gelu_plot.py): Python code used


to create Figure 1.7

Lemma 1.2.16. Let x ∈ R and let a be the GELU activation function (cf. Definition 1.2.15).
Then the following two statements are equivalent:

(i) It holds that a(x) > 0.

(ii) It holds that r(x) > 0 (cf. Definition 1.2.4).

Proof of Lemma 1.2.16. Note that (1.26) and (1.51) establish that ((i) ↔ (ii)). The proof
of Lemma 1.2.16 is thus complete.

Definition 1.2.17 (Multidimensional GELU unit activation function). Let d ∈ N and let a
be the GELU activation function (cf. Definition 1.2.15). we say that A is the d-dimensional
GELU activation function if and only if A = Ma,d (cf. Definition 1.2.1).

1.2.7 Standard logistic activation


Definition 1.2.18 (Standard logistic activation function). We say that a is the standard
logistic activation function if and only if it holds that a : R → R is the function from R to
R which satisfies for all x ∈ R that

1 exp(x)
a(x) = = . (1.52)
1 + exp(−x) exp(x) + 1

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -.5 ,1.5) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 ax . plot (x , tf . keras . activations . relu (x , max_value =1) ,
11 label = ’ (0 ,1) - clipping ’)

38
1.2. Activation functions

1.5
(0,1)-clipping
standard logistic 1.0
0.5
0.0
3 2 1 0 1 2 3
0.5

Figure 1.8 (plots/logistic.pdf): A plot of the standard logistic activation function


and the (0, 1)-clipping activation function

12 ax . plot (x , tf . keras . activations . sigmoid ( x ) ,


13 label = ’ standard logistic ’)
14 ax . legend ()
15

16 plt . savefig ( " ../../ plots / logistic . pdf " , bbox_inches = ’ tight ’)

Source code 1.6 (code/activation_functions/logistic_plot.py): Python code


used to create Figure 1.8

Definition 1.2.19 (Multidimensional standard logistic activation functions). Let d ∈ N


and let a be the standard logistic activation function (cf. Definition 1.2.18). Then we say
that A is the d-dimensional standard logistic activation function if and only if A = Ma,d
(cf. Definition 1.2.1).

1.2.7.1 Derivative of the standard logistic activation function


Proposition 1.2.20 (Logistic ODE). Let a be the standard logistic activation function (cf.
Definition 1.2.18). Then
(i) it holds that a : R → R is infinitely often differentiable and
(ii) it holds for all x ∈ R that
a(0) = 1/2, a′ (x) = a(x)(1 − a(x)) = a(x) − [a(x)]2 , and (1.53)
a′′ (x) = a(x)(1 − a(x))(1 − 2 a(x)) = 2[a(x)]3 − 3[a(x)]2 + a(x). (1.54)
Proof of Proposition 1.2.20. Note that (1.52) implies item (i). Next observe that (1.52)
ensures that for all x ∈ R it holds that
 
′ exp(−x) exp(−x)
a (x) = = a(x)
(1 + exp(−x))2 1 + exp(−x)
(1.55)
   
1 + exp(−x) − 1 1
= a(x) = a(x) 1 −
1 + exp(−x) 1 + exp(−x)
= a(x)(1 − a(x)).

39
Chapter 1: Basics on ANNs

Hence, we obtain that for all x ∈ R it holds that


′
a′′ (x) = a(x)(1 − a(x)) = a′ (x)(1 − a(x)) + a(x)(1 − a(x))′


= a′ (x)(1 − a(x)) − a(x) a′ (x) = a′ (x)(1 − 2 a(x))


= a(x)(1 − a(x))(1 − 2 a(x)) (1.56)
= a(x) − [a(x)]2 (1 − 2 a(x)) = a(x) − [a(x)]2 − 2[a(x)]2 + 2[a(x)]3


= 2[a(x)]3 − 3[a(x)]2 + a(x).

This establishes item (ii). The proof of Proposition 1.2.20 is thus complete.

1.2.7.2 Integral of the standard logistic activation function


Lemma 1.2.21 (Primitive of the standard logistic activation function). Let s be the softplus
activation function and let l be the standard logistic activation function (cf. Definitions 1.2.11
and 1.2.18). Then it holds for all x ∈ R that
Z x Z x  
1
l(y) dy = −y
dy = ln(1 + exp(x)) = s(x). (1.57)
−∞ −∞ 1 + e

Proof of Lemma 1.2.21. Observe that (1.47) implies that for all x ∈ R it holds that
 
1

s (x) = exp(x) = l(x). (1.58)
1 + exp(x)

The fundamental theorem of calculus hence shows that for all w, x ∈ R with w ≤ x it holds
that Z x
l(y) dy = s(x) − s(w). (1.59)
w |{z}
≥0

Combining this with the fact that limw→−∞ s(w) = 0 establishes (1.57). The proof of
Lemma 1.2.21 is thus complete.

1.2.8 Swish activation


Definition 1.2.22 (Swish activation function). Let β ∈ R. Then we say that a is the swish
activation function with parameter β if and only if it holds that a : R → R is the function
from R to R which satisfies for all x ∈ R that
x
a(x) = . (1.60)
1 + exp(−βx)

40
1.2. Activation functions

3.0
ReLU
GELU 2.5
swish
2.0
1.5
1.0
0.5
0.0
4 3 2 1 0 1 2 3
0.5

Figure 1.9 (plots/swish.pdf): A plot of the swish activation function, the GELU
activation function, and the ReLU activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -4 ,3) , ( -.5 ,3) )
7
8 x = np . linspace ( -4 , 3 , 100)
9

10 ax . plot (x , tf . keras . activations . relu ( x ) , label = ’ ReLU ’)


11 ax . plot (x , tf . keras . activations . gelu ( x ) , label = ’ GELU ’)
12 ax . plot (x , tf . keras . activations . swish ( x ) , label = ’ swish ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / swish . pdf " , bbox_inches = ’ tight ’)

Source code 1.7 (code/activation_functions/swish_plot.py): Python code


used to create Figure 1.9

Lemma 1.2.23 (Relation between the swish activation function and the logistic activation
function). Let β ∈ R, let s be the swish activation function with parameter 1, and let l be
the standard logistic activation function (cf. Definitions 1.2.18 and 1.2.22). Then it holds
for all x ∈ R that
s(x) = xl(βx). (1.61)
Proof of Lemma 1.2.23. Observe that (1.60) and (1.52) establish (1.61). The proof of
Lemma 1.2.23 is thus complete.
Definition 1.2.24 (Multidimensional swish activation functions). Let d ∈ N and let a be
the swish activation function with parameter 1 (cf. Definition 1.2.22). Then we say that A
is the d-dimensional swish activation function if and only if A = Ma,d (cf. Definition 1.2.1).

41
Chapter 1: Basics on ANNs

1.2.9 Hyperbolic tangent activation


Definition 1.2.25 (Hyperbolic tangent activation function). We denote by tanh : R → R
the function which satisfies for all x ∈ R that
exp(x) − exp(−x)
tanh(x) = (1.62)
exp(x) + exp(−x)
and we call tanh the hyperbolic tangent activation function (we call tanh the hyperbolic
tangent).

1.5
(-1,1)-clipping
standard logistic 1.0
tanh
0.5
0.0
3 2 1 0 1 2 3
0.5
1.0
1.5

Figure 1.10 (plots/tanh.pdf): A plot of the hyperbolic tangent, the (−1, 1)-clipping
activation function, and the standard logistic activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -1.5 ,1.5) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 ax . plot (x , tf . keras . activations . relu ( x +1 , max_value =2) -1 ,
11 label = ’ ( -1 ,1) - clipping ’)
12 ax . plot (x , tf . keras . activations . sigmoid ( x ) ,
13 label = ’ standard logistic ’)
14 ax . plot (x , tf . keras . activations . tanh ( x ) , label = ’ tanh ’)
15 ax . legend ()
16
17 plt . savefig ( " ../../ plots / tanh . pdf " , bbox_inches = ’ tight ’)

Source code 1.8 (code/activation_functions/tanh_plot.py): Python code used


to create Figure 1.10

42
1.2. Activation functions

Definition 1.2.26 (Multidimensional hyperbolic tangent activation functions). Let d ∈ N.


Then we say that A is the d-dimensional hyperbolic tangent activation function if and only
if A = Mtanh,d (cf. Definitions 1.2.1 and 1.2.25).

Lemma 1.2.27. Let a be the standard logistic activation function (cf. Definition 1.2.18).
Then it holds for all x ∈ R that

tanh(x) = 2 a(2x) − 1 (1.63)

(cf. Definitions 1.2.18 and 1.2.25).

Proof of Lemma 1.2.27. Observe that (1.52) and (1.62) ensure that for all x ∈ R it holds
that
 
exp(2x) 2 exp(2x) − (exp(2x) + 1)
2 a(2x) − 1 = 2 −1=
exp(2x) + 1 exp(2x) + 1
exp(2x) − 1 exp(x)(exp(x) − exp(−x))
= = (1.64)
exp(2x) + 1 exp(x)(exp(x) + exp(−x))
exp(x) − exp(−x)
= = tanh(x).
exp(x) + exp(−x)

The proof of Lemma 1.2.27 is thus complete.


Exercise 1.2.16. Let a be the standard logistic activation function (cf. Definition 1.2.18).
Prove or disprove the following
PL−1 statement: There exists L ∈ {2, 3, . . .}, d, l1 , l2 , . . . , lL−1 ∈ N,
θ ∈ Rd with d ≥ 2 l1 + k=2 lk (lk−1 + 1) + (lL−1 + 1) such that for all x ∈ R it holds that

θ,1
(1.65)

NM a,l ,Ma,l2 ,...,Ma,lL−1 ,idR (x) = tanh(x)
1

(cf. Definitions 1.1.3, 1.2.1, and 1.2.25).

1.2.10 Softsign activation


Definition 1.2.28 (Softsign activation function). We say that a is the softsign activation
function if and only if it holds that a : R → R is the function from R to R which satisfies
for all x ∈ R that
x
a(x) = . (1.66)
|x| + 1
1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5

43
Chapter 1: Basics on ANNs

tanh 1
softsign
0
4 2 0 2 4
1

Figure 1.11 (plots/softsign.pdf): A plot of the softsign activation function and


the hyperbolic tangent

6 ax = plot_util . setup_axis (( -5 ,5) , ( -1.5 ,1.5) )


7
8 x = np . linspace ( -5 , 5 , 100)
9
10 ax . plot (x , tf . keras . activations . tanh ( x ) , label = ’ tanh ’)
11 ax . plot (x , tf . keras . activations . softsign ( x ) , label = ’ softsign ’)
12 ax . legend ()
13
14 plt . savefig ( " ../../ plots / softsign . pdf " , bbox_inches = ’ tight ’)

Source code 1.9 (code/activation_functions/softsign_plot.py): Python code


used to create Figure 1.11

Definition 1.2.29 (Multidimensional softsign activation functions). Let d ∈ N and let


a be the softsign activation function (cf. Definition 1.2.28). Then we say that A is the
d-dimensional softsign activation function if and only if A = Ma,d (cf. Definition 1.2.1).

1.2.11 Leaky rectified linear unit (leaky ReLU) activation


Definition 1.2.30 (Leaky ReLU activation function). Let γ ∈ [0, ∞). Then we say that a
is the leaky ReLU activation function with leak factor γ if and only if it holds that a : R → R
is the function from R to R which satisfies for all x ∈ R that
(
x :x>0
a(x) = (1.67)
γx : x ≤ 0.

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)

44
1.2. Activation functions

2.0
ReLU
leaky ReLU
1.5

1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.5

Figure 1.12 (plots/leaky_relu.pdf): A plot of the leaky ReLU activation function


with leak factor 1/10 and the ReLU activation function

9
10 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . relu (x , alpha =0.1) ,
12 label = ’ leaky ReLU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / leaky_relu . pdf " , bbox_inches = ’ tight ’)

Source code 1.10 (code/activation_functions/leaky_relu_plot.py): Python


code used to create Figure 1.12

Lemma 1.2.31. Let γ ∈ [0, 1] and let a : R → R be a function. Then a is the leaky ReLU
activation function with leak factor γ if and only if it holds for all x ∈ R that

a(x) = max{x, γx} (1.68)

(cf. Definition 1.2.30).

Proof of Lemma 1.2.31. Note that the fact that γ ≤ 1 and (1.67) establish (1.68). The
proof of Lemma 1.2.31 is thus complete.

Lemma 1.2.32. Let u, β ∈ R, v ∈ (u, ∞), α ∈ (−∞, 0], let a1 be the softplus activation
function, let a2 be the GELU activation function, let a3 be the standard logistic activation
function, let a4 be the swish activation function with parameter β, let a5 be the softsign
activation function, and let l be the leaky ReLU activation function with leaky parameter γ
(cf. Definitions 1.2.11, 1.2.15, 1.2.18, 1.2.22, 1.2.28, and 1.2.30). Then

(i) it holds for all f ∈ {r, cu,v , tanh, a1 , a2 , . . . , a5 } that lim supx→−∞ |f ′ (x)| = 0 and

45
Chapter 1: Basics on ANNs

(ii) it holds that limx→−∞ l′ (x) = γ


(cf. Definitions 1.2.4, 1.2.9, and 1.2.25).
Proof of Lemma 1.2.32. Note that (1.26), (1.45), (1.47), (1.51), (1.52), (1.60), (1.62), and
(1.66) prove item (i). Observe that (1.67) establishes item (ii). The proof of Lemma 1.2.32
is thus complete.
Definition 1.2.33 (Multidimensional leaky ReLU activation function). Let d ∈ N, γ ∈
[0, ∞) and let a be the leaky ReLU activation function with leak factor γ (cf. Defini-
tion 1.2.30). Then we say that A is the d-dimensional leaky ReLU activation function with
leak factor γ if and only if A = Ma,d (cf. Definition 1.2.1).

1.2.12 Exponential linear unit (ELU) activation


Another popular activation function is the so-called exponential linear unit (ELU) activation
function which has been introduced in Clevert et al. [83]. This activation function is the
subject of the next notion.
Definition 1.2.34 (ELU activation function). Let γ ∈ (−∞, 0]. Then we say that a is
the ELU activation function with asymptotic γ if and only if it holds that a : R → R is the
function from R to R which satisfies for all x ∈ R that
(
x :x>0
a(x) = (1.69)
γ(1 − exp(x)) : x ≤ 0.

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -1 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)
9
10 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
11 ax . plot (x , tf . keras . activations . relu (x , alpha =0.1) , linewidth =2 ,
label = ’ leaky ReLU ’)
12 ax . plot (x , tf . keras . activations . elu ( x ) , linewidth =0.9 , label = ’ ELU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / elu . pdf " , bbox_inches = ’ tight ’)

Source code 1.11 (code/activation_functions/elu_plot.py): Python code used


to create Figure 1.13

46
1.2. Activation functions

2.0
ReLU
leaky ReLU
ELU 1.5

1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

0.5

1.0

Figure 1.13 (plots/elu.pdf): A plot of the ELU activation function with asymptotic
−1, the leaky ReLU activation function with leak factor 1/10, and the ReLU activation
function

Lemma 1.2.35. Let γ ∈ (−∞, 0] and let a be the ELU activation function with asymptotic
γ (cf. Definition 1.2.34). Then

lim sup a(x) = lim inf a(x) = γ. (1.70)


x→−∞ x→−∞

Proof of Lemma 1.2.35. Observe that (1.69) establishes (1.70). The proof of Lemma 1.2.35
is thus complete.

Definition 1.2.36 (Multidimensional ELU activation function). Let d ∈ N, γ ∈ (−∞, 0]


and let a be the ELU activation function with asymptotic γ (cf. Definition 1.2.34). Then
we say that A is the d-dimensional ELU activation function with asymptotic γ if and only
if A = Ma,d (cf. Definition 1.2.1).

1.2.13 Rectified power unit (RePU) activation


Another popular activation function is the so-called rectified power unit (RePU) activation
function. This concept is the subject of the next notion.

Definition 1.2.37 (RePU activation function). Let p ∈ N. Then we say that a is the RePU
activation function with power p if and only if it holds that a : R → R is the function from
R to R which satisfies for all x ∈ R that

a(x) = (max{x, 0})p . (1.71)

47
Chapter 1: Basics on ANNs

3.0
ReLU
RePU
2.5

2.0

1.5

1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
0.5

Figure 1.14 (plots/repu.pdf): A plot of the RePU activation function with power
2 and the ReLU activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,3) )
7 ax . set_ylim ( -.5 , 3)
8
9 x = np . linspace ( -2 , 2 , 100)
10
11 ax . plot (x , tf . keras . activations . relu ( x ) , linewidth =3 , label = ’ ReLU ’)
12 ax . plot (x , tf . keras . activations . relu ( x ) **2 , label = ’ RePU ’)
13 ax . legend ()
14
15 plt . savefig ( " ../../ plots / repu . pdf " , bbox_inches = ’ tight ’)

Source code 1.12 (code/activation_functions/repu_plot.py): Python code


used to create Figure 1.14

Definition 1.2.38 (Multidimensional RePU activation function). Let d, p ∈ N and let a


be the RePU activation function with power p (cf. Definition 1.2.37). Then we say that A
is the d-dimensional RePU activation function with power p if and only if A = Ma,d (cf.
Definition 1.2.1).

48
1.2. Activation functions

1.2.14 Sine activation


The sine function has been proposed as activation function in Sitzmann et al. [380]. This is
formulated in the next notion.

Definition 1.2.39 (Sine activation function). We say that a is the sine activation function
if and only if it holds that a : R → R is the function from R to R which satisfies for all
x ∈ R that
a(x) = sin(x). (1.72)

1
0
6 4 2 0 2 4 6
1

Figure 1.15 (plots/sine.pdf): A plot of the sine activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -2* np . pi ,2* np . pi ) , ( -1.5 ,1.5) )
7
8 x = np . linspace ( -2* np . pi , 2* np . pi , 100)
9
10 ax . plot (x , np . sin ( x ) )
11
12 plt . savefig ( " ../../ plots / sine . pdf " , bbox_inches = ’ tight ’)

Source code 1.13 (code/activation_functions/sine_plot.py): Python code


used to create Figure 1.15

Definition 1.2.40 (Multidimensional sine activation functions). Let d ∈ N and let a be the
sine activation function (cf. Definition 1.2.39). Then we say that A is the d-dimensional
sine activation function if and only if A = Ma,d (cf. Definition 1.2.1).

1.2.15 Heaviside activation


Definition 1.2.41 (Heaviside activation function). We say that a is the Heaviside activation
function (we say that a is the Heaviside step function, we say that a is the unit step function)

49
Chapter 1: Basics on ANNs

if and only if it holds that a : R → R is the function from R to R which satisfies for all
x ∈ R that (
1 :x≥0
a(x) = 1[0,∞) (x) = (1.73)
0 : x < 0.

1.5
Heaviside
standard logistic 1.0
0.5
0.0
3 2 1 0 1 2 3
0.5

Figure 1.16 (plots/heaviside.pdf): A plot of the Heaviside activation function


and the standard logistic activation function

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -.5 ,1.5) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 ax . plot ( x [0:50] , [0]*50 , ’ C0 ’)
11 ax . plot ( x [50:100] , [1]*50 , ’ C0 ’ , label = ’ Heaviside ’)
12 ax . plot (x , tf . keras . activations . sigmoid ( x ) , ’ C1 ’ ,
13 label = ’ standard logistic ’)
14 ax . legend ()
15
16 plt . savefig ( " ../../ plots / heaviside . pdf " , bbox_inches = ’ tight ’)

Source code 1.14 (code/activation_functions/heaviside_plot.py): Python


code used to create Figure 1.16

Definition 1.2.42 (Multidimensional Heaviside activation functions). Let d ∈ N and let


a be the Heaviside activation function (cf. Definition 1.2.41). Then we say that A is the
d-dimensional Heaviside activation function (we say that A is the d-dimensional Heaviside
step function, we say that A is the d-dimensional unit step function) if and only if A = Ma,d
(cf. Definition 1.2.1).

50
1.3. Fully-connected feedforward ANNs (structured description)

1.2.16 Softmax activation


Definition 1.2.43 (Softmax activation function). Let d ∈ N. Then we say that A is the
d-dimensional softmax activation function if and only if it holds that A : Rd → Rd is the
function from Rd to Rd which satisfies for all x = (x1 , x2 , . . . , xd ) ∈ Rd that
 
exp(x1 ) exp(x2 ) exp(xd )
A(x) = Pd exp(x ) , Pd exp(x ) , . . . , Pd exp(x ) . (1.74)
( i=1 i ) ( i=1 i ) ( i=1 i )

Lemma 1.2.44. Let d ∈ N and let A = (A1 , A2 , . . . , Ad ) be the d-dimensional softmax


activation function (cf. Definition 1.2.43). Then

(i) it holds for all x ∈ Rd , k ∈ {1, 2, . . . , d} that Ak (x) ∈ (0, 1] and

(ii) it holds for all x ∈ Rd that


d
X
Ak (x) = 1. (1.75)
k=1

tum

(cf. Definition 1.2.43).

Proof of Lemma 1.2.44. Observe that (1.74) demonstrates that for all x = (x1 , x2 , . . . , xd ) ∈
Rd it holds that
Xd Xd Pd
exp(xk )
Ak (x) = Pd
exp(xk )
= Pk=1
d = 1. (1.76)
( i=1 exp(xi )) i=1 exp(xi )
k=1 k=1

The proof of Lemma 1.2.44 is thus complete.

1.3 Fully-connected feedforward ANNs (structured de-


scription)
In this section we present an alternative way to describe the fully-connected feedforward
ANNs introduced in Section 1.1 above. Roughly speaking, in Section 1.1 above we defined a
vectorized description of fully-connected feedforward ANNs in the sense that the trainable
parameters of a fully-connected feedforward ANN are represented by the components of a
single Euclidean vector (cf. Definition 1.1.3 above). In this section we introduce a structured
description of fully-connected feedforward ANNs in which the trainable parameters of
a fully-connected feedforward ANN are represented by a tuple of matrix-vector pairs
corresponding to the weight matrices and bias vectors of the fully-connected feedforward
ANNs (cf. Definitions 1.3.1 and 1.3.4 below).

51
Chapter 1: Basics on ANNs

1.3.1 Structured description of fully-connected feedforward ANNs


Definition 1.3.1 (Structured description of fully-connected feedforward ANNs). We denote
by N the set given by
 
L
× lk ×lk−1 lk
(1.77)
S S
N = L∈N l0 ,l1 ,...,lL ∈N k=1
(R ×R ) ,

L
×
(Rlk ×lk−1 × Rlk ) ⊆ N we denote by

for every L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ ∈ k=1
P(Φ), L(Φ), I(Φ), O(Φ) ∈ N, H(Φ) ∈ N0 the numbers given by

L(Φ) = L, I(Φ) = l0 , O(Φ) = lL , and H(Φ) = L−1, (1.78)


PL
P(Φ) = k=1 lk (lk−1 +1),

× L lk ×lk−1 lk

for every n ∈ N0 , L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ ∈ k=1
(R × R ) ⊆ N we denote by
Dn (Φ) ∈ N0 the number given by
(
ln : n ≤ L
Dn (Φ) = (1.79)
0 : n > L,

for every Φ ∈ N we denote by D(Φ) ∈ NL(Φ)+1 the tuple given by

D(Φ) = (D0 (Φ), D1 (Φ), . . . , DL(Φ) (Φ)), (1.80)

×
and for every L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈
L
k=1
(Rlk ×lk−1 ×
Rlk ) ⊆ N, n ∈ {1, 2, . . . , L} we denote by Wn,Φ ∈ Rln ×ln−1 , Bn,Φ ∈ Rln the matrix and the
vector given by
Wn,Φ = Wn and Bn,Φ = Bn . (1.81)

Definition 1.3.2 (Fully-connected feedforward ANNs). We say that Φ is a fully-connected


feedforward ANN if and only if it holds that

Φ∈N (1.82)

(cf. Definition 1.3.1).

Lemma 1.3.3. Let Φ ∈ N (cf. Definition 1.3.1). Then

(i) it holds that D(Φ) ∈ NL(Φ)+1 ,

(ii) it holds that


I(Φ) = D0 (Φ) and O(Φ) = DL(Φ) (Φ), (1.83)
and

52
1.3. Fully-connected feedforward ANNs (structured description)

(iii) it holds for all n ∈ {1, 2, . . . , L(Φ)} that

Wn,Φ ∈ RDn (Φ)×Dn−1 (Φ) and Bn,Φ ∈ RDn (Φ) . (1.84)


.

Proof of Lemma 1.3.3. Note that the assumption that

×L
(Rlk ×lk−1

× Rlk )
S S
Φ∈N= L∈N (l0 ,l1 ,...,lL )∈NL+1 k=1

ensures that there exist L ∈ N, l0 , l1 , . . . , lL ∈ N which satisfy that

× L
(Rlk ×lk−1 (1.85)

Φ∈ k=1
× Rlk ) .

Observe that (1.85), (1.78), and (1.79) imply that

L(Φ) = L, I(Φ) = l0 = D0 (Φ), and O(Φ) = lL = DL (Φ). (1.86)

This shows that


D(Φ) = (l0 , l1 , . . . , lL ) ∈ NL+1 = NL(Φ)+1 . (1.87)
Next note that (1.85), (1.79), and (1.81) ensure that for all n ∈ {1, 2, . . . , L(Φ)} it holds
that
Wn,Φ ∈ Rln ×ln−1 = RDn (Φ)×Dn−1 (Φ) and Bn,Φ ∈ Rln = RDn (Φ) . (1.88)
The proof of Lemma 1.3.3 is thus complete.

1.3.2 Realizations of fully-connected feedforward ANNs


Definition 1.3.4 (Realizations of fully-connected feedforward ANNs). Let Φ ∈ N and let
a : R → R be a function (cf. Definition 1.3.1). Then we denote by

RN
a (Φ) : R
I(Φ)
→ RO(Φ) (1.89)

the function which satisfies for all x0 ∈ RD0 (Φ) , x1 ∈ RD1 (Φ) , . . . , xL(Φ) ∈ RDL(Φ) (Φ) with

∀ k ∈ {1, 2, . . . , L(Φ)} : xk = Ma1(0,L(Φ)) (k)+idR 1{L(Φ)} (k),Dk (Φ) (Wk,Φ xk−1 + Bk,Φ ) (1.90)

that
(RN
a (Φ))(x0 ) = xL(Φ) (1.91)
and we call RNa (Φ) the realization function of the fully-connected feedforward ANN Φ with
activation function a (we call RNa (Φ) the realization of the fully-connected feedforward ANN
Φ with activation a) (cf. Definition 1.2.1).

53
Chapter 1: Basics on ANNs

Exercise 1.3.1. Let

Φ = ((W1 , B1 ), (W2 , B2 ), (W3 , B3 )) ∈ (R2×1 × R2 ) × (R3×2 × R3 ) × (R1×3 × R1 ) (1.92)

satisfy
   
    −1 2 0
1 3
W1 = , B1 = , W2 =  3 −4, B2 = 0,
 (1.93)
2 4
−5 6 0

and (1.94)
 
W3 = −1 1 −1 , B3 = −4 .
Prove or disprove the following statement: It holds that

(RN
r (Φ))(−1) = 0 (1.95)

(cf. Definitions 1.2.4 and 1.3.4).


Exercise 1.3.2. Let a be the standard logistic activation function (cf. Definition 1.2.18).
Prove or disprove the following statement: There exists Φ ∈ N such that

RN
tanh (Φ) = a (1.96)

(cf. Definitions 1.2.25, 1.3.1, and 1.3.4).


1 import torch
2 import torch . nn as nn
3 import torch . nn . functional as F
4
5

6 # To define a neural network , we define a class that inherits from


7 # torch . nn . Module
8 class Ful lyConnec tedANN ( nn . Module ) :
9 def __init__ ( self ) :
10 super () . __init__ ()
11 # In the constructor , we define the weights and biases .
12 # Wrapping the tensors in torch . nn . Parameter objects tells
13 # PyTorch that these are parameters that should be
14 # optimized during training .
15 self . W1 = nn . Parameter (
16 torch . Tensor ([[1 , 0] , [0 , -1] , [ -2 , 2]])
17 )
18 self . B1 = nn . Parameter ( torch . Tensor ([0 , 2 , -1]) )
19 self . W2 = nn . Parameter ( torch . Tensor ([[1 , -2 , 3]]) )
20 self . B2 = nn . Parameter ( torch . Tensor ([1]) )
21
22 # The realization function of the network

54
1.3. Fully-connected feedforward ANNs (structured description)

23 def forward ( self , x0 ) :


24 x1 = F . relu ( self . W1 @ x0 + self . B1 )
25 x2 = self . W2 @ x1 + self . B2
26 return x2
27
28
29 model = Ful lyConnect edANN ()
30

31 x0 = torch . Tensor ([1 , 2])


32 # Print the output of the realization function for input x0
33 print ( model . forward ( x0 ) )
34
35 # As a consequence of inheriting from torch . nn . Module we can just
36 # " call " the model itself ( which will call the forward method
37 # implicitly )
38 print ( model ( x0 ) )
39
40 # Wrapping a tensor in a Parameter object and assigning it to an
41 # instance variable of the Module makes PyTorch register it as a
42 # parameter . We can access all parameters via the parameters
43 # method .
44 for p in model . parameters () :
45 print ( p )

Source code 1.15 (code/fc-ann-manual.py): Python code for implementing a


fully-connected feedforward ANN in PyTorch.
 1 0  The
 0 model
 created here  represents
the fully-connected feedforward ANN 0 −1 , 2
−1
, (( 1 −2 3 ), ( 1 )) ∈ (R3×2 ×
−2 2
R3 ) × (R1×3 × R1 ) ⊆ N using the ReLU activation function after the hidden layer.

1 import torch
2 import torch . nn as nn
3
4
5 class Fu llyConne ctedANN ( nn . Module ) :
6 def __init__ ( self ) :
7 super () . __init__ ()
8 # Define the layers of the network in terms of Modules .
9 # nn . Linear (3 , 20) represents an affine function defined
10 # by a 20 x3 weight matrix and a 20 - dimensional bias vector .
11 self . affine1 = nn . Linear (3 , 20)
12 # The torch . nn . ReLU class simply wraps the
13 # torch . nn . functional . relu function as a Module .
14 self . activation1 = nn . ReLU ()
15 self . affine2 = nn . Linear (20 , 30)
16 self . activation2 = nn . ReLU ()
17 self . affine3 = nn . Linear (30 , 1)
18

55
Chapter 1: Basics on ANNs

19 def forward ( self , x0 ) :


20 x1 = self . activation1 ( self . affine1 ( x0 ) )
21 x2 = self . activation2 ( self . affine2 ( x1 ) )
22 x3 = self . affine3 ( x2 )
23 return x3
24
25
26 model = Full yConnect edANN ()
27
28 x0 = torch . Tensor ([1 , 2 , 3])
29 print ( model ( x0 ) )
30
31 # Assigning a Module to an instance variable of a Module registers
32 # all of the former ’s parameters as parameters of the latter
33 for p in model . parameters () :
34 print ( p )

Source code 1.16 (code/fc-ann.py): Python code for implementing a fully-


connected feedforward ANN in PyTorch. The model implemented here represents
a fully-connected feedforward ANN with two hidden layers, 3 neurons in the input
layer, 20 neurons in the first hidden layer, 30 neurons in the second hidden layer,
and 1 neuron in the output layer. Unlike Source code 1.15, this code uses the
torch.nn.Linear class to represent the affine transformations.

1 import torch
2 import torch . nn as nn
3
4 # A Module whose forward method is simply a composition of Modules
5 # can be represented using the torch . nn . Sequential class
6 model = nn . Sequential (
7 nn . Linear (3 , 20) ,
8 nn . ReLU () ,
9 nn . Linear (20 , 30) ,
10 nn . ReLU () ,
11 nn . Linear (30 , 1) ,
12 )
13
14 # Prints a summary of the model architecture
15 print ( model )
16
17 x0 = torch . Tensor ([1 , 2 , 3])
18 print ( model ( x0 ) )

Source code 1.17 (code/fc-ann2.py): Python code for creating a fully-connected


feedforward ANN in PyTorch. This creates the same model as Source code 1.16
but uses the torch.nn.Sequential class instead of defining a new subclass of
torch.nn.Module.

56
1.3. Fully-connected feedforward ANNs (structured description)

1.3.3 On the connection to the vectorized description


Definition 1.3.5 (Transformation from the structured to the S vectorized description of
fully-connected feedforward ANNs). We denote by T : N → d

d∈N R the function which
satisfies for all Φ ∈ N, k ∈ {1, 2, . . . , L(Φ)}, d ∈ N, θ = (θ1 , θ2 , . . . , θd ) ∈ Rd with T (Φ) = θ
that
 
θ(Pk−1 li (li−1 +1))+lk lk−1 +1
 θ Pi=1 
 ( k−1
 Pi=1 li (li−1 +1))+lk lk−1 +2 

d = P(Φ), θ
Bk,Φ =  ( i=1 li (li−1 +1))+lk lk−1 +3  Wk,Φ =
, and
 k−1

..
.
 
 
θ(Pk−1 li (li−1 +1))+lk lk−1 +lk
i=1
 
θ( Pk−1
li (li−1 +1))+1 θ(Pk−1 li (li−1 +1))+2 ··· θ(Pk−1 li (li−1 +1))+lk−1
 θ Pk−1i=1 i=1
θ(Pk−1 li (li−1 +1))+lk−1 +2 ···
i=1
θ(Pk−1 li (li−1 +1))+2lk−1 
 ( i=1 li (li−1 +1))+lk−1 +1 i=1 i=1

 θ(Pk−1 li (li−1 +1))+2lk−1 +1 θ(Pk−1 li (li−1 +1))+2lk−1 +2 ··· θ( k−1 li (li−1 +1))+3lk−1 
 P

i=1 i=1 i=1
.. .. .. ..
 
. . . .
 
 
θ( k−1 li (li−1 +1))+(lk −1)lk−1 +1 θ( k−1 li (li−1 +1))+(lk −1)lk−1 +2 · · ·
P P θ(Pk−1 li (li−1 +1))+lk lk−1
i=1 i=1 i=1
(1.97)

(cf. Definition 1.3.1).

Lemma 1.3.6. Let Φ ∈ (R3×3 × R3 ) × (R2×3 × R2 ) satisfy


    
1 2 3 10    
13 14 15 19 
Φ = 4 5 6, 11, , . (1.98)
16 17 18 20
7 8 9 12

Then T (Φ) = (1, 2, 3, . . . , 19, 20) ∈ R20 .

Proof of Lemma 1.3.6. Observe that (1.97) establishes (1.98). The proof of Lemma 1.3.6
is thus complete.

Lemma 1.3.7. Let a, b ∈ N, W = (Wi,j )(i,j)∈{1,2,...,a}×{1,2,...,b} ∈ Ra×b , B = (B1 , B2 , . . . ,


Ba ) ∈ Ra . Then

T ((W, B))

= W1,1 , W1,2 , . . . , W1,b , W2,1 , W2,2 , . . . , W2,b , . . . , Wa,1 , Wa,2 , . . . , Wa,b , B1 , B2 , . . . , Ba
(1.99)

(cf. Definition 1.3.5).

57
Chapter 1: Basics on ANNs

Proof of Lemma 1.3.7. Observe that (1.97) establishes (1.99). The proof of Lemma 1.3.7 is
thus complete.

Lemma 1.3.8. Let L ∈ N, l0 , l1 , . . . , lL ∈ N and for every k ∈ {1, 2, . . . , L} let Wk =


(Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 , Bk = (Bk,1 , Bk,2 , . . . , Bk,lk ) ∈ Rlk . Then
 
T (W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )

= W1,1,1 , W1,1,2 , . . . , W1,1,l0 , . . . , W1,l1 ,1 , W1,l1 ,2 , . . . , W1,l1 ,l0 , B1,1 , B1,2 , . . . , B1,l1 ,
W2,1,1 , W2,1,2 , . . . , W2,1,l1 , . . . , W2,l2 ,1 , W2,l2 ,2 , . . . , W2,l2 ,l1 , B2,1 , B2,2 , . . . , B2,l2 ,
...,

WL,1,1 , WL,1,2 , . . . , WL,1,lL−1 , . . . , WL,lL ,1 , WL,lL ,2 , . . . , WL,lL ,lL−1 , BL,1 , BL,2 , . . . , BL,lL
(1.100)

(cf. Definition 1.3.5).

Proof of Lemma 1.3.8. Note that (1.97) implies (1.100). The proof of Lemma 1.3.8 is thus
complete.
Exercise 1.3.3. Prove or disprove the following statement: The function T is injective (cf.
Definition 1.3.5).
Exercise 1.3.4. Prove or disprove the following statement: The function T is surjective (cf.
Definition 1.3.5).
Exercise 1.3.5. Prove or disprove the following statement: The function T is bijective (cf.
Definition 1.3.5).

Proposition 1.3.9. Let a ∈ C(R, R), Φ ∈ N (cf. Definition 1.3.1). Then


 T (Φ),I(Φ)
Nid O(Φ) : H(Φ) = 0
N
(1.101)
R
Ra (Φ) =
N T (Φ),I(Φ) : H(Φ) > 0
Ma,D (Φ) ,Ma,D (Φ) ,...,Ma,D
1 (Φ) ,id O(Φ)
2 H(Φ) R

(cf. Definitions 1.1.3, 1.2.1, 1.3.4, and 1.3.5).

Proof of Proposition 1.3.9. Throughout this proof, let L ∈ N, l0 , l1 , . . . , lL ∈ N satisfy that

L(Φ) = L and D(Φ) = (l0 , l1 , . . . , lL ). (1.102)

Note that (1.97) shows that for all k ∈ {1, 2, . . . , L}, x ∈ Rlk−1 it holds that
Pk−1
T (Φ), li (li−1 +1) 
Wk,Φ x + Bk,Φ = Alk ,lk−1 i=1
(x) (1.103)

58
1.4. Convolutional ANNs (CNNs)

(cf. Definitions 1.1.1 and 1.3.5). This demonstrates that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . ,
xL−1 ∈ RlL−1 with ∀ k ∈ {1, 2, . . . , L − 1} : xk = Ma,lk (Wk,Φ xk−1 + Bk,Φ ) it holds that

x 0 :L=1


T (Φ), L−2
P
l (l +1)
xL−1 = i=1 i i−1
Ma,lL−1 ◦ AlL−1 ,lL−2 (1.104)
 T (Φ),
PL−3
l (l +1) T (Φ),0  : L > 1
i=1 i i−1

 ◦M ◦A
a,lL−2 lL−2 ,lL−3 ◦ ... ◦ M ◦ A
a,l1 (x )
l1 ,l0 0

(cf. Definition 1.2.1). This, (1.103), (1.5), and (1.91) show that for all x0 ∈ Rl0 , x1 ∈
Rl1 , . . . , xL ∈ RlL with ∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),lk (Wk,Φ xk−1 + Bk,Φ ) it
holds that
T (Φ), L−1
P
N
 l (l +1) 
Ra (Φ) (x0 ) = xL = WL,Φ xL−1 + BL,Φ = AlL ,lL−1 i=1 i i−1 (xL−1 )

 NidT (Φ),l0 (x0 ) (1.105)

:L=1
RlL
=
 N T (Φ),l0

Ma,l ,Ma,l ,...,Ma,l ,id l (x0 ) : L > 1
1 2 L−1 R L

(cf. Definitions 1.1.3 and 1.3.4). The proof of Proposition 1.3.9 is thus complete.

1.4 Convolutional ANNs (CNNs)


In this section we review CNNs, which are ANNs designed to process data with a spatial
structure. In a broad sense, CNNs can be thought of as any ANNs involving a convolution
operation (cf, for instance, Definition 1.4.1 below). Roughly speaking, convolutional
operations allow CNNs to exploit spatial invariance of data by performing the same
operations across different regions of an input data point. In principle, such convolution
operations can be employed in combinations with other ANN architecture elements, such as
fully-connected layers (cf., for example, Sections 1.1 and 1.3 above), residual layers (cf., for
instance, Section 1.5 below), and recurrent structures (cf., for example, Section 1.6 below).
However, for simplicity we introduce in this section in all mathematical details feedforward
CNNs only involving convolutional layers based on the discrete convolution operation
without padding (sometimes called valid padding) in Definition 1.4.1 (see Definitions 1.4.2
and 1.4.5 below). We refer, for instance, to [4, Section 12.5], [60, Chapter 16], [63, Section
4.2], [164, Chapter 9], and [36, Sectino 1.6.1] for other introductions on CNNs.
CNNs were introduced in LeCun et al. [262] for computer vision (CV) applications. The
first successful modern CNN architecture is widely considered to be the AlexNet architecture
proposed in Krizhevsky et al. [257]. A few other very successful early CNN architecures for
CV include [152, 190, 206, 282, 291, 371, 378, 390]. While CV is by far the most popular
domain of application for CNNs, CNNs have also been employed successfully in several other
areas. In particular, we refer, for example, to [110, 143, 245, 430, 434, 437] for applications
of CNNs to natural language processing (NLP), we refer, for instance, to [1, 59, 78, 359, 396]

59
Chapter 1: Basics on ANNs

for applications of CNNs to audio processing, and we refer, for example, to [46, 105, 236,
348, 408, 440] for applications of CNNs to time series analysis. Finally, for approximation
results for feedforward CNNs we refer, for instance, to Petersen & Voigtländer [334] and
the references therein.

1.4.1 Discrete convolutions


Definition 1.4.1 (Discrete convolutions). Let T ∈ N, a1 , a2 , . . . , aT , w1 , w2 , . . . , wT , d1 ,
d2 , . . . , dT ∈ N and let A = (Ai1 ,i2 ,...,iT )(i1 ,i2 ,...,iT )∈(×Tt=1 {1,2,...,at }) ∈ Ra1 ×a2 ×...×aT , W =
(Wi1 ,i2 ,...,iT )(i1 ,i2 ,...,iT )∈(×Tt=1 {1,2,...,wt }) ∈ Rw1 ×w2 ×...×wT satisfy for all t ∈ {1, 2, . . . , T } that

dt = at − wt + 1. (1.106)

Then we denote by A ∗ W = ((A ∗ W )i1 ,i2 ,...,iT )(i1 ,i2 ,...,iT )∈(×Tt=1 {1,2,...,dt }) ∈ Rd1 ×d2 ×...×dT the
tensor which satisfies for all i1 ∈ {1, 2, . . . , d1 }, i2 ∈ {1, 2, . . . , d2 }, . . . , iT ∈ {1, 2, . . . , dT }
that
w1 X
X w2 wT
X
(A ∗ W )i1 ,i2 ,...,iT = ··· Ai1 −1+r1 ,i2 −1+r2 ,...,iT −1+rT Wr1 ,r2 ,...,rT . (1.107)
r1 =1 r2 =1 rT =1

1.4.2 Structured description of feedforward CNNs


Definition 1.4.2 (Structured description of feedforward CNNs). We denote by C the set
given by

C=
!
L

× (R
[ [ [
ck,1 ×ck,2 ×...×ck,T lk ×lk−1
× Rlk . (1.108)

)
T,L∈N l0 ,l1 ,...,lL ∈N (ck,t )(k,t)∈{1,2,...,L}×{1,2,...,T } ⊆N k=1

Definition 1.4.3 (Feedforward CNNs). We say that Φ is a feedforward CNN if and only if
it holds that
Φ∈C (1.109)
(cf. Definition 1.4.2).

1.4.3 Realizations of feedforward CNNs


Definition 1.4.4 (One tensor). Let T ∈ N, d1 , d2 , . . . , dT ∈ N. Then we denote by
Id1 ,d2 ,...,dT = (Idi11,i,d22,...,i
,...,dT
T
)(i1 ,i2 ,...,iT )∈(×Tt=1 {1,2,...,dt }) ∈ Rd1 ×d2 ×...×dT the tensor which satisfies for
all i1 ∈ {1, 2, . . . , d1 }, i2 ∈ {1, 2, . . . , d2 }, . . . , iT ∈ {1, 2, . . . , dT } that

Idi11,i,d22,...,i
,...,dT
T
= 1. (1.110)

60
1.4. Convolutional ANNs (CNNs)

Definition 1.4.5 (Realizations associated to feedforward CNNs). Let T, L ∈ N, l0 , l1 , . . . ,


lL ∈ N, let (ck,t )(k,t)∈{1,2,...,L}×{1,2,...,T } ⊆ N, let Φ = (((Wk,n,m )(n,m)∈{1,2,...,lk }×{1,2,...,lk−1 } ,
L
×
(Bk,n )n∈{1,2,...,lk } ))k∈{1,2,...,L} ∈ k=1 ((Rck,1 ×ck,2 ×...×ck,T )lk ×lk−1 × Rlk ) ⊆ C, and let a : R → R
be a function. Then we denote by
 
!
RC (Rd1 ×d2 ×...×dT )l0  → (Rd1 ×d2 ×...×dT )lL
S S
a (Φ) : 
 
d1 ,d2 ,...,dT ∈N d1 ,d2 ,...,dT ∈N
∀ t∈{1,2,...,T } : dt − L
P
k=1 (ck,t −1)≥1
(1.111)
the function which satisfies for all (dk,t )(k,t)∈{0,1,...,L}×{1,2,...,T } ⊆ N, x0 = (x0,1 , . . . , x0,l0 ) ∈
(Rd0,1 ×d0,2 ×...×d0,T )l0 , x1 = (x1,1 , . . . , x1,l1 ) ∈ (Rd1,1 ×d1,2 ×...×d1,T )l1 , . . . , xL = (xL,1 , . . . , xL,lL ) ∈
(RdL,1 ×dL,2 ×...×dL,T )lL with

∀ k ∈ {1, 2, . . . , L}, t ∈ {1, 2, . . . , T } : dk,t = dk−1,t − ck,t + 1 (1.112)

and

∀ k ∈ {1, 2, . . . , L}, n ∈ {1, 2, . . . , lk } :


xk,n = Ma1(0,L) (k)+idR 1{L} (k),dk,1 ,dk,2 ,...,dk,T (Bk,n Idk,1 ,dk,2 ,...,dk,T + lm=1
P k−1
xk−1,m ∗ Wk,n,m )
(1.113)

that

(RC
a (Φ))(x0 ) = xL (1.114)

and we call RC a (Φ) the realization function of the feedforward CNN Φ with activation
function a (we call RCa (Φ) the realization of the feedforward CNN Φ with activation a) (cf.
Definitions 1.2.1, 1.4.1, 1.4.2, and 1.4.4).

1 import torch
2 import torch . nn as nn
3

4
5 class ConvolutionalANN ( nn . Module ) :
6 def __init__ ( self ) :
7 super () . __init__ ()
8 # The convolutional layer defined here takes any tensor of
9 # shape (1 , n , m ) [ a single input ] or (N , 1 , n , m ) [ a batch
10 # of N inputs ] where N , n , m are natural numbers satisfying
11 # n >= 3 and m >= 3.
12 self . conv1 = nn . Conv2d (
13 in_channels =1 , out_channels =5 , kernel_size =(3 , 3)
14 )

61
Chapter 1: Basics on ANNs

15 self . activation1 = nn . ReLU ()


16 self . conv2 = nn . Conv2d (
17 in_channels =5 , out_channels =5 , kernel_size =(5 , 3)
18 )
19
20 def forward ( self , x0 ) :
21 x1 = self . activation1 ( self . conv1 ( x0 ) )
22 print ( x1 . shape )
23 x2 = self . conv2 ( x1 )
24 print ( x2 . shape )
25 return x2
26
27
28 model = ConvolutionalANN ()
29 x0 = torch . rand (1 , 20 , 20)
30 # This will print the shapes of the outputs of the two layers of
31 # the model , in this case :
32 # torch . Size ([5 , 18 , 18])
33 # torch . Size ([5 , 14 , 16])
34 model ( x0 )

Source code 1.18 (code/conv-ann.py): Python code implementing a feedforward


CNN in PyTorch. The implemented model here corresponds to a feedforward
CNN Φ ∈ C where T = 2, L = 2, l0 = 1, l1 = 5, l2 = 5, (c1,1 , c1,2 ) = (3, 3),
(c2,1 , c2,2 ) = (5, 3), and Φ ∈ × L
(Rck,1 ×ck,2 ×...×ck,T )lk ×lk−1 × Rlk = ((R3×3 )5×1 ×

k=1
R5 ) × ((R3×5 )5×5 × R5 ). The model, given an input of shape (1, d1 , d2 ) with
d1 ∈ N ∩ [7, ∞), d2 ∈ N ∩ [5, ∞), produces an output of shape (5, d1 − 6, d2 − 4),
(corresponding to the realization function RC a (Φ) for a ∈ C(R, R) having domain
) and satisfying for all d1 ∈ N ∩ [7, ∞), d2 ∈ N ∩ [5, ∞),
d1 ×d2 1
S
d1 ,d2 ∈N, d1 ≥7, d2 ≥5 (R
x0 ∈ (R ) that (Ra (Φ))(x0 ) ∈ (Rd1 −6,d2 −4 )5 ).
d1 ×d2 1 C

Example 1.4.6 (Example for Definition 1.4.5). Let T = 2, L = 2, l0 = 1, l1 = 2, l2 = 1,


c1,1 = 2, c1,2 = 2, c2,1 = 1, c2,2 = 1 and let
!
L

× (Rck,1 ×ck,2 ×...×ck,T )lk ×lk−1 × Rlk = (R2×2 )2×1 × R2 × (R1×1 )1×2 × R1
  
Φ∈
k=1
(1.115)

satisfy
   
0 0
 
 0 0  1 
(1.116)
  
Φ=  ,
 , −2 2 , 3 
.
1 0 −1 
0 1

62
1.4. Convolutional ANNs (CNNs)

Then
 
1 2 3  
11 15
C
(1.117)

Rr (Φ) 4 5 6 =
23 27
7 8 9
(cf. Definitions 1.2.4 and 1.4.5).
Proof for Example 1.4.6. Throughout this proof, let x0 ∈ R3×3 , x1 = (x1,1 , x1,2 ) ∈ (R2×2 )2 ,
x2 ∈ R2×2 with satisfy that
 
1 2 3   
0 0
x0 = 4 5 6, 2,2
x1,1 = Mr,2×2 I + x0 ∗ , (1.118)
0 0
7 8 9
  
1 0
2,2
x1,2 = Mr,2×2 (−1)I + x0 ∗ , (1.119)
0 1
and x2 = MidR ,2×2 3I2,2 + x1,1 ∗ −2 + x1,2 ∗ 2 . (1.120)
 

Note that (1.114), (1.116), (1.118), (1.119), and (1.120) imply that
 
1 2 3
RC 4 5 6 = RC (1.121)
 
r (Φ) r (Φ) (x0 ) = x2 .
7 8 9
Next observe that (1.118) ensures that
      
2,2 0 0 1 1 0 0
x1,1 = Mr,2×2 I + x0 ∗ = Mr,2×2 +
0 0 1 1 0 0
    (1.122)
1 1 1 1
= Mr,2×2 = .
1 1 1 1
Furthermore, note that (1.119) assures that
      
2,2 1 0 −1 −1 6 8
x1,2 = Mr,2×2 (−1)I + x0 ∗ = Mr,2×2 +
0 1 −1 −1 12 14
    (1.123)
5 7 5 7
= Mr,2×2 = .
11 13 11 13
Moreover, observe that this, (1.122), and (1.120) demonstrate that
x2 = MidR ,2×2 3I2,2 + x1,1 ∗ −2 + x1,2 ∗ 2
 
     
2,2 1 1  5 7 
= MidR ,2×2 3I + ∗ −2 + ∗ 2
1 1 11 13
(1.124)
     
3 3 −2 −2 10 14
= MidR ,2×2 + +
3 3 −2 −2 22 26
   
11 15 11 15
= MidR ,2×2 = .
23 27 23 27

63
Chapter 1: Basics on ANNs

This and (1.121) establish (1.117). The proof for Example 1.4.6 is thus complete.

1 import torch
2 import torch . nn as nn
3
4

5 model = nn . Sequential (
6 nn . Conv2d ( in_channels =1 , out_channels =2 , kernel_size =(2 , 2) ) ,
7 nn . ReLU () ,
8 nn . Conv2d ( in_channels =2 , out_channels =1 , kernel_size =(1 , 1) ) ,
9 )
10
11 with torch . no_grad () :
12 model [0]. weight . set_ (
13 torch . Tensor ([[[[0 , 0] , [0 , 0]]] , [[[1 , 0] , [0 , 1]]]])
14 )
15 model [0]. bias . set_ ( torch . Tensor ([1 , -1]) )
16 model [2]. weight . set_ ( torch . Tensor ([[[[ -2]] , [[2]]]]) )
17 model [2]. bias . set_ ( torch . Tensor ([3]) )
18
19 x0 = torch . Tensor ([[[1 , 2 , 3] , [4 , 5 , 6] , [7 , 8 , 9]]])
20 print ( model ( x0 ) )

Source code 1.19 (code/conv-ann-ex.py): Python code implementing the


feedforward CNN Φ from Example 1.4.6 (see (1.116)) in PyTorch and verifying
(1.117).

Exercise 1.4.1. Let

Φ = ((W1,n,m )(n,m)∈{1,2,3}×{1} , (B1,n )n∈{1,2,3} ),


((W2,n,m )(n,m)∈{1}×{1,2,3} , (B2,n )n∈{1} ) ∈ ((R2 )3×1 × R3 ) × ((R3 )1×3 × R1 ) (1.125)


satisfy

W1,1,1 = (1, −1), W1,2,1 = (2, −2), W1,3,1 = (−3, 3), (B1,n )n∈{1,2,3} = (1, 2, 3), (1.126)

W2,1,1 = (1, −1, 1), W2,1,2 = (2, −2, 2), W2,1,3 = (−3, 3, −3), and B2,1 = −2 (1.127)

and let v ∈ R9 satisfy v = (1, 2, 3, 4, 5, 4, 3, 2, 1). Specify

(RC
r (Φ))(v) (1.128)

explicitly and prove that your result is correct (cf. Definitions 1.2.4 and 1.4.5)!

64
1.4. Convolutional ANNs (CNNs)

Exercise 1.4.2. Let

Φ = ((W1,n,m )(n,m)∈{1,2,3}×{1} , (B1,n )n∈{1,2,3} ),


((W2,n,m )(n,m)∈{1}×{1,2,3} , (B2,n )n∈{1} ) ∈ ((R3 )3×1 × R3 ) × ((R2 )1×3 × R1 ) (1.129)


satisfy

W1,1,1 = (1, 1, 1), W1,2,1 = (2, −2, −2), (1.130)

W1,3,1 = (−3, −3, 3), (B1,n )n∈{1,2,3} = (3, −2, −1), (1.131)

W2,1,1 = (2, −1), W2,1,2 = (−1, 2), W2,1,3 = (−1, 0), and B2,1 = −2 (1.132)

and let v ∈ R9 satisfy v = (1, −1, 1, −1, 1, −1, 1, −1, 1). Specify

(RC
r (Φ))(v) (1.133)

explicitly and prove that your result is correct (cf. Definitions 1.2.4 and 1.4.5)!
Exercise 1.4.3. Prove or disprove the following statement: For every a ∈ C(R, R), Φ ∈ N
there exists Ψ ∈ C such that for all x ∈ RI(Φ) it holds that RI(Φ) ⊆ Domain(RC
a (Ψ)) and

(RC N
a (Ψ))(x) = (Ra (Φ))(x) (1.134)

(cf. Definitions 1.3.1, 1.3.4, 1.4.2, and 1.4.5).

Definition 1.4.7 (Standard scalar products). We denote by ⟨·, ·⟩ :


S d d

d∈N (R × R ) → R
the function which satisfies for all d ∈ N, x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd ) ∈ Rd that

d
(1.135)
P
⟨x, y⟩ = xi yi .
i=1

(d) (d) (d) (d)


Exercise 1.4.4. For every d ∈ N let e1 , e2 , . . . , ed ∈ Rd satisfy e1 = (1, 0, . . . , 0),
(d) (d)
e2 = (0, 1, 0, . . . , 0), . . . , ed = (0, . . . , 0, 1). Prove or disprove the following statement:
For all a ∈ C(R, R), Φ ∈ N, D ∈ N, x = ((xi,j )j∈{1,2,...,D} )i∈{1,2,...,I(Φ)} ∈ (RD )I(Φ) it holds
that
(O(Φ))
(RC , (RN (1.136)
 
a (Φ))(x) = ⟨ek a (Φ))((xi,j )i∈{1,2,...,I(Φ)} )⟩ j∈{1,2,...,D} k∈{1,2,...,O(Φ)}

(cf. Definitions 1.3.1, 1.3.4, 1.4.5, and 1.4.7).

65
Chapter 1: Basics on ANNs

1.5 Residual ANNs (ResNets)


In this section we review ResNets. Roughly speaking, plain-vanilla feedforward ANNs can be
seen as having a computational structure consisting of sequentially chained layers in which
each layer feeds information forward to the next layer (cf., for example, Definitions 1.1.3
and 1.3.4 above). ResNets, in turn, are ANNs involving so-called skip connections in their
computational structure, which allow information from one layer to be fed not only to the
next layer, but also to other layers further down the computational structure. In principle,
such skip connections can be employed in combinations with other ANN architecture
elements, such as fully-connected layers (cf., for instance, Sections 1.1 and 1.3 above),
convolutional layers (cf., for example, Section 1.4 above), and recurrent structures (cf., for
instance, Section 1.6 below). However, for simplicity we introduce in this section in all
mathematical details feedforward fully-connected ResNets in which the skip connection is a
learnable linear map (see Definitions 1.5.1 and 1.5.4 below).
ResNets were introduced in He et al. [190] as an attempt to improve the performance of
deep ANNs which typically are much harder to train than shallow ANNs (cf., for example,
[30, 153, 328]). The ResNets in He et al. [190] only involve skip connections that are
identity mappings without trainable parameters, and are thus a special case of the definition
of ResNets provided in this section (see Definitions 1.5.1 and 1.5.4 below). The idea of
skip connection (sometimes also called shortcut connections) has already been introduced
before ResNets and has been used in earlier ANN architecture such as the highway nets in
Srivastava et al. [384, 385] (cf. also [264, 293, 345, 390, 398]). In addition, we refer to [191,
206, 404, 417, 427] for a few successful ANN architecures building on the ResNets in He et
al. [190].

1.5.1 Structured description of fully-connected ResNets


Definition 1.5.1 (Structured description of fully-connected ResNets). We denote by R the
set given by
 
×
L
(Rlk ×lk−1 × lk ×lr

× Rlk ) ×
S S S
R= L∈N l0 ,l1 ,...,lL ∈N S⊆{(r,k)∈(N0 )2 : r<k≤L} k=1 (r,k)∈S
R .
(1.137)

Definition 1.5.2 (Fully-connected ResNets). We say that Φ is a fully-connected ResNet if


and only if it holds that
Φ∈R (1.138)

(cf. Definition 1.5.1).

66
1.5. Residual ANNs (ResNets)

Lemma 1.5.3 (On an empty set of skip connections). Let L ∈ N, l0 , l1 , . . . , lL ∈ N,


S ⊆ {(r, k) ∈ (N0 )2 : r < k ≤ L}. Then
(
1 :S=∅
×
# (r,k)∈S Rlk ×lr = (1.139)

∞ : S ̸= ∅.

Proof of Lemma 1.5.3. Throughout this proof, for all sets A and B let F (A, B) be the set
of all function from A to B. Note that

×
# (r,k)∈S Rlk ×lr = # f ∈ F S, S(r,k)∈S Rlk ×lr : (∀ (r, k) ∈ S : f (r, k) ∈ Rlk ×lr ) .
  

(1.140)

This and the fact that for all sets B it holds that #(F (∅, B)) = 1 ensure that

×
# (r,k)∈∅ Rlk ×lr = #(F (∅, ∅)) = 1. (1.141)


Next note that (1.140) assures that for all (R, K) ∈ S it holds that

×
# (r,k)∈S Rlk ×lr ≥ # F {(R, K)}, RlK ×lR = ∞. (1.142)
 

Combining this and (1.141) establishes (1.139). The proof of Lemma 1.5.3 is thus complete.

1.5.2 Realizations of fully-connected ResNets


Definition 1.5.4 (Realizations associated to fully-connected ResNets). Let L ∈ N, l0 , l1 ,
. . . , lL ∈ N, S ⊆ {(r, k) ∈ (N0 )2 : r < k ≤ L}, Φ = ((Wk , Bk )k∈{1,2,...,L} , (Vr,k )(r,k)∈S ) ∈
× L
(Rlk ×lk−1 × Rlk ) ×× Rlk ×lr ⊆ R and let a : R → R be a function. Then
 
k=1 (r,k)∈S
we denote by

RR l0
a (Φ) : R → R
lL
(1.143)

the function which satisfies for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL ∈ RlL with

∀ k ∈ {1, 2, . . . , L} :
xk = Ma1(0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk + r∈N0 ,(r,k)∈S Vr,k xr ) (1.144)
P

that
(RR
a (Φ))(x0 ) = xL (1.145)
and we call RR a (Φ) the realization function of the fully-connected ResNet Φ with activation
function a (we call RR a (Φ) the realization of the fully-connected ResNet Φ with activation
a) (cf. Definitions 1.2.1 and 1.5.1).

67
Chapter 1: Basics on ANNs

Definition 1.5.5 (Identity matrices). Let d ∈ N. Then we denote by Id ∈ Rd×d the identity
matrix in Rd×d .

1 import torch
2 import torch . nn as nn
3
4 class ResidualANN ( nn . Module ) :
5 def __init__ ( self ) :
6 super () . __init__ ()
7 self . affine1 = nn . Linear (3 , 10)
8 self . activation1 = nn . ReLU ()
9 self . affine2 = nn . Linear (10 , 20)
10 self . activation2 = nn . ReLU ()
11 self . affine3 = nn . Linear (20 , 10)
12 self . activation3 = nn . ReLU ()
13 self . affine4 = nn . Linear (10 , 1)
14
15 def forward ( self , x0 ) :
16 x1 = self . activation1 ( self . affine1 ( x0 ) )
17 x2 = self . activation2 ( self . affine2 ( x1 ) )
18 x3 = self . activation3 ( x1 + self . affine3 ( x2 ) )
19 x4 = self . affine4 ( x3 )
20 return x4

Source code 1.20 (code/res-ann.py): Python code implementing a fully-connected


ResNet in PyTorch. The implemented model here corresponds to a fully-
connected ResNet (Φ, V ) where l0 = 3, l1 = 10, l2 = 20, l3 = 10, l4 = 1,
×4 lk ×lk−1
, S = {(1, 3)},
lk

Φ = ((W1 , B1 ), (W2 , B2 ), (W3 , B3 ),(W4 , B4 )) ∈ k=1
(R × R )
V = (Vr,k )(r,k)∈S ∈ × (r,k)∈S
R lk ×lr
, and V1,3 = I10 (cf. Definition 1.5.5).

Example 1.5.6 (Example for Definition 1.5.2). Let l0 = 1, l1 = 1, l2 = 2, l3 = 2, l4 = 1,


S = {(0, 4)}, let
4
×lk ×lk−1 (1.146)
lk

Φ = ((W1 , B1 ), (W2 , B2 ), (W3 , B3 ), (W4 , B4 )) ∈ k=1
(R × R )

satisfy
   
1 0
(1.147)
 
W1 = 1 , B1 = 0 , W2 = , B2 = ,
2 1
   
1 0 0
(1.148)
 
W3 = , B3 = , W4 = 2 2 , and B4 = 1 ,
0 1 0
and let V = (Vr,k )(r,k)∈S ∈ × (r,k)∈S
Rlk ×lr satisfy

(1.149)

V0,4 = −1 .

68
1.5. Residual ANNs (ResNets)

Then
(RR
r (Φ, V ))(5) = 28 (1.150)
(cf. Definitions 1.2.4 and 1.5.4).
Proof for Example 1.5.6. Throughout this proof, let x0 ∈ R1 , x1 ∈ R1 , x2 ∈ R2 , x3 ∈ R2 ,
x4 ∈ R1 satisfy for all k ∈ {1, 2, 3, 4} that x0 = 5 and
(1.151)
P
xk = Mr1(0,4) (k)+idR 1{4} (k),lk (Wk xk−1 + Bk + r∈N0 ,(r,k)∈S Vr,k xr ).
Observe that (1.151) assures that
(RR
r (Φ, V ))(5) = x4 . (1.152)
Next note that (1.151) ensures that
x1 = Mr,1 (W1 x0 + B1 ) = Mr,1 (5), (1.153)
       
1 0 5 5
(1.154)

x2 = Mr,2 (W2 x1 + B2 ) = Mr,1 5 + = Mr,1 = ,
2 1 11 11
        
1 0 5 0 5 5
x3 = Mr,2 (W3 x2 + B3 ) = Mr,1 + = Mr,1 = , (1.155)
0 1 11 0 11 11
and x4 = Mr,1 (W4 x3 + B4 + V0,4 x0 )
(1.156)
   
 5   
= Mr,1 2 2 + 1 + −1 5 = Mr,1 (28) = 28.
11
This and (1.152) establish (1.150). The proof for Example 1.5.6 is thus complete.
Exercise 1.5.1. Let l0 = 1, l1 = 2, l2 = 3, l3 = 1, S = {(0, 3), (1, 3)}, let

× 3
(Rlk ×lk−1 (1.157)

Φ = ((W1 , B1 ), (W2 , B2 ), (W3 , B3 )) ∈ k=1
× Rlk )
satisfy
   
    −1 2 0
1 3
W1 = , B1 = ,W2 =  3 −4, B2 = 0, (1.158)
2 4
−5 6 0
and (1.159)
 
W3 = −1 1 −1 , B3 = −4 ,
and let V = (Vr,k )(r,k)∈S ∈ ×
(r,k)∈S
Rlk ×lr satisfy
and (1.160)
 
V0,3 = 1 V1,3 = 3 −2 .
Prove or disprove the following statement: It holds that
(RR
r (Φ, V ))(−1) = 0 (1.161)
(cf. Definitions 1.2.4 and 1.5.4).

69
Chapter 1: Basics on ANNs

1.6 Recurrent ANNs (RNNs)


In this section we review RNNs, a type of ANNs designed to take sequences of data points
as inputs. Roughly speaking, unlike in feedforward ANNs where an input is processed by
a successive application of series of different parametric functions (cf. Definitions 1.1.3,
1.3.4, 1.4.5, and 1.5.4 above), in RNNs an input sequence is processed by a repeated
application of the same parametric function whereby after the first application, each
subsequent application of the parametric function takes as input a new element of the input
sequence and a partial output from the previous application of the parametric function.
The output of an RNN is then given by a sequence of partial outputs coming from the
repeated applications of the parametric function (see Definition 1.6.2 below for a precise
description of RNNs and cf., for instance, [4, Section 12.7], [60, Chapter 17] [63, Chapter 5],
and [164, Chapter 10] for other introductions to RNNs).
The repeatedly applied parametric function in an RNN is typically called an RNN node
and any RNN architecture is determined by specifying the architecture of the corresponding
RNN node. We review a simple variant of such RNN nodes and the corresponding RNNs in
Section 1.6.2 in detail and we briefly address one of the most commonly used RNN nodes,
the so-called long short-term memory (LSTM) node, in Section 1.6.3.
There is a wide range of application areas where sequential data are considered and
RNN based deep learning methods are being employed and developed. Examples of such
applications areas are NLP including language translation (cf., for example, [11, 76, 77, 388]
and the references therein), language generation (cf., for instance, [51, 169, 238, 340] and
the references therein), and speech recognition (cf., for example, [6, 81, 170, 172, 360] and
the references therein), time series prediction analysis including stock market prediction
(cf., for instance, [130, 133, 372, 376] and the references therein) and weather prediction (cf.,
for example, [352, 375, 407] and the references therein) and video analysis (cf., for instance,
[108, 235, 307, 401] and the references therein).

1.6.1 Description of RNNs


Definition 1.6.1 (Function unrolling). Let X, Y, I be sets, let f : X × I → Y × I be a
function, and let T ∈ N, I ∈ I. Then we denote by Rf,T,I : X T → Y T the function which
satisfies for all x1 , x2 , . . . , xT ∈ X, y1 , y2 , . . . , yT ∈ Y , i0 , i1 , . . . , iT ∈ I with i0 = I and
∀ t ∈ {1, 2, . . . , T } : (yt , it ) = f (xt , it−1 ) that

Rf,T,I (x1 , x2 , . . . , xT ) = (y1 , y2 , . . . , yT ) (1.162)

and we call Rf,T,i the T -times unrolled function f with initial information I.
Definition 1.6.2 (Description of RNNs). Let X, Y, I be sets, let d, T ∈ N, θ ∈ Rd , I ∈ I,
and let N = (Nϑ )ϑ∈Rd : Rd × X × I → Y × I be a function. Then we call R the realization
function of the T -step unrolled RNN with RNN node N, parameter vector θ, and initial

70
1.6. Recurrent ANNs (RNNs)

information I (we call R the realization of the T -step unrolled RNN with RNN node N,
parameter vector θ, and initial information I) if and only if

R = RNθ ,T,I (1.163)

(cf. Definition 1.6.1).

1.6.2 Vectorized description of simple fully-connected RNNs


Definition 1.6.3 (Vectorized description of simple fully-connected RNN nodes). Let
x, y, i ∈ N, θ ∈ R(x+i+1)i+(i+1)y and let Ψ1 : Ri → Ri and Ψ2 : Ry → Ry be functions. Then we
call r the realization function of the simple fully-connected RNN node with parameter vector
θ and activation functions Ψ1 and Ψ2 (we call r the realization of the simple fully-connected
RNN node with parameter vector θ and activations Ψ1 and Ψ2 ) if and only if it holds that
r : Rx × Ri → Ry × Ri is the function from Rx × Ri to Ry × Ri which satisfies for all x ∈ Rx ,
i ∈ Ri that
 
θ,(x+i+1)i
◦ Ψ1 ◦ Aθ,0 θ,0
(1.164)
 
r(x, i) = Ψ2 ◦ Ay,i i,x+i (x, i), Ψ1 ◦ A i,x+i (x, i)

(cf. Definition 1.1.1).

Definition 1.6.4 (Vectorized description of simple fully-connected RNNs). Let x, y, i, T ∈ N,


θ ∈ R(x+i+1)i+(i+1)y , I ∈ Ri and let Ψ1 : Ri → Ri and Ψ2 : Ry → Ry be functions. Then we call
R the realization function of the T -step unrolled simple fully-connected RNN with parameter
vector θ, activation functions Ψ1 and Ψ2 , and initial information I (we call R the realization
of the T -step unrolled simple fully-connected RNN with parameter vector θ, activations Ψ1
and Ψ2 , and initial information I) if and only if there exists r : Rx × Ri → Ry × Ri such that

(i) it holds that r is the realization of the simple fully-connected RNN node with parameters
θ and activations Ψ1 and Ψ2 and

(ii) it holds that

R = Rr,T,I (1.165)

(cf. Definitions 1.6.1 and 1.6.3).

Lemma 1.6.5. Let x, y, i, d, T ∈ N, θ ∈ Rd , I ∈ Ri satisfy d = (x + i + 1)i + (i + 1)y, let


Ψ1 : Ri → Ri and Ψ2 : Ry → Ry be functions, and let N = (Nϑ )ϑ∈Rd : Rd × Rx × Ri → Ry × Ri
satisfy for all ϑ ∈ Rd that Nϑ is the realization of the simple fully-connected RNN node with
parameter vector ϑ and activations Ψ1 and Ψ2 (cf. Definition 1.6.3). Then the following
two statements are equivalent:

71
Chapter 1: Basics on ANNs

(i) It holds that R is the realization of the T -step unrolled simple fully-connected RNN
with parameter vector θ, activations Ψ1 and Ψ2 , and initial information I (cf. Defini-
tion 1.6.4).
(ii) It holds that R is the realization of the T -step unrolled RNN with RNN node N,
parameter vector θ, and initial information I (cf. Definition 1.6.2).
Proof of Lemma 1.6.5. Observe that (1.163) and (1.165) ensure that ((i) ↔ (ii)). The proof
of Lemma 1.6.5 is thus complete.
Exercise 1.6.1. For every T ∈ N, α ∈ (0, 1) let RT,α be the realization of the T -step
unrolled simple fully-connected RNN with parameter vector (1, 0, 0, α, 0, 1 − α, 0, 0, −1, 1, 0),
activations Mr,2 and idR , and initial information (0, 0) (cf. Definitions 1.2.1, 1.2.4, and
1.6.4). For every T ∈ N, α ∈ (0, 1) specify RT,α (1, 1, . . . , 1) explicitly and prove that your
result is correct!

1.6.3 Long short-term memory (LSTM) RNNs


In this section we briefly discuss a very popular type of RNN nodes called LSTM nodes and
the corresponding RNNs called LSTM networks which were introduced in Hochreiter &
Schmidhuber [201]. Loosely speaking, LSTM nodes were invented to attempt to the tackle
the issue that most RNNs based on simple RNN nodes, such as the simple fully-connected
RNN nodes in Section 1.6.2 above, struggle to learn to understand long-term dependencies
in sequences of data (cf., for example, [30, 328]). Roughly speaking, an RNN processes
an input sequence by repeatedly applying an RNN node to a tuple consisting of a new
element of the input sequence and a partial output of the previous application of the RNN
node (see Definition 1.6.2 above for a precise description of RNNs). Therefore, the only
information on previously processed elements of the input sequence that any application
of an RNN node has access to, is the information encoded in the output produced by the
last application of the RNN node. For this reason, RNNs can be seen as only having a
short-term memory. The LSTM architecture, however is designed with the aim to facilitate
the transmission of long-term information within this short-term memory. LSTM networks
can thus be seen as having a sort of long short-term memory.
For a precise definition of LSTM networks we refer to the original article Hochreiter &
Schmidhuber [201] and, for instance, to the excellent explanations in [133, 169, 319]. For a
few selected references on LSTM networks in the literature we refer, for example, to [11, 77,
133, 147, 148, 169, 171–174, 288, 330, 360, 367, 388, 425] and the references therein.

1.7 Further types of ANNs


In this section we present a selection of references and some rough comments on a couple of
further popular types of ANNs in the literature which were not discussed in the previous

72
1.7. Further types of ANNs

sections of this chapter above.

1.7.1 ANNs with encoder-decoder architectures: autoencoders


In this section we discuss the idea of autoencoders which are based on encoder-decoder
ANN architectures. Roughly speaking, the goal of autoencoders is to learn a simplified
representation of data points and a way to closely reconstruct the original data points
from the simplified representation. The simplified representation of data points is usually
called the encoding and is obtained by applying an encoder ANN to the data points. The
approximate reconstruction of the original data points from the encoded representations is,
in turn, called the decoding and is obtained by applying a decoder ANN to the encoded
representations. The composition of the encoder ANN with the decoder ANN is called the
autoencoder. In the simplest situations the encoder ANN and decoder ANN are trained to
perform their respective desired functions by training the full autoencoder to be as close to
the identity mapping on the data points as possible.
A large number of different architectures and training procedures for autoencoders have
been proposed in the literature. In the following we list a selection of a few popular ideas
from the scientific literature.

• We refer, for instance, to [49, 198, 200, 253, 356] for foundational references introducing
and refining the idea of autoencoders,

• we refer, for example, to [402, 403, 416] for so-called denoising autoencoders which
add random pertubation to the input data in the training of autoencoders,

• we refer, for instance, to [51, 107, 246] for so-called variational autoencoders which
use techniques from bayesian statistics in the training of autoencoders,

• we refer, for example, [294, 349] for autoencoders involving convolutions, and

• we refer, for instance, [118, 292] for adversarial autoencoders which combine the
principles of autoencoders with the paradigm of generative adversarial networks (see
Goodfellow et al. [165]).

1.7.2 Transformers and the attention mechanism


In Section 1.6 we reviewed RNNs which are a type of ANNs designed to take sequences
of data points as inputs. Very roughly speaking, RNNs process a sequence of data points
by sequentially processing one data point of the sequence after the other and thereby
constantly updating an information state encoding previously processed information (see
Section 1.6.1 above for a precise description of RNNs). When processing a data point of the
sequence, any information coming from earlier data points is thus only available to the RNN

73
Chapter 1: Basics on ANNs

through the information state passed on from the previous processing step of the RNN.
Consequently, it can be hard for RNNs to learn to understand long-term dependencies in
the input sequence. In Section 1.6.3 above, we briefly discussed the LSTM architecture for
RNNs which is an architecture for RNNs aimed at giving such RNNs the capacity to indeed
learn to understand such long-term dependencies.
Another approach in the literature to design ANN architectures which process sequential
data and are capable to efficiently learn to understand long-term dependencies in data
sequences is called the attention mechanism. Very roughly speaking, in the context of
sequences of the data, the attention mechanism aims to give ANNs the capacity to "pay
attention" to selected parts of the entire input sequence when they are processing a data
point of the sequence. The idea for using attention mechanisms in ANNs was first introduced
in Bahdanau et al. [11] in the context of RNNs trained for machine translation. In this
context the proposed ANN architecture still processes the input sequence sequentially,
however past information is not only available through the information state from the
previous processing step, but also through the attention mechanism, which can directly
extract information from data points far away from the data point being processed.
Likely the most famous ANNs based on the attention mechanism do however not involve
any recurrent elements and have been named Transfomer ANNs by the authors of the
seminal paper Vaswani et al. [397] called "Attention is all you need". Roughly speaking,
Transfomer ANNs are designed to process sequences of data by considering the entire input
sequence at once and relying only on the attention mechanism to understand dependencies
between the data points in the sequence. Transfomer ANNs are the basis for many recently
very successful large language models (LLMs), such as, generative pre-trained transformers
(GPTs) in [54, 320, 341, 342] which are the models behind the famous ChatGPT application,
Bidirectional Encoder Representations from Transformers (BERT) models in Devlin et
al. [104], and many others (cf., for example, [91, 267, 343, 418, 422] and the references
therein).
Beyond the NLP applications for which Transformers and attention mechanisms have
been introduced, similar ideas have been employed in several other areas, such as, computer
vision (cf., for instance, [109, 240, 278, 404]), protein structure prediction (cf., for example,
[232]), multimodal learning (cf., for instance, [283]), and long sequence time-series forecasting
(cf., for example, [441]). Moreover, we refer, for instance, to [81, 288], [157, Chapter 17],
and [164, Section 12.4.5.1] for explorations and explanations of the attention mechanism in
the literature.

1.7.3 Graph neural networks (GNNs)


All ANNs reviewed in the previous sections of this book are designed to take real-valued
vectors or sequences of real-valued vectors as inputs. However, there are several learning
problems based on data, such as social network data or molecular data, that are not
optimally represented by real-valued vectors but are better represented by graphs (see,

74
1.7. Further types of ANNs

for example, West [411] for an introduction on graphs). As a consequence, many ANN
architectures which can process graphs as inputs, so-called graph neural networks (GNNs),
have been introduced in the literature.

• We refer, for instance, to [362, 415, 439, 442] for overview articles on GNNs,

• we refer, for example, to [166, 366] for foundational articles for GNNs,

• we refer, for instance, to [399, 426] for applications of attention mechanisms (cf.
Section 1.7.2 above) to GNNs,

• we refer, for example, to [55, 95, 412, 424] for GNNs involving convolutions on graphs,
and

• we refer, for instance, to [16, 151, 361, 368, 414] for applications of GNNs to problems
from the natural sciences.

1.7.4 Neural operators


In this section we review a few popular ANN-type architectures employed in operator
learning. Roughly speaking, in operator learning one is not interested in learning a map
between finite dimensional euclidean spaces, but in learning a map from a space of functions
to a space of functions. Such a map between (typically infinite-dimensional) vector spaces
is usually called an operator. An example of such a map is the solution operator of an
evolutionary PDE which maps the initial condition of the PDE to the corresponding
terminal value of the PDE. To approximate/learn operators it is necessary to develop
parametrized families of operators, objects which we refer to as neural operators. Many
different architectures for such neural operators have been proposed in the literature, some
of which we now list in the next paragraphs.
One of the most successful neural operator architectures are so-called Fourier neural
operators (FNOs) introduced in Li et al. [271] (cf. also Kovachki et al. [252]). Very roughly
speaking, FNOs are parametric maps on function spaces, which involve transformations on
function values as well as on Fourier coefficients. FNOs have been derived based on the
neural operators introduced in Li et al. [270, 272] which are based on integral transformations
with parametric integration kernels. We refer, for example, to [53, 251, 269, 410] and the
references therein for extensions and theoretical results on FNOs.
A simple and successful architecture for neural operators, which is based on a universal
approximation theorem for neural operators, are the deep operator networks (deepONets)
introduced in Lu et al. [284]. Roughly speaking, a deepONet consists of two ANNs that take
as input the evaluation point of the output space and input function values at predetermined
"sensor" points respectively, and that are joined together by a scalar product to produce
the output of the deepONet. We refer, for instance, to [115, 167, 249, 261, 276, 297, 335,

75
Chapter 1: Basics on ANNs

392, 406, 413, 432] for extensions and theoretical results on deepONets. For a comparison
between deepONets and FNOs we refer, for example, to Lu et al. [285].
A further natural approach is to employ CNNs (see Section 1.4) to develop neural
operator architectures. We refer, for instance, to [185, 192, 244, 350, 443] for such CNN-
based neural operators. Finally, we refer, for example, to [67, 94, 98, 135, 136, 227, 273,
277, 301, 344, 369, 419] for further neural operator architectures and theoretical results for
neural operators.

76
Chapter 2

ANN calculus

In this chapter we review certain operations that can be performed on the set of fully-
connected feedforward ANNs such as compositions (see Section 2.1), paralellizations (see
Section 2.2), scalar multiplications (see Section 2.3), and sums (see Section 2.4) and thereby
review an appropriate calculus for fully-connected feedforward ANNsṪhe operations and
the calculus for fully-connected feedforward ANNs presented in this chapter will be used in
Chapters 3 and 4 to establish certain ANN approximation results.
In the literature such operations on ANNs and such kind of calculus on ANNs has been
used in many research articles such as [128, 159, 180, 181, 184, 228, 321, 329, 333] and the
references therein. The specific presentation of this chapter is based on Grohs et al. [180,
181].

2.1 Compositions of fully-connected feedforward ANNs


2.1.1 Compositions of fully-connected feedforward ANNs
Definition 2.1.1 (Composition of ANNs). We denote by

(·) • (·) : {(Φ, Ψ) ∈ N × N : I(Φ) = O(Ψ)} → N (2.1)

the function which satisfies for all Φ, Ψ ∈ N, k ∈ {1, 2, . . . , L(Φ) + L(Ψ) − 1} with
I(Φ) = O(Ψ) that L(Φ • Ψ) = L(Φ) + L(Ψ) − 1 and

(Wk,Ψ , Bk,Ψ )
 : k < L(Ψ)
(Wk,Φ•Ψ , Bk,Φ•Ψ ) = (W1,Φ WL(Ψ),Ψ , W1,Φ BL(Ψ),Ψ + B1,Φ ) : k = L(Ψ) (2.2)

(Wk−L(Ψ)+1,Φ , Bk−L(Ψ)+1,Φ ) : k > L(Ψ)

(cf. Definition 1.3.1).

77
Chapter 2: ANN calculus

2.1.2 Elementary properties of compositions of fully-connected


feedforward ANNs
Proposition 2.1.2 (Properties of standard compositions of fully-connected feedforward
ANNs). Let Φ, Ψ ∈ N satisfy I(Φ) = O(Ψ) (cf. Definition 1.3.1). Then
(i) it holds that

D(Φ • Ψ) = (D0 (Ψ), D1 (Ψ), . . . , DH(Ψ) (Ψ), D1 (Φ), D2 (Φ), . . . , DL(Φ) (Φ)), (2.3)

(ii) it holds that


[L(Φ • Ψ) − 1] = [L(Φ) − 1] + [L(Ψ) − 1], (2.4)

(iii) it holds that


H(Φ • Ψ) = H(Φ) + H(Ψ), (2.5)

(iv) it holds that

P(Φ • Ψ) = P(Φ) + P(Ψ) + D1 (Φ)(DL(Ψ)−1 (Ψ) + 1)


− D1 (Φ)(D0 (Φ) + 1) − DL(Ψ) (Ψ)(DL(Ψ)−1 (Ψ) + 1) (2.6)
≤ P(Φ) + P(Ψ) + D1 (Φ)DH(Ψ) (Ψ),

and
I(Ψ)
(v) it holds for all a ∈ C(R, R) that RN
a (Φ • Ψ) ∈ C(R , RO(Φ) ) and

RN N N
a (Φ • Ψ) = [Ra (Φ)] ◦ [Ra (Ψ)] (2.7)

(cf. Definitions 1.3.4 and 2.1.1).


Proof of Proposition 2.1.2. Throughout this proof, let L = L(Φ • Ψ) and for every a ∈
C(R, R) let

Xa = x = (x0 , x1 , . . . , xL ) ∈ RD0 (Φ•Ψ) × RD1 (Φ•Ψ) × · · · × RDL (Φ•Ψ) :




∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),Dk (Φ•Ψ) (Wk,Φ•Ψ xk−1 + Bk,Φ•Ψ ) . (2.8)


Note that the fact that L(Φ • Ψ) = L(Φ) + L(Ψ) − 1 and the fact that for all Θ ∈ N it holds
that H(Θ) = L(Θ) − 1 establish items (ii) and (iii). Observe that item (iii) in Lemma 1.3.3
and (2.2) show that for all k ∈ {1, 2, . . . , L} it holds that

Dk (Ψ)×Dk−1 (Ψ)
R
 : k < L(Ψ)
Wk,Φ•Ψ ∈ R D1 (Φ)×DL(Ψ)−1 (Ψ)
: k = L(Ψ) (2.9)

 Dk−L(Ψ)+1 (Φ)×Dk−L(Ψ) (Φ)
R : k > L(Ψ).

78
2.1. Compositions of fully-connected feedforward ANNs

This, item (iii) in Lemma 1.3.3, and the fact that H(Ψ) = L(Ψ) − 1 ensure that for all
k ∈ {0, 1, . . . , L} it holds that
(
Dk (Ψ) : k ≤ H(Ψ)
Dk (Φ • Ψ) = (2.10)
Dk−L(Ψ)+1 (Φ) : k > H(Ψ).

This establishes item (i). Note that (2.10) implies that


L
P
P(Φ1 • Φ2 ) = Dj (Φ • Ψ)(Dj−1 (Φ • Ψ) + 1)
j=1
" #
H(Ψ)
P
= Dj (Ψ)(Dj−1 (Ψ) + 1) + D1 (Φ)(DH(Ψ) (Ψ) + 1)
j=1
" #
L
P
+ Dj−L(Ψ)+1 (Φ)(Dj−L(Ψ) (Φ) + 1)
j=L(Ψ)+1
" # (2.11)
L(Ψ)−1
P
= Dj (Ψ)(Dj−1 (Ψ) + 1) + D1 (Φ)(DH(Ψ) (Ψ) + 1)
j=1
" #
L(Φ)
P
+ Dj (Φ)(Dj−1 (Φ) + 1)
j=2
 
= P(Ψ) − DL(Ψ) (Ψ)(DL(Ψ)−1 (Ψ) + 1) + D1 (Φ)(DH(Ψ) (Ψ) + 1)
 
+ P(Φ) − D1 (Φ)(D0 (Φ) + 1) .

This proves item (iv). Observe that (2.10) and item (ii) in Lemma 1.3.3 ensure that

I(Φ • Ψ) = D0 (Φ • Ψ) = D0 (Ψ) = I(Ψ)


(2.12)
and O(Φ • Ψ) = DL(Φ•Ψ) (Φ • Ψ) = DL(Φ•Ψ)−L(Ψ)+1 (Φ) = DL(Φ) (Φ) = O(Φ).

This demonstrates that for all a ∈ C(R, R) it holds that

RN
a (Φ • Ψ) ∈ C(R
I(Φ•Ψ)
, RO(Φ•Ψ) ) = C(RI(Ψ) , RO(Φ) ). (2.13)

Next note that (2.2) implies that for all k ∈ N ∩ (1, L(Φ) + 1) it holds that

(WL(Ψ)+k−1,Φ•Ψ , BL(Ψ)+k−1,Φ•Ψ ) = (Wk,Φ , Bk,Φ ). (2.14)

This and (2.10) ensure that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa , k ∈ N∩(1, L(Φ)+
1) it holds that

xL(Ψ)+k−1 = Ma1(0,L) (L(Ψ)+k−1)+idR 1{L} (L(Ψ)+k−1),Dk (Φ) (Wk,Φ xL(Ψ)+k−2 + Bk,Φ )


(2.15)
= Ma1(0,L(Φ)) (k)+idR 1{L(Φ)} (k),Dk (Φ) (Wk,Φ xL(Ψ)+k−2 + Bk,Φ ).

79
Chapter 2: ANN calculus

Furthermore, observe that (2.2) and (2.10) show that for all a ∈ C(R, R), x = (x0 , x1 , . . . ,
xL ) ∈ Xa it holds that

xL(Ψ) = Ma1(0,L) (L(Ψ))+idR 1{L} (L(Ψ)),DL(Ψ) (Φ•Ψ) (WL(Ψ),Φ•Ψ xL(Ψ)−1 + BL(Ψ),Φ•Ψ )


= Ma1(0,L(Φ)) (1)+idR 1{L(Φ)} (1),D1 (Φ) (W1,Φ WL(Ψ),Ψ xL(Ψ)−1 + W1,Φ BL(Ψ),Ψ + B1,Φ ) (2.16)
= Ma1(0,L(Φ)) (1)+idR 1{L(Φ)} (1),D1 (Φ) (W1,Φ (WL(Ψ),Ψ xL(Ψ)−1 + BL(Ψ),Ψ ) + B1,Φ ).

Combining this and (2.15) proves that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa it
holds that
(RNa (Φ))(WL(Ψ),Ψ xL(Ψ)−1 + BL(Ψ),Ψ ) = xL . (2.17)
Moreover, note that (2.2) and (2.10) imply that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈
Xa , k ∈ N ∩ (0, L(Ψ)) it holds that

xk = Ma,Dk (Ψ) (Wk,Ψ xk−1 + Bk,Ψ ) (2.18)

This proves that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa it holds that

(RN
a (Ψ))(x0 ) = WL(Ψ),Ψ xL(Ψ)−1 + BL(Ψ),Ψ . (2.19)

Combining this with (2.17) demonstrates that for all a ∈ C(R, R), x = (x0 , x1 , . . . , xL ) ∈ Xa
it holds that
(RN N N
(2.20)
 
a (Φ)) (Ra (Ψ))(x0 ) = xL = Ra (Φ • Ψ) (x0 ).

This and (2.13) prove item (v). The proof of Proposition 2.1.2 is thus complete.

2.1.3 Associativity of compositions of fully-connected feedforward


ANNs
Lemma 2.1.3. Let Φ1 , Φ2 , Φ3 ∈ N satisfy I(Φ1 ) = O(Φ2 ), I(Φ2 ) = O(Φ3 ), and L(Φ2 ) = 1
(cf. Definition 1.3.1). Then

(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ) (2.21)

(cf. Definition 2.1.1).

Proof of Lemma 2.1.3. Observe that the fact that for all Ψ1 , Ψ2 ∈ N with I(Ψ1 ) = O(Ψ2 )
it holds that L(Ψ1 • Ψ2 ) = L(Ψ1 ) + L(Ψ2 ) − 1 and the assumption that L(Φ2 ) = 1 ensure
that
L(Φ1 • Φ2 ) = L(Φ1 ) and L(Φ2 • Φ3 ) = L(Φ3 ) (2.22)
(cf. Definition 2.1.1). Therefore, we obtain that

L((Φ1 • Φ2 ) • Φ3 ) = L(Φ1 ) + L(Φ3 ) = L(Φ1 • (Φ2 • Φ3 )). (2.23)

80
2.1. Compositions of fully-connected feedforward ANNs

Next note that (2.22), (2.2), and the assumption that L(Φ2 ) = 1 imply that for all
k ∈ {1, 2, . . . , L(Φ1 )} it holds that
(
(W1,Φ1 W1,Φ2 , W1,Φ1 B1,Φ2 + B1,Φ1 ) : k = 1
(Wk,Φ1 •Φ2 , Bk,Φ1 •Φ2 ) = (2.24)
(Wk,Φ1 , Bk,Φ1 ) : k > 1.

This, (2.2), and (2.23) prove that for all k ∈ {1, 2, . . . , L(Φ1 ) + L(Φ3 ) − 1} it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 )

(Wk,Φ3 , Bk,Φ3 )
 : k < L(Φ3 )
= (W1,Φ1 •Φ2 WL(Φ3 ),Φ3 , W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 ) : k = L(Φ3 )

(Wk−L(Φ3 )+1,Φ1 •Φ2 , Bk−L(Φ3 )+1,Φ1 •Φ2 ) : k > L(Φ3 ) (2.25)


(Wk,Φ3 , Bk,Φ3 )
 : k < L(Φ3 )
= (W1,Φ1 •Φ2 WL(Φ3 ),Φ3 , W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 ) : k = L(Φ3 )

(Wk−L(Φ3 )+1,Φ1 , Bk−L(Φ3 )+1,Φ1 ) : k > L(Φ3 ).

Furthermore, observe that (2.2), (2.22), and (2.23) show that for all k ∈ {1, 2, . . . , L(Φ1 ) +
L(Φ3 ) − 1} it holds that
(Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) )

(Wk,Φ2 •Φ3 , Bk,Φ2 •Φ3 )
 : k < L(Φ2 • Φ3 )
= (W1,Φ1 WL(Φ2 •Φ3 ),Φ2 •Φ3 , W1,Φ BL(Φ2 •Φ3 ),Φ2 •Φ3 + B1,Φ1 ) : k = L(Φ2 • Φ3 )

(Wk−L(Φ2 •Φ3 )+1,Φ1 , Bk−L(Φ2 •Φ3 )+1,Φ1 ) : k > L(Φ2 • Φ3 ) (2.26)


(Wk,Φ3 , Bk,Φ3 )
 : k < L(Φ3 )
= (W1,Φ1 WL(Φ3 ),Φ2 •Φ3 , W1,Φ BL(Φ3 ),Φ2 •Φ3 + B1,Φ1 ) : k = L(Φ3 )

(Wk−L(Φ3 )+1,Φ1 , Bk−L(Φ3 )+1,Φ1 ) : k > L(Φ3 ).

Combining this with (2.25) establishes that for all k ∈ {1, 2, . . . , L(Φ1 )+L(Φ3 )−1}\{L(Φ3 )}
it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 ) = (Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) ). (2.27)
Moreover, note that (2.24) and (2.2) ensure that
W1,Φ1 •Φ2 WL(Φ3 ),Φ3 = W1,Φ1 W1,Φ2 WL(Φ3 ),Φ3 = W1,Φ1 WL(Φ3 ),Φ2 •Φ3 . (2.28)
In addition, observe that (2.24) and (2.2) demonstrate that
W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 = W1,Φ1 W1,Φ2 BL(Φ3 ),Φ3 + W1,Φ1 B1,Φ2 + B1,Φ1
= W1,Φ1 (W1,Φ2 BL(Φ3 ),Φ3 + B1,Φ2 ) + B1,Φ1 (2.29)
= W1,Φ BL(Φ3 ),Φ2 •Φ3 + B1,Φ1 .

81
Chapter 2: ANN calculus

Combining this and (2.28) with (2.27) proves that for all k ∈ {1, 2, . . . , L(Φ1 ) + L(Φ3 ) − 1}
it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 ) = (Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) ). (2.30)
This and (2.23) imply that
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ). (2.31)
The proof of Lemma 2.1.3 is thus complete.
Lemma 2.1.4. Let Φ1 , Φ2 , Φ3 ∈ N satisfy I(Φ1 ) = O(Φ2 ), I(Φ2 ) = O(Φ3 ), and L(Φ2 ) > 1
(cf. Definition 1.3.1). Then
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ) (2.32)
(cf. Definition 2.1.1).
Proof of Lemma 2.1.4. Note that the fact that for all Ψ, Θ ∈ N it holds that L(Ψ • Θ) =
L(Ψ) + L(Θ) − 1 ensures that
L((Φ1 • Φ2 ) • Φ3 ) = L(Φ1 • Φ2 ) + L(Φ3 ) − 1
= L(Φ1 ) + L(Φ2 ) + L(Φ3 ) − 2
(2.33)
= L(Φ1 ) + L(Φ2 • Φ3 ) − 1
= L(Φ1 • (Φ2 • Φ3 ))
(cf. Definition 2.1.1). Furthermore, observe that (2.2) shows that for all k ∈ {1, 2, . . . ,
L((Φ1 • Φ2 ) • Φ3 )} it holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 )

(Wk,Φ3 , Bk,Φ3 )
 : k < L(Φ3 )
(2.34)
= (W1,Φ1 •Φ2 WL(Φ3 ),Φ3 , W1,Φ1 •Φ2 BL(Φ3 ),Φ3 + B1,Φ1 •Φ2 ) : k = L(Φ3 )

(Wk−L(Φ3 )+1,Φ1 •Φ2 , Bk−L(Φ3 )+1,Φ1 •Φ2 ) : k > L(Φ3 ).

Moreover, note that (2.2) and the assumption that L(Φ2 ) > 1 ensure that for all k ∈
N ∩ (L(Φ3 ), L((Φ1 • Φ2 ) • Φ3 )] it holds that
(Wk−L(Φ3 )+1,Φ1 •Φ2 , Bk−L(Φ3 )+1,Φ1 •Φ2 )

(Wk−L(Φ3 )+1,Φ2 , Bk−L(Φ3 )+1,Φ2 )
 : k − L(Φ3 ) + 1 < L(Φ2 )
= (W1,Φ1 WL(Φ2 ),Φ2 , W1,Φ1 BL(Φ2 ),Φ2 + B1,Φ1 ) : k − L(Φ3 ) + 1 = L(Φ2 )

(Wk−L(Φ3 )+1−L(Φ2 )+1,Φ1 , Bk−L(Φ3 )+1−L(Φ2 )+1,Φ1 ) : k − L(Φ3 ) + 1 > L(Φ2 ) (2.35)


(Wk−L(Φ3 )+1,Φ2 , Bk−L(Φ3 )+1,Φ2 )
 :k < L(Φ2 ) + L(Φ3 ) − 1
= (W1,Φ1 WL(Φ2 ),Φ2 , W1,Φ1 BL(Φ2 ),Φ2 + B1,Φ1 ) : k = L(Φ2 ) + L(Φ3 ) − 1

(Wk−L(Φ3 )−L(Φ2 )+2,Φ1 , Bk−L(Φ3 )−L(Φ2 )+2,Φ1 ) : k > L(Φ2 ) + L(Φ3 ) − 1.

82
2.1. Compositions of fully-connected feedforward ANNs

Combining this with (2.34) proves that for all k ∈ {1, 2, . . . , L((Φ1 • Φ2 ) • Φ3 )} it holds
that

(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 )




 (Wk,Φ3 , Bk,Φ3 ) : k < L(Φ3 )

(W1,Φ2 WL(Φ3 ),Φ3 , W1,Φ2 BL(Φ3 ),Φ3 + B1,Φ2 ) : k = L(Φ3 )



= (Wk−L(Φ3 )+1,Φ2 , Bk−L(Φ3 )+1,Φ2 ) : L(Φ3 ) < k < L(Φ2 ) + L(Φ3 ) − 1

(W1,Φ1 WL(Φ2 ),Φ2 , W1,Φ1 BL(Φ2 ),Φ2 + B1,Φ1 ) : k = L(Φ2 ) + L(Φ3 ) − 1





k−L(Φ3 )−L(Φ2 )+2,Φ1 , Bk−L(Φ3 )−L(Φ2 )+2,Φ1 ) k > L(Φ2 ) + L(Φ3 ) − 1.
(W :
(2.36)

In addition, observe that (2.2), the fact that L(Φ2 • Φ3 ) = L(Φ2 ) + L(Φ3 ) − 1, and the
assumption that L(Φ2 ) > 1 demonstrate that for all k ∈ {1, 2, . . . , L(Φ1 • (Φ2 • Φ3 ))} it
holds that

(Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) )



(Wk,Φ2 •Φ3 , Bk,Φ2 •Φ3 )
 : k < L(Φ2 • Φ3 )
= (W1,Φ1 WL(Φ2 •Φ3 ),Φ2 •Φ3 , W1,Φ BL(Φ2 •Φ3 ),Φ2 •Φ3 + B1,Φ1 ) : k = L(Φ2 • Φ3 )

(Wk−L(Φ2 •Φ3 )+1,Φ1 , Bk−L(Φ2 •Φ3 )+1,Φ1 ) : k > L(Φ2 • Φ3 )




 (Wk,Φ2 •Φ3 , Bk,Φ2 •Φ3 ) : k < L(Φ2 ) + L(Φ3 ) − 1
(W1,Φ1 WL(Φ2 )+L(Φ3 )−1,Φ2 •Φ3 ,

= : k = L(Φ2 ) + L(Φ3 ) − 1

 W1,Φ BL(Φ2 )+L(Φ3 )−1,Φ2 •Φ3 + B1,Φ1 )

k−L(Φ2 )−L(Φ3 )+2,Φ1 , Bk−L(Φ2 )−L(Φ3 )+2,Φ1 ) : k > L(Φ2 ) + L(Φ3 ) − 1
(W


 (Wk,Φ3 , Bk,Φ3 ) : k < L(Φ3 )

(W1,Φ2 WL(Φ3 ),Φ3 , W1,Φ2 BL(Φ3 ),Φ3 + B1,Φ2 ) : k = L(Φ3 )



= (Wk−L(Φ3 )+1,Φ2 , Bk−L(Φ3 )+1,Φ2 ) : L(Φ3 ) < k < L(Φ2 ) + L(Φ3 ) − 1

(W1,Φ1 WL(Φ2 ),Φ2 , W1,Φ BL(Φ2 ),Φ2 + B1,Φ1 ) : k = L(Φ2 ) + L(Φ3 ) − 1





k−L(Φ2 )−L(Φ3 )+2,Φ1 , Bk−L(Φ2 )−L(Φ3 )+2,Φ1 ) k > L(Φ2 ) + L(Φ3 ) − 1.
(W :
(2.37)

This, (2.36), and (2.33) establish that for all k ∈ {1, 2, . . . , L(Φ1 ) + L(Φ2 ) + L(Φ3 ) − 2} it
holds that
(Wk,(Φ1 •Φ2 )•Φ3 , Bk,(Φ1 •Φ2 )•Φ3 ) = (Wk,Φ1 •(Φ2 •Φ3 ) , Bk,Φ1 •(Φ2 •Φ3 ) ). (2.38)
Hence, we obtain that
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ). (2.39)
The proof of Lemma 2.1.4 is thus complete.

83
Chapter 2: ANN calculus

Corollary 2.1.5. Let Φ1 , Φ2 , Φ3 ∈ N satisfy I(Φ1 ) = O(Φ2 ) and I(Φ2 ) = O(Φ3 ) (cf.
Definition 1.3.1). Then
(Φ1 • Φ2 ) • Φ3 = Φ1 • (Φ2 • Φ3 ) (2.40)
(cf. Definition 2.1.1).

Proof of Corollary 2.1.5. Note that Lemma 2.1.3 and Lemma 2.1.4 establish (2.40). The
proof of Corollary 2.1.5 is thus complete.

2.1.4 Powers of fully-connected feedforward ANNs


Definition 2.1.6 (Powers of fully-connected feedforward ANNs). We denote by (·)•n : {Φ ∈
N : I(Φ) = O(Φ)} → N, n ∈ N0 , the functions which satisfy for all n ∈ N0 , Φ ∈ N with
I(Φ) = O(Φ) that

 IO(Φ) , (0, 0, . . . , 0) ∈ RO(Φ)×O(Φ) × RO(Φ)

:n=0
Φ•n = (2.41)
 Φ • (Φ•(n−1) ) :n∈N

(cf. Definitions 1.3.1, 1.5.5, and 2.1.1).

Lemma 2.1.7 (Number of hidden layers of powers of ANNs). Let n ∈ N0 , Φ ∈ N satisfy


I(Φ) = O(Φ) (cf. Definition 1.3.1). Then

H(Φ•n ) = nH(Φ) (2.42)

(cf. Definition 2.1.6).

Proof of Lemma 2.1.7. Observe that Proposition 2.1.2, (2.41), and induction establish
(2.42). The proof of Lemma 2.1.7 is thus complete.

2.2 Parallelizations of fully-connected feedforward ANNs


2.2.1 Parallelizations of fully-connected feedforward ANNs with
the same length
Definition 2.2.1 (Parallelization of fully-connected feedforward ANNs). Let n ∈ N. Then
we denote by

Pn : Φ = (Φ1 , . . . , Φn ) ∈ Nn : L(Φ1 ) = L(Φ2 ) = . . . = L(Φn ) → N (2.43)




84
2.2. Parallelizations of fully-connected feedforward ANNs

the function which satisfies for all Φ = (Φ1 , . . . , Φn ) ∈ Nn , k ∈ {1, 2, . . . , L(Φ1 )} with
L(Φ1 ) = L(Φ2 ) = · · · = L(Φn ) that
 
Wk,Φ1 0 0 ··· 0
 0
 Wk,Φ2 0 ··· 0 

L(Pn (Φ)) = L(Φ1 ),
 0
Wk,Pn (Φ) =  0 W k,Φ3 ··· 0 ,

 .. .. .. .. .. 
 . . . . . 
0 0 0 ··· Wk,Φn
 
Bk,Φ1
 Bk,Φ 
(2.44)
2
and Bk,Pn (Φ) =  .. 

 . 
Bk,Φn

(cf. Definition 1.3.1).

Lemma 2.2.2 (Architectures of parallelizations of fully-connected feedforward ANNs).


Let n, L ∈ N, Φ = (Φ1 , . . . , Φn ) ∈ Nn satisfy L = L(Φ1 ) = L(Φ2 ) = . . . = L(Φn ) (cf.
Definition 1.3.1). Then

(i) it holds that


L
 
×R ( n
Pn
( n
(2.45)
P P
j=1 Dk (Φj ))×( j=1 Dk−1 (Φj )) j=1 Dk (Φj ))

Pn (Φ) ∈ ×R ,
k=1

(ii) it holds for all k ∈ N0 that

Dk (Pn (Φ)) = Dk (Φ1 ) + Dk (Φ2 ) + . . . + Dk (Φn ), (2.46)

and

(iii) it holds that


(2.47)

D Pn (Φ) = D(Φ1 ) + D(Φ2 ) + . . . + D(Φn )

(cf. Definition 2.2.1).

Proof of Lemma 2.2.2. Note that item (iii) in Lemma 1.3.3 and (2.44) imply that for all
k ∈ {1, 2, . . . , L} it holds that
Pn
Dk (Φj ))×( n
Pn
and (2.48)
P
Wk,Pn (Φ) ∈ R( j=1 j=1 Dk−1 (Φj )) Bk,Pn (Φ) ∈ R( j=1 Dk−1 (Φj ))

(cf. Definition 2.2.1). Item (iii) in Lemma 1.3.3 therefore establishes items (i) and (ii). Note
that item (ii) implies item (iii). The proof of Lemma 2.2.2 is thus complete.

85
Chapter 2: ANN calculus

Proposition 2.2.3 (Realizations of parallelizations of fully-connected feedforward ANNs).


Let a ∈ C(R, R), n ∈ N, Φ = (Φ1 , . . . , Φn ) ∈ Nn satisfy L(Φ1 ) = L(Φ2 ) = · · · = L(Φn ) (cf.
Definition 1.3.1). Then

(i) it holds that Pn Pn


RN [ I(Φj )]
, R[ O(Φj )]
(2.49)

a (Pn (Φ)) ∈ C R
j=1 j=1

and

(ii) it holds for all x1 ∈ RI(Φ1 ) , x2 ∈ RI(Φ2 ) , . . . , xn ∈ RI(Φn ) that

RN

a Pn (Φ) (x1 , x2 , . . . , xn )
[ n
P (2.50)
= (RN N N j=1 O(Φj )]

a (Φ1 ))(x 1 ), (Ra (Φ2 ))(x 2 ), . . . , (Ra (Φn ))(x n ) ∈ R

(cf. Definitions 1.3.4 and 2.2.1).

Proof of Proposition 2.2.3. Throughout this proof, let L = L(Φ1 ), for every j ∈ {1, 2, . . . ,
n} let

X j = x = (x0 , x1 , . . . , xL ) ∈ RD0 (Φj ) × RD1 (Φj ) × · · · × RDL (Φj ) :




∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),Dk (Φj ) (Wk,Φj xk−1 + Bk,Φj ) , (2.51)


and let

X = x = (x0 , x1 , . . . , xL ) ∈ RD0 (Pn (Φ)) × RD1 (Pn (Φ)) × · · · × RDL (Pn (Φ)) :


∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),Dk (Pn (Φ)) (Wk,Pn (Φ) xk−1 + Bk,Pn (Φ) ) . (2.52)


Observe that item (ii) in Lemma 2.2.2 and item (ii) in Lemma 1.3.3 imply that
n
X n
X
I(Pn (Φ)) = D0 (Pn (Φ)) = D0 (Φn ) = I(Φn ). (2.53)
j=1 j=1

Furthermore, note that item (ii) in Lemma 2.2.2 and item (ii) in Lemma 1.3.3 ensure that
n
X n
X
O(Pn (Φ)) = DL(Pn (Φ)) (Pn (Φ)) = DL(Φn ) (Φn ) = O(Φn ). (2.54)
j=1 j=1

Observe that (2.44) and item (ii) in Lemma 2.2.2 show that for allPa ∈ C(R, R), k ∈
n
{1, 2, . . . , L}, x1 ∈ RDk (Φ1 ) , x2 ∈ RDk (Φ2 ) , . . . , xn ∈ RDk (Φn ) , x ∈ R[ j=1 Dk (Φj )] with x =

86
2.2. Parallelizations of fully-connected feedforward ANNs

(x1 , x2 , . . . , xn ) it holds that

Ma,Dk (Pn (Φ)) (Wk,Pn (Φ) x + Bk,Pn (Φ) )


    
Wk,Φ1 0 0 ··· 0 x1 Bk,Φ1
 0
 Wk,Φ2 0 ··· 0   x2   Bk,Φ2 
   
= Ma,Dk (Pn (Φ))  0
 0 Wk,Φ3 · · · 0   x3  +  Bk,Φ3 
   
 .. .. .. .. .. . . 
 . . . . .  ..   .. 
   
0 0 0 · · · Wk,Φn xn Bk,Φn (2.55)
   
Wk,Φ1 x1 + Bk,Φ1 Ma,Dk (Φ1 ) (Wk,Φ1 x1 + Bk,Φ1 )
 Wk,Φ x2 + Bk,Φ   Ma,D (Φ ) (Wk,Φ x2 + Bk,Φ ) 
 2 2   k 2 2 2 
 Wk,Φ x3 + Bk,Φ   Ma,D (Φ ) (Wk,Φ x3 + Bk,Φ ) 
= Ma,Dk (Pn (Φ))  3 3  =  k 3 3 3 .
.. ..
. .
   
   
Wk,Φn xn + Bk,Φn Ma,Dk (Φn ) (Wk,Φn xn + Bk,Φn )

This proves that for all k ∈ {1, 2, . . . , L}, x = (x0 , x1 , . . . , xL ) ∈ X, x1 = (x10 , x11 , . . . , x1L ) ∈ X 1 ,
x2 = (x20 , x21 , . . . , x2L ) ∈ X 2 , . . . , xn = (xn0 , xn1 , . . . , xnL ) ∈ X n with xk−1 = (x1k−1 , x2k−1 , . . . ,
xnk−1 ) it holds that
xk = (x1k , x2k , . . . , xnk ). (2.56)

Induction, and (1.91) hence demonstrate that for all k ∈ {1, 2, . . . , L}, x = (x0 , x1 , . . . , xL ) ∈
X, x1 = (x10 , x11 , . . . , x1L ) ∈ X 1 , x2 = (x20 , x21 , . . . , x2L ) ∈ X 2 , . . . , xn = (xn0 , xn1 , . . . , xnL ) ∈ X n
with x0 = (x10 , x20 , . . . , xn0 ) it holds that

RN 1 2 n

a (Pn (Φ)) (x0 ) = xL = (xL , xL , . . . , xL )
(2.57)
= (RN 1 N 2 N n

a (Φ1 ))(x0 ), (Ra (Φ2 ))(x0 ), . . . , (Ra (Φn ))(x0 ) .

This establishes item (ii). The proof of Proposition 2.2.3 is thus complete.

Proposition 2.2.4 (Upper bounds for the numbers of parameters of parallelizations of


fully-connected feedforward ANNs). Let n, L ∈ N, Φ1 , Φ2 , . . . , Φn ∈ N satisfy L = L(Φ1 ) =
L(Φ2 ) = . . . = L(Φn ) (cf. Definition 1.3.1). Then
2
(2.58)
Pn
P Pn (Φ1 , Φ2 , . . . , Φn ) ≤ 21

j=1 P(Φj )

(cf. Definition 2.2.1).

Proof of Proposition 2.2.4. Throughout this proof, for every j ∈ {1, 2, . . . , n}, k ∈ {0, 1,

87
Chapter 2: ANN calculus

. . . , L} let lj,k = Dk (Φj ). Note that item (ii) in Lemma 2.2.2 demonstrates that
L h
X ih P i
Pn n 
P(Pn (Φ1 , Φ2 , . . . , Φn )) = i=1 li,k l
i=1 i,k−1 + 1
k=1
L h
X ih P i
Pn n 
= i=1 li,k j=1 lj,k−1 +1
k=1
Xn Xn X L n X
X n X
L
≤ li,k (lj,k−1 + 1) ≤ li,k (lj,ℓ−1 + 1)
i=1 j=1 k=1 i=1 j=1 k,ℓ=1
n n
(2.59)
X XhPL ihP
L
i
= k=1 li,k ℓ=1 (lj,ℓ−1 + 1)
i=1 j=1
Xn X n h ihP i
PL 1 L
≤ k=1 2 li,k (l i,k−1 + 1) ℓ=1 lj,ℓ (lj,ℓ−1 + 1)
i=1 j=1
Xn X n hP i2
1 1 n
= 2
P(Φi )P(Φ j ) = 2 i=1 P(Φ i ) .
i=1 j=1

The proof of Proposition 2.2.4 is thus complete.

Corollary 2.2.5 (Lower and upper bounds for the numbers of parameters of parallelizations
of fully-connected feedforward ANNs). Let n ∈ N, Φ = (Φ1 , . . . , Φn ) ∈ Nn satisfy D(Φ1 ) =
D(Φ2 ) = . . . = D(Φn ) (cf. Definition 1.3.1). Then
 n2   n2 +n  2
(2.60)
Pn
2
P(Φ1 ) ≤ 2
P(Φ1 ) ≤ P(Pn (Φ)) ≤ n2 P(Φ1 ) ≤ 21 i=1 P(Φi )

(cf. Definition 2.2.1).

Proof of Corollary 2.2.5. Throughout this proof, let L ∈ N, l0 , l1 , . . . , lL ∈ N satisfy

D(Φ1 ) = (l0 , l1 , . . . , lL ). (2.61)

Observe that (2.61) and the assumption that D(Φ1 ) = D(Φ2 ) = . . . = D(Φn ) imply that for
all j ∈ {1, 2, . . . , n} it holds that

D(Φj ) = (l0 , l1 , . . . , lL ). (2.62)

Combining this with item (iii) in Lemma 2.2.2 demonstrates that


L
(2.63)
P 
P(Pn (Φ)) = (nlj ) (nlj−1 ) + 1 .
j=1

88
2.2. Parallelizations of fully-connected feedforward ANNs

Hence, we obtain that


L
 L

2
lj (lj−1 + 1) = n2 P(Φ1 ). (2.64)
P  P
P(Pn (Φ)) ≤ (nlj ) (nlj−1 ) + n = n
j=1 j=1

Furthermore, note that the assumption that D(Φ1 ) = D(Φ2 ) = . . . = D(Φn ) and the fact
that P(Φ1 ) ≥ l1 (l0 + 1) ≥ 2 ensure that
n 2 n 2
n2
2 2 1 2 1 1
(2.65)
P P
n P(Φ1 ) ≤ 2 [P(Φ1 )] = 2 [nP(Φ1 )] = 2 P(Φ1 ) = 2 P(Φi ) .
i=1 i=1

Moreover, observe that (2.63) and the fact that for all a, b ∈ N it holds that
2(ab + 1) = ab + 1 + (a − 1)(b − 1) + a + b ≥ ab + a + b + 1 = (a + 1)(b + 1) (2.66)
show that
 L

1
P
P(Pn (Φ)) ≥ 2
(nlj )(n + 1)(lj−1 + 1)
j=1
 L
 (2.67)
n(n+1) P  n2 +n 
= 2
lj (lj−1 + 1) = 2
P(Φ1 ).
j=1

This, (2.64), and (2.65) establish (2.60). The proof of Corollary 2.2.5 is thus complete.
Exercise 2.2.1. Prove or disprove the following statement: For every n ∈ N, Φ = (Φ1 , . . . ,
Φn ) ∈ Nn with L(Φ1 ) = L(Φ2 ) = . . . = L(Φn ) it holds that
P(Pn (Φ1 , Φ2 , . . . , Φn )) ≤ n ni=1 P(Φi ) . (2.68)
P 

Exercise 2.2.2. Prove or disprove the following statement: For every n ∈ N, Φ = (Φ1 , . . . ,
Φn ) ∈ Nn with P(Φ1 ) = P(Φ2 ) = . . . = P(Φn ) it holds that
P(Pn (Φ1 , Φ2 , . . . , Φn )) ≤ n2 P(Φ1 ). (2.69)

2.2.2 Representations of the identities with ReLU activation func-


tions
Definition 2.2.6 (Fully-connected feedforward ReLU identity ANNs). We denote by
Id ∈ N, d ∈ N, the fully-connected feedforward ANNs which satisfy for all d ∈ N that
     
1 0  
∈ (R2×1 × R2 ) × (R1×2 × R1 ) (2.70)

I1 = , , 1 −1 , 0
−1 0
and
Id = Pd (I1 , I1 , . . . , I1 ) (2.71)
(cf. Definitions 1.3.1 and 2.2.1).

89
Chapter 2: ANN calculus

Lemma 2.2.7 (Properties of fully-connected feedforward ReLU identity ANNs). Let d ∈ N.


Then
(i) it holds that
D(Id ) = (d, 2d, d) ∈ N3 (2.72)
and
(ii) it holds that
RN
r (Id ) = idRd (2.73)

(cf. Definitions 1.3.1, 1.3.4, and 2.2.6).


Proof of Lemma 2.2.7. Throughout this proof, let L = 2, l0 = 1, l1 = 2, l2 = 1. Note that
(2.70) establishes that
D(I1 ) = (1, 2, 1) = (l0 , l1 , l2 ). (2.74)
This, (2.71), and Proposition 2.2.4 prove that
D(Id ) = (d, 2d, d) ∈ N3 . (2.75)
This establishes item (i). Next note that (2.70) assures that for all x ∈ R it holds that
(RN
r (I1 ))(x) = r(x) − r(−x) = max{x, 0} − max{−x, 0} = x. (2.76)
Combining this and Proposition 2.2.3 demonstrates that for all x = (x1 , . . . , xd ) ∈ Rd it
holds that RN
r (Id ) ∈ C(R , R ) and
d d

(RN N

r (Id ))(x) = Rr Pd (I1 , I1 , . . . , I1 ) (x1 , x2 , . . . , xd )
= (RN N N
(2.77)

r (I1 ))(x1 ), (Rr (I1 ))(x2 ), . . . , (Rr (I1 ))(xd )
= (x1 , x2 , . . . , xd ) = x
(cf. Definition 2.2.1). This establishes item (ii). The proof of Lemma 2.2.7 is thus complete.

2.2.3 Extensions of fully-connected feedforward ANNs


Definition 2.2.8 (Extensions of fully-connected feedforward ANNs). Let L ∈ N, I ∈ N
satisfy I(I) = O(I). Then we denote by
(2.78)
 
EL,I : Φ ∈ N : L(Φ) ≤ L and O(Φ) = I(I) → N
the function which satisfies for all Φ ∈ N with L(Φ) ≤ L and O(Φ) = I(I) that
EL,I (Φ) = (I•(L−L(Φ)) ) • Φ (2.79)
(cf. Definitions 1.3.1, 2.1.1, and 2.1.6).

90
2.2. Parallelizations of fully-connected feedforward ANNs

Lemma 2.2.9 (Length of extensions of fully-connected feedforward ANNs). Let d, i ∈ N,


Ψ ∈ N satisfy D(Ψ) = (d, i, d) (cf. Definition 1.3.1). Then
(i) it holds for all n ∈ N0 that H(Ψ•n ) = n, L(Ψ•n ) = n + 1, D(Ψ•n ) ∈ Nn+2 , and
(
(d, d) :n=0
D(Ψ•n ) = (2.80)
(d, i, i, . . . , i, d) : n ∈ N

and

(ii) it holds for all Φ ∈ N, L ∈ N ∩ [L(Φ), ∞) with O(Φ) = d that

L(EL,Ψ (Φ)) = L (2.81)

(cf. Definitions 2.1.6 and 2.2.8).


Proof of Lemma 2.2.9. Throughout this proof, let Φ ∈ N satisfy O(Φ) = d. Observe that
Lemma 2.1.7 and the fact that H(Ψ) = 1 show that for all n ∈ N0 it holds that

H(Ψ•n ) = nH(Ψ) = n (2.82)

(cf. Definition 2.1.6). Combining this with (1.78) and Lemma 1.3.3 ensures that

H(Ψ•n ) = n, L(Ψ•n ) = n + 1, and D(Ψ•n ) ∈ Nn+2 . (2.83)

Next we claim that for all n ∈ N0 it holds that


(
(d, d) :n=0
Nn+2 ∋ D(Ψ•n ) = (2.84)
(d, i, i, . . . , i, d) : n ∈ N.

We now prove (2.84) by induction on n ∈ N0 . Note that the fact that

Ψ•0 = (Id , 0) ∈ Rd×d × Rd (2.85)

establishes (2.84) in the base case n = 0 (cf. Definition 1.5.5). For the induction step assume
that there exists n ∈ N0 which satisfies
(
(d, d) :n=0
Nn+2 ∋ D(Ψ•n ) = (2.86)
(d, i, i, . . . , i, d) : n ∈ N.

Note that (2.86), (2.41), (2.83), item (i) in Proposition 2.1.2, and the fact that D(Ψ) =
(d, i, d) ∈ N3 imply that

D(Ψ•(n+1) ) = D(Ψ • (Ψ•n )) = (d, i, i, . . . , i, d) ∈ Nn+3 (2.87)

91
Chapter 2: ANN calculus

(cf. Definition 2.1.1). Induction therefore proves (2.84). This and (2.83) establish item (i).
Observe that (2.79), item (iii) in Proposition 2.1.2, (2.82), and the fact that H(Φ) = L(Φ)−1
imply that for all L ∈ N ∩ [L(Φ), ∞) it holds that

H EL,Ψ (Φ) = H (Ψ•(L−L(Φ)) ) • Φ = H Ψ•(L−L(Φ)) + H(Φ)


  
(2.88)
= (L − L(Φ)) + H(Φ) = L − 1.

The fact that H EL,Ψ (Φ) = L EL,Ψ (Φ) − 1 hence proves that
 

(2.89)
 
L EL,Ψ (Φ) = H EL,Ψ (Φ) + 1 = L.

This establishes item (ii). The proof of Lemma 2.2.9 is thus complete.

Lemma 2.2.10 (Realizations of extensions of fully-connected feedforward ANNs). Let


a ∈ C(R, R), I ∈ N satisfy RN
a (I) = idRI(I) (cf. Definitions 1.3.1 and 1.3.4). Then

(i) it holds for all n ∈ N0 that


RN •n
a (I ) = idRI(I) (2.90)
and

(ii) it holds for all Φ ∈ N, L ∈ N ∩ [L(Φ), ∞) with O(Φ) = I(I) that

RN N
a (EL,I (Φ)) = Ra (Φ) (2.91)

(cf. Definitions 2.1.6 and 2.2.8).

Proof of Lemma 2.2.10. Throughout this proof, let Φ ∈ N, L, d ∈ N satisfy L(Φ) ≤ L and
I(I) = O(Φ) = d. We claim that for all n ∈ N0 it holds that

RN •n d d
a (I ) ∈ C(R , R ) and ∀ x ∈ Rd : (RN •n
a (I ))(x) = x. (2.92)

We now prove (2.92) by induction on n ∈ N0 . Note that (2.41) and the fact that O(I) = d
demonstrate that RN a (I ) ∈ C(R , R ) and ∀ x ∈ R : (Ra (I ))(x) = x. This establishes
•0 d d d N •0

(2.92) in the base case n = 0. For the induction step observe that for all n ∈ N0 with
a (I ) ∈ C(R , R ) and ∀ x ∈ R : (Ra (I ))(x) = x it holds that
•n N •n
RN d d d

RN
a (I
•(n+1)
) = RN •n N N •n d d
a (I • (I )) = (Ra (I)) ◦ (Ra (I )) ∈ C(R , R ) (2.93)

and
•(n+1) N •n
∀ x ∈ Rd : RN N
 
a (I ) (x) = [R a (I)] ◦ [Ra (I )] (x)
N •n
(2.94)
= (Ra (I)) Ra (I ) (x) = (RN
N
 
a (I))(x) = x.

92
2.2. Parallelizations of fully-connected feedforward ANNs

Induction therefore proves (2.92). This establishes item (i). Note (2.79), item (v) in
Proposition 2.1.2, item (i), and the fact that I(I) = O(Φ) ensure that
•(L−L(Φ))
RN N
a (EL,I (Φ)) = Ra ((I ) • Φ)
(2.95)
∈ C(RI(Φ) , RO(I) ) = C(RI(Φ) , RI(I) ) = C(RI(Φ) , RO(Φ) )

and

∀ x ∈ RI(Φ) : RN N •(L−L(Φ)) N
  
a (E L,I (Φ)) (x) = Ra (I ) (Ra (Φ))(x)
(2.96)
= (RN
a (Φ))(x).

This establishes item (ii). The proof of Lemma 2.2.10 is thus complete.

Lemma 2.2.11 (Architectures of extensions of fully-connected feedforward ANNs). Let


d, i, L, L ∈ N, l0 , l1 , . . . , lL−1 ∈ N, Φ, Ψ ∈ N satisfy

L ≥ L, D(Φ) = (l0 , l1 , . . . , lL−1 , d), and D(Ψ) = (d, i, d) (2.97)

(cf. Definition 1.3.1). Then D(EL,Ψ (Φ)) ∈ NL+1 and


(
(l0 , l1 , . . . , lL−1 , d) :L=L
D(EL,Ψ (Φ)) = (2.98)
(l0 , l1 , . . . , lL−1 , i, i, . . . , i, d) :L>L

(cf. Definition 2.2.8).

Proof of Lemma 2.2.11. Observe that item (i) in Lemma 2.2.9 demonstrates that

H(Ψ•(L−L) )) = L − L, D(Ψ•(L−L) ) ∈ NL−L+2 , (2.99)


(
(d, d) :L=L
and D(Ψ•(L−L) ) = (2.100)
(d, i, i, . . . , i, d) :L>L
(cf. Definition 2.1.6). Combining this with Proposition 2.1.2 establishes that

H (Ψ•(L−L) ) • Φ = H(Ψ•(L−L) ) + H(Φ) = (L − L) + L − 1 = L − 1, (2.101)




D((Ψ•(L−L) ) • Φ) ∈ NL+1 , (2.102)


(
(l0 , l1 , . . . , lL−1 , d) :L=L
and D((Ψ•(L−L) ) • Φ) = (2.103)
(l0 , l1 , . . . , lL−1 , i, i, . . . , i, d) : L > L.
This and (2.79) establish (2.98). The proof of Lemma 2.2.11 is thus complete.

93
Chapter 2: ANN calculus

2.2.4 Parallelizations of fully-connected feedforward ANNs with


different lengths
Definition 2.2.12 (Parallelization of fully-connected feedforward ANNs with different
length). Let n ∈ N, Ψ = (Ψ1 , . . . , Ψn ) ∈ Nn satisfy for all j ∈ {1, 2, . . . , n} that

H(Ψj ) = 1 and I(Ψj ) = O(Ψj ) (2.104)

(cf. Definition 1.3.1). Then we denote by

Pn,Ψ : Φ = (Φ1 , . . . , Φn ) ∈ Nn : ∀ j ∈ {1, 2, . . . , n} : O(Φj ) = I(Ψj ) → N (2.105)


 

the function which satisfies for all Φ = (Φ1 , . . . , Φn ) ∈ Nn with ∀ j ∈ {1, 2, . . . , n} :


O(Φj ) = I(Ψj ) that

(2.106)

Pn,Ψ (Φ) = Pn Emaxk∈{1,2,...,n} L(Φk ),Ψ1 (Φ1 ), . . . , Emaxk∈{1,2,...,n} L(Φk ),Ψn (Φn )

(cf. Definitions 2.2.1 and 2.2.8 and Lemma 2.2.9).


Lemma 2.2.13 (Realizations for parallelizations of fully-connected feedforward ANNs
with different length). Let a ∈ C(R, R), n ∈ N, I = (I1 , . . . , In ), Φ = (Φ1 , . . . , Φn ) ∈ Nn
satisfy for all j ∈ {1, 2, . . . , n}, x ∈ RO(Φj ) that H(Ij ) = 1, I(Ij ) = O(Ij ) = O(Φj ), and
(RNa (Ij ))(x) = x (cf. Definitions 1.3.1 and 1.3.4). Then

(i) it holds that


[ n
Pn
(2.107)
P
RN j=1 I(Φj )] , R[ j=1 O(Φj )]
 
a Pn,I (Φ) ∈ C R
and

(ii) it holds for all x1 ∈ RI(Φ1 ) , x2 ∈ RI(Φ2 ) , . . . , xn ∈ RI(Φn ) that

RN

a (Pn,I (Φ)) (x1 , x2 , . . . , xn )
[ n
P (2.108)
= (RN N N j=1 O(Φj )]

a (Φ 1 ))(x 1 ), (Ra (Φ 2 ))(x 2 ), . . . , (Ra (Φn ))(x n ) ∈ R

(cf. Definition 2.2.12).


Proof of Lemma 2.2.13. Throughout this proof, let L ∈ N satisfy L = maxj∈{1,2,...,n} L(Φj ).
Note that item (ii) in Lemma 2.2.9, the assumption that for all j ∈ {1, 2, . . . , n} it holds
that H(Ij ) = 1, (2.79), (2.4), and item (ii) in Lemma 2.2.10 demonstrate
(I) that for all j ∈ {1, 2, . . . , n} it holds that L(EL,Ij (Φj )) = L and RN
a (EL,Ij (Φj )) ∈
C(R I(Φj )
,RO(Φj )
) and

(II) that for all j ∈ {1, 2, . . . , n}, x ∈ RI(Φj ) it holds that

RN N
(2.109)

a (EL,Ij (Φj )) (x) = (Ra (Φj ))(x)

94
2.2. Parallelizations of fully-connected feedforward ANNs

(cf. Definition 2.2.8). Items (i) and (ii) in Proposition 2.2.3 therefore imply

(A) that
Pn Pn
RN ∈ C R[ I(Φj )]
, R[ O(Φj )]
(2.110)
 
a Pn EL,I1 (Φ1 ), EL,I2 (Φ2 ), . . . , EL,In (Φn )
j=1 j=1

and

(B) that for all x1 ∈ RI(Φ1 ) , x2 ∈ RI(Φ2 ) , . . . , xn ∈ RI(Φn ) it holds that

RN

a P n E L,I1 (Φ 1 ), E L,I 2 (Φ 2 ), . . . , E L,I n (Φ n ) (x1 , x2 , . . . , xn )
 
= RN N N
  
a E L,I1 (Φ 1 ) (x 1 ), R a E L,I2 (Φ 2 ) (x 2 ), . . . , R a EL,In (Φn ) (x n ) (2.111)
 
= (RN N
a (Φ1 ))(x1 ), (Ra (Φ2 ))(x2 ), . . . , (Ra (Φn ))(xn )
N

(cf. Definition 2.2.1). Combining this with (2.106) and the fact that L = maxj∈{1,2,...,n}
L(Φj ) ensures

(C) that
[ n
Pn
(2.112)
P
RN j=1 I(Φj )] , R[ j=1 O(Φj )]
 
a Pn,I (Φ) ∈ C R

and

(D) that for all x1 ∈ RI(Φ1 ) , x2 ∈ RI(Φ2 ) , . . . , xn ∈ RI(Φn ) it holds that

RN

a Pn,I (Φ) (x1 , x2 , . . . , xn )
= RN

a Pn EL,I1 (Φ1 ), EL,I2 (Φ2 ), . . . , EL,In (Φn ) (x1 , x2 , . . . , xn ) (2.113)
 
N N N
= (Ra (Φ1 ))(x1 ), (Ra (Φ2 ))(x2 ), . . . , (Ra (Φn ))(xn ) .

This establishes items items (i) and (ii). The proof of Lemma 2.2.13 is thus complete.

Exercise 2.2.3. For every d ∈ N let Fd : Rd → Rd satisfy for all x = (x1 , . . . , xd ) ∈ Rd that

Fd (x) = (max{|x1 |}, max{|x1 |, |x2 |}, . . . , max{|x1 |, |x2 |, . . . , |xd |}). (2.114)

Prove or disprove the following statement: For all d ∈ N there exists Φ ∈ N such that

RN
r (Φ) = Fd (2.115)

(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).

95
Chapter 2: ANN calculus

2.3 Scalar multiplications of fully-connected feedforward


ANNs
2.3.1 Affine transformations as fully-connected feedforward ANNs
Definition 2.3.1 (Fully-connected feedforward affine transformation ANNs). Let m, n ∈ N,
W ∈ Rm×n , B ∈ Rm . Then we denote by

AW,B ∈ (Rm×n × Rm ) ⊆ N (2.116)

the fully-connected feedforward ANN given by

AW,B = (W, B) (2.117)

(cf. Definitions 1.3.1 and 1.3.2).


Lemma 2.3.2 (Realizations of fully-connected feedforward affine transformation of ANNs).
Let m, n ∈ N, W ∈ Rm×n , B ∈ Rm . Then
(i) it holds that D(AW,B ) = (n, m) ∈ N2 ,

(ii) it holds for all a ∈ C(R, R) that RN n m


a (AW,B ) ∈ C(R , R ), and

(iii) it holds for all a ∈ C(R, R), x ∈ Rn that

(RN
a (AW,B ))(x) = Wx + B (2.118)

(cf. Definitions 1.3.1, 1.3.4, and 2.3.1).


Proof of Lemma 2.3.2. Note that the fact that AW,B ∈ (Rm×n × Rm ) ⊆ N shows that

D(AW,B ) = (n, m) ∈ N2 . (2.119)

This proves item (i). Furthermore, observe that the fact that

AW,B = (W, B) ∈ (Rm×n × Rm ) (2.120)

and (1.91) ensure that for all a ∈ C(R, R), x ∈ Rn it holds that RN n m
a (AW,B ) ∈ C(R , R )
and
(RN
a (AW,B ))(x) = Wx + B. (2.121)
This establishes items (ii) and (iii). The proof of Lemma 2.3.2 is thus complete. The proof
of Lemma 2.3.2 is thus complete.
Lemma 2.3.3 (Compositions with fully-connected feedforward affine transformation ANNs).
Let Φ ∈ N (cf. Definition 1.3.1). Then

96
2.3. Scalar multiplications of fully-connected feedforward ANNs

(i) it holds for all m ∈ N, W ∈ Rm×O(Φ) , B ∈ Rm that

D(AW,B • Φ) = (D0 (Φ), D1 (Φ), . . . , DH(Φ) (Φ), m), (2.122)

(ii) it holds for all a ∈ C(R, R), m ∈ N, W ∈ Rm×O(Φ) , B ∈ Rm that RN


a (AW,B • Φ) ∈
C(RI(Φ) , Rm ),

(iii) it holds for all a ∈ C(R, R), m ∈ N, W ∈ Rm×O(Φ) , B ∈ Rm , x ∈ RI(Φ) that

(RN N
(2.123)

a (A W,B • Φ))(x) = W (Ra (Φ))(x) + B,

(iv) it holds for all n ∈ N, W ∈ RI(Φ)×n , B ∈ RI(Φ) that

D(Φ • AW,B ) = (n, D1 (Φ), D2 (Φ), . . . , DL(Φ) (Φ)), (2.124)

(v) it holds for all a ∈ C(R, R), n ∈ N, W ∈ RI(Φ)×n , B ∈ RI(Φ) that RN


a (Φ • AW,B ) ∈
C(Rn , RO(Φ) ), and

(vi) it holds for all a ∈ C(R, R), n ∈ N, W ∈ RI(Φ)×n , B ∈ RI(Φ) , x ∈ Rn that

(RN N
a (Φ • AW,B ))(x) = (Ra (Φ))(Wx + B) (2.125)

(cf. Definitions 1.3.4, 2.1.1, and 2.3.1).


Proof of Lemma 2.3.3. Note that Lemma 2.3.2 implies that for all m, n ∈ N, W ∈ Rm×n ,
B ∈ Rm , a ∈ C(R, R), x ∈ Rn it holds that RN
a (AW,B ) ∈ C(R , R ) and
n m

(RN
a (AW,B ))(x) = Wx + B (2.126)

(cf. Definitions 1.3.4 and 2.3.1). Combining this and Proposition 2.1.2 proves items (i), (ii),
(iii), (iv), (v), and (vi). The proof of Lemma 2.3.3 is thus complete.

2.3.2 Scalar multiplications of fully-connected feedforward ANNs


Definition 2.3.4 (Scalar multiplications of ANNs). We denote by (·) ⊛ (·) : R × N → N
the function which satisfies for all λ ∈ R, Φ ∈ N that

λ ⊛ Φ = Aλ IO(Φ) ,0 • Φ (2.127)

(cf. Definitions 1.3.1, 1.5.5, 2.1.1, and 2.3.1).


Lemma 2.3.5. Let λ ∈ R, Φ ∈ N (cf. Definition 1.3.1). Then
(i) it holds that D(λ ⊛ Φ) = D(Φ),

97
Chapter 2: ANN calculus

I(Φ)
(ii) it holds for all a ∈ C(R, R) that RN
a (λ ⊛ Φ) ∈ C(R , RO(Φ) ), and
(iii) it holds for all a ∈ C(R, R), x ∈ RI(Φ) that
RN N
(2.128)
 
a (λ ⊛ Φ) (x) = λ (Ra (Φ))(x)

(cf. Definitions 1.3.4 and 2.3.4).


Proof of Lemma 2.3.5. Throughout this proof, let L ∈ N, l0 , l1 , . . . , lL ∈ N satisfy
L = L(Φ) and (l0 , l1 , . . . , lL ) = D(Φ). (2.129)
Observe that item (i) in Lemma 2.3.2 demonstrates that
D(Aλ IO(Φ) ,0 ) = (O(Φ), O(Φ)) (2.130)
(cf. Definitions 1.5.5 and 2.3.1). Combining this and item (i) in Lemma 2.3.3 shows that
D(λ ⊛ Φ) = D(Aλ IO(Φ) ,0 • Φ) = (l0 , l1 , . . . , lL−1 , O(Φ)) = D(Φ) (2.131)
(cf. Definitions 2.1.1 and 2.3.4). This establishes item (i). Note that items (ii) and (iii)
in Lemma 2.3.3 ensure that for all a ∈ C(R, R), x ∈ RI(Φ) it holds that RN a (λ ⊛ Φ) ∈
C(R I(Φ)
,RO(Φ)
) and
RN N
 
a (λ ⊛ Φ) (x) = Ra (Aλ IO(Φ) ,0 • Φ) (x)

= λ IO(Φ) (RN (2.132)



a (Φ))(x)
= λ (RN

a (Φ))(x)

(cf. Definition 1.3.4). This proves items (ii) and (iii). The proof of Lemma 2.3.5 is thus
complete.

2.4 Sums of fully-connected feedforward ANNs with the


same length
2.4.1 Sums of vectors as fully-connected feedforward ANNs
Definition 2.4.1 (Sums of vectors as fully-connected feedforward ANNs). Let m, n ∈ N.
Then we denote by
Sm,n ∈ (Rm×(mn) × Rm ) ⊆ N (2.133)
the fully-connected feedforward ANN given by
Sm,n = A(Im Im ... Im ),0 (2.134)
(cf. Definitions 1.3.1, 1.3.2, 1.5.5, and 2.3.1).

98
2.4. Sums of fully-connected feedforward ANNs with the same length

Lemma 2.4.2. Let m, n ∈ N. Then


(i) it holds that D(Sm,n ) = (mn, m) ∈ N2 ,
(ii) it holds for all a ∈ C(R, R) that RN
a (Sm,n ) ∈ C(R
mn
, Rm ), and
(iii) it holds for all a ∈ C(R, R), x1 , x2 , . . . , xn ∈ Rm that
n
(RN (2.135)
P
a (Sm,n ))(x1 , x2 , . . . , xn ) = xk
k=1

(cf. Definitions 1.3.1, 1.3.4, and 2.4.1).


Proof of Lemma 2.4.2. Observe that the fact that Sm,n ∈ (Rm×(mn) × Rm ) implies that
D(Sm,n ) = (mn, m) ∈ N2 (2.136)
(cf. Definitions 1.3.1 and 2.4.1). This establishes item (i). Note that items (ii) and (iii)
in Lemma 2.3.2 demonstrate that for all a ∈ C(R, R), x1 , x2 , . . . , xn ∈ Rm it holds that
RNa (Sm,n ) ∈ C(R
mn
, Rm ) and
(RN N

a (Sm,n ))(x1 , x2 , . . . , xn ) = Ra A(Im Im ... Im ),0 (x1 , x2 , . . . , xn )
n
P (2.137)
= (Im Im . . . Im )(x1 , x2 , . . . , xn ) = xk
k=1

(cf. Definitions 1.3.4, 1.5.5, and 2.3.1). This proves items (ii) and (iii). The proof of
Lemma 2.4.2 is thus complete.
Lemma 2.4.3. Let m, n ∈ N, a ∈ C(R, R), Φ ∈ N satisfy O(Φ) = mn (cf. Definition 1.3.1).
Then
I(Φ)
(i) it holds that RN
a (Sm,n • Φ) ∈ C(R , Rm ) and
(ii) it holds for all x ∈ RI(Φ) , y1 , y2 , . . . , yn ∈ Rm with (RN
a (Φ))(x) = (y1 , y2 , . . . , yn ) that
n
RN (2.138)
 P
a (Sm,n • Φ) (x) = yk
k=1

(cf. Definitions 1.3.4, 2.1.1, and 2.4.1).


Proof of Lemma 2.4.3. Observe that Lemma 2.4.2 shows that for all x1 , x2 , . . . , xn ∈ Rm it
holds that RN
a (Sm,n ) ∈ C(R
mn
, Rm ) and
n
(RN (2.139)
P
a (Sm,n ))(x1 , x2 , . . . , xn ) = xk
k=1

(cf. Definitions 1.3.4 and 2.4.1). Combining this and item (v) in Proposition 2.1.2 establishes
items (i) and (ii). The proof of Lemma 2.4.3 is thus complete.

99
Chapter 2: ANN calculus

Lemma 2.4.4. Let n ∈ N, a ∈ C(R, R), Φ ∈ N (cf. Definition 1.3.1). Then


(i) it holds that RN
a (Φ • SI(Φ),n ) ∈ C(R
nI(Φ)
, RO(Φ) ) and

(ii) it holds for all x1 , x2 , . . . , xn ∈ RI(Φ) that


 n 
RN N
(2.140)
 P
a (Φ • SI(Φ),n ) (x1 , x2 , . . . , xn ) = (Ra (Φ)) xk
k=1

(cf. Definitions 1.3.4, 2.1.1, and 2.4.1).


Proof of Lemma 2.4.4. Note that Lemma 2.4.2 ensures that for all m ∈ N, x1 , x2 , . . . , xn ∈
Rm it holds that RN
a (Sm,n ) ∈ C(R
mn
, Rm ) and
n
(RN (2.141)
P
a (Sm,n ))(x1 , x2 , . . . , xn ) = xk
k=1

(cf. Definitions 1.3.4 and 2.4.1). Combining this and item (v) in Proposition 2.1.2 proves
items (i) and (ii). The proof of Lemma 2.4.4 is thus complete.

2.4.2 Concatenation of vectors as fully-connected feedforward


ANNs
Definition 2.4.5 (Transpose of a matrix). Let m, n ∈ N, A ∈ Rm×n . Then we denote by
A∗ ∈ Rn×m the transpose of A.
Definition 2.4.6 (Concatenation of vectors as fully-connected feedforward ANNs). Let
m, n ∈ N. Then we denote by

Tm,n ∈ (R(mn)×m × Rmn ) ⊆ N (2.142)

the fully-connected feedforward ANN given by

Tm,n = A(Im Im ... Im )∗ ,0 (2.143)

(cf. Definitions 1.3.1, 1.3.2, 1.5.5, 2.3.1, and 2.4.5).


Lemma 2.4.7. Let m, n ∈ N. Then
(i) it holds that D(Tm,n ) = (m, mn) ∈ N2 ,

(ii) it holds for all a ∈ C(R, R) that RN m


a (Tm,n ) ∈ C(R , R
mn
), and

(iii) it holds for all a ∈ C(R, R), x ∈ Rm that

(RN
a (Tm,n ))(x) = (x, x, . . . , x) (2.144)

100
2.4. Sums of fully-connected feedforward ANNs with the same length

(cf. Definitions 1.3.1, 1.3.4, and 2.4.6).


Proof of Lemma 2.4.7. Observe that the fact that Tm,n ∈ (R(mn)×m × Rmn ) implies that
D(Tm,n ) = (m, mn) ∈ N2 (2.145)
(cf. Definitions 1.3.1 and 2.4.6). This establishes item (i). Note that item (iii) in Lemma 2.3.2
demonstrates that for all a ∈ C(R, R), x ∈ Rm it holds that RN m
a (Tm,n ) ∈ C(R , R
mn
) and
(RN N

a (Tm,n ))(x) = Ra A(Im Im ... Im )∗ ,0 (x)
(2.146)
= (Im Im . . . Im )∗ x = (x, x, . . . , x)
(cf. Definitions 1.3.4, 1.5.5, 2.3.1, and 2.4.5). This proves items (ii) and (iii). The proof of
Lemma 2.4.7 is thus complete.
Lemma 2.4.8. Let n ∈ N, a ∈ C(R, R), Φ ∈ N (cf. Definition 1.3.1). Then
I(Φ)
(i) it holds that RN
a (TO(Φ),n • Φ) ∈ C(R , RnO(Φ) ) and
(ii) it holds for all x ∈ RI(Φ) that
RN N N N
(2.147)
 
a (TO(Φ),n • Φ) (x) = (Ra (Φ))(x), (Ra (Φ))(x), . . . , (Ra (Φ))(x)

(cf. Definitions 1.3.4, 2.1.1, and 2.4.6).


Proof of Lemma 2.4.8. Observe that Lemma 2.4.7 shows that for all m ∈ N, x ∈ Rm it
holds that RN m
a (Tm,n ) ∈ C(R , R
mn
) and
(RN
a (Tm,n ))(x) = (x, x, . . . , x) (2.148)
(cf. Definitions 1.3.4 and 2.4.6). Combining this and item (v) in Proposition 2.1.2 establishes
items (i) and (ii). The proof of Lemma 2.4.8 is thus complete.
Lemma 2.4.9. Let m, n ∈ N, a ∈ C(R, R), Φ ∈ N satisfy I(Φ) = mn (cf. Definition 1.3.1).
Then
O(Φ)
(i) it holds that RN m
a (Φ • Tm,n ) ∈ C(R , R ) and
(ii) it holds for all x ∈ Rm that
RN N
(2.149)

a (Φ • Tm,n ) (x) = (Ra (Φ))(x, x, . . . , x)

(cf. Definitions 1.3.4, 2.1.1, and 2.4.6).


Proof of Lemma 2.4.9. Note that Lemma 2.4.7 ensures that for all x ∈ Rm it holds that
RN m
a (Tm,n ) ∈ C(R , R
mn
) and
(RN
a (Tm,n ))(x) = (x, x, . . . , x) (2.150)
(cf. Definitions 1.3.4 and 2.4.6). Combining this and item (v) in Proposition 2.1.2 proves
items (i) and (ii). The proof of Lemma 2.4.9 is thus complete.

101
Chapter 2: ANN calculus

2.4.3 Sums of fully-connected feedforward ANNs


Definition 2.4.10 (Sums of fully-connected feedforward ANNs with the same length). Let
m ∈ Z, n ∈ {m, m + 1, . . . }, Φm , Φm+1 , . . . , Φn ∈ N satisfy for all k ∈ {m, m + 1, . . . , n}
that
L(Φk ) = L(Φm ), I(Φk ) = I(Φm ), and O(Φk ) = O(Φm ) (2.151)
Ln
(cf. Definition 1.3.1). Then we denote by k=m Φk ∈ N (we denote by Φm ⊕ Φm+1 ⊕ . . .
⊕ Φn ∈ N) the fully-connected feedforward ANN given by
n
Φk = SO(Φm ),n−m+1 • Pn−m+1 (Φm , Φm+1 , . . . , Φn ) • TI(Φm ),n−m+1 ∈ N (2.152)
L   
k=m

(cf. Definitions 1.3.2, 2.1.1, 2.2.1, 2.4.1, and 2.4.6).


Lemma 2.4.11 (Realizations of sums of fully-connected feedforward ANNs). Let m ∈ Z,
n ∈ {m, m + 1, . . .}, Φm , Φm+1 , . . . , Φn ∈ N satisfy for all k ∈ {m, m + 1, . . . , n} that
L(Φk ) = L(Φm ), I(Φk ) = I(Φm ), and O(Φk ) = O(Φm ) (2.153)
(cf. Definition 1.3.1). Then
Ln 
(i) it holds that L k=m Φk = L(Φm ),

(ii) it holds that


 n   n n n

L P P P
D Φk = I(Φm ), D1 (Φk ), D2 (Φk ), . . . , DH(Φm ) (Φk ), O(Φm ) ,
k=m k=m k=m k=m
(2.154)
and
(iii) it holds for all a ∈ C(R, R) that
 n  Xn
N
(RN (2.155)
L
Ra Φk = a (Φk ))
k=m k=m

(cf. Definitions 1.3.4 and 2.4.10).


Proof of Lemma 2.4.11. First, observe that Lemma 2.2.2 implies that

D Pn−m+1 (Φm , Φm+1 , . . . , Φn )
 n n n n

P P P P
= D0 (Φk ), D1 (Φk ), . . . , DL(Φm )−1 (Φk ), DL(Φm ) (Φk )
k=m k=m k=m k=m
(2.156)
 n n n
P P P
= (n − m + 1)I(Φm ), D1 (Φk ), D2 (Φk ), . . . , DL(Φm )−1 (Φk ),
k=m k=m k=m

(n − m + 1)O(Φm )

102
2.4. Sums of fully-connected feedforward ANNs with the same length

(cf. Definition 2.2.1). Furthermore, note that item (i) in Lemma 2.4.2 demonstrates that
D(SO(Φm ),n−m+1 ) = ((n − m + 1)O(Φm ), O(Φm )) (2.157)
(cf. Definition 2.4.1). This, (2.156), and item (i) in Proposition 2.1.2 show that
 
D SO(Φm ),n−m+1 • Pn−m+1 (Φm , Φm+1 , . . . , Φn )
(2.158)
 n n n

P P P
= (n − m + 1)I(Φm ), D1 (Φk ), D2 (Φk ), . . . , DL(Φm )−1 (Φk ), O(Φm ) .
k=m k=m k=m

Moreover, observe that item (i) in Lemma 2.4.7 establishes that


(2.159)

D TI(Φm ),n−m+1 = (I(Φm ), (n − m + 1)I(Φm ))
(cf. Definitions 2.1.1 and 2.4.6). Combining this, (2.158), and item (i) in Proposition 2.1.2
ensures that
 n 
L
D Φk
k=m
(2.160)
  
= D SO(Φm ),(n−m+1) • Pn−m+1 (Φm , Φm+1 , . . . , Φn ) • TI(Φm ),(n−m+1)
 n n n

P P P
= I(Φm ), D1 (Φk ), D2 (Φk ), . . . , DL(Φm )−1 (Φk ), O(Φm )
k=m k=m k=m

(cf. Definition 2.4.10). This proves items (i) and (ii). Note that Lemma 2.4.9 and (2.156)
imply that for all a ∈ C(R, R), x ∈ RI(Φm ) it holds that
RN I(Φm )
, R(n−m+1)O(Φm ) ) (2.161)

a [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 ∈ C(R

and
RN

a [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 (x)
(2.162)
= RN

a Pn−m+1 (Φm , Φm+1 , . . . , Φn ) (x, x, . . . , x)

(cf. Definition 1.3.4). Combining this with item (ii) in Proposition 2.2.3 demonstrates that
for all a ∈ C(R, R), x ∈ RI(Φm ) it holds that
RN

a [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 (x)
(2.163)
= (RN N N (n−m+1)O(Φm )

a (Φm ))(x), (Ra (Φm+1 ))(x), . . . , (Ra (Φn ))(x) ∈ R .
Lemma 2.4.3, (2.157), and Corollary 2.1.5 hence show that for all a ∈ C(R, R), x ∈ RI(Φm )
it holds that RN n I(Φm )
, RO(Φm ) ) and
L 
a k=m Φk ∈ C(R
  n 
N
L
Ra Φk (x)
k=m
= RN

a SO(Φm ),n−m+1 • [Pn−m+1 (Φm , Φm+1 , . . . , Φn )] • TI(Φm ),n−m+1 (x) (2.164)
X n
= (RN
a (Φk ))(x).
k=m

This establishes item (iii). The proof of Lemma 2.4.11 is thus complete.

103
Chapter 2: ANN calculus

104
Part II

Approximation

105
Chapter 3

One-dimensional ANN approximation


results

In learning problems ANNs are heavily used with the aim to approximate certain target
functions. In this chapter we review basic ReLU ANN approximation results for a class
of one-dimensional target functions (see Section 3.3). ANN approximation results for
multi-dimensional target functions are treated in Chapter 4 below.
In the scientific literature the capacity of ANNs to approximate certain classes of target
functions has been thoroughly studied; cf., for instance, [14, 41, 89, 203, 204] for early
universal ANN approximation results, cf., for example, [28, 43, 175, 333, 374, 423] and
the references therein for more recent ANN approximation results establishing rates in the
approximation of different classes of target functions, and cf., for instance, [128, 179, 259,
370] and the references therein for approximation capacities of ANNs related to solutions of
PDEs (cf. also Chapters 16 and 17 in Part VI of these lecture notes for machine learning
methods for PDEs). This chapter is based on Ackermann et al. [3, Section 4.2] (cf., for
example, also Hutzenthaler et al. [209, Section 3.4]).

3.1 Linear interpolation of one-dimensional functions


3.1.1 On the modulus of continuity
Definition 3.1.1 (Modulus of continuity). Let A ⊆ R be a set and let f : A → R be
a function. Then we denote by wf : [0, ∞] → [0, ∞] the function which satisfies for all
h ∈ [0, ∞] that
 
wf (h) = sup |f (x) − f (y)| : (x, y ∈ A with |x − y| ≤ h) ∪ {0}
  (3.1)
= sup r ∈ R : (∃ x ∈ A, y ∈ A ∩ [x − h, x + h] : r = |f (x) − f (y)|) ∪ {0}

and we call wf the modulus of continuity of f .

107
Chapter 3: One-dimensional ANN approximation results

Lemma 3.1.2 (Elementary properties of moduli of continuity). Let A ⊆ R be a set and let
f : A → R be a function. Then
(i) it holds that wf is non-decreasing,
(ii) it holds that f is uniformly continuous if and only if limh↘0 wf (h) = 0,
(iii) it holds that f is globally bounded if and only if wf (∞) < ∞, and
(iv) it holds for all x, y ∈ A that |f (x) − f (y)| ≤ wf (|x − y|)
(cf. Definition 3.1.1).
Proof of Lemma 3.1.2. Observe that (3.1) proves items (i), (ii), (iii), and (iv). The proof
of Lemma 3.1.2 is thus complete.
Lemma 3.1.3 (Subadditivity of moduli of continuity). Let a ∈ [−∞, ∞], b ∈ [a, ∞], let
f : ([a, b] ∩ R) → R be a function, and let h, h ∈ [0, ∞]. Then
wf (h + h) ≤ wf (h) + wf (h) (3.2)
(cf. Definition 3.1.1).
Proof of Lemma 3.1.3. Throughout this proof, assume without loss of generality that
h ≤ h < ∞. Note that the fact that for all x, y ∈ [a, b] ∩ R with |x − y| ≤ h + h it
holds that [x − h, x + h] ∩ [y − h, y + h] ∩ [a, b] ̸= ∅ ensures that for all x, y ∈ [a, b] ∩ R with
|x − y| ≤ h + h there exists z ∈ [a, b] ∩ R such that
|x − z| ≤ h and |y − z| ≤ h. (3.3)
Items (i) and (iv) in Lemma 3.1.2 therefore imply that for all x, y ∈ [a, b] ∩ R with
|x − y| ≤ h + h there exists z ∈ [a, b] ∩ R such that
|f (x) − f (y)| ≤ |f (x) − f (z)| + |f (y) − f (z)|
(3.4)
≤ wf (|x − z|) + wf (|y − z|) ≤ wf (h) + wf (h)
(cf. Definition 3.1.1). Combining this with (3.1) demonstrates that
wf (h + h) ≤ wf (h) + wf (h). (3.5)
The proof of Lemma 3.1.3 is thus complete.
Lemma 3.1.4 (Properties of moduli of continuity of Lipschitz continuous functions). Let
A ⊆ R, L ∈ [0, ∞), let f : A → R satisfy for all x, y ∈ A that
|f (x) − f (y)| ≤ L|x − y|, (3.6)
and let h ∈ [0, ∞). Then
wf (h) ≤ Lh (3.7)
(cf. Definition 3.1.1).

108
3.1. Linear interpolation of one-dimensional functions

Proof of Lemma 3.1.4. Observe that (3.1) and (3.6) show that
wf (h) = sup |f (x) − f (y)| ∈ [0, ∞) : (x, y ∈ A with |x − y| ≤ h) ∪ {0}
 

≤ sup L|x − y| ∈ [0, ∞) : (x, y ∈ A with |x − y| ≤ h) ∪ {0} (3.8)


 

≤ sup({Lh, 0}) = Lh
(cf. Definition 3.1.1). The proof of Lemma 3.1.4 is thus complete.

3.1.2 Linear interpolation of one-dimensional functions


Definition 3.1.5 (Linear interpolation operator). Let K ∈ N, x0 , x1 , . . . , xK , f0 , f1 , . . . , fK ∈
R satisfy x0 < x1 < . . . < xK . Then we denote by
Lx0f0,x,f1 ,...,x
1 ,...,fK
K
:R→R (3.9)
the function which satisfies for all k ∈ {1, 2, . . . , K}, x ∈ (−∞, x0 ), y ∈ [xk−1 , xk ), z ∈
[xK , ∞) that
(Lx0f,x0 ,f1 ,...,x
1 ,...,fK
K
)(x) = f0 , (Lx0f,x0 ,f1 ,...,x
1 ,...,fK
K
)(z) = fK , (3.10)
(Lx0f0,x,f1 ,...,x
1 ,...,fK
)(y) = fk−1 + xy−x (3.11)

and K
k−1
k −xk−1
(fk − fk−1 ).
Lemma 3.1.6 (Elementary properties of the linear interpolation operator). Let K ∈ N,
x0 , x1 , . . . , xK , f0 , f1 , . . . , fK ∈ R satisfy x0 < x1 < . . . < xK . Then
(i) it holds for all k ∈ {0, 1, . . . , K} that
(Lx0f,x0 ,f1 ,...,x
1 ,...,fK
K
)(xk ) = fk , (3.12)

(ii) it holds for all k ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ] that


x−xk−1 
(Lx0f0,x,f1 ,...,x
1 ,...,fK
K
)(x) = fk−1 + xk −xk−1
(fk − fk−1 ), (3.13)
and
(iii) it holds for all k ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ] that
k −x x−xk−1 
(Lx0f,x0 ,f1 ,...,x
1 ,...,fK
)(x) = xkx−x (3.14)

K k−1
fk−1 + xk −xk−1
fk .

(cf. Definition 3.1.5).


Proof of Lemma 3.1.6. Note that (3.11) establishes items (i) and (ii). Observe that item (ii)
proves that for all k ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ] it holds that
h i
xk −xk−1 
(Lx0 ,x1 ,...,xK )(x) = xk −xk−1 − xk −xk−1 fk−1 + xx−x
f0 ,f1 ,...,fK x−xk−1  k−1

k −xk−1
fk
(3.15)
k −x x−xk−1
= xkx−x
 
k−1
f k−1 + xk −xk−1
f k .
This establishes item (iii). The proof of Lemma 3.1.6 is thus complete.

109
Chapter 3: One-dimensional ANN approximation results

Proposition 3.1.7 (Approximation and continuity properties for the linear interpolation
operator). Let K ∈ N, x0 , x1 , . . . , xK ∈ R satisfy x0 < x1 < . . . < xK and let f : [x0 , xK ] → R
be a function. Then

(i) it holds for all x, y ∈ R with x ̸= y that

(Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) − (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
)(y)
 K
(3.16)
 
wf (xk − xk−1 )
≤ max |x − y|
k∈{1,2,...,K} xk − xk−1

and

(ii) it holds that


f (x ),f (x ),...,f (xK )
supx∈[x0 ,xK ] (Lx0 ,x10,...,xK1 )(x) − f (x) ≤ wf (maxk∈{1,2,...,K} |xk − xk−1 |) (3.17)

(cf. Definitions 3.1.1 and 3.1.5).

Proof of Proposition 3.1.7. Throughout this proof, let L ∈ [0, ∞] satisfy


 
wf (xk − xk−1 )
L = max (3.18)
k∈{1,2,...,K} xk − xk−1

and let l : R → R satisfy for all x ∈ R that

l(x) = (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) (3.19)

(cf. Definitions 3.1.1 and 3.1.5). Observe that item (ii) in Lemma 3.1.6, item (iv) in
Lemma 3.1.2, and (3.18) ensure that for all k ∈ {1, 2, . . . , K}, x, y ∈ [xk−1 , xk ] with x ≠ y it
holds that

|l(x) − l(y)| = xx−x (f (xk ) − f (xk−1 )) − xy−x


 
k−1
k −xk−1
k−1
k −xk−1
(f (xk ) − f (xk−1 ))

f (xk ) − f (xk−1 )
 
wf (xk − xk−1 )
 (3.20)
= (x − y) ≤ |x − y| ≤ L|x − y|.
xk − xk−1 xk − xk−1

Furthermore, note that that the triangle inequality and item (i) in Lemma 3.1.6 imply that
for all k, l ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ], y ∈ [xl−1 , xl ] with k < l it holds that

|l(x) − l(y)| ≤ |l(x) − l(xk )| + |l(xk ) − l(xl−1 )| + |l(xl−1 ) − l(y)|


= |l(x) − l(xk )| + |f (xk ) − f (xl−1 )| + |l(xl−1 ) − l(y)|
l−1
! (3.21)
X
≤ |l(x) − l(xk )| + |f (xj ) − f (xj−1 )| + |l(xl−1 ) − l(y)|.
j=k+1

110
3.1. Linear interpolation of one-dimensional functions

Item (iv) in Lemma 3.1.2, and (3.18) hence demonstrate that for all k, l ∈ {1, 2, . . . , K},
x ∈ [xk−1 , xk ], y ∈ [xl−1 , xl ] with k < l and x ̸= y it holds that

|l(x) − l(y)|
l−1
!
X
≤ |l(x) − l(xk )| + wf (|xj − xj−1 |) + |l(xl−1 ) − l(y)|
j=k+1
l−1   ! (3.22)
X wf (xj − xj−1 )
= |l(x) − l(xk )| + (xj − xj−1 ) + |l(xl−1 ) − l(y)|
j=k+1
xj − xj−1
≤ |l(xk ) − l(x)| + L(xl−1 − xk ) + |l(y) − l(xl−1 )|.

This and (3.21) show that for all k, l ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ], y ∈ [xl−1 , xl ] with k < l
and x ̸= y it holds that
l−1
! !
X
|l(x) − l(y)| ≤ L (xk − x) + (xj − xj−1 ) + (y − xl−1 ) = L|x − y|. (3.23)
j=k+1

Combining this and (3.20) proves that for all x, y ∈ [x0 , xK ] with x ̸= y it holds that

|l(x) − l(y)| ≤ L|x − y|. (3.24)

This, the fact that for all x, y ∈ (−∞, x0 ] with x ̸= y it holds that

|l(x) − l(y)| = 0 ≤ L|x − y|, (3.25)

the fact that for all x, y ∈ [xK , ∞) with x ̸= y it holds that

|l(x) − l(y)| = 0 ≤ L|x − y|, (3.26)

and the triangle inequality therefore establish that for all x, y ∈ R with x ̸= y it holds that

|l(x) − l(y)| ≤ L|x − y|. (3.27)

This proves item (i). Observe that item (iii) in Lemma 3.1.6 ensures that for all k ∈
{1, 2, . . . , K}, x ∈ [xk−1 , xk ] it holds that
   
xk − x x − xk−1
|l(x) − f (x)| = f (xk−1 ) + f (xk ) − f (x)
xk − xk−1 xk − xk−1
   
xk − x x − xk−1
= (f (xk−1 ) − f (x)) + (f (xk ) − f (x)) (3.28)
xk − xk−1 xk − xk−1
   
xk − x x − xk−1
≤ |f (xk−1 ) − f (x)| + |f (xk ) − f (x)|.
xk − xk−1 xk − xk−1

111
Chapter 3: One-dimensional ANN approximation results

Combining this with (3.1) and Lemma 3.1.2 implies that for all k ∈ {1, 2, . . . , K}, x ∈
[xk−1 , xk ] it holds that
 
xk − x x − xk−1
|l(x) − f (x)| ≤ wf (|xk − xk−1 |) +
xk − xk−1 xk − xk−1 (3.29)
= wf (|xk − xk−1 |) ≤ wf (maxj∈{1,2,...,K} |xj − xj−1 |).

This establishes item (ii). The proof of Proposition 3.1.7 is thus complete.

Corollary 3.1.8 (Approximation and Lipschitz continuity properties for the linear inter-
polation operator). Let K ∈ N, L, x0 , x1 , . . . , xK ∈ R satisfy x0 < x1 < . . . < xK and let
f : [x0 , xK ] → R satisfy for all x, y ∈ [x0 , xK ] that

|f (x) − f (y)| ≤ L|x − y|. (3.30)

Then

(i) it holds for all x, y ∈ R that

(Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) − (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(y) ≤ L|x − y| (3.31)

and

(ii) it holds that


 
sup (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) − f (x) ≤ L max |xk − xk−1 | (3.32)
x∈[x0 ,xK ] k∈{1,2,...,K}

(cf. Definition 3.1.5).

Proof of Corollary 3.1.8. Note that the assumption that for all x, y ∈ [x0 , xK ] it holds that
|f (x) − f (y)| ≤ L|x − y| demonstrates that

|f (xL ) − f (x0 )| L|xL − x0 |


0≤ ≤ = L. (3.33)
(xL − x0 ) (xL − x0 )

Combining this, Lemma 3.1.4, and the assumption that for all x, y ∈ [x0 , xK ] it holds that
|f (x) − f (y)| ≤ L|x − y| with item (i) in Proposition 3.1.7 shows that for all x, y ∈ R it
holds that

(Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
K
)(x) − (Lx0f,x(x10,...,x
),f (x1 ),...,f (xK )
)(y)
 K
(3.34)
 
L|xk − xk−1 |
≤ max |x − y| = L|x − y|.
k∈{1,2,...,K} |xk − xk−1 |

112
3.2. Linear interpolation with fully-connected feedforward ANNs

This proves item (i). Observe that the assumption that for all x, y ∈ [x0 , xK ] it holds that
|f (x) − f (y)| ≤ L|x − y|, Lemma 3.1.4, and item (ii) in Proposition 3.1.7 ensure that
 
f (x0 ),f (x1 ),...,f (xK )
sup (Lx0 ,x1 ,...,xK )(x) − f (x) ≤ wf max |xk − xk−1 |
x∈[x0 ,xK ] k∈{1,2,...,K}
  (3.35)
≤L max |xk − xk−1 | .
k∈{1,2,...,K}

This establishes item (ii). The proof of Corollary 3.1.8 is thus complete.

3.2 Linear interpolation with fully-connected feedfor-


ward ANNs
3.2.1 Activation functions as fully-connected feedforward ANNs
Definition 3.2.1 (Activation functions as fully-connected feedforward ANNs). Let n ∈ N.
Then we denote by
in ∈ ((Rn×n × Rn ) × (Rn×n × Rn )) ⊆ N (3.36)
the fully-connected feedforward ANN given by

in = ((In , 0), (In , 0)) (3.37)

(cf. Definitions 1.3.1 and 1.5.5).

Lemma 3.2.2 (Realization functions of fully-connected feedforward activation ANNs). Let


n ∈ N. Then

(i) it holds that D(in ) = (n, n, n) ∈ N3 and

(ii) it holds for all a ∈ C(R, R) that

RN
a (in ) = Ma,n (3.38)

(cf. Definitions 1.2.1, 1.3.1, 1.3.4, and 3.2.1).

Proof of Lemma 3.2.2. Note that the fact that in ∈ ((Rn×n × Rn ) × (Rn×n × Rn )) ⊆ N
implies that
D(in ) = (n, n, n) ∈ N3 (3.39)
(cf. Definitions 1.3.1 and 3.2.1). This proves item (i). Observe that (1.91) and the fact that

in = ((In , 0), (In , 0)) ∈ ((Rn×n × Rn ) × (Rn×n × Rn )) (3.40)

113
Chapter 3: One-dimensional ANN approximation results

demonstrate that for all a ∈ C(R, R), x ∈ Rn it holds that RN


a (in ) ∈ C(R , R ) and
n n

(RN
a (in ))(x) = In (Ma,n (In x + 0)) + 0 = Ma,n (x). (3.41)

This establishes item (ii). The proof of Lemma 3.2.2 is thus complete.
Lemma 3.2.3 (Compositions of fully-connected feedforward activation ANNs with general
fully-connected feedforward ANNs). Let Φ ∈ N (cf. Definition 1.3.1). Then
(i) it holds that

D(iO(Φ) • Φ)
(3.42)
= (D0 (Φ), D1 (Φ), D2 (Φ), . . . , DL(Φ)−1 (Φ), DL(Φ) (Φ), DL(Φ) (Φ)) ∈ NL(Φ)+2 ,

I(Φ)
(ii) it holds for all a ∈ C(R, R) that RN
a (iO(Φ) • Φ) ∈ C(R , RO(Φ) ),

(iii) it holds for all a ∈ C(R, R) that RN N


a (iO(Φ) • Φ) = Ma,O(Φ) ◦ (Ra (Φ)),

(iv) it holds that

D(Φ • iI(Φ) )
(3.43)
= (D0 (Φ), D0 (Φ), D1 (Φ), D2 (Φ), . . . , DL(Φ)−1 (Φ), DL(Φ) (Φ)) ∈ NL(Φ)+2 ,

I(Φ)
(v) it holds for all a ∈ C(R, R) that RN
a (Φ • iI(Φ) ) ∈ C(R , RO(Φ) ), and

(vi) it holds for all a ∈ C(R, R) that RN N


a (Φ • iI(Φ) ) = (Ra (Φ)) ◦ Ma,I(Φ)

(cf. Definitions 1.2.1, 1.3.4, 2.1.1, and 3.2.1).


Proof of Lemma 3.2.3. Note that Lemma 3.2.2 shows that for all n ∈ N, a ∈ C(R, R) it
holds that
RN
a (in ) = Ma,n (3.44)
(cf. Definitions 1.2.1, 1.3.4, and 3.2.1). Combining this and Proposition 2.1.2 proves items (i),
(ii), (iii), (iv), (v), and (vi). The proof of Lemma 3.2.3 is thus complete.

3.2.2 Representations for ReLU ANNs with one hidden neuron


Lemma 3.2.4. Let α, β, h ∈ R, H ∈ N satisfy

H = h ⊛ (i1 • Aα,β ) (3.45)

(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, and 3.2.1). Then


(i) it holds that H = ((α, β), (h, 0)),

114
3.2. Linear interpolation with fully-connected feedforward ANNs

(ii) it holds that D(H) = (1, 1, 1) ∈ N3 ,

(iii) it holds that RN


r (H) ∈ C(R, R), and

(iv) it holds for all x ∈ R that (RN


r (H))(x) = h max{αx + β, 0}

(cf. Definitions 1.2.4 and 1.3.4).

Proof of Lemma 3.2.4. Observe that Lemma 2.3.2 ensures that

Aα,β = (α, β), D(Aα,β ) = (1, 1) ∈ N2 , RN


r (Aα,β ) ∈ C(R, R), (3.46)

and ∀ x ∈ R : (RN
r (Aα,β ))(x) = αx + β (cf. Definitions 1.2.4 and 1.3.4). Proposition 2.1.2,
Lemma 3.2.2, Lemma 3.2.3, (1.26), (1.91), and (2.2) hence imply that

r (i1 • Aα,β ) ∈ C(R, R), (3.47)


i1 • Aα,β = ((α, β), (1, 0)), D(i1 • Aα,β ) = (1, 1, 1) ∈ N3 , RN
and ∀ x ∈ R : (RN N
r (i1 • Aα,β ))(x) = r(Rr (Aα,β )(x)) = max{αx + β, 0}. (3.48)

This, Lemma 2.3.5, and (2.127) demonstrate that

H = h ⊛ (i1 • Aα,β ) = ((α, β), (h, 0)), D(H) = (1, 1, 1), RN


r (H) ∈ C(R, R), (3.49)
and (RN
r (H))(x) = h((RN
r (i1 • Aα,β ))(x)) = h max{αx + β, 0}. (3.50)

This establishes items (i), (ii), (iii), and (iv). The proof of Lemma 3.2.4 is thus complete.

3.2.3 ReLU ANN representations for linear interpolations


Proposition 3.2.5 (ReLU ANN representations for linear interpolations). Let K ∈ N, f0 ,
f1 , . . . , fK , x0 , x1 , . . . , xK ∈ R satisfy x0 < x1 < . . . < xK and let F ∈ N satisfy
 K   
(fmin{k+1,K} −fk ) (fk −fmax{k−1,0} )
(3.51)
L
F = A1,f0 • (xmin{k+1,K} −xmin{k,K−1} )
− (xmax{k,1} −xmax{k−1,0} ) ⊛ (i1 • A1,−xk )
k=0

(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Then

(i) it holds that D(F) = (1, K + 1, 1) ∈ N3 ,

(ii) it holds that RN f0 ,f1 ,...,fK


r (F) = Lx0 ,x1 ,...,xK , and

(iii) it holds that P(F) = 3K + 4

(cf. Definitions 1.2.4, 1.3.4, and 3.1.5).

115
Chapter 3: One-dimensional ANN approximation results

Proof of Proposition 3.2.5. Throughout this proof, let c0 , c1 , . . . , cK ∈ R satisfy for all
k ∈ {0, 1, . . . , K} that

(fmin{k+1,K} − fk ) (fk − fmax{k−1,0} )


ck = − (3.52)
(xmin{k+1,K} − xmin{k,K−1} ) (xmax{k,1} − xmax{k−1,0} )

and let Φ0 , Φ1 , . . . , ΦK ∈ ((R1×1 × R1 ) × (R1×1 × R1 )) ⊆ N satisfy for all k ∈ {0, 1, . . . , K}


that
Φk = ck ⊛ (i1 • A1,−xk ). (3.53)
Note that Lemma 3.2.4 shows that for all k ∈ {0, 1, . . . , K} it holds that

RN
r (Φk ) ∈ C(R, R), D(Φk ) = (1, 1, 1) ∈ N3 , (3.54)
and ∀ x ∈ R: (RN
r (Φk ))(x) = ck max{x − xk , 0} (3.55)

(cf. Definitions 1.2.4 and 1.3.4). This, Lemma 2.3.3, Lemma 2.4.11, and (3.51) prove that

D(F) = (1, K + 1, 1) ∈ N3 and RN


r (F) ∈ C(R, R). (3.56)

This establishes item (i). Observe that item (i) and (1.78) ensure that

P(F) = 2(K + 1) + (K + 2) = 3K + 4. (3.57)

This implies item (iii). Note that (3.52), (3.55), Lemma 2.3.3, and Lemma 2.4.11 demonstrate
that for all x ∈ R it holds that
K
X K
X
(RN
r (F))(x) = f0 + N
(Rr (Φk ))(x) = f0 + ck max{x − xk , 0}. (3.58)
k=0 k=0

This and the fact that for all k ∈ {0, 1, . . . , K} it holds that x0 ≤ xk show that for all
x ∈ (−∞, x0 ] it holds that
(RN
r (F))(x) = f0 + 0 = f0 . (3.59)
Next we claim that for all k ∈ {1, 2, . . . , K} it holds that
k−1
X fk − fk−1
cn = . (3.60)
n=0
xk − xk−1

We now prove (3.60) by induction on k ∈ {1, 2, . . . , K}. For the base case k = 1 observe
that (3.52) proves that
0
X f1 − f0
cn = c0 = . (3.61)
n=0
x 1 − x0

116
3.2. Linear interpolation with fully-connected feedforward ANNs

This establishes (3.60) in the base case k = 1. For the induction step observe that (3.52)
fk−1 −fk−2
n=0 cn = xk−1 −xk−2 it holds that
ensures that for all k ∈ N ∩ (1, ∞) ∩ (0, K] with k−2
P

k−1 k−2
X X fk − fk−1 fk−1 − fk−2 fk−1 − fk−2 fk − fk−1
cn = ck−1 + cn = − + = . (3.62)
n=0 n=0
xk − xk−1 xk−1 − xk−2 xk−1 − xk−2 xk − xk−1

Induction thus implies (3.60). Furthermore, note that (3.58), (3.60), and the fact that for
all k ∈ {1, 2, . . . , K} it holds that xk−1 < xk demonstrate that for all k ∈ {1, 2, . . . , K},
x ∈ [xk−1 , xk ] it holds that
K
X
(RN
r (F))(x) − (RN
r (F))(xk−1 ) = cn (max{x − xn , 0} − max{xk−1 − xn , 0})
n=0
k−1 k−1
cn (x − xk−1 ) (3.63)
X X
= cn [(x − xn ) − (xk−1 − xn )] =
n=0 n=0
 
fk − fk−1
= (x − xk−1 ).
xk − xk−1
Next we claim that for all k ∈ {1, 2, . . . , K}, x ∈ [xk−1 , xk ] it holds that
 
fk − fk−1
N
(Rr (F))(x) = fk−1 + (x − xk−1 ). (3.64)
xk − xk−1
We now prove (3.64) by induction on k ∈ {1, 2, . . . , K}. For the base case k = 1 observe
that (3.59) and (3.63) show that for all x ∈ [x0 , x1 ] it holds that
 
f1 − f0
N N N N
(Rr (F))(x) = (Rr (F))(x0 )+(Rr (F))(x)−(Rr (F))(x0 ) = f0 + (x − x0 ). (3.65)
x1 − x0
This proves (3.64) in the base case k = 1. For the induction step note that (3.63) establishes
that for all k ∈ N ∩ (1, ∞) ∩ [1, K], x ∈ [xk−1 , xk ] with ∀ y ∈ [xk−2 , xk−1 ] : (RN
r (F))(y) =
fk−1 −fk−2 
fk−2 + xk−1 −xk−2 (y − xk−2 ) it holds that

(RN N N N
r (F))(x) = (Rr (F))(xk−1 ) + (Rr (F))(x) − (Rr (F))(xk−1 )
   
fk−1 − fk−2 fk − fk−1
= fk−2 + (xk−1 − xk−2 ) + (x − xk−1 )
xk−1 − xk−2 xk − xk−1 (3.66)
 
fk − fk−1
= fk−1 + (x − xk−1 ).
xk − xk−1
Induction thus ensures (3.64). Moreover, observe that (3.52) and (3.60) imply that
K K−1
X X fK − fK−1 fK − fK−1
cn = cK + cn = − + = 0. (3.67)
n=0 n=0
xK − xK−1 xK − xK−1

117
Chapter 3: One-dimensional ANN approximation results

The fact that for all k ∈ {0, 1, . . . , K} it holds that xk ≤ xK and (3.58) therefore demonstrate
that for all x ∈ [xK , ∞) it holds that
" K #
X
N N
(Rr (F))(x) − (Rr (F))(xK ) = cn (max{x − xn , 0} − max{xK − xn , 0})
n=0
K K
(3.68)
X X
= cn [(x − xn ) − (xK − xn )] = cn (x − xK ) = 0.
n=0 n=0

This and (3.64) show that for all x ∈ [xK , ∞) it holds that
fK −fK−1 
(RN N
r (F))(x) = (Rr (F))(xK ) = fK−1 + xK −xK−1
(xK − xK−1 ) = fK . (3.69)

Combining this, (3.59), (3.64), and (3.11) proves item (ii). The proof of Proposition 3.2.5 is
thus complete.
Exercise 3.2.1. Prove or disprove the following statement: There exists Φ ∈ N such that
P(Φ) ≤ 16 and
sup cos(x) − (RN 1
r (Φ))(x) ≤ 2 (3.70)
x∈[−2π,2π]

(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).


Exercise 3.2.2. Prove or disprove the following statement: There exists Φ ∈ N such that
I(Φ) = 4, O(Φ) = 1, P(Φ) ≤ 60, and ∀ x, y, u, v ∈ R : (RN r (Φ))(x, y, u, v) = max{x, y, u, v}
(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).
Exercise 3.2.3. Prove or disprove the following statement: For every m ∈ N there exists
Φ ∈ N such that I(Φ) = 2m , O(Φ) = 1, P(Φ) ≤ 3(2m (2m +1)), and ∀ x = (x1 , x2 , . . . , x2m ) ∈
R : (RNr (Φ))(x) = max{x1 , x2 , . . . , x2m } (cf. Definitions 1.2.4, 1.3.1, and 1.3.4).

3.3 ANN approximations results for one-dimensional


functions
3.3.1 Constructive ANN approximation results
Proposition 3.3.1 (ANN approximations through linear interpolations). Let K ∈ N,
L, a, x0 , x1 , . . . , xK ∈ R, b ∈ (a, ∞) satisfy for all k ∈ {0, 1, . . . , K} that xk = a + k(b−a)
K
, let
f : [a, b] → R satisfy for all x, y ∈ [a, b] that
|f (x) − f (y)| ≤ L|x − y|, (3.71)
and let F ∈ N satisfy
 K  
L K(f (xmin{k+1,K} )−2f (xk )+f (xmax{k−1,0} )) 
F = A1,f (x0 ) • (b−a)
⊛ (i1 • A1,−xk ) (3.72)
k=0

118
3.3. ANN approximations results for one-dimensional functions

(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Then

(i) it holds that D(F) = (1, K + 1, 1),


f (x ),f (x ),...,f (xK )
(ii) it holds that RN 0
r (F) = Lx0 ,x1 ,...,xK
1
,

(iii) it holds for all x, y ∈ R that |(RN N


r (F))(x) − (Rr (F))(y)| ≤ L|x − y|,

−1
(iv) it holds that supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ L(b − a)K , and

(v) it holds that P(F) = 3K + 4

(cf. Definitions 1.2.4, 1.3.4, and 3.1.5).

Proof of Proposition 3.3.1. Note that the fact that for all k ∈ {0, 1, . . . , K} it holds that

xmin{k+1,K} − xmin{k,K−1} = xmax{k,1} − xmax{k−1,0} = (b − a)K −1 (3.73)

establishes that for all k ∈ {0, 1, . . . , K} it holds that

(f (xmin{k+1,K} ) − f (xk )) (f (xk ) − f (xmax{k−1,0} ))



(xmin{k+1,K} − xmin{k,K−1} ) (xmax{k,1} − xmax{k−1,0} )
(3.74)
K(f (xmin{k+1,K} ) − 2f (xk ) + f (xmax{k−1,0} ))
= .
(b − a)

This and Proposition 3.2.5 prove items (i), (ii), and (v). Observe that item (i) in Corol-
lary 3.1.8, item (ii), and the assumption that for all x, y ∈ [a, b] it holds that

|f (x) − f (y)| ≤ L|x − y| (3.75)

prove item (iii). Note that item (ii), the assumption that for all x, y ∈ [a, b] it holds that

|f (x) − f (y)| ≤ L|x − y|, (3.76)

item (ii) in Corollary 3.1.8, and the fact that for all k ∈ {1, 2, . . . , K} it holds that

(b − a)
xk − xk−1 = (3.77)
K
ensure that for all x ∈ [a, b] it holds that
 
L(b − a)
N
|(Rr (F))(x) − f (x)| ≤ L max |xk − xk−1 | = . (3.78)
k∈{1,2,...,K} K

This establishes item (iv). The proof of Proposition 3.3.1 is thus complete.

119
Chapter 3: One-dimensional ANN approximation results

Lemma 3.3.2 (Approximations through ANNs with constant realizations). Let L, a ∈ R,


b ∈ [a, ∞), ξ ∈ [a, b], let f : [a, b] → R satisfy for all x, y ∈ [a, b] that

|f (x) − f (y)| ≤ L|x − y|, (3.79)

and let F ∈ N satisfy


F = A1,f (ξ) • (0 ⊛ (i1 • A1,−ξ )) (3.80)
(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, and 3.2.1). Then

(i) it holds that D(F) = (1, 1, 1),

(ii) it holds that RN


r (F) ∈ C(R, R),

(iii) it holds for all x ∈ R that (RN


r (F))(x) = f (ξ),

(iv) it holds that supx∈[a,b] |(RN


r (F))(x) − f (x)| ≤ L max{ξ − a, b − ξ}, and

(v) it holds that P(F) = 4

(cf. Definitions 1.2.4 and 1.3.4).

Proof of Lemma 3.3.2. Observe that items (i) and (ii) in Lemma 2.3.3, and items (ii)
and (iii) in Lemma 3.2.4 establish items (i) and (ii). Note that item (iii) in Lemma 2.3.3
and item (iii) in Lemma 2.3.5 imply that for all x ∈ R it holds that

(RN N
r (F))(x) = (Rr (0 ⊛ (i1 • A1,−ξ )))(x) + f (ξ)
(3.81)
= 0 (RN

r (i1 • A1,−ξ ))(x) + f (ξ) = f (ξ)

(cf. Definitions 1.2.4 and 1.3.4). This proves item (iii). Observe that (3.81), the fact that
ξ ∈ [a, b], and the assumption that for all x, y ∈ [a, b] it holds that

|f (x) − f (y)| ≤ L|x − y| (3.82)

demonstrate that for all x ∈ [a, b] it holds that

|(RN
r (F))(x) − f (x)| = |f (ξ) − f (x)| ≤ L|x − ξ| ≤ L max{ξ − a, b − ξ}. (3.83)

This establishes item (iv). Note that (1.78) and item (i) show that

P(F) = 1(1 + 1) + 1(1 + 1) = 4. (3.84)

This proves item (v). The proof of Lemma 3.3.2 is thus complete.

120
3.3. ANN approximations results for one-dimensional functions

Corollary 3.3.3 (Explicit ANN approximations with prescribed error tolerances). Let
L(b−a) L(b−a) 
ε ∈ (0, ∞), L, a ∈ R, b ∈ (a, ∞), K ∈ N0 ∩ ε
, ε + 1 , x0 , x1 , . . . , xK ∈ R satisfy for
k(b−a)
all k ∈ {0, 1, . . . , K} that xk = a + max{K,1} , let f : [a, b] → R satisfy for all x, y ∈ [a, b] that

|f (x) − f (y)| ≤ L|x − y|, (3.85)

and let F ∈ N satisfy


 K   
K(f (xmin{k+1,K} )−2f (xk )+f (xmax{k−1,0} ))
(3.86)
L
F = A1,f (x0 ) • (b−a)
⊛ (i1 • A1,−xk )
k=0

(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Then

(i) it holds that D(F) = (1, K + 1, 1),

(ii) it holds that RN


r (F) ∈ C(R, R),

(iii) it holds for all x, y ∈ R that |(RN N


r (F))(x) − (Rr (F))(y)| ≤ L|x − y|,

L(b−a)
(iv) it holds that supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ max{K,1}
≤ ε, and

(v) it holds that P(F) = 3K + 4 ≤ 3L(b − a)ε−1 + 7

(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).

Proof of Corollary 3.3.3. Observe that the assumption that K ∈ N0 ∩


 L(b−a)
, L(b−a)

ε ε
+1
ensures that
L(b − a)
≤ ε. (3.87)
max{K, 1}
This, items (i), (iii), and (iv) in Proposition 3.3.1, and items (i), (ii), (iii), and (iv) in
Lemma 3.3.2 establish items (i), (ii), (iii), and (iv). Note that item (v) in Proposition 3.3.1,
item (v) in Lemma 3.3.2, and the fact that

L(b − a)
K ≤1+ , (3.88)
ε
imply that
3L(b − a)
P(F) = 3K + 4 ≤ + 7. (3.89)
ε
This proves item (v). The proof of Corollary 3.3.3 is thus complete.

121
Chapter 3: One-dimensional ANN approximation results

3.3.2 Convergence rates for the approximation error


S∞ d 
Definition 3.3.4 (Quasi vector norms). We denote by ∥·∥p : d=1 R → R, p ∈ (0, ∞],
d
the functions which satisfy for all p ∈ (0, ∞), d ∈ N, θ = (θ1 , . . . , θd ) ∈ R that
1/p
(3.90)
Pd p
∥θ∥p = i=1 |θi | and ∥θ∥∞ = maxi∈{1,2,...,d} |θi |.

Corollary 3.3.5 (Implicit one-dimensional ANN approximations with prescribed error


tolerances and explicit parameter bounds). Let ε ∈ (0, ∞), L ∈ [0, ∞), a ∈ R, b ∈ [a, ∞)
and let f : [a, b] → R satisfy for all x, y ∈ [a, b] that

|f (x) − f (y)| ≤ L|x − y|. (3.91)

Then there exists F ∈ N such that

(i) it holds that RN


r (F) ∈ C(R, R),

(ii) it holds that H(F) = 1,

(iii) it holds that D1 (F) ≤ L(b − a)ε−1 + 2,

(iv) it holds for all x, y ∈ R that |(RN N


r (F))(x) − (Rr (F))(y)| ≤ L|x − y|,

(v) it holds that supx∈[a,b] |(RN


r (F))(x) − f (x)| ≤ ε,

(vi) it holds that P(F) = 3(D1 (F)) + 1 ≤ 3L(b − a)ε−1 + 7, and

(vii) it holds that ∥T (F)∥∞ ≤ max{1, |a|, |b|, 2L, |f (a)|}

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 3.3.4).

Proof of Corollary 3.3.5. Throughout this proof, assume without loss of generality that
a < b, let K ∈ N0 ∩ L(b−a) , L(b−a) + 1 , x0 , x1 , . . . , xK ∈ [a, b], c0 , c1 , . . . , cK ∈ R satisfy for

ε ε
all k ∈ {0, 1, . . . , K} that

k(b − a) K(f (xmin{k+1,K} ) − 2f (xk ) + f (xmax{k−1,0} ))


xk = a + and ck = , (3.92)
max{K, 1} (b − a)

and let F ∈ N satisfy


 K

(3.93)
L
F = A1,f (x0 ) • (ck ⊛ (i1 • A1,−xk ))
k=0

(cf. Definitions 1.3.1, 2.1.1, 2.3.1, 2.3.4, 2.4.10, and 3.2.1). Observe that Corollary 3.3.3
demonstrates that

122
3.3. ANN approximations results for one-dimensional functions

(I) it holds that D(F) = (1, K + 1, 1),

(II) it holds that RN


r (F) ∈ C(R, R),

(III) it holds for all x, y ∈ R that |(RN N


r (F))(x) − (Rr (F))(y)| ≤ L|x − y|,

(IV) it holds that supx∈[a,b] |(RN


r (F))(x) − f (x)| ≤ ε, and

(V) it holds that P(F) = 3K + 4

(cf. Definitions 1.2.4 and 1.3.4). This establishes items (i), (iv), and (v). Note that item (I)
and the fact that
L(b − a)
K ≤1+ (3.94)
ε
prove items (ii) and (iii). Observe that item (ii) and items (I) and (V) show that

3L(b − a)
P(F) = 3K + 4 = 3(K + 1) + 1 = 3(D1 (F)) + 1 ≤ + 7. (3.95)
ε
This proves item (vi). Note that Lemma 3.2.4 ensures that for all k ∈ {0, 1, . . . , K} it holds
that
ck ⊛ (i1 • A1,−xk ) = ((1, −xk ), (ck , 0)). (3.96)
Combining this with (2.152), (2.143), (2.134), and (2.2) implies that
    
1 −x0
1  −x1   
F =  .. ,  .. , c0 c1 · · · cK , f (x0 ) 
    
 .   .  
1 −xK
∈ (R(K+1)×1 × RK+1 ) × (R1×(K+1) × R). (3.97)

Lemma 1.3.8 hence demonstrates that

∥T (F)∥∞ = max{|x0 |, |x1 |, . . . , |xK |, |c0 |, |c1 |, . . . , |cK |, |f (x0 )|, 1} (3.98)

(cf. Definitions 1.3.5 and 3.3.4). Furthermore, observe that the assumption that for all
x, y ∈ [a, b] it holds that
|f (x) − f (y)| ≤ L|x − y| (3.99)
and the fact that for all k ∈ N ∩ (0, K + 1) it holds that

(b − a)
xk − xk−1 = (3.100)
max{K, 1}

123
Chapter 3: One-dimensional ANN approximation results

establish that for all k ∈ {0, 1, . . . , K} it holds that

K(|f (xmin{k+1,K} ) − f (xk )| + |f (xmax{k−1,0} )) − f (xk )|


|ck | ≤
(b − a)
KL(|xmin{k+1,K} − xk | + |xmax{k−1,0} − xk |)
≤ (3.101)
(b − a)
2KL(b − a)[max{K, 1}]−1
≤ ≤ 2L.
(b − a)

This and (3.98) prove item (vii). The proof of Corollary 3.3.5 is thus complete.

Corollary 3.3.6 (Implicit one-dimensional ANN approximations with prescribed error


tolerances and asymptotic parameter bounds). Let L, a ∈ R, b ∈ [a, ∞) and let f : [a, b] → R
satisfy for all x, y ∈ [a, b] that

|f (x) − f (y)| ≤ L|x − y|. (3.102)

Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N such that

RN
r (F) ∈ C(R, R), supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ ε, H(F) = 1, (3.103)
∥T (F)∥∞ ≤ max{1, |a|, |b|, 2L, |f (a)|}, and P(F) ≤ Cε −1
(3.104)

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 3.3.4).

Proof of Corollary 3.3.6. Throughout this proof, assume without loss of generality that
a < b and let
C = 3L(b − a) + 7. (3.105)
Note that the assumption that a < b shows that L ≥ 0. Furthermore, observe that (3.105)
ensures that for all ε ∈ (0, 1] it holds that

3L(b − a)ε−1 + 7 ≤ 3L(b − a)ε−1 + 7ε−1 = Cε−1 . (3.106)

This and Corollary 3.3.5 imply that for all ε ∈ (0, 1] there exists F ∈ N such that

RN
r (F) ∈ C(R, R), supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ ε, H(F) = 1, (3.107)
∥T (F)∥∞ ≤ max{1, |a|, |b|, 2L, |f (a)|}, and P(F) ≤ 3L(b − a)ε −1
+ 7 ≤ Cε −1
(3.108)

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 3.3.4). The proof of Corollary 3.3.6 is thus
complete.

124
3.3. ANN approximations results for one-dimensional functions

Corollary 3.3.7 (Implicit one-dimensional ANN approximations with prescribed error


tolerances and asymptotic parameter bounds). Let L, a ∈ R, b ∈ [a, ∞) and let f : [a, b] → R
satisfy for all x, y ∈ [a, b] that

|f (x) − f (y)| ≤ L|x − y|. (3.109)

Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N such that

RN
r (F) ∈ C(R, R), supx∈[a,b] |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−1 (3.110)

(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).

Proof of Corollary 3.3.7. Note that Corollary 3.3.6 establishes (3.110). The proof of Corol-
lary 3.3.7 is thus complete.
Exercise 3.3.1. Let f : [−2, 3] → R satisfy for all x ∈ [−2, 3] that

f (x) = x2 + 2 sin(x). (3.111)

Prove or disprove the following statement: There exist c ∈ R and F = (Fε )ε∈(0,1] : (0, 1] → N
such that for all ε ∈ (0, 1] it holds that

RN
r (Fε ) ∈ C(R, R), supx∈[−2,3] |(RN
r (Fε ))(x) − f (x)| ≤ ε, and P(Fε ) ≤ cε−1 (3.112)

(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).


Exercise 3.3.2. Prove or disprove the following statement: There exists Φ ∈ N such that
P(Φ) ≤ 10 and √
sup x − (RN 1
r (Φ))(x) ≤ 4 (3.113)
x∈[0,10]

(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).

125
Chapter 3: One-dimensional ANN approximation results

126
Chapter 4

Multi-dimensional ANN approximation


results

In this chapter we review basic deep ReLU ANN approximation results for possibly multi-
dimensional target functions. We refer to the beginning of Chapter 3 for a small selection
of ANN approximation results from the literature. The specific presentation of this chapter
is strongly based on [25, Sections 2.2.6, 2.2.7, 2.2.8, and 3.1], [226, Sections 3 and 4.2], and
[230, Section 3].

4.1 Approximations through supremal convolutions


Definition 4.1.1 (Metric). We say that δ is a metric on E if and only if it holds that
δ : E × E → [0, ∞) is a function from E × E to [0, ∞) which satisfies that

(i) it holds that


{(x, y) ∈ E 2 : d(x, y) = 0} = (4.1)
S
x∈E {(x, x)}

(positive definiteness),

(ii) it holds for all x, y ∈ E that


δ(x, y) = δ(y, x) (4.2)
(symmetry), and

(iii) it holds for all x, y, z ∈ E that

δ(x, z) ≤ δ(x, y) + δ(y, z) (4.3)

(triangle inequality).

127
Chapter 4: Multi-dimensional ANN approximation results

Definition 4.1.2 (Metric space). We say that E is a metric space if and only if there exist
a set E and a metric δ on E such that

E = (E, δ) (4.4)

(cf. Definition 4.1.1).


Proposition 4.1.3 (Approximations through supremal convolutions). Let (E, δ) be a
metric space, let L ∈ [0, ∞), D ⊆ E, M ⊆ D satisfy M = ̸ ∅, let f : D → R satisfy for
all x ∈ D, y ∈ M that |f (x) − f (y)| ≤ Lδ(x, y), and let F : E → R ∪ {∞} satisfy for all
x ∈ E that
F (x) = sup [f (y) − Lδ(x, y)] (4.5)
y∈M

(cf. Definition 4.1.2). Then


(i) it holds for all x ∈ M that F (x) = f (x),
(ii) it holds for all x ∈ D that F (x) ≤ f (x),
(iii) it holds for all x ∈ E that F (x) < ∞,
(iv) it holds for all x, y ∈ E that |F (x) − F (y)| ≤ Lδ(x, y), and
(v) it holds for all x ∈ D that
 
|F (x) − f (x)| ≤ 2L inf δ(x, y) . (4.6)
y∈M

Proof of Proposition 4.1.3. First, observe that the assumption that for all x ∈ D, y ∈ M
it holds that |f (x) − f (y)| ≤ Lδ(x, y) ensures that for all x ∈ D, y ∈ M it holds that

f (y) + Lδ(x, y) ≥ f (x) ≥ f (y) − Lδ(x, y). (4.7)

Hence, we obtain that for all x ∈ D it holds that

f (x) ≥ sup [f (y) − Lδ(x, y)] = F (x). (4.8)


y∈M

This establishes item (ii). Moreover, note that (4.5) implies that for all x ∈ M it holds that

F (x) ≥ f (x) − Lδ(x, x) = f (x). (4.9)

This and (4.8) establish item (i). Note that (4.7) (applied for every y, z ∈ M with x ↶ y,
y ↶ z in the notation of (4.7)) and the triangle inequality ensure that for all x ∈ E,
y, z ∈ M it holds that

f (y) − Lδ(x, y) ≤ f (z) + Lδ(y, z) − Lδ(x, y) ≤ f (z) + Lδ(x, z). (4.10)

128
4.1. Approximations through supremal convolutions

Hence, we obtain that for all x ∈ E, z ∈ M it holds that

F (x) = sup [f (y) − Lδ(x, y)] ≤ f (z) + Lδ(x, z) < ∞. (4.11)


y∈M

This and the assumption that M = ̸ ∅ prove item (iii). Note that item (iii), (4.5), and the
triangle inequality show that for all x, y ∈ E it holds that
   
F (x) − F (y) = sup (f (v) − Lδ(x, v)) − sup (f (w) − Lδ(y, w))
v∈M w∈M
 
= sup f (v) − Lδ(x, v) − sup (f (w) − Lδ(y, w))
v∈M w∈M
(4.12)
 
≤ sup f (v) − Lδ(x, v) − (f (v) − Lδ(y, v))
v∈M
= sup (Lδ(y, v) − Lδ(x, v))
v∈M
≤ sup (Lδ(y, x) + Lδ(x, v) − Lδ(x, v)) = Lδ(x, y).
v∈M

This and the fact that for all x, y ∈ E it holds that δ(x, y) = δ(y, x) establish item (iv).
Observe that items (i) and (iv), the triangle inequality, and the assumption that ∀ x ∈
D, y ∈ M : |f (x) − f (y)| ≤ Lδ(x, y) ensure that for all x ∈ D it holds that

|F (x) − f (x)| = inf |F (x) − F (y) + f (y) − f (x)|


y∈M

≤ inf (|F (x) − F (y)| + |f (y) − f (x)|)


y∈M (4.13)
 
≤ inf (2Lδ(x, y)) = 2L inf δ(x, y) .
y∈M y∈M

This establishes item (v). The proof of Proposition 4.1.3 is thus complete.

Corollary 4.1.4 (Approximations through supremum convolutions). Let (E, δ) be a metric


space, let L ∈ [0, ∞), M ⊆ E satisfy M ̸= ∅, let f : E → R satisfy for all x ∈ E, y ∈ M
that |f (x) − f (y)| ≤ Lδ(x, y), and let F : E → R ∪ {∞} satisfy for all x ∈ E that

F (x) = sup [f (y) − Lδ(x, y)] (4.14)


y∈M

. Then

(i) it holds for all x ∈ M that F (x) = f (x),

(ii) it holds for all x ∈ E that F (x) ≤ f (x),

(iii) it holds for all x, y ∈ E that |F (x) − F (y)| ≤ Lδ(x, y), and

129
Chapter 4: Multi-dimensional ANN approximation results

(iv) it holds for all x ∈ E that


 
|F (x) − f (x)| ≤ 2L inf δ(x, y) . (4.15)
y∈M

Proof of Corollary 4.1.4. Note that Proposition 4.1.3 establishes items (i), (ii), (iii), and
(iv). The proof of Corollary 4.1.4 is thus complete.

Exercise 4.1.1. Prove or disprove the following statement: There exists Φ ∈ N such that
I(Φ) = 2, O(Φ) = 1, P(Φ) ≤ 3 000 000 000, and

sup |sin(x) sin(y) − (RN 1


r (Φ))(x, y)| ≤ 5 . (4.16)
x,y∈[0,2π]

4.2 ANN representations


4.2.1 ANN representations for the 1-norm
Definition 4.2.1 (1-norm ANN representations). We denote by (Ld )d∈N ⊆ N the fully-
connected feedforward ANNs which satisfy that

(i) it holds that


    
1 0
∈ (R2×1 × R2 ) × (R1×2 × R1 ) (4.17)
 
L1 = , , 1 1 , 0
−1 0

and

(ii) it holds for all d ∈ {2, 3, 4, . . . } that Ld = S1,d • Pd (L1 , L1 , . . . , L1 )

(cf. Definitions 1.3.1, 2.1.1, 2.2.1, and 2.4.1).

Proposition 4.2.2 (Properties of fully-connected feedforward 1-norm ANNs). Let d ∈ N.


Then

(i) it holds that D(Ld ) = (d, 2d, 1),

(ii) it holds that RN d


r (Ld ) ∈ C(R , R), and

(iii) it holds for all x ∈ Rd that (RN


r (Ld ))(x) = ∥x∥1

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 3.3.4, and 4.2.1).

130
4.2. ANN representations

Proof of Proposition 4.2.2. Observe that the fact that D(L1 ) = (1, 2, 1) and Lemma 2.2.2
show that
D(Pd (L1 , L1 , . . . , L1 )) = (d, 2d, d) (4.18)
(cf. Definitions 1.3.1, 2.2.1, and 4.2.1). Combining this, Proposition 2.1.2, and Lemma 2.3.2
ensures that
(4.19)

D(Ld ) = D S1,d • Pd (L1 , L1 , . . . , L1 ) = (d, 2d, 1)
(cf. Definitions 2.1.1 and 2.4.1). This establishes item (i). Note that (4.17) assures that for
all x ∈ R it holds that
(RN
r (L1 ))(x) = r(x) + r(−x) = max{x, 0} + max{−x, 0} = |x| = ∥x∥1 (4.20)
(cf. Definitions 1.2.4, 1.3.4, and 3.3.4). Combining this and Proposition 2.2.3 shows that for
all x = (x1 , . . . , xd ) ∈ Rd it holds that
RN (4.21)

r (Pd (L1 , L1 , . . . , L1 )) (x) = (|x1 |, |x2 |, . . . , |xd |).

This and Lemma 2.4.2 demonstrate that for all x = (x1 , . . . , xd ) ∈ Rd it holds that
(RN N

r (Ld ))(x) = Rr (S1,d • Pd (L1 , L1 , . . . , L1 )) (x)
d (4.22)
= RN
 P
r (S1,d ) (|x 1 |, |x 2 |, . . . , |x d |) = |x k | = ∥x∥ 1 .
k=1

This establishes items (ii) and (iii). The proof of Proposition 4.2.2 is thus complete.
Lemma 4.2.3. Let d ∈ N. Then
(i) it holds that B1,Ld = 0 ∈ R2d ,
(ii) it holds that B2,Ld = 0 ∈ R,
(iii) it holds that W1,Ld ∈ {−1, 0, 1}(2d)×d ,
(iv) it holds for all x ∈ Rd that ∥W1,Ld x∥∞ = ∥x∥∞ , and

(v) it holds that W2,Ld = 1 1 · · · 1 ∈ R1×(2d)
(cf. Definitions 1.3.1, 3.3.4, and 4.2.1).
Proof of Lemma 4.2.3. Throughout this proof, assume without loss of generality that d > 1.
Note that the fact that B1,L1 = 0 ∈ R2 , the fact that B2,L1 = 0 ∈ R, the fact that B1,S1,d
= 0 ∈ R, and the fact that Ld = S1,d • Pd (L1 , L1 , . . . , L1 ) establish items (i) and (ii) (cf.
Definitions 1.3.1, 2.1.1, 2.2.1, 2.4.1, and 4.2.1). In addition, observe that the fact that
 
W1,L1 0 ··· 0
 0 W1,L1 · · · 0 
 
1
W1,L1 = and W1,Ld =  .. .. ... ..  ∈ R
(2d)×d
(4.23)
 
−1  . . . 
0 0 · · · W1,L1

131
Chapter 4: Multi-dimensional ANN approximation results

proves item (iii). Next note that (4.23) implies item (iv). Moreover, note that the fact that
W2,L1 = (1 1) and the fact that Ld = S1,d • Pd (L1 , L1 , . . . , L1 ) show that

W2,Ld = W1,S1,d W2,Pd (L1 ,L1 ,...,L1 )


 
W2,L1 0 ··· 0
 0 W2,L1 ··· 0 
= 1 1 · · · 1  .. .. .. ..
 
} . . . . (4.24)

| {z 
∈R1×d 0 0 ··· W2,L1
| {z }
∈Rd×(2d)

1 ∈ R1×(2d) .

= 1 1 ···

This establishes item (v). The proof of Lemma 4.2.3 is thus complete.
Exercise 4.2.1. Let d = 9, S = {(1, 3), (3, 5)}, V = (Vr,k )(r,k)∈S ∈ × (r,k)∈S
Rd×d satisfy
V1,3 = V3,5 = Id , let Ψ ∈ N satisfy

Ψ = Id • Pd (L1 , L1 , . . . , L1 ) • Id • Pd (L1 , L1 , . . . , L1 ), (4.25)

and let Φ ∈ R satisfy


Φ = (Ψ, (Vr,k )(r,k)∈S ) (4.26)
(cf. Definitions 1.3.1, 1.5.1, 1.5.5, 2.1.1, 2.2.1, 2.2.6, and 4.2.1). For every x ∈ Rd specify

(RR
r (Φ))(x) (4.27)

explicitly and prove that your result is correct (cf. Definitions 1.2.4 and 1.5.4)!

4.2.2 ANN representations for maxima


Lemma 4.2.4 (Unique existence of fully-connected feedforward maxima ANNs). There
exist unique (ϕd )d∈N ⊆ N which satisfy that

(i) it holds for all d ∈ N that I(ϕd ) = d,

(ii) it holds for all d ∈ N that O(ϕd ) = 1,

(iii) it holds that ϕ1 = A1,0 ∈ R1×1 × R1 ,

(iv) it holds that


    
1 −1 0
1 1 −1 , 0  ∈ (R3×2 × R3 ) × (R1×3 × R1 ), (4.28)
 
ϕ2 =  0 1 , 0,
 
0 −1 0

132
4.2. ANN representations


(v) it holds for all d ∈ {2, 3, 4, . . .} that ϕ2d = ϕd • Pd (ϕ2 , ϕ2 , . . . , ϕ2 ) , and

(vi) it holds for all d ∈ {2, 3, 4, . . .} that ϕ2d−1 = ϕd • Pd (ϕ2 , ϕ2 , . . . , ϕ2 , I1 )
(cf. Definitions 1.3.1, 2.1.1, 2.2.1, 2.2.6, and 2.3.1).
Proof of Lemma 4.2.4. Throughout this proof, let ψ ∈ N satisfy
    
1 −1 0
1 , 0, 1 1 −1 , 0  ∈ (R3×2 × R3 ) × (R1×3 × R1 ) (4.29)
 
ψ=   0
0 −1 0

(cf. Definition 1.3.1). Observe that (4.29) and Lemma 2.2.7 demonstrate that

I(ψ) = 2, O(ψ) = I(I1 ) = O(I1 ) = 1, and L(ψ) = L(I1 ) = 2. (4.30)

Lemma 2.2.2 and Lemma 2.2.7 therefore prove that for all d ∈ N ∩ (1, ∞) it holds that

I(Pd (ψ, ψ, . . . , ψ)) = 2d, O(Pd (ψ, ψ, . . . , ψ)) = d, (4.31)

I(Pd (ψ, ψ, . . . , ψ, I1 )) = 2d − 1, and O(Pd (ψ, ψ, . . . , ψ, I1 )) = d (4.32)


(cf. Definitions 2.2.1 and 2.2.6). Combining (4.30), Proposition 2.1.2, and induction hence
shows that there exists unique ϕd ∈ N, d ∈ N, which satisfy for all d ∈ N that I(ϕd ) = d,
O(ϕd ) = 1, and


 A1,0 :d=1

ψ :d=2
ϕd =  (4.33)


 ϕd/2 • Pd/2 (ψ, ψ, . . . , ψ) : d ∈ {4, 6, 8, . . .}

ϕ(d+1)/2 • P(d+1)/2 (ψ, ψ, . . . , ψ, I1 ) : d ∈ {3, 5, 7, . . .}.

The proof of Lemma 4.2.4 is thus complete.


Definition 4.2.5 (Maxima ANN representations). We denote by (Md )d∈N ⊆ N the fully-
connected feedforward ANNs which satisfy that
(i) it holds for all d ∈ N that I(Md ) = d,
(ii) it holds for all d ∈ N that O(Md ) = 1,
(iii) it holds that M1 = A1,0 ∈ R1×1 × R1 ,
(iv) it holds that
    
1 −1 0
1 1 −1 , 0  ∈ (R3×2 ×R3 )×(R1×3 ×R1 ), (4.34)
 
M2 =  0 1 , 0,
 
0 −1 0

133
Chapter 4: Multi-dimensional ANN approximation results


(v) it holds for all d ∈ {2, 3, 4, . . .} that M2d = Md • Pd (M2 , M2 , . . . , M2 ) , and

(vi) it holds for all d ∈ {2, 3, 4, . . .} that M2d−1 = Md • Pd (M2 , M2 , . . . , M2 , I1 )
(cf. Definitions 1.3.1, 2.1.1, 2.2.1, 2.2.6, and 2.3.1 and Lemma 4.2.4).
Definition 4.2.6 (Floor and ceiling of real numbers). We denote by ⌈·⌉ : R → Z and
⌊·⌋ : R → Z the functions which satisfy for all x ∈ R that

⌈x⌉ = min(Z ∩ [x, ∞)) and ⌊x⌋ = max(Z ∩ (−∞, x]). (4.35)

Exercise 4.2.2. Prove or disprove the following statement: For all n ∈ {3, 5, 7, . . . } it holds
that ⌈log2 (n + 1)⌉ = ⌈log2 (n)⌉.
Proposition 4.2.7 (Properties of fully-connected feedforward maxima ANNs). Let d ∈ N.
Then
(i) it holds that H(Md ) = ⌈log2 (d)⌉,

(ii) it holds for all i ∈ N that Di (Md ) ≤ 3 2di ,


 

(iii) it holds that RN d


r (Md ) ∈ C(R , R), and

(iv) it holds for all x = (x1 , . . . , xd ) ∈ Rd that (RN


r (Md ))(x) = max{x1 , x2 , . . . , xd }

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 4.2.5, and 4.2.6).


Proof of Proposition 4.2.7. Throughout this proof, assume without loss of generality that
d > 1. Note that (4.34) ensures that

H(M2 ) = 1 (4.36)

(cf. Definitions 1.3.1 and 4.2.5). This and (2.44) demonstrate that for all d ∈ {2, 3, 4, . . .} it
holds that

H(Pd (M2 , M2 , . . . , M2 )) = H(Pd (M2 , M2 , . . . , M2 , I1 )) = H(M2 ) = 1 (4.37)

(cf. Definitions 2.2.1 and 2.2.6). Combining this with Proposition 2.1.2 establishes that for
all d ∈ {3, 4, 5, . . .} it holds that

H(Md ) = H(M⌈d/2⌉ ) + 1 (4.38)

(cf. Definition 4.2.6). This assures that for all d ∈ {4, 6, 8, . . .} with H(Md/2 ) = ⌈log2 (d/2)⌉ it
holds that

H(Md ) = ⌈log2 (d/2)⌉ + 1 = ⌈log2 (d) − 1⌉ + 1 = ⌈log2 (d)⌉. (4.39)

134
4.2. ANN representations

Furthermore, note that (4.38) and the fact that for all d ∈ {3, 5, 7, . . .} it holds that
⌈log2 (d + 1)⌉ = ⌈log2 (d)⌉ ensure that for all d ∈ {3, 5, 7, . . .} with H(M⌈d/2⌉ ) = ⌈log2 (⌈d/2⌉)⌉
it holds that
   
H(Md ) = log2 (⌈d/2⌉) + 1 = log2 ((d+1)/2) + 1
(4.40)
= ⌈log2 (d + 1) − 1⌉ + 1 = ⌈log2 (d + 1)⌉ = ⌈log2 (d)⌉.

Combining this and (4.39) demonstrates that for all d ∈ {3, 4, 5, . . .} with ∀ k ∈ {2, 3, . . . ,
d − 1} : H(Mk ) = ⌈log2 (k)⌉ it holds that

H(Md ) = ⌈log2 (d)⌉. (4.41)

The fact that H(M2 ) = 1 and induction hence establish item (i). Observe that the fact that
D(M2 ) = (2, 3, 1) assure that for all i ∈ N it holds that

(4.42)
2
Di (M2 ) ≤ 3 = 3 2i
.

Moreover, note that Proposition 2.1.2 and Lemma 2.2.2 imply that for all d ∈ {2, 3, 4, . . .},
i ∈ N it holds that (
3d :i=1
Di (M2d ) = (4.43)
Di−1 (Md ) : i ≥ 2
and (
3d − 1 :i=1
Di (M2d−1 ) = (4.44)
Di−1 (Md ) : i ≥ 2.
This assures that for all d ∈ {2, 4, 6, . . .} it holds that

D1 (Md ) = 3( 2d ) = 3 (4.45)
d
2
.

In addition, observe that (4.44) ensures that for all d ∈ {3, 5, 7, . . . } it holds that

D1 (Md ) = 3 2d − 1 ≤ 3 2d . (4.46)
   

This and (4.45) show that for all d ∈ {2, 3, 4, . . .} it holds that

D1 (Md ) ≤ 3 2d . (4.47)
 

Next note that (4.43) demonstrates that for all d ∈ {4, 6, 8, . . .}, i ∈ {2, 3, 4, . . .} with
Di−1 (Md/2 ) ≤ 3 ( /2) 2i−1 it holds that
1

d

1
= 3 2di . (4.48)
   
Di (Md ) = Di−1 (Md/2 ) ≤ 3 (d/2) 2i−1

135
Chapter 4: Multi-dimensional ANN approximation results

Furthermore,
 d+1  observe that (4.44) and the fact that for all d ∈ {3, 5, 7, . . .}, i ∈ N it holds
that d
assure that for all d ∈ {3, 5, 7, . . .}, i ∈ {2, 3, 4, . . .} with Di−1 (M⌈d/2⌉ ) ≤

2i 
= 2i
3 ⌈ /2⌉ 2i−1 it holds that
1

d

1
= 3 d+1 = 3 2di . (4.49)
     
Di (Md ) = Di−1 (M⌈d/2⌉ ) ≤ 3 ⌈d/2⌉ 2i−1 2i

This and (4.48) ensure that for all d ∈ {3,  4, 5, . . .}, i ∈ {2, 3, 4, . . .} with ∀ k ∈ {2, 3, . . . , d −
1}, j ∈ {1, 2, . . . , i − 1} : Dj (Mk ) ≤ 3 2kj it holds that

Di (Md ) ≤ 3 2di . (4.50)


 

Combining this, (4.42), and (4.47) with induction establishes item (ii). Note that (4.34)
ensures that for all x = (x1 , x2 ) ∈ R2 it holds that

(RN
r (M2 ))(x) = max{x1 − x2 , 0} + max{x2 , 0} − max{−x2 , 0}
(4.51)
= max{x1 − x2 , 0} + x2 = max{x1 , x2 }

(cf. Definitions 1.2.4 and 1.3.4). Proposition 2.2.3, Proposition 2.1.2, Lemma 2.2.7, and
induction hence imply that for all d ∈ {2, 3, 4, . . .}, x = (x1 , x2 , . . . , xd ) ∈ Rd it holds that

RN d
and RN (4.52)

r (Md ) ∈ C(R , R) r (Md ) (x) = max{x1 , x2 , . . . , xd }.

This establishes items (iii) and (iv). The proof of Proposition 4.2.7 is thus complete.

Lemma 4.2.8. Let d ∈ N, i ∈ {1, 2, . . . , L(Md )} (cf. Definitions 1.3.1 and 4.2.5). Then

(i) it holds that Bi,Md = 0 ∈ RDi (Md ) ,

(ii) it holds that Wi,Md ∈ {−1, 0, 1}Di (Md )×Di−1 (Md ) , and

(iii) it holds for all x ∈ Rd that ∥W1,Md x∥∞ ≤ 2∥x∥∞

(cf. Definition 3.3.4).

Proof of Lemma 4.2.8. Throughout this proof, assume without loss of generality that d > 2
(cf. items (iii) and (iv) in Definition 4.2.5) and let A1 ∈ R3×2 , A2 ∈ R1×3 , C1 ∈ R2×1 ,
C2 ∈ R1×2 satisfy
 
1 −1  
1
and
 
A1 = 0 1 ,
  A2 = 1 1 −1 , C1 = , C2 = 1 −1 .
−1
0 −1
(4.53)

136
4.2. ANN representations

Note that items (iv), (v), and (vi) in Definition 4.2.5 assure that for all d ∈ {2, 3, 4, . . .} it
holds that
 
A1 0 · · · 0 0  
 0 A1 · · · 0 A1 0 · · · 0
0 
 0 A1 · · · 0 
W1,M2d−1 =  ... .. . . . ..
 
. . .. ., W1,M2d =  .. .. . . .. ,
   
 . . . . 
(4.54)
 
 0 0 · · · A1 0
0 0 · · · A1
0 0 ··· 0 C1 | {z }
∈R(3d)×(2d)
| {z }
∈R(3d−1)×(2d−1)
B1,M2d−1 = 0 ∈ R3d−1 , and B1,M2d = 0 ∈ R3d .

This and (4.53) proves item (iii). Furthermore, note that (4.54) and item (iv) in Defini-
tion 4.2.5 imply that for all d ∈ {2, 3, 4, . . .} it holds that B1,Md = 0. Items (iv), (v), and
(vi) in Definition 4.2.5 hence ensure that for all d ∈ {2, 3, 4, . . .} it holds that
 
A2 0 · · · 0 0  
 0 A2 · · · 0 0  A2 0 · · · 0
 0 A2 · · · 0 
W2,M2d−1 = W1,Md  ... .. . . .. .. ,
 
. . . .  W = W 1,Md  .. .. . . .. ,
  
2,M2d
 . . . . 


 0 0 · · · A2 0 
0 0 · · · A2
0 0 · · · 0 C2 | {z }
∈Rd×(3d)
| {z }
∈Rd×(3d−1)
B2,M2d−1 = B1,Md = 0, and B2,M2d = B1,Md = 0.
(4.55)

Combining this and item (iv) in Definition 4.2.5 shows that for all d ∈ {2, 3, 4, . . .} it holds
that B2,Md = 0. Moreover, note that (2.2) demonstrates that for all d ∈ {2, 3, 4, . . . , },
i ∈ {3, 4, . . . , L(Md ) + 1} it holds that

Wi,M2d−1 = Wi,M2d = Wi−1,Md and Bi,M2d−1 = Bi,M2d = Bi−1,Md . (4.56)

This, (4.53), (4.54), (4.55), the fact that for all d ∈ {2, 3, 4, . . .} it holds that B2,Md = 0, and
induction establish items (i) and (ii). The proof of Lemma 4.2.8 is thus complete.

4.2.3 ANN representations for maximum convolutions


Lemma 4.2.9. Let d, K ∈ N, L ∈ [0, ∞), x1 , x2 , . . . , xK ∈ Rd , y = (y1 , y2 , . . . , yK ) ∈ RK ,
Φ ∈ N satisfy

(4.57)

Φ = MK • A−L IK ,y • PK Ld • AId ,−x1 , Ld • AId ,−x2 , . . . , Ld • AId ,−xK • Td,K

(cf. Definitions 1.3.1, 1.5.5, 2.1.1, 2.2.1, 2.3.1, 2.4.6, 4.2.1, and 4.2.5). Then

137
Chapter 4: Multi-dimensional ANN approximation results

(i) it holds that I(Φ) = d,

(ii) it holds that O(Φ) = 1,

(iii) it holds that H(Φ) = ⌈log2 (K)⌉ + 1,

(iv) it holds that D1 (Φ) = 2dK,


 K 
(v) it holds for all i ∈ {2, 3, 4, . . .} that Di (Φ) ≤ 3 2i−1 ,

(vi) it holds that ∥T (Φ)∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2∥y∥∞ }, and

(vii) it holds for all x ∈ Rd that (RN


r (Φ))(x) = maxk∈{1,2,...,K} (yk − L∥x − xk ∥1 )

(cf. Definitions 1.2.4, 1.3.4, 1.3.5, 3.3.4, and 4.2.6).

Proof of Lemma 4.2.9. Throughout this proof, let Ψk ∈ N, k ∈ {1, 2, . . . , K}, satisfy for
all k ∈ {1, 2, . . . , K} that Ψk = Ld • AId ,−xk , let Ξ ∈ N satisfy

(4.58)

Ξ = A−L IK ,y • PK Ψ1 , Ψ2 , . . . , ΨK • Td,K ,

and let ~·~ : m,n∈N Rm×n → [0, ∞) satisfy for all m, n ∈ N, M = (Mi,j )i∈{1,...,m}, j∈{1,...,n} ∈
S
Rm×n that ~M ~ = maxi∈{1,...,m}, j∈{1,...,n} |Mi,j |. Observe that (4.57) and Proposition 2.1.2
ensure that O(Φ) = O(MK ) = 1 and I(Φ) = I(Td,K ) = d. This proves items (i) and (ii).
Moreover, observe that the fact that for all m, n ∈ N, W ∈ Rm×n , B ∈ Rm it holds that
H(AW,B ) = 0 = H(Td,K ), the fact that H(Ld ) = 1, and Proposition 2.1.2 assure that

H(Ξ) = H(A−L IK ,y ) + H(PK (Ψ1 , Ψ2 , . . . , ΨK )) + H(Td,K ) = H(Ψ1 ) = H(Ld ) = 1. (4.59)

Proposition 2.1.2 and Proposition 4.2.7 hence ensure that

H(Φ) = H(MK • Ξ) = H(MK ) + H(Ξ) = ⌈log2 (K)⌉ + 1 (4.60)

(cf. Definition 4.2.6). This establishes item (iii). Next observe that the fact that H(Ξ) = 1,
Proposition 2.1.2, and Proposition 4.2.7 assure that for all i ∈ {2, 3, 4, . . .} it holds that

(4.61)
 K 
Di (Φ) = Di−1 (MK ) ≤ 3 2i−1 .

This proves item (v). Furthermore, note that Proposition 2.1.2, Proposition 2.2.4, and
Proposition 4.2.2 assure that
K
X K
X
D1 (Φ) = D1 (Ξ) = D1 (PK (Ψ1 , Ψ2 , . . . , ΨK )) = D1 (Ψi ) = D1 (Ld ) = 2dK. (4.62)
i=1 i=1

138
4.2. ANN representations

This establishes item (iv). Moreover, observe that (2.2) and Lemma 4.2.8 imply that
Φ = (W1,Ξ , B1,Ξ ), (W1,MK W2,Ξ , W1,MK B2,Ξ ),
 (4.63)
(W2,MK , 0), . . . , (WL(MK ),MK , 0) .
Next note that the fact that for all k ∈ {1, 2, . . . , K} it holds that W1,Ψk = W1,AId ,−xk W1,Ld =
W1,Ld assures that
  
W1,Ψ1 0 ··· 0 Id
 0 W1,Ψ2 · · · 0 I d 
 
W1,Ξ = W1,PK (Ψ1 ,Ψ2 ,...,ΨK ) W1,Td,K =  .. .. . ..  .. 
 
 . . . . .  . 
0 0 · · · W1,ΨK Id
    (4.64)
W1,Ψ1 W1,Ld
 W1,Ψ  W1,L 
2  d
=  ..  =  .. .
 
 .   . 
W1,ΨK W1,Ld
Lemma 4.2.3 hence demonstrates that ~W1,Ξ ~ = 1. In addition, note that (2.2) implies
that
 
B1,Ψ1
 B1,Ψ 
2 
B1,Ξ = W1,PK (Ψ1 ,Ψ2 ,...,ΨK ) B1,Td,K + B1,PK (Ψ1 ,Ψ2 ,...,ΨK ) = B1,PK (Ψ1 ,Ψ2 ,...,ΨK ) =  .. .

 . 
B1,ΨK
(4.65)
Furthermore, observe that Lemma 4.2.3 implies that for all k ∈ {1, 2, . . . , K} it holds that

B1,Ψk = W1,Ld B1,AId ,−xk + B1,Ld = −W1,Ld xk . (4.66)

This, (4.65), and Lemma 4.2.3 show that


∥B1,Ξ ∥∞ = max ∥B1,Ψk ∥∞ = max ∥W1,Ld xk ∥∞ = max ∥xk ∥∞ (4.67)
k∈{1,2,...,K} k∈{1,2,...,K} k∈{1,2,...,K}

(cf. Definition 3.3.4). Combining this, (4.63), Lemma 4.2.8, and the fact that ~W1,Ξ ~ = 1
shows that
∥T (Φ)∥∞ = max{~W1,Ξ ~, ∥B1,Ξ ∥∞ , ~W1,MK W2,Ξ ~, ∥W1,MK B2,Ξ ∥∞ , 1}
 (4.68)
= max 1, maxk∈{1,2,...,K} ∥xk ∥∞ , ~W1,MK W2,Ξ ~, ∥W1,MK B2,Ξ ∥∞
(cf. Definition 1.3.5). Next note that Lemma 4.2.3 ensures that for all k ∈ {1, 2, . . . , K} it
holds that B2,Ψk = B2,Ld = 0. Hence, we obtain that B2,PK (Ψ1 ,Ψ2 ,...,ΨK ) = 0. This implies
that
B2,Ξ = W1,A−L IK ,y B2,PK (Ψ1 ,Ψ2 ,...,ΨK ) + B1,A−L IK ,y = B1,A−L IK ,y = y. (4.69)

139
Chapter 4: Multi-dimensional ANN approximation results

In addition, observe that the fact that for all k ∈ {1, 2, . . . , K} it holds that W2,Ψk = W2,Ld
assures that
W2,Ξ = W1,A−L IK ,y W2,PK (Ψ1 ,Ψ2 ,...,ΨK ) = −LW2,PK (Ψ1 ,Ψ2 ,...,ΨK )
   
W2,Ψ1 0 ··· 0 −LW2,Ld 0 ··· 0
 0 W2,Ψ2 · · · 0   0 −LW2,Ld · · · 0 
= −L .. .. .. .. = .. .. .. .. .
   
 . . . . . . . .
 
  
0 0 · · · W2,ΨK 0 0 · · · −LW2,Ld
(4.70)

Item (v) in Lemma 4.2.3 and Lemma 4.2.8 hence imply that

~W1,MK W2,Ξ ~ = L~W1,MK ~ ≤ L. (4.71)

Moreover, observe that (4.69) and Lemma 4.2.8 assure that

∥W1,MK B2,Ξ ∥∞ ≤ 2∥B2,Ξ ∥∞ = 2∥y∥∞ . (4.72)

Combining this with (4.68) and (4.71) establishes item (vi). Next observe that Proposi-
tion 4.2.2 and Lemma 2.3.3 show that for all x ∈ Rd , k ∈ {1, 2, . . . , K} it holds that

(RN N N
(4.73)

r (Ψk ))(x) = Rr (Ld ) ◦ Rr (AId ,−xk ) (x) = ∥x − xk ∥1 .

This, Proposition 2.2.3, and Proposition 2.1.2 imply that for all x ∈ Rd it holds that

RN (4.74)
 
r (PK (Ψ1 , Ψ2 , . . . , ΨK ) • Td,K ) (x) = ∥x − x1 ∥1 , ∥x − x2 ∥1 , . . . , ∥x − xK ∥1 .

(cf. Definitions 1.2.4 and 1.3.4). Combining this and Lemma 2.3.3 establishes that for all
x ∈ Rd it holds that

(RN N N

r (Ξ))(x) = Rr (A−L IK ,y ) ◦ Rr (PK (Ψ1 , Ψ2 , . . . , ΨK ) • Td,K ) (x)
 (4.75)
= y1 − L∥x − x1 ∥1 , y2 − L∥x − x2 ∥1 , . . . , yK − L∥x − xK ∥1 .

Proposition 2.1.2 and Proposition 4.2.7 hence demonstrate that for all x ∈ Rd it holds that

(RN N N

r (Φ))(x) = Rr (MK ) ◦ Rr (Ξ) (x)
= (RN

r (MK )) y1 − L∥x − x1 ∥1 , y2 − L∥x − x2 ∥1 , . . . , yK − L∥x − xK ∥1
= maxk∈{1,2,...,K} (yk − L∥x − xk ∥1 ).
(4.76)

This establishes item (vii). The proof of Lemma 4.2.9 is thus complete.

140
4.3. ANN approximations results for multi-dimensional functions

4.3 ANN approximations results for multi-dimensional


functions
4.3.1 Constructive ANN approximation results
Proposition 4.3.1. Let d, K ∈ N, L ∈ [0, ∞), let E ⊆ Rd be a set, let x1 , x2 , . . . , xK ∈ E,
let f : E → R satisfy for all x, y ∈ E that |f (x) − f (y)| ≤ L∥x − y∥1 , and let y ∈ RK ,
Φ ∈ N satisfy y = (f (x1 ), f (x2 ), . . . , f (xK )) and
(4.77)

Φ = MK • A−L IK ,y • PK Ld • AId ,−x1 , Ld • AId ,−x2 , . . . , Ld • AId ,−xK • Td,K
(cf. Definitions 1.3.1, 1.5.5, 2.1.1, 2.2.1, 2.3.1, 2.4.6, 3.3.4, 4.2.1, and 4.2.5). Then
supx∈E |(RN (4.78)
 
r (Φ))(x) − f (x)| ≤ 2L supx∈E mink∈{1,2,...,K} ∥x − xk ∥1

(cf. Definitions 1.2.4 and 1.3.4).


Proof of Proposition 4.3.1. Throughout this proof, let F : Rd → R satisfy for all x ∈ Rd
that
F (x) = maxk∈{1,2,...,K} (f (xk ) − L∥x − xk ∥1 ). (4.79)
Observe that Corollary 4.1.4, (4.79), and the assumption that for all x, y ∈ E it holds that
|f (x) − f (y)| ≤ L∥x − y∥1 assure that
(4.80)
 
supx∈E |F (x) − f (x)| ≤ 2L supx∈E mink∈{1,2,...,K} ∥x − xk ∥1 .
Moreover, note that Lemma 4.2.9 ensures that for all x ∈ E it holds that F (x) = (RN
r (Φ))(x).
Combining this and (4.80) establishes (4.78). The proof of Proposition 4.3.1 is thus
complete.
Exercise 4.3.1. Prove or disprove the following statement: There exists Φ ∈ N such that
I(Φ) = 2, O(Φ) = 1, P(Φ) < 20, and
sup x2 + y 2 − 2x − 2y + 2 − (RN 3
r (Φ))(v) ≤ 8 . (4.81)
v=(x,y)∈[0,2]2

4.3.2 Covering number estimates


Definition 4.3.2 (Covering numbers). Let (E, δ) be a metric space and let r ∈ [0, ∞].
Then we denote by C (E,δ),r ∈ N0 ∪ {∞} (we denote by C E,r ∈ N0 ∪ {∞}) the extended real
number given by
    
(|A| ≤ n) ∧ (∀ x ∈ E :
C (E,δ),r
= min n ∈ N0 : ∃ A ⊆ E : ∪ {∞} (4.82)
∃ a ∈ A : δ(a, x) ≤ r)
and we call C (E,δ),r the r-covering number of (E, δ) (we call C E,r the r-covering number of
E).

141
Chapter 4: Multi-dimensional ANN approximation results

Lemma 4.3.3. Let (E, δ) be a metric space and let r ∈ [0, ∞]. Then



 0   :X=∅


inf n ∈ N : ∃ x1 , x2 , . . . , xn ∈ E :

C (E,δ),r =

  n   : X ̸= ∅
 S
E⊆ {v ∈ E : d(xm , v) ≤ r} ∪ {∞}



m=1
(4.83)
(cf. Definition 4.3.2).

Proof of Lemma 4.3.3. Throughout this proof, assume without loss of generality that E ̸=
∅. Observe that Lemma 12.2.4 establishes (4.83). The proof of Lemma 4.3.3 is thus
complete.
Exercise 4.3.2. Prove or disprove the following statement: For every metric space (X, d),
every Y ⊆ X, and every r ∈ [0, ∞] it holds that C (Y,d|Y ×Y ),r ≤ C (X,d),r .
Exercise 4.3.3. Prove or disprove the following statement: For every metric space (E, δ) it
holds that C (E,δ),∞ = 1.
Exercise 4.3.4. Prove or disprove the following statement: For every metric space (E, δ)
and every r ∈ [0, ∞) with C (E,δ),r < ∞ it holds that E is bounded. (Note: A metric space
(E, δ) is bounded if and only if there exists r ∈ [0, ∞) such that it holds for all x, y ∈ E
that δ(x, y) ≤ r.)
Exercise 4.3.5. Prove or disprove the following statement: For every bounded metric space
(E, δ) and every r ∈ [0, ∞] it holds that C (E,δ),r < ∞.

Lemma 4.3.4. Let d ∈ N, a ∈ R, b ∈ (a, ∞), r ∈ (0, ∞) and for every p ∈ [1, ∞) let
δp : ([a, b]d ) × ([a, b]d ) → [0, ∞) satisfy for all x, y ∈ [a, b]d that δp (x, y) = ∥x − y∥p (cf.
Definition 3.3.4). Then it holds for all p ∈ [1, ∞) that
(
d
l 1/p md 1 : r ≥ d(b−a)/2
C ([a,b] ,δp ),r ≤ d (b−a)
2r
≤ d(b−a) d
 (4.84)
r
: r < d(b−a)/2.

(cf. Definitions 4.2.6 and 4.3.2).

Proof of Lemma 4.3.4. Throughout this proof, let (Np )p∈[1,∞) ⊆ N satisfy for all p ∈ [1, ∞)
that l 1/p m
Np = d (b−a)
2r
, (4.85)

for every N ∈ N, i ∈ {1, 2, . . . , N } let gN,i ∈ [a, b] be given by

gN,i = a + (i−1/2)(b−a)/N (4.86)

142
4.3. ANN approximations results for multi-dimensional functions

and for every p ∈ [1, ∞) let Ap ⊆ [a, b]d be given by

Ap = {gNp ,1 , gNp ,2 , . . . , gNp ,Np }d (4.87)

(cf. Definition 4.2.6). Observe that it holds for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [a +


N,i ] that
(i−1)(b−a)/N , g

|x − gN,i | = a + (i−1/2)(b−a)
N
−x≤a+ (i−1/2)(b−a)
N
− a+ (i−1)(b−a) 
N
= b−a
2N
. (4.88)

In addition, note that it holds for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [gN,i , a + i(b−a)/N ] that

|x − gN,i | = x − a + (i−1/2)(b−a) 
N
≤a+ i(b−a)
N
− a+ (i−1/2)(b−a) 
N
= b−a
2N
. (4.89)

Combining this with (4.88) implies for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [a + (i−1)(b−a)/N , a +


i(b−a)/N ] that |x − g (b−a)/(2N ). This proves that for every N ∈ N, x ∈ [a, b] there exists
N,i | ≤
y ∈ {gN,1 , gN,2 , . . . , gN,N } such that

|x − y| ≤ b−a
2N
. (4.90)

This establishes that for every p ∈ [1, ∞), x = (x1 , x2 , . . . , xd ) ∈ [a, b]d there exists
y = (y1 , y2 , . . . , yd ) ∈ Ap such that
 d
1/p  d
1/p
1/p 1
(b−a)p d /p (b−a)2r
p d (b−a)
= r. (4.91)
P P
δp (x, y) = ∥x − y∥p = |xi − yi | ≤ (2Np )p
= 2Np
≤ 2d1/p (b−a)
i=1 i=1

Combining this with (4.82), (4.87), (4.85), and the fact that ∀ x ∈ [0, ∞) : ⌈x⌉ ≤ 1(0,1] (x) +
2x1(1,∞) (x) = 1(0,r] (rx) + 2x1(r,∞) (rx) yields that for all p ∈ [1, ∞) it holds that

d ,δ
l 1/p
md  d(b−a) d
d (b−a)
C ([a,b] p ),r
≤ |Ap | = (Np )d = 2r
≤ 2r

≤ 1(0,r] d(b−a)

+ 2d(b−a)
1 d(b−a)
d (4.92)
2 2r (r,∞) 2

= 1(0,r] d(b−a) d(b−a) d


1 d(b−a) 

2
+ r (r,∞) 2

(cf. Definition 4.3.2). The proof of Lemma 4.3.4 is thus complete.

4.3.3 Convergence rates for the approximation error


Lemma 4.3.5. Let d ∈ N, L, a ∈ R, b ∈ (a, ∞), let f : [a, b]d → R satisfy for all x, y ∈ [a, b]d
that |f (x) − f (y)| ≤ L∥x − y∥1 , and let F = A0,f ((a+b)/2,(a+b)/2,...,(a+b)/2) ∈ R1×d × R1 (cf.
Definitions 2.3.1 and 3.3.4). Then

(i) it holds that I(F) = d,

143
Chapter 4: Multi-dimensional ANN approximation results

(ii) it holds that O(F) = 1,

(iii) it holds that H(F) = 0,

(iv) it holds that P(F) = d + 1,

(v) it holds that ∥T (F)∥∞ ≤ supx∈[a,b]d |f (x)|, and


dL(b−a)
(vi) it holds that supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ 2

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5).


Proof of Lemma 4.3.5. Note that the assumption that for all x, y ∈ [a, b]d it holds that
|f (x) − f (y)| ≤ L∥x − y∥1 assures that L ≥ 0. Next observe that Lemma 2.3.2 assures that
for all x ∈ Rd it holds that

(RN (4.93)

r (F))(x) = f (a+b)/2, (a+b)/2, . . . , (a+b)/2 .

The fact that for all x ∈ [a, b] it holds that |x − (a+b)/2| ≤ (b−a)/2 and the assumption that
for all x, y ∈ [a, b]d it holds that |f (x) − f (y)| ≤ L∥x − y∥1 hence ensure that for all
x = (x1 , x2 , . . . , xd ) ∈ [a, b]d it holds that

|(RN

r (F))(x) − f (x)| = |f
(a+b)/2, (a+b)/2, . . . , (a+b)/2 − f (x)|

≤ L (a+b)/2, (a+b)/2, . . . , (a+b)/2 − x 1
(4.94)
d d
P P L(b−a) dL(b−a)
= L |(a+b)/2 − xi | ≤ 2
= 2
.
i=1 i=1

This and the fact that ∥T (F)∥∞ = |f ((a+b)/2, (a+b)/2, . . . , (a+b)/2)| ≤ supx∈[a,b]d |f (x)| complete
the proof of Lemma 4.3.5.
Proposition 4.3.6. Let d ∈ N, L, a ∈ R, b ∈ (a, ∞), r ∈ (0, d/4), let f : [a, b]d → R and
δ : [a, b]d × [a, b]d → R satisfy for all x, y ∈ [a, b]d that |f (x) − f (y)| ≤ L∥x − y∥1 and
δ(x, y) = ∥x − y∥1 , and let K ∈ N, x1 , x2 , . . . ,xK ∈ [a, b]d , y ∈ RK , F ∈ N satisfy K =
d
C ([a,b] ,δ),(b−a)r , supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ (b − a)r, y = (f (x1 ), f (x2 ), . . . , f (xK )),
and

(4.95)

F = MK • A−L IK ,y • PK Ld • AId ,−x1 , Ld • AId ,−x2 , . . . , Ld • AId ,−xK • Td,K

(cf. Definitions 1.3.1, 1.5.5, 2.1.1, 2.2.1, 2.3.1, 2.4.6, 3.3.4, 4.2.1, 4.2.5, and 4.3.2). Then
(i) it holds that I(F) = d,

(ii) it holds that O(F) = 1,


3d
 
(iii) it holds that H(F) ≤ d log2 4r
+ 1,

144
4.3. ANN approximations results for multi-dimensional functions

3d d

(iv) it holds that D1 (F) ≤ 2d 4r
,

3d d 1
  
(v) it holds for all i ∈ {2, 3, 4, . . .} that Di (F) ≤ 3 4r 2i−1
,

3d 2d 2

(vi) it holds that P(F) ≤ 35 4r
d,

(vii) it holds that ∥T (F)∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}, and

(viii) it holds that supx∈[a,b]d |(RN


r (F))(x) − f (x)| ≤ 2L(b − a)r

(cf. Definitions 1.2.4, 1.3.4, 1.3.5, and 4.2.6).

Proof of Proposition 4.3.6. Note that the assumption that for all x, y ∈ [a, b]d it holds that
|f (x) − f (y)| ≤ L∥x − y∥1 assures that L ≥ 0. Next observe that (4.95), Lemma 4.2.9, and
Proposition 4.3.1 demonstrate that

(I) it holds that I(F) = d,

(II) it holds that O(F) = 1,

(III) it holds that H(F) = ⌈log2 (K)⌉ + 1,

(IV) it holds that D1 (F) = 2dK,

(V) it holds for all i ∈ {2, 3, 4, . . .} that Di (F) ≤ 3 2i−1 ,


 K 

(VI) it holds that ∥T (F)∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2[maxk∈{1,2,...,K} |f (xk )|]},
and

(VII) it holds that supx∈[a,b]d |(RN


 
r (F))(x) − f (x)| ≤ 2L supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk )

(cf. Definitions 1.2.4, 1.3.4, 1.3.5, and 4.2.6). Note that items (I) and (II) establish items (i)
and (ii). Next observe that Lemma 4.3.4 and the fact that 2r d
≥ 2 imply that
l md  d d d
d ,δ),(b−a)r d(b−a) 3d d (4.96)
K = C ([a,b] 3 d

≤ 2(b−a)r
= 2r
≤ (
2 2r
) = 4r
.

Combining this with item (III) assures that


l  m
3d d 3d
(4.97)
 
H(F) = ⌈log2 (K)⌉ + 1 ≤ log2 4r
+ 1 = ⌈d log2 4r
⌉ + 1.

This establishes item (iii). Moreover, note that (4.96) and item (IV) imply that

3d d (4.98)

D1 (F) = 2dK ≤ 2d 4r
.

145
Chapter 4: Multi-dimensional ANN approximation results

This establishes item (iv). In addition, observe that item (V) and (4.96) establish item (v).
Next note that item (III) ensures that for all i ∈ N ∩ (1, H(F)] it holds that
K
2i−1
≥ K
2H(F)−1
= K
2⌈log2 (K)⌉
≥ K
2log2 (K)+1
= K
2K
= 12 . (4.99)

Item (V) and (4.96) hence show that for all i ∈ N ∩ (1, H(F)] it holds that

3d d 3 (4.100)
 K  3K

Di (F) ≤ 3 2i−1 ≤ 2i−2
≤ 4r 2i−2
.

Furthermore, note that the fact that for all x ∈ [a, b]d it holds that ∥x∥∞ ≤ max{|a|, |b|}
and item (VI) imply that

∥T (F)∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2[maxk∈{1,2,...,K} |f (xk )|]}


(4.101)
≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}.

This establishes item (vii). Moreover, observe that the assumption that

(4.102)
 
supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ (b − a)r

and item (VII) demonstrate that

supx∈[a,b]d |(RN
 
r (F))(x) − f (x)| ≤ 2L supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ 2L(b − a)r.
(4.103)

This establishes item (viii). It thus remains to prove item (vi). For this note that items (I)
and (II), (4.98), and (4.100) assure that

L(F)
X
P(F) = Di (F)(Di−1 (F) + 1)
i=1
d d d
≤ 2d 3d (d + 1) + 3d 3 2d 3d

4r 4r 4r
+1 (4.104)
 
L(F)−1
3d d 3 3d d 3 3d d
X
3
   
+ 4ri−22 i−3 + 1
4r 2
+
4r 2L(F)−3
+ 1.
i=3

Next note that the fact that 3d


4r
≥ 3 ensures that

3d d
d d d 3
(d + 1) + 3d 3 2d 3d + 1 + 3d
 
2d 4r 4r 4r 4r 2L(F)−3
+1
3d 2d 3 (4.105)
 
≤ 4r
2d(d + 1) + 3(2d + 1) + 21−3 + 1
3d 2d 2 3d 2d 2
 
≤ 4r
d (4 + 9 + 12 + 1) = 26 4r
d.

146
4.3. ANN approximations results for multi-dimensional functions

Moreover, observe that the fact that 3d


4r
≥ 3 implies that
L(F)−1 L(F)−1
3d d 3 3d d 3 3d 2d
X X
3 3
    
4r 2i−2 4r 2i−3
+1 ≤ 4r 2i−2 2i−3
+1
i=3 i=3
L(F)−1h i
3d 2d
X
9 3

= 22i−5
+ 2i−2
4r
i=3
(4.106)
L(F)−4h i
3d 2d
X
9 −i 3 −i

= 4r 2
(4 ) + 2
(2 )
i=0
3d 2d 9 1 3 1 3d 2d
   
≤ 4r 2 1−4−1
+ 2 1−2−1
=9 4r
.

Combining this, (4.104), and (4.105) demonstrates that


3d 2d 2 3d 2d 3d 2d 2 (4.107)
  
P(F) ≤ 26 4r
d +9 4r
≤ 35 4r
d.

This establishes item (vi). The proof of Proposition 4.3.6 is thus complete.
Proposition 4.3.7. Let d ∈ N, L, a ∈ R, b ∈ (a, ∞), r ∈ (0, ∞) and let f : [a, b]d → R
satisfy for all x, y ∈ [a, b]d that |f (x) − f (y)| ≤ L∥x − y∥1 (cf. Definition 3.3.4). Then there
exists F ∈ N such that
(i) it holds that I(F) = d,

(ii) it holds that O(F) = 1,

(iii) it holds that H(F) ≤ d log2 3d + 1 1(0,d/4) (r),


  
4r

1(0,d/4) (r) + 1[d/4,∞) (r),


d
(iv) it holds that D1 (F) ≤ 2d 3d 4r
 d 1 
(v) it holds for all i ∈ {2, 3, 4, . . .} that Di (F) ≤ 3 3d 4r 2i−1
,

d 1(0,d/4) (r) + (d + 1)1[d/4,∞) (r),


2d 2
(vi) it holds that P(F) ≤ 35 3d 4r

(vii) it holds that ∥T (F)∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}, and

(viii) it holds that supx∈[a,b]d |(RN


r (F))(x) − f (x)| ≤ 2L(b − a)r

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 4.2.6).


Proof of Proposition 4.3.7. Throughout this proof, assume without loss of generality that
r < d/4 (cf. Lemma 4.3.5), let δ : [a, b]d × [a, b]d → R satisfy for all x, y ∈ [a, b]d that

δ(x, y) = ∥x − y∥1 , (4.108)

147
Chapter 4: Multi-dimensional ANN approximation results

and let K ∈ N ∪ {∞} satisfy


d ,δ),(b−a)r
K = C ([a,b] . (4.109)
Note that Lemma 4.3.4 assures that K < ∞. This and (4.82) ensure that there exist
x1 , x2 , . . . , xK ∈ [a, b]d such that

(4.110)
 
supx∈[a,b]d mink∈{1,2,...,K} δ(x, xk ) ≤ (b − a)r.

Combining this with Proposition 4.3.6 establishes items (i), (ii), (iii), (iv), (v), (vi), (vii),
and (viii). The proof of Proposition 4.3.7 is thus complete.
Proposition 4.3.8 (Implicit multi-dimensional ANN approximations with prescribed error
tolerances and explicit parameter bounds). Let d ∈ N, L, a ∈ R, b ∈ [a, ∞), ε ∈ (0, 1] and
let f : [a, b]d → R satisfy for all x, y ∈ [a, b]d that

|f (x) − f (y)| ≤ L∥x − y∥1 (4.111)

(cf. Definition 3.3.4). Then there exists F ∈ N such that


(i) it holds that I(F) = d,
(ii) it holds that O(F) = 1,

(iii) it holds that H(F) ≤ d log2 max 3dL(b−a) , 1 + log2 (ε−1 ) + 2,


  
2

(iv) it holds that D1 (F) ≤ ε−d d(3d max{L(b − a), 1})d ,


d
(v) it holds for all i ∈ {2, 3, 4, . . .} that Di (F) ≤ ε−d 3 (3dL(b−a))

2i + 1 ,
2d
(vi) it holds that P(F) ≤ ε−2d 9 3d max{L(b − a), 1} d2 ,
(vii) it holds that ∥T (F)∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}, and

(viii) it holds that supx∈[a,b]d |(RN


r (F))(x) − f (x)| ≤ ε

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5).


Proof of Proposition 4.3.8. Throughout this proof, assume without loss of generality that

L(b − a) ̸= 0. (4.112)

Observe that (4.112) ensures that L ̸= 0 and a < b. Combining this with the assumption
that for all x, y ∈ [a, b]d it holds that

|f (x) − f (y)| ≤ L∥x − y∥1 , (4.113)

ensures that L > 0. Proposition 4.3.7 therefore ensures that there exists F ∈ N which
satisfies that

148
4.3. ANN approximations results for multi-dimensional functions

(I) it holds that I(F) = d,

(II) it holds that O(F) = 1,

(III) it holds that H(F) ≤ 3dL(b−a) 


+ 1 1(0,d/4) ε
,
  
d log2 2ε 2L(b−a)

(IV) it holds that D1 (F) ≤ 2d 3dL(b−a) d


1 ε
+ 1[d/4,∞) ε
,
 
2ε (0, d/4)
2L(b−a) 2L(b−a)

3dL(b−a) d 1
(V) it holds for all i ∈ {2, 3, 4, . . .} that Di (F) ≤ 3 ,
  
2ε 2i−1

(VI) it holds that P(F) ≤ 35 3dL(b−a) 2d 2


1 ε
+ (d + 1)1[d/4,∞) ε
,
 

d (0,d/4) 2L(b−a) 2L(b−a)

(VII) it holds that ∥T (F)∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]}, and

(VIII) it holds that supx∈[a,b]d |(RN


r (F))(x) − f (x)| ≤ ε

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 4.2.6). Note that item (III) assures that

H(F) ≤ d log2 3dL(b−a) −1


1 ε
   
2
+ log 2 (ε ) + 2 (0, d/4)
2L(b−a)
3dL(b−a) 
(4.114)
−1
 
≤ d max log2 2
, 0 + log2 (ε ) + 2.

Furthermore, observe that item (IV) implies that

3d max{L(b−a),1} d
1 ε
+ 1[d/4,∞) ε
 
D1 (F) ≤ d (0, d/4)
ε 2L(b−a) 2L(b−a)
(4.115)
−d d
≤ ε d(3d max{L(b − a), 1}) .

Moreover, note that item (V) establishes that for all i ∈ {2, 3, 4, . . . } it holds that

3dL(b−a) d 1 (3dL(b−a))d
+ 1 ≤ ε−d 3 (4.116)
 
Di (F) ≤ 3 2ε 2i−1 2i
+1 .

In addition, observe that item (VI) ensures that

3d max{L(b−a),1} 2d 2
1 ε
+ (d + 1)1[d/4,∞) ε
 
P(F) ≤ 9 ε
d (0,d/4) 2L(b−a) 2L(b−a)
2d 2 (4.117)
−2d
≤ε 9 3d max{L(b − a), 1} d.

Combining this, (4.114), (4.115), and (4.116) with items (I), (II), (VII), and (VIII) estab-
lishes items (i), (ii), (iii), (iv), (v), (vi), (vii), and (viii). The proof of Proposition 4.3.8 is
thus complete.

149
Chapter 4: Multi-dimensional ANN approximation results

Corollary 4.3.9 (Implicit multi-dimensional ANN approximations with prescribed error


tolerances and asymptotic parameter bounds). Let d ∈ N, L, a ∈ R, b ∈ [a, ∞) and let
f : [a, b]d → R satisfy for all x, y ∈ [a, b]d that

|f (x) − f (y)| ≤ L∥x − y∥1 (4.118)

(cf. Definition 3.3.4). Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N
such that

H(F) ≤ C(log2 (ε−1 ) + 1), ∥T (F)∥∞ ≤ max 1, L, |a|, |b|, 2 supx∈[a,b]d |f (x)| , (4.119)
  

RN d
r (F) ∈ C(R , R), supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.120)
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5).

Proof of Corollary 4.3.9. Throughout this proof, let C ∈ R satisfy


2d
C = 9 3d max{L(b − a), 1} d2 . (4.121)

Note that items (i), (ii), (iii), (vi), (vii), and (viii) in Proposition 4.3.8 and the fact that for
all ε ∈ (0, 1] it holds that

d log2 max 3dL(b−a) , 1 + log2 (ε−1 ) + 2 ≤ d max 3dL(b−a) , 1 + log2 (ε−1 ) + 2


    
2 2
≤ d max 3dL(b − a), 1 + 2 + d log2 (ε−1 )


≤ C(log2 (ε−1 ) + 1)
(4.122)

imply that for every ε ∈ (0, 1] there exists F ∈ N such that

H(F) ≤ C(log2 (ε−1 ) + 1), ∥T (F)∥∞ ≤ max 1, L, |a|, |b|, 2 supx∈[a,b]d |f (x)| , (4.123)
  

RN d
r (F) ∈ C(R , R), supx∈[a,b]d |(RN
r (F))(x)−f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.124)
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5). The proof of Corollary 4.3.9 is thus complete.

Lemma 4.3.10 (Explicit estimates for vector norms). Let d ∈ N, p, q ∈ (0, ∞] satisfy
p ≤ q. Then it holds for all x ∈ Rd that

∥x∥p ≥ ∥x∥q (4.125)

(cf. Definition 3.3.4).

150
4.3. ANN approximations results for multi-dimensional functions

Proof of Lemma 4.3.10. Throughout this proof, assume without loss of generality that
q < ∞, let e1 , e2 , . . . , ed ∈ Rd satisfy e1 = (1, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0), . . . , ed =
(0, . . . , 0, 1), let r ∈ R satisfy
r = p−1 q, (4.126)
and let x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ Rd satisfy for all i ∈ {1, 2, . . . , d} that

yi = |xi |p . (4.127)

Observe that (4.127), the fact that


d
X
y= yi ei , (4.128)
i=1

and the fact that for all v, w ∈ Rd it holds that

∥v + w∥r ≤ ∥v∥r + ∥w∥r (4.129)

(cf. Definition 3.3.4) ensures that


" d
#1/q " d
#1/q " d
#1/q " d
#1/(pr)
X X X X
|xi |q |xi |pr |yi |r |yi |r = ∥y∥r/p
1
∥x∥q = = = =
i=1 i=1 i=1 i=1

(4.130)
d
1/p " d
#1/p " d
#1/p " d
#1/p
X X X X
= yi ei ≤ ∥yi ei ∥r = |yi |∥ei ∥r = |yi |
i=1 r i=1 i=1 i=1
1/p
= ∥y∥1 = ∥x∥p .

This establishes (4.125). The proof of Lemma 4.3.10 is thus complete.


Corollary 4.3.11 (Implicit multi-dimensional ANN approximations with prescribed error
tolerances and asymptotic parameter bounds). Let d ∈ N, L, a ∈ R, b ∈ [a, ∞) and let
f : [a, b]d → R satisfy for all x, y ∈ [a, b]d that

|f (x) − f (y)| ≤ L∥x − y∥1 (4.131)

(cf. Definition 3.3.4). Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N
such that

RN d
r (F) ∈ C(R , R), supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.132)

(cf. Definitions 1.2.4, 1.3.1, and 1.3.4).


Proof of Corollary 4.3.11. Note that Corollary 4.3.9 establishes (4.132). The proof of
Corollary 4.3.11 is thus complete.

151
Chapter 4: Multi-dimensional ANN approximation results

4.4 Refined ANN approximations results for multi-di-


mensional functions
In Chapter 15 below we establish estimates for the overall error in the training of suit-
able rectified clipped ANNs (see Section 4.4.1 below) in the specific situation of GD-type
optimization methods with many independent random initializations. Besides optimiza-
tion error estimates from Part III and generalization error estimates from Part IV, for
this overall error analysis we also employ suitable approximation error estimates with a
somewhat more refined control on the architecture of the approximating ANNs than the
approximation error estimates established in the previous sections of this chapter (cf., for
instance, Corollaries 4.3.9 and 4.3.11 above). It is exactly the subject of this section to
establish such refined approximation error estimates (see Proposition 4.4.12 below).
This section is specifically tailored to the requirements of the overall error analysis
presented in Chapter 15 and does not offer much more significant insights into the approxi-
mation error analyses of ANNs than the content of the previous sections in this chapter. It
can therefore be skipped at the first reading of this book and only needs to be considered
when the reader is studying Chapter 15 in detail.

4.4.1 Rectified clipped ANNs


Definition 4.4.1 (Rectified clipped ANNs). Let L, d ∈ N, u ∈ [−∞, ∞), v ∈ (u, ∞],
l = (l0 , l1 , . . . , lL ) ∈ NL+1 , θ ∈ Rd satisfy
L
X
d≥ lk (lk−1 + 1). (4.133)
k=1

θ,l
Then we denote by Nu,v : Rl0 → RlL the function which satisfies for all x ∈ Rl0 that
( θ,l0 
NCu,v,l (x) :L=1
Nu,vθ,l
(x) = L
(4.134)
NRθ,ll 0,Rl ,...,Rl ,Cu,v,l (x) : L > 1

1 2 L−1 L

(cf. Definitions 1.1.3, 1.2.5, and 1.2.10).


Lemma 4.4.2. Let Φ ∈ N (cf. Definition 1.3.1). Then it holds for all x ∈ RI(Φ) that
T (Φ),D(Φ) 
N−∞,∞ (x) = (RN
r (Φ))(x) (4.135)

(cf. Definitions 1.2.4, 1.3.4, 1.3.5, and 4.4.1).


Proof of Lemma 4.4.2. Observe that Proposition 1.3.9, (4.134), (1.27), and the fact that
for all d ∈ N it holds that C−∞,∞,d = idRd demonstrate (4.135) (cf. Definition 1.2.10). The
proof of Lemma 4.4.2 is thus complete.

152
4.4. Refined ANN approximations results for multi-dimensional functions

4.4.2 Embedding ANNs in larger architectures


Lemma 4.4.3. Let a ∈ C(R, R), L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N satisfy for all
k ∈ {1, 2, . . . , L} that l0 = l0 , lL = lL , and lk ≥ lk , for every k ∈ {1, 2, . . . , L} let Wk =
(Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 , Wk = (Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 ,
Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk , Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk , assume for all k ∈ {1, 2, . . . , L},
i ∈ {1, 2, . . . , lk }, j ∈ N ∩ (0, lk−1 ] that
Wk,i,j = Wk,i,j and Bk,i = Bk,i , (4.136)
and assume for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk }, j ∈ N∩(lk−1 , lk−1 +1) that Wk,i,j = 0.
Then
RN N
 
a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) = Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL ))
(4.137)
(cf. Definition 1.3.4).
Proof of Lemma 4.4.3. Throughout this proof, let πk : Rlk → Rlk , k ∈ {0, 1, . . . , L}, satisfy
for all k ∈ {0, 1, . . . , L}, x = (x1 , x2 , . . . , xlk ) that
πk (x) = (x1 , x2 , . . . , xlk ). (4.138)
Note that the assumption that l0 = l0 and lL = lL proves that
RN l0 lL
(4.139)

a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∈ C(R , R )

(cf. Definition 1.3.4). Furthermore, observe that the assumption that for all k ∈ {1, 2, . . . , l},
i ∈ {1, 2, . . . , lk }, j ∈ N ∩ (lk−1 , lk−1 + 1) it holds that Wk,i,j = 0 shows that for all
k ∈ {1, 2, . . . , L}, x = (x1 , x2 , . . . , xlk−1 ) ∈ Rlk−1 it holds that
πk (Wk x + Bk )
" lk−1 # " lk−1 # " lk−1 # !
X X X
= Wk,1,i xi + Bk,1 , Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk
i=1 i=1 i=1 (4.140)
" lk−1 # " lk−1 # " lk−1 # !
X X X
= Wk,1,i xi + Bk,1 , Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk .
i=1 i=1 i=1

Combining this with the assumption that for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk }, j ∈
N∩(0, lk−1 ] it holds that Wk,i,j = Wk,i,j and Bk,i = Bk,i ensures that for all k ∈ {1, 2, . . . , L},
x = (x1 , x2 , . . . , xlk−1 ) ∈ Rlk−1 it holds that
πk (Wk x + Bk )
" lk−1 # " lk−1 # " lk−1 # !
(4.141)
X X X
= Wk,1,i xi + Bk,1 , Wk,2,i xi + Bk,2 , . . . , Wk,lk ,i xi + Bk,lk
i=1 i=1 i=1
= Wk πk−1 (x) + Bk .

153
Chapter 4: Multi-dimensional ANN approximation results

Hence, we obtain that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 , k ∈ N ∩ (0, L) with
∀ m ∈ N ∩ (0, L) : xm = Ma,lm (Wm xm−1 + Bm ) it holds that
πk (xk ) = Ma,lk (πk (Wk xk−1 + Bk )) = Ma,lk (Wk πk−1 (xk−1 ) + Bk ) (4.142)
(cf. Definition 1.2.1). Induction, the assumption that l0 = l0 and lL = lL , and (4.141)
therefore imply that for all x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL−1 ∈ RlL−1 with ∀ k ∈ N ∩ (0, L) : xk =
Ma,lk (Wk xk−1 + Bk ) it holds that
RN

a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (x0 )
= RN

a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (π0 (x0 ))
= WL πL−1 (xL−1 ) + BL (4.143)
= πL (WL xL−1 + BL ) = WL xL−1 + BL
= RN

a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (x0 ).

The proof of Lemma 4.4.3 is thus complete.


Lemma 4.4.4. Let a ∈ C(R, R), L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N satisfy for all
k ∈ {1, 2, . . . , L} that
l0 = l0 , lL = lL , and lk ≥ lk (4.144)
and let Φ ∈ N satisfy D(Φ) = (l0 , l1 , . . . , lL ) (cf. Definition 1.3.1). Then there exists Ψ ∈ N
such that

a (Ψ) = Ra (Φ) (4.145)


RN N
D(Ψ) = (l0 , l1 , . . . , lL ), ∥T (Ψ)∥∞ = ∥T (Φ)∥∞ , and
(cf. Definitions 1.3.4, 1.3.5, and 3.3.4).
Proof of Lemma 4.4.4. Throughout this proof, let Bk = (Bk,i )i∈{1,2,...,lk } ∈ Rlk , k ∈ {1, 2,
. . . , L}, and Wk = (Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, satisfy
Φ = ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) (4.146)
and let Wk = (Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, and Bk =
(Bk,i )i∈{1,2,...,lk } ∈ Rlk , k ∈ {1, 2, . . . , L}, satisfy for all k ∈ {1, 2, . . . , L}, i ∈ {1, 2, . . . , lk },
j ∈ {1, 2, . . . , lk−1 } that
( (
Wk,i,j : (i ≤ lk ) ∧ (j ≤ lk−1 ) Bk,i : i ≤ lk
Wk,i,j = and Bk,i = (4.147)
0 : (i > lk ) ∨ (j > lk−1 ) 0 : i > lk .

Note that (1.77) establishes that ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∈ × L


i=1
(Rli ×li−1 ×
Rli ) ⊆ N and
(4.148)

D ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) = (l0 , l1 , . . . , lL ).

154
4.4. Refined ANN approximations results for multi-dimensional functions

Furthermore, observe that Lemma 1.3.8 and (4.147) demonstrate that

(4.149)

∥T ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∥∞ = ∥T (Φ)∥∞

(cf. Definitions 1.3.5 and 3.3.4). Moreover, note that Lemma 4.4.3 proves that

RN N

a (Φ) = Ra ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL ))
(4.150)
= RN

a ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL ))

(cf. Definition 1.3.4). The proof of Lemma 4.4.4 is thus complete.

Lemma 4.4.5. Let L, L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N, Φ1 = ((W1 , B1 ), (W2 , B2 ),


× L lk ×lk−1 lk

. . . , (WL , BL )) ∈ k=1
(R × R ) , Φ2 = ((W1 , B1 ), (W2 , B2 ), . . . , (WL , BL )) ∈
× L lk ×lk−1 lk

k=1
(R × R ) . Then

(4.151)
 
∥T (Φ1 • Φ2 )∥∞ ≤ max ∥T (Φ1 )∥∞ , ∥T (Φ2 )∥∞ , T ((W1 WL , W1 BL + B1 )) ∞

(cf. Definitions 1.3.5, 2.1.1, and 3.3.4).

Proof of Lemma 4.4.5. Observe that (2.2) and Lemma 1.3.8 establish (4.151). The proof
of Lemma 4.4.5 is thus complete.

Lemma 4.4.6. Let d, L ∈ N, Φ ∈ N satisfy L ≥ L(Φ) and d = O(Φ) (cf. Definition 1.3.1).
Then
∥T (EL,Id (Φ))∥∞ ≤ max{1, ∥T (Φ)∥∞ } (4.152)
(cf. Definitions 1.3.5, 2.2.6, 2.2.8, and 3.3.4).

Proof of Lemma 4.4.6. Throughout this proof, assume without loss of generality that
L > L(Φ) and let l0 , l1 , . . . , lL−L(Φ)+1 ∈ N satisfy

(l0 , l1 , . . . , lL−L(Φ)+1 ) = (d, 2d, 2d, . . . , 2d, d). (4.153)

Note that Lemma 2.2.7 shows that D(Id ) = (d, 2d, d) ∈ N3 (cf. Definition 2.2.6). Item (i)
in Lemma 2.2.9 hence ensures that

L((Id )•(L−L(Φ)) ) = L − L(Φ) + 1


(4.154)
and D((Id )•(L−L(Φ)) ) = (l0 , l1 , . . . , lL−L(Φ)+1 ) ∈ NL−L(Φ)+2

(cf. Definition 2.1.1). This implies that there exist Wk ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L−L(Φ)+1},
and Bk ∈ Rlk , k ∈ {1, 2, . . . , L − L(Φ) + 1}, which satisfy

(Id )•(L−L(Φ)) = ((W1 , B1 ), (W2 , B2 ), . . . , (WL−L(Φ)+1 , BL−L(Φ)+1 )). (4.155)

155
Chapter 4: Multi-dimensional ANN approximation results

Furthermore, observe that (2.44), (2.70), (2.71), (2.2), and (2.41) demonstrate that
 
1 0 ··· 0
−1 0 · · · 0 
 
0
 1 · · · 0 

W1 =  0 −1 · · · 0  ∈ R(2d)×d
 
 .. .. . . .. 
 . . . . 
 
0 0 ··· 1  (4.156)
0 0 · · · −1
 
1 −1 0 0 · · · 0 0
0 0 1 −1 · · · 0 0 
and WL−L(Φ)+1 =  .. .. .. .. . . .. ..  ∈ Rd×(2d) .
 
. . . . . . . 
0 0 0 0 · · · 1 −1

Moreover, note that (2.44), (2.70), (2.71), (2.2), and (2.41) prove that for all k ∈ N ∩ (1, L −
L(Φ) + 1) it holds that
 
1 0 ··· 0
−1 0 · · · 0   
  1 −1 0 0 · · · 0 0
0 1 ··· 0 
 0 −1 · · · 0  0 0 1 −1 · · · 0 0 
  
Wk =    . . . . . . .. .. 
 .. .. . . ..   .. .. .. .. . . . 
 . . . . 
  0 0 0 0 · · · 1 −1
0 0 ··· 1 | {z }
0 0 · · · −1 ∈Rd×(2d)

(4.157)
| {z }
∈R(2d)×d
 
1 −1 0 0 ··· 0 0
−1 1 0 0 · · · 0 0
 
0
 0 1 −1 · · · 0 0 

= 0
 0 −1 1 · · · 0 0  ∈ R(2d)×(2d) .
 .. .. .. .. . . . .. 
 . . . . . .. . 
 
0 0 0 0 · · · 1 −1
0 0 0 0 · · · −1 1

In addition, observe that (2.70), (2.71), (2.44), (2.41), and (2.2) establish that for all
k ∈ N ∩ [1, L − L(Φ)] it holds that
Bk = 0 ∈ R2d and BL−L(Φ)+1 = 0 ∈ Rd . (4.158)
Combining this, (4.156), and (4.157) shows that
T (Id )•(L−L(Φ)) (4.159)


=1

156
4.4. Refined ANN approximations results for multi-dimensional functions

(cf. Definitions 1.3.5 and 3.3.4). Next note that (4.156) ensures that for all k ∈ N,
W = (wi,j )(i,j)∈{1,2,...,d}×{1,2,...,k} ∈ Rd×k it holds that
 
w1,1 w1,2 · · · w1,k
−w1,1 −w1,2 · · · −w1,k 
 
 w2,1
 w 2,2 · · · w 2,k


W1 W = −w2,1 −w2,2 · · · −w2,k  ∈ R(2d)×k . (4.160)
 
 .. .. . . .
.
 . . . . 

 
 wd,1 wd,2 · · · wd,k 
−wd,1 −wd,2 · · · −wd,k

Furthermore, observe that (4.156) and (4.158) imply that for all B = (b1 , b2 , . . . , bd ) ∈ Rd
it holds that
   
1 0 ··· 0 b1
−1 0 · · · 0   −b1 
  b1  
0
 1 ··· 0    b2   b2 
 
W1 B + B1 =  0 −1 · · · 0  ..  = −b2  ∈ R2d . (4.161)
    
 .. .. . . ..  .   .. 
 . . . .
 . 
  
 b
0 ··· 1  d

0  bd 
0 0 · · · −1 −bd

Combining this with (4.160) demonstrates that for all k ∈ N, W ∈ Rd×k , B ∈ Rd it holds
that
(4.162)
 
T ((W1 W, W1 B + B1 )) ∞ = T ((W, B)) ∞ .
This, Lemma 4.4.5, and (4.159) prove that

∥T (EL,Id (Φ))∥∞ = T ((Id )•(L−L(Φ)) ) • Φ ∞



(4.163)
≤ max T (Id )•(L−L(Φ)) ∞ , ∥T (Φ)∥∞ = max{1, ∥T (Φ)∥∞ }
 

(cf. Definition 2.2.8). The proof of Lemma 4.4.6 is thus complete.


Lemma 4.4.7. Let L, L ∈ N, l0 , l1 , . . . , lL , l0 , l1 , . . . , lL ∈ N satisfy

L ≥ L, l0 = l0 , and lL = lL , (4.164)

assume for all i ∈ N ∩ [0, L) that li ≥ li , assume for all i ∈ N ∩ (L − 1, L) that li ≥ 2lL , and
let Φ ∈ N satisfy D(Φ) = (l0 , l1 , . . . , lL ) (cf. Definition 1.3.1). Then there exists Ψ ∈ N
such that

r (Ψ) = Rr (Φ) (4.165)


D(Ψ) = (l0 , l1 , . . . , lL ), ∥T (Ψ)∥∞ ≤ max{1, ∥T (Φ)∥∞ }, and RN N

(cf. Definitions 1.2.4, 1.3.4, 1.3.5, and 3.3.4).

157
Chapter 4: Multi-dimensional ANN approximation results

Proof of Lemma 4.4.7. Throughout this proof, let Ξ ∈ N satisfy Ξ = EL,IlL (Φ) (cf. Def-
initions 2.2.6 and 2.2.8). Note that item (i) in Lemma 2.2.7 establishes that D(IlL ) =
(lL , 2lL , lL ) ∈ N3 . Combining this with Lemma 2.2.11 shows that D(Ξ) ∈ NL+1 and
(
(l0 , l1 , . . . , lL ) :L=L
D(Ξ) = (4.166)
(l0 , l1 , . . . , lL−1 , 2lL , 2lL , . . . , 2lL , lL ) : L > L.

Furthermore, observe that Lemma 4.4.6 (applied with d ↶ lL , L ↶ L, Φ ↶ Φ in the


notation of Lemma 4.4.6) ensures that

∥T (Ξ)∥∞ ≤ max{1, ∥T (Φ)∥∞ } (4.167)

(cf. Definitions 1.3.5 and 3.3.4). Moreover, note that item (ii) in Lemma 2.2.7 implies that
for all x ∈ RlL it holds that
(RNr (IlL ))(x) = x (4.168)
(cf. Definitions 1.2.4 and 1.3.4). This and item (ii) in Lemma 2.2.10 prove that

RN N
r (Ξ) = Rr (Φ). (4.169)

In addition, observe that (4.166), the assumption that for all i ∈ [0, L) it holds that
l0 = l0 , lL = lL , and li ≤ li , the assumption that for all i ∈ N ∩ (L − 1, L) it holds
that li ≥ 2lL , and Lemma 4.4.4 (applied with a ↶ r, L ↶ L, (l0 , l1 , . . . , lL ) ↶ D(Ξ),
(l0 , l1 , . . . , lL ) ↶ (l0 , l1 , . . . , lL ), Φ ↶ Ξ in the notation of Lemma 4.4.4) demonstrate that
there exists Ψ ∈ N such that

D(Ψ) = (l0 , l1 , . . . , lL ), ∥T (Ψ)∥∞ = ∥T (Ξ)∥∞ , and r (Ψ) = Rr (Ξ). (4.170)


RN N

Combining this with (4.167) and (4.169) proves (4.165). The proof of Lemma 4.4.7 is thus
complete.

Lemma 4.4.8. Let u ∈ [−∞, ∞), v ∈ (u, ∞], L, L, d, d ∈ N, θ ∈ Rd , l0 , l1 , . . . , lL , l0 , l1 ,


. . . , lL ∈ N satisfy that

and lL = lL , (4.171)
PL PL
d≥ i=1 li (li−1 + 1), d≥ i=1 li (li−1 + 1), L ≥ L, l0 = l0 ,

assume for all i ∈ N ∩ [0, L) that li ≥ li , and assume for all i ∈ N ∩ (L − 1, L) that li ≥ 2lL .
Then there exists ϑ ∈ Rd such that

∥ϑ∥∞ ≤ max{1, ∥θ∥∞ } and ϑ,(l0 ,l1 ,...,lL )


Nu,v θ,(l0 ,l1 ,...,lL )
= Nu,v (4.172)

(cf. Definitions 3.3.4 and 4.4.1).

158
4.4. Refined ANN approximations results for multi-dimensional functions

Proof of Lemma 4.4.8. Throughout this proof, let η1 , η2 , . . . , ηd ∈ R satisfy

θ = (η1 , η2 , . . . , ηd ) (4.173)

and let Φ ∈ × L
Rli ×li−1 × Rli satisfy

i=1

T (Φ) = (η1 , η2 , . . . , ηP(Φ) ) (4.174)

(cf. Definitions 1.3.1 and 1.3.5). Note that Lemma 4.4.7 establishes that there exists Ψ ∈ N
which satisfies

D(Ψ) = (l0 , l1 , . . . , lL ), ∥T (Ψ)∥∞ ≤ max{1, ∥T (Φ)∥∞ }, and RN


r (Ψ) = Rr (Φ) (4.175)
N

(cf. Definitions 1.2.4, 1.3.4, and 3.3.4). Next let ϑ = (ϑ1 , ϑ2 , . . . , ϑd ) ∈ Rd satisfy

(ϑ1 , ϑ2 , . . . , ϑP(Ψ) ) = T (Ψ) and ∀ i ∈ N ∩ (P(Ψ), d + 1) : ϑi = 0. (4.176)

Observe that (4.173), (4.174), (4.175), and (4.176) show that

∥ϑ∥∞ = ∥T (Ψ)∥∞ ≤ max{1, ∥T (Φ)∥∞ } ≤ max{1, ∥θ∥∞ }. (4.177)

Furthermore, note that Lemma 4.4.2 and (4.174) ensure that for all x ∈ Rl0 it holds that

θ,(l ,l ,...,lL )  T (Φ),D(Φ) 


0 1
N−∞,∞ (x) = N−∞,∞ (x) = (RN
r (Φ))(x) (4.178)

(cf. Definition 4.4.1). Moreover, observe that Lemma 4.4.2, (4.175), and (4.176) imply that
for all x ∈ Rl0 it holds that
ϑ,(l ,l ,...,lL )  T (Ψ),D(Ψ) 
0 1
N−∞,∞ (x) = N−∞,∞ (x) = (RN
r (Ψ))(x). (4.179)

Combining this and (4.178) with (4.175) and the assumption that l0 = l0 and lL = lL
demonstrates that
θ,(l0 ,l1 ,...,lL ) ϑ,(l0 ,l1 ,...,lL )
N−∞,∞ = N−∞,∞ . (4.180)

Therefore, we obtain that


θ,(l ,l ,...,lL ) ϑ,(l ,l ,...,lL )
θ,(l0 ,l1 ,...,lL )
Nu,v 0 1
= Cu,v,lL ◦ N−∞,∞ 0 1
= Cu,v,lL ◦ N−∞,∞ ϑ,(l0 ,l1 ,...,lL )
= Nu,v (4.181)

(cf. Definition 1.2.10). This and (4.177) prove (4.172). The proof of Lemma 4.4.8 is thus
complete.

159
Chapter 4: Multi-dimensional ANN approximation results

4.4.3 Approximation through ANNs with variable architectures


Corollary 4.4.9. Let d, K, d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 , L ∈ [0, ∞) satisfy that

L ≥ ⌈log2 (K)⌉ + 2, l0 = d, lL = 1, l1 ≥ 2dK, and d ≥ Li=1 li (li−1 + 1), (4.182)


P

 K 
assume for all i ∈ N∩(1, L) that li ≥ 3 2i−1 , let E ⊆ Rd be a set, let x1 , x2 , . . . , xK ∈ E, and
let f : E → R satisfy for all x, y ∈ E that |f (x) − f (y)| ≤ L∥x − y∥1 (cf. Definitions 3.3.4
and 4.2.6). Then there exists θ ∈ Rd such that

∥θ∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|} (4.183)

and
θ,l
(4.184)
 
supx∈E f (x) − N−∞,∞ (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1
(cf. Definition 4.4.1).

Proof of Corollary 4.4.9. Throughout this proof, let y ∈ RK , Φ ∈ N satisfy y = (f (x1 ),


f (x2 ), . . . , f (xK )) and

(4.185)

Φ = MK • A−L IK ,y • PK Ld • AId ,−x1 , Ld • AId ,−x2 , . . . , Ld • AId ,−xK • Td,K

(cf. Definitions 1.3.1, 1.5.5, 2.1.1, 2.2.1, 2.3.1, 2.4.6, 4.2.1, and 4.2.5). Note that Lemma 4.2.9
and Proposition 4.3.1 establish that

(I) it holds that L(Φ) = ⌈log2 (K)⌉ + 2,

(II) it holds that I(Φ) = d,

(III) it holds that O(Φ) = 1,

(IV) it holds that D1 (Φ) = 2dK,

(V) it holds for all i ∈ {2, 3, . . . , L(Φ) − 1} that Di (Φ) ≤ 3⌈ 2i−1


K
⌉,

(VI) it holds that ∥T (Φ)∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|}, and

(VII) it holds that supx∈E |f (x) − (RN


 
r (Φ))(x)| ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1

(cf. Definitions 1.2.4, 1.3.4, and 1.3.5). Furthermore, observe that the fact that L ≥
⌈log2 (K)⌉ + 2 = L(Φ), the fact that l0 = d = D0 (Φ), the fact that l1 ≥ 2dK = D1 (Φ), the
fact that for all i ∈ {1, 2, . . . , L(Φ) − 1}\{1} it holds that li ≥ 3⌈ 2i−1
K
⌉ ≥ Di (Φ), the fact
that for all i ∈ N ∩ (L(Φ) − 1, L) it holds that li ≥ 3⌈ 2i−1
K
⌉ ≥ 2 = 2DL(Φ) (Φ), the fact that
lL = 1 = DL(Φ) (Φ), and Lemma 4.4.8 show that there exists θ ∈ Rd which satisfies that

∥θ∥∞ ≤ max{1, ∥T (Φ)∥∞ } and


θ,(l ,l ,...,lL )
0 1
N−∞,∞
T (Φ),D(Φ)
= N−∞,∞ . (4.186)

160
4.4. Refined ANN approximations results for multi-dimensional functions

This and item (VI) ensure that

∥θ∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|}. (4.187)

Moreover, note that (4.186), Lemma 4.4.2, and item (VII) imply that
θ,(l ,l ,...,lL )
0 1 T (Φ),D(Φ)
supx∈E f (x) − N−∞,∞ (x) = supx∈E f (x) − N−∞,∞ (x)
= supx∈E f (x) − (RN
r (Φ))(x) (4.188)
 
≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1
(cf. Definition 4.4.1). The proof of Corollary 4.4.9 is thus complete.
Corollary 4.4.10. Let d, K, d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 , L ∈ [0, ∞), u ∈ [−∞, ∞),
v ∈ (u, ∞] satisfy that
L ≥ ⌈log2 K⌉ + 2, l0 = d, lL = 1, l1 ≥ 2dK, and d ≥ Li=1 li (li−1 + 1), (4.189)
P
 K 
assume for all i ∈ N ∩ (1, L) that li ≥ 3 2i−1 , let E ⊆ Rd be a set, let x1 , x2 , . . . , xK ∈ E,
and let f : E → ([u, v] ∩ R) satisfy for all x, y ∈ E that |f (x) − f (y)| ≤ L∥x − y∥1 (cf.
Definitions 3.3.4 and 4.2.6). Then there exists θ ∈ Rd such that
∥θ∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|} (4.190)
and
θ,l
(4.191)
 
supx∈E f (x) − Nu,v (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1 .
(cf. Definition 4.4.1).
Proof of Corollary 4.4.10. Observe that Corollary 4.4.9 demonstrates that there exists
θ ∈ Rd such that
∥θ∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|} (4.192)
and
θ,l
(4.193)
 
supx∈E f (x) − N−∞,∞ (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1 .
Furthermore, note that the assumption that f (E) ⊆ [u, v] proves that for all x ∈ E it holds
that
f (x) = cu,v (f (x)) (4.194)
(cf. Definitions 1.2.9 and 4.4.1). The fact that for all x, y ∈ R it holds that |cu,v (x)−cu,v (y)| ≤
|x − y| and (4.193) hence establish that
θ,l θ,l
supx∈E f (x) − Nu,v (x) = supx∈E |cu,v (f (x)) − cu,v (N−∞,∞ (x))|
θ,l
  (4.195)
≤ supx∈E f (x) − N−∞,∞ (x) ≤ 2L supx∈E inf k∈{1,2,...,K} ∥x − xk ∥1 .
The proof of Corollary 4.4.10 is thus complete.

161
Chapter 4: Multi-dimensional ANN approximation results

4.4.4 Refined convergence rates for the approximation error


Lemma 4.4.11. Let d, d, L ∈ N, L, a ∈ R, b ∈ (a, ∞), u P ∈ [−∞, ∞), v ∈ (u, ∞],
l = (l0 , l1 , . . . , lL ) ∈ N , assume l0 = d, lL = 1, and d ≥ Li=1 li (li−1 + 1), and let
L+1

f : [a, b]d → ([u, v] ∩ R) satisfy for all x, y ∈ [a, b]d that |f (x) − f (y)| ≤ L∥x − y∥1 (cf.
Definition 3.3.4). Then there exists ϑ ∈ Rd such that ∥ϑ∥∞ ≤ supx∈[a,b]d |f (x)| and

dL(b − a)
ϑ,l
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ (4.196)
2
(cf. Definition 4.4.1).
Proof of Lemma 4.4.11. Throughout this proof, let d = + 1), let m = (m1 , m2 ,
PL
i=1 li (li−1
. . . , md ) ∈ [a, b]d satisfy for all i ∈ {1, 2, . . . , d} that
a+b
mi = , (4.197)
2
and let ϑ = (ϑ1 , ϑ2 , . . . , ϑd ) ∈ Rd satisfy for all i ∈ {1, 2, . . . , d}\{d} that ϑi = 0 and
ϑd = f (m). Observe that the assumption that lL = 1 and the fact that ∀ i ∈ {1, 2, . . . , d −
1} : ϑi = 0 show that for all x = (x1 , x2 , . . . , xlL−1 ) ∈ RlL−1 it holds that
lL−1 
ϑ, L−1
P
i=1 li (li−1 +1)
P
A1,lL−1 (x) = ϑ[ L−1 li (li−1 +1)]+i xi + ϑ[PL−1 li (li−1 +1)]+lL−1 +1
P
i=1 i=1
i=1
lL−1 
(4.198)
P
= ϑ[PL li (li−1 +1)]−(lL−1 −i+1) xi + ϑPL li (li−1 +1)
i=1 i=1
i=1
lL−1 
P
= ϑd−(lL−1 −i+1) xi + ϑd = ϑd = f (m)
i=1

(cf. Definition 1.1.1). Combining this with the fact that f (m) ∈ [u, v] ensures that for all
x ∈ RlL−1 it holds that
ϑ, L−1 ϑ, L−1
P P
i=1 li (li−1 +1) i=1 li (li−1 +1)
 
Cu,v,lL ◦ AlL ,lL−1 (x) = Cu,v,1 ◦ A1,lL−1 (x)
= cu,v (f (m)) = max{u, min{f (m), v}} (4.199)
= max{u, f (m)} = f (m)

(cf. Definitions 1.2.9 and 1.2.10). This implies for all x ∈ Rd that
ϑ,l
Nu,v (x) = f (m). (4.200)

Furthermore, note that (4.197) demonstrates that for all x ∈ [a, m1 ], x ∈ [m1 , b] it holds
that
|m1 − x| = m1 − x = (a+b)/2 − x ≤ (a+b)/2 − a = (b−a)/2
(4.201)
and |m1 − x| = x − m1 = x − (a+b)/2 ≤ b − (a+b)/2 = (b−a)/2.

162
4.4. Refined ANN approximations results for multi-dimensional functions

The assumption that ∀ x, y ∈ [a, b]d : |f (x) − f (y)| ≤ L∥x − y∥1 and (4.200) therefore prove
that for all x = (x1 , x2 , . . . , xd ) ∈ [a, b]d it holds that
d
ϑ,l
P
|Nu,v (x) − f (x)| = |f (m) − f (x)| ≤ L∥m − x∥1 = L |mi − xi |
i=1
d L(b − a)
(4.202)
d
P P dL(b − a)
= L |m1 − xi | ≤ = .
i=1 i=1 2 2

This and the fact that ∥ϑ∥∞ = maxi∈{1,2,...,d} |ϑi | = |f (m)| ≤ supx∈[a,b]d |f (x)| establish
(4.196). The proof of Lemma 4.4.11 is thus complete.

Proposition 4.4.12. Let d, d, L ∈ N, A ∈ (0, ∞), L, a ∈ R, b ∈ (a, ∞), u ∈ [−∞, ∞),


v ∈ (u, ∞], l = (l0 , l1 , . . . , lL ) ∈ NL+1 , assume

L ≥ 1 + (⌈log2 (A/(2d))⌉ + 1)1(6d ,∞) (A), l0 = d, l1 ≥ A1(6d ,∞) (A), lL = 1, (4.203)


PL
and d ≥ i=1 li (li−1 + 1), assume for all i ∈ {1, 2, . . . , L}\{1, L} that

li ≥ 3⌈A/(2i d)⌉1(6d ,∞) (A), (4.204)

and let f : [a, b]d → ([u, v] ∩ R) satisfy for all x, y ∈ [a, b]d that

|f (x) − f (y)| ≤ L∥x − y∥1 (4.205)

(cf. Definitions 3.3.4 and 4.2.6). Then there exists ϑ ∈ Rd such that ∥ϑ∥∞ ≤ max{1, L, |a|,
allowbreakabsb, 2[supx∈[a,b]d |f (x)|]} and

3dL(b − a)
ϑ,l
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ (4.206)
A1/d
(cf. Definition 4.4.1).

Proof of Proposition 4.4.12. Throughout this proof, assume without loss of generality that
A 1/d
A > 6d (cf. Lemma 4.4.11), let Z = ⌊ 2d ⌋ ∈ Z. Observe that the fact that for all k ∈ N


it holds that 2k ≤ 2(2 ) = 2 shows that 3d = 6d/2d ≤ A/(2d). Hence, we obtain that
k−1 k

1/d 1/d
2≤ 2 A
3 2d
≤ A
2d
− 1 < Z. (4.207)

In the next step let r = d(b−a)/2Z ∈ (0, ∞), let δ : [a, b]d ×[a, b]d → R satisfy for all x, y ∈ [a, b]d
that δ(x, y) = ∥x − y∥1 , and let K = max(2, C ([a,b] ,δ),r ) ∈ N ∪ {∞} (cf. Definition 4.3.2).
d

Note that (4.207) and Lemma 4.3.4 ensure that


n  d o
K = max{2, C ([a,b]d ,δ),r d(b−a)
} ≤ max 2, ⌈ 2r ⌉ = max{2, (⌈Z⌉)d } = Zd < ∞. (4.208)

163
Chapter 4: Multi-dimensional ANN approximation results

This implies that


4 ≤ 2dK ≤ 2dZd ≤ 2dA
2d
= A. (4.209)
Combining this and the fact that L ≥ 1 + (⌈log2 (A/(2d))⌉ + 1)1(6d ,∞) (A) = ⌈log2 (A/(2d))⌉ +
2 therefore demonstrates that ⌈log2 (K)⌉ ≤ ⌈log2 (A/(2d))⌉ ≤ L − 2. This, (4.209), the
assumption that l1 ≥ A1(6d ,∞) (A) = A, and the assumption that ∀ i ∈ {2, 3, . . . , L−1} : li ≥
3⌈A/(2i d)⌉1(6d ,∞) (A) = 3⌈A/(2i d)⌉ prove that for all i ∈ {2, 3, . . . , L − 1} it holds that

L ≥ ⌈log2 (K)⌉ + 2, l1 ≥ A ≥ 2dK, and li ≥ 3⌈ 2Ai d ⌉ ≥ 3⌈ 2i−1


K
⌉. (4.210)

Let x1 , x2 , . . . , xK ∈ [a, b]d satisfy

(4.211)
 
supx∈[a,b]d inf k∈{1,2,...,K} δ(x, xk ) ≤ r.

Observe that (4.210), the assumptions that l0 = d, lL = 1, d ≥ Li=1 li (li−1 + 1), and
P
∀ x, y ∈ [a, b]d : |f (x) − f (y)| ≤ L∥x − y∥1 , and Corollary 4.4.10 establish that there exists
ϑ ∈ Rd such that

∥ϑ∥∞ ≤ max{1, L, maxk∈{1,2,...,K} ∥xk ∥∞ , 2 maxk∈{1,2,...,K} |f (xk )|} (4.212)

and
ϑ,l
 
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ 2L supx∈[a,b]d inf k∈{1,2,...,K} ∥x − xk ∥1
  (4.213)
= 2L supx∈[a,b]d inf k∈{1,2,...,K} δ(x, xk ) .

Note that (4.212) shows that

∥ϑ∥∞ ≤ max{1, L, |a|, |b|, 2 supx∈[a,b]d |f (x)|}. (4.214)

Furthermore, observe that (4.213), (4.207), (4.211), and the fact that for all k ∈ N it holds
that 2k ≤ 2(2k−1 ) = 2k ensure that
ϑ,l
 
supx∈[a,b]d |Nu,v (x) − f (x)| ≤ 2L supx∈[a,b]d inf k∈{1,2,...,K} δ(x, xk )
dL(b − a) dL(b − a) (2d)1/d 3dL(b − a) 3dL(b − a) (4.215)
≤ 2Lr = ≤ 1/d
= 1/d
≤ .
Z 2 A 2A A1/d

3 2d

Combining this with (4.214) implies (4.206). The proof of Proposition 4.4.12 is thus
complete.

Corollary 4.4.13. Let d ∈ N, a ∈ R, b ∈ (a, ∞), L ∈ (0, ∞) and let f : [a, b]d → R satisfy
for all x, y ∈ [a, b]d that

|f (x) − f (y)| ≤ L∥x − y∥1 (4.216)

164
4.4. Refined ANN approximations results for multi-dimensional functions

(cf. Definition 3.3.4). Then there exist C ∈ R such that for all ε ∈ (0, 1] there exists F ∈ N
such that

H(F) ≤ max 0, d(log2 (ε−1 ) + log2 (d) + log2 (3L(b − a)) + 1) , (4.217)


RN d
(4.218)

∥T (F)∥∞ ≤ max 1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|] , r (F) ∈ C(R , R),

supx∈[a,b]d |(RN
r (F))(x) − f (x)| ≤ ε, and P(F) ≤ Cε−2d (4.219)
(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5).

Proof of Corollary 4.4.13. Throughout this proof let C ∈ R satisfy


2d d
C = 89 3dL(b − a) + (d + 22) 3dL(b − a) + d + 11, (4.220)
(ε) (ε) (ε)
for every ε ∈ (0, 1] let Aε ∈ (0, ∞), Lε ∈ N, l(ε) = (l0 , l1 , . . . , lLε ) ∈ NLε +1 satisfy
 d
3dL(b − a) Aε
+ 1 1(6d ,∞) (Aε ), (4.221)
  
Aε = , Lε = 1 + log2 2d
ε

l1 = ⌊Aε ⌋1(6d ,∞) (Aε ) + 1, (4.222)


(ε) (ε) (ε)
l0 = d, and lLε = 1,
and assume for all ε ∈ (0, 1], i ∈ {2, 3, . . . , Lε − 1} that

li = 3 2Aiεd 1(6d ,∞) (Aε ) (4.223)


(ε)  

(cf. Definition 4.2.6). Observe that the fact that for all ε ∈ (0, 1] it holds that Lε ≥
1 + log2 A2dε + 1 1(6d ,∞) (Aε ), the fact that for all ε ∈ (0, 1] it holds that l0 = d,
   (ε)

the fact that for all ε ∈ (0, 1] it holds that l1 ≥ Aε 1(6d ,∞) (Aε ), the fact that for all
(ε)

(ε)
ε ∈ (0, 1] it holds that lLε = 1, the fact that for all ε ∈ (0, 1], i ∈ {2, 3, . . . , Lε − 1}
it holds that li ≥ 3⌈ 2Aiεd ⌉1(6d ,∞) (Aε ), Proposition
(ε)
 4.4.12, and Lemma
 4.4.2 demonstrate
×
Lε (ε) (ε) (ε) 
that for all ε ∈ (0, 1] there exists Fε ∈ i=1
Rli ×li−1
×R li
⊆ N which satisfies
∥T (Fε )∥∞ ≤ max{1, L, |a|, |b|, 2[supx∈[a,b]d |f (x)|]} and

3dL(b − a)
supx∈[a,b]d |(RN
r (Fε ))(x) − f (x)| ≤ = ε. (4.224)
(Aε )1/d

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, and 1.3.5). Furthermore, note that the fact that d ≥ 1
proves that for all ε ∈ (0, 1] it holds that

H(Fε ) = Lε − 1 = ( log2 A2dε + 1)1(6d ,∞) (Aε )


 
(4.225)
= ⌈log2 ( Adε )⌉1(6d ,∞) (Aε ) ≤ max{0, log2 (Aε ) + 1}.

165
Chapter 4: Multi-dimensional ANN approximation results

Combining this and the fact that for all ε ∈ (0, 1] it holds that
 
log2 (Aε ) = d log2 3dL(b−a) = d log2 (ε−1 ) + log2 (d) + log2 (3L(b − a)) (4.226)

ε

establishes that for all ε ∈ (0, 1] it holds that

H(Fε ) ≤ max 0, d log2 (ε−1 ) + log2 (d) + log2 (3L(b − a)) + 1 . (4.227)
 

Moreover, observe that (4.222) and (4.223) show that for all ε ∈ (0, 1] it holds that

X (ε) (ε)
P(Fε ) = li (li−1 + 1)
i=1
≤ ⌊Aε ⌋ + 1 (d + 1) + 3 A4dε ⌊Aε ⌋ + 2
   

ε −1
L
X (4.228)
+ max ⌊Aε ⌋ + 1, 3 2LAε −1 3 2Aiεd (3 2i−1
  ε     Aε 
d
+ 1 + d
+ 1)
i=3
L
Xε −1
Aε Aε 3Aε
   
≤ (Aε + 1)(d + 1) + 3 4
+ 1 Aε + 2 + 3Aε + 4 + 3 2i
+1 2i−1
+4 .
i=3

In addition, note that the fact that ∀ x ∈ (0, ∞) : log2 (x) = log2 (x/2) + 1 ≤ x/2 + 1 ensures
that for all ε ∈ (0, 1] it holds that

Lε ≤ 2 + log2 ( Adε ) ≤ 3 + Aε
2d
≤3+ Aε
2
. (4.229)

This implies that for all ε ∈ (0, 1] it holds that


L
X ε −1
Aε 3Aε
 
3 2i
+1 2i−1
+4
i=3
"L −1 # "L −1 # "L −1 #
Xε Xε Xε

≤ 9(Aε )2 21−2i + 12Aε 2−i + 9Aε 21−i + 12(Lε − 3)


i=3 i=3 i=3
(4.230)
" ∞
# " ∞
# " ∞
#
9(Aε )2
X X X
−i −i
≤ 8
4 + 3Aε 2 + 9Aε
2
2−i + 6Aε
i=1 i=1 i=1
= 3
8
(Aε )2 + 3Aε + 9
A
2 ε
+ 6Aε = 3
8
(Aε )2 + 27
2
Aε .

This and (4.228) demonstrate that for all ε ∈ (0, 1] it holds that

P(Fε ) ≤ ( 34 + 38 )(Aε )2 + (d + 1 + 29 + 3 + 27
)Aε +d+1+6+4
2
(4.231)
= 98 (Aε )2 + (d + 22)Aε + d + 11.

166
4.4. Refined ANN approximations results for multi-dimensional functions

Combining this, (4.220), and (4.221) proves that


2d d
P(Fε ) ≤ 3dL(b − a) ε−2d + (d + 22) 3dL(b − a) ε−d + d + 11
9
8
h 2d d i (4.232)
≤ 89 3dL(b − a) + (d + 22) 3dL(b − a) + d + 11 ε−2d = Cε−2d .

Combining this with (4.224) and (4.227) establishes (4.217), (4.218), and (4.219). The
proof of Corollary 4.4.13 is thus complete.
Remark 4.4.14 (High-dimensional ANN approximation results). Corollary 4.4.13 above is a
multi-dimensional ANN approximation result in the sense that the input dimension d ∈ N
of the domain of definition [a, b]d of the considered target function f that we intend to
approximate can be any natural number. However, we note that Corollary 4.4.13 does
not provide a useful contribution in the case when the dimension d is large, say d ≥ 5, as
Corollary 4.4.13 does not provide any information on how the constant C in (4.219) grows
in d and as the dimension d appears in the exponent of the reciprocal ε−1 of the prescribed
approximation accuracy ε in the bound for the number of ANN parameters in (4.219).
In the literature there are also a number of suitable high-dimensional ANN approximation
results which assure that the constant in the parameter bound grows at most polynomially
in the dimension d and which assure that the exponent of the reciprocal ε−1 of the prescribed
approximation accuracy ε in the ANN parameter bound is completely independent of the
dimension d. Such results do have the potential to provide a useful practical conclusion for
ANN approximations even when the dimension d is large. We refer, for example, to [14, 15,
28, 70, 121, 160] and the references therein for such high-dimensional ANN approximation
results in the context of general classes of target functions and we refer, for instance, to [3,
29, 35, 123, 128, 161–163, 177, 179, 205, 209, 228, 259, 353] and the references therein for
such high-dimensional ANN approximation results where the target functions are solutions
of PDEs (cf. also Section 18.4 below).
Remark 4.4.15 (Infinite dimensional ANN approximation results). In the literature there
are now also results where the target function that we intend to approximate is defined on
an infinite dimensional vector space and where the dimension of the domain of definition
of the target function is thus infinity (see, for example, [32, 68, 69, 202, 255, 363] and the
references therein). This perspective seems to be very reasonable as in many applications,
input data, such as images and videos, that should be processed through the target function
are more naturally represented by elements of infinite dimensional spaces instead of elements
of finite dimensional spaces.

167
Chapter 4: Multi-dimensional ANN approximation results

168
Part III

Optimization

169
Chapter 5

Optimization through gradient flow (GF)


trajectories

In Chapters 6 and 7 below we study deterministic and stochastic GD-type optimization


methods from the literature. Such methods are widely used in machine learning problems to
approximately minimize suitable objective functions. The SGD-type optimization methods
in Chapter 7 can be viewed as suitable Monte Carlo approximations of the deterministic
GD-type optimization methods in Chapter 6 and the deterministic GD-type optimization
methods in Chapter 6 can, roughly speaking, be viewed as time-discrete approximations of
solutions of suitable GF ODEs. To develop intuitions for GD-type optimization methods
and for some of the tools which we employ to analyze such methods, we study in this
chapter such GF ODEs. In particular, we show in this chapter how such GF ODEs can be
used to approximately solve appropriate optimization problems.
Further investigations on optimization through GF ODEs can, for example, be found in
[2, 44, 126, 216, 224, 225, 258] and the references therein.

5.1 Introductory comments for the training of ANNs


Key components of deep supervised learning algorithms are typically deep ANNs and also
suitable gradient based optimization methods. In Parts I and II we have introduced and
studied different types of ANNs while in Part III we introduce and study gradient based
optimization methods. In this section we briefly outline the main ideas behind gradient
based optimization methods and sketch how such gradient based optimization methods arise
within deep supervised learning algorithms. To do this, we now recall the deep supervised
learning framework from the introduction.
Specifically, let d, M ∈ N, E ∈ C(Rd , R), x1 , x2 , . . . , xM +1 ∈ Rd , y1 , y2 , . . . , yM ∈ R
satisfy for all m ∈ {1, 2, . . . , M } that
ym = E(xm ) (5.1)

171
Chapter 5: Optimization through ODEs

and let L : C(Rd , R) → [0, ∞) satisfy for all ϕ ∈ C(Rd , R) that


"M #
1 X
L(ϕ) = |ϕ(xm ) − ym |2 . (5.2)
M m=1

As in the introduction we think of M ∈ N as the number of available known input-output


data pairs, we think of d ∈ N as the dimension of the input data, we think of E : Rd → R
as an unknown function which relates input and output data through (5.1), we think of
x1 , x2 , . . . , xM +1 ∈ Rd as the available known input data, we think of y1 , y2 , . . . , yM ∈ R
as the available known output data, and we have that the function L : C(Rd , R) → [0, ∞)
in (5.2) is the objective function (the function we want to minimize) in the optimization
problem associated to the considered learning problem (cf. (3) in the introduction). In
particular, observe that

L(E) = 0 (5.3)

and we are trying to approximate the function E by computing an approximate minimizer of


the function L : C(Rd , R) → [0, ∞). In order to make this optimization problem amenable
to numerical computations, we consider a spatially discretized version of the optimiza-
tion problem associated to (5.2) by employing parametrizations of ANNs (cf. (7) in the
introduction).
More formally,
Phlet a : R → R be differentiable, let h ∈ N, l1 , l2 , . . . , lh , d ∈ N satisfy
d = l1 (d + 1) + k=2 lk (lk−1 + 1) + lh + 1, and consider the parametrization function
θ,d
Rd ∋ θ 7→ NM a,l ,Ma,l2 ,...,Ma,lh ,idR ∈ C(Rd , R) (5.4)
1

(cf. Definitions 1.1.3 and 1.2.1). Note that h is the number of hidden layers of the ANNs
in (5.4), note for every i ∈ {1, 2, . . . , h} that li ∈ N is the number of neurons in the i-th
hidden layer of the ANNs in (5.4), and note that d is the number of real parameters used
to describe the ANNs in (5.4). Observe that for every θ ∈ Rd we have that the function
θ,d
Rd ∋ x 7→ NM a,l 1
,Ma,l2 ,...,Ma,lh ,idR ∈R (5.5)

in (5.4) is nothing else than the realization function associated to a fully-connected feed-
forward ANN where before each hidden layer a multidimensional version of the activation
function a : R → R is applied. We restrict ourselves in this section to a differentiable
activation function as this differentiability property allows us to consider gradients (cf. (5.7),
(5.8), and Section 5.3.2 below for details).
We now discretize the optimization problem in (5.2) as the problem of computing
approximate minimizers of the function L : Rd → [0, ∞) which satisfies for all θ ∈ Rd that
"M #
1 X 2
θ,d
(5.6)

L(θ) = NMa,l ,Ma,l ,...,Ma,l ,idR (xm ) − ym
M m=1 1 2 h

172
5.2. Basics for GFs

and this resulting optimization problem is now accessible to numerical computations.


Specifically, deep learning algorithms solve optimization problems of the type (5.6) by means
of gradient based optimization methods. Loosely speaking, gradient based optimization
methods aim to minimize the considered objective function (such as (5.6) above) by
performing successive steps based on the direction of the negative gradient of the objective
function. One of the simplest gradient based optimization method is the plain-vanilla
GD optimization method which performs successive steps in the direction of the negative
gradient and we now sketch the GD optimization method applied to (5.6). Let ξ ∈ Rd , let
(γn )n∈N ⊆ [0, ∞), and let θ = (θn )n∈N0 : N0 → Rd satisfy for all n ∈ N that

θ0 = ξ and θn = θn−1 − γn (∇L)(θn−1 ). (5.7)

The process (θn )n∈N0 is the GD process for the minimization problem associated to (5.6)
with learning rates (γn )n∈N and initial value ξ (see Definition 6.1.1 below for the precise
definition).
This plain-vanilla GD optimization method and related GD-type optimization methods
can be regarded as discretizations of solutions of GF ODEs. In the context of the min-
imization problem in (5.6) such solutions of GF ODEs can be described as follows. Let
Θ = (Θt )t∈[0,∞) : [0, ∞) → Rd be a continuously differentiable function which satisfies for all
t ∈ [0, ∞) that

Θ0 = ξ and Θ̇t = ∂
Θ
∂t t
= −(∇L)(Θt ). (5.8)

The process (Θt )t∈[0,∞) is the solution of the GF ODE corresponding to the minimization
problem associated to (5.6) with initial value ξ.
In Chapter 6 below we introduce and study deterministic GD-type optimization methods
such as the GD optimization method in (5.7). To develop intuitions for GD-type optimization
methods and for some of the tools which we employ to analyze such GD-type optimization
methods, we study in the remainder of this chapter GF ODEs such as (5.8) above. In
deep learning algorithms usually not GD-type optimization methods but stochastic variants
of GD-type optimization methods are employed to solve optimization problems of the
form (5.6). Such SGD-type optimization methods can be viewed as suitable Monte Carlo
approximations of deterministic GD-type methods and in Chapter 7 below we treat such
SGD-type optimization methods.

5.2 Basics for GFs


5.2.1 GF ordinary differential equations (ODEs)
Definition 5.2.1 (GF trajectories). Let d ∈ N, ξ ∈ Rd , let L : Rd → R be a function, and
let G : Rd → Rd be a B(Rd )/B(Rd )-measurable function which satisfies for all U ∈ {V ⊆

173
Chapter 5: Optimization through ODEs

Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ). (5.9)

Then we say that Θ is a GF trajectory for the objective function L with generalized gradient
G and initial value ξ (we say that Θ is a GF trajectory for the objective function L with
initial value ξ, we say that Θ is a solution of the GF ODE for the objective function L
with generalized gradient G and initial value ξ, we say that Θ is a solution of the GF ODE
for the objective function L with initial value ξ) if and only if it holds that Θ : [0, ∞) → Rd
is a function from [0, ∞) to Rd which satisfies for all t ∈ [0, ∞) that
Z t
Θt = ξ − G(Θs ) ds. (5.10)
0

5.2.2 Direction of negative gradients


Lemma 5.2.2. Let d ∈ N, L ∈ C 1 (Rd , R), ϑ ∈ Rd , r ∈ (0, ∞) and let G : Rd → R satisfy
for all v ∈ Rd that
 
L(ϑ + hv) − L(ϑ)
G(v) = lim = [L ′ (ϑ)](v). (5.11)
h→0 h

Then

(i) it holds that


(
0 : (∇L)(ϑ) = 0
sup G(v) = r∥(∇L)(ϑ)∥2 = r(∇L)(ϑ)  (5.12)
v∈{w∈Rd : ∥w∥2 =r} G ∥(∇L)(ϑ)∥2
̸ 0
: (∇L)(ϑ) =

and

(ii) it holds that


(
0 : (∇L)(ϑ) = 0
inf G(v) = −r∥(∇L)(ϑ)∥ 2 = −r(∇L)(ϑ)  (5.13)
v∈{w∈Rd : ∥w∥2 =r} G ∥(∇L)(ϑ)∥2
̸ 0
: (∇L)(ϑ) =

(cf. Definition 3.3.4).

Proof of Lemma 5.2.2. Note that (5.11) implies that for all v ∈ Rd it holds that

G(v) = ⟨(∇L)(ϑ), v⟩ (5.14)

174
5.2. Basics for GFs

(cf. Definition 1.4.7). The Cauchy–Schwarz inequality hence ensures that for all v ∈ Rd
with ∥v∥2 = r it holds that

−r∥(∇L)(ϑ)∥2 = −∥(∇L)(ϑ)∥2 ∥v∥2 ≤ −⟨−(∇L)(ϑ), v⟩


(5.15)
= G(v) ≤ ∥(∇L)(ϑ)∥2 ∥v∥2 = r∥(∇L)(ϑ)∥2

(cf. Definition 3.3.4). Furthermore, observe that (5.14) shows that for all c ∈ R it holds that

G(c(∇L)(ϑ)) = ⟨(∇L)(ϑ), c(∇L)(ϑ)⟩ = c∥(∇L)(ϑ)∥22 . (5.16)

Combining this and (5.15) proves item (i) and item (ii). The proof of Lemma 5.2.2 is thus
complete.
Lemma 5.2.3. RLet d ∈ N, Θ ∈ C([0, ∞), Rd ), L ∈ C 1 (Rd , R) and assume for all t ∈ [0, ∞)
t
that Θt = Θ0 − 0 (∇L)(Θs ) ds. Then
(i) it holds that Θ ∈ C 1 ([0, ∞), Rd ),
(ii) it holds for all t ∈ (0, ∞) that Θ̇t = −(∇L)(Θt ), and
(iii) it holds for all t ∈ [0, ∞) that
Z t
L(Θt ) = L(Θ0 ) − ∥(∇L)(Θs )∥22 ds (5.17)
0

(cf. Definition 3.3.4).


Proof of Lemma 5.2.3. Note that the fundamental theorem of calculus implies item (i) and
item (ii). Combining item (ii) with the fundamental theorem of calculus and the chain rule
ensures that for all t ∈ [0, ∞) it holds that
Z t Z t
L(Θt ) = L(Θ0 ) + ⟨(∇L)(Θs ), Θ̇s ⟩ ds = L(Θ0 ) − ∥(∇L)(Θs )∥22 ds (5.18)
0 0

(cf. Definitions 1.4.7 and 3.3.4). This establishes item (iii). The proof of Lemma 5.2.3 is
thus complete.
Corollary 5.2.4 (Illustration for the negative GF). Let d ∈ d
R tN, Θ ∈ C([0, ∞), R ), L ∈
1 d
C (R , R) and assume for all t ∈ [0, ∞) that Θ(t) = Θ(0) − 0 (∇L)(Θ(s)) ds. Then
(i) it holds that Θ ∈ C 1 ([0, ∞), Rd ),
(ii) it holds for all t ∈ (0, ∞) that

(L ◦ Θ)′ (t) = −∥(∇L)(Θ(t))∥22 , (5.19)

and

175
Chapter 5: Optimization through ODEs

(iii) it holds for all Ξ ∈ C 1 ([0, ∞), Rd ), τ ∈ (0, ∞) with Ξ(τ ) = Θ(τ ) and ∥Ξ′ (τ )∥2 =
∥Θ′ (τ )∥2 that

(L ◦ Θ)′ (τ ) ≤ (L ◦ Ξ)′ (τ ) (5.20)

(cf. Definition 3.3.4).

Proof of Corollary 5.2.4. Observe that Lemma 5.2.3 and the fundamental theorem of cal-
culus imply item (i) and item (ii). Note that Lemma 5.2.2 shows for all Ξ ∈ C 1 ([0, ∞), Rd ),
t ∈ (0, ∞) it holds that

(L ◦ Ξ)′ (t) = [L ′ (Ξ(t))](Ξ′ (t))


≥ d
inf ′
[L ′ (Ξ(t))](v) (5.21)
v∈{w∈R : ∥w∥2 =∥Ξ (t)∥2 }

= −∥Ξ′ (t)∥2 ∥(∇L)(Ξ(t))∥2

(cf. Definition 3.3.4). Lemma 5.2.3 therefore ensures that for all Ξ ∈ C 1 ([0, ∞), Rd ),
τ ∈ (0, ∞) with Ξ(τ ) = Θ(τ ) and ∥Ξ′ (τ )∥2 = ∥Θ′ (τ )∥2 it holds that

(L ◦ Ξ)′ (τ ) ≥ −∥Ξ′ (τ )∥2 ∥(∇L)(Ξ(τ ))∥2 ≥ −∥Θ′ (τ )∥2 ∥(∇L)(Θ(τ ))∥2


(5.22)
= −∥(∇L)(Θ(τ ))∥22 = (L ◦ Θ)′ (τ ).

This and item (ii) establish item (iii). The proof of Corollary 5.2.4 is thus complete.

176
5.2. Basics for GFs

2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0

Figure 5.1 (plots/gradient_plot1.pdf): Illustration of negative gradients


in a one-dimensional example. The plot shows the graph of the function
[−2, 2] ∋ x 7→ x4 − 3x2 ∈ R with the value of the negative gradient at several
points indicated by horizontal arrows. The Python code used to produce this
plot is given in Source code 5.1.

177
Chapter 5: Optimization through ODEs

2
2 0 2 4 6

Figure 5.2 (plots/gradient_plot2.pdf): Illustration of negative gradients


in a two-dimensional example. The plot shows contour lines of the function
R2 ∋ (x, y) 7→ 21 |x − 1|2 + 5|y − 1|2 ∈ R with arrows indicating the direction
and magnitude of the negative gradient at several points along these contour
lines. The Python code used to produce this plot is given in Source code 5.2.
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 def f ( x ) :
5 return x **4 - 3 * x **2
6
7 def nabla_f ( x ) :
8 return 4 * x **3 - 6 * x
9
10 plt . figure ()
11
12 # Plot graph of f
13 x = np . linspace ( -2 ,2 ,100)
14 plt . plot (x , f ( x ) )
15

178
5.2. Basics for GFs

16 # Plot arrows
17 for x in np . linspace ( -1.9 ,1.9 ,21) :
18 d = nabla_f ( x )
19 plt . arrow (x , f ( x ) , -.05 * d , 0 ,
20 l e n g t h _ i n c l ud e s _ h e a d = True , head_width =0.08 ,
21 head_length =0.05 , color = ’b ’)
22
23 plt . savefig ( " ../ plots / gradient_plot1 . pdf " )

Source code 5.1 (code/gradient_plot1.py): Python code used to create Figure 5.1
1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 K = [1. , 10.]
5 vartheta = np . array ([1. , 1.])
6
7 def f (x , y ) :
8 result = K [0] / 2. * np . abs ( x - vartheta [0]) **2 \
9 + K [1] / 2. * np . abs ( y - vartheta [1]) **2
10 return result
11
12 def nabla_f ( x ) :
13 return K * ( x - vartheta )
14
15 plt . figure ()
16
17 # Plot contour lines of f
18 x = np . linspace ( -3. , 7. , 100)
19 y = np . linspace ( -2. , 4. , 100)
20 X , Y = np . meshgrid (x , y )
21 Z = f (X , Y )
22 cp = plt . contour (X , Y , Z , colors = " black " ,
23 levels = [0.5 ,2 ,4 ,8 ,16] ,
24 linestyles = " : " )
25

26 # Plot arrows along contour lines


27 for l in [0.5 ,2 ,4 ,8 ,16]:
28 for d in np . linspace (0 , 2.* np . pi , 10 , endpoint = False ) :
29 x = np . cos ( d ) / (( K [0] / (2* l ) ) **.5) + vartheta [0]
30 y = np . sin ( d ) / (( K [1] / (2* l ) ) **.5) + vartheta [1]
31 grad = nabla_f ( np . array ([ x , y ]) )
32 plt . arrow (x , y , -.05 * grad [0] , -.05 * grad [1] ,
33 l e n g t h _ i n c l ud e s _ h e a d = True , head_width =.08 ,
34 head_length =.1 , color = ’b ’)
35
36 plt . savefig ( " ../ plots / gradient_plot2 . pdf " )

Source code 5.2 (code/gradient_plot2.py): Python code used to create Figure 5.2

179
Chapter 5: Optimization through ODEs

5.3 Regularity properties for ANNs


5.3.1 On the differentiability of compositions of parametric func-
tions
Lemma 5.3.1. Let d1 , d2 , l1 , l2 ∈ N, let A1 : Rl1 → Rl1 × Rl2 and A2 : Rl2 → Rl1 × Rl2
satisfy for all x1 ∈ Rl1 , x2 ∈ Rl2 that A1 (x1 ) = (x1 , 0) and A2 (x2 ) = (0, x2 ), for every
k ∈ {1, 2} let Bk : Rl1 × Rl2 → Rlk satisfy for all x1 ∈ Rl1 , x2 ∈ Rl2 that Bk (x1 , x2 ) = xk ,
for every k ∈ {1, 2} let Fk : Rdk → Rlk be differentiable, and let f : Rd1 × Rd2 → Rl1 × Rl2
satisfy for all x1 ∈ Rd1 , x2 ∈ Rd2 that

f (x1 , x2 ) = (F1 (x1 ), F2 (x2 )). (5.23)

Then

(i) it holds that f = A1 ◦ F1 ◦ B1 + A2 ◦ F2 ◦ B2 and

(ii) it holds that f is differentiable.

Proof of Lemma 5.3.1. Observe that (5.23) implies that for all x1 ∈ Rd1 , x2 ∈ Rd2 it holds
that
(A1 ◦ F1 ◦ B1 + A2 ◦ F2 ◦ B2 )(x1 , x2 ) = (A1 ◦ F1 )(x1 ) + (A2 ◦ F2 )(x2 )
= (F1 (x1 ), 0) + (0, F2 (x2 )) (5.24)
= (F1 (x1 ), F2 (x2 )).
Combining this and the fact that A1 , A2 , F1 , F2 , B1 , and B2 are differentiable with the chain
rule establishes that f is differentiable. The proof of Lemma 5.3.1 is thus complete.

Lemma 5.3.2. Let d1 , d2 , l0 , l1 , l2 ∈ N, let A : Rd1 × Rd2 × Rl0 → Rd2 × Rd1 +l0 and B : Rd2 ×
Rd1 +l0 → Rd2 × Rl1 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that

A(θ1 , θ2 , x) = (θ2 , (θ1 , x)) and B(θ2 , (θ1 , x)) = (θ2 , F1 (θ1 , x)), (5.25)

for every k ∈ {1, 2} let Fk : Rdk × Rlk−1 → Rlk be differentiable, and let f : Rd1 × Rd2 × Rl0 →
Rl2 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that

(5.26)

f (θ1 , θ2 , x) = F2 (θ2 , ·) ◦ F1 (θ1 , ·) (x).

Then

(i) it holds that f = F2 ◦ B ◦ A and

(ii) it holds that f is differentiable.

180
5.3. Regularity properties for ANNs

Proof of Lemma 5.3.2. Note that (5.25) and (5.26) ensure that for all θ1 ∈ Rd1 , θ2 ∈ Rd2 ,
x ∈ Rl0 it holds that

f (θ1 , θ2 , x) = F2 (θ2 , F1 (θ1 , x)) = F2 (B(θ2 , (θ1 , x))) = F2 (B(A(θ1 , θ2 , x))). (5.27)

Observe that Lemma 5.3.1 (applied with d1 ↶ d2 , d2 ↶ d1 + l1 , l1 ↶ d2 , l2 ↶ l1 ,


F1 ↶ (Rd2 ∋ θ2 7→ θ2 ∈ Rd2 ), F2 ↶ (Rd1 +l1 ∋ (θ1 , x) 7→ F1 (θ1 , x) ∈ Rl1 ) in the notation
of Lemma 5.3.1) implies that B is differentiable. Combining this, the fact that A is
differentiable, the fact that F2 is differentiable, and (5.27) with the chain rule assures that
f is differentiable. The proof of Lemma 5.3.2 is thus complete.

5.3.2 On the differentiability of realizations of ANNs


Lemma 5.3.3 (Differentiability of realization functions of ANNs). Let L ∈ N, l0 , l1 , . . . ,
lL ∈ N, for every k ∈ {1, 2, . . . , L} let dk = lk (lk−1 + 1), for every k ∈ {1, 2, . . . , L} let
Ψk : Rlk → Rlk be differentiable, and for every k ∈ {1, 2, . . . , L} let Fk : Rdk × Rlk−1 → Rlk
satisfy for all θ ∈ Rdk , x ∈ Rlk−1 that

Fk (θ, x) = Ψk Aθ,0 (5.28)



lk ,lk−1 (x)

(cf. Definition 1.1.1). Then


(i) it holds for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , . . ., θL ∈ RdL , x ∈ Rl0 that
(θ ,θ ,...,θ ),l 
NΨ11,Ψ22 ,...,ΨLL 0 (x) = (FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·))(x) (5.29)

and

(ii) it holds that

Rd1 +d2 +...+dL × Rl0 ∋ (θ, x) 7→ NΨθ,l1 ,Ψ (x) ∈ RlL (5.30)


0

2 ,...,ΨL

is differentiable
(cf. Definition 1.1.3).
Proof of Lemma 5.3.3. Note that (1.1) shows that for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , . . ., θL ∈ RdL ,
k ∈ {1, 2, . . . , L} it holds that
Pk−1
(θ1 ,θ2 ,...,θL ), dj
Alk ,lk−1 j=1
= Aθlkk,l,0k−1 . (5.31)

Hence, we obtain that for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , . . ., θL ∈ RdL , k ∈ {1, 2, . . . , L} it holds
that
(θ1 ,θ2 ,...,θL ), k−1
P
j=1 dj
(5.32)

Fk (θk , x) = Ψk ◦ Alk ,lk−1 (x).

181
Chapter 5: Optimization through ODEs

Combining this with (1.5) establishes item (i). Observe that the assumption that for all
k ∈ {1, 2, . . . , L} it holds that Ψk is differentiable, the fact that for all m, n ∈ N, θ ∈ Rm(n+1)
it holds that Rm(n+1) × Rn ∋ (θ, x) 7→ Aθ,0 m,n (x) ∈ R
m
is differentiable, and the chain rule
ensure that for all k ∈ {1, 2, . . . , L} it holds that Fk is differentiable. Lemma 5.3.2 and
induction hence prove that

Rd1 × Rd2 × . . . × RdL × Rl0 ∋ (θ1 , θ2 , . . . , θL , x)


7→ (FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·))(x) ∈ RlL (5.33)

is differentiable. This and item (i) prove item (ii). The proof of Lemma 5.3.3 is thus
complete.

Lemma 5.3.4 (Differentiability of the empirical risk function). LetPL, d ∈ N\{1}, M, l0 ,


l1 , . . . , lL ∈ N, x1 , x2 , . . . , xM ∈ Rl0 , y1 , y2 , . . . , yM ∈ RlL satisfy d = Lk=1 lk (lk−1 + 1), for
every k ∈ {1, 2, . . . , L} let Ψk : Rlk → Rlk be differentiable, and let L : Rd → R satisfy for
all θ ∈ Rd that "M #
1 X
L NΨθ,l1 ,Ψ (5.34)
0
 
L(θ) = 2 ,...,ΨL
(xm ), ym
M m=1

(cf. Definition 1.1.3). Then L is differentiable.

Proof of Lemma 5.3.4. Note that Lemma 5.3.3 and Lemma 5.3.1 (applied with d1 ↶ d + l0 ,
d2 ↶ lL , l1 ↶ lL , l2 ↶ lL , F1 ↶ (Rd × Rl0 ∋ (θ, x) 7→ NΨθ,l1 ,Ψ
 lL
0
2 ,...,ΨL
(x) ∈ R ), F2 ↶ idRlL
in the notation of Lemma 5.3.1) show that

Rd × Rl0 × RlL ∋ (θ, x, y) 7→ NΨθ,l1 ,Ψ (x), y ∈ RlL × RlL (5.35)


0
 
2 ,...,ΨL

is differentiable. The assumption that L is differentiable and the chain rule therefore ensure
that for all x ∈ Rl0 , y ∈ RlL it holds that

Rd ∋ θ 7→ L NΨθ,l1 ,Ψ (5.36)
 
0
2 ,...,ΨL
(x m ), ym ∈R

is differentiable. This implies that L is differentiable. The proof of Lemma 5.3.4 is thus
complete.

Lemma 5.3.5. Let a : R → R be differentiable and let d ∈ D. Then Ma,d is differentiable.

Proof of Lemma 5.3.5. Observe that the assumption that a is differentiable, Lemma 5.3.1,
and induction demonstrate that for all m ∈ N it holds that Ma,m is differentiable. The
proof of Lemma 5.3.5 is thus complete.

182
5.4. Loss functions

Corollary 5.3.6. LetPL, d ∈ N\{1}, M, l0 , l1 , . . . , lL ∈ N, x1 , x2 , . . . , xM ∈ Rl0 , y1 , y2 , . . . ,


L
lL
yM ∈ R satisfy d = k=1 lk (lk−1 +1), let a : R → R and L : RlL ×RlL → R be differentiable,
and let L : Rd → R satisfy for all θ ∈ Rd that
"M #
1 X θ,l0
(5.37)
 
L(θ) = L NM a,l1 ,Ma,l2 ,...,Ma,lL−1 ,idRlL
(xm ), ym
M m=1

(cf. Definitions 1.1.3 and 1.2.1). Then L is differentiable.

Proof of Corollary 5.3.6. Note that Lemma 5.3.5, and Lemma 5.3.4 prove that L is differ-
entiable. The proof of Corollary 5.3.6 is thus complete.

Corollary 5.3.7. Let L, P d ∈ N\{1}, M, l0 , l1 , . . . , lL ∈ N, x1 , x2 , . . . , xM ∈ Rl0 , y1 , y2 , . . . ,


yM ∈ (0, ∞)lL satisfy d = Lk=1 lk (lk−1 + 1), let A be the lL -dimensional softmax activation
function, let a : R → R and L : (0, ∞)lL ×(0, ∞)lL → R be differentiable, and let L : Rd → R
satisfy for all θ ∈ Rd that
"M #
1 X θ,l0
(5.38)
 
L(θ) = L NM a,l1 ,Ma,l2 ,...,Ma,lL−1 ,A
(xm ), ym
M m=1

(cf. Definitions 1.1.3, 1.2.1, and 1.2.43 and Lemma 1.2.44). Then L is differentiable.

Proof of Corollary 5.3.7. Observe that Lemma 5.3.5, the fact that A is differentiable, and
Lemma 5.3.4 establish that L is differentiable. The proof of Corollary 5.3.7 is thus
complete.

5.4 Loss functions


5.4.1 Absolute error loss
Definition 5.4.1. Let d ∈ N and let ~·~ : Rd → [0, ∞) be a norm. We say that L is the
l 1 -error loss function based on ~·~ (we say that L is the absolute error loss function based
on ~·~) if and only if it holds that L : Rd × Rd → R is the function from Rd × Rd to R
which satisfies for all x, y ∈ Rd that

L(x, y) = ~x − y~. (5.39)

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5

183
Chapter 5: Optimization through ODEs

2.0

1.5

1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
¹-error
0.5

Figure 5.3 (plots/l1loss.pdf): A plot of the function R ∋ x 7→ L(x, 0) ∈ [0, ∞)


where L is the l 1 -error loss function based on R ∋ x 7→ |x| ∈ [0, ∞) (cf. Defini-
tion 5.4.1).

6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )


7
8 x = np . linspace ( -2 , 2 , 100)
9
10 mae_loss = tf . keras . losses . Me anAbsolu teError (
11 reduction = tf . keras . losses . Reduction . NONE )
12 zero = tf . zeros ([100 ,1])
13
14 ax . plot (x , mae_loss ( x . reshape ([100 ,1]) , zero ) ,
15 label = ’ ℓ 1 - error ’)
16 ax . legend ()
17
18 plt . savefig ( " ../../ plots / l1loss . pdf " , bbox_inches = ’ tight ’)

Source code 5.3 (code/loss_functions/l1loss_plot.py): Python code used to


create Figure 5.3

5.4.2 Mean squared error loss


Definition 5.4.2. Let d ∈ N and let ~·~ : Rd → [0, ∞) be a norm. We say that L is the
mean squared error loss function based on ~·~ if and only if it holds that L : Rd × Rd → R
is the function from Rd × Rd to R which satisfies for all x, y ∈ Rd that

L(x, y) = ~x − y~2 . (5.40)

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt

184
5.4. Loss functions

2.0

1.5

1.0

0.5

0.0
2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0
Mean squared error
0.5

Figure 5.4 (plots/mseloss.pdf): A plot of the function R ∋ x 7→ L(x, 0) ∈ [0, ∞)


where L is the mean squared error loss function based on R ∋ x 7→ |x| ∈ [0, ∞) (cf.
Definition 5.4.2).

4 import plot_util
5
6 ax = plot_util . setup_axis (( -2 ,2) , ( -.5 ,2) )
7
8 x = np . linspace ( -2 , 2 , 100)
9
10 mse_loss = tf . keras . losses . MeanSquaredError (
11 reduction = tf . keras . losses . Reduction . NONE )
12 zero = tf . zeros ([100 ,1])
13
14 ax . plot (x , mse_loss ( x . reshape ([100 ,1]) , zero ) ,
15 label = ’ Mean squared error ’)
16 ax . legend ()
17
18 plt . savefig ( " ../../ plots / mseloss . pdf " , bbox_inches = ’ tight ’)

Source code 5.4 (code/loss_functions/mseloss_plot.py): Python code used to


create Figure 5.4

Lemma 5.4.3. Let d ∈ N and let L be the mean squared error loss function based on
Rd ∋ x 7→ ∥x∥2 ∈ [0, ∞) (cf. Definitions 3.3.4 and 5.4.2). Then

(i) it holds that L ∈ C ∞ (Rd × Rd , R)

(ii) it holds for all x, y, u, v ∈ Rd that

L(u, v) = L(x, y)+L′ (x, y)(u−x, v−y)+ 12 L(2) (x, y) (u−x, v−y), (u−x, v−y) . (5.41)


185
Chapter 5: Optimization through ODEs

Proof of Lemma 5.4.3. Note that (5.40) implies that for all x = (x1 , . . . , xd ), y = (y1 , . . . ,
yd ) ∈ Rd it holds that
d
X
L(x, y) = ∥x − y∥22 = ⟨x − y, x − y⟩ = (xi − yi )2 . (5.42)
i=1

Hence, we obtain that for all x, y ∈ Rd it holds that L ∈ C 1 (Rd × Rd , R) and

(∇L)(x, y) = (2(x − y), −2(x − y)) ∈ R2d . (5.43)

This implies that for all x, y, h, k ∈ Rd it holds that

L′ (x, y)(h, k) = ⟨2(x − y), h⟩ + ⟨−2(x, y), k⟩ = 2⟨x − y, h − k⟩. (5.44)

Furthermore, observe that (5.43) implies that for all x, y ∈ Rd it holds that L ∈ C 2 (Rd ×
Rd , R) and  
2 Id −2 Id
(Hess(x,y) L) = . (5.45)
−2 Id 2 Id
Therefore, we obtain that for all x, y, h, k ∈ Rd it holds that

L(2) (x, y) (h, k), (h, k) = 2⟨h, h⟩ − 2⟨h, k⟩ − 2⟨k, h⟩ + 2⟨k, k⟩ = 2∥h − k∥22 . (5.46)


Combining this with (5.43) shows that for all x, y ∈ Rd , h, k ∈ Rd it holds that L ∈
C ∞ (Rd × Rd , R) and

L(x, y) + L′ (x, y)(h, k) + 21 L(2) (x, y) (h, k), (h, k)




= ∥x − y∥22 + 2⟨x − y, h − k⟩ + ∥h − k∥22


(5.47)
= ∥x − y + (h − k)∥22
= L(x + h, y + k).

This implies items (i) and (ii). The proof of Lemma 5.4.3 is thus complete.

5.4.3 Huber error loss


Definition 5.4.4. Let d ∈ N, let δ ∈ [0, ∞), and let ~·~ : Rd → [0, ∞) be a norm. We
say that L is the δ-Huber-error loss function based on ~·~ if and only if it holds that
L : Rd × Rd → R is the function from Rd × Rd to R which satisfies for all x, y ∈ Rd that
(
1
~x − y~2 : ~x − y~ ≤ δ
L(x, y) = 2 (5.48)
δ(~x − y~ − 2δ ) : ~x − y~ > δ.

186
5.4. Loss functions

4.0
Scaled mean squared error
¹-error3.5
1-Huber-error
3.0
2.5
2.0
1.5
1.0
0.5
0.0
3 2 1 0 1 2 3
0.5

Figure 5.5 (plots/huberloss.pdf): A plot of the functions R ∋ x 7→ Li (x, 0) ∈


[0, ∞), i ∈ {1, 2, 3}, where L0 is the mean squared error loss function based on
R ∋ x 7→ |x| ∈ [0, ∞), where L1 : Rd × Rd → [0, ∞) satisfies for all x, y ∈ Rd that
L1 (x, y) = 12 L0 (x, y), where L2 is the l 1 -error loss function based on R ∋ x 7→ |x| ∈
[0, ∞), and where L3 is the 1-Huber loss function based on R ∋ x 7→ |x| ∈ [0, ∞).

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis (( -3 ,3) , ( -.5 ,4) )
7
8 x = np . linspace ( -3 , 3 , 100)
9
10 mse_loss = tf . keras . losses . MeanSquaredError (
11 reduction = tf . keras . losses . Reduction . NONE )
12 mae_loss = tf . keras . losses . Me anAbsolu teError (
13 reduction = tf . keras . losses . Reduction . NONE )
14 huber_loss = tf . keras . losses . Huber (
15 reduction = tf . keras . losses . Reduction . NONE )
16
17 zero = tf . zeros ([100 ,1])
18
19 ax . plot (x , mse_loss ( x . reshape ([100 ,1]) , zero ) /2. ,
20 label = ’ Scaled mean squared error ’)
21 ax . plot (x , mae_loss ( x . reshape ([100 ,1]) , zero ) ,
22 label = ’ ℓ 1 - error ’)
23 ax . plot (x , huber_loss ( x . reshape ([100 ,1]) , zero ) ,
24 label = ’1 - Huber - error ’)
25 ax . legend ()

187
Chapter 5: Optimization through ODEs

26
27 plt . savefig ( " ../../ plots / huberloss . pdf " , bbox_inches = ’ tight ’)

Source code 5.5 (code/loss_functions/huberloss_plot.py): Python code used


to create Figure 5.5

5.4.4 Cross-entropy loss


Definition 5.4.5. Let d ∈ N\{1}. We say that L is the d-dimensional cross-entropy loss
function if and only if it holds that L : [0, ∞)d × [0, ∞)d → (−∞, ∞] is the function from
[0, ∞)d × [0, ∞)d to (−∞, ∞] which satisfies for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈
[0, ∞)d that
X d
(5.49)
 
L(x, y) = − limz↘xi ln(z)yi .
i=1

3.0
Cross-entropy
2.5

2.0

1.5

1.0

0.5

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Figure 5.6 (plots/crossentropyloss.pdf): A plot of the function (0, 1) ∋ x 7→


L (x, 1 − x), 10 , 10 ∈ R where L is the 2-dimensional cross-entropy loss function
3 7


(cf. Definition 5.4.5).

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util
5
6 ax = plot_util . setup_axis ((0 ,1) , (0 ,3) )
7

188
5.4. Loss functions

8 ax . set_aspect (.3)
9

10 x = np . linspace (0 , 1 , 100)
11
12 cce_loss = tf . keras . losses . C a t e g o r i c a l C r o s s e n t r o p y (
13 reduction = tf . keras . losses . Reduction . NONE )
14 y = tf . constant ([[0.3 , 0.7]] * 100 , shape =(100 , 2) )
15

16 X = tf . stack ([ x ,1 - x ] , axis =1)


17
18 ax . plot (x , cce_loss (y , X ) , label = ’ Cross - entropy ’)
19 ax . legend ()
20
21 plt . savefig ( " ../../ plots / crossentropyloss . pdf " , bbox_inches = ’ tight ’
)

Source code 5.6 (code/loss_functions/crossentropyloss_plot.py): Python


code used to create Figure 5.6

Lemma 5.4.6. Let d ∈ N\{1} and let L be the d-dimensional cross-entropy loss function
(cf. Definition 5.4.5). Then
(i) it holds for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d that
(5.50)

(L(x, y) = ∞) ↔ ∃ i ∈ {1, 2, . . . , d} : [(xi = 0) ∧ (yi ̸= 0)] ,

(ii) it holds for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d with ∀ i ∈ {1, 2, . . . , d} :
[(xi ̸= 0) ∨ (yi = 0)] that
X
L(x, y) = − ln(xi )yi ∈ R, (5.51)
i∈{1,2,...,d},
yi ̸=0

and
(iii) it holds for all x = (x1 , . . . , xd ) ∈ (0, ∞)d , y = (y1 , . . . , yd ) ∈ [0, ∞)d that
d
X
L(x, y) = − ln(xi )yi ∈ R. (5.52)
i=1

Proof of Lemma 5.4.6. Note that (5.49) and the fact that for all a, b ∈ [0, ∞) it holds that

0 :b=0
 
(5.53)

lim ln(z)b = ln(a)b : (a ̸= 0) ∧ (b ̸= 0)
z↘a 
−∞ : (a = 0) ∧ (b ̸= 0)

prove items (i), (ii), and (iii). The proof of Lemma 5.4.6 is thus complete.

189
Chapter 5: Optimization through ODEs

Lemma 5.4.7. Let d ∈ N\{1}, let L be the d-dimensional cross-entropy loss function, let
x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d satisfy di=1 xi = di=1 yi and x =
P P
̸ y, and let
f : [0, 1] → (−∞, ∞] satisfy for all h ∈ [0, 1] that
f (h) = L(x + h(y − x), y) (5.54)
(cf. Definition 5.4.5). Then f is strictly decreasing.
Proof of Lemma 5.4.7. Throughout this proof, let g : [0, 1) → (−∞, ∞] satisfy for all
h ∈ [0, 1) that
g(h) = f (1 − h) (5.55)
and let J = {i ∈ {1, 2, . . . , d} : yi ̸= 0}. Observe that (5.54) shows that for all h ∈ [0, 1) it
holds that
g(h) = L(x + (1 − h)(y − x), y) = L(y + h(x − y), y). (5.56)
Furthermore, note that the fact that for all i ∈ J it holds that xi ∈ [0, ∞) and yi ∈ (0, ∞)
ensures that for all i ∈ J, h ∈ [0, 1) it holds that
yi + h(xi − yi ) = (1 − h)yi + hxi ≥ (1 − h)yi > 0. (5.57)
This, (5.56), and item (ii) in Lemma 5.4.6 imply that for all h ∈ [0, 1) it holds that
X
g(h) = − ln(yi + h(xi − yi ))yi ∈ R. (5.58)
i∈J

The chain rule hence demonstrates that for all h ∈ [0, 1) it holds that ([0, 1) ∋ z 7→ g(z) ∈
R) ∈ C ∞ ([0, 1), R) and
X yi (xi − yi )
g ′ (h) = − . (5.59)
y i + h(xi − yi )
i∈J

This and the chain rule establish that for all h ∈ [0, 1) it holds that
X yi (xi − yi )2
′′
g (h) = . (5.60)
i∈J
(yi + h(xi − yi ))2

Moreover, observe that the fact that for all z = (z1 , . . . , zd ) ∈ [0, ∞)d with
Pd Pd
i=1 zi = i=1 yi
and ∀ i ∈ J : zi = yi it holds that
" # " #
X X X
zi = zi − zi
i∈{1,2,...,d}\J i∈{1,2,...,d} i∈J
" # " #
(5.61)
X X
= yi − zi
i∈{1,2,...,d} i∈J
X
= (yi − zi ) = 0
i∈J

190
5.4. Loss functions

proves that for all z = (z1 , . . . , zd ) ∈ [0, ∞)d with di=1 zi = di=1 yi and ∀ i ∈ J : zi = yi
P P

it holds that z = y. The assumption that i=1 xi = i=1 yi and x ̸= y therefore ensures
Pd Pd
that there exists i ∈ J such that xi ̸= yi > 0. Combining this with (5.60) shows that for all
h ∈ [0, 1) it holds that
g ′′ (h) > 0. (5.62)
The fundamental theorem of calculus hence implies that for all h ∈ (0, 1) it holds that
Z h
′ ′
g (h) = g (0) + g ′′ (h) dh > g ′ (0). (5.63)
0

In addition, note that (5.59) and the assumption that di=1 xi = di=1 yi demonstrate that
P P

" # " #
X yi (xi − yi ) X X X
g ′ (0) = − = (yi − xi ) = yi − xi
i∈J
yi i∈J i∈J i∈J
" # " # " # " # " # (5.64)
X X X X X
= yi − xi = xi − xi = xi ≥ 0.
i∈{1,2,...,d} i∈J i∈{1,2,...,d} i∈J i∈{1,2,...,d}\J

Combining this and (5.63) establishes that for all h ∈ (0, 1) it holds that

g ′ (h) > 0. (5.65)

Therefore, we obtain that g is strictly increasing. This and (5.55) prove that f |(0,1] is strictly
decreasing. Next observe that (5.55) and (5.58) ensure that for all h ∈ (0, 1] it holds that
X X
f (h) = − ln(yi + (1 − h)(xi − yi ))yi = − ln(xi + h(yi − xi ))yi ∈ R. (5.66)
i∈J i∈J

Furthermore, note that items (i) and (ii) in Lemma 5.4.6 show that
 X 
[f (0) = ∞] ∨ f (0) = − ln(xi + 0(yi − xi ))yi ∈ R . (5.67)
i∈J

Combining this with (5.66) implies that

[f (0) = ∞] ∨ (∀ h ∈ [0, 1] : f (h) ∈ R) ∧ ([0, 1] ∋ h 7→ f (h) ∈ R) ∈ C([0, 1], R) . (5.68)


 

This and the fact that f |(0,1] is strictly decreasing demonstrate that f is strictly decreasing.
The proof of Lemma 5.4.7 is thus complete.
Pd
Corollary 5.4.8. Let d ∈ N\{1}, let A = {x = (x1 , . . . , xd ) ∈ [0, 1]d : i=1 xi = 1}, let L
be the d-dimensional cross-entropy loss function, and let y ∈ A (cf. Definition 5.4.5). Then

191
Chapter 5: Optimization through ODEs

(i) it holds that


(5.69)

x ∈ A : L(x, y) = inf z∈A L(z, y) = {y}
and
(ii) it holds that X
inf L(z, y) = L(y, y) = − ln(yi )yi . (5.70)
z∈A
i∈{1,2,...,d},
yi ̸=0

Proof of Corollary 5.4.8. Observe that Lemma 5.4.7 shows that for all x ∈ A\{y} it holds
that
L(x, y) = L(x + 0(y − x), y) > L(x + 1(y − x), y) = L(y, y). (5.71)
This and item (ii) in Lemma 5.4.6 establish items (i) and (ii). The proof of Corollary 5.4.8
is thus complete.

5.4.5 Kullback–Leibler divergence loss


Lemma 5.4.9. Let z ∈ (0, ∞). Then
(i) it holds that
(5.72)
   
lim inf ln(x)x = 0 = lim sup ln(x)x
x↘0 x↘0

and
(ii) it holds for all y ∈ [0, ∞) that
(
0 :y=0
z z
(5.73)
     
lim inf ln x
x = z
 = lim sup ln x
x .
x↘y ln y
y :y>0 x↘y

Proof of Lemma 5.4.9. Throughout this proof, let f : (0, ∞) → R and g : (0, ∞) → R
satisfy for all x ∈ (0, ∞) that

f (x) = ln(x−1 ) and g(x) = x. (5.74)

Note that the chain rule proves that for all x ∈ (0, ∞) it holds that f is differentiable and

f ′ (x) = −x−2 (x−1 )−1 = −x−1 . (5.75)

Combining this, the fact that limx→∞ |f (x)| = ∞ = limx→∞ |g(x)|, the fact that g is
differentiable, the fact that for all x ∈ (0, ∞) it holds that g ′ (x) = 1 ̸= 0, and the fact that
−1
limx→∞ −x1 = 0 with l’Hôpital’s rule ensures that

lim inf f (x)


g(x)
= 0 = lim sup fg(x)
(x)
. (5.76)
x→∞ x→∞

192
5.4. Loss functions

This shows that


f (x−1 ) −1
lim inf g(x−1 )
= 0 = lim sup fg(x
(x )
−1 ) . (5.77)
x↘0 x↘0
−1
The fact that for all x ∈ (0, ∞) it holds that fg(x −1 ) = ln(x)x hence establishes item (i).
(x )

Observe that item (i) and the fact that for all x ∈ (0, ∞) it holds that ln xz x = ln(z)x −


ln(x)x prove item (ii). The proof of Lemma 5.4.9 is thus complete.
Definition 5.4.10. Let d ∈ N\{1}. We say that L is the d-dimensional Kullback–Leibler
divergence loss function if and only if it holds that L : [0, ∞)d × [0, ∞)d → (−∞, ∞] is
the function from [0, ∞)d × [0, ∞)d to (−∞, ∞] which satisfies for all x = (x1 , . . . , xd ),
y = (y1 , . . . , yd ) ∈ [0, ∞)d that
d
X
z
(5.78)
  
L(x, y) = − lim lim ln u
u
z↘xi u↘yi
i=1

(cf. Lemma 5.4.9).

3.0
Kullback-Leibler divergence
Cross-entropy
2.5

2.0

1.5

1.0

0.5

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Figure 5.7(plots/kldloss.pdf): A plot of the functions (0, 1) ∋ x 7→ Li (x, 1 −


, 10 ∈ R, i ∈ {1, 2}, where L1 is the 2-dimensional Kullback–Leibler diver-
3 7
x), 10
gence loss function and where L1 is the 2-dimensional cross-entropy loss function (cf.
Definitions 5.4.5 and 5.4.10).

1 import numpy as np
2 import tensorflow as tf
3 import matplotlib . pyplot as plt
4 import plot_util

193
Chapter 5: Optimization through ODEs

5
6 ax = plot_util . setup_axis ((0 ,1) , (0 ,3) )
7
8 ax . set_aspect (.3)
9
10 x = np . linspace (0 , 1 , 100)
11
12 kld_loss = tf . keras . losses . KLDivergence (
13 reduction = tf . keras . losses . Reduction . NONE )
14 cce_loss = tf . keras . losses . C a t e g o r i c a l C r o s s e n t r o p y (
15 reduction = tf . keras . losses . Reduction . NONE )
16 y = tf . constant ([[0.3 , 0.7]] * 100 , shape =(100 , 2) )
17
18 X = tf . stack ([ x ,1 - x ] , axis =1)
19
20 ax . plot (x , kld_loss (y , X ) , label = ’ Kullback - Leibler divergence ’)
21 ax . plot (x , cce_loss (y , X ) , label = ’ Cross - entropy ’)
22 ax . legend ()
23
24 plt . savefig ( " ../../ plots / kldloss . pdf " , bbox_inches = ’ tight ’)

Source code 5.7 (code/loss_functions/kldloss_plot.py): Python code used to


create Figure 5.7

Lemma 5.4.11. Let d ∈ N\{1}, let LCE be the d-dimensional cross-entropy loss function,
and let LKLD be the d-dimensional Kullback–Leibler divergence loss function (cf. Defini-
tions 5.4.5 and 5.4.10). Then it holds for all x, y ∈ [0, ∞)d that

LCE (x, y) = LKLD (x, y) + LCE (y, y). (5.79)

Proof of Lemma 5.4.11. Note that Lemma 5.4.9 implies that for all a, b ∈ [0, ∞) it holds
that
lim lim ln uz u = lim lim ln(z)u − ln(u)u
    
z↘a u↘b z↘a u↘b
h i
= lim ln(z)b − lim [ln(u)u] (5.80)
z↘a u↘b
   
= lim [ln(z)b] − lim [ln(u)u] .
z↘a u↘b

This and (5.78) demonstrate that for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d it holds
that
Xd
lim lim ln uz u
  
LKLD (x, y) = −
z↘xi u↘yi
i=1
d
! d
! (5.81)
X X
=− lim [ln(z)yi ] + lim [ln(u)u] .
z↘xi u↘yi
i=1 i=1

194
5.5. GF optimization in the training of ANNs

Furthermore, observe that Lemma 5.4.9 ensures that for all b ∈ [0, ∞) it holds that
(
0 :b=0
(5.82)
   
lim ln(u)u = = lim ln(u)b .
u↘b ln(b)b : b > 0 u↘b

Combining this with (5.81) shows that for all x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞)d it
holds that
d
! d
!
X X
LKLD (x, y) = − lim [ln(z)yi ] + lim [ln(u)yi ] = LCE (x, y) − LCE (y, y). (5.83)
z↘xi u↘yi
i=1 i=1

Therefore, we obtain (5.79). The proof of Lemma 5.4.11 is thus complete.

Lemma 5.4.12. Let d ∈ N\{1}, let L be the d-dimensional Pd Kullback–Leibler


Pd loss function,
d
let x = (x1 , . . . , xd ), y = (y1 , . . . , yd ) ∈ [0, ∞) satisfy i=1 xi = i=1 yi and x ̸= y, and let
f : [0, 1] → (−∞, ∞] satisfy for all h ∈ [0, 1] that

f (h) = L(x + h(y − x), y) (5.84)

(cf. Definition 5.4.10). Then f is strictly decreasing.

Proof of Lemma 5.4.12. Note that Lemma 5.4.7 and Lemma 5.4.11 establish (5.84). The
proof of Lemma 5.4.12 is thus complete.
Pd
Corollary 5.4.13. Let d ∈ N\{1}, let A = {x = (x1 , . . . , xd ) ∈ [0, 1]d : i=1 xi = 1},
let L be the d-dimensional Kullback–Leibler divergence loss function, and let y ∈ A (cf.
Definition 5.4.10). Then

(i) it holds that


(5.85)

x ∈ A : L(x, y) = inf z∈A L(z, y) = {y}
and

(ii) it holds that inf z∈A L(z, y) = L(y, y) = 0.

Proof of Corollary 5.4.13. Observe that Corollary 5.4.13 and Lemma 5.4.11 prove items (i)
and (ii). The proof of Corollary 5.4.13 is thus complete.

5.5 GF optimization in the training of ANNs


PL 
Example 5.5.1. Let d, L, d ∈ N, l1 , l2 , . . . , lL ∈ N satisfy d = l1 (d + 1) + k=2 lk (lk−1 + 1) ,
let a : R → R be differentiable, let M ∈ N, x1 , x2 , . . . , xM ∈ Rd , y1 , y2 , . . . , yM ∈ RlL , let

195
Chapter 5: Optimization through ODEs

L : RlL × RlL → R be the mean squared error loss function based on Rd ∋ x 7→ ∥x∥2 ∈ [0, ∞),
let L : Rd → [0, ∞) satisfy for all θ ∈ Rd that
"M #
1 X θ,d
(5.86)
 
L(θ) = L NMa,l ,Ma,l ,...,Ma,l ,id l (xm ), ym ,
M m=1 1 2 h R L

let ξ ∈ Rd , and let Θ : [0, ∞) → Rd satisfy for all t ∈ [0, ∞) that


Z t
Θt = ξ − (∇L)(Θs ) ds (5.87)
0

(cf. Definitions 1.1.3, 1.2.1, 3.3.4, and 5.4.2, Corollary 5.3.6, and Lemma 5.4.3). Then Θ
is a GF trajectory for the objective function L with initial value ξ (cf. Definition 5.2.1).
Proof for Example 5.5.1. Note that (5.9), (5.10), and (5.87) demonstrate that Θ is a GF
trajectory for the objective function L with initial value ξ (cf. Definition 5.2.1). The proof
for Example 5.5.1 is thus complete.
PL 
Example 5.5.2. Let d, L, d ∈ N, l1 , l2 , . . . , lL ∈ N satisfy d = l1 (d + 1) + k=2 lk (lk−1 + 1) ,
let a : R → R be differentiable, let A : RlL → RlL be the lL -dimensional softmax activation
function, let M ∈ N, x1 , x2 , . . . , xM ∈ Rd , y1 , y2 , . . . , yM ∈ [0, ∞)lL , let L1 be the lL -
dimensional cross-entropy loss function, let L2 be the lL -dimensional Kullback–Leibler
divergence loss function, for every i ∈ {1, 2} let Li : Rd → [0, ∞) satisfy for all θ ∈ Rd that
"M #
1 X θ,d
(5.88)
 
Li (θ) = Li NM a,l1 ,Ma,l2 ,...,Ma,lh ,A
(xm ), ym ,
M m=1

let ξ ∈ Rd , and for every i ∈ {1, 2} let Θi : [0, ∞) → Rd satisfy for all t ∈ [0, ∞) that
Z t
i
Θt = ξ − (∇Li )(Θis ) ds (5.89)
0

(cf. Definitions 1.1.3, 1.2.1, 1.2.43, 5.4.5, and 5.4.10 and Corollary 5.3.7). Then it holds
for all i, j ∈ {1, 2} that Θi is a GF trajectory for the objective function Lj with initial value
ξ (cf. Definition 5.2.1).
Proof for Example 5.5.2. Observe that Lemma 5.4.11 implies that for all x, y ∈ (0, ∞)lL it
holds that
(∇x L1 )(x, y) = (∇x L2 )(x, y). (5.90)
Hence, we obtain that for all x ∈ Rd it holds that
(∇L1 )(x) = (∇L2 )(x). (5.91)
This, (5.9), (5.10), and (5.89) demonstrate that for all i ∈ {1, 2} it holds that Θi is a GF
trajectory for the objective function Lj with initial value ξ (cf. Definition 5.2.1). The proof
for Example 5.5.2 is thus complete.

196
5.6. Lyapunov-type functions for GFs

5.6 Lyapunov-type functions for GFs


5.6.1 Gronwall differential inequalities
The following lemma, Lemma 5.6.1 below, is referred to as a Gronwall inequality in the
literature (cf., for instance, Henry [194, Chapter 7]). Gronwall inequalities are powerful
tools to study dynamical systems and, especially, solutions of ODEs.
Lemma 5.6.1 (Gronwall inequality). Let T ∈ (0, ∞), α ∈ R, ϵ ∈ C 1 ([0, T ], R), β ∈
C([0, T ], R) satisfy for all t ∈ [0, T ] that

ϵ′ (t) ≤ αϵ(t) + β(t). (5.92)

Then it holds for all t ∈ [0, T ] that


Z t
αt
ϵ(t) ≤ e ϵ(0) + eα(t−s) β(s) ds. (5.93)
0

Proof of Lemma 5.6.1. Throughout this proof, let v : [0, T ] → R satisfy for all t ∈ [0, T ]
that Z t 
v(t) = eαt
e−αs)
β(s) ds (5.94)
0

and let u : [0, T ] → R satisfy for all t ∈ [0, T ] that

u(t) = [ϵ(t) − v(t)]e−αt . (5.95)

Note that the product rule and the fundamental theorem of calculus demonstrate that for
all t ∈ [0, T ] it holds that v ∈ C 1 ([0, T ], R) and
Z t  Z t 
′ α(t−s) α(t−s)
v (t) = αe β(s) ds + β(t) = α e β(s) ds + β(t) = αv(t) + β(t).
0 0
(5.96)
The assumption that ϵ ∈ C 1 ([0, T ], R) and the product rule therefore ensure that for all
t ∈ [0, T ] it holds that u ∈ C 1 ([0, T ], R) and

u′ (t) = [ϵ′ (t) − v ′ (t)]e−αt − [ϵ(t) − v(t)]αe−αt


= [ϵ′ (t) − v ′ (t) − αϵ(t) + αv(t)]e−αt
(5.97)
= [ϵ′ (t) − αv(t) − β(t) − αϵ(t) + αv(t)]e−αt
= [ϵ′ (t) − β(t) − αϵ(t)]e−αt .

Combining this with the assumption that for all t ∈ [0, T ] it holds that ϵ′ (t) ≤ αϵ(t) + β(t)
proves that for all t ∈ [0, T ] it holds that

u′ (t) ≤ [αϵ(t) + β(t) − β(t) − αϵ(t)]e−αt = 0. (5.98)

197
Chapter 5: Optimization through ODEs

This and the fundamental theorem of calculus imply that for all t ∈ [0, T ] it holds that
Z t Z t
u(t) = u(0) + ′
u (s) ds ≤ u(0) + 0 ds = u(0) = ϵ(0). (5.99)
0 0

Combining this, (5.94), and (5.95) shows that for all t ∈ [0, T ] it holds that
Z t
αt αt αt
ϵ(t) = e u(t) + v(t) ≤ e ϵ(0) + v(t) ≤ e ϵ(0) + eα(t−s) β(s) ds. (5.100)
0

The proof of Lemma 5.6.1 is thus complete.

5.6.2 Lyapunov-type functions for ODEs


Proposition 5.6.2 (Lyapunov-type functions for ODEs). Let d ∈ N, T ∈ (0, ∞), α ∈ R,
let O ⊆ Rd be open, let β ∈ C(O, R), G ∈ C(O, Rd ), V ∈ C 1 (O, R) satisfy for all θ ∈ O
that
V ′ (θ)G(θ) = ⟨(∇V )(θ), G(θ)⟩ ≤ αV (θ) + β(θ), (5.101)
Rt
and let Θ ∈ C([0, T ], O) satisfy for all t ∈ [0, T ] that Θt = Θ0 + 0 G(Θs ) ds (cf. Defini-
tion 1.4.7). Then it holds for all t ∈ [0, T ] that
Z t
αt
V (Θt ) ≤ e V (Θ0 ) + eα(t−s) β(Θs ) ds. (5.102)
0

Proof of Proposition 5.6.2. Throughout this proof, let ϵ, b ∈ C([0, T ], R) satisfy for all
t ∈ [0, T ] that
ϵ(t) = V (Θt ) and b(t) = β(Θt ). (5.103)
Observe that (5.101), (5.103), the fundamental theorem of calculus, and the chain rule
ensure that for all t ∈ [0, T ] it holds that
ϵ′ (t) = dt
d
(V (Θt )) = V ′ (Θt ) Θ̇t = V ′ (Θt )G(Θt ) ≤ αV (Θt ) + β(Θt ) = αϵ(t) + b(t). (5.104)


Lemma 5.6.1 and (5.103) hence demonstrate that for all t ∈ [0, T ] it holds that
Z t Z t
αt
V (Θt ) = ϵ(t) ≤ ϵ(0)e + e α(t−s) αt
b(s) ds = V (Θ0 )e + eα(t−s) β(Θs ) ds. (5.105)
0 0

The proof of Proposition 5.6.2 is thus complete.


Corollary 5.6.3. Let d ∈ N, T ∈ (0, ∞), α ∈ R, let O ⊆ Rd be open, let G ∈ C(O, Rd ),
V ∈ C 1 (O, R) satisfy for all θ ∈ O that
V ′ (θ)G(θ) = ⟨(∇V )(θ), G(θ)⟩ ≤ αV (θ), (5.106)
Rt
and let Θ ∈ C([0, T ], O) satisfy for all t ∈ [0, T ] that Θt = Θ0 + 0
G(Θs ) ds (cf. Defini-
tion 1.4.7). Then it holds for all t ∈ [0, T ] that
V (Θt ) ≤ eαt V (Θ0 ). (5.107)

198
5.6. Lyapunov-type functions for GFs

Proof of Corollary 5.6.3. Note that Proposition 5.6.2 and (5.106) establish (5.107). The
proof of Corollary 5.6.3 is thus complete.

5.6.3 On Lyapunov-type functions and coercivity-type conditions


Lemma 5.6.4 (Derivative of the standard norm). Let d ∈ N, ϑ ∈ Rd and let V : Rd → R
satisfy for all θ ∈ Rd that
V (θ) = ∥θ − ϑ∥22 (5.108)
(cf. Definition 3.3.4). Then it holds for all θ ∈ Rd that V ∈ C ∞ (Rd , R) and

(∇V )(θ) = 2(θ − ϑ). (5.109)

Proof of Lemma 5.6.4. Throughout this proof, let ϑ1 , ϑ2 , . . . , ϑd ∈ R satisfy ϑ = (ϑ1 , ϑ2 , . . . ,


ϑd ). Note that the fact that for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd it holds that
d
X
V (θ) = (θi − ϑi )2 (5.110)
i=1

implies that for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd it holds that V ∈ C ∞ (Rd , R) and


 ∂V    
∂θ1
(θ) 2(θ1 − ϑ1 )
.. .. (5.111)
(∇V )(θ) =  .  =  .  = 2(θ − ϑ).
   
∂V
∂θd
(θ) 2(θd − ϑd )

The proof of Lemma 5.6.4 is thus complete.


Corollary 5.6.5 (On quadratic Lyapunov-type functions and coercivity-type conditions).
Let d ∈ N, c ∈ R, T ∈ (0, ∞), ϑ ∈ Rd , let O ⊆ Rd be open, let L ∈ C 1 (O, R) satisfy for all
θ ∈ O that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 , (5.112)
Rt
and let Θ ∈ C([0, T ], O) satisfy for all t ∈ [0, T ] that Θt = Θ0 − 0 (∇L)(Θs ) ds (cf.
Definitions 1.4.7 and 3.3.4). Then it holds for all t ∈ [0, T ] that

∥Θt − ϑ∥2 ≤ e−ct ∥Θ0 − ϑ∥2 . (5.113)

Proof of Corollary 5.6.5. Throughout this proof, let G : O → Rd satisfy for all θ ∈ O that

G(θ) = −(∇L)(θ) (5.114)

and let V : O → R satisfy for all θ ∈ O that

V (θ) = ∥θ − ϑ∥22 . (5.115)

199
Chapter 5: Optimization through ODEs

Observe that Lemma 5.6.4 and (5.112) ensure that for all θ ∈ O it holds that V ∈ C 1 (O, R)
and
V ′ (θ)G(θ) = ⟨(∇V )(θ), G(θ)⟩ = ⟨2(θ − ϑ), G(θ)⟩
(5.116)
= −2⟨(θ − ϑ), (∇L)(θ)⟩ ≤ −2c∥θ − ϑ∥22 = −2cV (θ).

Corollary 5.6.3 hence proves that for all t ∈ [0, T ] it holds that

∥Θt − ϑ∥22 = V (Θt ) ≤ e−2ct V (Θ0 ) = e−2ct ∥Θ0 − ϑ∥22 . (5.117)

The proof of Corollary 5.6.5 is thus complete.

5.6.4 Sufficient and necessary conditions for local minimum points


Lemma 5.6.6. Let d ∈ N, let O ⊆ Rd be open, let ϑ ∈ O, let L : O → R be a function,
assume that L is differentiable at ϑ, and assume that (∇L)(ϑ) ̸= 0. Then there exists
θ ∈ O such that L(θ) < L(ϑ).

Proof of Lemma 5.6.6. Throughout this proof, let v ∈ Rd \{0} satisfy v = −(∇L)(ϑ), let
δ ∈ (0, ∞) satisfy for all t ∈ (−δ, δ) that

ϑ + tv = ϑ − t(∇L)(ϑ) ∈ O, (5.118)

and let L : (−δ, δ) → R satisfy for all t ∈ (−δ, δ) that

L(t) = L(ϑ + tv). (5.119)

Note that for all t ∈ (0, δ) it holds that


   
L(t) − L(0) 2 L(ϑ + tv) − L(ϑ)
+ ∥v∥2 = + ∥(∇L)(ϑ)∥22
t t
 
L(ϑ + tv) − L(ϑ)
= + ⟨(∇L)(ϑ), (∇L)(ϑ)⟩ (5.120)
t
 
L(ϑ + tv) − L(ϑ)
= − ⟨(∇L)(ϑ), v⟩ .
t

Therefore, we obtain that for all t ∈ (0, δ) it holds that


   
L(t) − L(0) L(ϑ + tv) − L(ϑ)
2
+ ∥v∥2 = − L ′ (ϑ)v
t t
′ (5.121)
L(ϑ + tv) − L(ϑ) − L (ϑ)tv |L(ϑ + tv) − L(ϑ) − L ′ (ϑ)tv|
= = .
t t

200
5.6. Lyapunov-type functions for GFs

The assumption that L is differentiable at ϑ hence demonstrates that


 
L(t) − L(0)
lim sup + ∥v∥22 = 0. (5.122)
t↘0 t

The fact that ∥v∥22 > 0 therefore demonstrates that there exists t ∈ (0, δ) such that

∥v∥22
 
L(t) − L(0)
+ ∥v∥22 < . (5.123)
t 2

The triangle inequality and the fact that ∥v∥22 > 0 hence prove that
   
L(t) − L(0) L(t) − L(0) 2 2 L(t) − L(0)
= + ∥v∥2 − ∥v∥2 ≤ + ∥v∥22 − ∥v∥22
t t t
2 2
(5.124)
∥v∥2 ∥v∥2
< − ∥v∥22 = − < 0.
2 2
This ensures that
L(ϑ + tv) = L(t) < L(0) = L(ϑ). (5.125)
The proof of Lemma 5.6.6 is thus complete.

Lemma 5.6.7 (A necessary condition for a local minimum point). Let d ∈ N, let O ⊆ Rd
be open, let ϑ ∈ O, let L : O → R be a function, assume that L is differentiable at ϑ, and
assume
L(ϑ) = inf θ∈O L(θ). (5.126)
Then (∇L)(ϑ) = 0.

Proof of Lemma 5.6.7. We prove Lemma 5.6.7 by contradiction. We thus assume that
(∇L)(ϑ) ̸= 0. Lemma 5.6.6 then implies that there exists θ ∈ O such that L(θ) < L(ϑ).
Combining this with (5.126) shows that

L(θ) < L(ϑ) = inf L(w) ≤ L(θ). (5.127)


w∈O

The proof of Lemma 5.6.7 is thus complete.

Lemma 5.6.8 (A sufficient condition for a local minimum point). Let d ∈ N, c ∈ (0, ∞),
r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that

⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 (5.128)

(cf. Definitions 1.4.7 and 3.3.4). Then

(i) it holds for all θ ∈ B that L(θ) − L(ϑ) ≥ 2c ∥θ − ϑ∥22 ,

201
Chapter 5: Optimization through ODEs

(ii) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ}, and

(iii) it holds that (∇L)(ϑ) = 0.

Proof of Lemma 5.6.8. Throughout this proof, let B be the set given by

B = {w ∈ Rd : ∥w − ϑ∥2 < r}. (5.129)

Note that (5.128) implies that for all v ∈ Rd with ∥v∥2 ≤ r it holds that

⟨(∇L)(ϑ + v), v⟩ ≥ c∥v∥22 . (5.130)

The fundamental theorem of calculus hence demonstrates that for all θ ∈ B it holds that
 t=1
L(θ) − L(ϑ) = L(ϑ + t(θ − ϑ)) t=0
Z 1
= L ′ (ϑ + t(θ − ϑ))(θ − ϑ) dt
Z0 1 (5.131)
1
= ⟨(∇L)(ϑ + t(θ − ϑ)), t(θ − ϑ)⟩ dt
t
Z0 1 Z 1 
21 2
≥ c∥t(θ − ϑ)∥2 dt = c∥θ − ϑ∥2 t dt = 2c ∥θ − ϑ∥22 .
0 t 0

This proves item (i). Next observe that (5.131) ensures that for all θ ∈ B\{ϑ} it holds that

L(θ) ≥ L(ϑ) + 2c ∥θ − ϑ∥22 > L(ϑ). (5.132)

Hence, we obtain for all θ ∈ B\{ϑ} that

inf L(w) = L(ϑ) < L(θ). (5.133)


w∈B

This establishes item (ii). It thus remains thus remains to prove item (iii). For this observe
that item (ii) ensures that

{θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ}. (5.134)

Combining this, the fact that B is open, and Lemma 5.6.7 (applied with d ↶ d, O ↶ B,
ϑ ↶ ϑ, L ↶ L|B in the notation of Lemma 5.6.7) assures that (∇L)(ϑ) = 0. This
establishes item (iii). The proof of Lemma 5.6.8 is thus complete.

202
5.7. Optimization through flows of ODEs

5.6.5 On a linear growth condition


Lemma 5.6.9 (On a linear growth condition). Let d ∈ N, L ∈ R, r ∈ (0, ∞], ϑ ∈ Rd ,
B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that

∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 (5.135)

(cf. Definition 3.3.4). Then it holds for all θ ∈ B that

L(θ) − L(ϑ) ≤ L2 ∥θ − ϑ∥22 . (5.136)

Proof of Lemma 5.6.9. Observe that (5.135), the Cauchy-Schwarz inequality, and the fun-
damental theorem of calculus ensure that for all θ ∈ B it holds that
 t=1
L(θ) − L(ϑ) = L(ϑ + t(θ − ϑ)) t=0
Z 1
= L ′ (ϑ + t(θ − ϑ))(θ − ϑ) dt
Z0 1
= ⟨(∇L)(ϑ + t(θ − ϑ)), θ − ϑ⟩ dt
0
Z 1 (5.137)
≤ ∥(∇L)(ϑ + t(θ − ϑ))∥2 ∥θ − ϑ∥2 dt
Z0 1
≤ L∥ϑ + t(θ − ϑ) − ϑ∥2 ∥θ − ϑ∥2 dt
0
Z 1 
2
= L∥θ − ϑ∥2 t dt = L2 ∥θ − ϑ∥22
0

(cf. Definition 1.4.7). The proof of Lemma 5.6.9 is thus complete.

5.7 Optimization through flows of ODEs


5.7.1 Approximation of local minimum points through GFs
Proposition 5.7.1 (Approximation of local minimum points through GFs). Let d ∈ N,
c, T ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B, L ∈ C 1 (Rd , R)
satisfy for all θ ∈ B that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 , (5.138)
R t
and let Θ ∈ C([0, T ], Rd ) satisfy for all t ∈ [0, T ] that Θt = ξ − 0 (∇L)(Θs ) ds (cf.
Definitions 1.4.7 and 3.3.4). Then

(i) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},

203
Chapter 5: Optimization through ODEs

(ii) it holds for all t ∈ [0, T ] that ∥Θt − ϑ∥2 ≤ e−ct ∥ξ − ϑ∥2 , and

(iii) it holds for all t ∈ [0, T ] that

0 ≤ 2c ∥Θt − ϑ∥22 ≤ L(Θt ) − L(ϑ). (5.139)

Proof of Proposition 5.7.1. Throughout this proof, let V : Rd → [0, ∞) satisfy for all θ ∈ Rd
that V (θ) = ∥θ − ϑ∥22 , let ϵ : [0, T ] → [0, ∞) satisfy for all t ∈ [0, T ] that ϵ(t) = ∥Θt − ϑ∥22 =
V (Θt ), and let τ ∈ [0, T ] be the real number given by

/ B} ∪ {T }) = inf {t ∈ [0, T ] : ϵ(t) > r2 } ∪ {T } . (5.140)



τ = inf({t ∈ [0, T ] : Θt ∈

Note that (5.138) and item (ii) in Lemma 5.6.8 establish item (i). Next observe that
Lemma 5.6.4 implies that for all θ ∈ Rd it holds that V ∈ C 1 (Rd , [0, ∞)) and

(∇V )(θ) = 2(θ − ϑ). (5.141)

Moreover, observe that the fundamental theorem of calculus (see, for example, Coleman
[85, Theorem 3.9]) and the fact that Rd ∋ v 7→ (∇L)(v) ∈ Rd and Θ : [0, T ] → Rd are
continuous functions ensure that for all t ∈ [0, T ] it holds that Θ ∈ C 1 ([0, T ], Rd ) and
d
dt
(Θt ) = −(∇L)(Θt ). (5.142)

Combining (5.138) and (5.141) hence demonstrates that for all t ∈ [0, τ ] it holds that
ϵ ∈ C 1 ([0, T ], [0, ∞)) and

ϵ′ (t) = dt
d
V (Θt ) = V ′ (Θt ) dt
d
 
(Θt )
d
= ⟨(∇V )(Θt ), dt (Θt )⟩
= ⟨2(Θt − ϑ), −(∇L)(Θt )⟩ (5.143)
= −2⟨(Θt − ϑ), (∇L)(Θt )⟩
≤ −2c∥Θt − ϑ∥22 = −2cϵ(t).

The Gronwall inequality, for instance, in Lemma 5.6.1 therefore implies that for all t ∈ [0, τ ]
it holds that
ϵ(t) ≤ ϵ(0)e−2ct . (5.144)
Hence, we obtain for all t ∈ [0, τ ] that

(5.145)
p p
∥Θt − ϑ∥2 = ϵ(t) ≤ ϵ(0)e−ct = ∥Θ0 − ϑ∥2 e−ct = ∥ξ − ϑ∥2 e−ct .

In the next step we prove that


τ > 0. (5.146)

204
5.7. Optimization through flows of ODEs

In our proof of (5.146) we distinguish between the case ε(0) = 0 and the case ε(0) > 0. We
first prove (5.146) in the case
ε(0) = 0. (5.147)
Observe that (5.147), the assumption that r ∈ (0, ∞], and the fact that ϵ : [0, T ] → [0, ∞)
is a continuous function show that

τ = inf {t ∈ [0, T ] : ϵ(t) > r2 } ∪ {T } > 0. (5.148)




This establishes (5.146) in the case ε(0) = 0. In the next step we prove (5.146) in the case

ε(0) > 0. (5.149)

Note that (5.143) and the assumption that c ∈ (0, ∞) assure that for all t ∈ [0, τ ] with
ϵ(t) > 0 it holds that
ϵ′ (t) ≤ −2cϵ(t) < 0. (5.150)
Combining this with (5.149) shows that

ϵ′ (0) < 0. (5.151)

The fact that ϵ′ : [0, T ] → [0, ∞) is a continuous function and the assumption that T ∈ (0, ∞)
therefore demonstrate that

inf({t ∈ [0, T ] : ϵ′ (t) > 0} ∪ {T }) > 0. (5.152)

Next note that the fundamental theorem of calculus and the assumption that ξ ∈ B imply
that for all s ∈ [0, T ] with s < inf({t ∈ [0, T ] : ϵ′ (t) > 0} ∪ {T }) it holds that
Z s
ϵ(s) = ϵ(0) + ϵ′ (u) du ≤ ϵ(0) = ∥ξ − ϑ∥22 ≤ r2 . (5.153)
0

Combining this with (5.152) proves that

τ = inf {s ∈ [0, T ] : ϵ(s) > r2 } ∪ {T } > 0. (5.154)




This establishes (5.146) in the case ε(0) > 0. Observe that (5.145), (5.146), and the
assumption that c ∈ (0, ∞) demonstrate that

∥Θτ − ϑ∥2 ≤ ∥ξ − ϑ∥2 e−cτ < r. (5.155)

The fact that ϵ : [0, T ] → [0, ∞) is a continuous function, (5.140), and (5.146) hence assure
that τ = T . Combining this with (5.145) proves that for all t ∈ [0, T ] it holds that

∥Θt − ϑ∥2 ≤ ∥ξ − ϑ∥2 e−ct . (5.156)

205
Chapter 5: Optimization through ODEs

This establishes item (ii). It thus remains to prove item (iii). For this observe that (5.138)
and item (i) in Lemma 5.6.8 demonstrate that for all θ ∈ B it holds that

0 ≤ 2c ∥θ − ϑ∥22 ≤ L(θ) − L(ϑ). (5.157)

Combining this and item (ii) implies that for all t ∈ [0, T ] it holds that

0 ≤ 2c ∥Θt − ϑ∥22 ≤ L(Θt ) − L(ϑ) (5.158)

This establishes item (iii). The proof of Proposition 5.7.1 is thus complete.

5.7.2 Existence and uniqueness of solutions of ODEs


Lemma 5.7.2 (Local existence of maximal solution of ODEs). Let d ∈ N, ξ ∈ Rd ,
T ∈ (0, ∞), let ~·~ : Rd → [0, ∞) be a norm, and let G : Rd → Rd be locally Lipschitz
continuous. Then there exist a unique real number τ ∈ (0, T ] and a unique continuous
function Θ : [0, τ ) → Rd such that for all t ∈ [0, τ ) it holds that
Z t
1
(5.159)
 
lim inf ~Θs ~ + (T −s) = ∞ and Θt = ξ + G(Θs ) ds.
s↗τ 0

Proof of Lemma 5.7.2. Note that, for example, Teschl [394, Theorem 2.2 and Corollary 2.16]
implies (5.159) (cf., for instance, [5, Theorem 7.6] and [222, Theorem 1.1]). The proof of
Lemma 5.7.2 is thus complete.

Lemma 5.7.3 (Local existence of maximal solution of ODEs on an infinite time interval).
Let d ∈ N, ξ ∈ Rd , let ~·~ : Rd → [0, ∞) be a norm, and let G : Rd → Rd be locally Lipschitz
continuous. Then there exist a unique extended real number τ ∈ (0, ∞] and a unique
continuous function Θ : [0, τ ) → Rd such that for all t ∈ [0, τ ) it holds that
Z t
(5.160)
 
lim inf ~Θs ~ + s = ∞ and Θt = ξ + G(Θs ) ds.
s↗τ 0

Proof of Lemma 5.7.3. First, observe that Lemma 5.7.2 implies that there exist unique real
numbers τn ∈ (0, n], n ∈ N, and unique continuous functions Θ(n) : [0, τn ) → Rd , n ∈ N,
such that for all n ∈ N, t ∈ [0, τn ) it holds that
h‌ i Z t
(n)
and (5.161)
(n)
‌ 1
lim inf ‌Θs ‌ + (n−s) = ∞ Θt = ξ + G(Θ(n)
s ) ds.
s↗τn 0

This shows that for all n ∈ N, t ∈ [0, min{τn+1 , n}) it holds that
h‌ i Z t
(n+1)
and (5.162)

(n+1) ‌
lim inf Θs
‌ 1
+ (n+1−s) = ∞ Θt =ξ+ G(Θ(n+1)
s ) ds.
s↗τn+1 0

206
5.7. Optimization through flows of ODEs

Hence, we obtain that for all n ∈ N, t ∈ [0, min{τn+1 , n}) it holds that
h‌ i
(5.163)

(n+1) ‌ 1
lim inf ‌ Θs + (n−s) = ∞
s↗min{τn+1 ,n}

Z t
(n+1)
and Θt =ξ+ G(Θ(n+1)
s ) ds. (5.164)
0
Combining this with (5.161) demonstrates that for all n ∈ N it holds that

τn = min{τn+1 , n} and Θ(n) = Θ(n+1) |[0,min{τn+1 ,n}) . (5.165)

Therefore, we obtain that for all n ∈ N it holds that

τn ≤ τn+1 and Θ(n) = Θ(n+1) |[0,τn ) . (5.166)

Next let t ∈ (0, ∞] be the extended real number given by

t = lim τn (5.167)
n→∞

and let Θ : [0, t) → Rd satisfy for all n ∈ N, t ∈ [0, τn ) that


(n)
Θt = Θt . (5.168)

Observe that for all t ∈ [0, t) there exists n ∈ N such that t ∈ [0, τn ). This, (5.161), and
(5.166) assure that for all t ∈ [0, t) it holds that Θ ∈ C([0, t), Rd ) and
Z t
Θt = ξ + G(Θs ) ds. (5.169)
0

In addition, note that (5.165) ensures that for all n ∈ N, k ∈ N ∩ [n, ∞) it holds that

min{τk+1 , n} = min{τk+1 , k, n} = min{min{τk+1 , k}, n} = min{τk , n}. (5.170)

This shows that for all n ∈ N, k ∈ N ∩ (n, ∞) it holds that min{τk , n} = min{τk−1 , n}.
Hence, we obtain that for all n ∈ N, k ∈ N ∩ (n, ∞) it holds that

min{τk , n} = min{τk−1 , n} = . . . = min{τn+1 , n} = min{τn , n} = τn . (5.171)

Combining this with the fact that (τn )n∈N ⊆ [0, ∞) is a non-decreasing sequence implies
that for all n ∈ N it holds that
n o
(5.172)

min{t, n} = min lim τk , n = lim min{τk , n} = lim τn = τn .
k→∞ k→∞ k→∞

Therefore, we obtain that for all n ∈ N with t < n it holds that

τn = min{t, n} = t. (5.173)

207
Chapter 5: Optimization through ODEs

This, (5.161), and (5.168) demonstrate that for all n ∈ N with t < n it holds that
‌ ‌
lim inf ~Θs ~ = lim inf ~Θs ~ = lim inf ‌Θ(n)
s

s↗t s↗τn s↗τn
h‌ ‌ i
1
= − (n−t) + lim inf ‌Θ(n)
s
‌ + 1
(n−t) (5.174)
s↗τn
h‌ ‌ i
1
= − (n−t) + lim inf ‌Θ(n)
s
‌+ 1
(n−s)
= ∞.
s↗τn

Therefore, we obtain that


(5.175)
 
lim inf ~Θs ~ + s = ∞.
s↗t

Next note that for all t̂ ∈ (0, ∞], Θ̂ ∈ C([0, t̂), RRd ), n ∈ N, t ∈ [0, min{t̂, n}) with
s
lim inf s↗t̂ [~Θ̂s ~ + s] = ∞ and ∀ s ∈ [0, t̂) : Θ̂s = ξ + 0 G(Θ̂u ) du it holds that
h i Z t
1
lim inf ~Θ̂s ~ + (n−s) = ∞ and Θ̂t = ξ + G(Θ̂s ) ds. (5.176)
s↗min{t̂,n} 0

This and (5.161) prove that for all t̂ ∈ (0, ∞], Θ̂ ∈ C([0, t̂), Rd ), n ∈ N with lim inf t↗t̂ [~Θ̂t ~+
Rt
t] = ∞ and ∀ t ∈ [0, t̂) : Θ̂t = ξ + 0 G(Θ̂s ) ds it holds that

min{t̂, n} = τn and Θ̂|[0,τn ) = Θ(n) . (5.177)

Combining (5.169) and (5.175) hence assures that for all t̂ ∈ R(0, ∞], Θ̂ ∈ C([0, t̂), Rd ),
t
n ∈ N with lim inf t↗t̂ [~Θ̂t ~ + t] = ∞ and ∀ t ∈ [0, t̂) : Θ̂t = ξ + 0 G(Θ̂s ) ds it holds that

min{t̂, n} = τn = min{t, n} and Θ̂|[0,τn ) = Θ(n) = Θ|[0,τn ) . (5.178)

This and (5.167) show that for all t̂ ∈ (0, ∞], Θ̂ ∈ C([0, t̂), Rd ) with lim inf t↗t̂ [~Θ̂t ~+t] = ∞
Rt
and ∀ t ∈ [0, t̂) : Θ̂t = ξ + 0 G(Θ̂s ) ds it holds that

t̂ = t and Θ̂ = Θ. (5.179)

Combining this, (5.169), and (5.175) completes the proof of Lemma 5.7.3.

5.7.3 Approximation of local minimum points through GFs revis-


ited
Theorem 5.7.4 (Approximation of local minimum points through GFs revisited). Let
d ∈ N, c ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B, L ∈ C 2 (Rd , R)
satisfy for all θ ∈ B that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 (5.180)
(cf. Definitions 1.4.7 and 3.3.4). Then

208
5.7. Optimization through flows of ODEs

(i) there exists a unique continuous function Θ : [0, ∞) → Rd such that for all t ∈ [0, ∞)
it holds that Z t
Θt = ξ − (∇L)(Θs ) ds, (5.181)
0

(ii) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},

(iii) it holds for all t ∈ [0, ∞) that ∥Θt − ϑ∥2 ≤ e−ct ∥ξ − ϑ∥2 , and

(iv) it holds for all t ∈ [0, ∞) that

0 ≤ 2c ∥Θt − ϑ∥22 ≤ L(Θt ) − L(ϑ). (5.182)

Proof of Theorem 5.7.4. First, observe that the assumption that L ∈ C 2 (Rd , R) ensures
that
Rd ∋ θ 7→ −(∇L)(θ) ∈ Rd (5.183)
is continuously differentiable. The fundamental theorem of calculus hence implies that

Rd ∋ θ 7→ −(∇L)(θ) ∈ Rd (5.184)

is locally Lipschitz continuous. Combining this with Lemma 5.7.3 (applied with G ↶ (Rd ∋
θ 7→ −(∇L)(θ) ∈ Rd ) in the notation of Lemma 5.7.3) proves that there exists a unique
extended real number τ ∈ (0, ∞] and a unique continuous function Θ : [0, τ ) → Rd such
that for all t ∈ [0, τ ) it holds that
Z t
and (5.185)
 
lim inf ∥Θs ∥2 + s = ∞ Θt = ξ − (∇L)(Θs ) ds.
s↗τ 0

Next observe that Proposition 5.7.1 proves that for all t ∈ [0, τ ) it holds that

∥Θt − ϑ∥2 ≤ e−ct ∥ξ − ϑ∥2 . (5.186)

This implies that


 
lim inf ∥Θs ∥2 ≤ lim inf ∥Θs − ϑ∥2 + ∥ϑ∥2
s↗τ s↗τ
  (5.187)
−cs
≤ lim inf e ∥ξ − ϑ∥2 + ∥ϑ∥2 ≤ ∥ξ − ϑ∥2 + ∥ϑ∥2 < ∞.
s↗τ

This and (5.185) demonstrate that


τ = ∞. (5.188)
This and (5.185) prove item (i). Moreover, note that Proposition 5.7.1 and item (i) establish
items (ii), (iii), and (iv). The proof of Theorem 5.7.4 is thus complete.

209
Chapter 5: Optimization through ODEs

5.7.4 Approximation error with respect to the objective function


Corollary 5.7.5 (Approximation error with respect to the objective function). Let d ∈ N,
c, L ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B, L ∈ C 2 (Rd , R)
satisfy for all θ ∈ B that

⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 (5.189)

(cf. Definitions 1.4.7 and 3.3.4). Then

(i) there exists a unique continuous function Θ : [0, ∞) → Rd such that for all t ∈ [0, ∞)
it holds that Z t
Θt = ξ − (∇L)(Θs ) ds, (5.190)
0

(ii) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},

(iii) it holds for all t ∈ [0, ∞) that ∥Θt − ϑ∥2 ≤ e−ct ∥ξ − ϑ∥2 , and

(iv) it holds for all t ∈ [0, ∞) that

0 ≤ 2c ∥Θt − ϑ∥22 ≤ L(Θt ) − L(ϑ) ≤ L2 ∥Θt − ϑ∥22 ≤ L2 e−2ct ∥ξ − ϑ∥22 . (5.191)

Proof of Corollary 5.7.5. Theorem 5.7.4 and Lemma 5.6.9 establish items (i), (ii), (iii), and
(iv). The proof of Corollary 5.7.5 is thus complete.

210
Chapter 6

Deterministic gradient descent (GD)


optimization methods

This chapter reviews and studies deterministic GD-type optimization methods such as the
classical plain-vanilla GD optimization method (see Section 6.1 below) as well as more
sophisticated GD-type optimization methods including GD optimization methods with
momenta (cf. Sections 6.3, 6.4, and 6.8 below) and GD optimization methods with adaptive
modifications of the learning rates (cf. Sections 6.5, 6.6, 6.7, and 6.8 below).
There are several other outstanding reviews on gradient based optimization methods in
the literature; cf., for example, the books [9, Chapter 5], [52, Chapter 9], [57, Chapter 3],
[164, Sections 4.3 and 5.9 and Chapter 8], [303], and [373, Chapter 14] and the references
therein and, for instance, the survey articles [33, 48, 122, 354, 386] and the references
therein.

6.1 GD optimization
In this section we review and study the classical plain-vanilla GD optimization method
(cf., for example, [303, Section 1.2.3], [52, Section 9.3], and [57, Chapter 3]). A simple
intuition behind the GD optimization method is the idea to solve a minimization problem
by performing successive steps in direction of the steepest descents of the objective function,
that is, by performing successive steps in the opposite direction of the gradients of the
objective function.
A slightly different and maybe a bit more accurate perspective for the GD optimization
method is to view the GD optimization method as a plain-vanilla Euler discretization of
the associated GF ODE (see, for example, Theorem 5.7.4 in Chapter 5 above)

Definition 6.1.1 (GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞), ξ ∈ Rd and
let L : Rd → R and G : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with

211
Chapter 6: Deterministic GD optimization methods

L|U ∈ C 1 (U, Rd ) that


G(θ) = (∇L)(θ). (6.1)
Then we say that Θ is the GD process for the objective function L with generalized gradient
G, learning rates (γn )n∈N , and initial value ξ (we say that Θ is the GD process for the
objective function L with learning rates (γn )n∈N and initial value ξ) if and only if it holds
that Θ : N0 → Rd is the function from N0 to Rd which satisfies for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γnG(Θn−1 ). (6.2)
Exercise 6.1.1. Let ξ = (ξ1 , ξ2 , ξ3 ) ∈ R3 satisfy ξ = (1, 2, 3), let L : R3 → R satisfy for all
θ = (θ1 , θ2 , θ3 ) ∈ R3 that
L(θ) = 2(θ1 )2 + (θ2 + 1)2 + (θ3 − 1)2 , (6.3)
and let Θ be the GD process for the objective function L with learning rates N ∋ n 7→ 21n ,
and initial value ξ (cf. Definition 6.1.1). Specify Θ1 , Θ2 , and Θ3 explicitly and prove that
your results are correct!
Exercise 6.1.2. Let ξ = (ξ1 , ξ2 , ξ3 ) ∈ R3 satisfy ξ = (ξ1 , ξ2 , ξ3 ) = (3, 4, 5), let L : R3 → R
satisfy for all θ = (θ1 , θ2 , θ3 ) ∈ R3 that
L(θ) = (θ1 )2 + (θ2 − 1)2 + 2 (θ3 + 1)2 ,
and let Θ be the GD process for the objective function L with learning rates N ∋ n 7→
1/3∈ [0, ∞) and initial value ξ (cf. Definition 6.1.1). Specify Θ1 , Θ2 , and Θ3 explicitly and
prove that your results are correct.

6.1.1 GD optimization in the training of ANNs


In the next example we apply the GD optimization method in the context of the training of
fully-connected feedforward ANNs in the vectorized description (see Section 1.1) with the
loss function being the mean squared error loss function in Definition 5.4.2 (see Section 5.4.2).
Ph 
Example 6.1.2. Let d, h, d ∈ N, l1 , l2 , . . . , lh ∈ N satisfy d = l1 (d+1)+ k=2 lk (lk−1 +1) +
lh + 1, let a : R → R be differentiable, let M ∈ N, x1 , x2 , . . . , xM ∈ Rd , y1 , y2 , . . . , yM ∈ R,
let L : Rd → [0, ∞) satisfy for all θ ∈ Rd that
"M #
1 X 2
θ,d
(6.4)

L(θ) = NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(xm ) − ym ,
M m=1

let ξ ∈ Rd , let (γn )n∈N ⊆ N, and let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ) (6.5)
(cf. Definitions 1.1.3 and 1.2.1 and Corollary 5.3.6). Then Θ is the GD process for the
objective function L with learning rates (γn )n∈N and initial value ξ.

212
6.1. GD optimization

Proof for Example 6.1.2. Note that (6.5), (6.1), and (6.2) demonstrate that Θ is the GD
process for the objective function L with learning rates (γn )n∈N and initial value ξ. The
proof for Example 6.1.2 is thus complete.

6.1.2 Euler discretizations for GF ODEs


Theorem 6.1.3 (Taylor’s formula). Let N ∈ N, α ∈ R, β ∈ (α, ∞), a, b ∈ [α, β],
f ∈ C N ([α, β], R). Then
"N −1 # Z
1 (N )
X f (n) (a)(b − a)n f (a + r(b − a))(b − a)N (1 − r)N −1
f (b) = + dr. (6.6)
n=0
n! 0 (N − 1)!

Proof of Theorem 6.1.3. Observe that the fundamental theorem of calculus assures that
for all g ∈ C 1 ([0, 1], R) it holds that
Z 1 Z 1 ′
g (r)(1 − r)0
g(1) = g(0) + ′
g (r) dr = g(0) + dr. (6.7)
0 0 0!
Furthermore, note that integration by parts ensures that for all n ∈ N, g ∈ C n+1 ([0, 1], R)
it holds that
Z 1 (n) r=1 Z 1 (n+1)
g (r)(1 − r)n−1
 (n)
g (r)(1 − r)n g (r)(1 − r)n
dr = − + dr
(n − 1)! n! n!
0
Z 1 (n+1)
r=0 0
(6.8)
g (n) (0) g (r)(1 − r)n
= + dr.
n! 0 n!
Combining this with (6.7) and induction shows that for all g ∈ C N ([0, 1], R) it holds that
"N −1 # Z
1 (N )
X g (n) (0) g (r)(1 − r)N −1
g(1) = + dr. (6.9)
n=0
n! 0 (N − 1)!

This establishes (6.6). The proof of Theorem 6.1.3 is thus complete.


Lemma 6.1.4 (Local error of the Euler method). Let d ∈ N, T, γ, c ∈ [0, ∞), G ∈
C 1 (Rd , Rd ), Θ ∈ C([0, ∞), Rd ), θ ∈ Rd satisfy for all x, y ∈ Rd , t ∈ [0, ∞) that
Z t
Θt = Θ0 + G(Θs ) ds, θ = ΘT + γG(ΘT ), (6.10)
0

∥G(x)∥2 ≤ c, and ∥G′ (x)y∥2 ≤ c∥y∥2 (6.11)


(cf. Definition 3.3.4). Then

∥ΘT +γ − θ∥2 ≤ c2 γ 2 . (6.12)

213
Chapter 6: Deterministic GD optimization methods

Proof of Lemma 6.1.4. Note that the fundamental theorem of calculus, the hypothesis that
G ∈ C 1 (Rd , Rd ), and (6.10) assure that for all t ∈ (0, ∞) it holds that Θ ∈ C 1 ([0, ∞), Rd )
and

Θ̇t = G(Θt ). (6.13)

Combining this with the hypothesis that G ∈ C 1 (Rd , Rd ) and the chain rule ensures that
for all t ∈ (0, ∞) it holds that Θ ∈ C 2 ([0, ∞), Rd ) and

Θ̈t = G′ (Θt )Θ̇t = G′ (Θt )G(Θt ). (6.14)

Theorem 6.1.3 and (6.13) therefore imply that


Z 1
ΘT +γ = ΘT + γ Θ̇T + (1 − r)γ 2 Θ̈T +rγ dr
0
Z 1 (6.15)
= ΘT + γG(ΘT ) + γ 2
(1 − r)G′ (ΘT +rγ )G(ΘT +rγ ) dr.
0

This and (6.10) demonstrate that

∥ΘT +γ − θ∥2
Z 1
= ΘT + γG(ΘT ) + γ 2
(1 − r)G′ (ΘT +rγ )G(ΘT +rγ ) dr − (ΘT + γG(ΘT ))
0 2
Z 1 (6.16)
≤ γ2 (1 − r)∥G′ (ΘT +rγ )G(ΘT +rγ )∥2 dr
0
Z 1
2 2 c2 γ 2
≤c γ r dr = ≤ c2 γ 2 .
0 2

The proof of Lemma 6.1.4 is thus complete.

Corollary 6.1.5 (Local error of the Euler method for GF ODEs). Let d ∈ N, T, γ, c ∈ [0, ∞),
L ∈ C 2 (Rd , R), Θ ∈ C([0, ∞), Rd ), θ ∈ Rd satisfy for all x, y ∈ Rd , t ∈ [0, ∞) that
Z t
Θt = Θ0 − (∇L)(Θs ) ds, θ = ΘT − γ(∇L)(ΘT ), (6.17)
0

∥(∇L)(x)∥2 ≤ c, and ∥(Hess L)(x)y∥2 ≤ c∥y∥2 (6.18)


(cf. Definition 3.3.4). Then

∥ΘT +γ − θ∥2 ≤ c2 γ 2 . (6.19)

214
6.1. GD optimization

Proof of Corollary 6.1.5. Throughout this proof, let G : Rd → Rd satisfy for all θ ∈ Rd that

G(θ) = −(∇L)(θ). (6.20)


Rt
Note that the fact that for all t ∈ [0, ∞) it holds that Θt = Θ0 + 0 G(Θs ) ds, the fact that
θ = ΘT + γG(ΘT ), the fact that for all x ∈ Rd it holds that ∥G(x)∥2 ≤ c, the fact that for all
x, y ∈ Rd it holds that ∥G′ (x)y∥2 ≤ c∥y∥2 , and Lemma 6.1.4 imply that ∥ΘT +γ − θ∥2 ≤ c2 γ 2 .
The proof of Corollary 6.1.5 is thus complete.

6.1.3 Lyapunov-type stability for GD optimization


Corollary 5.6.3 in Section 5.6.2 and Corollary 5.6.5 in Section 5.6.3 in Chapter 5 above, in
particular, illustrate how Lyapunov-type functions can be employed to establish conver-
gence properties for GFs. Roughly speaking, the next two results, Proposition 6.1.6 and
Corollary 6.1.7 below, are the time-discrete analogons of Corollary 5.6.3 and Corollary 5.6.5,
respectively.

Proposition 6.1.6 (Lyapunov-type stability for discrete-time dynamical systems). Let


d ∈ N, ξ ∈ Rd , c ∈ (0, ∞), (γn )n∈N ⊆ [0, c], let V : Rd → R, Φ : Rd × [0, ∞) → Rd , and
ε : [0, c] → [0, ∞) satisfy for all θ ∈ Rd , t ∈ [0, c] that

V (Φ(θ, t)) ≤ ε(t)V (θ), (6.21)

and let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Φ(Θn−1 , γn ). (6.22)

Then it holds for all n ∈ N0 that


 n

(6.23)
Q
V (Θn ) ≤ ε(γk ) V (ξ).
k=1

Proof of Proposition 6.1.6. We prove (6.23) by induction on n ∈ N0 . For the base case
n = 0 note that the assumption that Θ0 = ξ ensures that V (Θ0 ) = V (ξ). This establishes
(6.23) in the base case n = 0. For the
Q induction step observe that (6.22) and (6.21) ensure
that for all n ∈ N0 with V (Θn ) ≤ ( nk=1 ε(γk ))V (ξ) it holds that

V (Θn+1 ) = V (Φ(Θn , γn+1 )) ≤ ε(γn+1 )V (Θn )


(6.24)
 n   n+1 
Q Q
≤ ε(γn+1 ) ε(γk ) V (ξ) = ε(γk ) V (ξ).
k=1 k=1

Induction thus establishes (6.23). The proof of Proposition 6.1.6 is thus complete.

215
Chapter 6: Deterministic GD optimization methods

Corollary 6.1.7 (On quadratic Lyapunov-type functions for the GD optimization method).
Let d ∈ N, ϑ, ξ ∈ Rd , c ∈ (0, ∞), (γn )n∈N ⊆ [0, c], L ∈ C 1 (Rd , R), let ~·~ : Rd → [0, ∞) be
a norm, let ε : [0, c] → [0, ∞) satisfy for all θ ∈ Rd , t ∈ [0, c] that
~θ − t(∇L)(θ) − ϑ~2 ≤ ε(t)~θ − ϑ~2 , (6.25)
and let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ). (6.26)
Then it holds for all n ∈ N0 that
 n

(6.27)
Q 1/2
~Θn − ϑ~ ≤ [ε(γk )] ~ξ − ϑ~.
k=1

Proof of Corollary 6.1.7. Throughout this proof, let V : Rd → R and Φ : Rd × [0, ∞) → Rd


satisfy for all θ ∈ Rd , t ∈ [0, ∞) that
V (θ) = ~θ − ϑ~2 and Φ(θ, t) = θ − t(∇L)(θ). (6.28)
Observe that Proposition 6.1.6 (applied with V ↶ V , Φ ↶ Φ in the notation of Proposi-
tion 6.1.6) and (6.28) imply that for all n ∈ N0 it holds that
 n   n 
2
ε(γk ) ~ξ − ϑ~2 . (6.29)
Q Q
~Θn − ϑ~ = V (Θn ) ≤ ε(γk ) V (ξ) =
k=1 k=1

This establishes (6.27). The proof of Corollary 6.1.7 is thus complete.


Corollary 6.1.7, in particular, illustrates that the one-step Lyapunov stability assumption
in (6.25) may provide us suitable estimates for the approximation errors associated to the GD
optimization method; see (6.27) above. The next result, Lemma 6.1.8 below, now provides
us sufficient conditions which ensure that the one-step Lyapunov stability condition in (6.25)
is satisfied so that we are in the position to apply Corollary 6.1.7 above to obtain estimates
for the approximation errors associated to the GD optimization method. Lemma 6.1.8
employs the growth condition and the coercivity-type condition in (5.189) in Corollary 5.7.5
above. Results similar to Lemma 6.1.8 can, for example, be found in [103, Remark 2.1] and
[221, Lemma 2.1]. We will employ the statement of Lemma 6.1.8 in our error analysis for
the GD optimization method in Section 6.1.4 below.
Lemma 6.1.8 (Sufficient conditions for a one-step Lyapunov-type stability condition). Let
d d d
d ∈ N, let
p ⟨⟨·, ·⟩⟩ : R ×R → R be a scalar product, let ~·~ : Rd → R satisfy for all v ∈ Rd that
~v~ = ⟨⟨v, v⟩⟩, and let c, L ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ R , B = {w ∈ Rd : ~w − ϑ~ ≤ r},
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that
⟨⟨θ − ϑ, (∇L)(θ)⟩⟩ ≥ c~θ − ϑ~2 and ~(∇L)(θ)~ ≤ L~θ − ϑ~. (6.30)
Then

216
6.1. GD optimization

(i) it holds that c ≤ L,

(ii) it holds for all θ ∈ B, γ ∈ [0, ∞) that

~θ − γ(∇L)(θ) − ϑ~2 ≤ (1 − 2γc + γ 2 L2 )~θ − ϑ~2 , (6.31)

(iii) it holds for all γ ∈ (0, L2c2 ) that 0 ≤ 1 − 2γc + γ 2 L2 < 1, and

(iv) it holds for all θ ∈ B, γ ∈ [0, Lc2 ] that

~θ − γ(∇L)(θ) − ϑ~2 ≤ (1 − cγ)~θ − ϑ~2 . (6.32)

Proof of Lemma 6.1.8. First of all, note that (6.30) ensures that for all θ ∈ B, γ ∈ [0, ∞)
it holds that
0 ≤ ~θ − γ(∇L)(θ) − ϑ~2 = ~(θ − ϑ) − γ(∇L)(θ)~2
= ~θ − ϑ~2 − 2γ ⟨⟨θ − ϑ, (∇L)(θ)⟩⟩ + γ 2 ~(∇L)(θ)~2
(6.33)
≤ ~θ − ϑ~2 − 2γc~θ − ϑ~2 + γ 2 L2 ~θ − ϑ~2
= (1 − 2γc + γ 2 L2 )~θ − ϑ~2 .

This establishes item (ii). Moreover, note that the fact that B\{ϑ} =
̸ ∅ and (6.33) assure
that for all γ ∈ [0, ∞) it holds that

1 − 2γc + γ 2 L2 ≥ 0. (6.34)

Hence, we obtain that


c2 2c2 c2
 2
= 1 − 2 Lc2 c + Lc 4 L2
 
1− L2
=1− L2
+ L2
 2 (6.35)
= 1 − 2 Lc2 c + Lc2 L2 ≥ 0.
 

2
This implies that Lc 2 ≤ 1. Therefore, we obtain that c2 ≤ L2 . This establishes item (i).
Furthermore, observe that (6.34) ensures that for all γ ∈ (0, L2c2 ) it holds that

0 ≤ 1 − 2γc + γ 2 L2 = 1 − γ (2c − γL2 ) < 1. (6.36)


|{z} | {z }
>0 >0

This proves item (iii). In addition, note that for all γ ∈ [0, Lc2 ] it holds that

1 − 2γc + γ 2 L2 ≤ 1 − 2γc + γ Lc2 L2 = 1 − cγ. (6.37)


 

Combining this with (6.33) establishes item (iv). The proof of Lemma 6.1.8 is thus
complete.

217
Chapter 6: Deterministic GD optimization methods

Exercise 6.1.3. Prove or disprove the following statement: There exist d ∈ N, γ ∈ (0, ∞),
ε ∈ (0, 1), r ∈ (0, ∞], ϑ, θ ∈ Rd and there exists a function G : Rd → Rd such that
∥θ − ϑ∥2 ≤ r, ∀ ξ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ∥ξ − γg(ξ) − ϑ∥2 ≤ ε∥ξ − ϑ∥2 , and
 2 γ
⟨θ − ϑ, g(θ)⟩ < min 1−ε , 2 max ∥θ − ϑ∥22 , ∥G(θ)∥22 . (6.38)


Exercise 6.1.4. Prove or disprove the following statement: For all d ∈ N, r ∈ (0, ∞],
ϑ ∈ Rd and for every function G : Rd → Rd which satisfies ∀ θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤
r} : ⟨θ − ϑ, G(θ)⟩ ≥ 12 max{∥θ − ϑ∥22 , ∥G(θ)∥22 } it holds that
∀θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ⟨θ − ϑ, G(θ)⟩ ≥ 21 ∥θ − ϑ∥22 ∧ ∥G(θ)∥2 ≤ 2∥θ − ϑ∥2 . (6.39)


Exercise 6.1.5. Prove or disprove the following statement: For all d ∈ N, c ∈ (0, ∞),
r ∈ (0, ∞], ϑ, v ∈ Rd , L ∈ C 1 (Rd , R), s, t ∈ [0, 1] such that ∥v∥2 ≤ r, s ≤ t, and
∀ θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 it holds that
L(ϑ + tv) − L(ϑ + sv) ≥ 2c (t2 − s2 )∥v∥22 . (6.40)
Exercise 6.1.6. Prove or disprove the following statement: For every d ∈ N, c ∈ (0, ∞),
r ∈ (0, ∞], ϑ ∈ Rd and for every L ∈ C 1 (Rd , R) which satisfies for all v ∈ Rd , s, t ∈ [0, 1]
with ∥v∥2 ≤ r and s ≤ t that L(ϑ + tv) − L(ϑ + sv) ≥ c(t2 − s2 )∥v∥22 it holds that
∀ θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ r} : ⟨θ − ϑ, (∇L)(θ)⟩ ≥ 2c∥θ − ϑ∥22 . (6.41)
Exercise 6.1.7. Let d ∈ N and for every v ∈ Rd , R ∈ [0, ∞] let BR (v) = {w ∈ Rd : ∥w−v∥2 ≤
R}. Prove or disprove the following statement: For all r ∈ (0, ∞], ϑ ∈ Rd , L ∈ C 1 (Rd , R)
the following two statements are equivalent:
(i) There exists c ∈ (0, ∞) such that for all θ ∈ Br (ϑ) it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 . (6.42)

(ii) There exists c ∈ (0, ∞) such that for all v, w ∈ Br (ϑ), s, t ∈ [0, 1] with s ≤ t it holds
that
L(ϑ + t(v − ϑ)) − L(ϑ + s(v − ϑ)) ≥ c(t2 − s2 )∥v − ϑ∥22 . (6.43)
Exercise 6.1.8. Let d ∈ N and for every v ∈ Rd , R ∈ [0, ∞] let BR (v) = {w ∈ Rd : ∥v −w∥2 ≤
R}. Prove or disprove the following statement: For all r ∈ (0, ∞], ϑ ∈ Rd , L ∈ C 1 (Rd , R)
the following three statements are equivalent:
(i) There exist c, L ∈ (0, ∞) such that for all θ ∈ Br (ϑ) it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 . (6.44)

(ii) There exist γ ∈ (0, ∞), ε ∈ (0, 1) such that for all θ ∈ Br (ϑ) it holds that
∥θ − γ(∇L)(θ) − ϑ∥2 ≤ ε∥θ − ϑ∥2 . (6.45)

(iii) There exists c ∈ (0, ∞) such that for all θ ∈ Br (ϑ) it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c max ∥θ − ϑ∥22 , ∥(∇L)(θ)∥22 . (6.46)


218
6.1. GD optimization

6.1.4 Error analysis for GD optimization


In this subsection we provide an error analysis for the GD optimization method. In particular,
we show under suitable hypotheses (cf. Proposition 6.1.9 below) that the considered GD
process converges to a local minimum point of the objective function of the considered
optimization problem.

6.1.4.1 Error estimates for GD optimization


Proposition 6.1.9 (Error estimates for the GD optimization method). Let d ∈ N, c, L ∈
(0, ∞), r ∈ (0, ∞], (γn )n∈N ⊆ [0, L2c2 ], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B,
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that

⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 , (6.47)

and let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ) (6.48)

(cf. Definitions 1.4.7 and 3.3.4). Then

(i) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},

(ii) it holds for all n ∈ N that 0 ≤ 1 − 2cγn + (γn )2 L2 ≤ 1,

(iii) it holds for all n ∈ N that ∥Θn − ϑ∥2 ≤ (1 − 2cγn + (γn )2 L2 )1/2 ∥Θn−1 − ϑ∥2 ≤ r,

(iv) it holds for all n ∈ N0 that


 n

2 2 1/2
(6.49)
Q
∥Θn − ϑ∥2 ≤ (1 − 2cγk + (γk ) L ) ∥ξ − ϑ∥2 ,
k=1

and

(v) it holds for all n ∈ N0 that


 n

0 ≤ L(Θn ) − L(ϑ) ≤ L2 ∥Θn − ϑ∥22 L
(1 − 2cγk + (γk ) L ) ∥ξ − ϑ∥22 . (6.50)
2 2
Q
≤ 2
k=1

Proof of Proposition 6.1.9. First, observe that (6.47) and item (ii) in Lemma 5.6.8 prove
item (i). Moreover, note that (6.47), item (iii) in Lemma 6.1.8, the assumption that for all
n ∈ N it holds that γn ∈ [0, L2c2 ], and the fact that
 2c 2 4c2 (6.51)
 4c2  4c2 4c2
L2 = 1 − L2 = 1 −
 2c 
1 − 2c L2
+ L2 L2
+ L4 L2
+ L2
=1

219
Chapter 6: Deterministic GD optimization methods

and establish item (ii). Next we claim that for all n ∈ N it holds that
∥Θn − ϑ∥2 ≤ (1 − 2cγn + (γn )2 L2 ) /2 ∥Θn−1 − ϑ∥2 ≤ r. (6.52)
1

We now prove (6.52) by induction on n ∈ N. For the base case n = 1 observe that (6.48),
the assumption that Θ0 = ξ ∈ B, item (ii) in Lemma 6.1.8, and item (ii) ensure that
∥Θ1 − ϑ∥22 = ∥Θ0 − γ1 (∇L)(Θ0 ) − ϑ∥22
≤ (1 − 2cγ1 + (γ1 )2 L2 )∥Θ0 − ϑ∥22 (6.53)
≤ ∥Θ0 − ϑ∥22 ≤ r2 .
This establishes (6.52) in the base case n = 1. For the induction step note that (6.48),
item (ii) in Lemma 6.1.8, and item (ii) imply that for all n ∈ N with Θn ∈ B it holds that
∥Θn+1 − ϑ∥22 = ∥Θn − γn+1 (∇L)(Θn ) − ϑ∥22
≤ (1 − 2cγn+1 + (γn+1 )2 L2 )∥Θn − ϑ∥22
| {z } (6.54)
∈[0,1]

≤ ∥Θn − ϑ∥22 ≤ r2 .
This demonstrates that for all n ∈ N with ∥Θn − ϑ∥2 ≤ r it holds that
∥Θn+1 − ϑ∥2 ≤ (1 − 2cγn+1 + (γn+1 )2 L2 ) /2 ∥Θn − ϑ∥2 ≤ r. (6.55)
1

Induction thus proves (6.52). Next observe that (6.52) establishes item (iii). Moreover, note
that induction, item (ii), and item (iii) prove item (iv). Furthermore, observe that item (iii)
and the fact that Θ0 = ξ ∈ B ensure that for all n ∈ N0 it holds that Θn ∈ B. Combining
this, (6.47), and Lemma 5.6.9 with items (i) and (iv) establishes item (v). The proof of
Proposition 6.1.9 is thus complete.

6.1.4.2 Size of the learning rates


In the next result, Corollary 6.1.10 below, we, roughly speaking, specialize Proposition 6.1.9
to the case where the learning rates (γn )n∈N ⊆ [0, L2c2 ] are a constant sequence.
Corollary 6.1.10 (Convergence of GD for constant learning rates). Let d ∈ N, c, L ∈ (0, ∞),
r ∈ (0, ∞], γ ∈ (0, L2c2 ), ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B, L ∈ C 1 (Rd , R)
satisfy for all θ ∈ B that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 , (6.56)
and let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γ(∇L)(Θn−1 ) (6.57)
(cf. Definitions 1.4.7 and 3.3.4). Then

220
6.1. GD optimization

(i) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},


(ii) it holds that 0 ≤ 1 − 2cγ + γ 2 L2 < 1,
(iii) it holds for all n ∈ N0 that
n/2
∥Θn − ϑ∥2 ≤ 1 − 2cγ + γ 2 L2 ∥ξ − ϑ∥2 , (6.58)


and
(iv) it holds for all n ∈ N0 that
n
0 ≤ L(Θn ) − L(ϑ) ≤ L2 ∥Θn − ϑ∥22 ≤ L
1 − 2cγ + γ 2 L2 ∥ξ − ϑ∥22 . (6.59)

2

Proof of Corollary 6.1.10. Observe that item (iii) in Lemma 6.1.8 proves item (ii). In
addition, note that Proposition 6.1.9 establishes items (i), (iii), and (iv). The proof of
Corollary 6.1.10 is thus complete.
Corollary 6.1.10 above establishes under suitable hypotheses convergence of the con-
sidered GD process in the case where the learning rates are constant and strictly smaller
than L2c2 . The next result, Theorem 6.1.11 below, demonstrates that the condition that
the learning rates are strictly smaller than L2c2 in Corollary 6.1.10 can, in general, not be
relaxed.
Theorem 6.1.11 (Sharp bounds on the learning rate for the convergence of GD ). Let
d ∈ N, α ∈ (0, ∞), γ ∈ R, ϑ ∈ Rd , ξ ∈ Rd \{ϑ}, let L : Rd → R satisfy for all θ ∈ Rd that

L(θ) = α2 ∥θ − ϑ∥22 , (6.60)

and let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − γ(∇L)(Θn−1 ) (6.61)

(cf. Definition 3.3.4). Then


(i) it holds for all θ ∈ Rd that ⟨θ − ϑ, (∇L)(θ)⟩ = α∥θ − ϑ∥22 ,
(ii) it holds for all θ ∈ Rd that ∥(∇L)(θ)∥2 = α∥θ − ϑ∥2 ,
(iii) it holds for all n ∈ N0 that ∥Θn − ϑ∥2 = |1 − γα|n ∥ξ − ϑ∥2 , and
(iv) it holds that

0
 : γ ∈ (0, 2/α)
lim inf ∥Θn − ϑ∥2 = lim sup∥Θn − ϑ∥2 = ∥ξ − ϑ∥2 : γ ∈ {0, 2/α} (6.62)
n→∞ n→∞ 
∞ : γ ∈ R\[0, 2/α]

221
Chapter 6: Deterministic GD optimization methods

(cf. Definition 1.4.7).

Proof of Theorem 6.1.11. First of all, note that Lemma 5.6.4 ensures that for all θ ∈ Rd it
holds that L ∈ C ∞ (Rd , R) and

(∇L)(θ) = α2 (2(θ − ϑ)) = α(θ − ϑ). (6.63)

This proves item (ii). Moreover, observe that (6.63) assures that for all θ ∈ Rd it holds that

⟨θ − ϑ, (∇L)(θ)⟩ = ⟨θ − ϑ, α(θ − ϑ)⟩ = α∥θ − ϑ∥22 (6.64)

(cf. Definition 1.4.7). This establishes item (i). Observe that (6.61) and (6.63) demonstrate
that for all n ∈ N it holds that

Θn − ϑ = Θn−1 − γ(∇L)(Θn−1 ) − ϑ
= Θn−1 − γα(Θn−1 − ϑ) − ϑ (6.65)
= (1 − γα)(Θn−1 − ϑ).

The assumption that Θ0 = ξ and induction hence prove that for all n ∈ N0 it holds that

Θn − ϑ = (1 − γα)n (Θ0 − ϑ) = (1 − γα)n (ξ − ϑ). (6.66)

Therefore, we obtain for all n ∈ N0 that

∥Θn − ϑ∥2 = |1 − γα|n ∥ξ − ϑ∥2 . (6.67)

This establishes item (iii). Combining item (iii) with the fact that for all t ∈ (0, 2/α) it holds
that |1 − tα| ∈ [0, 1), the fact that for all t ∈ {0, 2/α} it holds that |1 − tα| = 1, the fact
that for all t ∈ R\[0, 2/α] it holds that |1 − tα| ∈ (1, ∞), and the fact that ∥ξ − ϑ∥2 > 0
establishes item (iv). The proof of Theorem 6.1.11 is thus complete.

Exercise 6.1.9. Let L : R → R satisfy for all θ ∈ R that

L(θ) = 2θ2 (6.68)

and let Θ : N0 → R satisfy for all n ∈ N that Θ0 = 1 and

Θn = Θn−1 − n−2 (∇L)(Θn−1 ). (6.69)

Prove or disprove the following statement: It holds that

lim sup |Θn | = 0. (6.70)


n→∞

222
6.1. GD optimization

Exercise 6.1.10. Let L : R → R satisfy for all θ ∈ R that

L(θ) = 4θ2 (6.71)


(r)
and for every r ∈ (1, ∞) let Θ(r) : N0 → R satisfy for all n ∈ N that Θ0 = 1 and

Θ(r)
(r) −r
n = Θn−1 − n (∇L)(Θn−1 ).
(r)
(6.72)

Prove or disprove the following statement: It holds for all r ∈ (1, ∞) that

lim inf |Θ(r)


n | > 0. (6.73)
n→∞

Exercise 6.1.11. Let L : R → R satisfy for all θ ∈ R that

L(θ) = 5θ2 (6.74)


(r) (r)
and for every r ∈ (1, ∞) let Θ(r) = (Θn )n∈N0 : N0 → R satisfy for all n ∈ N that Θ0 = 1
and

Θ(r)
(r) −r
n = Θn−1 − n (∇L)(Θn−1 ).
(r)
(6.75)

Prove or disprove the following statement: It holds for all r ∈ (1, ∞) that

lim inf |Θ(r)


n | > 0. (6.76)
n→∞

6.1.4.3 Convergence rates


The next result, Corollary 6.1.12 below, establishes a convergence rate for the GD optimiza-
tion method in the case of possibly non-constant learning rates. We prove Corollary 6.1.12
through an application of Proposition 6.1.9 above.

Corollary 6.1.12 (Qualitative convergence of GD). Let d ∈ N, L ∈ C 1 (Rd , R), (γn )n∈N ⊆
R, c, L ∈ (0, ∞), ξ, ϑ ∈ Rd satisfy for all θ ∈ Rd that

⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 , ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 , (6.77)

and 0 < lim inf γn ≤ lim sup γn < 2c


L2
, (6.78)
n→∞ n→∞

and let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ) (6.79)

(cf. Definitions 1.4.7 and 3.3.4). Then

223
Chapter 6: Deterministic GD optimization methods

(i) it holds that {θ ∈ Rd : L(θ) = inf w∈Rd L(w)} = {ϑ},

(ii) there exist ϵ ∈ (0, 1), C ∈ R such that for all n ∈ N0 it holds that

∥Θn − ϑ∥2 ≤ ϵn C, (6.80)

and

(iii) there exist ϵ ∈ (0, 1), C ∈ R such that for all n ∈ N0 it holds that

0 ≤ L(Θn ) − L(ϑ) ≤ ϵn C. (6.81)

Proof of Corollary 6.1.12. Throughout this proof, let α, β ∈ R satisfy

0 < α < lim inf γn ≤ lim sup γn < β < 2c


L2
(6.82)
n→∞ n→∞

(cf. (6.78)), let m ∈ N satisfy for all n ∈ N that γm+n ∈ [α, β], and let h : R → R satisfy for
all t ∈ R that
h(t) = 1 − 2ct + t2 L2 . (6.83)
Observe that (6.77) and item (ii) in Lemma 5.6.8 prove item (i). In addition, observe that
the fact that for all t ∈ R it holds that h′ (t) = −2c + 2tL2 implies that for all t ∈ (−∞, Lc2 ]
it holds that
h′ (t) ≤ −2c + 2 Lc2 L2 = 0. (6.84)
 

The fundamental theorem of calculus hence assures that for all t ∈ [α, β] ∩ [0, Lc2 ] it holds
that
Z t Z t
h(t) = h(α) + ′
h (s) ds ≤ h(α) + 0 ds = h(α) ≤ max{h(α), h(β)}. (6.85)
α α

Furthermore, observe that the fact that for all t ∈ R it holds that h′ (t) = −2c + 2tL2 implies
that for all t ∈ [ Lc2 , ∞) it holds that

h′ (t) ≤ h′ ( Lc2 ) = −2c + 2 Lc2 L2 = 0. (6.86)


 

The fundamental theorem of calculus hence ensures that for all t ∈ [α, β] ∩ [ Lc2 , ∞) it holds
that
Z β Z β
max{h(α), h(β)} ≥ h(β) = h(t) + ′
h (s) ds ≥ h(t) + 0 ds = h(t). (6.87)
t t

Combining this and (6.85) establishes that for all t ∈ [α, β] it holds that

h(t) ≤ max{h(α), h(β)}. (6.88)

224
6.1. GD optimization

Moreover, observe that the fact that α, β ∈ (0, L2c2 ) and item (iii) in Lemma 6.1.8 ensure
that
{h(α), h(β)} ⊆ [0, 1). (6.89)
Hence, we obtain that
max{h(α), h(β)} ∈ [0, 1). (6.90)
This implies that there exists ε ∈ R such that

0 ≤ max{h(α), h(β)} < ε < 1. (6.91)

Next note that the fact that for all n ∈ N it holds that γm+n ∈ [α, β] ⊆ [0, L2c2 ], items (ii)
and (iv) in Proposition 6.1.9 (applied with d ↶ d, c ↶ c, L ↶ L, r ↶ ∞, (γn )n∈N ↶
(γm+n )n∈N , ϑ ↶ ϑ, ξ ↶ Θm , L ↶ L in the notation of Proposition 6.1.9), (6.77), (6.79),
and (6.88) demonstrate that for all n ∈ N it holds that
" n #
Y
(1 − 2cγm+k + (γm+k )2 L2 ) /2 ∥Θm − ϑ∥2
1
∥Θm+n − ϑ∥2 ≤
"k=1
n
#
(6.92)
Y
(h(γm+k )) /2 ∥Θm − ϑ∥2
1
=
k=1

≤ (max{h(α), h(β)}) /2 ∥Θm − ϑ∥2


n

≤ ε /2 ∥Θm − ϑ∥2 .
n

This shows that for all n ∈ N with n > m it holds that

(6.93)
(n−m)/2
∥Θn − ϑ∥2 ≤ ε ∥Θm − ϑ∥2 .

The fact that for all n ∈ N0 with n ≤ m it holds that


    
∥Θn − ϑ∥2 n/2 ∥Θk − ϑ∥2
: k ∈ {0, 1, . . . , m} ε /2 (6.94)
n
∥Θn − ϑ∥2 = n/2
ε ≤ max k/2
ε ε

hence assures that for all n ∈ N0 it holds that


   
∥Θk − ϑ∥2 n/2 (n−m)/2
∥Θn − ϑ∥2 ≤ max max : k ∈ {0, 1, . . . , m} ε , ε ∥Θm − ϑ∥2
εk/2
    
1/2 n ∥Θk − ϑ∥2 −m/2
= (ε ) max max : k ∈ {0, 1, . . . , m} , ε ∥Θm − ϑ∥2
εk/2
  
1/2 n ∥Θk − ϑ∥2
= (ε ) max : k ∈ {0, 1, . . . , m} .
εk/2
(6.95)

225
Chapter 6: Deterministic GD optimization methods

This proves item (ii). In addition, note that Lemma 5.6.9, item (i), and (6.95) assure that
for all n ∈ N0 it holds that
εn L ∥Θk − ϑ∥22
  
L
0 ≤ L(Θn ) − L(ϑ) ≤ 2 ∥Θn − ϑ∥2 ≤ 2
max : k ∈ {0, 1, . . . , m} . (6.96)
2 εk
This establishes item (iii). The proof of Corollary 6.1.12 is thus complete.

6.1.4.4 Error estimates in the case of small learning rates


The inequality in (6.49) in item (iv) in Proposition 6.1.9 above provides us an error
estimate for the GD optimization method in the case where the learning rates (γn )n∈N in
Proposition 6.1.9 satisfy that for all n ∈ N it holds that γn ≤ L2c2 . The error estimate in
(6.49) can be simplified in the special case where the learning rates (γn )n∈N satisfy the more
restrictive condition that for all n ∈ N it holds that γn ≤ Lc2 . This is the subject of the
next result, Corollary 6.1.13 below. We prove Corollary 6.1.13 through an application of
Proposition 6.1.9 above.
Corollary 6.1.13 (Error estimates in the case of small learning rates). Let d ∈ N,
c, L ∈ (0, ∞), r ∈ (0, ∞], (γn )n∈N ⊆ [0, Lc2 ], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B,
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that

⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 , (6.97)

and let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ) (6.98)

(cf. Definitions 1.4.7 and 3.3.4). Then


(i) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},

(ii) it holds for all n ∈ N that 0 ≤ 1 − cγn ≤ 1,

(iii) it holds for all n ∈ N0 that


 n 
(6.99)
Pn
(1 − cγk ) ∥ξ − ϑ∥2 ≤ exp − 2c
Q 1/2

∥Θn − ϑ∥2 ≤ k=1 γk ∥ξ − ϑ∥2 ,
k=1

and

(iv) it holds for all n ∈ N0 that


 n 
Pn
L
(1 − cγk ) ∥ξ − ϑ∥22 ≤ L
∥ξ − ϑ∥22 .
Q 
0 ≤ L(Θn ) − L(ϑ) ≤ 2 2
exp −c k=1 γk
k=1
(6.100)

226
6.1. GD optimization

Proof of Corollary 6.1.13. Note that item (ii) in Proposition 6.1.9 and the assumption that
for all n ∈ N it holds that γn ∈ [0, Lc2 ] ensure that for all n ∈ N it holds that
h c i
0 ≤ 1 − 2cγn + (γn )2 L2 ≤ 1 − 2cγn + γn 2 L2 = 1 − 2cγn + γn c = 1 − cγn ≤ 1. (6.101)
L
This proves item (ii). Moreover, note that (6.101) and Proposition 6.1.9 establish items (i),
(iii), and (iv). The proof of Corollary 6.1.13 is thus complete.
In the next result, Corollary 6.1.14 below, we, roughly speaking, specialize Corol-
lary 6.1.13 above to the case where the learning rates (γn )n∈N ⊆ [0, Lc2 ] are a constant
sequence.
Corollary 6.1.14 (Error estimates in the case of small and constant learning rates). Let
d ∈ N, c, L ∈ (0, ∞), r ∈ (0, ∞], γ ∈ (0, Lc2 ], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r}, ξ ∈ B,
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 , (6.102)
and let Θ : N0 → Rd satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γ(∇L)(Θn−1 ) (6.103)
(cf. Definitions 1.4.7 and 3.3.4). Then
(i) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},
(ii) it holds that 0 ≤ 1 − cγ < 1,
(iii) it holds for all n ∈ N0 that ∥Θn − ϑ∥2 ≤ (1 − cγ)n/2 ∥ξ − ϑ∥2 , and
L
(iv) it holds for all n ∈ N0 that 0 ≤ L(Θn ) − L(ϑ) ≤ 2
(1 − cγ)n ∥ξ − ϑ∥22 .
Proof of Corollary 6.1.14. Corollary 6.1.14 is an immediate consequence of Corollary 6.1.13.
The proof of Corollary 6.1.14 is thus complete.

6.1.4.5 On the spectrum of the Hessian of the objective function at a local


minimum point
A crucial ingredient in our error analysis for the GD optimization method in Sections 6.1.4.1,
6.1.4.2, 6.1.4.3, and 6.1.4.4 above is to employ the growth and the coercivity-type hypothe-
ses, for instance, in (6.47) in Proposition 6.1.9 above. In this subsection we disclose in
Proposition 6.1.16 below suitable conditions on the Hessians of the objective function of
the considered optimization problem which are sufficient to ensure that (6.47) is satisfied
so that we are in the position to apply the error analysis in Sections 6.1.4.1, 6.1.4.2, 6.1.4.3,
and 6.1.4.4 above (cf. Corollary 6.1.17 below). Our proof of Proposition 6.1.16 employs the
following classical result (see Lemma 6.1.15 below) for symmetric matrices with real entries.

227
Chapter 6: Deterministic GD optimization methods

Lemma 6.1.15 (Properties of the spectrum of real symmetric matrices). Let d ∈ N, let
A ∈ Rd×d be a symmetric matrix, and let

S = {λ ∈ C : (∃ v ∈ Cd \{0} : Av = λv)}. (6.104)

Then

(i) it holds that S = {λ ∈ R : (∃ v ∈ Rd \{0} : Av = λv)} ⊆ R,

(ii) it holds that  


∥Av∥2
sup = max|λ|, (6.105)
v∈Rd \{0} ∥v∥2 λ∈S

and

(iii) it holds for all v ∈ Rd that

min(S)∥v∥22 ≤ ⟨v, Av⟩ ≤ max(S)∥v∥22 (6.106)

(cf. Definitions 1.4.7 and 3.3.4).

Proof of Lemma 6.1.15. Throughout this proof, let e1 , e2 , . . . , ed ∈ Rd be the vectors given
by

e1 = (1, 0, . . . , 0), e2 = (0, 1, 0, . . . , 0), ..., ed = (0, . . . , 0, 1). (6.107)

Observe that the spectral theorem for symmetric matrices (see, for example, Petersen [331,
Theorem 4.3.4]) proves that there exist (d × d)-matrices Λ = (Λi,j )(i,j)∈{1,2,...,d}2 , O =
(Oi,j )(i,j)∈{1,2,...,d}2 ∈ Rd×d such that S = {Λ1,1 , Λ2,2 , . . . , Λd,d }, O∗ O = OO∗ = Id , A = OΛO∗ ,
and  
Λ1,1 0
.. d×d
(6.108)
Λ= . ∈R
 
0 Λd,d
(cf. Definition 1.5.5). Hence, we obtain that S ⊆ R. Next note that the assumption
that S = {λ ∈ C : (∃ v ∈ Cd \{0} : Av = λv)} ensures that for every λ ∈ S there exists
v ∈ Cd \{0} such that

ARe(v) + iAIm(v) = Av = λv = λRe(v) + iλIm(v). (6.109)

The fact that S ⊆ R therefore demonstrates that for every λ ∈ S there exists v ∈ Rd \{0}
such that Av = λv. This and the fact that S ⊆ R ensure that S ⊆ {λ ∈ R : (∃ v ∈
Rd \{0} : Av = λv)}. Combining this and the fact that {λ ∈ R : (∃ v ∈ Rd \{0} : Av =

228
6.1. GD optimization

λv)} ⊆ S proves item (i). Furthermore, note that (6.108) assures that for all v =
(v1 , v2 , . . . , vd ) ∈ Rd it holds that
" d
#1/2 " d
#1/2
X X
2
max |Λ1,1 |2 , . . . , |Λd,d |2 |vi |2

∥Λv∥2 = |Λi,i vi | ≤
i=1 i=1
i1/2
(6.110)
h  2 2
= max |Λ1,1 |, . . . , |Λd,d | ∥v∥2

= max |Λ1,1 |, . . . , |Λd,d | ∥v∥2

= maxλ∈S |λ| ∥v∥2

(cf. Definition 3.3.4). The fact that O is an orthogonal matrix and the fact that A = OΛO∗
therefore imply that for all v ∈ Rd it holds that

∥Av∥2 = ∥OΛO∗ v∥2 = ∥ΛO∗ v∥2


≤ maxλ∈S |λ| ∥O∗ v∥2 (6.111)


= maxλ∈S |λ| ∥v∥2 .

This implies that


"  #
maxλ∈S |λ| ∥v∥2
 
∥Av∥2
sup ≤ sup = maxλ∈S |λ|. (6.112)
v∈Rd \{0} ∥v∥2 v∈Rd \{0} ∥v∥2

In addition, note that the fact that S = {Λ1,1 , Λ2,2 . . . , Λd,d } ensures that there exists
j ∈ {1, 2, . . . , d} such that
|Λj,j | = maxλ∈S |λ|. (6.113)
Next observe that the fact that A = OΛO∗ , the fact that O is an orthogonal matrix, and
(6.113) imply that
 
∥Av∥2 ∥AOej ∥2
sup ≥ = ∥OΛO∗ Oej ∥2 = ∥OΛej ∥2
d
v∈R \{0} ∥v∥ 2 ∥Oe ∥
j 2 (6.114)
= ∥Λej ∥2 = ∥Λj,j ej ∥2 = |Λj,j | = maxλ∈S |λ|.

Combining this and (6.112) establishes item (ii). It thus remains to prove item (iii). For
this note that (6.108) ensures that for all v = (v1 , v2 , . . . , vd ) ∈ Rd it holds that
d
X d
X
2
⟨v, Λv⟩ = Λi,i |vi | ≤ max{Λ1,1 , . . . , Λd,d }|vi |2
i=1 i=1
(6.115)
= max{Λ1,1 , . . . , Λd,d }∥v∥22 = max(S)∥v∥22

229
Chapter 6: Deterministic GD optimization methods

(cf. Definition 1.4.7). The fact that O is an orthogonal matrix and the fact that A = OΛO∗
therefore demonstrate that for all v ∈ Rd it holds that

⟨v, Av⟩ = ⟨v, OΛO∗ v⟩ = ⟨O∗ v, ΛO∗ v⟩


(6.116)
≤ max(S)∥O∗ v∥22 = max(S)∥v∥22 .

Moreover, observe that (6.108) implies that for all v = (v1 , v2 , . . . , vd ) ∈ Rd it holds that
d
X d
X
2
⟨v, Λv⟩ = Λi,i |vi | ≥ min{Λ1,1 , . . . , Λd,d }|vi |2
i=1 i=1
(6.117)
= min{Λ1,1 , . . . , Λd,d }∥v∥22 = min(S)∥v∥22 .

The fact that O is an orthogonal matrix and the fact that A = OΛO∗ hence demonstrate
that for all v ∈ Rd it holds that

⟨v, Av⟩ = ⟨v, OΛO∗ v⟩ = ⟨O∗ v, ΛO∗ v⟩


(6.118)
≥ min(S)∥O∗ v∥22 = min(S)∥v∥22 .

Combining this with (6.116) establishes item (iii). The proof of Lemma 6.1.15 is thus
complete.

We now present the promised Proposition 6.1.16 which discloses suitable conditions
(cf. (6.119) and (6.120) below) on the Hessians of the objective function of the considered
optimization problem which are sufficient to ensure that (6.47) is satisfied so that we are
in the position to apply the error analysis in Sections 6.1.4.1, 6.1.4.2, 6.1.4.3, and 6.1.4.4
above.

Proposition 6.1.16 (Conditions on the spectrum of the Hessian of the objective function
at a local minimum point). Let d ∈ N, let ~·~ : Rd×d → [0, ∞) satisfy for all A ∈ Rd×d that
~A~ = supv∈Rd \{0} ∥Av∥
∥v∥2
2
, and let λ, α ∈ (0, ∞), β ∈ [α, ∞), ϑ ∈ Rd , L ∈ C 2 (Rd , R) satisfy
for all v, w ∈ Rd that

(∇L)(ϑ) = 0, ~(Hess L)(v) − (Hess L)(w)~ ≤ λ∥v − w∥2 , (6.119)

and {µ ∈ R : (∃ u ∈ Rd \{0} : [(Hess L)(ϑ)]u = µu)} ⊆ [α, β] (6.120)


(cf. Definition 3.3.4). Then it holds for all θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ αλ } that

⟨θ − ϑ, (∇L)(θ)⟩ ≥ α2 ∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ 3β


2
∥θ − ϑ∥2 (6.121)

(cf. Definition 1.4.7).

230
6.1. GD optimization

Proof of Proposition 6.1.16. Throughout this proof, let B ⊆ Rd be the set given by

B = w ∈ Rd : ∥w − ϑ∥2 ≤ α
(6.122)

λ

and let S ⊆ C be the set given by

S = {µ ∈ C : (∃ u ∈ Cd \{0} : [(Hess L)(ϑ)]u = µu)}. (6.123)

Note that the fact that (Hess L)(ϑ) ∈ Rd×d is a symmetric matrix, item (i) in Lemma 6.1.15,
and (6.120) imply that

S = {µ ∈ R : (∃ u ∈ Rd \{0} : [(Hess L)(ϑ)]u = µu)} ⊆ [α, β]. (6.124)

Next observe that the assumption that (∇L)(ϑ) = 0 and the fundamental theorem of
calculus ensure that for all θ, w ∈ Rd it holds that

⟨w, (∇L)(θ)⟩ = ⟨w, (∇L)(θ) − (∇L)(ϑ)⟩


D E
t=1
= w, [(∇L)(ϑ + t(θ − ϑ))]t=0
 
1
= w, ∫ [(Hess L)(ϑ + t(θ − ϑ))](θ − ϑ) dt
0
Z 1 (6.125)
= w, [(Hess L)(ϑ + t(θ − ϑ))](θ − ϑ) dt
0
= w, [(Hess L)(ϑ)](θ − ϑ)
Z 1
 
+ w, (Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ) (θ − ϑ) dt
0

(cf. Definition 1.4.7). The fact that (Hess L)(ϑ) ∈ Rd×d is a symmetric matrix, item (iii)
in Lemma 6.1.15, and the Cauchy-Schwarz inequality therefore imply that for all θ ∈ B it
holds that

⟨θ − ϑ, (∇L)(θ)⟩
≥ θ − ϑ, [(Hess L)(ϑ)](θ − ϑ)
Z 1
 
− θ − ϑ, (Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ) (θ − ϑ) dt
0 (6.126)
≥ min(S)∥θ − ϑ∥22
Z 1
 
− ∥θ − ϑ∥2 (Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ) (θ − ϑ) 2
dt.
0

231
Chapter 6: Deterministic GD optimization methods

Combining this with (6.124) and (6.119) shows that for all θ ∈ B it holds that
⟨θ − ϑ, (∇L)(θ)⟩
≥ α∥θ − ϑ∥22
Z 1
− ∥θ − ϑ∥2 ~(Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ)~∥θ − ϑ∥2 dt
0
(6.127)
Z 1 
2
≥ α∥θ − ϑ∥2 − λ∥ϑ + t(θ − ϑ) − ϑ∥2 dt ∥θ − ϑ∥22
 Z 1 0 
t dt λ∥θ − ϑ∥2 ∥θ − ϑ∥22 = α − λ2 ∥θ − ϑ∥2 ∥θ − ϑ∥22

= α−
0
≥ α − 2λ ∥θ − ϑ∥22 = α2 ∥θ − ϑ∥22 .
λα

Moreover, observe that (6.119), (6.124), (6.125), the fact that (Hess L)(ϑ) ∈ Rd×d is a
symmetric matrix, item (ii) in Lemma 6.1.15, the Cauchy-Schwarz inequality, and the
assumption that α ≤ β ensure that for all θ ∈ B, w ∈ Rd with ∥w∥2 = 1 it holds that
⟨w, (∇L)(θ)⟩
≤ w, [(Hess L)(ϑ)](θ − ϑ)
Z 1
 
+ w, (Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ) (θ − ϑ) dt
0
≤ ∥w∥2 ∥[(Hess L)(ϑ)](θ − ϑ)∥2
Z 1
+ ∥w∥2 ∥[(Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ)](θ − ϑ)∥2 dt
0
" #
∥[(Hess L)(ϑ)]v∥2
≤ sup ∥θ − ϑ∥2 (6.128)
v∈Rd \{0} ∥v∥2
Z 1
+ ~(Hess L)(ϑ + t(θ − ϑ)) − (Hess L)(ϑ)~∥θ − ϑ∥2 dt
0
Z 1 

≤ max S ∥θ − ϑ∥2 + λ∥ϑ + t(θ − ϑ) − ϑ∥2 dt ∥θ − ϑ∥2
0
 Z 1  
t dt ∥θ − ϑ∥2 ∥θ − ϑ∥2 = β + λ2 ∥θ − ϑ∥2 ∥θ − ϑ∥2

≤ β+λ
 0
≤ β + 2λ ∥θ − ϑ∥2 = 2β+α
λα
∥θ − ϑ∥2 ≤ 3β
 
2 2
∥θ − ϑ∥2 .
Therefore, we obtain for all θ ∈ B that
∥(∇L)(θ)∥2 = sup [⟨w, (∇L)(θ)⟩] ≤ 3β
2
∥θ − ϑ∥2 . (6.129)
w∈Rd , ∥w∥2 =1

Combining this and (6.127) establishes (6.121). The proof of Proposition 6.1.16 is thus
complete.

232
6.1. GD optimization

The next result, Corollary 6.1.17 below, combines Proposition 6.1.16 with Proposi-
tion 6.1.9 to obtain an error analysis which assumes the conditions in (6.119) and (6.120)
in Proposition 6.1.16 above. A result similar to Corollary 6.1.17 can, for instance, be found
in Nesterov [303, Theorem 1.2.4].
Corollary 6.1.17 (Error analysis for the GD optimization method under conditions on the
Hessian of the objective function). Let d ∈ N, let ~·~ : Rd×d → R satisfy for all A ∈ Rd×d that
~A~ = supv∈Rd \{0} ∥Av∥
∥v∥2
2 4α
, and let λ, α ∈ (0, ∞), β ∈ [α, ∞), (γn )n∈N ⊆ [0, 9β d
2 ], ϑ, ξ ∈ R ,

L ∈ C 2 (Rd , R) satisfy for all v, w ∈ Rd that


(∇L)(ϑ) = 0, ~(Hess L)(v) − (Hess L)(w)~ ≤ λ∥v − w∥2 , (6.130)
{µ ∈ R : (∃ u ∈ Rd \{0} : [(Hess L)(ϑ)]u = µu)} ⊆ [α, β], (6.131)
α d
and ∥ξ − ϑ∥2 ≤ λ
, and let Θ : N0 → R satisfy for all n ∈ N that
Θ0 = ξ and Θn = Θn−1 − γn (∇L)(Θn−1 ) (6.132)
(cf. Definition 3.3.4). Then
(i) it holds that {θ ∈ B : L(θ) = inf w∈B L(w)} = {ϑ},
9β 2 (γk )2
(ii) it holds for all k ∈ N that 0 ≤ 1 − αγk + 4
≤ 1,
(iii) it holds for all n ∈ N0 that
 n h i1/2 
9β 2 (γk )2
(6.133)
Q
∥Θn − ϑ∥2 ≤ 1 − αγk + 4
∥ξ − ϑ∥2 ,
k=1

and
(iv) it holds for all n ∈ N0 that
 n h i
9β 2 (γk )2

∥ξ − ϑ∥22 . (6.134)
Q
0 ≤ L(Θn ) − L(ϑ) ≤ 4
1 − αγk + 4
k=1

Proof of Corollary 6.1.17. Note that (6.130), (6.131), and Proposition 6.1.16 prove that for
all θ ∈ {w ∈ Rd : ∥w − ϑ∥2 ≤ αλ } it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ α2 ∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ 3β
2
∥θ − ϑ∥2 (6.135)
(cf. Definition 1.4.7). Combining this, the assumption that
α
∥ξ − ϑ∥2 ≤ , (6.136)
λ
(6.132), and items (iv) and (v) in Proposition 6.1.9 (applied with c ↶ α2 , L ↶ 3β2
, r ↶ αλ in
the notation of Proposition 6.1.9) establishes items (i), (ii), (iii), and (iv). The proof of
Corollary 6.1.17 is thus complete.

233
Chapter 6: Deterministic GD optimization methods

Remark 6.1.18. In Corollary 6.1.17 we establish convergence of the considered GD process


under, amongst other things, the assumption that all eigenvalues of the Hessian of L : Rd →
R at the local minimum point ϑ are strictly positive (see (6.131)). In the situation where L
is the cost function (integrated loss function) associated to a supervised learning problem in
the training of ANNs, this assumption is basically not satisfied. Nonetheless, the convergence
analysis in Corollary 6.1.17 can, roughly speaking, also be performed under the essentially
(up to the smoothness conditions) more general assumption that there exists k ∈ N0 such
that the set of local minimum points is locally a smooth k-dimensional submanifold of
Rd and that the rank of the Hessian of L is on this set of local minimum points locally
(at least) d − k (cf. Fehrman et al. [132] for details). In certain situations this essentially
generalized assumption has also been shown to be satisfied in the training of ANNs in
suitable supervised learning problems (see Jentzen & Riekert [223]).

6.1.4.6 Equivalent conditions on the objective function


Lemma 6.1.19. Let d ∈ N, let ⟨⟨·,p·⟩⟩ : Rd × Rd → R be a scalar product, let ~·~ : Rd → R
satisfy for all v ∈ Rd that ~v~ = ⟨⟨v, v⟩⟩, let γ ∈ (0, ∞), ε ∈ (0, 1), r ∈ (0, ∞], ϑ ∈ Rd ,
B = {w ∈ Rd : ~w − ϑ~ ≤ r}, and let G : Rd → Rd satisfy for all θ ∈ B that
~θ − γG(θ) − ϑ~ ≤ ε~θ − ϑ~. (6.137)
Then it holds for all θ ∈ B that
nh 2 i o
2 γ 2
⟨⟨θ − ϑ, G(θ)⟩⟩ ≥ max 1−ε2γ
~θ − ϑ~ , 2
~G(θ)~
n 2 o (6.138)
≥ min 1−ε , γ2 max ~θ − ϑ~2 , ~G(θ)~2 .


Proof of Lemma 6.1.19. First, note that (6.137) ensures that for all θ ∈ B it holds that
ε2 ~θ − ϑ~2 ≥ ~θ − γG(θ) − ϑ~2 = ~(θ − ϑ) − γG(θ)~2
(6.139)
= ~θ − ϑ~2 − 2γ ⟨⟨θ − ϑ, G(θ)⟩⟩ + γ 2 ~G(θ)~2 .
Hence, we obtain for all θ ∈ B that
2γ⟨⟨θ − ϑ, G(θ)⟩⟩ ≥ (1 − ε2 )~θ − ϑ~2 + γ 2 ~G(θ)~2
(6.140)
≥ max (1 − ε2 )~θ − ϑ~2 , γ 2 ~G(θ)~2 ≥ 0.


This demonstrates that for all θ ∈ B it holds that


1
max (1 − ε2 )~θ − ϑ~2 , γ 2 ~G(θ)~2

⟨⟨θ − ϑ, G(θ)⟩⟩ ≥ 2γ
nh 2 i o
1−ε 2 γ 2
= max 2γ
~θ − ϑ~ , 2 ~G(θ)~ (6.141)
n 2 o
≥ min 1−ε , γ2 max ~θ − ϑ~2 , ~G(θ)~2 .


The proof of Lemma 6.1.19 is thus complete.

234
6.1. GD optimization

Lemma 6.1.20. Let d ∈ N, let ⟨⟨·,p·⟩⟩ : Rd × Rd → R be a scalar product, let ~·~ : Rd → R


satisfy for all v ∈ Rd that ~v~ = ⟨⟨v, v⟩⟩, let c ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈
Rd : ~w − ϑ~ ≤ r}, and let G : Rd → Rd satisfy for all θ ∈ B that
⟨⟨θ − ϑ, G(θ)⟩⟩ ≥ c max ~θ − ϑ~2 , ~G(θ)~2 . (6.142)


Then it holds for all θ ∈ B that


⟨⟨θ − ϑ, G(θ)⟩⟩ ≥ c~θ − ϑ~2 and ~G(θ)~ ≤ 1c ~θ − ϑ~. (6.143)
Proof of Lemma 6.1.20. Observe that (6.142) and the Cauchy-Schwarz inequality assure
that for all θ ∈ B it holds that
~G(θ)~2 ≤ max ~θ − ϑ~2 , ~G(θ)~2 ≤ 1c ⟨⟨θ − ϑ, G(θ)⟩⟩ ≤ 1c ~θ − ϑ~~G(θ)~. (6.144)


Therefore, we obtain for all θ ∈ B that


~G(θ)~ ≤ 1c ~θ − ϑ~. (6.145)
Combining this with (6.142) completes the proof of Lemma 6.1.20.
Lemma 6.1.21. Let d ∈ N, c ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r},
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 . (6.146)
Then it holds for all v ∈ Rd , s, t ∈ [0, 1] with ∥v∥2 ≤ r and s ≤ t that
L(ϑ + tv) − L(ϑ + sv) ≥ 2c (t2 − s2 )∥v∥22 . (6.147)
Proof of Lemma 6.1.21. First of all, observe that (6.146) implies that for all v ∈ Rd with
∥v∥2 ≤ r it holds that
⟨(∇L)(ϑ + v), v⟩ ≥ c∥v∥22 . (6.148)
The fundamental theorem of calculus hence ensures that for all v ∈ Rd , s, t ∈ [0, 1] with
∥v∥2 ≤ r and s ≤ t it holds that
 h=t
L(ϑ + tv) − L(ϑ + sv) = L(ϑ + hv) h=s
Z t
= L ′ (ϑ + hv)v dh
Zs t
1
= h
⟨(∇L)(ϑ + hv), hv⟩ dh (6.149)
s
Z t
≥ c
h
∥hv∥22 dh
s Z t 
=c h dh ∥v∥22 = 2c (t2 − s2 )∥v∥22 .
s

The proof of Lemma 6.1.21 is thus complete.

235
Chapter 6: Deterministic GD optimization methods

Lemma 6.1.22. Let d ∈ N, c ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r},


L ∈ C 1 (Rd , R) satisfy for all v ∈ Rd , s, t ∈ [0, 1] with ∥v∥2 ≤ r and s ≤ t that
L(ϑ + tv) − L(ϑ + sv) ≥ c(t2 − s2 )∥v∥22 (6.150)
(cf. Definition 3.3.4). Then it holds for all θ ∈ B that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ 2c∥θ − ϑ∥22 (6.151)
(cf. Definition 1.4.7).
Proof of Lemma 6.1.22. Observe that (6.150) ensures that for all s ∈ (0, r] ∩ R, θ ∈ Rd \{ϑ}
with ∥θ − ϑ∥2 < s it holds that
⟨θ − ϑ, (∇L)(θ)⟩ = L ′ (θ)(θ − ϑ) = lim h1 L(θ + h(θ − ϑ)) − L(θ)
 
h↘0
  
1  
= lim L ϑ + (1+h)∥θ−ϑ∥ s
2 s
∥θ−ϑ∥2
(θ − ϑ)
h↘0 h
  
∥θ−ϑ∥2 s
−L ϑ+ s ∥θ−ϑ∥2
(θ − ϑ)
 h 
c (1+h)∥θ−ϑ∥2 i2 h ∥θ−ϑ∥2 i2  s
2
≥ lim sup − (θ − ϑ)
h↘0 h s s ∥θ−ϑ∥2
2 (6.152)
  i2
 2 −1
 h 2
∥θ−ϑ∥2
= c lim sup (1+h)h s
s
∥θ−ϑ∥2
(θ − ϑ)
h↘0 2
   
2
= c lim sup 2h+h
h
∥θ − ϑ∥22
h↘0
 
= c lim sup(2 + h) ∥θ − ϑ∥22 = 2c∥θ − ϑ∥22
h↘0

(cf. Definition 1.4.7). Hence, we obtain that for all θ ∈ Rd \{ϑ} with ∥θ − ϑ∥2 < r it holds
that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ 2c∥θ − ϑ∥22 . (6.153)
Combining this with the fact that the function
Rd ∋ v 7→ (∇L)(v) ∈ Rd (6.154)
is continuous establishes (6.151). The proof of Lemma 6.1.22 is thus complete.
Lemma 6.1.23. Let d ∈ N, L ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r},
L ∈ C 1 (Rd , R) satisfy for all θ ∈ B that
∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 (6.155)
(cf. Definition 3.3.4). Then it holds for all v, w ∈ B that
(6.156)

|L(v) − L(w)| ≤ L max ∥v − ϑ∥2 , ∥w − ϑ∥2 ∥v − w∥2 .

236
6.1. GD optimization

Proof of Lemma 6.1.23. Observe that (6.155), the fundamental theorem of calculus, and
the Cauchy-Schwarz inequality assure that for all v, w ∈ B it holds that

 h=1
|L(v) − L(w)| = L(w + h(v − w)) h=0
Z 1
= L ′ (w + h(v − w))(v − w) dh
0
Z 1

= (∇L) w + h(v − w) , v − w dh
0
Z 1

≤ ∥(∇L) hv + (1 − h)w ∥2 ∥v − w∥2 dh
Z0 1
≤ L∥hv + (1 − h)w − ϑ∥2 ∥v − w∥2 dh (6.157)
0
Z 1

≤ L h∥v − ϑ∥2 + (1 − h)∥w − ϑ∥2 ∥v − w∥2 dh
0
Z 1 

= L ∥v − w∥2 h∥v − ϑ∥2 + h∥w − ϑ∥2 dh
0
Z 1 

= L ∥v − ϑ∥2 + ∥w − ϑ∥2 ∥v − w∥2 h dh
0
≤ L max{∥v − ϑ∥2 , ∥w − ϑ∥2 }∥v − w∥2

(cf. Definition 1.4.7). The proof of Lemma 6.1.23 is thus complete.

Lemma 6.1.24. Let d ∈ N, L ∈ (0, ∞), r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r},


L ∈ C 1 (Rd , R) satisfy for all v, w ∈ B that

(6.158)

|L(v) − L(w)| ≤ L max ∥v − ϑ∥2 , ∥w − ϑ∥2 ∥v − w∥2

(cf. Definition 3.3.4). Then it holds for all θ ∈ B that

∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 . (6.159)

Proof of Lemma 6.1.24. Note that (6.158) implies that for all θ ∈ Rd with ∥θ − ϑ∥2 < r it

237
Chapter 6: Deterministic GD optimization methods

holds that
h i
∥(∇L)(θ)∥2 = sup L ′ (θ)(w)
w∈Rd ,∥w∥2 =1
h 1 i
= sup lim h (L(θ + hw) − L(θ))
w∈Rd ,∥w∥2 =1 h↘0
 h i
L

≤ sup lim inf h max ∥θ + hw − ϑ∥2 , ∥θ − ϑ∥2 ∥θ + hw − θ∥2
w∈Rd ,∥w∥2 =1 h↘0
 h i
1

= sup lim inf L max ∥θ + hw − ϑ∥2 , ∥θ − ϑ∥2 h
∥hw∥2
w∈Rd ,∥w∥2 =1 h↘0
 h i

= sup lim inf L max ∥θ + hw − ϑ∥2 , ∥θ − ϑ∥2
w∈Rd ,∥w∥2 =1 h↘0
h i
= sup L∥θ − ϑ∥2 = L∥θ − ϑ∥2 .
w∈Rd ,∥w∥2 =1
(6.160)
The fact that the function Rd ∋ v 7→ (∇L)(v) ∈ Rd is continuous therefore establishes
(6.159). The proof of Lemma 6.1.24 is thus complete.
Corollary 6.1.25. Let d ∈ N, r ∈ (0, ∞], ϑ ∈ Rd , B = {w ∈ Rd : ∥w − ϑ∥2 ≤ r},
L ∈ C 1 (Rd , R) (cf. Definition 3.3.4). Then the following four statements are equivalent:
(i) There exist c, L ∈ (0, ∞) such that for all θ ∈ B it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c∥θ − ϑ∥22 and ∥(∇L)(θ)∥2 ≤ L∥θ − ϑ∥2 . (6.161)

(ii) There exist γ ∈ (0, ∞), ε ∈ (0, 1) such that for all θ ∈ B it holds that
∥θ − γ(∇L)(θ) − ϑ∥2 ≤ ε∥θ − ϑ∥2 . (6.162)

(iii) There exists c ∈ (0, ∞) such that for all θ ∈ B it holds that
⟨θ − ϑ, (∇L)(θ)⟩ ≥ c max ∥θ − ϑ∥22 , ∥(∇L)(θ)∥22 . (6.163)


(iv) There exist c, L ∈ (0, ∞) such that for all v, w ∈ B, s, t ∈ [0, 1] with s ≤ t it holds that
L ϑ + t(v − ϑ) − L ϑ + s(v − ϑ) ≥ c(t2 − s2 )∥v − ϑ∥22 (6.164)
 

(6.165)

and |L(v) − L(w)| ≤ L max ∥v − ϑ∥2 , ∥w − ϑ∥2 ∥v − w∥2
(cf. Definition 1.4.7).
Proof of Corollary 6.1.25. Note that items (ii) and (iii) in Lemma 6.1.8 prove that ((i) →
(ii)). Observe that Lemma 6.1.19 demonstrates that ((ii) → (iii)). Note that Lemma 6.1.20
establishes that ((iii) → (i)). Observe that Lemma 6.1.21 and Lemma 6.1.23 show that ((i)
→ (iv)). Note that Lemma 6.1.22 and Lemma 6.1.24 establish that ((iv) → (i)). The proof
of Corollary 6.1.25 is thus complete.

238
6.2. Explicit midpoint GD optimization

6.2 Explicit midpoint GD optimization


As discussed in Section 6.1 above, the GD optimization method can be viewed as an
Euler discretization of the associated GF ODE in Theorem 5.7.4 in Chapter 5. In the
literature also more sophisticated methods than the Euler method have been employed to
approximate the GF ODE. In particular, higher order Runge-Kutta methods have been used
to approximate local minimum points of optimization problems (cf., for example, Zhang et
al. [433]). In this section we illustrate this in the case of the explicit midpoint method.

Definition 6.2.1 (Explicit midpoint GD optimization method). Let d ∈ N, (γn )n∈N ⊆


[0, ∞), ξ ∈ Rd and let L : Rd → R and G : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open},
θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ). (6.166)

Then we say that Θ is the explicit midpoint GD process for the objective function L with
generalized gradient G, learning rates (γn )n∈N , and initial value ξ (we say that Θ is the
explicit midpoint GD process for the objective function L with learning rates (γn )n∈N and
initial value ξ) if and only if it holds that Θ : N0 → Rd is the function from N0 to Rd which
satisfies for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − γnG(Θn−1 − γn


2
G(Θn−1 )). (6.167)

6.2.1 Explicit midpoint discretizations for GF ODEs


Lemma 6.2.2 (Local error of the explicit midpoint method). Let d ∈ N, T, γ, c ∈ [0, ∞),
G ∈ C 2 (Rd , Rd ), Θ ∈ C([0, ∞), Rd ), θ ∈ Rd satisfy for all x, y, z ∈ Rd , t ∈ [0, ∞) that
Z t
θ = ΘT + γG ΘT + γ2 G(ΘT ) , (6.168)

Θt = Θ0 + G(Θs ) ds,
0

∥G(x)∥2 ≤ c, ∥G′ (x)y∥2 ≤ c∥y∥2 , and ∥G′′ (x)(y, z)∥2 ≤ c∥y∥2 ∥z∥2 (6.169)
(cf. Definition 3.3.4). Then

∥ΘT +γ − θ∥2 ≤ c3 γ 3 . (6.170)

Proof of Lemma 6.2.2. Note that the fundamental theorem of calculus, the assumption that
G ∈ C 2 (Rd , Rd ), and (6.168) assure that for all t ∈ [0, ∞) it holds that Θ ∈ C 1 ([0, ∞), Rd )
and

Θ̇t = G(Θt ). (6.171)

239
Chapter 6: Deterministic GD optimization methods

Combining this with the assumption that G ∈ C 2 (Rd , Rd ) and the chain rule ensures that
for all t ∈ [0, ∞) it holds that Θ ∈ C 2 ([0, ∞), Rd ) and

Θ̈t = G′ (Θt )Θ̇t = G′ (Θt )G(Θt ). (6.172)

Theorem 6.1.3 and (6.171) hence ensure that


hγ i Z 1 h γ i2
ΘT + 2 = ΘT +
γ Θ̇T + (1 − r) Θ̈T +rγ/2 dr
2 2
0
(6.173)
γ2 1
hγ i Z
= ΘT + G(ΘT ) + (1 − r)G′ (ΘT +rγ/2 )G(ΘT +rγ/2 ) dr.
2 4 0
Therefore, we obtain that

γ2 1
hγ i Z
ΘT + γ2 − ΘT − G(ΘT ) = (1 − r)G′ (ΘT +rγ/2 )G(ΘT +rγ/2 ) dr. (6.174)
2 4 0

Combining this, the fact that for all x, y ∈ Rd it holds that ∥G(x) − G(y)∥2 ≤ c∥x − y∥2 ,
and (6.169) ensures that

G(ΘT + γ2 ) − G ΘT + γ2 G(ΘT ) 2 ≤ c ΘT + γ2 − ΘT − γ2 G(ΘT ) 2




cγ 2 1
Z
≤ (1 − r) G′ (ΘT +rγ/2 )G(ΘT +rγ/2 ) 2 dr
4 0 (6.175)
1
c3 γ 2 c3 γ 2
Z
≤ r dr = .
4 0 8

Furthermore, observe that (6.171), (6.172), the hypothesis that G ∈ C 2 (Rd , Rd ), the product
rule, and the chain rule assure that for all t ∈ [0, ∞) it holds that Θ ∈ C 3 ([0, ∞), Rd ) and
...
Θ t = G′′ (Θt )(Θ̇t , G(Θt )) + G′ (Θt )G′ (Θt )Θ̇t
(6.176)
= G′′ (Θt )(G(Θt ), G(Θt )) + G′ (Θt )G′ (Θt )G(Θt ).

Theorem 6.1.3, (6.171), and (6.172) hence imply that for all s, t ∈ [0, ∞) it holds that

(1 − r)2 (s − t)3 ...


Z 1
(s − t)2
  
Θs = Θt + (s − t)Θ̇t + Θ̈t + Θ t+r(s−t) dr
2 0 2
(s − t)2 ′
 
= Θt + (s − t)G(Θt ) + G (Θt )G(Θt )
2 (6.177)
3 Z 1
(s − t)
+ (1 − r)2 G′′ (Θt+r(s−t) )(G(Θt+r(s−t) ), G(Θt+r(s−t) ))
2 0
+ G′ (Θt+r(s−t) )G′ (Θt+r(s−t) )G(Θt+r(s−t) ) dr.


240
6.2. Explicit midpoint GD optimization

This assures that


ΘT +γ − ΘT
hγ i  2
γ
= ΘT + γ2 + G(ΘT + γ2 ) + G′ (ΘT + γ2 )G(ΘT + γ2 )
2 8
γ3 1
Z
+ (1 − r)2 G′′ (ΘT +(1+r)γ/2 )(G(ΘT +(1+r)γ/2 ), G(ΘT +(1+r)γ/2 ))
16 0
+ G′ (ΘT +(1+r)γ/2 )G′ (ΘT +(1+r)γ/2 )G(ΘT +(1+r)γ/2 ) dr

"  2
hγ i γ
− ΘT + γ2 − G(ΘT + γ2 ) + G′ (ΘT + γ2 )G(ΘT + γ2 )
2 8
γ3 1
Z
− (1 − r)2 G′′ (ΘT +(1−r)γ/2 )(G(ΘT +(1−r)γ/2 ), G(ΘT +(1−r)γ/2 )) (6.178)
16 0
#
+ G′ (ΘT +(1−r)γ/2 )G′ (ΘT +(1−r)γ/2 )G(ΘT +(1−r)γ/2 ) dr


γ3 1
Z 
= γG(ΘT + 2 ) +
γ (1 − r) G′′ (ΘT +(1+r)γ/2 )(G(ΘT +(1+r)γ/2 ), G(ΘT +(1+r)γ/2 ))
2
16 0
+ G (ΘT +(1+r)γ/2 )G′ (ΘT +(1+r)γ/2 )G(ΘT +(1+r)γ/2 )

+ G′′ (ΘT +(1−r)γ/2 )(G(ΘT +(1−r)γ/2 ), G(ΘT +(1−r)γ/2 ))



′ ′
+ G (ΘT +(1−r)γ/2 )G (ΘT +(1−r)γ/2 )G(ΘT +(1−r)γ/2 ) dr.

This, (6.169), and (6.175) assure that

∥ΘT +γ − θ∥2 = ΘT +γ − ΘT − γG(ΘT + γ2 G(ΘT )) 2


≤ ΘT +γ − [ΘT + γG(ΘT + γ2 )] 2
+ γ γG(ΘT + γ2 ) − G(ΘT + γ2 G(ΘT )) 2
≤ γ G(ΘT + γ2 ) − G(ΘT + γ2 G(ΘT )) 2
γ3 1
Z 
+ (1 − r)2 G′′ (ΘT +(1+r)γ/2 )(G(ΘT +(1+r)γ/2 ), G(ΘT +(1+r)γ/2 )) 2
16 0
+ G′ (ΘT +(1+r)γ/2 )G′ (ΘT +(1+r)γ/2 )G(ΘT +(1+r)γ/2 ) 2
+ G′′ (ΘT +(1−r)γ/2 )(G(ΘT +(1−r)γ/2 ), G(ΘT +(1−r)γ/2 )) 2

′ ′
+ G (ΘT +(1−r)γ/2 )G (ΘT +(1−r)γ/2 )G(ΘT +(1−r)γ/2 ) 2 dr
c3 γ 3 c3 γ 3 1 2 5c3 γ 3
Z
≤ + r dr = ≤ c3 γ 3 .
8 4 0 24
(6.179)

The proof of Lemma 6.2.2 is thus complete.

241
Chapter 6: Deterministic GD optimization methods

Corollary 6.2.3 (Local error of the explicit midpoint method for GF ODEs). Let d ∈ N,
T, γ, c ∈ [0, ∞), L ∈ C 3 (Rd , R), Θ ∈ C([0, ∞), Rd ), θ ∈ Rd satisfy for all x, y, z ∈ Rd ,
t ∈ [0, ∞) that
Z t
θ = ΘT − γ(∇L) ΘT − γ2 (∇L)(ΘT ) , (6.180)

Θt = Θ0 − (∇L)(Θs ) ds,
0

∥(∇L)(x)∥2 ≤ c, ∥(Hess L)(x)y∥2 ≤ c∥y∥2 , and ∥(∇L)′′ (x)(y, z)∥2 ≤ c∥y∥2 ∥z∥2
(6.181)
(cf. Definition 3.3.4). Then

∥ΘT +γ − θ∥2 ≤ c3 γ 3 . (6.182)

Proof of Corollary 6.2.3. Throughout this proof, let G : Rd → Rd satisfy for all θ ∈ Rd that

G(θ) = −(∇L)(θ). (6.183)

Note that the fact that for all t ∈ [0, ∞) it holds that
Z t
Θt = Θ0 + G(Θs ) ds, (6.184)
0

the fact that


θ = ΘT + γG ΘT + γ2 G(ΘT ) , (6.185)


the fact that for all x ∈ Rd it holds that ∥G(x)∥2 ≤ c, the fact that for all x, y ∈ Rd it holds
that ∥G′ (x)y∥2 ≤ c∥y∥2 , the fact that for all x, y, z ∈ Rd it holds that

∥G′′ (x)(y, z)∥2 ≤ c∥y∥2 ∥z∥2 , (6.186)

and Lemma 6.2.2 show that


∥ΘT +γ − θ∥2 ≤ c3 γ 3 . (6.187)
The proof of Corollary 6.2.3 is thus complete.

6.3 GD optimization with classical momentum


In Section 6.1 above we have introduced and analyzed the classical plain-vanilla GD
optimization method. In the literature there are a number of somehow more sophisticated
GD-type optimization methods which aim to improve the convergence speed of the classical
plain-vanilla GD optimization method (see, for example, Ruder [354] and Sections 6.4, 6.5,
6.6, 6.7, and 6.8 below). In this section we introduce one of such more sophisticated GD-type
optimization methods, that is, we introduce the so-called momentum GD optimization

242
6.3. GD optimization with classical momentum

method (see Definition 6.3.1 below). The idea to improve GD optimization methods with a
momentum term was first introduced in Polyak [337]. To illustrate the advantage of the
momentum GD optimization method over the plain-vanilla GD optimization method we
now review a result proving that the momentum GD optimization method does indeed
outperform the classical plain-vanilla GD optimization method in the case of a simple class
of optimization problems (see Section 6.3.3 below).
In the scientific literature there are several very similar, but not exactly equivalent
optimization techniques which are referred to as optimization with momentum. Our
definition of the momentum GD optimization method in Definition 6.3.1 below is based on
[247, 306] and (7) in [111]. A different version where, roughly speaking, the factor (1 − αn )
in (6.189) in Definition 6.3.1 is replaced by 1 can, for instance, be found in [112, Algorithm
2]. A further alternative definition where, roughly speaking, the momentum terms are
accumulated over the increments of the optimization process instead of over the gradients
of the objective function (cf. (6.190) in Definition 6.3.1 below) can, for example, be found
in (9) in [337], (2) in [339], and (4) in [354].
Definition 6.3.1 (Momentum GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(αn )n∈N ⊆ [0, 1], ξ ∈ Rd and let L : Rd → R and G : Rd → Rd satisfy for all U ∈ {V ⊆
Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ). (6.188)

Then we say that Θ is the momentum GD process for the objective function L with
generalized gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial
value ξ (we say that Θ is the momentum GD process for the objective function L with
learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial value ξ) if and only
if it holds that Θ : N0 → Rd is the function from N0 to Rd which satisfies that there exists
m : N0 → Rd such that for all n ∈ N it holds that

Θ0 = ξ, m0 = 0, (6.189)

mn = αn mn−1 + (1 − αn )G(Θn−1 ), (6.190)


and Θn = Θn−1 − γn mn . (6.191)
Exercise 6.3.1. Let L : R → R satisfy for all θ ∈ R that L(θ) = 2θ2 and let Θ be the
momentum GD process for the objective function L with with learning rates N ∋ n 7→
1/2n ∈ [0, ∞), momentum decay factors N ∋ n 7→ 1/2 ∈ [0, 1], and initial value 1 (cf.

Definition 6.3.1). Specify Θ1 , Θ2 , and Θ3 explicitly and prove that your results are correct!
Exercise 6.3.2. Let ξ = (ξ1 , ξ2 ) ∈ R2 satisfy (ξ1 , ξ2 ) = (2, 3), let L : R2 → R satisfy for all
θ = (θ1 , θ2 ) ∈ R2 that

L(θ) = (θ1 − 3)2 + 12 (θ2 − 2)2 + θ1 + θ2 ,

243
Chapter 6: Deterministic GD optimization methods

and let Θ be the momentum GD process for the objective function L with learning rates
N ∋ n 7→ 2/n ∈ [0, ∞), momentum decay factors N ∋ n 7→ 1/2 ∈ [0, 1], and initial value ξ (cf.
Definition 6.3.1). Specify Θ1 and Θ2 explicitly and prove that your results are correct!

6.3.1 Representations for GD optimization with momentum


In (6.189), (6.190), and (6.191) above the momentum GD optimization method is formulated
by means of a one-step recursion. This one-step recursion can efficiently be exploited in
an implementation. In Corollary 6.3.4 below we provide a suitable full-history recursive
representation for the momentum GD optimization method, which enables us to develop a
better intuition for the momentum GD optimization method. Our proof of Corollary 6.3.4
employs the explicit representation of momentum terms in Lemma 6.3.3 below. Our proof
of Lemma 6.3.3, in turn, uses an application of the following result.

Lemma 6.3.2. Let (αn )n∈N ⊆ R and let (mn )n∈N0 ⊆ R satisfy for all n ∈ N that m0 = 0
and

mn = αn mn−1 + 1 − αn . (6.192)

Then it holds for all n ∈ N0 that


n
Y
mn = 1 − αk . (6.193)
k=1

Proof of Lemma 6.3.2. We prove (6.193) by induction on n ∈ N0 . For the base case n = 0
observe that the assumption that m0 = 0 establishes that
0
Y
m0 = 0 = 1 − αk . (6.194)
k=1

This establishes (6.193) in the base case nQ= 0. For the induction step note that (6.192)
assures that for all n ∈ N0 with mn = 1 − nk=1 αk it holds that
" n
#
Y
mn+1 = αn+1 mn + 1 − αn+1 = αn+1 1 − αk + 1 − αn+1
k=1
n+1 n+1
(6.195)
Y Y
= αn+1 − αk + 1 − αn+1 = 1 − αk .
k=1 k=1

Induction hence establishes (6.193). The proof of Lemma 6.3.2 is thus complete.

244
6.3. GD optimization with classical momentum

Lemma 6.3.3 (An explicit representation of momentum terms). Let d ∈ N, (αn )n∈N ⊆ R,
(an,k )(n,k)∈(N0 )2 ⊆ R, (Gn )n∈N0 ⊆ Rd , (mn )n∈N0 ⊆ Rd satisfy for all n ∈ N, k ∈ {0, 1, . . . , n −
1} that
" n #
Y
m0 = 0, mn = αn mn−1 + (1 − αn )Gn−1 , and an,k = (1 − αk+1 ) αl (6.196)
l=k+2

Then
(i) it holds for all n ∈ N0 that
n−1
X
mn = an,kGk (6.197)
k=0
and
(ii) it holds for all n ∈ N0 that
n−1
X n
Y
an,k = 1 − αk . (6.198)
k=0 k=1

Proof of Lemma 6.3.3. Throughout this proof, let (mn )n∈N0 ⊆ R satisfy for all n ∈ N0 that
n−1
X
mn = an,k . (6.199)
k=0

We now prove item (i) by induction on n ∈ N0 . For the base case n = 0 note that (6.196)
ensures that
X−1
m0 = 0 = an,kGk . (6.200)
k=0

This establishes item (i) in the base case


P n = 0. For the induction step note that (6.196)
assures that for all n ∈ N0 with mn = n−1 k=0 an,k Gk it holds that

mn+1 = αn+1 mn + (1 − αn+1 )Gn


" n−1 #
X
= αn+1 an,kGk + (1 − αn+1 )Gn
k=0
" n−1 " n
# #
X Y
= αn+1 (1 − αk+1 ) αl Gk + (1 − αn+1 )Gn
k=0 l=k+2 (6.201)
" n−1 " n+1
# #
X Y
= (1 − αk+1 ) αl Gk + (1 − αn+1 )Gn
k=0 l=k+2
n
" n+1
# n
X Y X
= (1 − αk+1 ) αl Gk = an+1,kGk .
k=0 l=k+2 k=0

245
Chapter 6: Deterministic GD optimization methods

Induction thus proves item (i). Furthermore, observe that (6.196) and (6.199) demonstrate
that for all n ∈ N it holds that m0 = 0 and
n−1 n−1
" n # n−2
" n #
X X Y X Y
mn = an,k = (1 − αk+1 ) αl = 1 − αn + (1 − αk+1 ) αl
k=0 k=0 l=k+2 k=0 l=k+2
n−2
" n−1
# n−2
X Y X
= 1 − αn + (1 − αk+1 )αn αl = 1 − αn + αn an−1,k = 1 − αn + αn mn−1 .
k=0 l=k+2 k=0
(6.202)
Combining this with Lemma 6.3.2 implies that for all n ∈ N0 it holds that
n
Y
mn = 1 − αk . (6.203)
k=1

This establishes item (ii). The proof of Lemma 6.3.3 is thus complete.
Corollary 6.3.4 (On a representation of the momentum GD optimization method). Let
d ∈ N, (γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1], (an,k )(n,k)∈(N0 )2 ⊆ R, ξ ∈ Rd satisfy for all n ∈ N,
k ∈ {0, 1, . . . , n − 1} that
" n #
Y
an,k = (1 − αk+1 ) αl , (6.204)
l=k+2

let L : Rd → R and G : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with


L|U ∈ C 1 (U, Rd ) that
G(θ) = (∇L)(θ), (6.205)
and let Θ be the momentum GD process for the objective function L with generalized
gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial value ξ
(cf. Definition 6.3.1). Then
(i) it holds for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that 0 ≤ an,k ≤ 1,
(ii) it holds for all n ∈ N0 that
n−1
X n
Y
an,k = 1 − αk , (6.206)
k=0 k=1

and
(iii) it holds for all n ∈ N that
" n−1 #
X
Θn = Θn−1 − γn an,kG(Θk ) . (6.207)
k=0

246
6.3. GD optimization with classical momentum

Proof of Corollary 6.3.4. Throughout this proof, let m : N0 → Rd satisfy for all n ∈ N that
m0 = 0 and mn = αn mn−1 + (1 − αn )G(Θn−1 ). (6.208)
Note that (6.204) implies item (i). Observe that (6.204), (6.208), and Lemma 6.3.3 assure
that for all n ∈ N0 it holds that
n−1
X n−1
X n
Y
mn = an,kG(Θk ) and an,k = 1 − αk . (6.209)
k=0 k=0 k=1

This proves item (ii). Note that (6.189), (6.190), (6.191), (6.208), and (6.209) demonstrate
that for all n ∈ N it holds that
" n−1 #
X
Θn = Θn−1 − γn mn = Θn−1 − γn an,kG(Θk ) . (6.210)
k=0

This establishes item (iii). The proof of Corollary 6.3.4 is thus complete.

6.3.2 Bias-adjusted GD optimization with momentum


Definition 6.3.5 (Bias-adjusted momentum GD optimization method). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1], ξ ∈ Rd satisfy α1 < 1 and let L : Rd → R and
G : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that
G(θ) = (∇L)(θ). (6.211)
Then we say that Θ is the bias-adjusted momentum GD process for the objective function L
with generalized gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and
initial value ξ (we say that Θ is the bias-adjusted momentum GD process for the objective
function L with learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial value
ξ) if and only if it holds that Θ : N0 → Rd is the function from N0 to Rd which satisfies that
there exists m : N0 → Rd such that for all n ∈ N it holds that
Θ0 = ξ, m0 = 0, (6.212)
mn = αn mn−1 + (1 − αn )G(Θn−1 ), (6.213)
γn mn
and Θn = Θn−1 − . (6.214)
1 − nl=1 αl
Q

Corollary 6.3.6 (On a representation of the bias-adjusted momentum GD optimization


method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1], ξ ∈ Rd , (an,k )(n,k)∈(N0 )2 ⊆ R satisfy
for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that α1 < 1 and
(1 − αk+1 ) nl=k+2 αl
Q 
an,k = , (6.215)
1 − nl=1 αl
Q

247
Chapter 6: Deterministic GD optimization methods

let L : Rd → R and G : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with


L|U ∈ C 1 (U, Rd ) that
G(θ) = (∇L)(θ), (6.216)
and let Θ be the bias-adjusted momentum GD process for the objective function L with
generalized gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial
value ξ (cf. Definition 6.3.5). Then
(i) it holds for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that 0 ≤ an,k ≤ 1,
(ii) it holds for all n ∈ N that
n−1
X
an,k = 1, (6.217)
k=0
and
(iii) it holds for all n ∈ N that
" n−1 #
X
Θn = Θn−1 − γn an,kG(Θk ) . (6.218)
k=0

Proof of Corollary 6.3.6. Throughout this proof, let m : N0 → Rd satisfy for all n ∈ N that
m0 = 0 and mn = αn mn−1 + (1 − αn )G(Θn−1 ) (6.219)
and let (bn,k )(n,k)∈(N0 )2 ⊆ R satisfy for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that
" n #
Y
bn,k = (1 − αk+1 ) αl . (6.220)
l=k+2

Observe that (6.215) implies item (i). Note that (6.215), (6.219), (6.220), and Lemma 6.3.3
assure that for all n ∈ N it holds that
n−1 n−1 Pn−1
1 − nk=1 αk
Q
X X bn,k
mn = bn,kG(Θk ) and an,k = k=0
Qn = Qn = 1. (6.221)
k=0 k=0
1 − k=1 α k 1 − k=1 α k

This proves item (ii). Observe that (6.212), (6.213), (6.214), (6.219), and (6.221) demon-
strate that for all n ∈ N it holds that
" n−1 #
γn mn X bn,k
Θn = Θn−1 − = Θn−1 − γn G(Θk )
1 − nl=1 αl 1 − nl=1 αl
Q Q
k=0
" n−1 # (6.222)
X
= Θn−1 − γn an,kG(Θk ) .
k=0

This establishes item (iii). The proof of Corollary 6.3.6 is thus complete.

248
6.3. GD optimization with classical momentum

6.3.3 Error analysis for GD optimization with momentum


In this subsection we provide in Section 6.3.3.2 below an error analysis for the momen-
tum GD optimization method in the case of a class of quadratic objective functions (cf.
Proposition 6.3.11 in Section 6.3.3.2 for the precise statement). In this specific case we also
provide in Section 6.3.3.3 below a comparison of the convergence speeds of the plain-vanilla
GD optimization method and the momentum GD optimization method. In particular,
we prove, roughly speeking, that the momentum GD optimization method outperfoms
the plain-vanilla GD optimization method in the case of the considered class of quadratic
objective functions; see Corollary 6.3.13 in Section 6.3.3.3 for the precise statement. For
this comparison between the plain-vanilla GD optimization method and the momentum GD
optimization method we employ a refined error analysis of the plain-vanilla GD optimization
method for the considered class of quadratic objective functions. This refined error analysis
is the subject of the next section (Section 6.3.3.1 below).
In the literature similar error analyses for the momentum GD optimization method can,
for instance, be found in [48, Section 7.1] and [337].

6.3.3.1 Error analysis for GD optimization in the case of quadratic objective


functions
Lemma 6.3.7 (Error analysis for the GD optimization method in the case of quadratic
objective functions). Let d ∈ N, ξ ∈ Rd , ϑ = (ϑ1 , . . . , ϑd ) ∈ Rd , κ, K, λ1 , λ2 , . . . , λd ∈ (0, ∞)
satisfy κ = min{λ1 , λ2 , . . . , λd } and K = max{λ1 , λ2 , . . . , λd }, let L : Rd → R satisfy for all
θ = (θ1 , . . . , θd ) ∈ Rd that " d #
X
1
L(θ) = 2 λi |θi − ϑi | , 2
(6.223)
i=1

and let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − 2
(K+κ)
(∇L)(Θn−1 ). (6.224)

Then it holds for all n ∈ N0 that


 K−κ n
∥Θn − ϑ∥2 ≤ K+κ
∥ξ − ϑ∥2 (6.225)

(cf. Definition 3.3.4).

Proof of Lemma 6.3.7. Throughout this proof, let Θ(1) , Θ(2) , . . . , Θ(d) : N0 → R satisfy for
(1) (2) (d)
all n ∈ N0 that Θn = (Θn , Θn , . . . , Θn ). Note that (6.223) implies that for all θ =
(θ1 , θ2 , . . . , θd ) ∈ Rd , i ∈ {1, 2, . . . , d} it holds that
∂f
(6.226)

∂θi
(θ) = λi (θi − ϑi ).

249
Chapter 6: Deterministic GD optimization methods

Combining this and (6.224) ensures that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) ∂f
Θ(i) 2

n − ϑi = Θn−1 − (K+κ) ∂θi (Θn−1 ) − ϑi
(i) (i)
2
(6.227)
 
= Θn−1 − ϑi − (K+κ) λi (Θn−1 − ϑi )
2λi
 (i)
= 1 − (K+κ) (Θn−1 − ϑi ).

Hence, we obtain that for all n ∈ N it holds that


d
X
∥Θn − ϑ∥22 = |Θ(i)
n − ϑi |
2

i=1
d h i
2λi 2
X (i)
= 1− (K+κ)
|Θn−1 − ϑi |2
i=1
" d # (6.228)
h i X
2λ1 2 2λd 2 (i)
|Θn−1 − ϑi |2

≤ max 1 − (K+κ)
,..., 1− (K+κ)
i=1
h i2
2λ1 2λd
∥Θn−1 − ϑ∥22

= max 1 − (K+κ)
,..., 1 − (K+κ)

(cf. Definition 3.3.4). Moreover, note that the fact that for all i ∈ {1, 2, . . . , d} it holds that
λi ≥ κ implies that for all i ∈ {1, 2, . . . , d} it holds that
K+κ−2κ K−κ
1− 2λi
(K+κ)
≤1− 2κ
(K+κ)
= K+κ
= K+κ
≥ 0. (6.229)

In addition, observe that the fact that for all i ∈ {1, 2, . . . , d} it holds that λi ≤ K implies
that for all i ∈ {1, 2, . . . , d} it holds that

= K+κ−2K
 K−κ 
1 − (K+κ)2λi 2K
≥ 1 − (K+κ) (K+κ)
= − K+κ
≤ 0. (6.230)

This and (6.229) ensure that for all i ∈ {1, 2, . . . , d} it holds that

1− 2λi
(K+κ)
≤ K−κ
K+κ
. (6.231)

Combining this with (6.228) demonstrates that for all n ∈ N it holds that
h n oi
2λ1 2λd
∥Θn − ϑ∥2 ≤ max 1 − K+κ , . . . , 1 − K+κ ∥Θn−1 − ϑ∥2
 K−κ  (6.232)
≤ K+κ ∥Θn−1 − ϑ∥2 .
Induction therefore establishes that for all n ∈ N0 it holds that
 K−κ n n
∥Θ0 − ϑ∥2 = K−κ (6.233)

∥Θn − ϑ∥2 ≤ K+κ K+κ
∥ξ − ϑ∥2 .

The proof of Lemma 6.3.7 is thus complete.

250
6.3. GD optimization with classical momentum

Lemma 6.3.7 above establishes, roughly speaking, the convergence rate K−κ
K+κ
(see (6.225)
above for the precise statement) for the GD optimization method in the case of the objective
function in (6.223). The next result, Lemma 6.3.8 below, essentially proves in the situation
of Lemma 6.3.7 that this convergence rate cannot be improved by means of a difference
choice of the learning rate.

Lemma 6.3.8 (Lower bound for the convergence rate of GD for quadratic objective
functions). Let d ∈ N, ξ = (ξ1 , ξ2 , . . . , ξd ), ϑ = (ϑ1 , ϑ2 , . . . , ϑd ) ∈ Rd , γ, κ, K, λ1 , λ2 . . . , λd ∈
(0, ∞) satisfy κ = min{λ1 , λ2 , . . . , λd } and K = max{λ1 , λ2 , . . . , λd }, let L : Rd → R satisfy
for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd that
" d
#
X
L(θ) = 1
2
λi |θi − ϑi |2 , (6.234)
i=1

and let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − γ(∇L)(Θn−1 ). (6.235)

Then it holds for all n ∈ N0 that


 n   
∥Θn − ϑ∥2 ≥ max{γK − 1, 1 − γκ} min |ξ1 − ϑ1 |, . . . , |ξd − ϑd |
n  (6.236)
≥ K−κ
  
K+κ
min |ξ1 − ϑ1 |, . . . , |ξd − ϑd |

(cf. Definition 3.3.4).

Proof of Lemma 6.3.8. Throughout this proof, let Θ(1) , Θ(2) , . . . , Θ(d) : N0 → R satisfy for
(1) (2) (d)
all n ∈ N0 that Θn = (Θn , Θn , . . . , Θn ) and let ι, I ∈ {1, 2, . . . , d} satisfy λι = κ and
λI = K. Observe that (6.234) implies that for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd , i ∈ {1, 2, . . . , d}
it holds that
∂f
(6.237)

∂θi
(θ) = λi (θi − ϑi ).
Combining this with (6.235) implies that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) ∂f
Θ(i)

n − ϑi = Θn−1 − γ ∂θi
(Θn−1 ) − ϑi
(i)
= Θn−1 − ϑi − γλi (Θn−1 − ϑi )
(i)
(6.238)
(i)
= (1 − γλi )(Θn−1 − ϑi ).

Induction hence proves that for all n ∈ N0 , i ∈ {1, 2, . . . , d} it holds that

Θ(i) n (i) n
n − ϑi = (1 − γλi ) (Θ0 − ϑi ) = (1 − γλi ) (ξi − ϑi ). (6.239)

251
Chapter 6: Deterministic GD optimization methods

This shows that for all n ∈ N0 it holds that


d
X d h
X i
∥Θn − ϑ∥22 = |Θ(i)
n − ϑi |2
= |1 − γλ i |2n
|ξi − ϑi |2

i=1 i=1
" d #
(6.240)
X
≥ min |ξ1 − ϑ1 |2 , . . . , |ξd − ϑd |2 |1 − γλi |2n
  

2 2
 i=1
max{|1 − γλ1 |2n , . . . , |1 − γλd |2n }
  
≥ min |ξ1 − ϑ1 | , . . . , |ξd − ϑd |
  2  2n
= min |ξ1 − ϑ1 |, . . . , |ξd − ϑd | max{|1 − γλ1 |, . . . , |1 − γλd |}

(cf. Definition 3.3.4). Furthermore, note that

max{|1 − γλ1 |, . . . , |1 − γλd |} ≥ max{|1 − γλI |, |1 − γλι |}


= max{|1 − γK|, |1 − γκ|} = max{1 − γK, γK − 1, 1 − γκ, γκ − 1} (6.241)
= max{γK − 1, 1 − γκ}.

In addition, observe that for all α ∈ (−∞, K+κ


2
] it holds that
K+κ−2κ K−κ
(6.242)
 2 
max{αK − 1, 1 − ακ} ≥ 1 − ακ ≥ 1 − K+κ κ= K+κ
= K+κ
.

Moreover, note that for all α ∈ [ K+κ


2
, ∞) it holds that
2 2K−(K+κ) K−κ
(6.243)
 
max{αK − 1, 1 − ακ} ≥ αK − 1 ≥ K+κ
K−1= K+κ
= K+κ
.

Combining this, (6.241), and (6.242) proves that

max{|1 − γλ1 |, . . . , |1 − γλd |} ≥ max{γK − 1, 1 − γκ} ≥ K−κ


K+κ
≥ 0. (6.244)

This and (6.240) demonstrate that for all n ∈ N0 it holds that


 n   
∥Θn − ϑ∥2 ≥ max{|1 − γλ1 |, . . . , |1 − γλd |} min |ξ1 − ϑ1 |, . . . , |ξd − ϑd |
n 
(6.245)
  
≥ max{γK − 1, 1 − γκ} min |ξ1 − ϑ1 |, . . . , |ξd − ϑd |
 K−κ n   
≥ K+κ min |ξ1 − ϑ1 |, . . . , |ξd − ϑd | .

The proof of Lemma 6.3.8 is thus complete.

6.3.3.2 Error analysis for GD optimization with momentum in the case of


quadratic objective functions
In this subsection we provide in Proposition 6.3.11 below an error analysis for the momentum
GD optimization method in the case of a class of quadratic objective functions. Our proof of
Proposition 6.3.11 employs the two auxiliary results on quadratic matrices in Lemma 6.3.9

252
6.3. GD optimization with classical momentum

and Lemma 6.3.10 below. Lemma 6.3.9 is a special case of the so-called Gelfand spectral
radius formula in the literature. Lemma 6.3.10 establishes a formula for the determinants
of quadratic block matrices (see (6.247) below for the precise statement). Lemma 6.3.10
and its proof can, for example, be found in Silvester [377, Theorem 3].

Lemma 6.3.9 (A special case of Gelfand’s spectral radius formula for real matrices). Let
d ∈ N, A ∈ Rd×d , S = {λ ∈ C : (∃ v ∈ Cd \{0} : Av = λv)} and let ~·~ : Rd → [0, ∞) be a
norm. Then
" #1/n  " #1/n 
n n
~A v~  = lim sup sup ~A v~  = max |λ|. (6.246)
lim inf  sup
n→∞ d
v∈R \{0} ~v~ n→∞ d
v∈R \{0} ~v~ λ∈S∪{0}

Proof of Lemma 6.3.9. Note that, for instance, Einsiedler & Ward [127, Theorem 11.6]
establishes (6.246) (cf., for example, Tropp [395]). The proof of Lemma 6.3.9 is thus
complete.

Lemma 6.3.10 (Determinants for block matrices). Let d ∈ N, A, B, C, D ∈ Cd×d satisfy


CD = DC. Then  
A B
det = det(AD − BC) (6.247)
C D
| {z }
∈ R(2d)×(2d)

Proof of Lemma 6.3.10. Throughout this proof, let Dx ∈ Cd×d , x ∈ C, satisfy for all x ∈ C
that
Dx = D − x Id (6.248)
(cf. Definition 1.5.5). Observe that the fact that for all x ∈ C it holds that CDx = Dx C
and the fact that for all X, Y, Z ∈ Cd×d it holds that
   
X Y X 0
det = det(X) det(Z) = det (6.249)
0 Z Y Z

(cf., for instance, Petersen [331, Proposition 5.5.3 and Proposition 5.5.4]) imply that for all
x ∈ C it holds that
    
A B Dx 0 (ADx − BC) B
det = det
C Dx −C Id (CDx − Dx C) Dx
(6.250)
 
(ADx − BC) B
= det
0 Dx
= det(ADx − BC) det(Dx ).

253
Chapter 6: Deterministic GD optimization methods

Moreover, note that (6.249) and the multiplicative property of the determinant (see, for
example, Petersen [331, (1) in Proposition 5.5.2]) imply that for all x ∈ C it holds that
      
A B Dx 0 A B Dx 0
det = det det
C Dx −C Id C Dx −C Id
 
A B
= det det(Dx ) det(Id ) (6.251)
C Dx
 
A B
= det det(Dx ).
C Dx
Combining this and (6.250) demonstrates that for all x ∈ C it holds that
 
A B
det det(Dx ) = det(ADx − BC) det(Dx ). (6.252)
C Dx
Hence, we obtain for all x ∈ C that
   
A B
det − det(ADx − BC) det(Dx ) = 0. (6.253)
C Dx
This implies that for all x ∈ C with det(Dx ) ̸= 0 it holds that
 
A B
det − det(ADx − BC) = 0. (6.254)
C Dx
Moreover, note that the fact that C ∋ x 7→ det(D − x Id ) ∈ C is a polynomial function of
degree d ensures that {x ∈ C : det(Dx ) = 0} = {x ∈ C : det(D − x Id ) = 0} is a finite set.
Combining this and (6.254) with the fact that the function
 
A B
C ∋ x 7→ det − det(ADx − BC) ∈ C (6.255)
C Dx
is continuous shows that for all x ∈ C it holds that
 
A B
det − det(ADx − BC) = 0. (6.256)
C Dx
Hence, we obtain for all x ∈ C that
 
A B
det = det(ADx − BC). (6.257)
C Dx
This establishes that
   
A B A B
det = det = det(AD0 − BC) = det(AD0 − BC). (6.258)
C D C D0
The proof of Lemma 6.3.10 is thus completed.

254
6.3. GD optimization with classical momentum

We are now in the position to formulate and prove the promised error analysis for
the momentum GD optimization method in the case of the considered class of quadratic
objective functions; see Proposition 6.3.11 below.
Proposition 6.3.11 (Error analysis for the momentum GD optimization method in
the case of quadratic objective functions). Let d ∈ N, ξ ∈ Rd , ϑ = (ϑ1 , . . . , ϑd ) ∈ Rd ,
κ, K, λ1 , λ2 , . . . , λd ∈ (0, ∞) satisfy κ = min{λ1 , λ2 , . . . , λd } and K = max{λ1 , λ2 , . . . , λd },
let L : Rd → R satisfy for all θ = (θ1 , . . . , θd ) ∈ Rd that
" d #
X
L(θ) = 21 λi |θi − ϑi |2 , (6.259)
i=1

and let Θ : N0 ∪ {−1} → Rd satisfy for all n ∈ N that Θ−1 = Θ0 = ξ and


h √ √ i2
Θn = Θn−1 − ( K+ κ)2 (∇L)(Θn−1 ) + √K−
√ 4
√ √
K+ κ
κ
(Θn−1 − Θn−2 ). (6.260)

Then
(i) it holds that Θ|N0 : N0 → Rd is the momentum GD process for the objective function
1
L with learning rates N ∋ n 7→ √Kκ ∈ [0, ∞), momentum decay factors N ∋ n 7→
 K1/2 −κ1/2 2
K1/2 +κ1/2
∈ [0, 1], and initial value ξ and

(ii) for every ε ∈ (0, ∞) there exists C ∈ (0, ∞) such that for all n ∈ N0 it holds that
h√ √ in
∥Θn − ϑ∥2 ≤ C √K− κ
√ +ε
K+ κ
(6.261)

(cf. Definitions 3.3.4 and 6.3.1).


Proof of Proposition 6.3.11. Throughout this proof, let ε ∈ (0, ∞), let ~·~ : R(2d)×(2d) →
[0, ∞) satisfy for all B ∈ R(2d)×(2d) that
 
∥Bv∥2
~B~ = sup , (6.262)
v∈R2d \{0} ∥v∥2
(1) (2) (d)
let Θ(1) , Θ(2) , . . . , Θ(d) : N0 → R satisfy for all n ∈ N0 that Θn = (Θn , Θn , . . . , Θn ), let
m : N0 → Rd satisfy for all n ∈ N0 that

mn = − Kκ(Θn − Θn−1 ), (6.263)

let ϱ ∈ (0, ∞), α ∈ [0, 1) be given by


h√ √ i2
K− κ
ϱ= √ 4

( K+ κ)2
and α= √ √
K+ κ
, (6.264)

255
Chapter 6: Deterministic GD optimization methods

let M ∈ Rd×d be the diagonal (d × d)-matrix given by


 
(1 − ϱλ1 + α) 0
.. (6.265)
M = . ,
 
0 (1 − ϱλd + α)

let A ∈ R2d×2d be the ((2d) × (2d))-matrix given by


 
M (−α Id )
A= , (6.266)
Id 0

and let S ⊆ C be the set given by

S = {µ ∈ C : (∃ v ∈ C2d \{0} : Av = µv)} = {µ ∈ C : det(A − µ I2d ) = 0} (6.267)

(cf. Definition 1.5.5). Observe that (6.260), (6.263), and the fact that
√ √ √ √ h√ √ √ √ √ √ √ √ i
( K+ κ)2 −( K− κ)2 1
4
= 4 ( K + κ + K − κ)( K + κ − [ K − κ])
h √ √ i √ (6.268)
= 14 (2 K)(2 κ) = Kκ

assure that for all n ∈ N it holds that



mn = − Kκ(Θn − Θn−1 )
√ h √ √ i2
 h i 
4 K− κ
= − Kκ Θn−1 − (√K+√κ)2 (∇L)(Θn−1 ) + √K+√κ (Θn−1 − Θn−2 ) − Θn−1
√ h √ √ i2
h i 
4 K− κ
= Kκ (√K+√κ)2 (∇L)(Θn−1 ) − √K+√κ (Θn−1 − Θn−2 )
√ √ 2 √ √ h i
K− κ)2
= ( K+ κ) −( 4
√ 4√
( K+ κ)2
(∇L)(Θn−1 )
√ h √ √ 2 i
− Kκ √K− K+ κ

κ
(Θn−1 − Θn−2 )
h √ √ i h √ √ i2 h √ i
( K− κ)2 K− κ
= 1 − ( K+ κ)2 (∇L)(Θn−1 ) + K+ κ − Kκ(Θn−1 − Θn−2 )
√ √ √ √
 h √ √ i2  h √ √ i2
= 1 − √K− K+ κ

κ
(∇L)(Θ n−1 ) + √
K− κ

K+ κ
mn−1 .
(6.269)

Moreover, note that (6.263) implies that for all n ∈ N0 it holds that

Θn = Θn−1 + (Θn − Θn−1 )


1
h √ i  (6.270)
= Θn−1 − Kκ − Kκ (Θn − Θn−1 ) = Θn−1 −
√ √1 mn .

256
6.3. GD optimization with classical momentum

In addition, observe that the assumption that Θ−1 = Θ0 = ξ and (6.263) ensure that

(6.271)

m0 = − Kκ Θ0 − Θ−1 = 0.
Combining this and the assumption that Θ0 = ξ with (6.269) and (6.270) proves item (i).
It thus remains to prove item (ii). For this observe that (6.259) implies that for all
θ = (θ1 , θ2 , . . . , θd ) ∈ Rd , i ∈ {1, 2, . . . , d} it holds that
∂f
(6.272)

∂θi
(θ) = λi (θi − ϑi ).
This, (6.260), and (6.264) imply that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) ∂f (i) (i)
Θ(i)

n − ϑi = Θn−1 − ϱ ∂θi (Θn−1 ) + α(Θn−1 − Θn−2 ) − ϑi
(i) (i) (i) (i)
(6.273)

= (Θn−1 − ϑi ) − ϱλi (Θn−1 − ϑi ) + α (Θn−1 − ϑi ) − (Θn−2 − ϑi )
(i) (i)
= (1 − ϱλi + α)(Θn−1 − ϑi ) − α(Θn−2 − ϑi ).
Combining this with (6.265) demonstrates that for all n ∈ N it holds that
Rd ∋ (Θn − ϑ) = M (Θn−1 − ϑ) − α(Θn−2 − ϑ)
 
 Θn−1 − ϑ
= M (−α Id ) . (6.274)
| {z } Θn−2 − ϑ
∈ Rd×2d | {z }
∈ R2d

This and (6.266) assure that for all n ∈ N it holds that


      
Θn − ϑ M (−α Id ) Θn−1 − ϑ Θn−1 − ϑ
2d
R ∋ = =A . (6.275)
Θn−1 − ϑ Id 0 Θn−2 − ϑ Θn−2 − ϑ
Induction hence proves that for all n ∈ N0 it holds that
     
Θn − ϑ Θ0 − ϑ n ξ −ϑ
2d
R ∋ =A n
=A . (6.276)
Θn−1 − ϑ Θ−1 − ϑ ξ−ϑ
This implies that for all n ∈ N0 it holds that
q
∥Θn − ϑ∥2 ≤ ∥Θn − ϑ∥22 + ∥Θn−1 − ϑ∥22
 
Θn − ϑ
=
Θn−1 − ϑ 2
 
n ξ −ϑ
= A
ξ−ϑ 2 (6.277)
 
n ξ−ϑ
≤ ~A ~
ξ−ϑ 2
q
= ~An ~ ∥ξ − ϑ∥22 + ∥ξ − ϑ∥22

= ~An ~ 2∥ξ − ϑ∥2 .

257
Chapter 6: Deterministic GD optimization methods

Next note that (6.267) and Lemma 6.3.9 demonstrate that


 1/n   1/n 
lim sup ~An ~ = lim inf ~An ~ = max |µ|. (6.278)
n→∞ n→∞ µ∈S∪{0}

This implies that there exists m ∈ N which satisfies for all n ∈ N0 ∩ [m, ∞) that
 n 1/n
~A ~ ≤ ε + max |µ|. (6.279)
µ∈S∪{0}

Therefore, we obtain for all n ∈ N0 ∩ [m, ∞) that


h in
n
~A ~ ≤ ε + max |µ| . (6.280)
µ∈S∪{0}

Furthermore, note that for all n ∈ N0 ∩ [0, m) it holds that


h in h i
~An ~
~An ~ = ε + max |µ| (ε+maxµ∈S∪{0} |µ|) n
µ∈S∪{0}
h in h n k~
o i (6.281)
≤ ε + max |µ| max (ε+max~A µ∈S∪{0} |µ|)
k : k ∈ N0 ∩ [0, m) ∪ {1} .
µ∈S∪{0}

Combining this and (6.280) proves that for all n ∈ N0 it holds that
h in h n k~
o i
~An ~ ≤ ε + max |µ| max (ε+max~A µ∈S∪{0} |µ|)
k : k ∈ N0 ∩ [0, m) ∪ {1} . (6.282)
µ∈S∪{0}

Next observe that Lemma 6.3.10, (6.266), and the fact that for all µ ∈ C it holds that
Id (−µ Id ) = −µ Id = (−µ Id ) Id ensure that for all µ ∈ C it holds that
 
(M − µ Id ) (−α Id )
det(A − µ I2d ) = det
Id −µ Id
 (6.283)
= det (M − µ Id )(−µ Id ) − (−α Id ) Id

= det (M − µ Id )(−µ Id ) + α Id .
This and (6.265) demonstrate that for all µ ∈ C it holds that
  
(1 − ϱλ1 + α − µ)(−µ) + α 0
..
det(A − µ I2d ) = det .
 


0 (1 − ϱλd + α − µ)(−µ) + α
d
Y 
= (1 − ϱλi + α − µ)(−µ) + α
i=1
Yd
µ2 − (1 − ϱλi + α)µ + α .

=
i=1
(6.284)

258
6.3. GD optimization with classical momentum

Moreover, note that for all µ ∈ C, i ∈ {1, 2, . . . , d} it holds that


h i h i2 h i2
µ2 − (1 − ϱλi + α)µ + α = µ2 − 2µ (1−ϱλ2i +α) + (1−ϱλ2i +α) + α − (1−ϱλ2i +α)
h i2
= µ− (1−ϱλi +α)
2
+ α − 41 [1 − ϱλi + α]2 (6.285)
h i2 h 2 i
(1−ϱλi +α) 1
= µ− 2
− 4 1 − ϱλi + α − 4α .

Hence, we obtain that for all i ∈ {1, 2, . . . , d} it holds that

µ ∈ C : µ2 − (1 − ϱλi + α)µ + α = 0

 h i2 h i
(1−ϱλi +α) 1
2
= µ ∈ C: µ − 2
= 4 1 − ϱλi + α − 4α
√ √
(6.286)
 
(1−ϱλi +α)+ [1−ϱλi +α]2 −4α (1−ϱλi +α)− [1−ϱλi +α]2 −4α
= 2
, 2
,
[   q
2

1
= 2
1 − ϱλi + α + s (1 − ϱλi + α) − 4α .
s∈{−1,1}

Combining this, (6.267), and (6.284) demonstrates that

S = {µ ∈ C : det(A − µ I2d ) = 0}
( " d #)
Y
2

= µ ∈ C: µ − (1 − ϱλi + α)µ + α = 0
i=1
d
[ (6.287)
µ ∈ C : µ2 − (1 − ϱλi + α)µ + α = 0

=
i=1
[d [   q 
1 2
= 2
1 − ϱλi + α + s (1 − ϱλi + α) − 4α .
i=1 s∈{−1,1}

Moreover, observe that the fact that for all i ∈ {1, 2, . . . , d} it holds that λi ≥ κ and (6.264)
ensure that for all i ∈ {1, 2, . . . , d} it holds that
h i √ √ 2
1 − ϱλi + α ≤ 1 − ϱκ + α = 1 − (√K+4√κ)2 κ + ((√K− √
κ)
K+ κ)2
√ √ √ √ √ √ √ √
( K+ κ)2 −4κ+( K− κ)2 K+2 K κ+κ−4κ+K−2 K κ+κ
= √ √ 2
( K+ κ)
= √ √ 2
( K+ κ)
(6.288)
√ √ √ √ h√ √ i
2( K− κ)( K+ κ)
= √2K−2κ

( K+ κ)2
= √ √
( K+ κ)2
= 2 √K− √
K+ κ
κ
≥ 0.

In addition, note that the fact that for all i ∈ {1, 2, . . . , d} it holds that λi ≤ K and (6.264)

259
Chapter 6: Deterministic GD optimization methods

assure that for all i ∈ {1, 2, . . . , d} it holds that


h i √ √
4 ( K− κ)2
1 − ϱλi + α ≥ 1 − ϱK + α = 1 − ( K+√κ)2 K +√ √ √
( K+ κ)2
√ √ √ √ √ √ √ √
( K+ κ)2 −4K+( K− κ)2 K+2 K κ+κ−4K+K−2 K κ+κ
= √ √ 2
( K+ κ)
= √ √ 2
( K+ κ)
h i h√ √ √ √ i (6.289)
−2K+2κ K−κ ( K− κ)( K+ κ)
= √ √
( K+ κ)2
= −2 (√K+ √ 2 = −2
κ)
√ √
( K+ κ)2
h√ √ i
= −2 √K− √
K+ κ
κ
≤ 0.
Combining this, (6.288), and (6.264) implies that for all i ∈ {1, 2, . . . , d} it holds that
h  √ √ i2 h √ √ i2
K− κ
2
(1 − ϱλi + α) ≤ 2 K+√κ
√ = 4 √K− √
K+ κ
κ
= 4α. (6.290)
This and (6.287) demonstrate that
max |µ| = max|µ|
µ∈S∪{0} µ∈S
 
1
q
2
= max max 1 − ϱλi + α + s (1 − ϱλi + α) − 4α
i∈{1,2,...,d} s∈{−1,1} 2

1
 h p i  (6.291)
= max max 1 − ϱλi + α + s (−1)(4α − [1 − ϱλi + α] ) 2
2 i∈{1,2,...,d} s∈{−1,1}
1
 h p i 2 1/2
= max max 1 − ϱλi + α + si 4α − (1 − ϱλi + α) 2 .
2 i∈{1,2,...,d} s∈{−1,1}
Combining this with (6.290) proves that
  1/2
1 2 p
2 2
max |µ| = 2 max max 1 − ϱλi + α + s 4α − (1 − ϱλi + α)
µ∈S∪{0} i∈{1,2,...,d} s∈{−1,1}

 1/2 (6.292)
 
1 2 2
= max max
2 i∈{1,2,...,d} s∈{−1,1}
(1 − ϱλi + α) + 4α − (1 − ϱλi + α)

= 21 [4α] /2 =
1
α.
Combining (6.277) and (6.282) hence ensures that for all n ∈ N0 it holds that

Θn − ϑ 2 ≤ 2 ∥ξ − ϑ∥2 ~An ~
n


≤ 2 ∥ξ − ϑ∥2 ε + max |µ|
µ∈S∪{0}
h n k~
o i
· max (ε+max~A µ∈S∪{0} |µ|)
k ∈ R : k ∈ N 0 ∩ [0, m) ∪ {1}
√ n h n
~Ak ~
o i
= 2 ∥ξ − ϑ∥2 ε + α /2 max (ε+α
1

1/2 )k ∈ R : k ∈ N 0 ∩ [0, m) ∪ {1}
√ h √
K− κ
√ in h n
~Ak ~
o i
= 2 ∥ξ − ϑ∥2 ε + K+√κ
√ max (ε+α1/2 )k ∈ R : k ∈ N0 ∩ [0, m) ∪ {1} .
(6.293)

260
6.3. GD optimization with classical momentum

This establishes item (ii). The proof of Proposition 6.3.11 it thus completed.

6.3.3.3 Comparison of the convergence speeds of GD optimization with and


without momentum
In this subsection we provide in Corollary 6.3.13 below a comparison between the convergence
speeds of the plain-vanilla GD optimization method and the momentum GD optimization
method. Our proof of Corollary 6.3.13 employs the auxiliary and elementary estimate
in Lemma 6.3.12 below, the refined error analysis for the plain-vanilla GD optimization
method in Section 6.3.3.1 above (see Lemma 6.3.7 and Lemma 6.3.8 in Section 6.3.3.1), as
well as the error analysis for the momentum GD optimization method in Section 6.3.3.2
above (see Proposition 6.3.11 in Section 6.3.3.2).
Lemma 6.3.12 (Comparison of the convergence rates of the GD optimization method and
the momentum GD optimization method). Let K, κ ∈ (0, ∞) satisfy κ < K. Then
√ √
K− κ K−κ
√ √ < . (6.294)
K+ κ K+κ
√ √
Proof of Lemma 6.3.12. Note that the fact that K − κ > 0 < 2 K κ ensures that
√ √ √ √ √ √
K− κ ( K − κ)( K + κ) K−κ K−κ
√ √ = √ √ 2 = √ √ < . (6.295)
K+ κ ( K + κ) K+2 K κ+κ K+κ

The proof of Lemma 6.3.12 it thus completed.


Corollary 6.3.13 (Convergence speed comparisons between the GD optimization method
and the momentum GD optimization method). Let d ∈ N, κ, K, λ1 , λ2 , . . . , λd ∈ (0, ∞), ξ =
(ξ1 , . . . , ξd ), ϑ = (ϑ1 , . . . , ϑd ) ∈ Rd satisfy κ = min{λ1 , λ2 , . . . , λd } < max{λ1 , λ2 , . . . , λd } =
K, let L : Rd → R satisfy for all θ = (θ1 , . . . , θd ) ∈ Rd that
" d #
X
L(θ) = 21 2
λi |θi − ϑi | , (6.296)
i=1

for every γ ∈ (0, ∞) let Θγ : N0 → Rd satisfy for all n ∈ N that

Θγ0 = ξ and Θγn = Θγn−1 − γ(∇L)(Θγn−1 ), (6.297)

and let M : N0 ∪ {−1} → Rd satisfy for all n ∈ N that M−1 = M0 = ξ and


h √ √ i2
Mn = Mn−1 − (√K+4√κ)2 (∇L)(Mn−1 ) + √K− √
K+ κ
κ
(Mn−1 − Mn−2 ). (6.298)

Then

261
Chapter 6: Deterministic GD optimization methods

(i) there exist γ, C ∈ (0, ∞) such that for all n ∈ N0 it holds that
n
∥Θγn − ϑ∥2 ≤ C K−κ (6.299)

K+κ
,

(ii) it holds for all γ ∈ (0, ∞), n ∈ N0 that


 K−κ n
∥Θγn − ϑ∥2 ≥ min{|ξ1 − ϑ1 |, . . . , |ξd − ϑd |} K+κ (6.300)

,

(iii) for every ε ∈ (0, ∞) there exists C ∈ (0, ∞) such that for all n ∈ N0 it holds that
h√ √ in
∥Mn − ϑ∥2 ≤ C √K− κ
√ +ε ,
K+ κ
(6.301)

and
√ √
K− κ K−κ
(iv) it holds that √ √
K+ κ
< K+κ

(cf. Definition 3.3.4).


Proof of Corollary 6.3.13. First, note that Lemma 6.3.7 proves item (i). Next observe that
Lemma 6.3.8 establishes item (ii). In addition, note that Proposition 6.3.11 proves item (iii).
Finally, observe that Lemma 6.3.12 establishes item (iv). The proof of Corollary 6.3.13 is
thus complete.
Corollary 6.3.13 above, roughly speaking, shows in the case of the considered class
of quadratic objective functions that the momentum GD optimization method in (6.298)
outperforms the classical plain-vanilla GD optimization method (and, in particular, the
classical plain-vanilla GD optimization method in (6.224) in Lemma 6.3.7 above) provided
that the parameters λ1 , λ2 , . . . , λd ∈ (0, ∞) in the objective function in (6.296) satisfy the
assumption that

min{λ1 , . . . , λd } < max{λ1 , . . . , λd }. (6.302)

The next elementary result, Lemma 6.3.14 below, demonstrates that the momentum GD
optimization method in (6.298) and the plain-vanilla GD optimization method in (6.224)
in Lemma 6.3.7 above coincide in the case where min{λ1 , . . . , λd } = max{λ1 , . . . , λd }.
Lemma 6.3.14 (Concurrence of the GD optimization method and the momentum GD
optimization method). Let d ∈ N, ξ, ϑ ∈ Rd , α ∈ (0, ∞), let L : Rd → R satisfy for all
θ ∈ Rd that
L(θ) = α2 ∥θ − ϑ∥22 , (6.303)
let Θ : N0 → Rd satisfy for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − 2
(α+α)
(∇L)(Θn−1 ), (6.304)

262
6.3. GD optimization with classical momentum

and let M : N0 ∪ {−1} → Rd satisfy for all n ∈ N that M−1 = M0 = ξ and


h √ √ i2
Mn = Mn−1 − (√α+4√α)2 (∇L)(Mn−1 ) + √α− √
α+ α
α
(Mn−1 − Mn−2 ) (6.305)

(cf. Definition 3.3.4). Then


(i) it holds that M|N0 : N0 → Rd is the momentum GD process for the objective function L
with learning rates N ∋ n 7→ 1/α ∈ [0, ∞), momentum decay factors N ∋ n 7→ 0 ∈ [0, 1],
and initial value ξ,

(ii) it holds for all n ∈ N0 that Mn = Θn , and

(iii) it holds for all n ∈ N that Θn = ϑ = Mn


(cf. Definition 6.3.1).
Proof of Lemma 6.3.14. First, note that (6.305) implies that for all n ∈ N it holds that

Mn = Mn−1 − √4
(2 α)2
(∇L)(Mn−1 ) = Mn−1 − α1 (∇L)(Mn−1 ). (6.306)

Combining this with the assumption that M0 = ξ establishes item (i). Next note that
(6.304) ensures that for all n ∈ N it holds that

Θn = Θn−1 − α1 (∇L)(Θn−1 ). (6.307)

Combining this with (6.306) and the assumption that Θ0 = ξ = M0 proves item (ii).
Furthermore, observe that Lemma 5.6.4 assures that for all θ ∈ Rd it holds that

(∇L)(θ) = α2 (2(θ − ϑ)) = α(θ − ϑ). (6.308)

Next we claim that for all n ∈ N it holds that

Θn = ϑ. (6.309)

We now prove (6.309) by induction on n ∈ N. For the base case n = 1 note that (6.307)
and (6.308) imply that

Θ1 = Θ0 − α1 (∇L)(Θ0 ) = ξ − α1 (α(ξ − ϑ)) = ξ − (ξ − ϑ) = ϑ. (6.310)

This establishes (6.309) in the base case n = 1. For the induction step observe that (6.307)
and (6.308) assure that for all n ∈ N with Θn = ϑ it holds that

Θn+1 = Θn − α1 (∇L)(Θn ) = ϑ − α1 (α(ϑ − ϑ)) = ϑ. (6.311)

Induction thus proves (6.309). Combining (6.309) and item (ii) establishes item (iii). The
proof of Lemma 6.3.14 is thus complete.

263
Chapter 6: Deterministic GD optimization methods

6.3.4 Numerical comparisons for GD optimization with and with-


out momentum
In this subsection we provide in Example 6.3.15, Source code 6.1, and Figure 6.1 a numerical
comparison of the plain-vanilla GD optimization method and the momentum GD optimiza-
tion method in the case of the specific quadratic optimization problem in (6.312)–(6.313)
below.
Example 6.3.15. Let K = 10, κ = 1, ϑ = (ϑ1 , ϑ2 ) ∈ R2 , ξ = (ξ1 , ξ2 ) ∈ R2 satisfy
       
ϑ1 1 ξ 5
ϑ= = and ξ= 1 = , (6.312)
ϑ2 1 ξ2 3
let L : R2 → R satisfy for all θ = (θ1 , θ2 ) ∈ R2 that
L(θ) = κ2 |θ1 − ϑ1 |2 + K2 |θ2 − ϑ2 |2 , (6.313)
 

let Θ : N0 → Rd satisfy for all n ∈ N that Θ0 = ξ and


2 2
Θn = Θn−1 − (K+κ)
(∇L)(Θn−1 ) = Θn−1 − 11
(∇L)(Θn−1 )
(6.314)
= Θn−1 − 0.18 (∇L)(Θn−1 ) ≈ Θn−1 − 0.18 (∇L)(Θn−1 ),
and let M : N0 → Rd and m : N0 → Rd satisfy for all n ∈ N that M0 = ξ, m0 = 0,
Mn = Mn−1 − 0.3 mn , and
mn = 0.5 mn−1 + (1 − 0.5) (∇L)(Mn−1 )
(6.315)
= 0.5 (mn−1 + (∇L)(Mn−1 )).
Then
(i) it holds for all θ = (θ1 , θ2 ) ∈ R2 that
   
κ(θ1 − ϑ1 ) θ1 − 1
(∇L)(θ) = = , (6.316)
K(θ2 − ϑ2 ) 10 (θ2 − 1)

(ii) it holds that


 
5
Θ0 = , (6.317)
3

2
Θ1 = Θ0 − 11 (∇L)(Θ0 ) ≈ Θ0 − 0.18(∇L)(Θ0 )
     
5 5−1 5 − 0.18 · 4
= − 0.18 =
3 10(3 − 1) 3 − 0.18 · 10 · 2 (6.318)
   
5 − 0.72 4.28
= = ,
3 − 3.6 −0.6

264
6.3. GD optimization with classical momentum

   
4.28 4.28 − 1
Θ2 ≈ Θ1 − 0.18(∇L)(Θ1 ) = − 0.18
−0.6 10(−0.6 − 1)
   
4.28 − 0.18 · 3.28 4.10 − 0.18 · 2 − 0.18 · 0.28
= =
−0.6 − 0.18 · 10 · (−1.6) −0.6 + 1.8 · 1.6
−4
(6.319)
3.74 − 9 · 56 · 10−4
   
4.10 − 0.36 − 2 · 9 · 4 · 7 · 10
= =
−0.6 + 1.6 · 1.6 + 0.2 · 1.6 −0.6 + 2.56 + 0.32
−4
     
3.74 − 504 · 10 3.6896 3.69
= = ≈ ,
2.88 − 0.6 2.28 2.28

   
3.69 3.69 − 1
Θ3 ≈ Θ2 − 0.18(∇L)(Θ2 ) ≈ − 0.18
2.28 10(2.28 − 1)
   
3.69 − 0.18 · 2.69 3.69 − 0.2 · 2.69 + 0.02 · 2.69
= =
2.28 − 0.18 · 10 · 1.28 2.28 − 1.8 · 1.28
    (6.320)
3.69 − 0.538 + 0.0538 3.7438 − 0.538
= =
2.28 − 1.28 − 0.8 · 1.28 1 − 1.28 + 0.2 · 1.28
     
3.2058 3.2058 3.21
= = ≈ ,
0.256 − 0.280 −0.024 −0.02

..
.

and

(iii) it holds that


 
5
M0 = , (6.321)
3

   
0 5−1
m1 = 0.5 (m0 + (∇L)(M0 )) = 0.5 +
0 10(3 − 1)
    (6.322)
0.5 (0 + 4) 2
= = ,
0.5 (0 + 10 · 2) 10

     
5 2 4.4
M1 = M0 − 0.3 m1 = − 0.3 = , (6.323)
3 10 0

265
Chapter 6: Deterministic GD optimization methods

   
2 4.4 − 1
m2 = 0.5 (m1 + (∇L)(M1 )) = 0.5 +
10 10(0 − 1)
    (6.324)
0.5 (2 + 3.4) 2.7
= = ,
0.5 (10 − 10) 0

       
4.4 2.7 4.4 − 0.81 3.59
M2 = M1 − 0.3 m2 = − 0.3 = = , (6.325)
0 0 0 0

   
2.7 3.59 − 1
m3 = 0.5 (m2 + (∇L)(M2 )) = 0.5 +
0 10(0 − 1)
   
0.5 (2.7 + 2.59) 0.5 · 5.29
= = (6.326)
0.5 (0 − 10) 0.5(−10)
     
2.5 + 0.145 2.645 2.65
= = ≈ ,
−5 −5 −5

   
3.59 2.65
M3 = M2 − 0.3 m3 ≈ − 0.3
0 −5
        (6.327)
3.59 − 0.795 3 − 0.205 2.795 2.8
= = = ≈ ,
1.5 1.5 1.5 1.5
..
.
.

1 # Example for GD and momentum GD


2
3 import numpy as np
4 import matplotlib . pyplot as plt
5
6 # Number of steps for the schemes
7 N = 8
8
9 # Problem setting
10 d = 2
11 K = [1. , 10.]
12
13 vartheta = np . array ([1. , 1.])
14 xi = np . array ([5. , 3.])
15

266
6.3. GD optimization with classical momentum

16 def f (x , y ) :
17 result = K [0] / 2. * np . abs ( x - vartheta [0]) **2 \
18 + K [1] / 2. * np . abs ( y - vartheta [1]) **2
19 return result
20
21 def nabla_f ( x ) :
22 return K * ( x - vartheta )
23

24 # Coefficients for GD
25 gamma_GD = 2 /( K [0] + K [1])
26
27 # Coefficients for momentum
28 gamma_momentum = 0.3
29 alpha = 0.5
30
31 # Placeholder for processes
32 Theta = np . zeros (( N +1 , d ) )
33 M = np . zeros (( N +1 , d ) )
34 m = np . zeros (( N +1 , d ) )
35

36 Theta [0] = xi
37 M [0] = xi
38
39 # Perform gradient descent
40 for i in range ( N ) :
41 Theta [ i +1] = Theta [ i ] - gamma_GD * nabla_f ( Theta [ i ])
42
43 # Perform momentum GD
44 for i in range ( N ) :
45 m [ i +1] = alpha * m [ i ] + (1 - alpha ) * nabla_f ( M [ i ])
46 M [ i +1] = M [ i ] - gamma_momentum * m [ i +1]
47

48
49 # ## Plot ###
50 plt . figure ()
51
52 # Plot the gradient descent process
53 plt . plot ( Theta [: , 0] , Theta [: , 1] ,
54 label = " GD " , color = " c " ,
55 linestyle = " --" , marker = " * " )
56
57 # Plot the momentum gradient descent process
58 plt . plot ( M [: , 0] , M [: , 1] ,
59 label = " Momentum " , color = " orange " , marker = " * " )
60
61 # Target value
62 plt . scatter ( vartheta [0] , vartheta [1] ,
63 label = " vartheta " , color = " red " , marker = " x " )
64

267
Chapter 6: Deterministic GD optimization methods

65 # Plot contour lines of f


66 x = np . linspace ( -3. , 7. , 100)
67 y = np . linspace ( -2. , 4. , 100)
68 X , Y = np . meshgrid (x , y )
69 Z = f (X , Y )
70 cp = plt . contour (X , Y , Z , colors = " black " ,
71 levels = [0.5 ,2 ,4 ,8 ,16] ,
72 linestyles = " : " )
73
74 plt . legend ()
75 plt . savefig ( " ../ plots / G D_moment um_plots . pdf " )

Source code 6.1 (code/example_GD_momentum_plots.py): Python code for


Figure 6.1

4
GD
Momentum
3 vartheta
2

2
2 0 2 4 6

Figure 6.1 (plots/GD_momentum_plots.pdf): Result of a call of Python code 6.1

Exercise 6.3.3. Let (γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1] satisfy for all n ∈ N that γn = n1 and
αn = 12 , let L : R → R satisfy for all θ ∈ R that L(θ) = θ2 , and let Θ be the momentum
GD process for the objective function L with learning rates (γn )n∈N , momentum decay
factors (αn )n∈N , and initial value 1 (cf. Definition 6.3.1). Specify Θ1 , Θ2 , Θ3 , and Θ4
explicitly and prove that your results are correct!

268
6.4. GD optimization with Nesterov momentum

6.4 GD optimization with Nesterov momentum


In this section we review the Nesterov accelerated GD optimization method, which was
first introduced in Nesterov [302] (cf., for instance, Sutskever et al. [387]). The Nesterov
accelerated GD optimization method can be viewed as building on the momentum GD
optimization method (see Definition 6.3.1) by attempting to provide some kind of foresight
to the scheme. A similar perspective is to see the Nesterov accelerated GD optimization
method as a combination of the momentum GD optimization method (see Definition 6.3.1)
and the explicit midpoint GD optimization method (see Section 6.2).
Definition 6.4.1 (Nesterov accelerated GD optimization method). Let d ∈ N, (γn )n∈N ⊆
[0, ∞), (αn )n∈N ⊆ [0, 1], ξ ∈ Rd and let L : Rd → R and G : Rd → Rd satisfy for all
U ∈ {V ⊆ Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that
G(θ) = (∇L)(θ). (6.328)
Then we say that Θ is the Nesterov accelerated GD process for the objective function L with
generalized gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial
value ξ (we say that Θ is the Nesterov accelerated GD process for the objective function L
with learning rates (γn )n∈N , momentum decay factors (αn )n∈N , and initial value ξ) if and
only if it holds that Θ : N0 → Rd is the function from N0 to Rd which satisfies that there
exists m : N0 → Rd such that for all n ∈ N it holds that
Θ0 = ξ, m0 = 0, (6.329)
mn = αn mn−1 + (1 − αn ) G(Θn−1 − γn αn mn−1 ), (6.330)
and Θn = Θn−1 − γn mn . (6.331)

6.5 Adagrad GD optimization (Adagrad)


In this section we review the Adagrad GD optimization method. Roughly speaking, the idea
of the Adagrad GD optimization method is to modify the plain-vanilla GD optimization
method by adapting the learning rates separately for every component of the optimization
process. The name Adagrad is derived from adaptive subgradient method and was first
presented in Duchi et al. [117] in the context of stochastic optimization. For pedagogical
purposes we present in this section a deterministic version of Adagrad optimization and we
refer to Section 7.6 below for the original stochastic version of Adagrad optimization.
Definition 6.5.1 (Adagrad GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
ε ∈ (0, ∞), ξ ∈ Rd and let L : Rd → R and G = (G, . . . , Gd ) : Rd → Rd satisfy for all
U ∈ {V ⊆ Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that
G(θ) = (∇L)(θ). (6.332)

269
Chapter 6: Deterministic GD optimization methods

Then we say that Θ is the Adagrad GD process for the objective function L with generalized
gradient G, learning rates (γn )n∈N , regularizing factor ε, and initial value ξ (we say that Θ is
the Adagrad GD process for the objective function L with learning rates (γn )n∈N , regularizing
factor ε, and initial value ξ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is
the function from N0 to Rd which satisfies for all n ∈ N, i ∈ {1, 2, . . . , d} that
 n−1
−1/2
(i)
Θ(i) 2
(6.333)
P
Θ0 = ξ and n = Θn−1 − γn ε + |Gi (Θk )| Gi (Θn−1 ).
k=0

6.6 Root mean square propagation GD optimization


(RMSprop)
In this section we review the RMSprop GD optimization method. Roughly speaking, the
RMSprop GD optimization method is a modification of the Adagrad GD optimization
method where the sum over the squares of previous partial derivatives of the objective
function (cf. (6.333) in Definition 6.5.1) is replaced by an exponentially decaying average over
the squares of previous partial derivatives of the objective function (cf. (6.335) and (6.336)
in Definition 6.6.1). RMSprop optimization was introduced by Geoffrey Hinton in his
coursera class on Neural Networks for Machine Learning (see Hinton et al. [199]) in the
context of stochastic optimization. As in the case of Adagrad optimization, we present
for pedagogical purposes first a deterministic version of RMSprop optimization in this
section and we refer to Section 7.7 below for the original stochastic version of RMSprop
optimization.

Definition 6.6.1 (RMSprop GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), ξ ∈ Rd and let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd
satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ). (6.334)

Then we say that Θ is the RMSprop GD process for the objective function L with generalized
gradient G, learning rates (γn )n∈N , second moment decay factors (βn )n∈N , regularizing factor
ε, and initial value ξ (we say that Θ is the RMSprop GD process for the objective function
L with learning rates (γn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε,
and initial value ξ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is the function
from N0 to Rd which satisfies that there exists M = (M(1) , . . . , M(d) ) : N0 → Rd such that
for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i)
Θ0 = ξ, M0 = 0, M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| , (6.335)
(i) (i) −1/2
Θ(i) (6.336)
 
and n = Θn−1 − γn ε + Mn Gi (Θn−1 ).

270
6.6. Root mean square propagation GD optimization (RMSprop)

6.6.1 Representations of the mean square terms in RMSprop


Lemma 6.6.2 (On a representation of the second order terms in RMSprop). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (βn )n∈N ⊆ [0, 1], (bn,k )(n,k)∈(N0 )2 ⊆ R, ε ∈ (0, ∞), ξ ∈ Rd satisfy for all
n ∈ N, k ∈ {0, 1, . . . , n − 1} that
" n #
Y
bn,k = (1 − βk+1 ) βl , (6.337)
l=k+2

let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open},


θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ), (6.338)

and let Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd be the RMSprop GD process for the objective function
L with generalized gradient G, learning rates (γn )n∈N , second moment decay factors (βn )n∈N ,
regularizing factor ε, and initial value ξ (cf. Definition 6.6.1). Then

(i) it holds for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that 0 ≤ bn,k ≤ 1,

(ii) it holds for all n ∈ N that


n−1
X n
Y
bn,k = 1 − βk , (6.339)
k=0 k=1

and

(iii) it holds for all n ∈ N, i ∈ {1, 2, . . . , d} that


" n−1
#−1/2
(i)
X
Θn(i) = Θn−1 − γn ε + bn,k |Gi (Θk )| 2
Gi (Θn−1 ). (6.340)
k=0

Proof of Lemma 6.6.2. Throughout this proof, let M = (M(1) , . . . , M(d) ) : N0 → Rd satisfy
(i)
for all n ∈ N, i ∈ {1, 2, . . . , d} that M0 = 0 and
(i)
M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| . (6.341)

Note that (6.337) implies item (i). Furthermore, observe that (6.337), (6.341), and
Lemma 6.3.3 assure that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
n−1
X n−1
X n
Y
M(i)
n = bn,k |Gi (Θk )|2
and bn,k = 1 − βk . (6.342)
k=0 k=0 k=1

271
Chapter 6: Deterministic GD optimization methods

This proves item (ii). Moreover, note that (6.335), (6.336), (6.341), and (6.342) demonstrate
that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i) (i) −1/2
Θ(i)
 
n = Θn−1 − γn ε + M n Gi (Θn−1 )
#−1/2
(6.343)
" n−1
(i)
X
= Θn−1 − γn ε + bn,k |Gi (Θk )|2 Gi (Θn−1 ).
k=0

This establishes item (iii). The proof of Lemma 6.6.2 is thus complete.

6.6.2 Bias-adjusted root mean square propagation GD optimiza-


tion
Definition 6.6.3 (Bias-adjusted RMSprop GD optimization method). Let d ∈ N, (γn )n∈N ⊆
[0, ∞), (βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), ξ ∈ Rd satisfy

β1 < 1 (6.344)

and let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open},


θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ). (6.345)

Then we say that Θ is the bias-adjusted RMSprop GD process for the objective function L
with generalized gradient G, learning rates (γn )n∈N , second moment decay factors (βn )n∈N ,
regularizing factor ε, and initial value ξ (we say that Θ is the bias-adjusted RMSprop GD
process for the objective function L with learning rates (γn )n∈N , second moment decay
factors (βn )n∈N , regularizing factor ε, and initial value ξ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is the function from N0 to Rd which satisfies that there exists
M = (M(1) , . . . , M(d) ) : N0 → Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
(i)
Θ0 = ξ, M0 = 0, M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| , (6.346)
 h (i)
i1/2 −1
(i)
and Θ(i)
n = Θn−1 − γn ε + 1−QMnn βk
Gi (Θn−1 ). (6.347)
k=1

Lemma 6.6.4 (On a representation of the second order terms in bias-adjusted RMSprop).
Let d ∈ N, (γn )n∈N ⊆ [0, ∞), (βn )n∈N ⊆ [0, 1), (bn,k )(n,k)∈(N0 )2 ⊆ R, ε ∈ (0, ∞), ξ ∈ Rd
satisfy for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that

(1 − βk+1 ) nl=k+2 βl
Q 
bn,k = , (6.348)
1 − nk=1 βk
Q

272
6.6. Root mean square propagation GD optimization (RMSprop)

let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open},


θ ∈ U with L|U ∈ C 1 (U, Rd ) that
G(θ) = (∇L)(θ), (6.349)
and let Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd be the bias-adjusted RMSprop GD process for the
objective function L with generalized gradient G, learning rates (γn )n∈N , second moment
decay factors (βn )n∈N , regularizing factor ε, and initial value ξ (cf. Definition 6.6.3). Then
(i) it holds for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that 0 ≤ bn,k ≤ 1,
(ii) it holds for all n ∈ N that
n−1
X
bn,k = 1, (6.350)
k=0
and
(iii) it holds for all n ∈ N, i ∈ {1, 2, . . . , d} that
 " n−1 #1/2 −1
(i)
X
Θ(i)
n = Θn−1 − γn ε +
 bn,k |Gi (Θk )|2  Gi (Θn−1 ). (6.351)
k=0

Proof of Lemma 6.6.4. Throughout this proof, let M = (M(1) , . . . , M(d) ) : N0 → Rd satisfy
(i)
for all n ∈ N, i ∈ {1, 2, . . . , d} that M0 = 0 and
(i)
M(i)
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )|
2
(6.352)
and let (Bn,k )(n,k)∈(N0 )2 ⊆ R satisfy for all n ∈ N, k ∈ {0, 1, . . . , n − 1} that
" n #
Y
Bn,k = (1 − βk+1 ) βl . (6.353)
l=k+2

Observe that (6.348) implies item (i). Note that (6.348), (6.352), (6.353), and Lemma 6.3.3
assure that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
n−1 n−1 Pn−1 Qn
X X Bn,k 1 − βk
(i)
Mn = Bn,k |Gi (Θk )|2
and bn,k = k=0
Qn = Qk=1
n = 1. (6.354)
k=0 k=0
1 − k=1 βk 1 − k=1 βk
This proves item (ii). Observe that (6.346), (6.347), (6.352), and (6.354) demonstrate that
for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that
 h (i)
i1/2 −1
(i) (i)
Θn = Θn−1 − γn ε + 1−Qnn βk M
Gi (Θn−1 )
k=1
 " n−1 #1/2 −1 (6.355)
(i)
X
= Θn−1 − γn ε + bn,k |Gi (Θk )|2  Gi (Θn−1 ).
k=0

273
Chapter 6: Deterministic GD optimization methods

This establishes item (iii). The proof of Lemma 6.6.4 is thus complete.

6.7 Adadelta GD optimization


The Adadelta GD optimization method reviewed in this section is an extension of the
RMSprop GD optimization method. Like the RMSprop GD optimization method, the
Adadelta GD optimization method adapts the learning rates for every component of the
optimization process separately. To do this, the Adadelta GD optimization method uses
two exponentially decaying averages: one over the squares of the past partial derivatives of
the objective function as does the RMSprop GD optimization method (cf. (6.358) below)
and another one over the squares of the past increments (cf. (6.360) below). As in the
case of Adagrad and RMSprop optimization, Adadelta optimization was introduced in a
stochastic setting (see Zeiler [429]), but for pedagogical purposes we present in this section
a deterministic version of Adadelta optimization. We refer to Section 7.8 below for the
original stochastic version of Adadelta optimization.

Definition 6.7.1 (Adadelta GD optimization method). Let d ∈ N, (βn )n∈N ⊆ [0, 1],
(δn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), ξ ∈ Rd and let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd
satisfy for all U ∈ {V ⊆ Rd : V is open}, θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ). (6.356)

Then we say that Θ is the Adadelta GD process for the objective function L with generalized
gradient G, second moment decay factors (βn )n∈N , delta decay factors (δn )n∈N , regularizing
factor ε, and initial value ξ (we say that Θ is the Adadelta GD process for the objective func-
tion L with second moment decay factors (βn )n∈N , delta decay factors (δn )n∈N , regularizing
factor ε, and initial value ξ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is
the function from N0 to Rd which satisfies that there exist M = (M(1) , . . . , M(d) ) : N0 → Rd
and ∆ = (∆(1) , . . . , ∆(d) ) : N0 → Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that

Θ0 = ξ, M0 = 0, ∆0 = 0, (6.357)

(i)
M(i) 2
n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| , (6.358)
 (i) 1/2
(i) ε + ∆n−1
Θ(i)
n = Θn−1 − (i)
Gi (Θn−1 ), (6.359)
ε + Mn
(i) (i)
and ∆(i) (i) 2
n = δn ∆n−1 + (1 − δn ) |Θn − Θn−1 | . (6.360)

274
6.8. Adaptive moment estimation GD optimization
(Adam)

6.8 Adaptive moment estimation GD optimization


(Adam)
In this section we introduce the Adam GD optimization method (see Kingma & Ba [247]).
Roughly speaking, the Adam GD optimization method can be viewed as a combination of
the bias-adjusted momentum GD optimization method (see Section 6.3.2) and the bias-
adjusted RMSprop GD optimization method (see Section 6.6.2). As in the case of Adagrad,
RMSprop, and Adadelta optimization, Adam optimization was introduced in a stochastic
setting in Kingma & Ba [247], but for pedagogical purposes we present in this section a
deterministic version of Adam optimization. We refer to Section 7.9 below for the original
stochastic version of Adam optimization.

Definition 6.8.1 (Adam GD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(αn )n∈N ⊆ [0, 1], (βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), ξ ∈ Rd satisfy

max{α1 , β1 } < 1 (6.361)

and let L : Rd → R and G = (G1 , . . . , Gd ) : Rd → Rd satisfy for all U ∈ {V ⊆ Rd : V is open},


θ ∈ U with L|U ∈ C 1 (U, Rd ) that

G(θ) = (∇L)(θ). (6.362)

Then we say that Θ is the Adam GD process for the objective function L with generalized
gradient G, learning rates (γn )n∈N , momentum decay factors (αn )n∈N , second moment decay
factors (βn )n∈N , regularizing factor ε, and initial value ξ (we say that Θ is the Adam GD
process for the objective function L with learning rates (γn )n∈N , momentum decay factors
(αn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε, and initial value ξ) if
and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 → Rd is the function from N0 to Rd which
satisfies that there exist m = (m(1) , . . . , m(d) ) : N0 → Rd and M = (M(1) , . . . , M(d) ) : N0 →
Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that

Θ0 = ξ, m0 = 0, M0 = 0, (6.363)

mn = αn mn−1 + (1 − αn ) G(Θn−1 ), (6.364)


(i)
M(i)n = βn Mn−1 + (1 − βn )|Gi (Θn−1 )| ,
2
(6.365)
" #
 h (i)
i1/2 −1 m
(i)
(i) n
and Θ(i) = Θn−1 − γn ε + (1−Q Mn
n . (6.366)
n l=1 βl ) (1 − nl=1 αl )
Q

275
Chapter 6: Deterministic GD optimization methods

276
Chapter 7

Stochastic gradient descent (SGD)


optimization methods

This chapter reviews and studies SGD-type optimization methods such as the classical
plain-vanilla SGD optimization method (see Section 7.2) as well as more sophisticated
SGD-type optimization methods including SGD-type optimization methods with momenta
(cf. Sections 7.4, 7.5, and 7.9 below) and SGD-type optimization methods with adaptive
modifications of the learning rates (cf. Sections 7.6, 7.7, 7.8, and 7.9 below).
For a brief list of resources in the scientific literature providing reviews on gradient
based optimization methods we refer to the beginning of Chapter 6.

7.1 Introductory comments for the training of ANNs


with SGD
In Chapter 6 we have introduced and studied deterministic GD-type optimization methods.
In deep learning algorithms usually not deterministic GD-type optimization methods
but stochastic variants of GD-type optimization methods are employed. Such SGD-type
optimization methods can be viewed as suitable Monte Carlo approximations of deterministic
GD-type methods and in this section we now roughly sketch some of the main ideas of
such SGD-type optimization methods. To do this, we now briefly recall the deep supervised
learning framework developed in the introduction and Section 5.1 above.
Specifically, let d, M ∈ N, E ∈ C(Rd , R), x1 , x2 , . . . , xM +1 ∈ Rd , y1 , y2 , . . . , yM ∈ R
satisfy for all m ∈ {1, 2, . . . , M } that

ym = E(xm ). (7.1)

As in the introduction and in Section 5.1 we think of M ∈ N as the number of available


known input-output data pairs, we think of d ∈ N as the dimension of the input data, we

277
Chapter 7: Stochastic gradient descent (SGD) optimization methods

think of E : Rd → R as an unknown function which we want to approximate, we think of


x1 , x2 , . . . , xM +1 ∈ Rd as the available known input data, we think of y1 , y2 , . . . , yM ∈ R as
the available known output data, and we are trying to use the available known input-output
data pairs to approximate the unknown function E by means of ANNs.
Specifically, let Ph a : R → R be differentiable, let h ∈d N, l1 , l2 , . . . , lh , d ∈ N satisfyd
d = l1 (d + 1) + k=2 lk (lk−1 + 1) + lh + 1, and let L : R → [0, ∞) satisfy for all θ ∈ R
that "M #
1 X 2
θ,d
(7.2)

L(θ) = NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(xm ) − ym
M m=1
(cf. Definitions 1.1.3 and 1.2.1). Note that h is the number of hidden layers of the ANNs
in (7.2), note for every i ∈ {1, 2, . . . , h} that li ∈ N is the number of neurons in the i-th
hidden layer of the ANNs in (7.2), and note that d is the number of real parameters used
to describe the ANNs in (7.2). We recall that we are trying to approximate the function E
by, first, computing an approximate minimizer ϑ ∈ Rd of the function L : Rd → [0, ∞) and,
thereafter, employing the realization
ϑ,d
Rd ∋ x 7→ NM a,l 1
,Ma,l2 ,...,Ma,lh ,idR ∈R (7.3)

of the ANN associated to the approximate minimizer ϑ ∈ Rd as an approximation of E.


Deep learning algorithms typically solve optimization problems of the type (7.2) by means
of gradient based optimization methods, which aim to minimize the considered objective
function by performing successive steps based on the direction of the negative gradient
of the objective function. We recall that one of the simplest gradient based optimization
method is the plain-vanilla GD optimization method which performs successive steps in
the direction of the negative gradient. In the context of the optimization problem in (7.2)
this GD optimization method reads as follows. Let ξ ∈ Rd , let (γn )n∈N ⊆ [0, ∞), and let
θ = (θn )n∈N0 : N0 → Rd satisfy for all n ∈ N that

θ0 = ξ and θn = θn−1 − γn (∇L)(θn−1 ). (7.4)

Note that the process (θn )n∈N0 is the GD process for the objective function L with learning
rates (γn )n∈N and initial value ξ (cf. Definition 6.1.1). Moreover, observe that the assumption
that a is differentiable ensures that L in (7.4) is also differentiable (see Section 5.3.2 above
for details).
In typical practical deep learning applications the number M of available known input-
output data pairs is very large, say, for example, M ≥ 106 . As a consequence it is typically
computationally prohibitively expensive to determine the exact gradient of the objective
function to perform steps of deterministic GD-type optimization methods. As a remedy for
this, deep learning algorithms usually employ stochastic variants of GD-type optimization
methods, where in each step of the optimization method the precise gradient of the objective
function is replaced by a Monte Carlo approximation of the gradient of the objective function.

278
7.2. SGD optimization

We now sketch this approach for the GD optimization method in (7.4) resulting in the
popular SGD optimization method applied to (7.2).
Specifically, let S = {1, 2, . . . , M }, J ∈ N, let (Ω, F, P) be a probability space, for every
n ∈ N, j ∈ {1, 2, . . . , J} let mn,j : Ω → S be a uniformly distributed random variable, let
l : Rd × S → R satisfy for all θ ∈ Rd , m ∈ S that
2
θ,d
(7.5)

l(θ, m) = NM a,l ,Ma,l ,...,Ma,l ,idR
(xm ) − ym ,
1 2 h

and let Θ = (Θn )n∈N0 : N0 × Ω → R satisfy for all n ∈ N that


d
" J #
1X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , mn,j ) . (7.6)
J j=1
The stochastic process (Θn )n∈N0 is an SGD process for the minimization problem associated
to (7.2) with learning rates (γn )n∈N , constant number of Monte Carlo samples (batch
sizes) J, initial value ξ, and data (mn,j )(n,j)∈N×{1,2,...,J} (see Definition 7.2.1 below for the
precise definition). Note that in (7.6) in each step n ∈ N we only employ a Monte Carlo
approximation
J M
1X 1 X
(∇θ l)(Θn−1 , mn,j ) ≈ (∇θ l)(Θn−1 , m) = (∇L)(Θn−1 ) (7.7)
J j=1 M m=1
of the exact gradient of the objective function. Nonetheless, in deep learning applications
the SGD optimization method (or other SGD-type optimization methods) typically result in
good approximate minimizers of the objective function. Note that employing approximate
gradients in the SGD optimization method in (7.6) means that performing any step of the
SGD process involves the computation of a sum with only J summands, while employing
the exact gradient in the GD optimization method in (7.4) means that performing any step
of the process involves the computation of a sum with M summands. In deep learning
applications when M is very large (for instance, M ≥ 106 ) and J is chosen to be reasonably
small (for example, J = 128), this means that performing steps of the SGD process is much
more computationally affordable than performing steps of the GD process. Combining this
with the fact that SGD-type optimization methods do in the training of ANNs often find
good approximate minimizers (cf., for instance, Remark 9.14.5 and [100, 391]) is the key
reason making the SGD optimization method and other SGD-type optimization methods the
optimization methods chosen in almost all deep learning applications. It is the topic of this
chapter to introduce and study SGD-type optimization methods such as the plain-vanilla
SGD optimization method in (7.6) above.

7.2 SGD optimization


In the next notion we present the promised stochastic version of the plain-vanilla GD
optimization method from Section 6.1, that is, in the next notion we present the plain-

279
Chapter 7: Stochastic gradient descent (SGD) optimization methods

vanilla SGD optimization method.

Definition 7.2.1 (SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N,
let (Ω, F, P) be a probability space, let (S, S) be a measurable space, let ξ : Ω → Rd be
a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let Xn,j : Ω → S be a random
variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and g : Rd × S → Rd satisfy for all
U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.8)

Then we say that Θ is the SGD process on ((Ω, F, P), (S, S)) for the loss function l with
generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , initial value ξ, and data
(Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the SGD process for the loss function l with
learning rates (γn )n∈N , batch sizes (Jn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } )
if and only if it holds that Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies
for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn g(Θn−1 , Xn,j ) . (7.9)
Jn j=1

7.2.1 SGD optimization in the training of ANNs


In the next example we apply the SGD optimization method in the context of the training
of fully-connected feedforward ANNs in the vectorized description (see Section 1.1) with the
loss function being the mean squared error loss function in Definition 5.4.2 (see Section 5.4.2).
Note that this is a very similar framework as the one developed in Section 7.1.
Ph 
Example 7.2.2. Let d, h, d ∈ N, l1 , l2 , . . . , lh ∈ N satisfy d = l1 (d+1)+ k=2 lk (lk−1 +1) +
lh + 1, let a : R → R be differentiable, let M ∈ N, x1 , x2 , . . . , xM ∈ Rd , y1 , y2 , . . . , yM ∈ R,
let L : Rd → [0, ∞) satisfy for all θ ∈ Rd that
"M #
1 X 2
θ,d
(7.10)

L(θ) = NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(xm ) − ym ,
M m=1

let S = {1, 2, . . . , M }, let ℓ : Rd × S → R satisfy for all θ ∈ Rd , m ∈ S that


2
θ,d
(7.11)

ℓ(θ, m) = NM a,l ,Ma,l2 ,...,Ma,lh ,idR (xm ) − ym ,
1

let ξ ∈ Rd , let (γn )n∈N ⊆ N, let ϑ : N0 → Rd satisfy for all n ∈ N that

ϑ0 = ξ and ϑn = ϑn−1 − γn (∇L)(ϑn−1 ), (7.12)

280
7.2. SGD optimization

let (Ω, F, P) be a probability space, let (Jn )n∈N ⊆ N, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let
mn,j : Ω → S be a uniformly distributed random variable, and let Θ : N0 × Ω → Rd satisfy
for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ ℓ)(Θn−1 , mn,j ) (7.13)
Jn j=1

(cf. Corollary 5.3.6). Then


(i) it holds that ϑ is the GD process for the objective function L with learning rates
(γn )n∈N and initial value ξ,

(ii) it holds that Θ is the SGD process for the loss function ℓ with learning rates (γn )n∈N ,
batch sizes (Jn )n∈N , initial value ξ, and data (mn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } , and

(iii) it holds for all n ∈ N, θ ∈ Rd that


 " #
Jn
1 X
Eθ − γn (∇θ ℓ)(θ, mn,j )  = θ − γn (∇L)(θ). (7.14)
Jn j=1

Proof for Example 7.2.2. Note that (7.12) proves item (i). Observe that (7.13) proves
item (ii). Note that (7.11), (7.10), and the assumption that for all n ∈ N, j ∈ {1, 2, . . . , Jn }
it holds that mn,j is uniformly distributed imply that for all n ∈ N, j ∈ {1, 2, . . . , Jn } it
holds that
"M #
1 X
E[ℓ(η, mn,j )] = ℓ(η, m)
M m=1
"M # (7.15)
1 X θ,d
 2
= NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(xm ) − ym = L(θ).
M m=1

Therefore, we obtain for all n ∈ N, θ ∈ Rd that


 " # " #
J n Jn
1 X 1 X  
Eθ − γn (∇θ ℓ)(θ, mn,j )  = θ − γn E (∇θ ℓ)(θ, mn,j )
Jn j=1 Jn j=1
(7.16)
" Jn
#
1 X
= θ − γn (∇L)(θ)
Jn j=1
= θ − γn (∇L)(θ).

The proof for Example 7.2.2 is thus complete.

281
Chapter 7: Stochastic gradient descent (SGD) optimization methods

Source codes 7.1 and 7.2 give two concrete implementations in PyTorch of the
framework described in Example 7.2.2 with different data and network architectures. The
plots generated by these codes can be found in in Figures 7.1 and 7.2, respectively. They
show the approximations of the respective target functions by the realization functions of
the ANNs at various points during the training.
1 import torch
2 import torch . nn as nn
3 import numpy as np
4 import matplotlib . pyplot as plt
5
6 M = 10000 # number of training samples
7
8 # We fix a random seed . This is not necessary for training a
9 # neural network , but we use it here to ensure that the same
10 # plot is created on every run .
11 torch . manual_seed (0)
12
13 # Here , we define the training set .
14 # Create a tensor of shape (M , 1) with entries sampled from a
15 # uniform distribution on [ -2 * pi , 2 * pi )
16 X = ( torch . rand (( M , 1) ) - 0.5) * 4 * np . pi
17 # We use the sine as the target function , so this defines the
18 # desired outputs .
19 Y = torch . sin ( X )
20

21 J = 32 # the batch size


22 N = 100000 # the number of SGD iterations
23
24 loss = nn . MSELoss () # the mean squared error loss function
25 gamma = 0.003 # the learning rate
26

27 # Define a network with a single hidden layer of 200 neurons and


28 # tanh activation function
29 net = nn . Sequential (
30 nn . Linear (1 , 200) , nn . Tanh () , nn . Linear (200 , 1)
31 )
32

33 # Set up a 3 x3 grid of plots


34 fig , axs = plt . subplots (
35 3,
36 3,
37 figsize =(12 , 8) ,
38 sharex = " col " ,
39 sharey = " row " ,
40 )
41
42 # Plot the target function
43 x = torch . linspace ( -2 * np . pi , 2 * np . pi , 1000) . reshape ((1000 , 1) )

282
7.2. SGD optimization

44 y = torch . sin ( x )
45 for ax in axs . flatten () :
46 ax . plot (x , y , label = " Target " )
47 ax . set_xlim ([ -2 * np . pi , 2 * np . pi ])
48 ax . set_ylim ([ -1.1 , 1.1])
49
50 plot_after = [1 , 30 , 100 , 300 , 1000 , 3000 , 10000 , 30000 , 100000]
51

52 # The training loop


53 for n in range ( N ) :
54 # Choose J samples randomly from the training set
55 indices = torch . randint (0 , M , (J ,) )
56 X_batch = X [ indices ]
57 Y_batch = Y [ indices ]
58
59 net . zero_grad () # Zero out the gradients
60
61 loss_val = loss ( net ( X_batch ) , Y_batch ) # Compute the loss
62 loss_val . backward () # Compute the gradients
63

64 # Update the parameters


65 with torch . no_grad () :
66 for p in net . parameters () :
67 # Subtract the scaled gradient in - place
68 p . sub_ ( gamma * p . grad )
69

70 if n + 1 in plot_after :
71 # Plot the realization function of the ANN
72 i = plot_after . index ( n + 1)
73 ax = axs [ i // 3][ i % 3]
74 ax . set_title ( f " Batch { n +1} " )
75

76 with torch . no_grad () :


77 ax . plot (x , net ( x ) , label = " ANN realization " )
78
79 axs [0][0]. legend ( loc = " upper right " )
80
81 plt . tight_layout ()
82 plt . savefig ( " ../../ plots / sgd . pdf " , bbox_inches = " tight " )

283
Chapter 7: Stochastic gradient descent (SGD) optimization methods

Source code 7.1 (code/optimization_methods/sgd.py): Python


code implementing the SGD optimization method in the training of an ANN as
described in Example 7.2.2 in PyTorch. In this code a fully-connected ANN
with a single hidden layer with 200 neurons using the hyperbolic tangent activation
function is trained so that the realization function approximates the target function
sin : R → R. Example 7.2.2 is implemented with d = 1, h = 1, d = 301, l1 = 200,
a = tanh, M = 10000, x1 , x2 , . . . , xM ∈ R, yi = sin(xi ) for all i ∈ {1, 2, . . . , M },
γn = 0.003 for all n ∈ N, and Jn = 32 for all n ∈ N in the notation of Example 7.2.2.
The plot generated by this code is shown in Figure 7.1.

Batch 1 Batch 30 Batch 100


1.0 Target
ANN realization
0.5

0.0

0.5

1.0
Batch 300 Batch 1000 Batch 3000
1.0

0.5

0.0

0.5

1.0
Batch 10000 Batch 30000 Batch 100000
1.0

0.5

0.0

0.5

1.0
6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6

Figure 7.1 (plots/sgd.pdf): A plot showing the realization function of an ANN


at several points during training with the SGD optimization method. This plot is
generated by the code in Source code 7.1.

1 import torch
2 import torch . nn as nn
3 import numpy as np

284
7.2. SGD optimization

4 import matplotlib . pyplot as plt


5

6 def plot_heatmap ( ax , g ) :
7 x = np . linspace ( -2 * np . pi , 2 * np . pi , 100)
8 y = np . linspace ( -2 * np . pi , 2 * np . pi , 100)
9 x , y = np . meshgrid (x , y )
10
11 # flatten the grid to [ num_points , 2] and convert to tensor
12 grid = np . vstack ([ x . flatten () , y . flatten () ]) . T
13 grid_torch = torch . from_numpy ( grid ) . float ()
14
15 # pass the grid through the network
16 z = g ( grid_torch )
17

18 # reshape the predictions back to a 2 D grid


19 Z = z . numpy () . reshape ( x . shape )
20
21 # plot the heatmap
22 ax . imshow (Z , origin = ’ lower ’ , extent =( -2 * np . pi , 2 * np . pi ,
23 -2 * np . pi , 2 * np . pi ) )
24
25 M = 10000
26
27 def f ( x ) :
28 return torch . sin ( x ) . prod ( dim =1 , keepdim = True )
29

30 torch . manual_seed (0)


31 X = torch . rand (( M , 2) ) * 4 * np . pi - 2 * np . pi
32 Y = f(X)
33
34 J = 32
35

36 N = 100000
37
38 loss = nn . MSELoss ()
39 gamma = 0.05
40
41 fig , axs = plt . subplots (
42 3 , 3 , figsize =(12 , 12) , sharex = " col " , sharey = " row " ,
43 )
44
45 net = nn . Sequential (
46 nn . Linear (2 , 50) ,
47 nn . Softplus () ,
48 nn . Linear (50 ,50) ,
49 nn . Softplus () ,
50 nn . Linear (50 , 1)
51 )
52

285
Chapter 7: Stochastic gradient descent (SGD) optimization methods

53 plot_after = [0 , 100 , 300 , 1000 , 3000 , 10000 , 30000 , 100000]


54

55 for n in range ( N + 1) :
56 indices = torch . randint (0 , M , (J ,) )
57
58 x = X [ indices ]
59 y = Y [ indices ]
60

61 net . zero_grad ()
62
63 loss_val = loss ( net ( x ) , y )
64 loss_val . backward ()
65
66 with torch . no_grad () :
67 for p in net . parameters () :
68 p . sub_ ( gamma * p . grad )
69
70 if n in plot_after :
71 i = plot_after . index ( n )
72

73 with torch . no_grad () :


74 plot_heatmap ( axs [ i // 3][ i % 3] , net )
75 axs [ i // 3][ i % 3]. set_title ( f " Batch { n } " )
76
77 with torch . no_grad () :
78 plot_heatmap ( axs [2][2] , f )
79 axs [2][2]. set_title ( " Target " )
80
81 plt . tight_layout ()
82 plt . savefig ( " ../../ plots / sgd2 . pdf " , bbox_inches = " tight " )

Source code 7.2 (code/optimization_methods/sgd2.py): Python code


implementing the SGD optimization method in the training of an ANN as described
in Example 7.2.2 in PyTorch. In this code a fully-connected ANN with two hidden
layers with 50 neurons each using the softplus activation funcction is trained so that
the realization function approximates the target function f : R2 → R which satisfies
for all x, y ∈ R that f (x, y) = sin(x) sin(y). Example 7.2.2 is implemented with d = 1,
h = 2, d = 2701, l1 = l2 = 50, a being the softplus activation function, M = 10000,
x1 , x2 , . . . , xM ∈ R2 , yi = f (xi ) for all i ∈ {1, 2, . . . , M }, γn = 0.003 for all n ∈ N,
and Jn = 32 for all n ∈ N in the notation of Example 7.2.2. The plot generated by
this code is shown in Figure 7.2.

286
7.2. SGD optimization

Batch 0 Batch 100 Batch 300


6

6
Batch 1000 Batch 3000 Batch 10000
6

6
Batch 30000 Batch 100000 Target
6

6
6 4 2 0 2 4 6 6 4 2 0 2 4 6 6 4 2 0 2 4 6

Figure 7.2 (plots/sgd2.pdf): A plot showing the realization function of an ANN


at several points during training with the SGD optimization method. This plot is
generated by the code in Source code 7.2.

287
Chapter 7: Stochastic gradient descent (SGD) optimization methods

7.2.2 Non-convergence of SGD for not appropriately decaying


learning rates
In this section we present two results that, roughly speaking, motivate that the sequence of
learning rates of the SGD optimization method should be chosen such that they converge
to zero (see Corollary 7.2.10 below) but not too fast (see Lemma 7.2.13 below).

7.2.2.1 Bias-variance decomposition of the mean square error


Lemma 7.2.3 (Bias-variance decomposition of the mean square error). Let d ∈ N, ϑ ∈ Rd ,
let ⟨⟨·, ·⟩⟩ : Rd × Rd → R be a scalar product, let ~·~ : Rd → [0, ∞) satisfy for all v ∈ Rd that

(7.17)
p
~v~ = ⟨⟨v, v⟩⟩,

let (Ω, F, P) be a probability space, and let Z : Ω → Rd be a random variable with E[~Z~] <
∞. Then
E ~Z − ϑ~2 = E ~Z − E[Z]~2 + ~E[Z] − ϑ~2 . (7.18)
   

Proof of Lemma 7.2.3. Observe that the assumption that E[~Z~] < ∞ and the Cauchy-
Schwarz inequality demonstrate that
   
E |⟨⟨Z − E[Z], E[Z] − ϑ⟩⟩| ≤ E ~Z − E[Z]~~E[Z] − ϑ~
(7.19)
≤ (E[~Z~] + ~E[Z]~)~E[Z] − ϑ~ < ∞.

The linearity of the expectation hence ensures that

E ~Z − ϑ~2 = E ~(Z − E[Z]) + (E[Z] − ϑ)~2


   

= E ~Z − E[Z]~2 + 2⟨⟨Z − E[Z], E[Z] − ϑ⟩⟩ + ~E[Z] − ϑ~2


 
(7.20)
= E ~Z − E[Z]~2 + 2⟨⟨E[Z] − E[Z], E[Z] − ϑ⟩⟩ + ~E[Z] − ϑ~2
 

= E ~Z − E[Z]~2 + ~E[Z] − ϑ~2 .


 

The proof of Lemma 7.2.3 is thus complete.

7.2.2.2 Non-convergence of SGD for constant learning rates


In this section we present Lemma 7.2.9, Corollary 7.2.10, and Lemma 7.2.11. Our proof of
Lemma 7.2.9 employs the auxiliary results in Lemmas 7.2.4, 7.2.5, 7.2.6, 7.2.7, and 7.2.8
below. Lemma 7.2.4 recalls an elementary and well known property for the expectation
of the product of independent random variables (see, for example, Klenke [248, Theorem
5.4]). In the elementary Lemma 7.2.8 we prove under suitable hypotheses the measurability
of certain derivatives of a function. A result similar to Lemma 7.2.8 can, for instance, be
found in Jentzen et al. [220, Lemma 4.4].

288
7.2. SGD optimization

Lemma 7.2.4. Let (Ω, F, P) be a probability space and let X, Y : Ω → R be independent


random variables with E[|X| + |Y |] < ∞. Then
     
(i) it holds that E |XY | = E |X| E |Y | < ∞ and

(ii) it holds that E[XY ] = E[X]E[Y ].

Proof of Lemma 7.2.4. Note that the fact that (X, Y )(P) = (X(P)) ⊗ (Y (P)), the integral
transformation theorem, Fubini’s theorem, and the assumption that E[|X| + |Y |] < ∞ show
that
Z
 
E |XY | = |X(ω)Y (ω)| P(dω)
ZΩ

= |xy| (X, Y )(P) (dx, dy)
ZR×RZ 
= |xy| (X(P))(dx) (Y (P))(dy)
Z R R
Z  (7.21)
= |y| |x| (X(P))(dx) (Y (P))(dy)
R R
Z Z 
= |x| (X(P))(dx) |y| (Y (P))(dy)
R R
   
= E |X| E |Y | < ∞.

This establishes item (i). Observe that item (i), the fact that (X, Y )(P) = (X(P)) ⊗ (Y (P)),
the integral transformation theorem, and Fubini’s theorem prove that
Z
 
E XY = X(ω)Y (ω) P(dω)
ZΩ

= xy (X, Y )(P) (dx, dy)
ZR×R
Z 
= xy (X(P))(dx) (Y (P))(dy)
R R (7.22)
Z Z 
= y x (X(P))(dx) (Y (P))(dy)
R R
Z Z 
= x (X(P))(dx) y (Y (P))(dy)
R R
= E[X]E[Y ].

This establishes item (ii). The proof of Lemma 7.2.4 is thus complete.

289
Chapter 7: Stochastic gradient descent (SGD) optimization methods

Lemma 7.2.5. Let (Ω, F, P) be a probability space, let d ∈ N, let ⟨⟨·, ·⟩⟩ : Rd × Rd → R be a
scalar product, let ~·~ : Rd → [0, ∞) satisfy for all v ∈ Rd that

(7.23)
p
~v~ = ⟨⟨v, v⟩⟩,
 
let X : Ω → Rd be a random variable, assume E ~X~2 < ∞, let e1 , e2 , . . . , ed ∈ Rd satisfy
for all i, j ∈ {1, ⟩⟩ = 1{i} (j), and for every random variable Y : Ω → Rd
 2, . . . , d} that ⟨⟨ei , ejd×d
2
with E ~Y ~ < ∞ let Cov(Y ) ∈ R satisfy

(7.24)

Cov(Y ) = E[⟨⟨ei , Y − E[Y ]⟩⟩⟨⟨ej , Y − E[Y ]⟩⟩] (i,j)∈{1,2,...,d}2 .

Then
Trace(Cov(X)) = E ~X − E[X]~2 . (7.25)
 

Proof of Lemma 7.2.5. First, note that the fact that ∀ i, j ∈ {1, 2, . . . , d} : ⟨⟨ei , ej ⟩⟩ = 1{i} (j)
implies that for all v ∈ Rd it holds that di=1 ⟨⟨ei , v⟩⟩ei = v. Combining this with the fact
P
that ∀ i, j ∈ {1, 2, . . . , d} : ⟨⟨ei , ej ⟩⟩ = 1{i} (j) demonstrates that
d
X  
Trace(Cov(X)) = E ⟨⟨ei , X − E[X]⟩⟩⟨⟨ei , X − E[X]⟩⟩
i=1
Xd Xd
= E[⟨⟨ei , X − E[X]⟩⟩⟨⟨ej , X − E[X]⟩⟩⟨⟨ei , ej ⟩⟩] (7.26)
i=1 j=1
 Pd Pd 
=E i=1 ⟨⟨ei , X − E[X]⟩⟩ei , j=1 ⟨⟨ej , X
− E[X]⟩⟩ej
= E[⟨⟨X − E[X], X − E[X]⟩⟩] = E ~X − E[X]~2 .
 

The proof of Lemma 7.2.5 is thus complete.


Lemma 7.2.6. Let d, n ∈ N, let ⟨⟨·, ·⟩⟩ : Rd × Rd → R be a scalar product, let ~·~ : Rd →
[0, ∞) satisfy for all v ∈ Rd that

(7.27)
p
~v~ = ⟨⟨v, v⟩⟩,
d
let (Ω, F, P) be a probability space,
Pn let Xk : Ω → R , k ∈ {1, 2, . . . , n}, be independent
random variables, and assume k=1 E ~Xk ~ < ∞. Then
h P n
i X
2
E ~ nk=1 (Xk − E[Xk ])~ = E ~Xk − E[Xk ]~2 . (7.28)
 
k=1

Proof of Lemma 7.2.6. First, observe that Lemma 7.2.4 and the assumption that E[~X1 ~ +
~X2 ~ + . . . + ~Xn ~] < ∞ ensure that for all k1 , k2 ∈ {1, 2, . . . , n} with k1 ̸= k2 it holds that

E |⟨⟨Xk1 − E[Xk1 ], Xk2 − E[Xk2 ]⟩⟩| ≤ E ~Xk1 − E[Xk1 ]~~Xk2 − E[Xk2 ]~ < ∞ (7.29)
   

290
7.2. SGD optimization

and
 
E ⟨⟨Xk1 − E[Xk1 ], Xk2 − E[Xk2 ]⟩⟩
= ⟨⟨E[Xk1 − E[Xk1 ]], E[Xk2 − E[Xk2 ]]⟩⟩ (7.30)
= ⟨⟨E[Xk1 ] − E[Xk1 ], E[Xk2 ] − E[Xk2 ]⟩⟩ = 0.

Therefore, we obtain that


h P i
2
E ~ nk=1 (Xk − E[Xk ])~
 Pn Pn 
=E k1 =1 (Xk1 − E[Xk1 ]), k2 =1 (Xk2 − E[Xk2 ])
hP i
n
=E k1 ,k2 =1 ⟨⟨X k1 − E[X k1 ], Xk2 − E[X k2 ]⟩⟩
  
n
!
 X X
~Xk − E[Xk ]~2 + 
 
= E
 ⟨⟨X k 1 − E[X k1 ], Xk 2 − E[X k2 ]⟩⟩ 

k=1 k1 ,k2 ∈{1,2,...,n},
k1 ̸=k2
 
n
!
X  2   X  
= E ~Xk − E[Xk ]~ +
 E ⟨⟨Xk1 − E[Xk1 ], Xk2 − E[Xk2 ]⟩⟩ 

k=1 k1 ,k2 ∈{1,2,...,n},
k1 ̸=k2
n
X
E ~Xk − E[Xk ]~2 .
 
=
k=1
(7.31)

The proof of Lemma 7.2.6 is thus complete.


Lemma 7.2.7 (Factorization lemma for independent random variables). Let (Ω, F, P) be a
probability space, let (X, X ) and (Y, Y) be measurable spaces, let X : Ω → X and Y : Ω → Y
be independent random variables, let Φ : X × Y → [0, ∞] be (X ⊗ Y)/B([0, ∞])-measurable,
and let ϕ : Y → [0, ∞] satisfy for all y ∈ Y that

ϕ(y) = E[Φ(X, y)]. (7.32)

Then
(i) it holds that the function ϕ is Y/B([0, ∞])-measurable and

(ii) it holds that


E[Φ(X, Y )] = E[ϕ(Y )]. (7.33)

Proof of Lemma 7.2.7. First, note that Fubini’s theorem (cf., for example, Klenke [248,
(14.6) in Theorem 14.16]), the assumption that the function X : Ω → X is F/X -measurable,

291
Chapter 7: Stochastic gradient descent (SGD) optimization methods

and the assumption that the function Φ : X × Y → [0, ∞] is (X ⊗ Y)/B([0, ∞])-measurable


show that the function
Z
Y ∋ y 7→ ϕ(y) = E[Φ(X, y)] = Φ(X(ω), y) P(dω) ∈ [0, ∞] (7.34)

is Y/B([0, ∞])-measurable. This proves item (i). Observe that the integral transformation
theorem, the fact that (X, Y )(P) = (X(P)) ⊗ (Y (P)), and Fubini’s theorem establish that
Z
 
E Φ(X, Y ) = Φ(X(ω), Y (ω)) P(dω)
ZΩ

= Φ(x, y) (X, Y )(P) (dx, dy)
ZX×Y
Z 
= Φ(x, y) (X(P))(dx) (Y (P))(dy) (7.35)
Y X
Z
 
= E Φ(X, y) (Y (P))(dy)
ZY
 
= ϕ(y) (Y (P))(dy) = E ϕ(Y ) .
Y

This proves item (ii). The proof of Lemma 7.2.7 is thus complete.
Lemma 7.2.8. Let d ∈ N, let (S, S) be a measurable space, let l = (l(θ, x))(θ,x)∈Rd ×S :
Rd × S → R be (B(Rd ) ⊗ S)/B(R)-measurable, and assume for every x ∈ S that the function
Rd ∋ θ 7→ l(θ, x) ∈ R is differentiable. Then the function
Rd × S ∋ (θ, x) 7→ (∇θ l)(θ, x) ∈ Rd (7.36)
is (B(Rd ) ⊗ S)/B(Rd )-measurable.
Proof of Lemma 7.2.8. Throughout this proof, let g = (g1 , . . . , gd ) : Rd × S → Rd satisfy
for all θ ∈ Rd , x ∈ S that
g(θ, x) = (∇θ l)(θ, x). (7.37)
The assumption that the function l : Rd × S → R is (B(Rd ) ⊗ S)/B(R)-measurable implies
that for all i ∈ {1, 2, . . . , d}, h ∈ R\{0} it holds that the function
 
Rd × S ∋ (θ, x) = ((θ1 , . . . , θd ), x) 7→ l((θ1 ,...,θi−1 ,θi +h,θhi+1 ,...,θd ),x)−l(θ,x) ∈ R (7.38)

is (B(Rd )⊗S)/B(R)-measurable. The fact that for all i ∈ {1, 2, . . . , d}, θ = (θ1 , . . . , θd ) ∈ Rd ,
x ∈ S it holds that
 −n ,θ

gi (θ, x) = lim l((θ1 ,...,θi−1 ,θi +2 2−n i+1 ,...,θd ),x)−l(θ,x)
(7.39)
n→∞

hence demonstrates that for all i ∈ {1, 2, . . . , d} it holds that the function gi : Rd × S → R
is (B(Rd ) ⊗ S)/B(R)-measurable. This ensures that g is (B(Rd ) ⊗ S)/B(Rd )-measurable.
The proof of Lemma 7.2.8 is thus complete.

292
7.2. SGD optimization

Lemma 7.2.9. Let d ∈ N, (γn )n∈N ⊆ (0, ∞), (Jn )n∈N ⊆ N, let ⟨⟨·, ·⟩⟩ : Rd × Rd → R be a
scalar product, let ~·~ : Rd → [0, ∞) satisfy for all v ∈ Rd that

(7.40)
p
~v~ = ⟨⟨v, v⟩⟩,

let (Ω, F, P) be a probability space, let ξ : Ω → Rd be a random variable, let (S, S) be a


measurable space, let Xn,j : Ω → S, j ∈ {1, 2, . . . , Jn }, n ∈ N, be i.i.d. random variables,
assume that ξ and (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } are independent, let l = (l(θ, x))(θ,x)∈Rd ×S : Rd ×
S → R be (B(Rd ) ⊗ S)/B(R)-measurable, assume  for all x ∈ Sthat (Rd ∋ θ 7→ l(θ, x) ∈
R) ∈ C 1 (Rd , R), assume for all θ ∈ Rd that E ~(∇θ l)(θ, X1,1 )~ < ∞ (cf. Lemma 7.2.8),
let V : Rd → [0, ∞] satisfy for all θ ∈ Rd that

V(θ) = E ~(∇θ l)(θ, X1,1 ) − E (∇θ l)(θ, X1,1 ) ~2 , (7.41)


   

and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , Xn,j ) . (7.42)
Jn j=1

Then it holds for all n ∈ N, ϑ ∈ Rd that


 
1/2 γn 1/2
E ~Θn − ϑ~2 (7.43)
 
≥ E V(Θn−1 ) .
(Jn )1/2

Proof of Lemma 7.2.9. Throughout this proof, for every n ∈ N let ϕn : Rd → [0, ∞] satisfy
for all θ ∈ Rd that
‌ h i ‌2 
γn PJn
ϕn (θ) = E ‌θ − Jn j=1 (∇θ l)(θ, Xn,j ) − ϑ‌ . (7.44)
‌ ‌

Note that Lemma 7.2.3 shows that for all ϑ ∈ Rd and all random variables Z : Ω → Rd with
E[~Z~] < ∞ it holds that

E ~Z − ϑ~2 = E ~Z − E[Z]~2 + ~E[Z] − ϑ~2 ≥ E ~Z − E[Z]~2 . (7.45)


     

Therefore, we obtain for all n ∈ N, θ ∈ Rd that


‌ h i ‌2 
‌ γn PJn
ϕn (θ) = E ‌ Jn j=1 (∇θ l)(θ, Xn,j ) − (θ − ϑ)‌

‌ h i h hP ii‌2 
(7.46)
‌ γn PJn γn Jn
≥ E ‌ Jn j=1 (∇θ l)(θ, Xn,j ) − E Jn j=1 (∇θ l)(θ, Xn,j ) ‌

‌ 
(γn )2 ‌PJn  ‌‌2
= (Jn )2 E ‌ j=1 (∇θ l)(θ, Xn,j ) − E (∇θ l)(θ, Xn,j ) ‌ .

293
Chapter 7: Stochastic gradient descent (SGD) optimization methods

Lemma 7.2.6, the fact that Xn,j : Ω → S, j ∈ {1, 2, . . . , Jn }, n ∈ N, are i.i.d. random
variables, and the fact that for all n ∈ N, j ∈ {1, 2, . . . , Jn }, θ ∈ Rd it holds that

(7.47)
   
E ~(∇θ l)(θ, Xn,j )~ = E ~(∇θ l)(θ, X1,1 )~ < ∞

hence establish that for all n ∈ N, θ ∈ Rd it holds that


"J #
n h‌
(γn )2
X  ‌2 i
ϕn (θ) ≥ (Jn )2 E ‌(∇θ l)(θ, Xn,j ) − E (∇θ l)(θ, Xn,j ) ‌
j=1
" Jn
#
h‌ ‌2 i
)2
(γn
X
(7.48)

= (Jn )2
E ‌(∇θ l)(θ, X1,1 ) − E (∇θ l)(θ, X1,1 ) ‌
j=1
" Jn
#
(γn )2
X (γn )2    (γn )2 
= (Jn )2
V(θ) = (Jn )2
Jn V(θ) = Jn V(θ).
j=1

Furthermore, observe that (7.42), (7.44), the fact that for all n ∈ N it holds that Θn−1
and (Xn,j )j∈{1,2,...,Jn } are independent random variables, and Lemma 7.2.7 prove that for all
n ∈ N, ϑ ∈ Rd it holds that
‌ h i ‌2 
2 γn PJn
 
E ~Θn − ϑ~ = E ‌Θn−1 − Jn j=1 (∇θ l)(Θn−1 , Xn,j ) − ϑ‌
‌ ‌
(7.49)
 
= E ϕn (Θn−1 ) .

Combining this with (7.48) implies that for all n ∈ N, ϑ ∈ Rd it holds that
h 2  i  2 
E ~Θn − ϑ~2 ≥ E (γJnn) V(Θn−1 ) = (γJnn) E V(Θn−1 ) . (7.50)
  

This establishes (7.43). The proof of Lemma 7.2.9 is thus complete.

Corollary 7.2.10. Let d ∈ N, ε ∈ (0, ∞), (γn )n∈N ⊆ (0, ∞), (Jn )n∈N ⊆ N, let ⟨⟨·, ·⟩⟩ : Rd ×
Rd → R be a scalar product, let ~·~ : Rd → [0, ∞) satisfy for all v ∈ Rd that

(7.51)
p
~v~ = ⟨⟨v, v⟩⟩,

let (Ω, F, P) be a probability space, let ξ : Ω → Rd be a random variable, let (S, S) be a


measurable space, let Xn,j : Ω → S, j ∈ {1, 2, . . . , Jn }, n ∈ N, be i.i.d. random variables,
assume that ξ and (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } are independent, let l = (l(θ, x))(θ,x)∈Rd ×S : Rd ×
S → R be (B(Rd ) ⊗ S)/B(R)-measurable, assume  for all x ∈ S that (Rd ∋ θ 7→ l(θ, x) ∈
R) ∈ C 1 (Rd , R), assume for all θ ∈ Rd that E ~(∇θ l)(θ, X1,1 )~ < ∞ (cf. Lemma 7.2.8)
and  1/2
E ~(∇θ l)(θ, X1,1 ) − E (∇θ l)(θ, X1,1 ) ~2 (7.52)
 
≥ ε,

294
7.2. SGD optimization

and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , Xn,j ) . (7.53)
Jn j=1

Then
(i) it holds for all n ∈ N, ϑ ∈ Rd that
 
1/2 γn
E ~Θn − ϑ~2 (7.54)

≥ε
(Jn )1/2
and
(ii) it holds for all ϑ ∈ Rd that
  
2 1/2 γn
(7.55)
 
lim inf E ~Θn − ϑ~ ≥ ε lim inf .
n→∞ n→∞ (Jn )1/2

Proof of Corollary 7.2.10. Throughout this proof, let V : Rd → [0, ∞] satisfy for all θ ∈ Rd
that
V(θ) = E ~(∇θ l)(θ, X1,1 ) − E (∇θ l)(θ, X1,1 ) ~2 . (7.56)
   

Note that (7.52) demonstrates that for all θ ∈ Rd it holds that


V(θ) ≥ ε2 . (7.57)
Lemma 7.2.9 therefore ensures that for all n ∈ N, ϑ ∈ Rd it holds that
 
2 1/2 γn 1/2 γn γn ε
(ε2 ) /2 = (7.58)
1
  
E ~Θn − ϑ~ ≥ 1/2
E V(Θn−1 ) ≥ 1/2
.
(Jn ) (Jn ) (Jn )1/2
This shows item (i). Observe that item (i) implies item (ii). The proof of Corollary 7.2.10
is thus complete.
Lemma 7.2.11 (Lower bound for the SGD optimization method). Let d ∈ N, (γn )n∈N ⊆
(0, ∞), (Jn )n∈N ⊆ N, let (Ω, F, P) be a probability space, let ξ : Ω → Rd be a random
variable, let Xn,j : Ω → Rd , j ∈ {1, 2, . . . , Jn }, n ∈ N, be i.i.d. random variables with
E[∥X1,1 ∥2 ] < ∞, assume that ξ and (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } are independent, let l =
(l(θ, x))(θ,x)∈Rd ×Rd : Rd × Rd → R satisfy for all θ, x ∈ Rd that
l(θ, x) = 12 ∥θ − x∥22 , (7.59)
and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that
" Jn
#
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , Xn,j ) . (7.60)
Jn j=1

Then

295
Chapter 7: Stochastic gradient descent (SGD) optimization methods

(i) it holds for all θ ∈ Rd that

(7.61)
 
E ∥(∇θ l)(θ, X1,1 )∥2 < ∞,

(ii) it holds for all θ ∈ Rd that


h  2i
E (∇θ l)(θ, X1,1 ) − E (∇θ l)(θ, X1,1 ) 2 = E ∥X1,1 − E[X1,1 ]∥22 , (7.62)
  

and

(iii) it holds for all n ∈ N, ϑ ∈ Rd that


 
1/2 1/2 γn
E ∥Θn − ϑ∥22 ≥ E ∥X1,1 − E[X1,1 ]∥22 (7.63)
 
.
(Jn )1/2

Proof of Lemma 7.2.11. First, note that (7.59) and Lemma 5.6.4 prove that for all θ, x ∈ Rd
it holds that
(∇θ l)(θ, x) = 21 (2(θ − x)) = θ − x. (7.64)
The assumption that E[∥X1,1 ∥2 ] < ∞ hence implies that for all θ ∈ Rd it holds that

(7.65)
     
E ∥(∇θ l)(θ, X1,1 )∥2 = E ∥θ − X1,1 ∥2 ≤ ∥θ∥2 + E ∥X1,1 ∥2 < ∞.

This establishes item (i). Furthermore, observe that (7.64) and item (i) demonstrate that
for all θ ∈ Rd it holds that

E ∥(∇θ l)(θ, X1,1 ) − E[(∇θ l)(θ, X1,1 )]∥22


 
(7.66)
= E ∥(θ − X1,1 ) − E[ θ − X1,1 ]∥22 = E ∥X1,1 − E[X1,1 ]∥22 .
   

This proves item (ii). Note that item (i) in Corollary 7.2.10 and items (i) and (ii) establish
item (iii). The proof of Lemma 7.2.11 is thus complete.

7.2.2.3 Non-convergence of GD for summable learning rates


In the next auxiliary result, Lemma 7.2.12 below, we recall a well known lower bound for
the natural logarithm.

Lemma 7.2.12 (A lower bound for the natural logarithm). It holds for all x ∈ (0, ∞) that

(x − 1)
ln(x) ≥ . (7.67)
x
296
7.2. SGD optimization

Proof of Lemma 7.2.12. First, observe that the fundamental theorem of calculus ensures
that for all x ∈ [1, ∞) it holds that
Z x Z x
1 1 (x − 1)
ln(x) = ln(x) − ln(1) = dt ≥ dt = . (7.68)
1 t 1 x x

Furthermore, note that the fundamental theorem of calculus shows that for all x ∈ (0, 1] it
holds that
Z 1 
1
ln(x) = ln(x) − ln(1) = −(ln(1) − ln(x)) = − dt
x t
Z 1  Z 1    (7.69)
1 1 1 (x − 1)
= − dt ≥ − dt = (1 − x) − = .
x t x x x x

This and (7.68) prove (7.67). The proof of Lemma 7.2.12 is thus complete.

Lemma 7.2.13 (GD fails to converge for a summable sequence of learning rates). Let
d ∈ N, ϑ ∈ Rd , ξ ∈ Rd \{ϑ}, α ∈ (0, ∞), (γn )n∈N ⊆ [0, ∞)\{1/α} satisfy ∞
P
n=1 n < ∞, let
γ
d d
L : R → R satisfy for all θ ∈ R that

L(θ) = α2 ∥θ − ϑ∥22 , (7.70)

and let Θ : N0 → Rd satisfy for all n ∈ N that Θ0 = ξ and

Θn = Θn−1 − γn (∇L)(Θn−1 ). (7.71)

Then

(i) it holds for all n ∈ N0 that


" n
#
Y
Θn − ϑ = (1 − γk α) (ξ − ϑ), (7.72)
k=1

(ii) it holds that " #


n
Y
lim inf 1 − γk α > 0, (7.73)
n→∞
k=1

and

(iii) it holds that


lim inf ∥Θn − ϑ∥2 > 0. (7.74)
n→∞

297
Chapter 7: Stochastic gradient descent (SGD) optimization methods

Proof of Lemma 7.2.13. Throughout this proof, let m ∈ N satisfy for all k ∈ N ∩ [m, ∞)
that γk < 1/(2α). Observe that Lemma 5.6.4 implies that for all θ ∈ Rd it holds that
(∇L)(θ) = α2 (2(θ − ϑ)) = α(θ − ϑ). (7.75)
Therefore, we obtain for all n ∈ N that
Θn − ϑ = Θn−1 − γn (∇L)(Θn−1 ) − ϑ
= Θn−1 − γn α(Θn−1 − ϑ) − ϑ (7.76)
= (1 − γn α)(Θn−1 − ϑ).
Induction hence demonstrates that for all n ∈ N it holds that
" n #
Y
Θn − ϑ = (1 − γk α) (Θ0 − ϑ), (7.77)
k=1

This and the assumption that Θ0 = ξ establish item (i). Note that the fact that for all
k ∈ N it holds that γk α ̸= 1 ensures that
m−1
Y
1 − γk α > 0. (7.78)
k=1

Moreover, note that the fact that for all k ∈ N ∩ [m, ∞) it holds that γk α ∈ [0, 1/2) assures
that for all k ∈ N ∩ [m, ∞) it holds that
(1 − γk α) ∈ (1/2, 1]. (7.79)
P∞
This, Lemma 7.2.12, and the assumption that n=1 γn < ∞ show that for all n ∈ N∩[m, ∞)
it holds that
n
! n
Y X
ln 1 − γk α = ln(1 − γk α)
k=m k=m
n n  
X (1 − γk α) − 1 X γk α
≥ = − (7.80)
k=m
(1 − γ k α)
k=m
(1 − γk α)
n   " n
# "∞ #
X γk α X X
≥ − 1 = −2α γk ≥ −2α γk > −∞.
k=m
( 2
) k=m k=1

Combining this with (7.78) proves that for all n ∈ N ∩ [m, ∞) it holds that
n
"m−1 # n
!!
Y Y Y
1 − γk α = 1 − γk α exp ln 1 − γk α
k=1 k=1 k=m
"m−1 # "∞ #! (7.81)
Y X
≥ 1 − γk α exp −2α γk > 0.
k=1 k=1

298
7.2. SGD optimization

Therefore, we obtain that


" n
# "m−1 # " ∞
#!
Y Y X
lim inf 1 − γk α ≥ 1 − γk α exp −2α γk > 0. (7.82)
n→∞
k=1 k=1 k=1

This establishes item (ii). Observe that items (i) and (ii) and the assumption that ξ ̸= ϑ
imply that
" n #
Y
lim inf ∥Θn − ϑ∥2 = lim inf (1 − γk α) (ξ − ϑ)
n→∞ n→∞
k=1
n
!2
Y
= lim inf (1 − γk α) ∥ξ − ϑ∥2 (7.83)
n→∞
k=1
" n
#!
Y
= ∥ξ − ϑ∥2 lim inf 1 − γk α > 0.
n→∞
k=1

This proves item (iii). The proof of Lemma 7.2.13 is thus complete.

7.2.3 Convergence rates for SGD for quadratic objective functions


Example 7.2.14 below, in particular, provides an error analysis for the SGD optimization
method in the case of one specific stochastic optimization problem (see (7.84) below). More
general error analyses for the SGD optimization method can, for instance, be found in [221,
229] and the references therein (cf. Section 7.2.3 below).

Example 7.2.14 (Example of an SGD process). Let d ∈ N, let (Ω, F, P) be a probability


space, let Xn : Ω → Rd , n ∈ N, be i.i.d. random variables with E[∥X1 ∥22 ] < ∞, let l =
(l(θ, x))(θ,x)∈Rd ×Rd : Rd × Rd → R and L : Rd → R satisfy for all θ, x ∈ Rd that

l(θ, x) = 21 ∥θ − x∥22 (7.84)


 
and L(θ) = E l(θ, X1 ) ,

and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that Θ0 = 0
and
Θn = Θn−1 − n1 (∇θ l)(Θn−1 , Xn ) (7.85)
(cf. Definition 3.3.4). Then

(i) it holds that {θ ∈ Rd : L(θ) = inf w∈Rd L(w)} = {E[X1 ]},

(ii) it holds for all n ∈ N that Θn = n1 (X1 + X2 + . . . + Xn ),

299
Chapter 7: Stochastic gradient descent (SGD) optimization methods

(iii) it holds for all n ∈ N that


1/2 1/2 −1/2
E ∥Θn − E[X1 ]∥22 = E ∥X1 − E[X1 ]∥22 (7.86)
 
n ,

and

(iv) it holds for all n ∈ N that

E[L(Θn )] − L(E[X1 ]) = 21 E ∥X1 − E[X1 ]∥22 n−1 . (7.87)


 

Proof for Example 7.2.14. Note that the assumption that E[∥X1 ∥22 ] < ∞ and Lemma 7.2.3
demonstrate that for all θ ∈ Rd it holds that

L(θ) = E l(θ, X1 ) = 12 E ∥X1 − θ∥22


   
(7.88)
= 21 E ∥X1 − E[X1 ]∥22 + ∥θ − E[X1 ]∥22 .
  

This establishes item (i). Observe that Lemma 5.6.4 ensures that for all θ, x ∈ Rd it holds
that
(∇θ l)(θ, x) = 12 (2(θ − x)) = θ − x. (7.89)
This and (7.85) assure that for all n ∈ N it holds that

Θn = Θn−1 − n1 (Θn−1 − Xn ) = (1 − n1 ) Θn−1 + n1 Xn = (n−1)


n
Θn−1 + n1 Xn . (7.90)

Next we claim that for all n ∈ N it holds that

Θn = n1 (X1 + X2 + . . . + Xn ). (7.91)

We now prove (7.91) by induction on n ∈ N. For the base case n = 1 note that (7.90)
implies that
Θ1 = 10 Θ0 + X1 = 11 (X1 ). (7.92)
 

This establishes (7.91) in the base case n = 1. For the induction step note that (7.90) shows
that for all n ∈ {2, 3, 4, . . .} with Θn−1 = (n−1)
1
(X1 + X2 + . . . + Xn−1 ) it holds that
h ih i
(n−1) (n−1)
Θn = Θn−1 + n1 Xn = 1
(X1 + X2 + . . . + Xn−1 ) + n1 Xn
n n (n−1)
(7.93)
= n1 (X1 + X2 + . . . + Xn−1 ) + n1 Xn = n1 (X1 + X2 + . . . + Xn ).

Induction hence implies (7.91). Furthermore, note that (7.91) proves item (ii). Observe
that Lemma 7.2.6, item (ii), and the fact that (Xn )n∈N are i.i.d. random variables with

300
7.2. SGD optimization

E[∥X1 ∥2 ] < ∞ demonstrate that for all n ∈ N it holds that


E ∥Θn − E[X1 ]∥22 = E ∥ n1 (X1 + X2 + . . . + Xn ) − E[X1 ]∥22
   
"   2#
1 n
P
=E (Xk − E[X1 ])
n k=1 2
" #!
n 2
1 P
= 2 E (Xk − E[Xk ])
n
 n
k=1

2
(7.94)
1  
E ∥Xk − E[Xk ]∥22
P
= 2
n k=1
1h  i
= 2 n E ∥X1 − E[X1 ]∥22
n
E[∥X1 − E[X1 ]∥22 ]
= .
n
This establishes item (iii). It thus remains to prove item (iv). For this note that (7.88) and
(7.94) ensure that for all n ∈ N it holds that
E[L(Θn )] − L(E[X1 ]) = E 12 E ∥E[X1 ] − X1 ∥22 + ∥Θn − E[X1 ]∥22
   

− 21 E ∥E[X1 ] − X1 ∥22 + ∥E[X1 ] − E[X1 ]∥22


  
(7.95)
= 12 E ∥Θn − E[X1 ]∥22
 

= 21 E ∥X1 − E[X1 ]∥22 n−1 .


 

This proves item (iv). The proof for Example 7.2.14 is thus complete.
The next result, Theorem 7.2.15 below, specifies strong and weak convergence rates for
the SGD optimization method in dependence on the asymptotic behavior of the sequence
of learning rates. The statement and the proof of Theorem 7.2.15 can be found in Jentzen
et al. [229, Theorem 1.1].
Theorem 7.2.15 (Convergence rates in dependence of learning rates). Let d ∈ N, α, γ, ν ∈
(0, ∞), ξ ∈ Rd , let (Ω, F, P) be a probability space, let Xn : Ω → Rd , n ∈ N, be i.i.d. random
variables with E[∥X1 ∥22 ] < ∞ and P(X1 = E[X1 ]) < 1, let (rε,i )(ε,i)∈(0,∞)×{0,1} ⊆ R satisfy
for all ε ∈ (0, ∞), i ∈ {0, 1} that

ν/2
 :ν<1
rε,i = min{1/2, γα + (−1)i ε} : ν = 1 (7.96)

0 : ν > 1,

let l = (l(θ, x))(θ,x)∈Rd ×Rd : Rd × Rd → R and L : Rd → R be the functions which satisfy


for all θ, x ∈ Rd that
l(θ, x) = α2 ∥θ − x∥22 (7.97)
 
and L(θ) = E l(θ, X1 ) ,

301
Chapter 7: Stochastic gradient descent (SGD) optimization methods

and let Θ : N0 × Ω → Rd be the stochastic process which satisfies for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − γ

(∇θ l)(Θn−1 , Xn ). (7.98)

Then
(i) there exists a unique ϑ ∈ Rd which satisfies that {θ ∈ Rd : L(θ) = inf w∈Rd L(w)} =
{ϑ},

(ii) for every ε ∈ (0, ∞) there exist c0 , c1 ∈ (0, ∞) such that for all n ∈ N it holds that
1/2
c0 n−rε,0 ≤ E ∥Θn − ϑ∥22 ≤ c1 n−rε,1 , (7.99)


and

(iii) for every ε ∈ (0, ∞) there exist c0 , c1 ∈ (0, ∞) such that for all n ∈ N it holds that

c0 n−2rε,0 ≤ E[L(Θn )] − L(ϑ) ≤ c1 n−2rε,1 . (7.100)

Proof of Theorem 7.2.15. Note that Jentzen et al. [229, Theorem 1.1] establishes items (i),
(ii), and (iii). The proof of Theorem 7.2.15 is thus complete.

7.2.4 Convergence rates for SGD for coercive objective functions


The statement and the proof of the next result, Theorem 7.2.16 below, can be found in
Jentzen et al. [221, Theorem 1.1].
Theorem 7.2.16. Let d ∈ N, p, α, κ, c ∈ (0, ∞), ν ∈ (0, 1), q = min({2, 4, 6, . . . } ∩ [p, ∞)),
ξ, ϑ ∈ Rd , let (Ω, F, P) be a probability space, let (S, S) be a measurable space, let Xn : Ω → S,
n ∈ N, be i.i.d. random variables, let l = (l(θ, x))θ∈Rd ,x∈S : Rd × S → R be (B(Rd ) ⊗
S)/B(R)-measurable, assume for all x ∈ S that (Rd ∋ θ 7→ l(θ, x) ∈ R) ∈ C 1 (Rd , R),
assume for all θ ∈ Rd that

(7.101)
 
E |l(θ, X1 )| + ∥(∇θ l)(θ, X1 )∥2 < ∞,
θ − ϑ, E[(∇θ l)(θ, X1 )] ≥ c max ∥θ − ϑ∥22 , ∥E[(∇θ l)(θ, X1 )]∥22 , (7.102)


E ∥(∇θ l)(θ, X1 ) − E[(∇θ l)(θ, X1 )]∥q2 ≤ κ 1 + ∥θ∥q2 , (7.103)


  
and
let L : Rd → R satisfy for all θ ∈ Rd that L(θ) = E[l(θ, X1 )], and let Θ : N0 × Ω → Rd be
the stochastic process which satisfies for all n ∈ N that

Θ0 = ξ and Θn = Θn−1 − α

(∇θ l)(Θn−1 , Xn ) (7.104)

(cf. Definitions 1.4.7 and 3.3.4). Then

302
7.3. Explicit midpoint SGD optimization


(i) it holds that θ ∈ Rd : L(θ) = inf w∈Rd L(w) = {ϑ} and

(ii) there exists c ∈ R such that for all n ∈ N it holds that


1/p
E ∥Θn − ϑ∥p2 ≤ cn− /2 . (7.105)
ν


Proof of Theorem 7.2.16. Observe that Jentzen et al. [221, Theorem 1.1] proves items (i)
and (ii). The proof of Theorem 7.2.16 is thus complete.

7.3 Explicit midpoint SGD optimization


In this section we introduce the stochastic version of the explicit midpoint GD optimization
method from Section 6.2.

Definition 7.3.1 (Explicit midpoint SGD optimization method). Let d ∈ N, (γn )n∈N ⊆
[0, ∞), (Jn )n∈N ⊆ N, let (Ω, F, P) be a probability space, let (S, S) be a measurable space, let
ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let Xn,j : Ω → S be a
random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and g = (g1 , . . . , gd ) : Rd ×
S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U with (U ∋ ϑ 7→ l(ϑ, x) ∈
R) ∈ C 1 (U, R) that
g(θ, x) = (∇θ l)(θ, x). (7.106)
Then we say that Θ is the explicit midpoint SGD process for the loss function l with
generalized gradient g, learning rates (γn )n∈N , and initial value ξ (we say that Θ is the
explicit midpoint SGD process for the loss function l with learning rates (γn )n∈N and initial
value ξ) if and only if it holds that Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd
which satisfies for all n ∈ N that
" Jn
#
1 X  γn h 1 PJn i 
Θ0 = ξ and Θn = Θn−1 − γn g Θn−1 − g(Θn−1 , Xn,j ) , Xn,j .
Jn j=1 2 Jn j=1
(7.107)

An implementation of the explicit midpoint SGD optimization method in PyTorch is


given in Source code 7.3.
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8

303
Chapter 7: Stochastic gradient descent (SGD) optimization methods

9 M = 1000
10

11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.003
20
21 for n in range ( N ) :
22 indices = torch . randint (0 , M , (J ,) )
23
24 x = X [ indices ]
25 y = Y [ indices ]
26
27 net . zero_grad ()
28

29 # Remember the original parameters


30 params = [ p . clone () . detach () for p in net . parameters () ]
31 # Compute the loss
32 loss_val = loss ( net ( x ) , y )
33 # Compute the gradients with respect to the parameters
34 loss_val . backward ()
35
36 with torch . no_grad () :
37 # Make a half - step in the direction of the negative
38 # gradient
39 for p in net . parameters () :
40 if p . grad is not None :
41 p . sub_ (0.5 * lr * p . grad )
42
43 net . zero_grad ()
44 # Compute the loss and the gradients at the midpoint
45 loss_val = loss ( net ( x ) , y )
46 loss_val . backward ()
47
48 with torch . no_grad () :
49 # Subtract the scaled gradient at the midpoint from the
50 # original parameters
51 for param , midpoint_param in zip (
52 params , net . parameters ()
53 ):
54 param . sub_ ( lr * midpoint_param . grad )
55
56 # Copy the new parameters into the model
57 for param , p in zip ( params , net . parameters () ) :

304
7.4. SGD optimization with classical momentum

58 p . copy_ ( param )
59

60 if n % 1000 == 0:
61 with torch . no_grad () :
62 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
63 y = torch . sin ( x )
64 loss_val = loss ( net ( x ) , y )
65 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.3 (code/optimization_methods/midpoint_sgd.py): Python code


implementing the explicit midpoint SGD optimization method in PyTorch

7.4 SGD optimization with classical momentum


In this section we introduce the stochastic version of the momentum GD optimization
method from Section 6.3 (cf. Polyak [337] and, for example, [111, 247]).

Definition 7.4.1 (Momentum SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, 1], let (Ω, F, P) be a probability space, let (S, S) be a measurable
space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let
Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.108)

Then we say that Θ is the momentum SGD process on ((Ω, F, P), (S, S)) for the loss function
l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay
factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the
momentum SGD process for the loss function l with learning rates (γn )n∈N , batch sizes
(Jn )n∈N , momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } )
if and only if Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that there
exists m : N0 × Ω → Rd such that for all n ∈ N it holds that

Θ0 = ξ, m0 = 0, (7.109)
"
Jn
#
1 X
mn = αn mn−1 + (1 − αn ) g(Θn−1 , Xn,j ) , (7.110)
Jn j=1

and Θn = Θn−1 − γn mn . (7.111)

305
Chapter 7: Stochastic gradient descent (SGD) optimization methods

An implementation in PyTorch of the momentum SGD optimization method as


described in Definition 7.4.1 above is given in Source code 7.4. This code produces a plot
which illustrates how different choices of the momentum decay rate and of the learning
rate influence the progression of the the loss during the training of a simple ANN with a
single hidden layer, learning an approximation of the sine function. We note that while
Source code 7.4 serves to illustrate a concrete implementation of the momentum SGD
optimization method, for applications it is generally much preferable to use PyTorch’s built-
in implementation of the momentum SGD optimization method in the torch.optim.SGD
optimizer, rather than implementing it from scratch.
1 import torch
2 import torch . nn as nn
3 import numpy as np
4 import matplotlib . pyplot as plt
5
6 M = 10000
7
8 torch . manual_seed (0)
9 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
10 Y = torch . sin ( X )
11
12 J = 64
13
14 N = 100000
15

16 loss = nn . MSELoss ()
17 lr = 0.01
18 alpha = 0.999
19
20 fig , axs = plt . subplots (1 , 4 , figsize =(12 , 3) , sharey = ’ row ’)
21

22 net = nn . Sequential (
23 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
24 )
25
26 for i , alpha in enumerate ([0 , 0.9 , 0.99 , 0.999]) :
27 print ( f " alpha = { alpha } " )
28
29 for lr in [0.1 , 0.03 , 0.01 , 0.003]:
30 torch . manual_seed (0)
31 net . apply (
32 lambda m : m . reset_parameters ()
33 if isinstance (m , nn . Linear )
34 else None
35 )
36
37 momentum = [
38 p . clone () . detach () . zero_ () for p in net . parameters ()

306
7.4. SGD optimization with classical momentum

39 ]
40

41 losses = []
42 print ( f " lr = { lr } " )
43
44 for n in range ( N ) :
45 indices = torch . randint (0 , M , (J ,) )
46

47 x = X [ indices ]
48 y = Y [ indices ]
49
50 net . zero_grad ()
51
52 loss_val = loss ( net ( x ) , y )
53 loss_val . backward ()
54
55 with torch . no_grad () :
56 for m , p in zip ( momentum , net . parameters () ) :
57 m . mul_ ( alpha )
58 m . add_ ((1 - alpha ) * p . grad )
59 p . sub_ ( lr * m )
60
61 if n % 100 == 0:
62 with torch . no_grad () :
63 x = ( torch . rand ((1000 , 1) ) - 0.5) * 4 * np . pi
64 y = torch . sin ( x )
65 loss_val = loss ( net ( x ) , y )
66 losses . append ( loss_val . item () )
67
68 axs [ i ]. plot ( losses , label = f " $ \\ gamma = { lr } $ " )
69
70 axs [ i ]. set_yscale ( " log " )
71 axs [ i ]. set_ylim ([1 e -6 , 1])
72 axs [ i ]. set_title ( f " $ \\ alpha = { alpha } $ " )
73
74 axs [0]. legend ()
75
76 plt . tight_layout ()
77 plt . savefig ( " ../ plots / sgd_momentum . pdf " , bbox_inches = ’ tight ’)

Source code 7.4 (code/optimization_methods/momentum_sgd.py): Python code


implementing the SGD optimization method with classical momentum in PyTorch

7.4.1 Bias-adjusted SGD optimization with classical momentum


Definition 7.4.2 (Bias-adjusted momentum SGD optimization method). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, 1] satisfy α1 < 1, let (Ω, F, P) be a

307
Chapter 7: Stochastic gradient descent (SGD) optimization methods

100
=0 = 0.9 = 0.99 = 0.999
10 1

10 2

10 3
= 0.1
10 4
= 0.03
10 5 = 0.01
= 0.003
10 6
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000

Figure 7.3 (plots/sgd_momentum.pdf): A plot showing the influence of the momen-


tum decay rate and learning rate on the loss during the training of an ANN using
the SGD optimization method with classical momentum

probability space, let (S, S) be a measurable space, let ξ : Ω → Rd be a random vari-


able, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let Xn,j : Ω → S be a random variable, and let
l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all
U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.112)

Then we say that Θ is the bias-adjusted momentum SGD process on ((Ω, F, P), (S, S)) for
the loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say
that Θ is the bias-adjusted momentum SGD process for the loss function l with learning
rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay factors (αn )n∈N , initial value ξ, and
data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if Θ : N0 × Ω → Rd is the function from N0 × Ω
to Rd which satisfies that there exists m : N0 × Ω → Rd such that for all n ∈ N it holds that

Θ0 = ξ, m0 = 0, (7.113)
Jn
" #
1 X
mn = αn mn−1 + (1 − αn ) g(Θn−1 , Xn,j ) , (7.114)
Jn j=1
γn mn
and Θn = Θn−1 − . (7.115)
1 − nl=1 αl
Q

An implementation of the bias-adjusted momentum SGD optimization method in


PyTorch is given in Source code 7.5.
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (

308
7.4. SGD optimization with classical momentum

6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)


7 )
8
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13

14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.01
20 alpha = 0.99
21 adj = 1
22
23 momentum = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
24
25 for n in range ( N ) :
26 indices = torch . randint (0 , M , (J ,) )
27
28 x = X [ indices ]
29 y = Y [ indices ]
30
31 net . zero_grad ()
32
33 loss_val = loss ( net ( x ) , y )
34 loss_val . backward ()
35
36 adj *= alpha
37

38 with torch . no_grad () :


39 for m , p in zip ( momentum , net . parameters () ) :
40 m . mul_ ( alpha )
41 m . add_ ((1 - alpha ) * p . grad )
42 p . sub_ ( lr * m / (1 - adj ) )
43

44 if n % 1000 == 0:
45 with torch . no_grad () :
46 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
47 y = torch . sin ( x )
48 loss_val = loss ( net ( x ) , y )
49 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.5 (code/optimization_methods/momentum_sgd_bias_adj.py):


Python code implementing the bias-adjusted momentum SGD optimization method
in PyTorch

309
Chapter 7: Stochastic gradient descent (SGD) optimization methods

7.5 SGD optimization with Nesterov momentum


In this section we introduce the stochastic version of the Nesterov accelerated GD optmiza-
tion method from Section 6.4 (cf. [302, 387]).

Definition 7.5.1 (Nesterov accelerated SGD optimization method). Let d ∈ N, (γn )n∈N ⊆
[0, ∞), (Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, 1], let (Ω, F, P) be a probability space, let (S, S) be a
measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn }
let Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.116)

Then we say that Θ is the Nesterov accelerated SGD process on ((Ω, F, P), (S, S)) for the
loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we
say that Θ is the Nesterov accelerated SGD process for the loss function l with learning
rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay rates (αn )n∈N , initial value ξ, and data
(Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if Θ : N0 × Ω → Rd is the function from N0 × Ω to Rd
which satisfies that there exists m : N0 × Ω → Rd such that for all n ∈ N it holds that

Θ0 = ξ, m0 = 0, (7.117)
" Jn
#
1 X
(7.118)

mn = αn mn−1 + (1 − αn ) g Θn−1 − γn αn mn−1 , Xn,j ,
Jn j=1

and Θn = Θn−1 − γn mn . (7.119)

An implementation of the Nesterov accelerated SGD optimization method in PyTorch


is given in Source code 7.6.
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8

9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13

310
7.5. SGD optimization with Nesterov momentum

14 J = 64
15

16 N = 150000
17
18 loss = nn . MSELoss ()
19 lr = 0.003
20 alpha = 0.999
21

22 m = [ p . clone () . detach () . zero_ () for p in net . parameters () ]


23
24 for n in range ( N ) :
25 indices = torch . randint (0 , M , (J ,) )
26
27 x = X [ indices ]
28 y = Y [ indices ]
29
30 net . zero_grad ()
31
32 # Remember the original parameters
33 params = [ p . clone () . detach () for p in net . parameters () ]
34
35 for p , m_p in zip ( params , m ) :
36 p . sub_ ( lr * alpha * m_p )
37
38 # Compute the loss
39 loss_val = loss ( net ( x ) , y )
40 # Compute the gradients with respect to the parameters
41 loss_val . backward ()
42
43 with torch . no_grad () :
44 for p , m_p , q in zip ( net . parameters () , m , params ) :
45 m_p . mul_ ( alpha )
46 m_p . add_ ((1 - alpha ) * p . grad )
47 q . sub_ ( lr * m_p )
48 p . copy_ ( q )
49
50 if n % 1000 == 0:
51 with torch . no_grad () :
52 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
53 y = torch . sin ( x )
54 loss_val = loss ( net ( x ) , y )
55 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.6 (code/optimization_methods/nesterov_sgd.py): Python code


implementing the Nesterov accelerated SGD optimization method in PyTorch

311
Chapter 7: Stochastic gradient descent (SGD) optimization methods

7.5.1 Simplified SGD optimization with Nesterov momentum


For reasons of algorithmic simplicity, in several deep learning libraries including PyTorch
(see [338] and cf., for instance, [31, Section 3.5]) optimization with Nesterov momentum
is not implemented such that it precisely corresponds to Definition 7.5.1. Rather, an
alternative definition for Nesterov accelerated SGD optimization is used, which we present
in Definition 7.5.3. The next result illustrates the connection between the original notion of
Nesterov accelerated SGD optimization in Definition 7.5.1 and the alternative notion of
Nesterov accelerated SGD optimization in Definition 7.5.3 employed by PyTorch (compare
(7.121)–(7.123) with (7.134)–(7.136)).

Lemma 7.5.2 (Relations between Definition 7.5.1 and Definition 7.5.3). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N, (αn )n∈N0 ⊆ [0, 1), let (Ω, F, P) be a probability space, let
(S, S) be a measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈
{1, 2, . . . , Jn } let Xn,j : Ω → S be a random variable, let l = (l(θ, x))(θ,x)∈Rd ×S : Rd ×S → R
and g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all x ∈ S, θ ∈ {v ∈ Rd : l(·, x) is
differentiable at v} that
g(θ, x) = (∇θ l)(θ, x), (7.120)
let Θ : N0 × Ω → Rd and m : N0 × Ω → Rd satisfy for all n ∈ N that

Θ0 = ξ, m0 = 0, (7.121)
"Jn
#
1 X
(7.122)

mn = αn mn−1 + (1 − αn ) g Θn−1 − γn αn mn−1 , Xn,j ,
Jn j=1

and Θn = Θn−1 − γn mn , (7.123)


let (βn )n∈N ⊆ [0, ∞), (δn )n∈N ⊆ [0, ∞) satisfy for all n ∈ N that

αn (1 − αn−1 )
βn = and δn = (1 − αn )γn , (7.124)
1 − αn

and let Ψ : N0 × Ω → Rd and m : N0 × Ω → Rd satisfy for all n ∈ N0 that


mn
mn = and Ψn = Θn − γn+1 αn+1 mn . (7.125)
1 − αn
Then

(i) it holds that Θ is the Nesterov accelerated SGD process on ((Ω, F, P), (S, S)) for the
loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk }
and

312
7.5. SGD optimization with Nesterov momentum

(ii) it holds for all n ∈ N that


Ψ0 = ξ, m0 = 0, (7.126)
Jn
1 X
mn = βn mn−1 + (7.127)

g Ψn−1 , Xn,j ,
Jn j=1
" Jn
#
1 X
Ψn = Ψn−1 − δn+1 βn+1 mn − δn (7.128)

and g Ψn−1 , Xn,j .
Jn j=1

Proof of Lemma 7.5.2. Note that (7.121), (7.122), and (7.123) show item (i). Observe that
(7.122) and (7.125) imply that for all n ∈ N it holds that

Jn
αn mn−1 1 X
mn =

+ g Ψn−1 , Xn,j
1 − αn Jn j=1
(7.129)
αn (1 − αn−1 )mn−1
Jn
1 X 
= + g Ψn−1 , Xn,j .
1 − αn Jn j=1

This and (7.124) demonstrate that for all n ∈ N it holds that

Jn
1 X
mn = βn mn−1 + (7.130)

g Ψn−1 , Xn,j .
Jn j=1

Furthermore, note that (7.122), (7.123), and (7.125) ensure that for all n ∈ N it holds that

Ψn = Θn − γn+1 αn+1 mn
= Θn−1 − γn mn − γn+1 αn+1 mn
= Ψn−1 + γn αn mn−1 − γn mn − γn+1 αn+1 mn
"
Jn
#
1 X 
= Ψn−1 + γn αn mn−1 − γn αn mn−1 − γn (1 − αn ) g Ψn−1 , Xn,j
Jn j=1
− γn+1 αn+1 mn (7.131)
" Jn
#
1 X 
= Ψn−1 − γn+1 αn+1 mn − γn (1 − αn ) g Ψn−1 , Xn,j
Jn j=1
" Jn
#
1 X
= Ψn−1 − γn+1 αn+1 (1 − αn )mn − γn (1 − αn )

g Ψn−1 , Xn,j .
Jn j=1

313
Chapter 7: Stochastic gradient descent (SGD) optimization methods

This and (7.124) establish that for all n ∈ N it holds that

δn+1 αn+1 (1 − αn )mn


" Jn
#
1 X 
Ψn = Ψn−1 − − δn g Ψn−1 , Xn,j
1 − αn+1 Jn j=1
" Jn
# (7.132)
1 X
= Ψn−1 − δn+1 βn+1 mn − δn

g Ψn−1 , Xn,j .
Jn j=1

Combining this with (7.121), (7.125), and (7.130) proves item (ii). The proof of Lemma 7.5.2
is thus complete.
Definition 7.5.3 (Simplified Nesterov accelerated SGD optimization method). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, ∞), let (Ω, F, P) be a probability space, let
(S, S) be a measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈
{1, 2, . . . , Jn } let Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd ×
S → R and g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all x ∈ S, θ ∈ {v ∈ Rd : l(·, x) is
differentiable at v} that
g(θ, x) = (∇θ l)(θ, x). (7.133)
Then we say that Θ is the simplified Nesterov accelerated SGD process on ((Ω, F, P), (S, S))
for the loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes
(Jn )n∈N , momentum decay factors (αn )n∈N , initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk }
(we say that Θ is the simplified Nesterov accelerated SGD process for the loss function l
with learning rates (γn )n∈N , batch sizes (Jn )n∈N , momentum decay rates (αn )n∈N , initial
value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if Θ : N0 × Ω → Rd is the function
from N0 × Ω to Rd which satisfies that there exists m : N0 × Ω → Rd such that for all n ∈ N
it holds that
Θ0 = ξ, m0 = 0, (7.134)
Jn
1 X
(7.135)

mn = αn mn−1 + g Θn−1 , Xn,j ,
Jn j=1
" Jn
#
1 X
(7.136)

and Θn = Θn−1 − γn αn mn − γn g Θn−1 , Xn,j .
Jn j=1
The simplified Nesterov accelerated SGD optimization method as described in Defini-
tion 7.5.3 is implemented in PyTorch in the form of the torch.optim.SGD optimizer with
the nesterov=True option.

7.6 Adagrad SGD optimization (Adagrad)


In this section we introduce the stochastic version of the Adagrad GD optimization method
from Section 6.5 (cf. Duchi et al. [117]).

314
7.6. Adagrad SGD optimization (Adagrad)

Definition 7.6.1 (Adagrad SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(Jn )n∈N ⊆ N, ε ∈ (0, ∞), let (Ω, F, P) be a probability space, let (S, S) be a measurable
space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let
Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.137)

Then we say that Θ is the Adagrad SGD process on ((Ω, F, P), (S, S)) for the loss function l
with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , regularizing factor
ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the Adagrad SGD
process for the loss function l with learning rates (γn )n∈N , batch sizes (Jn )n∈N , regularizing
factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies for all
n ∈ N, i ∈ {1, 2, . . . , d} that Θ0 = ξ and
" n 
#1/2 !−1 " Jn
#
(i)
X PJk 2 1 X
Θ(i)
n = Θn−1 − γn ε + 1
Jk j=1 gi (Θk−1 , Xk,j ) gi (Θn−1 , Xn,j ) .
k=1
Jn j=1
(7.138)
An implementation in PyTorch of the Adagrad SGD optimization method as described
in Definition 7.6.1 above is given in Source code 7.7. The Adagrad SGD optimization
method as described in Definition 7.6.1 above is also available in PyTorch in the form of
the built-in torch.optim.Adagrad optimizer (which, for applications, is generally much
preferable to implementing it from scratch).
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13

14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()

315
Chapter 7: Stochastic gradient descent (SGD) optimization methods

19 lr = 0.02
20 eps = 1e -10
21
22 sum_sq_grad = [ p . clone () . detach () . fill_ ( eps ) for p in net .
parameters () ]
23
24 for n in range ( N ) :
25 indices = torch . randint (0 , M , (J ,) )
26
27 x = X [ indices ]
28 y = Y [ indices ]
29
30 net . zero_grad ()
31

32 loss_val = loss ( net ( x ) , y )


33 loss_val . backward ()
34
35 with torch . no_grad () :
36 for a , p in zip ( sum_sq_grad , net . parameters () ) :
37 a . add_ ( p . grad * p . grad )
38 p . sub_ ( lr * a . rsqrt () * p . grad )
39
40 if n % 1000 == 0:
41 with torch . no_grad () :
42 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
43 y = torch . sin ( x )
44 loss_val = loss ( net ( x ) , y )
45 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.7 (code/optimization_methods/adagrad.py): Python code


implementing the Adagrad SGD optimization method in PyTorch

7.7 Root mean square propagation SGD optimization


(RMSprop)
In this section we introduce the stochastic version of the RMSprop GD optimization method
from Section 6.6 (cf. Hinton et al. [199]).

Definition 7.7.1 (RMSprop SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(Jn )n∈N ⊆ N, (βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), let (Ω, F, P) be a probability space, let (S, S) be a
measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn }
let Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U

316
7.7. Root mean square propagation SGD optimization (RMSprop)

with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.139)

Then we say that Θ is the RMSprop SGD process on ((Ω, F, P), (S, S)) for the loss function l
with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , second moment decay
factors (βn )n∈N , regularizing factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we
say that Θ is the RMSprop SGD process for the loss function l with learning rates (γn )n∈N ,
batch sizes (Jn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε, initial
value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that Θ = (Θ(1) , . . . , Θ(d) ) :
N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that there exists M =
(M(1) , . . . , M(d) ) : N0 × Ω → Rd such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that

Θ0 = ξ, M0 = 0, (7.140)

Jn
" #2
(i) 1 X
M(i)
n = βn Mn−1 + (1 − βn ) gi (Θn−1 , Xn,j ) , (7.141)
Jn j=1
" Jn
#
(i) γn 1 X
and (i)
Θn = Θn−1 −  (i) 1/2 gi (Θn−1 , Xn,j ) . (7.142)
ε + Mn Jn j=1

Remark 7.7.2. In Hinton et al. [199] it is proposed to choose 0.9 = β1 = β2 = . . . as default


values for the second moment decay factors (βn )n∈N ⊆ [0, 1] in Definition 7.7.1.
An implementation in PyTorch of the RMSprop SGD optimization method as described
in Definition 7.7.1 above is given in Source code 7.8. The RMSprop SGD optimization
method as described in Definition 7.7.1 above is also available in PyTorch in the form of
the built-in torch.optim.RMSprop optimizer (which, for applications, is generally much
preferable to implementing it from scratch).
1 import torch
2 import torch . nn as nn
3 import numpy as np
4

5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8
9 M = 1000
10

11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15

317
Chapter 7: Stochastic gradient descent (SGD) optimization methods

16 N = 150000
17

18 loss = nn . MSELoss ()
19 lr = 0.001
20 beta = 0.9
21 eps = 1e -10
22
23 moments = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
24
25 for n in range ( N ) :
26 indices = torch . randint (0 , M , (J ,) )
27
28 x = X [ indices ]
29 y = Y [ indices ]
30
31 net . zero_grad ()
32
33 loss_val = loss ( net ( x ) , y )
34 loss_val . backward ()
35

36 with torch . no_grad () :


37 for m , p in zip ( moments , net . parameters () ) :
38 m . mul_ ( beta )
39 m . add_ ((1 - beta ) * p . grad * p . grad )
40 p . sub_ ( lr * ( eps + m ) . rsqrt () * p . grad )
41

42 if n % 1000 == 0:
43 with torch . no_grad () :
44 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
45 y = torch . sin ( x )
46 loss_val = loss ( net ( x ) , y )
47 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.8 (code/optimization_methods/rmsprop.py): Python code


implementing the RMSprop SGD optimization method in PyTorch

7.7.1 Bias-adjusted root mean square propagation SGD optimiza-


tion
Definition 7.7.3 (Bias-adjusted RMSprop SGD optimization method). Let d ∈ N,
(γn )n∈N ⊆ [0, ∞), (Jn )n∈N ⊆ N, (βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞) satisfy β1 < 1, let (Ω, F, P)
be a probability space, let (S, S) be a measurable space, let ξ : Ω → Rd be a random vari-
able, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let Xn,j : Ω → S be a random variable, and let
l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all

318
7.7. Root mean square propagation SGD optimization (RMSprop)

U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.143)

Then we say that Θ is the bias-adjusted RMSprop SGD process on ((Ω, F, P), (S, S)) for
the loss function l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N ,
second moment decay factors (βn )n∈N , regularizing factor ε, initial value ξ, and data
(Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the bias-adjusted RMSprop SGD process for the
loss function l with learning rates (γn )n∈N , batch sizes (Jn )n∈N , second moment decay factors
(βn )n∈N , regularizing factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only
if it holds that Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd
which satisfies that there exists M = (M(1) , . . . , M(d) ) : N0 × Ω → Rd such that for all n ∈ N,
i ∈ {1, 2, . . . , d} it holds that
Θ0 = ξ, M0 = 0, (7.144)
" Jn
# 2
(i) 1X
(i)
Mn = βn Mn−1 + (1 − βn ) gi (Θn−1 , Xn,j ) , (7.145)
Jn j=1
" Jn
#
 h (i)
i1/2 −1 1 X
(i)
and Θ(i)
n = Θn−1 − γn ε + (1− n
QMn
gi (Θn−1 , Xn,j ) . (7.146)
l=1 βl ) Jn j=1

An implementation in PyTorch of the bias-adjusted RMSprop SGD optimization


method as described in Definition 7.7.3 above is given in Source code 7.9.
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17

18 loss = nn . MSELoss ()
19 lr = 0.001
20 beta = 0.9
21 eps = 1e -10
22 adj = 1

319
Chapter 7: Stochastic gradient descent (SGD) optimization methods

23
24 moments = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
25
26 for n in range ( N ) :
27 indices = torch . randint (0 , M , (J ,) )
28
29 x = X [ indices ]
30 y = Y [ indices ]
31
32 net . zero_grad ()
33
34 loss_val = loss ( net ( x ) , y )
35 loss_val . backward ()
36

37 with torch . no_grad () :


38 adj *= beta
39 for m , p in zip ( moments , net . parameters () ) :
40 m . mul_ ( beta )
41 m . add_ ((1 - beta ) * p . grad * p . grad )
42 p . sub_ ( lr * ( eps + ( m / (1 - adj ) ) . sqrt () ) . reciprocal ()
* p . grad )
43
44 if n % 1000 == 0:
45 with torch . no_grad () :
46 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
47 y = torch . sin ( x )
48 loss_val = loss ( net ( x ) , y )
49 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.9 (code/optimization_methods/rmsprop_bias_adj.py): Python


code implementing the bias-adjusted RMSprop SGD optimization method in
PyTorch

7.8 Adadelta SGD optimization


In this section we introduce the stochastic version of the Adadelta GD optimization method
from Section 6.7 (cf. Zeiler [429]).
Definition 7.8.1 (Adadelta SGD optimization method). Let d ∈ N, (Jn )n∈N ⊆ N,
(βn )n∈N ⊆ [0, 1], (δn )n∈N ⊆ [0, 1], ε ∈ (0, ∞), let (Ω, F, P) be a probability space, let (S, S) be
a measurable space, let ξ : Ω → Rd be a random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn }
let Xn,j : Ω → S be a random variable, and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and
g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U
with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that
g(θ, x) = (∇θ l)(θ, x). (7.147)

320
7.8. Adadelta SGD optimization

Then we say that Θ is the Adadelta SGD process on ((Ω, F, P), (S, S)) for the loss function l
with generalized gradient g, batch sizes (Jn )n∈N , second moment decay factors (βn )n∈N , delta
decay factors (δn )n∈N , regularizing factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk }
(we say that Θ is the Adadelta SGD process for the loss function l with batch sizes
(Jn )n∈N , second moment decay factors (βn )n∈N , delta decay factors (δn )n∈N , regularizing
factor ε, initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that
there exist M = (M(1) , . . . , M(d) ) : N0 × Ω → Rd and ∆ = (∆(1) , . . . , ∆(d) ) : N0 × Ω → Rd
such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that

Θ0 = ξ, M0 = 0, ∆0 = 0, (7.148)
"Jn
#2
(i) 1 X
M(i)
n = + (1 − βn )
βn Mn−1 gi (Θn−1 , Xn,j ) , (7.149)
Jn j=1
 (i) 1/2
" Jn
#
(i) ε + ∆ 1 X
Θ(i)
n = Θn−1 −
n−1
(i)
gi (Θn−1 , Xn,j ) , (7.150)
ε + Mn Jn j=1
(i) (i) 2
and ∆(i) (i)
n = δn ∆n−1 + (1 − δn ) Θn − Θn−1 . (7.151)

An implementation in PyTorch of the Adadelta SGD optimization method as described


in Definition 7.8.1 above is given in Source code 7.10. The Adadelta SGD optimization
method as described in Definition 7.8.1 above is also available in PyTorch in the form of
the built-in torch.optim.Adadelta optimizer (which, for applications, is generally much
preferable to implementing it from scratch).
1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8

9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17
18 loss = nn . MSELoss ()
19 beta = 0.9

321
Chapter 7: Stochastic gradient descent (SGD) optimization methods

20 delta = 0.9
21 eps = 1e -10
22
23 moments = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
24 Delta = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
25
26 for n in range ( N ) :
27 indices = torch . randint (0 , M , (J ,) )
28
29 x = X [ indices ]
30 y = Y [ indices ]
31
32 net . zero_grad ()
33

34 loss_val = loss ( net ( x ) , y )


35 loss_val . backward ()
36
37 with torch . no_grad () :
38 for m , D , p in zip ( moments , Delta , net . parameters () ) :
39 m . mul_ ( beta )
40 m . add_ ((1 - beta ) * p . grad * p . grad )
41 inc = (( eps + D ) / ( eps + m ) ) . sqrt () * p . grad
42 p . sub_ ( inc )
43 D . mul_ ( delta )
44 D . add_ ((1 - delta ) * inc * inc )
45

46 if n % 1000 == 0:
47 with torch . no_grad () :
48 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
49 y = torch . sin ( x )
50 loss_val = loss ( net ( x ) , y )
51 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.10 (code/optimization_methods/adadelta.py): Python code


implementing the Adadelta SGD optimization method in PyTorch

7.9 Adaptive moment estimation SGD optimization


(Adam)
In this section we introduce the stochastic version of the Adam GD optimization method
from Section 6.8 (cf. Kingma & Ba [247]).
Definition 7.9.1 (Adam SGD optimization method). Let d ∈ N, (γn )n∈N ⊆ [0, ∞),
(Jn )n∈N ⊆ N, (αn )n∈N ⊆ [0, 1], (βn )n∈N ⊆ [0, 1], ε ∈ (0, ∞) satisfy
max{α1 , β1 } < 1, (7.152)

322
7.9. Adaptive moment estimation SGD optimization
(Adam)

let (Ω, F, P) be a probability space, let (S, S) be a measurable space, let ξ : Ω → Rd be a


random variable, for every n ∈ N, j ∈ {1, 2, . . . , Jn } let Xn,j : Ω → S be a random variable,
and let l = (l(θ, x))(θ,x)∈Rd ×S : Rd × S → R and g = (g1 , . . . , gd ) : Rd × S → Rd satisfy for
all U ∈ {V ⊆ Rd : V is open}, x ∈ S, θ ∈ U with (U ∋ ϑ 7→ l(ϑ, x) ∈ R) ∈ C 1 (U, R) that

g(θ, x) = (∇θ l)(θ, x). (7.153)

Then we say that Θ is the Adam SGD process on ((Ω, F, P), (S, S)) for the loss function
l with generalized gradient g, learning rates (γn )n∈N , batch sizes (Jn )n∈N , momentum
decay factors (αn )n∈N , second moment decay factors (βn )n∈N , regularizing factor ε ∈ (0, ∞),
initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } (we say that Θ is the Adam SGD pro-
cess for the loss function l with learning rates (γn )n∈N , batch sizes (Jn )n∈N , momen-
tum decay factors (αn )n∈N , second moment decay factors (βn )n∈N , regularizing factor
ε ∈ (0, ∞), initial value ξ, and data (Xn,j )(n,j)∈{(k,l)∈N2 : l≤Jk } ) if and only if it holds that
Θ = (Θ(1) , . . . , Θ(d) ) : N0 × Ω → Rd is the function from N0 × Ω to Rd which satisfies that
there exist m = (m(1) , . . . , m(d) ) : N0 × Ω → Rd and M = (M(1) , . . . , M(d) ) : N0 × Ω → Rd
such that for all n ∈ N, i ∈ {1, 2, . . . , d} it holds that

Θ0 = ξ, m0 = 0, M0 = 0, (7.154)

Jn
" #
1 X
mn = αn mn−1 + (1 − αn ) g(Θn−1 , Xn,j ) , (7.155)
Jn j=1
" Jn
#2
(i) 1 X
M(i)
n = βn Mn−1 + (1 − βn ) gi (Θn−1 , Xn,j ) , (7.156)
Jn j=1
" #
 h (i)
i1/2 −1 (i)
mn
(i)
and Θ(i)
n = Θn−1 − γn ε +
Mn
Q n
(1− l=1 βl )
Qn . (7.157)
(1 − l=1 αl )

Remark 7.9.2. In Kingma & Ba [247] it is proposed to choose

0.001 = γ1 = γ2 = . . . , 0.9 = α1 = α2 = . . . , 0.999 = β1 = β2 = . . . , (7.158)

and 10−8 = ε as default values for (γn )n∈N ⊆ [0, ∞), (αn )n∈N ⊆ [0, 1], (βn )n∈N ⊆ [0, 1],
ε ∈ (0, ∞) in Definition 7.9.1.
An implementation in PyTorch of the Adam SGD optimization method as described
in Definition 7.9.1 above is given in Source code 7.11. The Adam SGD optimization method
as described in Definition 7.9.1 above is also available in PyTorch in the form of the
built-in torch.optim.Adam optimizer (which, for applications, is generally much preferable
to implementing it from scratch).

323
Chapter 7: Stochastic gradient descent (SGD) optimization methods

1 import torch
2 import torch . nn as nn
3 import numpy as np
4
5 net = nn . Sequential (
6 nn . Linear (1 , 200) , nn . ReLU () , nn . Linear (200 , 1)
7 )
8
9 M = 1000
10
11 X = torch . rand (( M , 1) ) * 4 * np . pi - 2 * np . pi
12 Y = torch . sin ( X )
13
14 J = 64
15
16 N = 150000
17

18 loss = nn . MSELoss ()
19 lr = 0.0001
20 alpha = 0.9
21 beta = 0.999
22 eps = 1e -8
23 adj = 1.
24 adj2 = 1.
25
26 m = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
27 MM = [ p . clone () . detach () . zero_ () for p in net . parameters () ]
28
29 for n in range ( N ) :
30 indices = torch . randint (0 , M , (J ,) )
31
32 x = X [ indices ]
33 y = Y [ indices ]
34
35 net . zero_grad ()
36
37 loss_val = loss ( net ( x ) , y )
38 loss_val . backward ()
39
40 with torch . no_grad () :
41 adj *= alpha
42 adj2 *= beta
43 for m_p , M_p , p in zip (m , MM , net . parameters () ) :
44 m_p . mul_ ( alpha )
45 m_p . add_ ((1 - alpha ) * p . grad )
46 M_p . mul_ ( beta )
47 M_p . add_ ((1 - beta ) * p . grad * p . grad )
48 p . sub_ ( lr * m_p / ((1 - adj ) * ( eps + ( M_p / (1 - adj2 )

324
7.9. Adaptive moment estimation SGD optimization
(Adam)

) . sqrt () ) ) )
49

50 if n % 1000 == 0:
51 with torch . no_grad () :
52 x = torch . rand ((1000 , 1) ) * 4 * np . pi - 2 * np . pi
53 y = torch . sin ( x )
54 loss_val = loss ( net ( x ) , y )
55 print ( f " Iteration : { n +1} , Loss : { loss_val } " )

Source code 7.11 (code/optimization_methods/adam.py): Python code


implementing the Adam SGD optimization method in PyTorch
Whereas Source code 7.11 and the other source codes presented in this chapter so far
served mostly to elucidate the definitions of the various optimization methods introduced
in this chapter by giving example implementations, in Source code 7.12 we demonstrate
how an actual machine learning problem might be solved using the built-in functionality
of PyTorch. This code trains a neural network with 3 convolutional layers and 2 fully
connected layers (with each hidden layer followed by a ReLU activation function) on the
MNIST dataset (introduced in Bottou et al. [47]), which consists of 28 × 28 pixel grayscale
images of handwritten digits from 0 to 9 and the corresponding labels and is one of the
most commonly used benchmarks for training machine learning systems in the literature.
Source code 7.12 uses the cross-entropy loss function and the Adam SGD optimization
method and outputs a graph showing the progression of the average loss on the training
set and on a test set that is not used for training as well as the accuracy of the model’s
predictions over the course of the training, see Figure 7.4.
1 import torch
2 import torchvision . datasets as datasets
3 import torchvision . transforms as transforms
4 import torch . nn as nn
5 import torch . utils . data as data
6 import torch . optim as optim
7 import matplotlib . pyplot as plt
8 from matplotlib . ticker import ScalarFormatter , NullFormatter
9
10 # We use the GPU if available . Otherwise , we use the CPU .
11 device = torch . device (
12 " cuda " if torch . cuda . is_available () else " cpu "
13 )
14
15 # We fix a random seed . This is not necessary for training a
16 # neural network , but we use it here to ensure that the same
17 # plot is created on every run .
18 torch . manual_seed (0)
19
20 # The torch . utils . data . Dataset class is an abstraction for a
21 # collection of instances that has a length and can be indexed

325
Chapter 7: Stochastic gradient descent (SGD) optimization methods

22 # ( usually by integers ) .
23 # The torchvision . datasets module contains functions for loading
24 # popular machine learning datasets , possibly downloading and
25 # transforming the data .
26
27 # Here we load the MNIST dataset , containing 28 x28 grayscale images
28 # of handwritten digits with corresponding labels in
29 # {0 , 1 , ... , 9}.
30
31 # First load the training portion of the data set , downloading it
32 # from an online source to the local folder ./ data ( if it is not
33 # yet there ) and transforming the data to PyTorch Tensors .
34 mnist_train = datasets . MNIST (
35 " ./ data " ,
36 train = True ,
37 transform = transforms . ToTensor () ,
38 download = True ,
39 )
40 # Next load the test portion
41 mnist_test = datasets . MNIST (
42 " ./ data " ,
43 train = False ,
44 transform = transforms . ToTensor () ,
45 download = True ,
46 )
47

48 # The data . utils . DataLoader class allows iterating datasets for


49 # training and validation . It supports , e . g . , batching and
50 # shuffling of datasets .
51
52 # Construct a DataLoader that when iterating returns minibatches
53 # of 64 instances drawn from a random permutation of the training
54 # dataset
55 train_loader = data . DataLoader (
56 mnist_train , batch_size =64 , shuffle = True
57 )
58 # The loader for the test dataset does not need shuffling
59 test_loader = data . DataLoader (
60 mnist_test , batch_size =64 , shuffle = False
61 )
62
63 # Define a neural network with 3 convolutional layers , each
64 # followed by a ReLU activation and then two affine layers ,
65 # the first followed by a ReLU activation
66 net = nn . Sequential ( # input shape (N , 1 , 28 , 28)
67 nn . Conv2d (1 , 5 , 5) , # (N , 5 , 24 , 24)
68 nn . ReLU () ,
69 nn . Conv2d (5 , 5 , 5) , # (N , 5 , 20 , 20)
70 nn . ReLU () ,

326
7.9. Adaptive moment estimation SGD optimization
(Adam)

71 nn . Conv2d (5 , 3 , 5) , # (N , 3 , 16 , 16)
72 nn . ReLU () ,
73 nn . Flatten () , # (N , 3 * 16 * 16) = (N , 768)
74 nn . Linear (768 , 128) , # (N , 128)
75 nn . ReLU () ,
76 nn . Linear (128 , 10) , # output shape (N , 10)
77 ) . to ( device )
78

79 # Define the loss function . For every natural number d , for


80 # e_1 , e_2 , ... , e_d the standard basis vectors in R ^d , for L the
81 # d - dimensional cross - entropy loss function , and for A the
82 # d - dimensional softmax activation function , the function loss_fn
83 # defined here satisfies for all x in R ^ d and all natural numbers
84 # i in [0 , d ) that
85 # loss_fn (x , i ) = L ( A ( x ) , e_i ) .
86 # The function loss_fn also accepts batches of inputs , in which
87 # case it will return the mean of the corresponding outputs .
88 loss_fn = nn . CrossEntropyLoss ()
89
90 # Define the optimizer . We use the Adam SGD optimization method .
91 optimizer = optim . Adam ( net . parameters () , lr =1 e -3)
92
93 # This function computes the average loss of the model over the
94 # entire test set and the accuracy of the model ’s predictions .
95 def c o m p u t e _ t e s t _ l o s s _ a n d _ a c c u r a c y () :
96 total_test_loss = 0.0
97 correct_count = 0
98 with torch . no_grad () :
99 # On each iteration the test_loader will yield a
100 # minibatch of images with corresponding labels
101 for images , labels in test_loader :
102 # Move the data to the device
103 images = images . to ( device )
104 labels = labels . to ( device )
105 # Compute the output of the neural network on the
106 # current minibatch
107 output = net ( images )
108 # Compute the mean of the cross - entropy losses
109 loss = loss_fn ( output , labels )
110 # For the cumulative total_test_loss , we multiply loss
111 # with the batch size ( usually 64 , as specified above ,
112 # but might be less for the final batch ) .
113 total_test_loss += loss . item () * images . size (0)
114 # For each input , the predicted label is the index of
115 # the maximal component in the output vector .
116 pred_labels = torch . max ( output , dim =1) . indices
117 # pred_labels == labels compares the two vectors
118 # componentwise and returns a vector of booleans .
119 # Summing over this vector counts the number of True

327
Chapter 7: Stochastic gradient descent (SGD) optimization methods

120 # entries .
121 correct_count += torch . sum (
122 pred_labels == labels
123 ) . item ()
124 avg_test_loss = total_test_loss / len ( mnist_test )
125 accuracy = correct_count / len ( mnist_test )
126 return ( avg_test_loss , accuracy )
127

128
129 # Initialize a list that holds the computed loss on every
130 # batch during training
131 train_losses = []
132
133 # Every 10 batches , we will compute the loss on the entire test
134 # set as well as the accuracy of the model ’s predictions on the
135 # entire test set . We do this for the purpose of illustrating in
136 # the produced plot the generalization capability of the ANN .
137 # Computing these losses and accuracies so frequently with such a
138 # relatively large set of datapoints ( compared to the training
139 # set ) is extremely computationally expensive , however ( most of
140 # the training runtime will be spent computing these values ) and
141 # so is not advisable during normal neural network training .
142 # Usually , the test set is only used at the very end to judge the
143 # performance of the final trained network . Often , a third set of
144 # datapoints , called the validation set ( not used to train the
145 # network directly nor to evaluate it at the end ) is used to
146 # judge overfitting or to tune hyperparameters .
147 test_interval = 10
148 test_losses = []
149 accuracies = []
150
151 # We run the training for 5 epochs , i . e . , 5 full iterations
152 # through the training set .
153 i = 0
154 for e in range (5) :
155 for images , labels in train_loader :
156 # Move the data to the device
157 images = images . to ( device )
158 labels = labels . to ( device )
159
160 # Zero out the gradients
161 optimizer . zero_grad ()
162 # Compute the output of the neural network on the current
163 # minibatch
164 output = net ( images )
165 # Compute the cross entropy loss
166 loss = loss_fn ( output , labels )
167 # Compute the gradients
168 loss . backward ()

328
7.9. Adaptive moment estimation SGD optimization
(Adam)

169 # Update the parameters of the neural network


170 optimizer . step ()
171
172 # Append the current loss to the list of training losses .
173 # Note that tracking the training loss comes at
174 # essentially no computational cost ( since we have to
175 # compute these values anyway ) and so is typically done
176 # during neural network training to gauge the training
177 # progress .
178 train_losses . append ( loss . item () )
179
180 if ( i + 1) % test_interval == 0:
181 # Compute the average loss on the test set and the
182 # accuracy of the model and add the values to the
183 # corresponding list
184 test_loss , accuracy = c o m p u t e _ t e s t _ l o s s _ a n d _ a c c u r a c y ()
185 test_losses . append ( test_loss )
186 accuracies . append ( accuracy )
187
188 i += 1
189
190 fig , ax1 = plt . subplots ( figsize =(12 , 8) )
191 # We plot the training losses , test losses , and accuracies in the
192 # same plot , but using two different y - axes
193 ax2 = ax1 . twinx ()
194

195 # Use a logarithmic scale for the losses


196 ax1 . set_yscale ( " log " )
197 # Use a logit scale for the accuracies
198 ax2 . set_yscale ( " logit " )
199 ax2 . set_ylim ((0.3 , 0.99) )
200 N = len ( test_losses ) * test_interval
201 ax2 . set_xlim ((0 , N ) )
202 # Plot the training losses
203 ( training_loss_line ,) = ax1 . plot (
204 train_losses ,
205 label = " Training loss ( left axis ) " ,
206 )
207 # Plot test losses
208 ( test_loss_line ,) = ax1 . plot (
209 range (0 , N , test_interval ) ,
210 test_losses ,
211 label = " Test loss ( left axis ) " ,
212 )
213 # Plot the accuracies
214 ( accuracies_line ,) = ax2 . plot (
215 range (0 , N , test_interval ) ,
216 accuracies ,
217 label = " Accuracy ( right axis ) " ,

329
Chapter 7: Stochastic gradient descent (SGD) optimization methods

218 color = " red " ,


219 )
220 ax2 . yaxis . se t _m a j or _ fo r ma t t er ( ScalarFormatter () )
221 ax2 . yaxis . se t _m i n or _ fo r ma t t er ( NullFormatter () )
222
223 # Put all the labels in a common legend
224 lines = [ training_loss_line , test_loss_line , accuracies_line ]
225 labels = [ l . get_label () for l in lines ]
226 ax2 . legend ( lines , labels )
227
228 plt . tight_layout ()
229 plt . savefig ( " ../ plots / mnist . pdf " , bbox_inches = " tight " )

Source code 7.12 (code/mnist.py): Python code training an ANN on the MNIST
dataset in PyTorch. This code produces a plot showing the progression of the
average loss on the test set and the accuracy of the model’s predictions, see Figure 7.4.

0.99
Training loss (left axis)
Test loss (left axis)
Accuracy (right axis)
100

10 1

0.90

10 2

10 3 0.50

0 1000 2000 3000 4000

Figure 7.4 (plots/mnist.pdf): The plot produced by Source code 7.12, showing
the average loss over each minibatch used during training (training loss) as well as
the average loss over the test set and the accuracy of the model’s predictions over
the course of the training.

Source code 7.13 compares the performance of several of the optimization methods

330
7.9. Adaptive moment estimation SGD optimization
(Adam)

introduced in this chapter, namely the plain vanilla SGD optimization method introduced
in Definition 7.2.1, the momentum SGD optimization method introduced in Definition 7.4.1,
the simplified Nesterov accelerated SGD optimization method introduced in Definition 7.5.3,
the Adagrad SGD optimization method introduced in Definition 7.6.1, the RMSprop SGD
optimization method introduced in Definition 7.7.1, the Adadelta SGD optimization method
introduced in Definition 7.8.1, and the Adam SGD optimization method introduced in
Definition 7.9.1, during training of an ANN on the MNIST dataset. The code produces two
plots showing the progression of the training loss as well as the accuracy of the model’s
predictions on the test set, see Figure 7.5. Note that this compares the performance of
the optimization methods only on one particular problem and without any efforts towards
choosing good hyperparameters for the considered optimization methods. Thus, the results
are not necessarily representative of the performance of these optimization methods in
general.
1 import torch
2 import torchvision . datasets as datasets
3 import torchvision . transforms as transforms
4 import torch . nn as nn
5 import torch . utils . data as data
6 import torch . optim as optim
7 import matplotlib . pyplot as plt
8 from matplotlib . ticker import ScalarFormatter , NullFormatter
9 import copy
10

11 # Set device as GPU if available or CPU otherwise


12 device = torch . device (
13 " cuda " if torch . cuda . is_available () else " cpu "
14 )
15
16 # Fix a random seed
17 torch . manual_seed (0)
18
19 # Load the MNIST training and test datasets
20 mnist_train = datasets . MNIST (
21 " ./ data " ,
22 train = True ,
23 transform = transforms . ToTensor () ,
24 download = True ,
25 )
26 mnist_test = datasets . MNIST (
27 " ./ data " ,
28 train = False ,
29 transform = transforms . ToTensor () ,
30 download = True ,
31 )
32 train_loader = data . DataLoader (
33 mnist_train , batch_size =64 , shuffle = True

331
Chapter 7: Stochastic gradient descent (SGD) optimization methods

34 )
35 test_loader = data . DataLoader (
36 mnist_test , batch_size =64 , shuffle = False
37 )
38
39 # Define a neural network
40 net = nn . Sequential ( # input shape (N , 1 , 28 , 28)
41 nn . Conv2d (1 , 5 , 5) , # (N , 5 , 24 , 24)
42 nn . ReLU () ,
43 nn . Conv2d (5 , 5 , 3) , # (N , 5 , 22 , 22)
44 nn . ReLU () ,
45 nn . Conv2d (5 , 3 , 3) , # (N , 3 , 20 , 20)
46 nn . ReLU () ,
47 nn . Flatten () , # (N , 3 * 16 * 16) = (N , 1200)
48 nn . Linear (1200 , 128) , # (N , 128)
49 nn . ReLU () ,
50 nn . Linear (128 , 10) , # output shape (N , 10)
51 ) . to ( device )
52
53 # Save the initial state of the neural network
54 initial_state = copy . deepcopy ( net . state_dict () )
55
56 # Define the loss function
57 loss_fn = nn . CrossEntropyLoss ()
58
59 # Define the optimizers that we want to compare . Each entry in the
60 # list is a tuple of a label ( for the plot ) and an optimizer
61 optimizers = [
62 # For SGD we use a learning rate of 0.001
63 (
64 " SGD " ,
65 optim . SGD ( net . parameters () , lr =1 e -3) ,
66 ),
67 (
68 " SGD with momentum " ,
69 optim . SGD ( net . parameters () , lr =1 e -3 , momentum =0.9) ,
70 ),
71 (
72 " Nesterov SGD " ,
73 optim . SGD (
74 net . parameters () , lr =1 e -3 , momentum =0.9 , nesterov = True
75 ),
76 ),
77 # For the adaptive optimization methods we use the default
78 # hyperparameters
79 (
80 " RMSprop " ,
81 optim . RMSprop ( net . parameters () ) ,
82 ),

332
7.9. Adaptive moment estimation SGD optimization
(Adam)

83 (
84 " Adagrad " ,
85 optim . Adagrad ( net . parameters () ) ,
86 ),
87 (
88 " Adadelta " ,
89 optim . Adadelta ( net . parameters () ) ,
90 ),
91 (
92 " Adam " ,
93 optim . Adam ( net . parameters () ) ,
94 ),
95 ]
96

97 def c o m p u t e _ t e s t _ l o s s _ a n d _ a c c u r a c y () :
98 total_test_loss = 0.0
99 correct_count = 0
100 with torch . no_grad () :
101 for images , labels in test_loader :
102 images = images . to ( device )
103 labels = labels . to ( device )
104
105 output = net ( images )
106 loss = loss_fn ( output , labels )
107
108 total_test_loss += loss . item () * images . size (0)
109 pred_labels = torch . max ( output , dim =1) . indices
110 correct_count += torch . sum (
111 pred_labels == labels
112 ) . item ()
113
114 avg_test_loss = total_test_loss / len ( mnist_test )
115 accuracy = correct_count / len ( mnist_test )
116
117 return ( avg_test_loss , accuracy )
118
119
120 loss_plots = []
121 accuracy_plots = []
122
123 test_interval = 100
124
125 for _ , optimizer in optimizers :
126 train_losses = []
127 accuracies = []
128 print ( optimizer )
129
130 with torch . no_grad () :
131 net . load_state_dict ( initial_state )

333
Chapter 7: Stochastic gradient descent (SGD) optimization methods

132
133 i = 0
134 for e in range (5) :
135 print ( f " Epoch { e +1} " )
136 for images , labels in train_loader :
137 images = images . to ( device )
138 labels = labels . to ( device )
139

140 optimizer . zero_grad ()


141 output = net ( images )
142 loss = loss_fn ( output , labels )
143 loss . backward ()
144 optimizer . step ()
145

146 train_losses . append ( loss . item () )


147
148 if ( i + 1) % test_interval == 0:
149 (
150 test_loss ,
151 accuracy ,
152 ) = c o m p u t e _ t e s t _ l o s s _ a n d _ a c c u r a c y ()
153 print ( accuracy )
154 accuracies . append ( accuracy )
155
156 i += 1
157

158 loss_plots . append ( train_losses )


159 accuracy_plots . append ( accuracies )
160
161 WINDOW = 200
162
163 _ , ( ax1 , ax2 ) = plt . subplots (2 , 1 , figsize =(10 , 12) )
164 ax1 . set_yscale ( " log " )
165 ax2 . set_yscale ( " logit " )
166 ax2 . yaxis . se t _m a j or _ fo r ma t t er ( ScalarFormatter () )
167 ax2 . yaxis . se t _m i n or _ fo r ma t t er ( NullFormatter () )
168 for ( label , _ ) , train_losses , accuracies in zip (
169 optimizers , loss_plots , accuracy_plots
170 ):
171 ax1 . plot (
172 [
173 sum ( train_losses [ max (0 ,i - WINDOW ) : i ]) / min (i , WINDOW )
174 for i in range (1 , len ( train_losses ) )
175 ],
176 label = label ,
177 )
178 ax2 . plot (
179 range (0 , len ( accuracies ) * test_interval , test_interval ) ,
180 accuracies ,

334
7.9. Adaptive moment estimation SGD optimization
(Adam)

181 label = label ,


182 )
183
184 ax1 . legend ()
185
186 plt . tight_layout ()
187 plt . savefig ( " ../ plots / mnist_optim . pdf " , bbox_inches = " tight " )

Source code 7.13 (code/mnist_optim.py): Python code comparing the performance


of several optimization methods during training of an ANN on the MNIST dataset.
See Figure 7.5 for the plots produced by this code.

Remark 7.9.3 (Analysis of accelerated SGD-type optimization methods). In the literature


there are numerous research articles which study the accelerated SGD-type optimization
methods reviewed in this chapter. In particular, we refer, for example, to [149, 275, 280,
339, 387] and the references therein for articles on SGD-type optimization methods with
momentum and we refer, for instance, to [96, 156, 289, 351, 438] and the references therein
for articles on adaptive SGD-type optimization methods.

335
Chapter 7: Stochastic gradient descent (SGD) optimization methods

100

10 1

SGD
SGD with momentum
Nesterov SGD
RMSprop
Adagrad
Adadelta
Adam
0 1000 2000 3000 4000
0.990

0.900

0.500

0.100

0 1000 2000 3000 4000

Figure 7.5 (plots/mnist_optim.pdf): The plots produced by Source code 7.13.


The upper plot shows the progression of the training loss during the training of the
ANNs. More precisely, each line shows a moving average of the training loss over
200 minibatches during the training of an ANN with the corresponding optimization
method. The lower plot shows the accuracy of the ANN’s predictions on the test set
over the course of the training with each optimization method.

336
Chapter 8

Backpropagation

In Chapters 6 and 7 we reviewed common deterministic and stochastic GD-type optimization


methods used for the training of ANNs. The specific implementation of such methods
requires efficient explicit computations of gradients. The most popular and somehow most
natural method to explicitly compute such gradients in the case of the training of ANNs is
the backpropagation method. In this chapter we derive and present this method in detail.
Further material on the backpropagation method can, for example, be found in the
books and overview articles [176], [4, Section 11.7], [60, Section 6.2.3], [63, Section 3.2.3],
[97, Section 5.6], and [373, Section 20.6].

8.1 Backpropagation for parametric functions


Proposition 8.1.1 (Backpropagation for parametric functions). Let L ∈ N, l0 , l1 , . . . , lL , d1 ,
d2 , . . . , dL ∈ N, for every k ∈ {1, 2, . . . , L} let Fk = (Fk (θk , xk−1 ))(θk ,xk−1 )∈Rdk ×Rlk−1 : Rdk ×
Rlk−1 → Rlk be differentiable, for every k ∈ {1, 2, . . . , L} let fk = (fk (θk , θk+1 , . . . , θL ,
xk−1 ))(θk ,θk+1 ,...,θL ,xk−1 )∈Rdk ×Rdk+1 ×...×RdL ×Rlk−1 : Rdk × Rdk+1 × . . . × RdL × Rlk−1 → RlL satisfy
for all θ = (θk , θk+1 , . . . , θL ) ∈ Rdk × Rdk+1 × . . . × RdL , xk−1 ∈ Rlk−1 that

(8.1)

fk (θ, xk−1 ) = FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ Fk (θk , ·) (xk−1 ),

let ϑ = (ϑ1 , ϑ2 , . . . , ϑL ) ∈ Rd1 × Rd2 × . . . × RdL , x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL ∈ RlL satisfy for
all k ∈ {1, 2, . . . , L} that

xk = Fk (ϑk , xk−1 ), (8.2)

and let Dk ∈ RlL ×lk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all k ∈ {1, 2, . . . , L} that DL+1 = IlL
and
  
∂Fk
Dk = Dk+1 (ϑk , xk−1 ) (8.3)
∂xk−1

337
Chapter 8: Backpropagation

(cf. Definition 1.5.5). Then

(i) it holds for all k ∈ {1, 2, . . . , L} that fk : Rdk × Rdk+1 × . . . × RdL × Rlk−1 → RlL is
differentiable,

(ii) it holds for all k ∈ {1, 2, . . . , L} that


 
∂fk
Dk = ((ϑk , ϑk+1 , . . . , ϑL ), xk−1 ), (8.4)
∂xk−1

and

(iii) it holds for all k ∈ {1, 2, . . . , L} that


    
∂f1 ∂Fk
(ϑ, x0 ) = Dk+1 (ϑk , xk−1 ) . (8.5)
∂θk ∂θk

Proof of Proposition 8.1.1. Note that (8.1), the fact that for all k ∈ N∩(0, L), (θk , θk+1 , . . . ,
θL ) ∈ Rdk × Rdk+1 × . . . × RdL , xk−1 ∈ Rlk−1 it holds that

fk ((θk , θk+1 , . . . , θL ), xk−1 ) = (fk+1 ((θk+1 , θk+2 , . . . , θL ), ·) ◦ Fk (θk , ·))(xk−1 ), (8.6)

the assumption that for all k ∈ {1, 2, . . . , L} it holds that Fk : Rdk × Rlk−1 → Rlk is
differentiable, Lemma 5.3.2, and induction imply that for all k ∈ {1, 2, . . . , L} it holds that

fk : Rdk × Rdk+1 × . . . × RdL × Rlk−1 → RlL (8.7)

is differentiable. This proves item (i). Next we prove (8.4) by induction on k ∈ {L, L −
1, . . . , 1}. Note that (8.3), the assumption that DL+1 = IlL , and the fact that fL = FL
assure that
    
∂FL ∂fL
DL = DL+1 (ϑL , xL−1 ) = (ϑL , xL−1 ). (8.8)
∂xL−1 ∂xL−1

This establishes (8.4) in the base case k = L. For the induction step note that (8.3), the
chain rule, and the fact that for all k ∈ N ∩ (0, L), xk−1 ∈ Rlk−1 it holds that

fk ((ϑk , ϑk+1 , . . . , ϑL ), xk−1 ) = fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), Fk (ϑk , xk−1 )) (8.9)

338
8.1. Backpropagation for parametric functions

∂fk+1 
imply that for all k ∈ N ∩ (0, L) with Dk+1 = ∂xk
((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) it holds that

 
∂fk
((ϑk , ϑk+1 , . . . , ϑL ), xk−1 )
∂xk−1
′
= Rlk−1 ∋ xk−1 7→ fk ((ϑk , ϑk+1 , . . . , ϑL ), xk−1 ) ∈ RlL (xk−1 )
′
= Rlk−1 ∋ xk−1 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), Fk (ϑk , xk−1 )) ∈ RlL (xk−1 )
h ′ i
= Rlk−1 ∋ xk 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk )) ∈ RlL (Fk (ϑk , xk−1 ))
h i (8.10)
′
Rlk−1 ∋ xk−1 7→ Fk (ϑk , xk−1 )) ∈ Rlk (xk−1 )
    
∂fk+1 ∂Fk
= ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) (ϑk , xk−1 )
∂xk ∂xk−1
  
∂Fk
= Dk+1 (ϑk , xk−1 ) = Dk .
∂xk−1

Induction thus proves (8.4). This establishes item (ii). Moreover, observe that (8.1) and
(8.2) assure that for all k ∈ N ∩ (0, L), θk ∈ Rlk it holds that

f1 ((ϑ1 , . . . , ϑk−1 , θk , ϑk+1 , . . . , ϑL ), x0 )



= FL (ϑL , ·) ◦ . . . ◦ Fk+1 (ϑk+1 , ·) ◦ Fk (θk , ·) ◦ Fk−1 (ϑk−1 , ·) ◦ . . . ◦ F1 (ϑ1 , ·) (x0 )
  (8.11)
= fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), Fk (θk , ·)) (Fk−1 (ϑk−1 , ·) ◦ . . . ◦ F1 (ϑ1 , ·))(x0 )
= fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), Fk (θk , xk−1 )).

Combining this with the chain rule, (8.2), and (8.4) demonstrates that for all k ∈ N ∩ (0, L)
it holds that

 
∂f1 ′
(ϑ, x0 ) = Rnk ∋ θk 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), Fk (θk , xk−1 )) ∈ RlL (ϑk )
∂θk
h i
lk lL ′

= R ∋ xk 7→ fk+1 ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) ∈ R (Fk (ϑk , xk−1 ))
h ′ i
Rnk ∋ θk 7→ Fk (θk , xk−1 ) ∈ Rlk (ϑk ) (8.12)
    
∂fk+1 ∂Fk
= ((ϑk+1 , ϑk+2 , . . . , ϑL ), xk ) (ϑk , xk−1 )
∂xk ∂θk
  
∂Fk
= Dk+1 (ϑk , xk−1 ) .
∂θk

339
Chapter 8: Backpropagation

Furthermore, observe that (8.1) and the fact that DL+1 = IlL ensure that
 
∂f1 ′
(ϑ, x0 ) = RnL ∋ θL 7→ FL (θL , xL−1 )) ∈ RlL (ϑL )
∂θL
  
∂FL
= (ϑL , xL−1 ) (8.13)
∂θL
  
∂FL
= DL+1 (ϑL , xL−1 ) .
∂θL
Combining this and (8.12) establishes item (iii). The proof of Proposition 8.1.1 is thus
complete.
Corollary 8.1.2 (Backpropagation for parametric functions with loss). Let L ∈ N,
l0 , l1 , . . . , lL , d1 , d2 , . . . , dL ∈ N, ϑ = (ϑ1 , ϑ2 , . . . , ϑL ) ∈ Rd1 × Rd2 × . . . × RdL , x0 ∈ Rl0 , x1 ∈
Rl1 , . . . , xL ∈ RlL , y ∈ RlL , let C = (C(x, y))(x,y)∈RlL ×RlL : RlL × RlL → R be differentiable,
for every k ∈ {1, 2, . . . , L} let Fk = (Fk (θk , xk−1 ))(θk ,xk−1 )∈Rdk ×Rlk−1 : Rdk × Rlk−1 → Rlk be
differentiable, let L = (L(θ1 , θ2 , . . . , θL ))(θ1 ,θ2 ,...,θL )∈Rd1 ×Rd2 ×...×RdL : Rd1 ×Rd2 ×. . .×RdL → R
satisfy for all θ = (θ1 , θ2 , . . . , θL ) ∈ Rd1 × Rd2 × . . . × RdL that

(8.14)

L(θ) = C(·, y) ◦ FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·) (x0 ),

assume for all k ∈ {1, 2, . . . , L} that

xk = Fk (ϑk , xk−1 ), (8.15)

and let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all k ∈ {1, 2, . . . , L} that
  ∗
∂Fk
DL+1 = (∇x C)(xL , y) and Dk = (ϑk , xk−1 ) Dk+1 . (8.16)
∂xk−1
Then
(i) it holds that L : Rd1 × Rd2 × . . . × RdL → R is differentiable and

(ii) it holds for all k ∈ {1, 2, . . . , L} that


  ∗
∂Fk
(∇θk L)(ϑ) = (ϑk , xk−1 ) Dk+1 . (8.17)
∂θk

Proof of Corollary 8.1.2. Throughout this proof, let Dk ∈ RlL ×lk−1 , k ∈ {1, 2, . . . , L + 1},
satisfy for all k ∈ {1, 2, . . . , L} that DL+1 = IlL and
  
∂Fk
Dk = Dk+1 (ϑk , xk−1 ) (8.18)
∂xk−1

340
8.1. Backpropagation for parametric functions

and let f = (f (θ1 , θ2 , . . . , θL ))(θ1 ,θ2 ,...,θL )∈Rd1 ×Rd2 ×...×RdL : Rd1 × Rd2 × . . . × RdL → RlL satisfy
for all θ = (θ1 , θ2 , . . . , θL ) ∈ Rd1 × Rd2 × . . . × RdL that

(8.19)

f (θ) = FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·) (x0 )

(cf. Definition 1.5.5). Note that item (i) in Proposition 8.1.1 ensures that f : Rd1 ×Rd2 ×. . .×
RdL → RlL is differentiable. This, the assumption that C : RlL × RlL → R is differentiable,
and the fact that L = C(·, y) ◦ f ensure that L : Rd1 × Rd2 × . . . × RdL → R is differentiable.
This establishes item (i). Next we claim that for all k ∈ {1, 2, . . . , L + 1} it holds that
  
∂C

[Dk ] = (xL , y) Dk . (8.20)
∂x
We now prove (8.20) by induction on k ∈ {L + 1, L, . . . , 1}. For the base case k = L + 1
note that (8.16) and (8.18) assure that
 
∗ ∗ ∂C
[DL+1 ] = [(∇x C)(xL , y)] = (xL , y)
∂x
      (8.21)
∂C ∂C
= (xL , y) IlL = (xL , y) DL+1 .
∂x ∂x
This establishes (8.20) in the base case k = L + 1. For the induction step  ∂Cobserve (8.16)
and (8.18) demonstrate that for all k ∈ {L, L − 1, . . . , 1} with [Dk+1 ] = ∂x (xL , y) Dk+1



it holds that
  
∗ ∗ ∂Fk
[Dk ] = [Dk+1 ] (ϑk , xk−1 )
∂xk−1
         (8.22)
∂C ∂Fk ∂C
= (xL , y) Dk+1 (ϑk , xk−1 ) = (xL , y) Dk .
∂x ∂xk−1 ∂x
Induction thus establishes (8.20). Furthermore, note that item (iii) in Proposition 8.1.1
assures that for all k ∈ {1, 2, . . . , L} it holds that
    
∂f ∂Fk
(ϑ) = Dk+1 (ϑk , xk−1 ) . (8.23)
∂θk ∂θk
Combining this with chain rule, the fact that L = C(·, y) ◦ f , and (8.20) ensures that for
all k ∈ {1, 2, . . . , L} it holds that
      
∂L ∂C ∂f
(ϑ) = (f (ϑ), y) (ϑ)
∂θk ∂x ∂θk
     
∂C ∂Fk
= (xL , y) Dk+1 (ϑk , xk−1 ) (8.24)
∂x ∂θk
  
∗ ∂Fk
= [Dk+1 ] (ϑk , xk−1 ) .
∂θk

341
Chapter 8: Backpropagation

Hence, we obtain that for all k ∈ {1, 2, . . . , L} it holds that


  ∗   ∗
∂L ∂Fk
(∇θk L)(ϑ) = (ϑ) = (ϑk , xk−1 ) Dk+1 . (8.25)
∂θk ∂θk

This establishes item (ii). The proof of Corollary 8.1.2 is thus complete.

8.2 Backpropagation for ANNs


Definition 8.2.1 (Diagonal matrices). We denote by diag : ( d∈N Rd ) → ( d∈N Rd×d ) the
S S
function which satisfies for all d ∈ N, x = (x1 , . . . , xd ) ∈ Rd that
 
x1 0 · · · 0
 0 x2 · · · 0 
diag(x) =  .. .. . . d×d
..  ∈ R . (8.26)
 
. . . .
0 0 · · · xd

Corollary 8.2.2 (Backpropagation for ANNs). Let L ∈ N, l0 , l1 , . . . , lL ∈ N, Φ =


× L
((W1 , B1 ), . . . , (WL , BL )) ∈ k=1 (Rlk ×lk−1 × Rlk ), let C = (C(x, y))(x,y)∈RlL ×RlL : RlL ×
RlL → R and a : R → R be differentiable, let x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL ∈ RlL , y ∈ RlL
satisfy for all k ∈ {1, 2, . . . , L} that

xk = Ma1[0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk ), (8.27)

×
L
(Rlk ×lk−1

let L = L((W1 , B1 ), . . . , (WL , BL )) ((W1 ,B1 ),...,(W ×Lk=1 (Rlk ×lk−1 ×Rlk ) : k=1
×
L ,BL ))∈

Rlk ) → R satisfy for all Ψ ∈ × L


k=1
(Rlk ×lk−1 × Rlk ) that

L(Ψ) = C((RN
a (Ψ))(x0 ), y), (8.28)

and let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all k ∈ {1, 2, . . . , L − 1} that

DL+1 = (∇x C)(xL , y), DL = [WL ]∗ DL+1 , and (8.29)

Dk = [Wk ]∗ [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 (8.30)


(cf. Definitions 1.2.1, 1.3.4, and 8.2.1). Then

(i) it holds that L : × L


k=1
(Rlk ×lk−1 × Rlk ) → R is differentiable,

(ii) it holds that (∇BL L)(Φ) = DL+1 ,

342
8.2. Backpropagation for ANNs

(iii) it holds for all k ∈ {1, 2, . . . , L − 1} that

(∇Bk L)(Φ) = [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 , (8.31)

(iv) it holds that (∇WL L)(Φ) = DL+1 [xL−1 ]∗ , and


(v) it holds for all k ∈ {1, 2, . . . , L − 1} that

(∇Wk L)(Φ) = [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 [xk−1 ]∗ . (8.32)

Proof of Corollary 8.2.2. Throughout this proof, for every k ∈ {1, 2, . . . , L} let
(m)
Fk = (Fk )m∈{1,2,...,lk }
= Fk ((Wk,i,j )(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } , Bk ),
 (8.33)
xk−1 (((W ) ,B ),x )∈(Rlk ×lk−1 ×Rlk−1 )×Rlk−1
k,i,j (i,j)∈{1,2,...,lk }×{1,2,...,lk−1 } k k−1

lk ×lk−1 lk−1 lk−1 lk


: (R ×R )×R →R

satisfy for all (Wk , Bk ) ∈ Rlk ×lk−1 × Rlk−1 , xk−1 ∈ Rlk−1 that

Fk ((Wk , Bk ), xk−1 ) = Ma1[0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk ) (8.34)


(d) (d) (d) (d) (d)
and for every d ∈ N let e1 , e2 , . . . , ed ∈ Rd satisfy e1 = (1, 0, . . . , 0), e2 = (0, 1, 0, . . . ,
(d)
0), . . . , ed = (0, . . . , 0, 1). Observe that the assumption that a is differentiable and (8.27)
× L
imply that L : k=1 (Rlk ×lk−1 × Rlk ) → R is differentiable. This establishes item (i). Next
note that (1.91), (8.28), and (8.34) ensure that for all Ψ = ((W1 , B1 ), . . . , (WL , BL )) ∈
× L
k=1
(Rlk ×lk−1 × Rlk ) it holds that

L(Ψ) = C(·, y) ◦ FL ((WL , BL ), ·) ◦ FL−1 ((WL−1 , BL−1 ), ·) ◦ . . . ◦ F1 ((W1 , B1 ), ·) (x0 ).
(8.35)

Moreover, observe that (8.27) and (8.34) imply that for all k ∈ {1, 2, . . . , L} it holds that

xk = Fk ((Wk , Bk ), xk−1 ). (8.36)

In addition, observe that (8.34) assures that


 
∂FL
((WL , BL ), xL−1 ) = WL . (8.37)
∂xL−1
Moreover, note that (8.34) implies that for all k ∈ {1, 2, . . . , L − 1} it holds that
 
∂Fk
((Wk , Bk ), xk−1 ) = [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Wk . (8.38)
∂xk−1

343
Chapter 8: Backpropagation

Combining this and (8.37) with (8.29) and (8.30) demonstrates that for all k ∈ {1, 2, . . . , L}
it holds that
  ∗
∂Fk
DL+1 = (∇x C)(xL , y) and Dk = (ϑk , xk−1 ) Dk+1 . (8.39)
∂xk−1

Next note that this, (8.35), (8.36), and Corollary 8.1.2 prove that for all k ∈ {1, 2, . . . , L}
it holds that
  ∗
∂Fk
(∇Bk L)(Φ) = ((Wk , Bk ), xk−1 ) Dk+1 and (8.40)
∂Bk
  ∗
∂Fk
(∇Wk L)(Φ) = ((Wk , Bk ), xk−1 ) Dk+1 . (8.41)
∂Wk
Moreover, observe that (8.34) implies that
 
∂FL
((WL , BL ), xL−1 ) = IlL (8.42)
∂BL

(cf. Definition 1.5.5). Combining this with (8.40) demonstrates that

(∇BL L)(Φ) = [IlL ]∗ DL+1 = DL+1 . (8.43)

This establishes item (ii). Furthermore, note that (8.34) assures that for all k ∈ {1, 2, . . . , L−
1} it holds that
 
∂Fk
((Wk , Bk ), xk−1 ) = diag(Ma′ ,lk (Wk xk−1 + Bk )). (8.44)
∂Bk

Combining this with (8.40) implies that for all k ∈ {1, 2, . . . , L − 1} it holds that

(∇Bk L)(Φ) = [diag(Ma′ ,lk (Wk xk−1 + Bk ))]∗ Dk+1


(8.45)
= [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 .

This establishes item (iii). In addition, observe that (8.34) ensures that for all m, i ∈
{1, 2, . . . , lL }, j ∈ {1, 2, . . . , lL−1 } it holds that
!
(m)
∂FL
((WL , BL ), xL−1 ) = 1{m} (i)⟨xL−1 , ej L−1 ⟩
(l )
(8.46)
∂WL,i,j

344
8.2. Backpropagation for ANNs

(cf. Definition 1.4.7). Combining this with (8.41) demonstrates that


(∇WL L)(Φ)
lL
" ! # !
(m)
X ∂FL
= ((WL , BL ), xL−1 ) ⟨DL+1 , e(lmL ) ⟩
m=1
∂W L,i,j
(i,j)∈{1,2,...,lL }×{1,2,...,lL−1 }
(8.47)
P 
m=1 1{m} (i)⟨ej
lL (lL−1 ) (l )
= , xL−1 ⟩⟨emL , DL+1 ⟩
(i,j)∈{1,2,...,lL }×{1,2,...,lL−1 }
 
(lL−1 ) (l )
= ⟨ej , xL−1 ⟩⟨ei L , DL+1 ⟩
(i,j)∈{1,2,...,lL }×{1,2,...,lL−1 }

= DL+1 [xL−1 ] .
This establishes item (iv). Moreover, note that (8.34) implies that for all k ∈ {1, 2, . . . , L−1},
m, i ∈ {1, 2, . . . , lk }, j ∈ {1, 2, . . . , lk−1 } it holds that
!
(m)
∂Fk
((Wk , Bk ), xk−1 ) = 1{m} (i)a′ (⟨ei k , Wk xk−1 + Bk ⟩)⟨ej k−1 , xk−1 ⟩.
(l ) (l )
(8.48)
∂Wk,i,j
Combining this with (8.41) demonstrates that for all k ∈ {1, 2, . . . , L − 1} it holds that
(∇Wk L)(Φ)
lk
" ! # !
(m)
X ∂Fk (lk )
= ((Wk , Bk ), xk−1 ) ⟨em , Dk+1 ⟩
m=1
∂W k,i,j
(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 }
P 
m=1 1{m} (i)a (⟨ei , Wk xk−1 + Bk ⟩)⟨ej
lk ′ (l ) (l k−1 ) (lk )
= k
, xk−1 ⟩⟨em , Dk+1 ⟩
(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 }
 
(l ) (l ) (l )
= a′ (⟨ei , Wk xk−1 + Bk ⟩)⟨ej
k k−1
, xk−1 ⟩⟨ei , Dk+1 ⟩
k

(i,j)∈{1,2,...,lk }×{1,2,...,lk−1 }

= [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 [xk−1 ] .
(8.49)
This establishes item (v). The proof of Corollary 8.2.2 is thus complete.
Corollary 8.2.3 (Backpropagation for ANNs with minibatches). Let L, M ∈ N, l0 , l1 , . . . ,
L
×
lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈ k=1 (Rlk ×lk−1 × Rlk ), let a : R → R and C =
(C(x, y))(x,y)∈RlL ×RlL : RlL × RlL → R be differentiable, for every m ∈ {1, 2, . . . , M } let
(m) (m) (m)
x0 ∈ Rl0 , x1 ∈ Rl1 , . . . , xL ∈ RlL , y(m) ∈ RlL satisfy for all k ∈ {1, 2, . . . , L} that
(m)
= Ma1[0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk ),
xk
(m)
(8.50)

× L
(Rlk ×lk−1

let L = L((W1 , B1 ), . . . , (WL , BL )) ((W1 ,B1 ),...,(W ,B ))∈×L (Rlk ×lk−1 ×Rlk ) : k=1
×
L L k=1
L
×
Rlk ) → R satisfy for all Ψ ∈ k=1 (Rlk ×lk−1 × Rlk ) that
M 
1 P
L(Ψ) = N (m)
C((Ra (Ψ))(x0 ), y ) , (m)
(8.51)
M m=1

345
Chapter 8: Backpropagation

(m)
and for every m ∈ {1, 2, . . . , M } let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all
k ∈ {1, 2, . . . , L − 1} that
(m) (m)
DL+1 = (∇x C)(xL , y(m) ), DL
(m) (m)
= [WL ]∗ DL+1 , and (8.52)
(m)
Dk
(m)
= [Wk ]∗ [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1
(m)
(8.53)
(cf. Definitions 1.2.1, 1.3.4, and 8.2.1). Then

(i) it holds that L : × L


k=1
(Rlk ×lk−1
× Rlk ) → R is differentiable,
PM (m) 
(ii) it holds that (∇BL L)(Φ) = M1 m=1 DL+1 ,

(iii) it holds for all k ∈ {1, 2, . . . , L − 1} that


M 
1 P
(∇Bk L)(Φ) =
(m) (m)
[diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 , (8.54)
M m=1
PM (m) (m) 
(iv) it holds that (∇WL L)(Φ) = 1
M m=1 DL+1 [xL−1 ]∗ , and
(v) it holds for all k ∈ {1, 2, . . . , L − 1} that
M 
1 P
(∇Wk L)(Φ) =
(m) (m) (m) ∗
[diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 [xk−1 ] . (8.55)
M m=1

Proof of Corollary 8.2.3. Throughout this proof, let L(m) : k=1×L


(Rlk ×lk−1 × Rlk ) → R,
L
×
m ∈ {1, 2, . . . , M }, satisfy for all m ∈ {1, 2, . . . , M }, Ψ ∈ k=1 (Rlk ×lk−1 × Rlk ) that
(m)
L(m) (Ψ) = C((RN
a (Ψ))(x0 ), y
(m)
). (8.56)

× L
Note that (8.56) and (8.51) ensure that for all Ψ ∈ k=1 (Rlk ×lk−1 × Rlk ) it holds that
M 
1 P
L(Ψ) = (m)
L (Ψ) . (8.57)
M m=1
Corollary 8.2.2 hence establishes items (i), (ii), (iii), (iv), and (v). The proof of Corollary 8.2.3
is thus complete.
Corollary 8.2.4 (Backpropagation for ANNs with quadratic loss and minibatches). Let
×
L
L, M ∈ N, l0 , l1 , . . . , lL ∈ N, Φ = ((W1 , B1 ), . . . , (WL , BL )) ∈ k=1 (Rlk ×lk−1 × Rlk ), let
(m) (m)
a : R → R be differentiable, for every m ∈ {1, 2, . . . , M } let x0 ∈ Rl0 , x1 ∈ Rl1 , . . . ,
(m)
xL ∈ RlL , y(m) ∈ RlL satisfy for all k ∈ {1, 2, . . . , L} that
(m)
xk
(m)
= Ma1[0,L) (k)+idR 1{L} (k),lk (Wk xk−1 + Bk ), (8.58)

346
8.2. Backpropagation for ANNs

×
L
(Rlk ×lk−1

let L = L((W1 , B1 ), . . . , (WL , BL )) ((W1 ,B1 ),...,(WL ,BL ))∈×L lk ×lk−1
×Rlk )
: k=1
×
k=1 (R

× L lk ×lk−1

Rlk ) → R satisfy for all Ψ ∈ k=1
(R lk
× R ) that
M 
1 P
L(Ψ) = N (m) (m) 2
∥(Ra (Ψ))(x0 ) − y ∥2 , (8.59)
M m=1
(m)
and for every m ∈ {1, 2, . . . , M } let Dk ∈ Rlk−1 , k ∈ {1, 2, . . . , L + 1}, satisfy for all
k ∈ {1, 2, . . . , L − 1} that
(m) (m)
DL+1 = 2(xL − y(m) ), DL
(m)
= [WL ]∗ DL+1 ,
(m)
and (8.60)

Dk
(m) (m)
= [Wk ]∗ [diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1
(m)
(8.61)
(cf. Definitions 1.2.1, 1.3.4, 3.3.4, and 8.2.1). Then

(i) it holds that L : × L


k=1
(Rlk ×lk−1
× Rlk ) → R is differentiable,
PM (m) 
(ii) it holds that (∇BL L)(Φ) = M1 m=1 DL+1 ,

(iii) it holds for all k ∈ {1, 2, . . . , L − 1} that


M 
1 P
(∇Bk L)(Φ) =
(m) (m)
[diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 , (8.62)
M m=1
PM (m) (m) 
(iv) it holds that (∇WL L)(Φ) = 1
M m=1 DL+1 [xL−1 ]∗ , and

(v) it holds for all k ∈ {1, 2, . . . , L − 1} that


M 
1 P
(∇Wk L)(Φ) =
(m) (m) (m) ∗
[diag(Ma′ ,lk (Wk xk−1 + Bk ))]Dk+1 [xk−1 ] . (8.63)
M m=1

Proof of Corollary 8.2.4. Throughout this proof, let C = (C(x, y))(x,y)∈RlL ×RlL : RlL ×RlL →
R satisfy for all x, y ∈ RlL that

C(x, y) = ∥x − y∥22 , (8.64)

Observe that (8.64) ensures that for all m ∈ {1, 2, . . . , M } it holds that
(m) (m)
(∇x C)(xL , y(m) ) = 2(xL − y(m) ) = DL+1 .
(m)
(8.65)

Combining this, (8.58), (8.59), (8.60), and (8.61) with Corollary 8.2.3 establishes items (i),
(ii), (iii), (iv), and (v). The proof of Corollary 8.2.4 is thus complete.

347
Chapter 8: Backpropagation

348
Chapter 9

Kurdyka–Łojasiewicz (KL) inequalities

In Chapter 5 (GF trajectories), Chapter 6 (deterministic GD-type processes), and Chapter 7


(SGD-type processes) we reviewed and studied gradient based processes for the approximate
solution of certain optimization problems. In particular, we sketched the approach of general
Lyapunov-type functions as well as the special case where the Lyapunov-type function is
the squared standard norm around a minimizer resulting in the coercivity-type conditions
used in several convergence results in Chapters 5, 6, and 7. However, the coercivity-type
conditions in Chapters 5, 6, and 7 are usually too restrictive to cover the situation of the
training of ANNs (cf., for instance, item (ii) in Lemma 5.6.8, [223, item (vi) in Corollary
29], and [213, Corollary 2.19]).
In this chapter we introduce another general class of Lyapunov-type functions which
does indeed cover the mathematical analysis of many of the ANN training situations.
Specifically, in this chapter we study Lyapunov-type functions that are given by suitable
fractional powers of differences of the risk function (cf., for example (9.8) in the proof of
Proposition 9.2.1 below). In that case the resulting Lyapunov-type conditions (cf., for
instance, (9.1), (9.4), and (9.11) below) are referred to as KL inequalities in the literature.
Further investigations related to KL inequalities in the scientific literature can, for
example, be found in [38, 44, 84, 100].

9.1 Standard KL functions


Definition 9.1.1 (Standard KL inequalities). Let d ∈ N, c ∈ R, α ∈ (0, ∞), let L : Rd → R
be differentiable, let U ⊆ Rd be a set, and let θ ∈ U . Then we say that L satisfies the
standard KL inequality at θ on U with exponent α and constant c (we say that L satisfies
the standard KL inequality at θ) if and only if it holds for all ϑ ∈ U that

|L(θ) − L(ϑ)|α ≤ c ∥(∇L)(ϑ)∥2 (9.1)

(cf. Definition 3.3.4).

349
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Definition 9.1.2 (Standard KL functions). Let d ∈ N and let L : Rd → R be differentiable.


Then we say that L is a standard KL function if and only if for all θ ∈ Rd there exist
ε, c ∈ (0, ∞), α ∈ (0, 1) such that for all ϑ ∈ {v ∈ Rd : ∥v − θ∥2 < ε} it holds that

|L(θ) − L(ϑ)|α ≤ c ∥(∇L)(ϑ)∥2 (9.2)

(cf. Definition 3.3.4).

9.2 Convergence analysis using standard KL functions


(regular regime)
Proposition 9.2.1. Let d ∈ N, ϑ ∈ Rd , c, C, ε ∈ (0, ∞), α ∈ (0, 1), L ∈ C 1 (Rd , R), let
O ⊆ Rd satisfy
2−2α
O = {θ ∈ Rd : ∥θ − ϑ∥2 < ε}\{ϑ} c = C2 supθ∈O |L(θ) − L(ϑ)| , (9.3)

and

assume for all θ ∈ O that L(θ) > L(ϑ) and

|L(θ) − L(ϑ)|α ≤ C∥(∇L)(θ)∥2 , (9.4)

and let Θ ∈ C([0, ∞), O) satisfy for all t ∈ [0, ∞) that


Z t
Θt = Θ0 − (∇L)(Θs ) ds (9.5)
0

(cf. Definition 3.3.4). Then there exists ψ ∈ Rd such that

(i) it holds that L(ψ) = L(ϑ),

(ii) it holds for all t ∈ [0, ∞) that

0 ≤ L(Θt ) − L(ψ) ≤ [(L(Θ0 ) − L(ψ))−1 + c−1 t]−1 , (9.6)

and

(iii) it holds for all t ∈ [0, ∞) that


Z ∞
∥Θt − ψ∥2 ≤ ∥(∇L)(Θs )∥2 ds
t
(9.7)
≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
≤ C(1 − α)−1 [(L(Θ0 ) − L(ψ))−1 + c−1 t]α−1 .

350
9.2. Convergence analysis using standard KL functions (regular regime)

Proof of Proposition 9.2.1. Throughout this proof, let V : O → R and U : O → R satisfy


for all θ ∈ O that

V (θ) = −|L(θ) − L(ϑ)|−1 and U (θ) = |L(θ) − L(ϑ)|1−α . (9.8)

Observe that the assumption that for all θ ∈ O it holds that |L(θ)−L(ϑ)|α ≤ C∥(∇L)(θ)∥2
shows that for all θ ∈ O it holds that

∥(∇L)(θ)∥22 ≥ C−2 |L(θ) − L(ϑ)|2α . (9.9)

Furthermore, note that (9.8) ensures that for all θ ∈ O it holds that V ∈ C 1 (O, R) and

(∇V )(θ) = |L(θ) − L(ϑ)|−2 (∇L)(θ). (9.10)

Combining this with (9.9) implies that for all θ ∈ O it holds that
⟨(∇V )(θ), −(∇L)(θ)⟩ = −|L(θ) − L(ϑ)|−2 ∥(∇L)(θ)∥22
(9.11)
≤ −C−2 |L(θ) − L(ϑ)|2α−2 ≤ −c−1 .

The assumption that for all t ∈ [0, R t∞) it holds that Θt ∈ O, the assumption that for all
t ∈ [0, ∞) it holds that Θt = Θ0 − 0 (∇L)(Θs ) ds, and Proposition 5.6.2 therefore establish
that for all t ∈ [0, ∞) it holds that
Z t
−1
−|L(Θt ) − L(ϑ)| = V (Θt ) ≤ V (Θ0 ) + −c−1 ds = V (Θ0 ) − c−1 t
0 (9.12)
−1 −1
= −|L(Θ0 ) − L(ϑ)| − c t.

Hence, we obtain for all t ∈ [0, ∞) that

0 ≤ L(Θt ) − L(ϑ) ≤ [|L(Θ0 ) − L(ϑ)|−1 + c−1 t]−1 . (9.13)

Moreover, observe that (9.8) ensures that for all θ ∈ O it holds that U ∈ C 1 (O, R) and

(∇U )(θ) = (1 − α)|L(θ) − L(ϑ)|−α (∇L)(θ). (9.14)

The assumption that for all θ ∈ O it holds that |L(θ) − L(ϑ)|α ≤ C∥(∇L)(θ)∥2 therefore
demonstrates that for all θ ∈ O it holds that
⟨(∇U )(θ), −(∇L)(θ)⟩ = −(1 − α)|L(θ) − L(ϑ)|−α ∥(∇L)(θ)∥22
(9.15)
≤ −C−1 (1 − α)∥(∇L)(θ)∥2 .

Combining this, the assumption that for all t ∈ [0, ∞) it holds that Θt ∈ O, the fact that
for all s, t ∈ [0, ∞) it holds that
Z t
Θs+t = Θs − (∇L)(Θs+u ) du, (9.16)
0

351
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

and Proposition 5.6.2 (applied for every s ∈ [0, ∞), t ∈ (s, ∞) with d ↶ d, T ↶ t − s,
O ↶ O, α ↶ 0, β ↶ (O ∋ θ 7→ −C−1 (1 − α)∥(∇L)(θ)∥2 ∈ R), G ↶ (∇L), Θ ↶
([0, t − s] ∋ u 7→ Θs+u ∈ O) in the notation of Proposition 5.6.2) ensures that for all
s, t ∈ [0, ∞) with s < t it holds that
0 ≤ |L(Θt ) − L(ϑ)|1−α = U (Θt )
Z t
≤ U (Θs ) + −C−1 (1 − α)∥(∇L)(Θu )∥2 du
s (9.17)
Z t 
1−α −1
= |L(Θs ) − L(ϑ)| − C (1 − α) ∥(∇L)(Θu )∥2 du .
s

This implies that for all s, t ∈ [0, ∞) with s < t it holds that
Z t
∥(∇L)(Θu )∥2 du ≤ C(1 − α)−1 |L(Θs ) − L(ϑ)|1−α . (9.18)
s

Hence, we obtain that


Z ∞
∥(∇L)(Θs )∥2 ds ≤ C(1 − α)−1 |L(Θ0 ) − L(ϑ)|1−α < ∞ (9.19)
0

This demonstrates that Z ∞


lim sup ∥(∇L)(Θs )∥2 ds = 0. (9.20)
r→∞ r
In addition, note that the fundamental
R t theorem of calculus and the assumption that for all
t ∈ [0, ∞) it holds that Θt = Θ0 − 0 (∇L)(Θs ) ds establish that for all r, s, t ∈ [0, ∞) with
r ≤ s ≤ t it holds that
Z t Z t Z ∞
∥Θt − Θs ∥2 = (∇L)(Θu ) du ≤ ∥(∇L)(Θu )∥2 du ≤ ∥(∇L)(Θu )∥2 du. (9.21)
s 2 s r

This and (9.20) prove that there exists ψ ∈ R which satisfiesd

lim sup∥Θt − ψ∥2 = 0. (9.22)


t→∞

Combining this and the assumption that L is continuous with (9.13) demonstrates that
(9.23)

L(ψ) = L limt→∞ Θt = limt→∞ L(Θt ) = L(ϑ).
Next observe that (9.22), (9.18), and (9.21) show that for all t ∈ [0, ∞) it holds that
 
∥Θt − ψ∥2 = Θt − lims→∞ Θs 2
= lim ∥Θt − Θs ∥2
s→∞
Z ∞ (9.24)
≤ ∥(∇L)(Θu )∥2 du
t
≤ C(1 − α)−1 |L(Θt ) − L(ϑ)|1−α .

352
9.3. Standard KL inequalities for monomials

Combining this with (9.13) and (9.23) establishes items (i), (ii), and (iii). The proof of
Proposition 9.2.1 is thus complete.

9.3 Standard KL inequalities for monomials


Lemma 9.3.1 (Standard KL inequalities for monomials). Let d ∈ N, p ∈ (1, ∞), ε, c, α ∈
(0, ∞) satisfy c ≥ p−1 εp(α−1)+1 and α ≥ 1 − p1 and let L : Rd → R satisfy for all ϑ ∈ Rd that
L(ϑ) = ∥ϑ∥p2 . (9.25)
Then
(i) it holds that L ∈ C 1 (Rd , R) and
(ii) it holds for all ϑ ∈ {v ∈ Rd : ∥v∥2 ≤ ε} that
|L(0) − L(ϑ)|α ≤ c∥(∇L)(ϑ)∥2 . (9.26)

Proof of Lemma 9.3.1. First, note that the fact that for all ϑ ∈ Rd it holds that
L(ϑ) = (∥ϑ∥22 ) /2 (9.27)
p

implies that for all ϑ ∈ Rd it holds that L ∈ C 1 (Rd , R) and


∥(∇L)(ϑ)∥2 = p∥ϑ∥p−1
2 . (9.28)
Furthermore, observe that the assumption that α ≥ 1 − p1 ensures that p(α − 1) + 1 ≥ 0. The
assumption that c ≥ p−1 εp(α−1)+1 therefore demonstrates that for all ϑ ∈ {v ∈ Rd : ∥v∥2 ≤ ε}
it holds that
−(p−1) p(α−1)+1
∥ϑ∥pα
2 ∥ϑ∥2 = ∥ϑ∥2 ≤ εp(α−1)+1 ≤ cp. (9.29)
Combining (9.28) and (9.29) ensures that for all ϑ ∈ {v ∈ Rd : ∥v∥2 ≤ ε} it holds that
|L(0) − L(ϑ)|α = ∥ϑ∥pα p−1
2 ≤ cp∥ϑ∥2 = c∥(∇L)(ϑ)∥2 . (9.30)
This completes the proof of Lemma 9.3.1.

9.4 Standard KL inequalities around non-critical points


Lemma 9.4.1 (Standard KL inequality around non-critical points). Let d ∈ N, let U ⊆ Rd
be open, and let L ∈ C 1 (U, R), θ ∈ U , c ∈ [0, ∞), α ∈ (0, ∞) satisfy for all ϑ ∈ U that
max{|L(θ) − L(ϑ)|α , c∥(∇L)(θ) − (∇L)(ϑ)∥2 } ≤ c∥(∇L)(θ)∥2
2
(9.31)
(cf. Definition 3.3.4). Then it holds for all ϑ ∈ U that
|L(θ) − L(ϑ)|α ≤ c∥(∇L)(ϑ)∥2 . (9.32)

353
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proof of Lemma 9.4.1. Note that (9.31) and the triangle inequality ensure that for all
ϑ ∈ U it holds that
c∥(∇L)(θ)∥2
= c∥(∇L)(ϑ) + [(∇L)(θ) − (∇L)(ϑ)]∥2 (9.33)
c∥(∇L)(θ)∥2
≤ c∥(∇L)(ϑ)∥2 + c∥(∇L)(θ) − (∇L)(ϑ)∥2 ≤ c∥(∇L)(ϑ)∥2 + 2
.

Hence, we obtain for all ϑ ∈ U that


c∥(∇L)(θ)∥2
2
≤ c∥(∇L)(ϑ)∥2 . (9.34)

Combining this with (9.31) establishes that for all ϑ ∈ U it holds that

|L(θ) − L(ϑ)|α ≤ c∥(∇L)(θ)∥2


2
≤ c∥(∇L)(ϑ)∥2 . (9.35)

The proof of Lemma 9.4.1 is thus complete.

Corollary 9.4.2 (Standard KL inequality around non-critical points). Let d ∈ N, L ∈


C 1 (Rd , R), θ ∈ Rd , c, α ∈ (0, ∞) satisfy (∇L)(θ) ̸= 0. Then there exists ε ∈ (0, 1) such
that for all ϑ ∈ {v ∈ Rd : ∥v − θ∥2 < ε} it holds that

|L(θ) − L(ϑ)|α ≤ c∥(∇L)(ϑ)∥2 (9.36)

(cf. Definition 3.3.4).

Proof of Corollary 9.4.2. Observe that the assumption that L ∈ C 1 (Rd , R) ensures that

(9.37)

lim supε↘0 supϑ∈{v∈Rd : ∥v−θ∥2 <ε} ∥(∇L)(θ) − (∇L)(ϑ)∥2 = 0

(cf. Definition 3.3.4). Combining this and the fact that c > 0 with the fact that L is
continuous demonstrates that
 
lim supε↘0 supϑ∈{v∈Rd : ∥v−θ∥2 <ε} max |L(θ) − L(ϑ)|α , c∥(∇L)(θ) − (∇L)(ϑ)∥2 = 0.
(9.38)
The fact that c > 0 and the fact that ∥(∇L)(θ)∥2 > 0 therefore prove that there exists
ε ∈ (0, 1) which satisfies

supϑ∈{v∈Rd : ∥v−θ∥2 <ε} max{|L(θ) − L(ϑ)|α , c∥(∇L)(θ) − (∇L)(ϑ)∥2 } < c∥(∇L)(θ)∥2


2
. (9.39)

Note that (9.39) ensures that for all ϑ ∈ {v ∈ Rd : ∥v − θ∥2 < ε} it holds that

max{|L(θ) − L(ϑ)|α , c∥(∇L)(θ) − (∇L)(ϑ)∥2 } ≤ c∥(∇L)(θ)∥2


2
. (9.40)

This and Lemma 9.4.1 establish (9.36). The proof of Corollary 9.4.2 is thus complete.

354
9.5. Standard KL inequalities with increased exponents

9.5 Standard KL inequalities with increased exponents


Lemma 9.5.1 (Standard KL inequalities with increased exponents). Let d ∈ N, let U ⊆ Rd
be a set, let θ ∈ U , c, α ∈ (0, ∞), let L : U → R and G : U → R satisfy for all ϑ ∈ U that

|L(θ) − L(ϑ)|α ≤ c|G(ϑ)|, (9.41)

and let β ∈ (α, ∞), C ∈ R satisfy C = c(supϑ∈U |L(θ) − L(ϑ)|β−α ). Then it holds for all
ϑ ∈ U that
|L(θ) − L(ϑ)|β ≤ C|G(ϑ)|. (9.42)

Proof of Lemma 9.5.1. Observe that (9.41) shows that for all ϑ ∈ U it holds that

|L(θ) − L(ϑ)|β = |L(θ) − L(ϑ)|α |L(θ) − L(ϑ)|β−α ≤ c|G(ϑ)| |L(θ) − L(ϑ)|β−α




= c|L(θ) − L(ϑ)|β−α |G(ϑ)| ≤ C|G(ϑ)|.




(9.43)

This establishes (9.42). The proof of Lemma 9.5.1 is thus complete.

Corollary 9.5.2 (Standard KL inequalities with increased exponents). Let d ∈ N, L ∈


C 1 (Rd , R), θ ∈ Rd , ε, c, α ∈ (0, ∞), β ∈ [α, ∞) satisfy for all ϑ ∈ {v ∈ Rd : ∥v − θ∥2 < ε}
that
|L(θ) − L(ϑ)|α ≤ c∥(∇L)(ϑ)∥2 (9.44)
(cf. Definition 3.3.4). Then there exists C ∈ (0, ∞) such that for all ϑ ∈ {v ∈ Rd : ∥v −θ∥2 <
ε} it holds that
|L(θ) − L(ϑ)|β ≤ C∥(∇L)(ϑ)∥2 . (9.45)

Proof of Corollary 9.5.2. Note that Lemma 9.5.1 establishes (9.45). The proof of Corol-
lary 9.5.2 is thus complete.

9.6 Standard KL inequalities for one-dimensional poly-


nomials
Corollary 9.6.1 (Reparametrization). Let ξ ∈ R, N ∈ N, p ∈ C ∞ (R, R) satisfy for all
x ∈ R that p(N +1) (x) = 0 and let β0 , β1 , . . . , βN ∈ R satisfy for all n ∈ {0, 1, . . . , N } that
(n)
βn = p n!(ξ) . Then it holds for all x ∈ R that

(9.46)
PN
p(x) = n=0 βn (x − ξ)n .

Proof of Corollary 9.6.1. Observe that Theorem 6.1.3 establishes (9.46). The proof of
Corollary 9.6.1 is thus complete.

355
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Corollary 9.6.2 (Quantitative standard KL inequalities for non-constant one-dimensional


polynomials). Let ξ ∈ R, N ∈ N, p ∈ C ∞ (R, R) satisfy for all x ∈ R that p(N +1) (x) = 0,
(n)
let β0 , β1 , . . . , βN ∈ R satisfy for all n ∈ {0, 1, . . . , N } that βn = p n!(ξ) , and let m ∈
{1, 2, . . . , N }, α ∈ [0, 1], c, ε ∈ R satisfy
|βn |α 
|βm | > 0 = m−1 α ≥ 1 − m−1 , c=2 N (9.47)
P P
n=1 |βn |, n=1 |βm m| ,

PN |βn n| −1
and ε = 12 [ n=1 |βm m| ] . Then it holds for all x ∈ [ξ − ε, ξ + ε] that

|p(x) − p(ξ)|α ≤ c|p′ (x)|. (9.48)

Proof of Corollary 9.6.2. Note that Corollary 9.6.1 ensures that for all x ∈ R it holds that

p(x) − p(ξ) = N n
(9.49)
P
n=1 βn (x − ξ) .

Hence, we obtain for all x ∈ R that

(9.50)
PN
p′ (x) = n=1 βn n(x − ξ)n−1

Therefore, we obtain for all x ∈ R that

p(x) − p(ξ) = N and (9.51)


PN
n
p′ (x) = βn n(x − ξ)n−1 .
P
n=m βn (x − ξ) n=m

Hence, we obtain for all x ∈ R that

(9.52)
PN 
|p(x) − p(ξ)|α ≤ n=m |βn |α |x − ξ|nα .

The fact that for all n ∈ {m, m + 1, . . . , N }, x ∈ R with |x − ξ| ≤ 1 it holds that


−1 −1
|x − ξ|nα ≤ |x − ξ|n(1−m ) ≤ |x − ξ|m(1−m ) = |x − ξ|m−1 therefore implies that for all
x ∈ R with |x − ξ| ≤ 1 it holds that

|p(x) − p(ξ)|α ≤ N α nα
≤ N α m−1
P P
n=m |βn | |x − ξ| n=m |βn | |x − ξ|
PN α
 PN α
 (9.53)
= |x − ξ|m−1 n=m |βn | = |x − ξ|m−1 n=1 |βn | .

Hence, we obtain for all x ∈ R with |x − ξ| ≤ 1 that

(9.54)
PN  c
|p(x) − p(ξ)|α ≤ |x − ξ|m−1 n=1 |β n |α
= 2 |x − ξ|m−1 |βm m|.

Furthermore, observe that (9.51) ensures that for all x ∈ R with |x − ξ| ≤ 1 it holds that
PN PN
|p′ (x)| = n=m βn n(x − ξ)
n−1
≥ |βm m||x − ξ|m−1 − n=m+1 βn n(x − ξ)
n−1
PN n−1

≥ |x − ξ|m−1 |βm m| − n=m+1 |x − ξ| |βn n|
PN  (9.55)
≥ |x − ξ|m−1 |βm m| − m
n=m+1 |x − ξ| |βn n|
PN 
= |x − ξ|m−1 |βm m| − |x − ξ|m n=m+1 |βn n| .

356
9.6. Standard KL inequalities for one-dimensional polynomials

|βn n| −1
Therefore, we obtain for all x ∈ R with |x − ξ| ≤ that
1
PN
2 n=m |βm m|

|p′ (x)| ≥ |x − ξ|m−1 |βm m| − |x − ξ| N


P 
|β n n|
  n=m+1 
|βm m|
|x − ξ| N 2|βn n| 
m−1
P
≥ |x − ξ| |βm m| − 2 n=m |βm m| (9.56)
 
≥ |x − ξ|m−1 |βm m| − |βm2m| = 12 |x − ξ|m−1 |βm m|.

|βn n| −1
Combining this with (9.54) demonstrates that for all x ∈ R with |x − ξ| ≤ 1
PN 
2 n=m |βm m|
it holds that
|p(x) − p(ξ)|α ≤ 2c |x − ξ|m−1 |βm m| ≤ c|p′ (x)|. (9.57)
This establishes (9.48). The proof of Corollary 9.6.2 is thus complete.
Corollary 9.6.3 (Quantitative standard KL inequalities for general one-dimensional
polynomials). Let ξ ∈ R, N ∈ N, p ∈ C ∞ (R, R) satisfy for all x ∈ R that p(N +1) (x) = 0,
(n)
let β0 , β1 , . . . , βN ∈ R satisfy for all n ∈ {0, 1, . . . , N } that βn = p n!(ξ) , let ρ ∈ R satisfy
ρ = 1{0}
PN  SN    PN 
n=1 |βn n| + min n=1 {|β n n|} \{0} ∪ n=1 |β n n| , and let α ∈ (0, 1],
c, ε ∈ [0, ∞) satisfy

α ≥ 1 − N −1 , c ≥ 2ρ−1 [ N 1
P α
PN PN −1
n=1 |β n | ], and ε ≤ ρ[ {0} ( n=1 |β n |) + 2( n=1 |βn n|)] .
(9.58)
Then it holds for all x ∈ [ξ − ε, ξ + ε] that

|p(x) − p(ξ)|α ≤ c|p′ (x)|. (9.59)

Proof of Corollary 9.6.3. Throughout this proof, assume without loss of generality that

supx∈R |p(x) − p(ξ)| > 0. (9.60)

Note that Corollary 9.6.1 and (9.60) ensure that N n=1 |βn | > 0. Hence, we obtain that
P
there exists m ∈ {1, 2, . . . , N } which satisfies
m−1
X
|βm | > 0 = |βn |. (9.61)
n=1

Observe that (9.61), the fact that α ≥ 1 − N −1 , and Corollary 9.6.2 ensure that for all
|βn n| −1
x ∈ R with |x − ξ| ≤ 12 [ N it holds that
P
n=1 |βm m| ]

N
" # " " N ##
α
X 2|βn | 2 X
|p(x) − p(ξ)|α ≤ |p′ (x)| ≤ |βn |α |p′ (x)| ≤ c|p′ (x)|. (9.62)
n=1
|β m m| ρ n=1

This establishes (9.59). The proof of Corollary 9.6.3 is thus complete.

357
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Corollary 9.6.4 (Qualitative standard KL inequalities for general one-dimensional polyno-


mials). Let ξ ∈ R, N ∈ N, p ∈ C ∞ (R, R) satisfy for all x ∈ R that p(N ) (x) = 0. Then there
exist ε, c ∈ (0, ∞), α ∈ (0, 1) such that for all x ∈ [ξ − ε, ξ + ε] it holds that

|p(x) − p(ξ)|α ≤ c|p′ (x)|. (9.63)

Proof of Corollary 9.6.4. Note that Corollary 9.6.3 establishes (9.63). The proof of Corol-
lary 9.6.4 is thus complete.

Corollary 9.6.5. Let L : R → R be a polynomial. Then L is a standard KL function (cf.


Definition 9.1.2).

Proof of Corollary 9.6.5. Observe that (9.2) and Corollary 9.6.4 establish that L is a
standard KL function (cf. Definition 9.1.2). The proof of Corollary 9.6.5 is thus complete.

9.7 Power series and analytic functions


Definition 9.7.1 (Analytic functions). Let m, n ∈ N, let U ⊆ Rm be open, and let
f : U → Rn be a function. Then we say that f is analytic if and only if for all x ∈ U there
exists ε ∈ (0, ∞) such that for all y ∈ {u ∈ U : ∥x − u∥2 < ε} it holds that f ∈ C ∞ (U, Rn )
and
K
1 (k)
(9.64)
P
lim sup f (y) − k!
f (x)(y − x, y − x, . . . , y − x) = 0
K→∞ k=0 2

(cf. Definition 3.3.4).

Proposition 9.7.2 (Power series). Let m, n ∈ N, ε ∈ (0, ∞), let U ⊆ Rm satisfy U = {x ∈


Rm : ∥x∥2 ≤ ε}, for every k ∈ N let Ak : (Rm )k → Rn be k-linear and symmetric, and let
f : U → Rn satisfy for all x ∈ U that
K
(9.65)
P
lim sup f (x) − f (0) − Ak (x, x, . . . , x) =0
K→∞ k=1 2

(cf. Definition 3.3.4). Then


P∞
(i) it holds for all x ∈ {u ∈ U : ∥u∥2 < ε} that k=1 ∥Ak (x, x, . . . , x)∥2 < ∞ and

(9.66)
P
f (x) = f (0) + Ak (x, x, . . . , x),
k=1

(ii) it holds that f |{u∈U : ∥u∥2 <ε} is infinitely often differentiable,

358
9.7. Power series and analytic functions

(iii) it holds for all x ∈ {u ∈ U : ∥u∥2 < ε}, l ∈ N, v1 , v2 , . . . , vl ∈ Rm that


∞ 
k!
(9.67)
P  
(k−l)!
∥Ak (v1 , v2 , . . . , vl , x, x, . . . , x)∥2 < ∞
k=l

and ∞ 
k!
(9.68)
 
f (l) (x)(v1 , . . . , vl ) =
P
(k−l)!
Ak (v1 , v2 , . . . , vl , x, x, . . . , x) ,
k=l
and
(iv) it holds for all k ∈ N that f (k) (0) = k!Ak .
Proof of Proposition 9.7.2. Throughout this proof, for every K ∈ N0 let FK : Rm → Rn
satisfy for all x ∈ Rm that
K
X
FK (x) = f (0) + Ak (x, x, . . . , x). (9.69)
k=1

Note that (9.65) ensures that for all x ∈ U it holds that


lim supK→∞ ∥f (x) − FK (x)∥2 = 0. (9.70)
Therefore, we obtain for all x ∈ U that
lim supK→∞ ∥FK+1 (x) − FK (x)∥2 = 0. (9.71)
This proves for all x ∈ U that
supk∈N ∥Ak (x, x, . . . , x)∥2 = supK∈N0 ∥FK+1 (x) − FK (x)∥2 < ∞. (9.72)
Hence, we obtain for all x ∈ {u ∈ U : ∥u∥2 < ε}\{0} that
∞ ∞  k !
X X ∥x∥2 εx
, εx , . . . , ∥x∥
εx

∥Ak (x, x, . . . , x)∥2 = Ak ∥x∥ 2 ∥x∥2 2
k=1 k=1
ε 2

"∞ # (9.73)
X ∥x∥2 k  
εx εx εx

≤ sup Ak ∥x∥2 , ∥x∥2 , . . . , ∥x∥2 2 < ∞.
k=1
ε k∈N

This shows that for all x ∈ {u ∈ U : ∥u∥2 < ε} it holds that



X
∥Ak (x, x, . . . , x)∥2 < ∞. (9.74)
k=1

Combining this with (9.65) establishes item (i). Observe that, for instance, Krantz &
Parks [254, Proposition 2.2.3] implies items (ii) and (iii). Note that (9.68) implies item (iv).
The proof of Proposition 9.7.2 is thus complete.

359
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proposition 9.7.3 (Characterization for analytic functions). Let m, n ∈ N, let U ⊆ Rm be


open, and let f ∈ C ∞ (U, Rn ). Then the following three statements are equivalent:

(i) It holds that f is analytic (cf. Definition 9.7.1).

(ii) It holds for all x ∈ U that there


P∞ exists ε ∈ (0, ∞) such that for all y ∈ {u ∈
1 (k)
U : ∥x − u∥2 < ε} it holds that k=0 k! ∥f (x)(y − x, y − x, . . . , y − x)∥2 < ∞ and

1 (k)
(9.75)
P
f (y) = k!
f (x)(y − x, y − x, . . . , y − x).
k=0

(iii) It holds for all compact C ⊆ U that there exists c ∈ R such that for all x ∈ C, k ∈ N,
v ∈ Rm it holds that
∥f (k) (x)(v, v, . . . , v)∥2 ≤ k! ck ∥v∥k2 . (9.76)

Proof of Proposition 9.7.3. The equivalence is a direct consequence from Proposition 9.7.2.
The proof of Proposition 9.7.3 is thus complete.

9.8 Standard KL inequalities for one-dimensional ana-


lytic functions
In Section 9.6 above we have seen that one-dimensional polynomials are standard KL
functions (see Corollary 9.6.5). In this section we verify that one-dimensional analytic
functions are also standard KL functions (see Corollary 9.8.6 below). The main arguments
for this statement are presented in the proof of Lemma 9.8.2 and are inspired by [129].

Lemma 9.8.1. Let ε ∈ (0, ∞), let U ⊆ R satisfy U = {x ∈ R : |x| ≤ ε}, let (ak )k∈N ⊆ R,
and let f : U → R satisfy for all x ∈ U that
K
ak xk = 0. (9.77)
P
lim sup f (x) − f (0) −
K→∞ k=1

Then
P∞ k
(i) it holds for all x ∈ {y ∈ U : |y| < ε} that k=1 |ak ||x| < ∞ and

ak x k , (9.78)
P
f (x) = f (0) +
k=1

(ii) it holds that f |{y∈U : |y|<ε} is infinitely often differentiable,

360
9.8. Standard KL inequalities for one-dimensional analytic functions

P∞  k!

(iii) it holds for all x ∈ {y ∈ U : |y| < ε}, l ∈ N that k=l (k−l)! |ak ||x|k−l < ∞ and

∞
f (l) (x) = (9.79)
P k!
 k−l
(k−l)!
ak x ,
k=l

and

(iv) it holds for all k ∈ N that f (k) (0) = k!ak .

Proof of Lemma 9.8.1. Observe that Proposition 9.7.2 (applied with m ↶ 1, n ↶ 1, ε ↶ ε,


U ↶ U , (Ak )k∈N ↶ (Rk ∋ (x1 , x2 , . . . , xk ) 7→ ak x1 x2 · · · xk ∈ R) k∈N , f ↶ f in the
notation of Proposition 9.7.2) establishes items (i), (ii), (iii), and (iv). The proof of
Lemma 9.8.1 is thus complete.

Lemma 9.8.2. Let ε, δ ∈ (0, 1), N ∈ N\{1}, (an )n∈N0 ⊆ R satisfy N = min({k ∈ N : ak ̸=
0} ∪ {∞}), let U ⊆ R satisfy U = {ξ ∈ R : |ξ| ≤ ε}, let L : U → R satisfy for all θ ∈ U
that K 
k
(9.80)
P
lim sup L(θ) − L(0) − ak θ = 0,
K→∞ k=1

and let M ∈ N ∩ (N, ∞) satisfy for all k ∈ N ∩ [M, ∞) that k|ak | ≤ (2ε−1 )k and
−1
δ = min 4ε , |aN | 2(max{|a1 |, |a2 |, . . . , |aM |}) + (2ε−1 )N +1 (9.81)
 
.

Then it holds for all θ ∈ {ξ ∈ R : |ξ| < δ} that


N −1
|L(θ) − L(0)| N ≤ 2|L ′ (θ)|. (9.82)

Proof of Lemma 9.8.2. Note that the assumption that for all k ∈ N ∩ [M, ∞) it holds that
|ak | ≤ k|ak | ≤ (2ε−1 )k ensures that for all K ∈ N ∩ [M, ∞) it holds that
K+N
P+1
|ak ||θ|k
k=N +1
K 
N +1 k
P
= |θ| |ak+N +1 ||θ|
k=0
 M   K 
N +1 k k
(9.83)
P P
= |θ| |ak+N +1 ||θ| + |ak+N +1 ||θ|
k=0 k=M +1
 M   K 
N +1 −1 k+N +1
P k P k
≤ |θ| (max{|a1 |, |a2 |, . . . , |aM |}) |θ| + (2ε ) |θ|
k=0 k=M +1
 M   K 
N +1 −1 N +1 −1
P k P k
= |θ| (max{|a1 |, |a2 |, . . . , |aM |}) |θ| + (2ε ) (2ε |θ|) .
k=0 k=M +1

361
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Therefore, we obtain for all θ ∈ R with |θ| ≤ ε


4
that

 ∞  ∞ 
k N +1 1 k −1 N +1 1 k
P P P
|ak ||θ| ≤ |θ| (max{|a1 |, |a2 |, . . . , |aM |}) + (2ε )
4 2
k=N +1 k=0 k=1

≤ |θ|N +1 2(max{|a1 |, |a2 |, . . . , |aM |}) + (2ε−1 )N +1 .


 

(9.84)
This demonstrates that for all θ ∈ R with |θ| ≤ δ it holds that

|ak ||θ|k ≤ |aN ||θ|N . (9.85)
P
k=N +1

Hence, we obtain for all θ ∈ R with |θ| ≤ δ that


∞ ∞
ak θk ≤ |aN ||θ|N + |ak ||θ|k ≤ 2|aN ||θ|N . (9.86)
P P
|L(θ) − L(0)| =
k=N k=N +1

Next observe that the assumption that for all k ∈ N ∩ [M, ∞) it holds that k|ak | ≤ (2ε−1 )k
ensures that for all K ∈ N ∩ [M, ∞) it holds that
N +K+1
k|ak ||θ|k−1
P
k=N +1
M −N −1   K 
N k k
P P
= |θ| (k + N + 1)|ak+N +1 ||θ| + (k + N + 1)|ak+N +1 ||θ|
k=0 k=M −N
 M −N −1   K  (9.87)
N k −1 k+N +1 k
P P
≤ |θ| max{|a1 |, 2|a2 |, . . . , M |aM |} |θ| + (2ε ) |θ|
k=0 k=M −N
 M −N −1   K−N 
N k −1 N +1 −1 k
P P
≤ |θ| max{|a1 |, 2|a2 |, . . . , M |aM |} |θ| + (2ε ) |2ε θ| .
k=0 k=M −N

Therefore, we obtain for all θ ∈ R with |θ| ≤ ε


4
that

k|ak ||θ|k−1
P
k=N +1

(9.88)
 ∞  ∞ 
N
P 1 k −1 N +1
P 1 k
≤ |θ| max{|a1 |, 2|a2 |, . . . , M |aM |} 4
+ (2ε ) 2
k=0 k=1
N −1 N +1
 
≤ |θ| 2(max{|a1 |, 2|a2 |, . . . , M |aM |}) + (2ε ) .

This establishes that for all θ ∈ R with |θ| ≤ δ it holds that


K
k|ak ||θ|k−1 ≤ |aN ||θ|N −1 . (9.89)
P
k=N +1

362
9.8. Standard KL inequalities for one-dimensional analytic functions

Hence, we obtain for all K ∈ N ∩ [N, ∞), θ ∈ R with |θ| < δ that
K K ∞
kak θk−1 = kak θk−1 ≥ N |aN ||θ|N −1 − k|ak ||θ|k−1 ≥ (N − 1)|aN ||θ|N −1 .
P P P
k=1 k=N k=N +1
(9.90)
Proposition
P∞ 9.7.2 therefore proves that for all θ ∈ {ξ ∈ R : |x| < ε} it holds that
k=1 k|ak θ
k−1
| < ∞ and

|L ′ (θ)| = kak θk−1 ≥ (N − 1)|aN ||θ|N −1 . (9.91)
P
k=1

Combining this with (9.86) shows that for all θ ∈ R with |θ| ≤ δ it holds that
N −1 N −1 N −1
|L(θ)−L(0)| N ≤ |2aN | N |θ|N −1 ≤ |2aN | N (N −1)−1 |aN |−1 |L ′ (θ)| ≤ 2|L ′ (θ)|. (9.92)
The proof of Lemma 9.8.2 is thus complete.
Corollary 9.8.3. Let ε ∈ (0, ∞), U ⊆ R satisfy U = {θ ∈ R : |θ| ≤ ε} and let L : U → R
satisfy for all θ ∈ U that
K
ak θk = 0. (9.93)
P
lim sup L(θ) − L(0) −
K→∞ k=1

Then there exist δ ∈ (0, ε), c ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ {ξ ∈ R : |ξ| < δ} it
holds that
|L(θ) − L(0)|α ≤ c |L ′ (0)|. (9.94)
Proof of Corollary 9.8.3. Throughout this proof, assume without loss of generality that
ε < 1, let N ∈ N ∪ {∞} satisfy N = min({k ∈ N : ak ̸= 0} ∪ {∞}), and assume without
loss of generality that 1 < N < ∞ (cf. item (iv) in Lemma 9.8.1 and Corollary 9.4.2). Note
that item (iii) in Lemma 9.8.1 ensures that for all θ ∈ R with |θ| < ε it holds that

k|ak ||θ|k−1 < ∞. (9.95)
P
k=1

Hence, we obtain that



ε k
(9.96)
P
k|ak | 2
< ∞.
k=1
This implies that there exists M ∈ N ∩ (N, ∞) which satisfies that for all k ∈ N ∩ [M, ∞)
it holds that
k|ak | ≤ (2ε−1 )k−1 ≤ (2ε−1 )k . (9.97)
Lemma 9.8.2 therefore establishes that for all θ ∈ {ξ ∈ R : |ξ| < min{ 4ε , |aN |[2(max{|a1 |, |a2 |,
. . . , |aM |}) + (2ε−1 )N +1 ]−1 } it holds that
N −1
|L(θ) − L(0)| N ≤ 2 |L ′ (θ)|. (9.98)
The proof of Corollary 9.8.3 is thus complete.

363
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Corollary 9.8.4. Let ε ∈ (0, ∞), U ⊆ R, ϑ ∈ U satisfy U = {θ ∈ R : |θ − ϑ| ≤ ε} and let


L : U → R satisfy for all θ ∈ U that
K
ak (θ − ϑ)k = 0. (9.99)
P
lim sup L(θ) − L(ϑ) −
K→∞ k=1

Then there exist δ ∈ (0, ε), c ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ {ξ ∈ R : |ξ − ϑ| < δ}
it holds that
|L(θ) − L(ϑ)|α ≤ c |L ′ (ϑ)|. (9.100)

Proof of Corollary 9.8.4. Throughout this proof, let V ⊆ R satisfy V = {θ ∈ R : |θ| ≤ ε}


and let M : V → R satisfy for all θ ∈ V that M(θ) = L(θ + ϑ). Observe that (9.99) and
the fact that for all θ ∈ V it holds that θ + ϑ ∈ U ensures thatfor all θ ∈ V it holds that
K
ak θ k
P
lim sup M(θ) − M(0) −
K→∞ k=1
K
(9.101)
k
P
= lim sup L(θ + ϑ) − L(ϑ) − ak ((θ + ϑ) − ϑ) = 0.
K→∞ k=1

Corollary 9.8.3 hence establishes that there exist δ ∈ (0, ε), c ∈ (0, ∞), α ∈ (0, 1) which
satisfy that for all θ ∈ {ξ ∈ R : |ξ| < δ} it holds that

|M(θ) − M(0)|α ≤ c |M′ (0)|. (9.102)

Therefore, we obtain for all θ ∈ {ξ ∈ R : |ξ| < δ} that

|L(θ + ϑ) − L(ϑ)|α = c |L ′ (θ)|. (9.103)

This implies (9.100). The proof of Corollary 9.8.4 is thus complete.

Corollary 9.8.5. Let U ⊆ R be open, let L : U → R be analytic, and let ϑ ∈ U (cf.


Definition 9.7.1). Then there exist ε, c ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ {ξ ∈
U : |ϑ − ξ| < ε} it holds that

|L(ϑ) − L(θ)|α ≤ c |(∇L)(θ)|. (9.104)

Proof of Corollary 9.8.5. Note that Corollary 9.8.4 establishes (9.104). The proof of Corol-
lary 9.8.5 is thus complete.

Corollary 9.8.6. Let L : R → R be analytic (cf. Definition 9.7.1). Then L is a standard


KL function (cf. Definition 9.1.2).

Proof of Corollary 9.8.6. Observe that (9.2) and Corollary 9.8.5 establish that L is a
standard KL function (cf. Definition 9.1.2). The proof of Corollary 9.8.6 is thus complete.

364
9.9. Standard KL inequalities for analytic functions

9.9 Standard KL inequalities for analytic functions


Theorem 9.9.1 (Standard KL inequalities for analytic functions). Let d ∈ N, let U ⊆ Rd
be open, let L : U → R be analytic, and let ϑ ∈ U (cf. Definition 9.7.1). Then there exist
ε, c ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ {u ∈ U : ∥ϑ − u∥2 < ε} it holds that

|L(ϑ) − L(θ)|α ≤ c ∥(∇L)(θ)∥2 (9.105)

(cf. Definition 3.3.4).

Proof of Theorem 9.9.1. Note that Łojasiewicz [281, Proposition 1] demonstrates (9.105)
(cf., for example, also Bierstone & Milman [38, Proposition 6.8]). The proof of Theorem 9.9.1
is thus complete.

Corollary 9.9.2. Let d ∈ N and let L : Rd → R be analytic (cf. Definition 9.7.1). Then
L is a standard KL function (cf. Definition 9.1.2).

Proof of Corollary 9.9.2. Observe that (9.2) and Theorem 9.9.1 establish that L is a
standard KL function (cf. Definition 9.1.2). The proof of Corollary 9.9.2 is thus complete.

9.10 Counterexamples
Example 9.10.1 (Example of a smooth function that is not a standard KL function). Let
L : R → R satisfy for all x ∈ R that
(
exp(−x−1 ) : x > 0
L(x) = (9.106)
0 : x ≤ 0.

Then

(i) it holds that L ∈ C ∞ (R, R),

(ii) it holds for all x ∈ (0, ∞) that L ′ (x) = x−2 exp(−x−1 ),

(iii) it holds for all α ∈ (0, 1), ε ∈ (0, ∞) that

|L(x) − L(0)|α
 
sup = ∞, (9.107)
x∈(0,ε) |L ′ (x)|

and

(iv) it holds that L is not a standard KL function

(cf. Definition 9.1.2).

365
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proof for Example 9.10.1. Throughout this proof, let

P = {f ∈ C((0, ∞), R) : f is a polynomial} (9.108)

and for every f ∈ C((0, ∞), R) let Gf : (0, ∞) → R satisfy for all x ∈ (0, ∞) that

Gf (x) = f (x−1 ) exp(−x−1 ). (9.109)

Note that the chain rule and the product rule ensure that for all f ∈ C 1 ((0, ∞), R),
x ∈ (0, ∞) it holds that Gf ∈ C 1 ((0, ∞), R) and

(Gf )′ (x) = −f ′ (x−1 )x−2 exp(−x−1 ) + f (x−1 )x−2 exp(−x−1 )


(9.110)
= (f (x−1 ) − f ′ (x−1 ))x−2 exp(−x−1 ).
Hence, we obtain for all p ∈ P that there exists q ∈ P such that

(Gp )′ = Gq . (9.111)

Combining this and (9.110) with induction ensures that for all p ∈ P , n ∈ N it holds that

Gp ∈ C ∞ ((0, ∞), R) and (∃ q ∈ P : (Gp )(n) = Gq ). (9.112)

This and the fact that for all p ∈ P it holds that limx↘0 Gp (x) = 0 establish that for all
p ∈ P it holds that
lim (Gp )(n) (x) = 0. (9.113)
x↘0

The fact that L|(0,∞) = G(0,∞)∋x7→1∈R and (9.110) therefore establish item (i) and item (iii).
Observe that (9.106) and the fact that for all y ∈ (0, ∞) it holds that

X yk y3 y3
exp(y) = ≥ = (9.114)
k=0
k! 3! 6

ensure that for all α ∈ (0, 1), ε ∈ (0, ∞), x ∈ (0, ε) it holds that
|L(x) − L(0)|α |L(x)|α x2 |L(x)|α

= ′
= = x2 |L(x)|α−1
|L (x)| |L (x)| L(x)
(9.115)
x2 (1 − α)3 (1 − α)3
 
2 (1 − α)
= x exp ≥ = .
x 6x3 6x
Hence, we obtain for all α ∈ (0, 1), ε ∈ (0, ∞) that
|L(x) − L(0)|α (1 − α)3
   
sup ≥ sup = ∞. (9.116)
x∈(0,ε) |L ′ (x)| x∈(0,ε) 6x

The proof for Example 9.10.1 is thus complete.

366
9.10. Counterexamples

Example 9.10.2 (Example of a differentiable function that fails to satisfy the standard
KL inequality). Let L : R → R satisfy for all x ∈ R that
R max{x,0}
L(x) = 0
y|sin(y −1 )| dy. (9.117)

Then

(i) it holds that L ∈ C 1 (R, R),

(ii) it holds for all c ∈ R, α, ε ∈ (0, ∞) that there exist x ∈ (0, ε) such that

|L(x) − L(0)|α > c|L ′ (x)|, (9.118)

and

(iii) it holds for all c ∈ R, α, ε ∈ (0, ∞) that we do not have that L satisfies the standard
KL inequality at 0 on [0, ε) with exponent α and constant c

(cf. Definition 9.1.1).

Proof for Example 9.10.2. Throughout this proof, let G : R → R satisfy for all x ∈ R that
(
x|sin(x−1 )| : x > 0
G(x) = (9.119)
0 : x ≤ 0.

Note that (9.119) proves that for all k ∈ N it holds that

G((kπ)−1 ) = (kπ)−1 |sin(kπ)| = 0. (9.120)

Furthermore, observe that (9.119) shows for all x ∈ (0, ∞) that

|G(x) − G(0)| = |x sin(x−1 )| ≤ |x|. (9.121)

Therefore, we obtain that G is continuous. This, (9.117), and the fundamental theorem of
calculus ensure that L is continuously differentiable with

L ′ = G. (9.122)

Combining this with (9.120) demonstrates that for all c ∈ R, α ∈ (0, ∞), k ∈ N it holds
that

|L((kπ)−1 ) − L(0)|α = [L((kπ)−1 )]α > 0 = c|G((kπ)−1 )| = c|L ′ ((kπ)−1 )|. (9.123)

The proof for Example 9.10.2 is thus complete.

367
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

9.11 Convergence analysis for solutions of GF ODEs


9.11.1 Abstract local convergence results for GF processes
Lemma 9.11.1. Let d ∈ N, Θ ∈ C([0, ∞), Rd ), L ∈ C 1 (Rd , R), let G : Rd → RRd satisfy for
t
all θ ∈ Rd that G(θ) = (∇L)(θ), and assume for all t ∈ [0, ∞) that Θt = Θ0 − 0 G(Θs ) ds.
Then it holds for all t ∈ [0, ∞) that
Z t
L(Θt ) = L(Θ0 ) − ∥G(Θs )∥22 ds (9.124)
0

(cf. Definition 3.3.4).


Proof of Lemma 9.11.1. Note that Lemma 5.2.3 implies (9.124). This completes the proof
of Lemma 9.11.1.
Proposition 9.11.2. Let d ∈ N, ϑ ∈ Rd , c ∈ R, C, ε ∈ (0, ∞), α ∈ (0, 1), Θ ∈
C([0, ∞), Rd ), L ∈ C(Rd , R), let G : Rd → Rd be B(Rd )/B(Rd )-measurable, assume for
all t ∈ [0, ∞) that
Z t Z t
L(Θt ) = L(Θ0 ) − 2
∥G(Θs )∥2 ds and Θt = Θ0 − G(Θs ) ds, (9.125)
0 0

and assume for all θ ∈ Rd with ∥θ − ϑ∥2 < ε that

|L(θ)−L(ϑ)|α ≤ C∥G(θ)∥2 , c = |L(Θ0 )−L(ϑ)|, C(1−α)−1 c1−α +∥Θ0 −ϑ∥2 < ε, (9.126)

and inf t∈{s∈[0,∞) : ∀ r∈[0,s] : ∥Θr −ϑ∥2 <ε} L(Θt ) ≥ L(ϑ) (cf. Definition 3.3.4). Then there exists
ψ ∈ Rd such that
(i) it holds that L(ψ) = L(ϑ),

(ii) it holds for all t ∈ [0, ∞) that ∥Θt − ϑ∥2 < ε,

(iii) it holds for all t ∈ [0, ∞) that 0 ≤ L(Θt ) − L(ψ) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 , and

(iv) it holds for all t ∈ [0, ∞) that


Z ∞
∥Θt − ψ∥2 ≤ ∥G(Θs )∥2 ds ≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
t (9.127)
≤C 3−2α 2−2α
c (1 − α) (1{0} (c) + C c + c t)
−1 2 2α α−1
.

Proof of Proposition 9.11.2. Throughout this proof, let L : [0, ∞) → R satisfy for all t ∈
[0, ∞) that
L(t) = L(Θt ) − L(ϑ), (9.128)

368
9.11. Convergence analysis for solutions of GF ODEs

let B ⊆ Rd satisfy
B = {θ ∈ Rd : ∥θ − ϑ∥2 < ε}, (9.129)
let T ∈ [0, ∞] satisfy
T = inf({t ∈ [0, ∞) : Θt ∈
/ B} ∪ {∞}), (9.130)
let τ ∈ [0, T ] satisfy
τ = inf({t ∈ [0, T ) : L(t) = 0} ∪ {T }), (9.131)
R∞
let g = (gt )t∈[0,∞) : [0, ∞) → [0, ∞] satisfy for all t ∈ [0, ∞) that gt = t ∥G(Θs )∥2 ds, and
let D ∈ R satisfy D = C2 c(2−2α) . In the first step of our proof of items (i), (ii), (iii), and
(iv) we show that for all t ∈ [0, ∞) it holds that

Θt ∈ B. (9.132)

For this we observe that (9.126), the


R t triangle inequality, and the assumption that for all
t ∈ [0, ∞) it holds that Θt = Θ0 − 0 G(Θs ) ds imply that for all t ∈ [0, ∞) it holds that
Z t
∥Θt − ϑ∥2 ≤ ∥Θt − Θ0 ∥2 + ∥Θ0 − ϑ∥2 ≤ G(Θs ) ds + ∥Θ0 − ϑ∥2
0 2
Z t Z t
≤ ∥G(Θs )∥2 ds + ∥Θ0 − ϑ∥2 < ∥G(Θs )∥2 ds − C(1 − α)−1 |L(Θ0 ) − L(ϑ)|1−α + ε.
0 0
(9.133)
RT
To establish (9.132), it is thus sufficient to prove that 0 ∥G(Θs )∥2 ds ≤ C(1 − α)−1 |L(Θ0 ) −
L(ϑ)|1−α . We will accomplish this by employing an appropriate differential inequality for a
fractional power of the function L in (9.128) (see (9.138) below for details). For this we
need several technical preparations. More formally, note that (9.128) and the assumption
that for all t ∈ [0, ∞) it holds that
Z t
L(Θt ) = L(Θ0 ) − ∥G(Θs )∥22 ds (9.134)
0

demonstrate that for almost all t ∈ [0, ∞) it holds that L is differentiable at t and satisfies

L′ (t) = d
dt
(L(Θt )) = −∥G(Θt )∥22 . (9.135)

Furthermore, observe that the assumption that inf t∈{s∈[0,∞) : ∀ r∈[0,s] : ∥Θr −ϑ∥2 <ε} L(Θt ) ≥ L(ϑ)
ensures that for all t ∈ [0, T ) it holds that

L(t) ≥ 0. (9.136)

Combining this with (9.126), (9.128), and (9.131) establishes that for all t ∈ [0, τ ) it holds
that
0 < [L(t)]α = |L(Θt ) − L(ϑ)|α ≤ C∥G(Θt )∥2 . (9.137)

369
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

The chain rule and (9.135) hence prove that for almost all t ∈ [0, τ ) it holds that
d
([L(t)]1−α ) = (1 − α)[L(t)]−α (−∥G(Θt )∥22 )
dt
(9.138)
≤ −(1 − α)C−1 ∥G(Θt )∥−1 2 −1
2 ∥G(Θt )∥2 = −C (1 − α)∥G(Θt )∥2 .

Moreover, note that (9.134) shows that [0, ∞) ∋ t 7→ L(t) ∈ R is absolutely continuous.
This and the fact that for all r ∈ (0, ∞) it holds that [r, ∞) ∋ y 7→ y 1−α ∈ R is Lipschitz
continuous imply that for all t ∈ [0, τ ) it holds that [0, t] ∋ s 7→ [L(s)]1−α ∈ R is absolutely
continuous. Combining this with (9.138) demonstrates that for all s, t ∈ [0, τ ) with s ≤ t it
holds that
Z t
∥G(Θu )∥2 du ≤ −C(1 − α)−1 ([L(t)]1−α − [L(s)]1−α ) ≤ C(1 − α)−1 [L(s)]1−α . (9.139)
s

In the next step we observe that (9.134) ensures that [0, ∞) ∋ t 7→ L(Θt ) ∈ R is non-
increasing. This and (9.128) establish that L is non-increasing. Combining (9.131) and
(9.136) therefore proves that for all t ∈ [τ, T ) it holds that L(t) = 0. Hence, we obtain that
for all t ∈ (τ, T ) it holds that
L′ (t) = 0. (9.140)
This and (9.135) show that for almost all t ∈ (τ, T ) it holds that

G(Θt ) = 0. (9.141)

Combining this with (9.139) implies that for all s, t ∈ [0, T ) with s ≤ t it holds that
Z t
∥G(Θu )∥2 du ≤ C(1 − α)−1 [L(s)]1−α . (9.142)
s

Therefore, we obtain that for all t ∈ [0, T ) it holds that


Z t
∥G(Θu )∥2 du ≤ C(1 − α)−1 [L(0)]1−α . (9.143)
0

In addition, note that (9.126) demonstrates that Θ0 ∈ B. Combining this with (9.130)
ensures that T > 0. This, (9.143), and (9.126) establish that
Z T
∥G(Θu )∥2 du ≤ C(1 − α)−1 [L(0)]1−α < ε < ∞. (9.144)
0

Combining (9.130) and (9.133) hence proves that

T = ∞. (9.145)

This establishes (9.132). In the next step of our proof of items (i), (ii), (iii), and (iv) we
verify that Θt ∈ Rd , t ∈ [0, ∞), is convergent (see (9.147) below). For this observe that the

370
9.11. Convergence analysis for solutions of GF ODEs

Rt
assumption that for all t ∈ [0, ∞) it holds that Θt = Θ0 − 0 G(Θs ) ds shows that for all
r, s, t ∈ [0, ∞) with r ≤ s ≤ t it holds that
Z t Z t Z ∞
∥Θt − Θs ∥2 = G(Θu ) du ≤ ∥G(Θu )∥2 du ≤ ∥G(Θu )∥2 du = gr . (9.146)
s 2 s r

Next note that (9.144) and (9.145) imply that ∞ > g0 ≥ lim supr→∞ gr = 0. Combining
this with (9.146) demonstrates that there exist ψ ∈ Rd which satisfies

lim supt→∞ ∥Θt − ψ∥2 = 0. (9.147)

In the next step of our proof of items (i), (ii), (iii), and (iv) we show that L(Θt ), t ∈ [0, ∞),
converges to L(ψ) with convergence order 1. We accomplish this by bringing a suitable
differential inequality for the reciprocal of the function L in (9.128) into play (see (9.150)
below for details). More specifically, observe that (9.135), (9.145), (9.130), and (9.126)
ensure that for almost all t ∈ [0, ∞) it holds that

L′ (t) = −∥G(Θt )∥22 ≤ −C−2 [L(t)]2α . (9.148)

Hence, we obtain that L is non-increasing. This proves that for all t ∈ [0, ∞) it holds that
L(t) ≤ L(0). This and the fact that for all t ∈ [0, τ ) it holds that L(t) > 0 establish that
for almost all t ∈ [0, τ ) it holds that

L′ (t) ≤ −C−2 [L(t)](2α−2) [L(t)]2 ≤ −C−2 [L(0)](2α−2) [L(t)]2 = −D−1 [L(t)]2 . (9.149)

Therefore, we obtain that for almost all t ∈ [0, τ ) it holds that


D L′ (t)
   
d D
=− ≥ 1. (9.150)
dt L(t) [L(t)]2
Furthermore, note that the fact that for all t ∈ [0, τ ) it holds that [0, t] ∋ s 7→ L(s) ∈ (0, ∞)
is absolutely continuous shows that for all t ∈ [0, τ ) it holds that [0, t] ∋ s 7→ D[L(s)]−1 ∈
(0, ∞) is absolutely continuous. This and (9.150) imply that for all t ∈ [0, τ ) it holds that
D D
− ≥ t. (9.151)
L(t) L(0)
Hence, we obtain that for all t ∈ [0, τ ) it holds that
D D
≥ + t. (9.152)
L(t) L(0)
Therefore, we obtain that for all t ∈ [0, τ ) it holds that
 −1
D
D +t ≥ L(t). (9.153)
L(0)

371
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

This demonstrates that for all t ∈ [0, τ ) it holds that

L(t) ≤ D (D[L(0)]−1 + t)−1 = C2 c2−2α (C2 c1−2α + t)−1 = C2 c2 (C2 c + c2α t)−1 . (9.154)

The fact that for all t ∈ [τ, ∞) it holds that L(t) = 0 and (9.131) hence ensure that for all
t ∈ [0, ∞) it holds that
0 ≤ L(t) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 . (9.155)
Moreover, observe that (9.147) and the assumption that L ∈ C(Rd , R) prove that
lim supt→∞ |L(Θt ) − L(ψ)| = 0. Combining this with (9.155) establishes that L(ψ) = L(ϑ).
This and (9.155) show that for all t ∈ [0, ∞) it holds that
0 ≤ L(Θt ) − L(ψ) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 . (9.156)
In the final step of our proof of items (i), (ii), (iii), and (iv) we establish convergence rates
for the real numbers ∥Θt − ψ∥2 , t ∈ [0, ∞). Note that (9.147), (9.146), and (9.142) imply
that for all t ∈ [0, ∞) it holds that
∥Θt −ψ∥2 = ∥Θt − [lims→∞ Θs ]∥2 = lims→∞ ∥Θt −Θs ∥2 ≤ gt ≤ C(1−α)−1 [L(t)]1−α . (9.157)
This and (9.156) demonstrate that for all t ∈ [0, ∞) it holds that
∥Θt − ψ∥2 ≤ gt ≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
≤ C(1 − α)−1 C2 c2 (1{0} (c) + C2 c + c2α t)−1 (9.158)
 1−α

= C3−2α c2−2α (1 − α)−1 (1{0} (c) + C2 c + c2α t)α−1 .


Combining this with (9.132) and (9.156) proves items (i), (ii), (iii), and (iv). The proof of
Proposition 9.11.2 is thus complete.
Corollary 9.11.3. Let d ∈ N, ϑ ∈ Rd , c ∈ [0, 1], C, ε ∈ (0, ∞), α ∈ (0, 1), Θ ∈
C([0, ∞), Rd ), L ∈ C(Rd , R), let G : Rd → Rd be B(Rd )/B(Rd )-measurable, assume for all
t ∈ [0, ∞) that
Z t Z t
L(Θt ) = L(Θ0 ) − 2
∥G(Θs )∥2 ds and Θt = Θ0 − G(Θs ) ds, (9.159)
0 0
d
and assume for all θ ∈ R with ∥θ − ϑ∥2 < ε that
|L(θ) − L(ϑ)|α ≤ C∥G(θ)∥2 , c = |L(Θ0 ) − L(ϑ)|, C(1 − α)−1 c1−α + ∥Θ0 − ϑ∥2 < ε,
(9.160)
and inf t∈{s∈[0,∞) : ∀ r∈[0,s] : ∥Θr −ϑ∥2 <ε} L(Θt ) ≥ L(ϑ) (cf. Definition 3.3.4). Then there exists
ψ ∈ Rd such that for all t ∈ [0, ∞) it holds that L(ψ) = L(ϑ), ∥Θt − ϑ∥2 < ε, 0 ≤
L(Θt ) − L(ψ) ≤ (1 + C−2 t)−1 , and
Z ∞
∥Θt − ψ∥2 ≤ ∥G(Θs )∥2 ds ≤ C(1 − α)−1 (1 + C−2 t)α−1 . (9.161)
t

372
9.11. Convergence analysis for solutions of GF ODEs

Proof of Corollary 9.11.3. Observe that Proposition 9.11.2 ensures that there exists ψ ∈ Rd
which satisfies that
(i) it holds that L(ψ) = L(ϑ),
(ii) it holds for all t ∈ [0, ∞) that ∥Θt − ϑ∥2 < ε,
(iii) it holds for all t ∈ [0, ∞) that 0 ≤ L(Θt ) − L(ψ) ≤ C2 c2 (1{0} (c) + C2 c + c2α t)−1 , and
(iv) it holds for all t ∈ [0, ∞) that
Z ∞
∥Θt − ψ∥2 ≤ ∥G(Θs )∥2 ds ≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
t (9.162)
≤C 3−2α 2−2α
c (1 − α) (1{0} (c) + C c + c t)
−1 2 2α α−1
.

Note that item (iii) and the assumption that c ≤ 1 establish that for all t ∈ [0, ∞) it holds
that
0 ≤ L(Θt ) − L(ψ) ≤ c2 (C−2 1{0} (c) + c + C−2 c2α t)−1 ≤ (1 + C−2 t)−1 . (9.163)
This and item (iv) show that for all t ∈ [0, ∞) it holds that
Z ∞
∥Θt − ψ∥2 ≤ ∥G(Θs )∥2 ds ≤ C(1 − α)−1 [L(Θt ) − L(ψ)]1−α
t (9.164)
−1 −2 α−1
≤ C(1 − α) (1 + C t) .
Combining this with item (i), item (ii), and (9.163) proves (9.161). The proof of Corol-
lary 9.11.3 is thus complete.

9.11.2 Abstract global convergence results for GF processes


Proposition 9.11.4. Let d ∈ N, Θ ∈ C([0, ∞), Rd ), L ∈ C(Rd , R), let G : Rd → Rd be
B(Rd )/B(Rd )-measurable, assume that for all ϑ ∈ Rd there exist ε, C ∈ (0, ∞), α ∈ (0, 1)
such that for all θ ∈ Rd with ∥θ − ϑ∥2 < ε it holds that
|L(θ) − L(ϑ)|α ≤ C∥G(θ)∥2 , (9.165)
assume for all t ∈ [0, ∞) that
Z t Z t
L(Θt ) = L(Θ0 ) − ∥G(Θs )∥22 ds and Θt = Θ0 − G(Θs ) ds, (9.166)
0 0

and assume lim inf t→∞ ∥Θt ∥2 < ∞. Then there exist ϑ ∈ Rd , C, τ, β ∈ (0, ∞) such that for
all t ∈ [τ, ∞) it holds that
−β −1
∥Θt − ϑ∥2 ≤ 1 + C(t − τ ) and 0 ≤ L(Θt ) − L(ϑ) ≤ 1 + C(t − τ ) . (9.167)

373
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proof of Proposition 9.11.4. Observe that (9.166) implies that [0, ∞) ∋ t 7→ L(Θt ) ∈ R is
non-increasing. Therefore, we obtain that there exists m ∈ [−∞, ∞) which satisfies

m = lim supt→∞ L(Θt ) = lim inf t→∞ L(Θt ) = inf t∈[0,∞) L(Θt ). (9.168)

Furthermore, note that the assumption that lim inf t→∞ ∥Θt ∥2 < ∞ demonstrates that there
exist ϑ ∈ Rd and δ = (δn )n∈N : N → [0, ∞) which satisfy

lim inf n→∞ δn = ∞ and lim supn→∞ ∥Θδn − ϑ∥2 = 0. (9.169)

Observe that (9.168), (9.169), and the fact that L is continuous ensure that

L(ϑ) = m ∈ R and ∀ t ∈ [0, ∞) : L(Θt ) ≥ L(ϑ). (9.170)

Next let ε, C ∈ (0, ∞), α ∈ (0, 1) satisfy for all θ ∈ Rd with ∥θ − ϑ∥2 < ε that

|L(θ) − L(ϑ)|α ≤ C∥G(θ)∥2 . (9.171)

Note that (9.169) and the fact that L is continuous demonstrate that there exist n ∈ N,
c ∈ [0, 1] which satisfy

c = |L(Θδn ) − L(ϑ)| and C(1 − α)−1 c1−α + ∥Θδn − ϑ∥2 < ε. (9.172)

Next let Φ : [0, ∞) → Rd satisfy for all t ∈ [0, ∞) that

Φt = Θδn +t . (9.173)

Observe that (9.166), (9.170), and (9.173) establish that for all t ∈ [0, ∞) it holds that
Z t Z t
L(Φt ) = L(Φ0 ) − 2
∥G(Φs )∥2 ds, Φt = Φ0 − G(Φs ) ds, and L(Φt ) ≥ L(ϑ).
0 0
(9.174)
Combining this with (9.171), (9.172), (9.173), and Corollary 9.11.3 (applied with Θ ↶ Φ in
the notation of Corollary 9.11.3) establishes that there exists ψ ∈ Rd which satisfies for all
t ∈ [0, ∞) that

0 ≤ L(Φt ) − L(ψ) ≤ (1 + C−2 t)−1 , ∥Φt − ψ∥2 ≤ C(1 − α)−1 (1 + C−2 t)α−1 , (9.175)

and L(ψ) = L(ϑ). Note that (9.173) and (9.175) show for all t ∈ [0, ∞) that 0 ≤
L(Θδn +t ) − L(ψ) ≤ (1 + C−2 t)−1 and ∥Θδn +t − ψ∥2 ≤ C(1 − α)−1 (1 + C−2 t)α−1 . Hence, we
obtain for all τ ∈ [δn , ∞), t ∈ [τ, ∞) that

0 ≤ L(Θt ) − L(ψ) ≤ (1 + C−2 (t − δn ))−1 = (1 + C−2 (t − τ ) + C−2 (τ − δn ))−1


(9.176)
≤ (1 + C−2 (t − τ ))−1

374
9.11. Convergence analysis for solutions of GF ODEs

and

∥Θt − ψ∥2 ≤ C(1 − α)−1 (1 + C−2 (t − δn ))α−1


h  1 iα−1
= C(1 − α)−1 α−1 (1 + C−2 (t − δn ))
(9.177)
 i−1 α−1
  1 
−1 α−1 −2
 h  1
−1 1−α 2
= C(1 − α) 1 + C (τ − δn ) + C(1 − α) C (t − τ ) .

Next let C, τ ∈ (0, ∞) satisfy


 1  1
C = max C2 , C(1 − α)−1 1−α C2 and τ = δn + C2 C(1 − α)−1 1−α . (9.178)
  

Observe that (9.176), (9.177), and (9.178) demonstrate for all t ∈ [τ, ∞) that

0 ≤ L(Θt ) − L(ψ) ≤ (1 + C−2 (t − τ ))−1 ≤ (1 + C −1 (t − τ ))−1 (9.179)

and
h  1  iα−1
−1 α−1 −2 −1

∥Θt − ψ∥2 ≤ C(1 − α) 1 + C (τ − δn ) + C (t − τ )
iα−1
 1   1  (9.180)
h
−1 α−1 −1 1−α −1

= C(1 − α) 1 + C(1 − α) + C (t − τ )
α−1
≤ 1 + C −1 (t − τ )
 
.

The proof of Proposition 9.11.4 is thus complete.

Corollary 9.11.5. Let d ∈ N, Θ ∈ C([0, ∞), Rd ), L ∈ C(Rd , R), let G : Rd → Rd be


B(Rd )/B(Rd )-measurable, assume that for all ϑ ∈ Rd there exist ε, C ∈ (0, ∞), α ∈ (0, 1)
such that for all θ ∈ Rd with ∥θ − ϑ∥2 < ε it holds that

|L(θ) − L(ϑ)|α ≤ C∥G(θ)∥2 , (9.181)

assume for all t ∈ [0, ∞) that


Z t Z t
L(Θt ) = L(Θ0 ) − ∥G(Θs )∥22 ds and Θt = Θ0 − G(Θs ) ds, (9.182)
0 0

and assume lim inf t→∞ ∥Θt ∥2 < ∞ (cf. Definition 3.3.4). Then there exist ϑ ∈ Rd , C, β ∈
(0, ∞) which satisfy for all t ∈ [0, ∞) that

∥Θt − ϑ∥2 ≤ C(1 + t)−β and 0 ≤ L(Θt ) − L(ϑ) ≤ C(1 + t)−1 . (9.183)

375
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proof of Corollary 9.11.5. Note that Proposition 9.11.4 demonstrates that there exist ϑ ∈
Rd , C, τ, β ∈ (0, ∞) which satisfy for all t ∈ [τ, ∞) that
−β −1
∥Θt − ϑ∥2 ≤ 1 + C(t − τ ) and 0 ≤ L(Θt ) − L(ϑ) ≤ 1 + C(t − τ ) . (9.184)

In the following let C ∈ (0, ∞) satisfy

C = max 1 + τ, (1 + τ )β , C−1 , C−β , (1 + τ )β sups∈[0,τ ] ∥Θs − ϑ∥2 , (1 + τ )(L(Θ0 ) − L(ϑ)) .


 

(9.185)
Observe that (9.184), (9.185), and the fact that [0, ∞) ∋ t 7→ L(Θt ) ∈ R is non-increasing
prove for all t ∈ [0, τ ] that

∥Θt − ϑ∥2 ≤ sups∈[0,τ ] ∥Θs − ϑ∥2 ≤ C(1 + τ )−β ≤ C(1 + t)−β (9.186)

and
0 ≤ L(Θt ) − L(ϑ) ≤ L(Θ0 ) − L(ϑ) ≤ C(1 + τ )−1 ≤ C(1 + t)−1 . (9.187)
Furthermore, note that (9.184) and (9.185) imply for all t ∈ [τ, ∞) that
−β −β
= C C /β + C /β C(t − τ )
1 1
∥Θt − ϑ∥2 ≤ 1 + C(t − τ )
−β (9.188)
≤ C C /β + t − τ ≤ C(1 + t)−β .
1

Moreover, observe that (9.184) and (9.185) demonstrate for all t ∈ [τ, ∞) that
−1 −1
0 ≤ L(Θt ) − L(ϑ) ≤ C C + CC(t − τ ) ≤C C−τ +t ≤ C(1 + t)−1 . (9.189)

The proof of Corollary 9.11.5 is thus complete.

Corollary 9.11.6. Let d ∈ N, Θ ∈ C([0, ∞), Rd ), L ∈ C 1 (Rd , R), assume that for all
ϑ ∈ Rd there exist ε, C ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ Rd with ∥θ − ϑ∥2 < ε it
holds that
|L(θ) − L(ϑ)|α ≤ C∥(∇L)(θ)∥2 , (9.190)
assume for all t ∈ [0, ∞) that
Z t
Θt = Θ0 − (∇L)(Θs ) ds, (9.191)
0

and assume lim inf t→∞ ∥Θt ∥2 < ∞ (cf. Definition 3.3.4). Then there exist ϑ ∈ Rd , C, β ∈
(0, ∞) which satisfy for all t ∈ [0, ∞) that

∥Θt −ϑ∥2 ≤ C(1+t)−β , 0 ≤ L(Θt )−L(ϑ) ≤ C(1+t)−1 , and (∇L)(ϑ) = 0. (9.192)

376
9.11. Convergence analysis for solutions of GF ODEs

Proof of Corollary 9.11.6. Note that Lemma 9.11.1 demonstrates that for all t ∈ [0, ∞) it
holds that Z t
L(Θt ) = L(Θ0 ) − ∥(∇L)(Θs )∥22 ds. (9.193)
0

Corollary 9.11.5 therefore establishes that there exist ϑ ∈ Rd , C, β ∈ (0, ∞) which satisfy
for all t ∈ [0, ∞) that

∥Θt − ϑ∥2 ≤ C(1 + t)−β and 0 ≤ L(Θt ) − L(ϑ) ≤ C(1 + t)−1 . (9.194)

This ensures that


lim sup∥Θt − ϑ∥2 = 0. (9.195)
t→∞

Combining this with the assumption that L ∈ C 1 (Rd , R) establishes that

lim sup∥(∇L)(Θt ) − (∇L)(ϑ)∥2 = 0. (9.196)


t→∞

Hence, we obtain that

lim sup ∥(∇L)(Θt )∥2 − ∥(∇L)(ϑ)∥2 = 0. (9.197)


t→∞

Furthermore, observe that (9.193) and (9.194) ensure that


Z ∞
∥(∇L)(Θs )∥22 ds < ∞. (9.198)
0

This and (9.197) demonstrate that

(∇L)(ϑ) = 0. (9.199)

Combining this with (9.194) establishes (9.192). The proof of Corollary 9.11.6 is thus
complete.

Corollary 9.11.7. Let d ∈ N, Θ ∈ C([0, ∞), Rd ), let L : Rd → R be analytic, assume for


all t ∈ [0, ∞) that
Z t
Θt = Θ0 − (∇L)(Θs ) ds, (9.200)
0

and assume lim inf t→∞ ∥Θt ∥2 < ∞ (cf. Definitions 3.3.4 and 9.7.1). Then there exist ϑ ∈ Rd ,
C, β ∈ (0, ∞) which satisfy for all t ∈ [0, ∞) that

∥Θt −ϑ∥2 ≤ C(1+t)−β , 0 ≤ L(Θt )−L(ϑ) ≤ C(1+t)−1 , and (∇L)(ϑ) = 0. (9.201)

377
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proof of Corollary 9.11.7. Note that Theorem 9.9.1 shows that for all ϑ ∈ Rd there exist
ε, C ∈ (0, ∞), α ∈ (0, 1) such that for all θ ∈ Rd with ∥θ − ϑ∥2 < ε it holds that
|L(θ) − L(ϑ)|α ≤ C∥(∇L)(θ)∥2 . (9.202)
Corollary 9.11.6 therefore establishes (9.201). The proof of Corollary 9.11.7 is thus complete.

Exercise 9.11.1. Prove or disprove the following statement: For all d ∈ N, L ∈ (0, ∞),
γ ∈ [0, L−1 ], all open and convex sets U ⊆ Rd , and all L ∈ C 1 (U, R), x ∈ U with
x − γ(∇L)(x) ∈ U and ∀ v, w ∈ U : ∥(∇L)(v) − (∇L)(w)∥2 ≤ L∥v − w∥2 it holds that
L(x − γ(∇L)(x)) ≤ L(x) − γ2 ∥(∇L)(x)∥22 (9.203)
(cf. Definition 3.3.4).

9.12 Convergence analysis for GD processes


9.12.1 One-step descent property for GD processes
Lemma 9.12.1. Let d ∈ N, L ∈ R, let U ⊆ Rd be open and convex, let L ∈ C 1 (U, R), and
assume for all x, y ∈ U that
∥(∇L)(x) − (∇L)(y)∥2 ≤ L∥x − y∥2 (9.204)
(cf. Definition 3.3.4). Then it holds for all x, y ∈ U that
L(y) ≤ L(x) + ⟨(∇L)(x), y − x⟩ + L2 ∥x − y∥22 (9.205)
(cf. Definition 1.4.7).
Proof of Lemma 9.12.1. Observe that the fundamental theorem of calculus, the Cauchy-
Schwarz inequality, and (9.204) prove that for all x, y ∈ U we have that
L(y) − L(x)
Z 1
 r=1
= L(x + r(y − x)) r=0 = ⟨(∇L)(x + r(y − x)), y − x⟩ dr
0
Z 1
= ⟨(∇L)(x), y − x⟩ + ⟨(∇L)(x + r(y − x)) − (∇L)(x), y − x⟩ dr
0
Z 1
≤ ⟨(∇L)(x), y − x⟩ + |⟨(∇L)(x + r(y − x)) − (∇L)(x), y − x⟩| dr (9.206)
0
Z 1 
≤ ⟨(∇L)(x), y − x⟩ + ∥(∇L)(x + r(y − x)) − (∇L)(x)∥2 dr ∥y − x∥2
0
Z 1 
≤ ⟨(∇L)(x), y − x⟩ + L∥y − x∥2 ∥r(y − x)∥2 dr
0
= ⟨(∇L)(x), y − x⟩ + L2 ∥x − y∥22

378
9.12. Convergence analysis for GD processes

(cf. Definition 1.4.7). The proof of Lemma 9.12.1 is thus complete.


Corollary 9.12.2. Let d ∈ N, L, γ ∈ R, let U ⊆ Rd be open and convex, let L ∈ C 1 (U, R),
and assume for all x, y ∈ U that
∥(∇L)(x) − (∇L)(y)∥2 ≤ L∥x − y∥2 (9.207)
(cf. Definition 3.3.4). Then it holds for all x ∈ U with x − γ(∇L)(x) ∈ U that
L(x − γ(∇L)(x)) ≤ L(x) + γ Lγ − 1 ∥(∇L)(x)∥22 . (9.208)

2

Proof of Corollary 9.12.2. Observe that Lemma 9.12.1 ensures that for all x ∈ U with
x − γ(∇L)(x) ∈ U it holds that
L(x − γ(∇L)(x)) ≤ L(x) + ⟨(∇L)(x), −γ(∇L)(x)⟩ + L2 ∥γ(∇L)(x)∥22
Lγ 2
(9.209)
= L(x) − γ∥(∇L)(x)∥22 + 2
∥(∇L)(x)∥22 .
This establishes (9.208). The proof of Corollary 9.12.2 is thus complete.
Corollary 9.12.3. Let d ∈ N, L ∈ (0, ∞), γ ∈ [0, L−1 ], let U ⊆ Rd be open and convex, let
L ∈ C 1 (U, R), and assume for all x, y ∈ U that
∥(∇L)(x) − (∇L)(y)∥2 ≤ L∥x − y∥2 (9.210)
(cf. Definition 3.3.4). Then it holds for all x ∈ U with x − γ(∇L)(x) ∈ U that
L(x − γ(∇L)(x)) ≤ L(x) − γ2 ∥(∇L)(x)∥22 ≤ L(x). (9.211)
Proof of Corollary 9.12.3. Note that Corollary 9.12.2, the fact that γ ≥ 0, and the fact
that Lγ
2
− 1 ≤ − 12 establish (9.211). The proof of Corollary 9.12.3 is thus complete.
Exercise 9.12.1. Let (γn )n∈N ⊆ (0, ∞) satisfy for all n ∈ N that γn = 1
n+1
and let L : R → R
satisfy for all x ∈ R that
L(x) = 2x + sin(x). (9.212)

Prove or disprove the following statement: For every Θ = (Θk )k∈N0 : N0 → R with ∀ k ∈
N : Θk = Θk−1 − γk (∇L)(Θk−1 ) and every n ∈ N it holds that
1 3
|2 + cos(Θn−1 )|2 . (9.213)

L(Θn ) ≤ L(Θn−1 ) − n+1 1 − 2(n+1)

Exercise 9.12.2. Let L : R → R satisfy for all x ∈ R that


L(x) = 4x + 3 sin(x). (9.214)

Prove or disprove the following statement: For every Θ = (Θn )n∈N0 : N0 → R with ∀ n ∈
1
N : Θn = Θn−1 − n+1 (∇L)(Θn−1 ) and every k ∈ N it holds that
L(Θk ) < L(Θk−1 ). (9.215)

379
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

9.12.2 Abstract local convergence results for GD processes


Proposition 9.12.4. Let d ∈ N, c ∈ R, ε, L, C ∈ (0, ∞), α ∈ (0, 1), γ ∈ (0, L−1 ], ϑ ∈ Rd ,
let B ⊆ Rd satisfy B = {θ ∈ Rd : ∥θ − ϑ∥2 < ε}, let L ∈ C(Rd , R) satisfy L|B ∈ C 1 (B, R),
let G : Rd → Rd satisfy for all θ ∈ B that G(θ) = (∇L)(θ), assume G(ϑ) = 0, assume for
all θ1 , θ2 ∈ B that
∥G(θ1 ) − G(θ2 )∥2 ≤ L∥θ1 − θ2 ∥2 , (9.216)
let Θ : N0 → Rd satisfy for all n ∈ N0 that Θn+1 = Θn − γG(Θn ), and assume for all θ ∈ B
that

|L(θ) − L(ϑ)|α ≤ C∥G(θ)∥2 , c = |L(Θ0 ) − L(ϑ)|, 2C(1 − α)−1 c1−α + ∥Θ0 − ϑ∥2 < γL+1 ε
,
(9.217)
and inf n∈{m∈N0 : ∀ k∈N0 ∩[0,m] : Θk ∈B} L(Θn ) ≥ L(ϑ) (cf. Definition 3.3.4). Then there exists
ψ ∈ L −1 ({L(ϑ)}) ∩ G−1 ({0}) ∩ B such that

(i) it holds for all n ∈ N0 that Θn ∈ B,

(ii) it holds for all n ∈ N0 that 0 ≤ L(Θn ) − L(ψ) ≤ 2C2 c2 (1{0} (c) + c2α nγ + 2C2 c)−1 ,
and

(iii) it holds for all n ∈ N0 that



∥Θk+1 − Θk ∥2 ≤ 2C(1 − α)−1 |L(Θn ) − L(ψ)|1−α
P
∥Θn − ψ∥2 ≤
k=n (9.218)
≤2 2−α 3−2α 2−2α
C c (1 − α) (1{0} (c) + c nγ + 2C c)
−1 2α 2 α−1
.

Proof of Proposition 9.12.4. Throughout this proof, let T ∈ N0 ∪ {∞} satisfy

T = inf({n ∈ N0 : Θn ∈
/ B} ∪ {∞}), (9.219)

let L : N0 → R satisfy for all n ∈ N0 that L(n) = L(Θn ) − L(ϑ), and let τ ∈ N0 ∪ {∞}
satisfy
τ = inf({n ∈ N0 ∩ [0, T ) : L(n) = 0} ∪ {T }). (9.220)
Observe that the assumption that G(ϑ) = 0 implies for all θ ∈ B that

γ∥G(θ)∥2 = γ∥G(θ) − G(ϑ)∥2 ≤ γL∥θ − ϑ∥2 . (9.221)

This, the fact that ∥Θ0 − ϑ∥2 < ε, and the fact that

∥Θ1 − ϑ∥2 ≤ ∥Θ1 − Θ0 ∥2 + ∥Θ0 − ϑ∥2 = γ∥G(Θ0 )∥2 + ∥Θ0 − ϑ∥2 ≤ (γL + 1)∥Θ0 − ϑ∥2 < ε
(9.222)

380
9.12. Convergence analysis for GD processes

ensure that T ≥ 2. Next note that the assumption that

inf L(Θn ) ≥ L(ϑ) (9.223)


n∈{m∈N0 : ∀ k∈N0 ∩[0,m] : Θk ∈B}

demonstrates for all n ∈ N0 ∩ [0, T ) that

L(n) ≥ 0. (9.224)

Furthermore, observe that the fact that B ⊆ Rd is open and convex, Corollary 9.12.3, and
(9.217) demonstrate for all n ∈ N0 ∩ [0, T − 1) that

L(n + 1) − L(n) = L(Θn+1 ) − L(Θn ) ≤ − γ2 ∥G(Θn )∥22 = − 12 ∥G(Θn )∥2 ∥γG(Θn )∥2
= − 12 ∥G(Θn )∥2 ∥Θn+1 − Θn ∥2 ≤ −(2C)−1 |L(Θn ) − L(ϑ)|α ∥Θn+1 − Θn ∥2
= −(2C)−1 [L(n)]α ∥Θn+1 − Θn ∥2 ≤ 0.
(9.225)

Hence, we obtain that


N0 ∩ [0, T ) ∋ n 7→ L(n) ∈ [0, ∞) (9.226)
is non-increasing. Combining this with (9.220) ensures for all n ∈ N0 ∩ [τ, T ) that

L(n) = 0. (9.227)

This and (9.225) demonstrate for all n ∈ N0 ∩ [τ, T − 1) that

0 = L(n + 1) − L(n) ≤ − γ2 ∥G(Θn )∥22 ≤ 0. (9.228)

The fact that γ > 0 therefore establishes for all n ∈ N0 ∩ [τ, T − 1) that G(Θn ) = 0. Hence,
we obtain for all n ∈ N0 ∩ [τ, T ) that

Θn = Θτ . (9.229)

Moreover, note that (9.220) and (9.225) ensure for all n ∈ N0 ∩ [0, τ ) ∩ [0, T − 1) that

2C(L(n) − L(n + 1))


Z L(n)
∥Θn+1 − Θn ∥2 ≤ = 2C [L(n)]−α du
[L(n)]α
(9.230)
L(n+1)

2C([L(n)]1−α − [L(n + 1)]1−α )


Z L(n)
≤ 2C u−α du = .
L(n+1) 1−α

This and (9.229) show for all n ∈ N0 ∩ [0, T − 1) that

2C([L(n)]1−α − [L(n + 1)]1−α )


∥Θn+1 − Θn ∥2 ≤ . (9.231)
1−α

381
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Combining this with the triangle inequality proves for all m, n ∈ N0 ∩ [0, T ) with m ≤ n
that
n−1
" n−1 #
X 2C X 1−α 1−α

∥Θn − Θm ∥2 ≤ ∥Θk+1 − Θk ∥2 ≤ [L(k)] − [L(k + 1)]
1 − α k=m
k=m (9.232)
2C([L(m)]1−α − [L(n)]1−α ) 2C[L(m)]1−α
= ≤ .
1−α 1−α
This and (9.217) demonstrate for all n ∈ N0 ∩ [0, T ) that

2C[L(0)]1−α 2C|L(Θ0 ) − L(ϑ)|1−α


∥Θn − Θ0 ∥2 ≤ = = 2C(1 − α)−1 c1−α . (9.233)
1−α 1−α
Combining this with (9.221), (9.217), and the triangle inequality implies for all n ∈ N0 ∩[0, T )
that
∥Θn+1 − ϑ∥2 ≤ ∥Θn+1 − Θn ∥2 + ∥Θn − ϑ∥2 = γ∥G(Θn )∥2 + ∥Θn − ϑ∥2
≤ (γL + 1)∥Θn − ϑ∥2 ≤ (γL + 1)(∥Θn − Θ0 ∥2 + ∥Θ0 − ϑ∥2 ) (9.234)
≤ (γL + 1)(2C(1 − α)−1 c1−α + ∥Θ0 − ϑ∥2 ) < ε.

Therefore, we obtain that


T = ∞. (9.235)
Combining this with (9.217), (9.232), and (9.226) demonstrates that

" n #
X X 2C[L(0)]1−α 2Cc1−α
∥Θk+1 − Θk ∥2 = lim ∥Θk+1 − Θk ∥2 ≤ = < ε < ∞. (9.236)
k=0
n→∞
k=0
1 − α 1 − α

Hence, we obtain that there exists ψ ∈ Rd which satisfies

lim supn→∞ ∥Θn − ψ∥2 = 0. (9.237)

Observe that (9.234), (9.235), and (9.237) ensure that

∥ψ − ϑ∥2 ≤ (γL + 1)(2C(1 − α)−1 c1−α + ∥Θ0 − ϑ∥2 ) < ε. (9.238)

Therefore, we obtain that


ψ ∈ B. (9.239)
Next note that (9.225), (9.217), and the fact that for all n ∈ N0 it holds that L(n) ≤ L(0) = c
ensure that for all n ∈ N0 ∩ [0, τ ) we have that

−L(n) ≤ L(n + 1) − L(n) ≤ − γ2 ∥G(Θn )∥22 ≤ − 2Cγ 2 [L(n)]2α ≤ − 2C2 cγ2−2α [L(n)]2 . (9.240)

382
9.12. Convergence analysis for GD processes

This establishes for all n ∈ N0 ∩ [0, τ ) that


2C2 c2−2α
0 < L(n) ≤ . (9.241)
γ
Combining this and (9.240) demonstrates for all n ∈ N0 ∩ [0, τ − 1) that
1 − 2C2 cγ2−2α L(n) − 1

1 1 1 1
− ≤ − =
L(n) L(n)(1 − 2C2 cγ2−2α L(n)) L(n) 1 − 2C2 cγ2−2α L(n)

L(n) L(n + 1)
(9.242)
− 2C2 cγ2−2α 1 γ
= γ
 = − 2 2−2α < − .
1 − 2C2 c2−2α L(n) ( 2C cγ − L(n)) 2C c2−2α
2

Therefore, we get for all n ∈ N0 ∩ [0, τ ) that


n−1  
1 1 X 1 1 1 nγ 1 nγ
= + − > + 2 2−2α = + 2 2−2α . (9.243)
L(n) L(0) k=0 L(k + 1) L(k) L(0) 2C c c 2C c
2 2−2α
Hence, we obtain for all n ∈ N0 ∩ [0, τ ) that L(n) < nγ+2C
2C c
2 c1−2α . Combining this with the

fact that for all n ∈ N0 ∩ [τ, ∞) it holds that L(n) = 0 shows that for all n ∈ N0 we have
that
2C2 c2
L(n) ≤ . (9.244)
1{0} (c) + c2α nγ + 2C2 c
This, (9.237), and the assumption that L is continuous prove that
L(ψ) = limn→∞ L(Θn ) = L(ϑ). (9.245)
Combining this with (9.244) implies for all n ∈ N0 that
2C2 c2
0 ≤ L(Θn ) − L(ψ) ≤ . (9.246)
1{0} (c) + c2α nγ + 2C2 c
Furthermore, observe that the fact that B ∋ θ 7→ G(θ) ∈ Rd is continuous, the fact that
ψ ∈ B, and (9.237) demonstrate that
G(ψ) = limn→∞ G(Θn ) = limn→∞ (γ −1 (Θn − Θn+1 )) = 0. (9.247)
Next note that (9.244) and (9.232) ensure for all n ∈ N0 that

X 2C[L(n)]1−α
∥Θn − ψ∥2 = lim ∥Θn − Θm ∥2 ≤ ∥Θk+1 − Θk ∥2 ≤
m→∞ 1−α
k=n (9.248)
22−α C3−2α c2−2α
≤ .
(1 − α)(1{0} (c) + c2α nγ + 2C2 c)1−α
Combining this with (9.245), (9.235), (9.247), and (9.246) establishes items (i), (ii), and
(iii). The proof of Proposition 9.12.4 is thus complete.

383
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Corollary 9.12.5. Let d ∈ N, c ∈ [0, 1], ε, L, C ∈ (0, ∞), α ∈ (0, 1), γ ∈ (0, L−1 ], ϑ ∈ Rd ,
let B ⊆ Rd satisfy B = {θ ∈ Rd : ∥θ − ϑ∥2 < ε}, let L ∈ C(Rd , R) satisfy L|B ∈ C 1 (B, R),
let G : Rd → Rd satisfy for all θ ∈ B that G(θ) = (∇L)(θ), assume for all θ1 , θ2 ∈ B that
∥G(θ1 ) − G(θ2 )∥2 ≤ L∥θ1 − θ2 ∥2 , (9.249)
let Θ = (Θn )n∈N0 : N0 → Rd satisfy for all n ∈ N0 that

Θn+1 = Θn − γG(Θn ), (9.250)

and assume for all θ ∈ B that


|L(θ) − L(ϑ)|α ≤ C∥G(θ)∥2 , c = |L(Θ0 ) − L(ϑ)|, 2C(1 − α)−1 c1−α + ∥Θ0 − ϑ∥2 < γL+1 ε
,
(9.251)
and L(θ) ≥ L(ϑ). Then there exists ψ ∈ L −1 ({L(ϑ)})∩G−1 ({0}) such that for all n ∈ N0
it holds that Θn ∈ B, 0 ≤ L(Θn ) − L(ψ) ≤ 2(2 + C−2 γn)−1 , and

∥Θk+1 − Θk ∥2 ≤ 22−α C(1 − α)−1 (2 + C−2 γn)α−1 . (9.252)
P
∥Θn − ψ∥2 ≤
k=n

Proof of Corollary 9.12.5. Observe that the fact that L(ϑ) = inf θ∈B L(θ) ensures that
G(ϑ) = (∇L)(ϑ) = 0 and inf n∈{m∈N0 : ∀ k∈N0 ∩[0,m] : Θk ∈B} L(Θn ) ≥ L(ϑ). Combining this
with Proposition 9.12.4 ensures that there exists ψ ∈ L −1 ({L(ϑ)}) ∩ G−1 ({0}) such that
(I) it holds for all n ∈ N0 that Θn ∈ B,
2 2
(II) it holds for all n ∈ N0 that 0 ≤ L(Θn ) − L(ψ) ≤ 1{0} (c)+c2α nγ+2C2 c , and
2C c

(III) it holds for all n ∈ N0 that



X 2C|L(Θn ) − L(ψ)|1−α
∥Θn − ψ∥2 ≤ ∥Θk+1 − Θk ∥2 ≤
1−α
k=n (9.253)
22−α C3−2α c2−2α
≤ .
(1 − α)(1{0} (c) + c2α nγ + 2C2 c)1−α

Note that item (II) and the assumption that c ≤ 1 establish for all n ∈ N0 that

0 ≤ L(Θn ) − L(ψ) ≤ 2c2 C−2 1{0} (c) + C−2 c2α nγ + 2c


−1
≤ 2(2 + C−2 γn)−1 . (9.254)
This and item (III) demonstrate for all n ∈ N0 that

2C|L(Θn ) − L(ψ)|1−α
 2−α 
X 2 C
∥Θn − ψ∥2 ≤ ∥Θk+1 − Θk ∥2 ≤ ≤ (2 + C−2 γn)α−1 .
k=n
1 − α 1 − α
(9.255)
The proof of Corollary 9.12.5 is thus complete.

384
9.13. On the analyticity of realization functions of ANNs

Exercise 9.12.3. Let L ∈ C 1 (R, R) satisfy for all θ ∈ R that


Z 1
L(θ) = θ +4
(sin(x) − θx)2 dx. (9.256)
0

Prove or disprove the following statement: For every continuous Θ = (Θt )t∈[0,∞) : [0, ∞) → R
Rt
with supt∈[0,∞) |Θt | < ∞ and ∀ t ∈ [0, ∞) : Θt = Θ0 − 0 (∇L)(Θs ) ds there exists ϑ ∈ R
such that

lim sup |Θt − ϑ| = 0. (9.257)


t→∞

Exercise 9.12.4. Let L ∈ C ∞ (R, R) satisfy for all θ ∈ R that


Z 1
L(θ) = (sin(x) − θx + θ2 )2 dx. (9.258)
0

Prove or disprove the following statement: For every Θ ∈ C([0, ∞), R) with supt∈[0,∞) |Θt | <
Rt
∞ and ∀ t ∈ [0, ∞) : Θt = Θ0 − 0 (∇L)(Θs ) ds there exists ϑ ∈ R, C, β ∈ (0, ∞) such that
for all t ∈ [0, ∞) it holds that

|Θt − ϑ| = C(1 + t)−β . (9.259)

9.13 On the analyticity of realization functions of ANNs


Proposition 9.13.1 (Compositions of analytic functions). Let l, m, n ∈ N, let U ⊆ Rl and
V ⊆ Rm be open, let f : U → Rm and g : V → Rn be analytic, and assume f (U ) ⊆ V (cf.
Definition 9.7.1). Then
U ∋ u 7→ g(f (u)) ∈ Rn (9.260)
is analytic.

Proof of Proposition 9.13.1. Observe that Faà di Bruno’s formula (cf., for instance, Fraenkel
[134]) establishes that f ◦ g is analytic (cf. also, for example, Krantz & Parks [254, Proposi-
tion 2.8]). The proof of Proposition 9.13.1 is thus complete.

Lemma 9.13.2. Let d1 , d2 , l1 , l2 ∈ N, for every k ∈ {1, 2} let Fk : Rdk → Rlk be analytic,
and let f : Rd1 × Rd2 → Rl1 × Rl2 satisfy for all x1 ∈ Rd1 , x2 ∈ Rd2 that

f (x1 , x2 ) = (F1 (x1 ), F2 (x2 )) (9.261)

(cf. Definition 9.7.1). Then f is analytic.

385
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proof of Lemma 9.13.2. Throughout this proof, let A1 : Rl1 → Rl1 × Rl2 and A2 : Rl2 →
Rl1 × Rl2 satisfy for all x1 ∈ Rl1 , x2 ∈ Rl2 that

A1 (x1 ) = (x1 , 0) and A2 (x2 ) = (0, x2 ) (9.262)

and for every k ∈ {1, 2} let Bk : Rl1 × Rl2 → Rlk satisfy for all x1 ∈ Rl1 , x2 ∈ Rl2 that

Bk (x1 , x2 ) = xk . (9.263)

Note that item (i) in Lemma 5.3.1 shows that

f = A1 ◦ F1 ◦ B1 + A2 ◦ F2 ◦ B2 . (9.264)

This, the fact that A1 , A2 , F1 , F2 , B1 , and B2 are analytic, and Proposition 9.13.1 establishes
that f is differentiable. The proof of Lemma 9.13.2 is thus complete.

Lemma 9.13.3. Let d1 , d2 , l0 , l1 , l2 ∈ N, for every k ∈ {1, 2} let Fk : Rdk × Rlk−1 → Rlk be
analytic, and let f : Rd1 × Rd2 × Rl0 → Rl2 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that

(9.265)

f (θ1 , θ2 , x) = F2 (θ2 , ·) ◦ F1 (θ1 , ·) (x)

(cf. Definition 9.7.1). Then f is analytic.

Proof of Lemma 9.13.3. Throughout this proof, let A : Rd1 × Rd2 × Rl0 → Rd2 × Rd1 +l0 and
B : Rd2 × Rd1 +l0 → Rd2 × Rl1 satisfy for all θ1 ∈ Rd1 , θ2 ∈ Rd2 , x ∈ Rl0 that

A(θ1 , θ2 , x) = (θ2 , (θ1 , x)) and B(θ2 , (θ1 , x)) = (θ2 , F1 (θ1 , x)), (9.266)

Observe that item (i) in Lemma 5.3.2 proves that

f = F2 ◦ B ◦ A. (9.267)

Furthermore, note that Lemma 9.13.2 (with d1 ↶ d2 , d2 ↶ d1 + l1 , l1 ↶ d2 , l2 ↶ l1 ,


F1 ↶ (Rd2 ∋ θ2 7→ θ2 ∈ Rd2 ), F2 ↶ (Rd1 +l1 ∋ (θ1 , x) 7→ F1 (θ1 , x) ∈ Rl1 ) in the notation
of Lemma 9.13.2) implies that B is analytic. Combining this, the fact that A is analytic,
the fact that F2 is analytic, and (9.267) with Proposition 9.13.1 demonstrates that f is
analytic. The proof of Lemma 9.13.3 is thus complete.

Corollary 9.13.4 (Analyticity of realization functions of ANNs). Let L ∈ N, l0 , l1 , . . . ,


lL ∈ N and for every k ∈ {1, 2, . . . , L} let Ψk : Rlk → Rlk be analytic (cf. Definition 9.7.1).
Then PL
R k=1 lk (lk−1 +1) × Rl0 ∋ (θ, x) 7→ NΨθ,l1 ,Ψ (x) ∈ RlL (9.268)
0

2 ,...,ΨL

is analytic (cf. Definition 1.1.3).

386
9.13. On the analyticity of realization functions of ANNs

Proof of Corollary 9.13.4. Throughout this proof, for every k ∈ {1, 2, . . . , L} let dk =
lk (lk−1 + 1) and for every k ∈ {1, 2, . . . , L} let Fk : Rdk × Rlk−1 → Rlk satisfy for all θ ∈ Rdk ,
x ∈ Rlk−1 that
Fk (θ, x) = Ψk Aθ,0 (9.269)

lk ,lk−1 (x)

(cf. Definition 1.1.1). Observe that item (i) in Lemma 5.3.3 demonstrates that for all
θ1 ∈ Rd1 , θ2 ∈ Rd2 , . . ., θL ∈ RdL , x ∈ Rl0 it holds that
(θ ,θ ,...,θ ),l 
NΨ11,Ψ22 ,...,ΨLL 0 (x) = (FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·))(x) (9.270)
(cf. Definition 1.1.3). Note that the assumption that for all k ∈ {1, 2, . . . , L} it holds that Ψk
is analytic, the fact that for all m, n ∈ N, θ ∈ Rm(n+1) it holds that Rm(n+1) × Rn ∋ (θ, x) 7→
Aθ,0
m,n (x) ∈ R
m
is analytic, and Proposition 9.13.1 ensure that for all k ∈ {1, 2, . . . , L} it
holds that Fk is analytic. Lemma 5.3.2 and induction hence ensure that

Rd1 × Rd2 × . . . × RdL × Rl0 ∋ (θ1 , θ2 , . . . , θL , x)


7→ (FL (θL , ·) ◦ FL−1 (θL−1 , ·) ◦ . . . ◦ F1 (θ1 , ·))(x) ∈ RlL (9.271)
is analytic. This and (9.270) establish that
PL
R k=1 lk (lk−1 +1) × Rl0 ∋ (θ, x) 7→ NΨθ,l1 ,Ψ
0
2 ,...,ΨL
(x) ∈ RlL (9.272)
is analytic. The proof of Corollary 9.13.4 is thus complete.
Corollary 9.13.5 (Analyticity of the empirical risk function). Let L, Pd ∈ N\{1}, M, l0 , l1 ,
. . . , lL ∈ N, x1 , x2 , . . . , xM ∈ Rl0 , y1 , y2 , . . . , yM ∈ RlL satisfy d = Lk=1 lk (lk−1 + 1), let
a : R → R and L : RlL × RlL → R be analytic, let L : Rd → R satisfy for all θ ∈ Rd that
"M #
1 X θ,l0
(9.273)
 
L(θ) = L NMa,l ,Ma,l ,...,Ma,l ,id l (xm ), ym
M m=1 1 2 L−1 R L

(cf. Definitions 1.1.3, 1.2.1, and 9.7.1). Then L is analytic.


Proof of Corollary 9.13.5. Observe that the assumption that a is analytic, Lemma 9.13.2,
and induction show that for all m ∈ N it holds that Ma,m is analytic. This, Corollary 9.13.4
and Lemma 9.13.2 (applied with d 1 ↶ d + l0 , d2 ↶ lL , l1 ↶ lL , l2 ↶ lL , F1 ↶ (Rd × Rl0 ∋
(θ, x) 7→ NMθ,l0
a,l1 ,Ma,l2 ,...,Ma,lL−1 ,idRlL
(x) ∈ RlL ), F2 ↶ idRlL in the notation of Lemma 9.13.2)
ensure that
Rd × Rl0 × RlL ∋ (θ, x, y) 7→ NM θ,l0
∈ RlL × RlL (9.274)
 
a,l ,M a,l ,...,M a,l ,id l
(x), y
1 2 L−1 R L

is analytic. The assumption that L is differentiable and the chain rule therefore establish
that for all x ∈ Rl0 , y ∈ RlL it holds that
Rd ∋ θ 7→ L NM θ,l0
(9.275)
 
a,l ,Ma,l ,...,M a,l ,id l
(x m ), ym ∈R
1 2 L−1 R L

is analytic. This proves (9.273). The proof of Corollary 9.13.5 is thus complete.

387
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

9.14 Standard KL inequalities for empirical risks in the


training of ANNs with analytic activation functions
Theorem 9.14.1 (Empirical risk minimization for ANNs with analytic activation functions).
l0 lL
PdL ∈ N\{1}, M, l0 , l1 , . . . , lL ∈ N, x1 , x2 , . l.L. , xMlL∈ R , y1 , y2 , . . . , yM ∈ R satisfy
Let L,
d = k=1 lk (lk−1 + 1), let a : R → R and L : R × R → R be analytic, let L : Rd → R
satisfy for all θ ∈ Rd that
"M #
1 X θ,l0
(9.276)

L(θ) = L NMa,l ,Ma,l ,...,Ma,l ,id l (xm ), ym ,
M m=1 1 2 L−1 R L

and let Θ ∈ C([0, ∞), Rd ) satisfy


Rt
lim inf t→∞ ∥Θt ∥2 < ∞ and ∀ t ∈ [0, ∞) : Θt = Θ0 − 0
(∇L)(Θs ) ds (9.277)

(cf. Definitions 1.1.3, 1.2.1, 3.3.4, and 9.7.1). Then there exist ϑ ∈ Rd , c, β ∈ (0, ∞) such
that for all t ∈ (0, ∞) it holds that

∥Θt − ϑ∥2 ≤ ct−β , 0 ≤ L(Θt ) − L(ϑ) ≤ ct−1 , and (∇L)(ϑ) = 0. (9.278)

Proof of Theorem 9.14.1. Note that Corollary 9.13.5 demonstrates that L is analytic.
Combining this with Corollary 9.11.7 establishes (9.278). The proof of Theorem 9.14.1 is
thus complete.

Lemma 9.14.2. Let a : R → R be the softplus activation function (cf. Definition 1.2.11).
Then a is analytic (cf. Definition 9.7.1).

Proof of Lemma 9.14.2. Throughout this proof, let f : R → (0, ∞) satisfy for all x ∈ R that
f (x) = 1 + exp(x). Observe that the fact that R ∋ x 7→ exp(x) ∈ R is analytic implies that
f is analytic (cf. Definition 9.7.1). Combining this and the fact that (0, ∞) ∋ x 7→ ln(x) ∈ R
is analytic with Proposition 9.13.1 and (1.47) demonstrates that a is analytic. The proof of
Lemma 9.14.2 is thus complete.

Lemma 9.14.3. Let d ∈ N and let L be the mean squared error loss function based
on Rd ∋ x 7→ ∥x∥2 ∈ [0, ∞) (cf. Definitions 3.3.4 and 5.4.2). Then L is analytic (cf.
Definition 9.7.1).

Proof of Lemma 9.14.3. Note that Lemma 5.4.3 ensures that L is analytic (cf. Defini-
tion 9.7.1). The proof of Lemma 9.14.3 is thus complete.

Corollary 9.14.4 (Empirical risk minimization for ANNs with softplus activation). Let
L, d ∈ N\{1}, M, l0 , l1 , . . . , lL ∈ N, x1 , x2 , . . . , xM ∈ Rl0 , y1 , y2 , . . . , yM ∈ RlL satisfy

388
9.14. Standard KL inequalities for empirical risks in the training of ANNs with analytic
activation functions

d = Lk=1 lk (lk−1 + 1), let a be the softplus activation function, let L : Rd → R satisfy for
P
all θ ∈ Rd that
"M #
1 X 2
L(θ) = θ,l0
ym − NMa,l ,Ma,l ,...,Ma,l ,id l (xm ) 2 , (9.279)
M m=1 1 2 L−1 R L

and let Θ ∈ C([0, ∞), Rd ) satisfy


Rt
lim inf t→∞ ∥Θt ∥2 < ∞ and ∀ t ∈ [0, ∞) : Θt = Θ0 − 0
(∇L)(Θs ) ds (9.280)

(cf. Definitions 1.1.3, 1.2.1, 1.2.11, and 3.3.4). Then there exist ϑ ∈ Rd , c, β ∈ (0, ∞) such
that for all t ∈ (0, ∞) it holds that

∥Θt − ϑ∥2 ≤ ct−β , 0 ≤ L(Θt ) − L(ϑ) ≤ ct−1 and (∇L)(ϑ) = 0. (9.281)

Proof of Corollary 9.14.4. Observe that Lemma 9.14.2, Lemma 9.14.3, and Theorem 9.14.1
establish (9.281). The proof of Corollary 9.14.4 is thus complete.
Remark 9.14.5 (Convergence to a good suboptimal critical point whose risk value is close
to the optimal risk value). Corollary 9.14.4 establishes convergence of a non-divergent GF
trajectory in the training of fully-connected feedforward ANNs to a critical point ϑ ∈ Rd of
the objective function. In several scenarios in the training of ANNs such limiting critical
points seem to be with high probability not global minimum points but suboptimal critical
points at which the value of the objective function is, however, not far away from the
minimal value of the objective function (cf. Ibragimov et al. [216] and also [144, 409]). In
view of this, there has been an increased interest in landscape analyses associated to the
objective function to gather more information on critical points of the objective function
(cf., for instance, [12, 72, 79, 80, 92, 113, 141, 215, 216, 239, 312, 357, 358, 365, 381–383,
400, 435, 436] and the references therein).
In general in most cases it remains an open problem to rigorously prove that the value
of the objective function at the limiting critical point is indeed with high probability close
to the minimal/infimal value1 of the objective function and thereby establishing a full
convergence analysis. However, in the so-called overparametrized regime where there are
much more ANN parameters than input-output training data pairs, several convergence
analyses for the training of ANNs have been achieved (cf., for instance, [74, 75, 114, 218]
and the references therein).
Remark 9.14.6 (Almost surely excluding strict saddle points). We also note that in several
situations it has been shown that the limiting critical point of the considered GF trajectory
1
It is of interest to note that it seems to strongly depend on the activation function, the architecture of
the ANN, and the underlying probability distribution of the data of the considered learning problem whether
the infimal value of the objective function is also a minimal value of the objective function or whether there
exists no minimal value of the objective function (cf., for example, [99, 142] and Remark 9.14.7 below).

389
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

with random initialization or of the considered GD process with random initialization is


almost surely not a saddle points but a local minimizers; cf., for example, [71, 265, 266,
322, 323].
Remark 9.14.7 (A priori bounds and existence of minimizers). Under the assumption that
the considered GF trajectory is non-divergent in the sense that
lim inf ∥Θt ∥2 < ∞
t→∞
(9.282)
(see (9.280) above) we have that Corollary 9.14.4 establishes convergence of a GF trajectory
in the training of fully-connected feedforward ANNs to a critical point ϑ ∈ Rd of the
objective function (see (9.281) above). Such kind of non-divergence and slightly stronger
boundedness assumptions, respectively, are very common hypotheses in convergence results
for gradient based optimization methods in the training of ANNs (cf., for instance, [2, 8,
44, 100, 101, 126, 224, 391], Section 9.11.2, and Theorem 9.14.1 in the context of the KL
approach and [93, 101, 225, 296] in the context of other approaches).
In most scenarios in the training of ANNs it remains an open problem to prove or
disprove such non-divergence and boundedness assumptions. In Gallon et al. [142] the
condition in (9.282) has been disproved and divergence of GF trajectories in the training of
shallow fully-connected feedforward ANNs has been established for specific target functions;
see also Petersen et al. [332].
The question of non-divergence of gradient based optimization methods seems to be
closely related to the question whether there exist minimizers in the optimization landscape
of the objective function. We refer to [99, 102, 224, 233] for results proving the existence
of minimizers in optimization landscapes for the training of ANNs and we refer to [142,
332] for results disproving the existence of minimizers in optimization landscapes for the
training of ANNs. We also refer to, for example, [125, 216] for strongly simplified ANN
training scenarios where non-divergence and boundedness conditions of the form (9.282)
have been established.

9.15 Fréchet subdifferentials and limiting Fréchet subd-


ifferentials
Definition 9.15.1 (Fréchet subgradients and limiting Fréchet subgradients). Let d ∈ N,
L ∈ C(Rd , R), x ∈ Rd . Then we denote by (DL)(x) ⊆ Rd the set given by
    
L(x + h) − L(x) − ⟨y, h⟩
d
(DL)(x) = y ∈ R : dlim inf ≥0 , (9.283)
R \{0}∋h→0 ∥h∥2
we call (DL)(x) the set of Fréchet subgradients of f at x, we denote by (DL)(x) ⊆ Rd the
set given by
hS i
(9.284)
T
(DL)(x) = ε∈(0,∞) y∈{z∈Rd : ∥x−z∥2 <ε} (DL)(y) ,

390
9.15. Fréchet subdifferentials and limiting Fréchet subdifferentials

and we call (DL)(x) the set limiting Fréchet subgradients of f at x (cf. Definitions 1.4.7
and 3.3.4).

Lemma 9.15.2 (Convex differentials). Let d ∈ N, L ∈ C(Rd , R), x, a ∈ Rd , b ∈ R,


ε ∈ (0, ∞) and let A : Rd → R satisfy for all y ∈ {z ∈ Rd : ∥z − x∥2 < ε} that

A(y) = ⟨a, y⟩ + b ≤ L(y) and A(x) = L(x) (9.285)

(cf. Definitions 1.4.7 and 3.3.4). Then

(i) it holds for all y ∈ {z ∈ Rd : ∥z − x∥2 < ε} that A(y) = ⟨a, y − x⟩ + L(x) and

(ii) it holds that a ∈ (DL)(x)

(cf. Definition 9.15.1).

Proof of Lemma 9.15.2. Note that (9.285) shows for all y ∈ {z ∈ Rd : ∥z − x∥2 < ε} that

A(y) = [A(y) − A(x)] + A(x) = [(⟨a, y⟩ + b) − (⟨a, x⟩ + b)] + A(x)


(9.286)
= ⟨a, y − x⟩ + A(x) = ⟨a, y − x⟩ + L(x).

This establishes item (i). Observe that (9.285) and item (i) ensure for all h ∈ {z ∈ Rd : 0 <
∥z∥2 < ε} that

L(x + h) − L(x) − ⟨a, h⟩ L(x + h) − A(x + h)


= ≥ 0. (9.287)
∥h∥2 ∥h∥2

This and (9.283) establish item (ii). The proof of Lemma 9.15.2 is thus complete.

Lemma 9.15.3 (Properties of Fréchet subgradients). Let d ∈ N, L ∈ C(Rd , R). Then

(i) it holds for all x ∈ Rd that

(DL)(x) = y ∈ Rd : ∃ z = (z1 , z2 ) : N → Rd × Rd : ∀ k ∈ N :
  

, (9.288)
  
z2 (k) ∈ (DL)(z1 (k)) ∧ lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0

(ii) it holds for all x ∈ Rd that (DL)(x) ⊆ (DL)(x),

(iii) it holds for all x ∈ {y ∈ Rd : L is differentiable at y} that (DL)(x) = {(∇L)(x)},


S
(iv) it holds for all x ∈ U ⊆Rd , U is open, L|U ∈C 1 (U,R) U that (DL)(x) = {(∇L)(x)}, and

(v) it holds for all x ∈ Rd that (DL)(x) is closed.

(cf. Definitions 3.3.4 and 9.15.1).

391
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Proof of Lemma 9.15.3. Throughout this proof, for every x, y ∈ Rd let Z x,y = (Z1x,y ,
Z2x,y ) : N → Rd × Rd satisfy for all k ∈ N that

Z1x,y (k) = x and Z2x,y (k) = y. (9.289)

Note that (9.284) proves that for all x ∈ Rd , y ∈ (DL)(x), ε ∈ (0, ∞) it holds that

(9.290)
S 
y∈ v∈{w∈Rn : ∥x−w∥2 <ε} (DL)(v) .

This
S implies that for all x ∈ R , y ∈ (DL)(x) and all ε, δ ∈ (0, ∞) there exists Y ∈
d

v∈{w∈Rd : ∥x−w∥2 <ε} (DL)(v) such that

∥y − Y ∥2 < δ. (9.291)

Hence, we obtain that for all x ∈ Rd , y ∈ (DL)(x), ε, δ ∈ (0, ∞) there exist v ∈ {w ∈


Rd : ∥x − w∥2 < ε}, Y ∈ (DL)(v) such that ∥y − Y ∥2 < δ. This demonstrates that for all
x ∈ Rd , y ∈ (DL)(x), ε, δ ∈ (0, ∞) there exist X ∈ Rd , Y ∈ (DL)(X) such that

∥x − X∥2 < ε and ∥y − Y ∥2 < δ. (9.292)

Therefore, we obtain that for all x ∈ Rd , y ∈ (DL)(x), k ∈ N there exist z1 , z2 ∈ Rd such


that
z2 ∈ (DL)(z1 ) and ∥z1 − x∥2 + ∥z2 − y∥2 < k1 . (9.293)
Furthermore, observe that for all x, y ∈ Rd , ε ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd
with lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there
exist X, Y ∈ Rd such that

Y ∈ (DL)(X) and ∥X − x∥2 + ∥Y − y∥2 < ε. (9.294)

Hence, we obtain that for all x, y ∈ Rd , ε, δ ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd


with lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there
exist X, Y ∈ Rd such that

Y ∈ (DL)(X), ∥x − X∥2 < ε, and ∥y − Y ∥2 < δ. (9.295)

This ensures that for all x, y ∈ Rd , ε, δ ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd with
lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there
exist v ∈ {w ∈ Rd : ∥x − w∥2 < ε}, Y ∈ (DL)(v) such that ∥y − Y ∥2 < δ. Therefore,
we obtain that for all x, y ∈ Rd , ε, δ ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd with
Sk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there exists
lim sup
Y ∈ v∈{w∈Rd : ∥x−w∥2 <ε} (DL)(v) such that

∥y − Y ∥2 < δ. (9.296)

392
9.15. Fréchet subdifferentials and limiting Fréchet subdifferentials

This establishes that for all x, y ∈ Rd , ε ∈ (0, ∞) and all z = (z1 , z2 ) : N → Rd × Rd with
lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) it holds
that
(9.297)
S 
y∈ v∈{w∈Rn : ∥x−w∥2 <ε} (DL)(v) .

This and (9.284) show that for all x, y ∈ Rd and all z = (z1 , z2 ) : N → Rd × Rd with
lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) it holds
that
y ∈ (DL)(x). (9.298)
Combining this with (9.293) proves item (i). Note that (9.289) implies that for all x ∈ Rd ,
y ∈ (DL)(x) it holds that
    
x,y x,y x,y x,y

∀ k ∈ N : Z2 (k) ∈ (DL)(Z1 (k)) ∧ lim sup ∥Z1 (k) − x∥2 + ∥Z2 (k) − y∥2 = 0
k→∞
(9.299)
(cf. Definitions 3.3.4 and 9.15.1). Combining this with item (i) establishes item (ii).
Observe that the fact that for all a ∈ R it holds that −a ≤ |a| demonstrates that for all
x ∈ {y ∈ Rd : L is differentiable at y} it holds that
   
lim inf Rd \{0}∋h→0 L(x+h)−L(x)−⟨(∇L)(x),h⟩
∥h∥2
≥ − lim inf d
R \{0}∋h→0
L(x+h)−L(x)−⟨(∇L)(x),h⟩
∥h∥2
h  i
|L(x+h)−L(x)−⟨(∇L)(x),h⟩|
≥ − lim supRd \{0}∋h→0 ∥h∥2
=0 (9.300)

(cf. Definition 1.4.7). This demonstrates that for all x ∈ {y ∈ Rd : L is differentiable at y}


it holds that
(∇L)(x) ∈ (DL)(x). (9.301)
Moreover, note that for all v ∈ Rd \{0} it holds that
   
⟨v,h⟩ ⟨v,h⟩
lim inf Rd \{0}∋h→0 ∥h∥2 = supε∈(0,∞) inf h∈{w∈Rd : ∥w∥2 ≤ε} ∥h∥2
(9.302)
⟨v,−ε∥v∥−1 v⟩
 
≤ supε∈(0,∞) ∥−ε∥v∥−12 v∥ = supε∈(0,∞) ⟨v, −∥v∥−1

2
2 v⟩ = −∥v∥2 < 0.
2

Hence, we obtain for all x ∈ {y ∈ Rd : L is differentiable at y}, w ∈ (DL)(x) that


 
L(x + h) − L(x) − ⟨w, h⟩
0 ≤ dlim inf
R \{0}∋h→0 ∥h∥2
 
= lim inf Rd \{0}∋h→0 L(x+h)−L(x)−⟨(∇L)(x),h⟩−⟨w−(∇L)(x),h⟩
∥h∥2
 
|L(x+h)−L(x)−⟨(∇L)(x),h⟩|+⟨(∇L)(x)−w,h⟩
≤ lim inf Rd \{0}∋h→0 ∥h∥2
h  i h  i
≤ lim inf Rd \{0}∋h→0 ⟨(∇L)(x)−w,h⟩
∥h∥2
+ lim sup d
R \{0}∋h→0
|L(x+h)−L(x)−⟨(∇L)(x),h⟩|
∥h∥2
 
= lim inf Rd \{0}∋h→0 ⟨(∇L)(x)−w,h⟩
∥h∥2
≤ −∥(∇L)(x) − w∥2 .
(9.303)

393
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Combining this with (9.301) proves item (iii). Observe that items (ii) and (iii) ensure that
for all open U ⊆ Rn and all x ∈ U with L|U ∈ C 1 (U, R) it holds that

{(∇L)(x)} = (DL)(x) ⊆ (DL)(x). (9.304)

In addition, note that for all open U ⊆ Rd , all x ∈ U , y ∈ Rd and all z = (z1 , z2 ) : N →
Rd ×Rd with lim supk→∞ (∥z1 (k)−x∥2 +∥z2 (k)−y∥2 ) = 0 and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k))
there exists K ∈ N such that for all k ∈ N ∩ [K, ∞) it holds that

z1 (k) ∈ U. (9.305)

Combining this with item (iii) shows that for all open U ⊆ Rd , all x ∈ U , y ∈ Rd and all
z = (z1 , z2 ) : N → Rd ×Rd with L|U ∈ C 1 (U, R), lim supk→∞ (∥z1 (k)−x∥2 +∥z2 (k)−y∥2 ) = 0
and ∀ k ∈ N : z2 (k) ∈ (DL)(z1 (k)) there exists K ∈ N such that ∀ k ∈ N∩[K, ∞) : z1 (k) ∈ U
and
lim supN∩[K,∞)∋k→∞ (∥z1 (k) − x∥2 + ∥(∇L)(z1 (k)) − y∥2 )
(9.306)
= lim supk→∞ (∥z1 (k) − x∥2 + ∥z2 (k) − y∥2 ) = 0.

This and item (i) imply that for all open U ⊆ Rd and all x ∈ U , y ∈ (DL)(x) with
L|U ∈ C 1 (U, R) it holds that
y = (∇L)(x). (9.307)
Combining this with (9.304) establishes item (iv). Observe that (9.284) demonstrates that
for all x ∈ Rd it holds that

(9.308)

Rd \((DL)(x)) = ε∈(0,∞) Rd \ Sy∈{z∈Rd : ∥x−z∥2 <ε} (DL)(y)
S

Therefore, we obtain for all x ∈ Rd that Rd \((DL)(x)) is open. This proves item (v). The
proof of Lemma 9.15.3 is thus complete.
Lemma 9.15.4 (Fréchet subgradients for maxima). Let c ∈ R and let L : R → R satisfy
for all x ∈ R that L(x) = max{x, c}. Then
(i) it holds for all x ∈ (−∞, c) that (DL)(x) = {0},

(ii) it holds for all x ∈ (c, ∞) that (DL)(x) = {1}, and

(iii) it holds that (DL)(c) = [0, 1]


(cf. Definition 9.15.1).
Proof of Lemma 9.15.4. Note that item (iii) in Lemma 9.15.3 establishes items (i) and (ii).
Observe that Lemma 9.15.2 establishes

[0, 1] ⊆ (DL)(c). (9.309)

394
9.15. Fréchet subdifferentials and limiting Fréchet subdifferentials

Furthermore, note that the assumption that for all x ∈ R it holds that L(x) = max{x, c}
ensures that for all a ∈ (1, ∞), h ∈ (0, ∞) it holds that
L(c + h) − L(c) − ah (c + h) − c − ah
= = 1 − a < 0. (9.310)
|h| h
Moreover, observe that the assumption that for all x ∈ R it holds that L(x) = max{x, c}
shows that for all a, h ∈ (−∞, 0), it holds that
L(c + h) − L(c) − ah c − c − ah
= = a < 0. (9.311)
|h| −h
Combining this with (9.310) demonstrates that
(DL)(c) ⊆ [0, 1]. (9.312)
This and (9.309) establish item (iii). The proof of Lemma 9.15.4 is thus complete.
Lemma 9.15.5 (Limits of limiting Fréchet subgradients). Let d ∈ N, L ∈ C(Rd , R), let
(xk )k∈N0 ⊆ Rd and (yk )k∈N0 ⊆ Rd satisfy
lim supk→∞ (∥xk − x0 ∥2 + ∥yk − y0 ∥2 ) = 0, (9.313)
and assume for all k ∈ N that yk ∈ (DL)(xk ) (cf. Definitions 3.3.4 and 9.15.1). Then
y0 ∈ (DL)(x0 ).
Proof of Lemma 9.15.5. Note that item (i) in Lemma 9.15.3 and the fact that for all k ∈ N
(k) (k)
it holds that yk ∈ (DL)(xk ) imply that for every k ∈ N there exists z (k) = (z1 , z2 ) : N →
Rd × Rd which satisfies for all v ∈ N that
(k) (k) (k) (k)
z2 (v) ∈ (DL)(z1 (v)) and lim supw→∞ ∥z1 (w) − xk ∥2 + ∥z2 (w) − yk ∥2 = 0. (9.314)


Observe that (9.314) demonstrates that there exists v = (vk )k∈N : N → N which satisfies for
all k ∈ N that
(k) (k)
∥z1 (vk ) − xk ∥2 + ∥z2 (vk ) − yk ∥2 ≤ 2−k . (9.315)
Next let Z = (Z1 , Z2 ) : N → Rd × Rd satisfy for all j ∈ {1, 2}, k ∈ N that
(k)
Zj (k) = zj (vk ). (9.316)
Note that (9.314), (9.315), (9.316), and the assumption that lim supk→∞ (∥xk − x0 ∥2 + ∥yk −
y0 ∥2 ) = 0 prove that

lim supk→∞ ∥Z1 (k) − x0 ∥2 + ∥Z2 (k) − y0 ∥2
 
≤ lim supk→∞ ∥Z1 (k) − xk ∥2 + ∥Z2 (k) − yk ∥2
 
+ lim supk→∞ ∥xk − x0 ∥2 + ∥yk − y0 ∥2
 (9.317)
= lim supk→∞ ∥Z1 (k) − xk ∥2 + ∥Z2 (k) − yk ∥2
(k) (k) 
= lim supk→∞ ∥z1 (vk ) − xk ∥2 + ∥z2 (vk ) − yk ∥2
≤ lim supk→∞ 2−k = 0.


395
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

Furthermore, observe that (9.314) and (9.316) establish that for all k ∈ N it holds that
Z2 (k) ∈ (DL)(Z1 (k)). Combining this and (9.317) with item (i) in Lemma 9.15.3 proves
that y0 ∈ (DL)(x0 ). The proof of Lemma 9.15.5 is thus complete.
Exercise 9.15.1. Prove or disprove the following statement: It holds for all d ∈ N, L ∈
C 1 (Rd , R), x ∈ Rd that (DL)(x) = (DL)(x) (cf. Definition 9.15.1).
Exercise 9.15.2. Prove or disprove the following statement: There exists d ∈ N such that
for all L ∈ C(Rd , R), x ∈ Rd it holds that (DL)(x) ⊆ (DL)(x) (cf. Definition 9.15.1).
Exercise 9.15.3. Prove or disprove the following statement: It holds for all d ∈ N, L ∈
C(Rd , R), x ∈ Rd that (DL)(x) is convex (cf. Definition 9.15.1).
Exercise 9.15.4. Prove or disprove the following statement: It holds for all d ∈ N, L ∈
C(Rd , R), x ∈ Rn that (DL)(x) is convex (cf. Definition 9.15.1).
Exercise 9.15.5. For every α ∈ (0, ∞), s ∈ {−1, 1} let Lα,s : R → R satisfy for all x ∈ R
that
(
x :x>0
Lα,s (x) = α
(9.318)
s|x| : x ≤ 0.

For every α ∈ (0, ∞), s ∈ {−1, 1}, x ∈ R specify (DLα,s )(x) and (DLα,s )(x) explicitly and
prove that your results are correct (cf. Definition 9.15.1)!

9.16 Non-smooth slope


Definition 9.16.1 (Non-smooth slope). Let d ∈ N, L ∈ C(Rd , R). Then we denote by
Sf : Rd → [0, ∞] the function which satisfies for all θ ∈ Rd that

(9.319)
 
SL (θ) = inf r ∈ R : (∃ h ∈ (DL)(θ) : r = ∥h∥2 ) ∪ {∞}

and we call Sf the non-smooth slope of f (cf. Definitions 3.3.4 and 9.15.1).

9.17 Generalized KL functions


Definition 9.17.1 (Generalized KL inequalities). Let d ∈ N, c ∈ R, α ∈ (0, ∞), L ∈
C(Rd , R), let U ⊆ Rd be a set, and let θ ∈ U . Then we say that L satisfies the generalized
KL inequality at θ on U with exponent α and constant c (we say that L satisfies the
generalized KL inequality at θ) if and only if for all ϑ ∈ U it holds that

|L(θ) − L(ϑ)|α ≤ c |SL (ϑ)| (9.320)

(cf. Definition 9.16.1).

396
9.17. Generalized KL functions

Definition 9.17.2 (Generalized KL functions). Let d ∈ N, L ∈ C(Rd , R). Then we say


that L is a generalized KL function if and only if for all θ ∈ Rd there exist ε, c ∈ (0, ∞),
α ∈ (0, 1) such that for all ϑ ∈ {v ∈ Rd : ∥v − θ∥2 < ε} it holds that

|L(θ) − L(ϑ)|α ≤ c |SL (ϑ)| (9.321)

(cf. Definitions 3.3.4 and 9.16.1).

Remark 9.17.3 (Examples and convergence results for generalized KL functions). In Theo-
rem 9.9.1 and Corollary 9.13.5 above we have seen that in the case of an analytic activation
function we have that the associated empirical risk function is also analytic and therefore
a standard KL function. In deep learning algorithms often deep ANNs with non-analytic
activation functions such as the ReLU activation (cf. Section 1.2.3) and the leaky ReLU
activation (cf. Section 1.2.11) are used. In the case of such non-differentiable activation
functions, the associated risk function is typically not a standard KL function. However,
under suitable assumptions on the target function and the underlying probability measure of
the input data of the considered learning problem, using Bolte et al. [44, Theorem 3.1] one
can verify in the case of such non-differentiable activation functions that the risk function
is a generalized KL function in the sense of Definition 9.17.2 above; cf., for instance, [126,
224]. Similar as for standard KL functions (cf., for example, Dereich & Kassing [100] and
Sections 9.11 and 9.12) one can then also develop a convergence theory for gradient based
optimization methods for generalized KL function (cf., for instance, Bolte et al. [44, Section
4] and Corollary 9.11.5).
Remark 9.17.4 (Further convergence analyses). We refer, for example, to [2, 7, 8, 44, 100,
391] and the references therein for convergence analyses under KL-type conditions for
gradient based optimization methods in the literature. Beyond the KL approach reviewed
in this chapter there are also several other approaches in the literature with which one
can conclude convergence of gradient based optimization methods to suitable generalized
critical points; cf., for instance, [45, 65, 93] and the references therein.

397
Chapter 9: Kurdyka–Łojasiewicz (KL) inequalities

398
Chapter 10

ANNs with batch normalization

In data-driven learning problems popular methods that aim to accelerate ANN training
procedures are BN methods. In this chapter we rigorously review such methods in detail.
In the literature BN methods have first been introduced in Ioffe & Szegedi [217].
Further investigation on BN techniques and applications of such methods can, for
example, be found in [4, Section 12.3.3], [131, Section 6.2.3], [164, Section 8.7.1], and [40,
364].

10.1 Batch normalization (BN)


Definition 10.1.1 (Batch). Let d, M ∈ N. Then we say that x is a batch of d-dimensional
data points of size M (we say that x is a batch of M d-dimensional data points, we say that
x is a batch) if and only if it holds that x ∈ (Rd )M .
Definition 10.1.2 (Batch mean). Let d, M ∈ N, x = (x(m) )m∈{1,2,...,M } ∈ (Rd )M . Then we
denote by Batchmean(x) = (Batchmean1 (x), . . . , Batchmeand (x)) ∈ Rd the vector given by
"M #
1 X (m)
Batchmean(x) = x (10.1)
M m=1

and we call Batchmean(x) the batch mean of the batch x.


(m)
Definition 10.1.3 (Batch variance). Let d, M ∈ N, x = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈
(Rd )M . Then we denote by
Batchvar(x) = (Batchvar1 (x), . . . , Batchvard (x)) ∈ Rd (10.2)
the vector which satisfies for all i ∈ {1, 2, . . . , d} that
"M #
1 X
Batchvari (x) =
(m)
(xi − Batchmeani (x))2 (10.3)
M m=1

399
Chapter 10: ANNs with batch normalization

and we call Batchvar(x) the batch variance of the batch x (cf. Definition 10.1.2).
(m)
Lemma 10.1.4. Let d, M ∈ N, x = (x(m) )m∈{1,2,...,M } = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈
(Rd )M , let (Ω, F, P) be a probability space, and let U : Ω → {1, 2, . . . , M } be a {1, 2, . . . , M }-
uniformly distributed random variable. Then
 
(i) it holds that Batchmean(x) = E x(U ) and
(U )
(ii) it holds for all i ∈ {1, 2, . . . , d} that Batchvari (x) = Var(xi ).
Proof of Lemma 10.1.4. Note that (10.1) proves item (i). Furthermore, note that item (i)
and (10.3) establish item (ii). The proof of Lemma 10.1.4 is thus complete.
Definition 10.1.5 (BN operations for given batch mean and batch variance). Let d ∈ N,
ε ∈ (0, ∞), β = (β1 , . . . , βd ), γ = (γ1 , . . . , γd ), µ = (µ1 , . . . , µd ) ∈ Rd , V = (V1 , . . . , Vd ) ∈
[0, ∞)d . Then we denote by

batchnormβ,γ,µ,V,ε : Rd → Rd (10.4)

the function which satisfies for all x = (x1 , . . . , xd ) ∈ Rd that


 h x −µ i 
i i
batchnormβ,γ,µ,V,ε (x) = γi √ + βi (10.5)
Vi + ε i∈{1,2,...,d}

and we call batchnormβ,γ,µ,V,ε the BN operation with mean parameter β, standard deviation
parameter γ, and regularization parameter ε given the batch mean µ and batch variance V .
Definition 10.1.6 (Batch normalization). Let d ∈ N, ε ∈ (0, ∞), β, γ ∈ Rd . Then we
denote by
(10.6)
S d M
 S d M

Batchnormβ,γ,ε : M ∈N (R ) → M ∈N (R )
the function which satisfies for all M ∈ N, x = (x(m) )m∈{1,2,...,M } ∈ (Rd )M that

Batchnormβ,γ,ε (x) = batchnormβ,γ,Batchmean(x),Batchvar(x),ε (x(m) ) m∈{1,2,...,M } ∈ (Rd )M




(10.7)

and we call Batchnormβ,γ,ε the BN with mean parameter β, standard deviation parameter
γ, and regularization parameter ε (cf. Definitions 10.1.2, 10.1.3, and 10.1.5).
Lemma 10.1.7. Let d, M ∈ N, β = (β1 , . . . , βd ), γ = (γ1 , . . . , γd ) ∈ Rd . Then
(m)
(i) it holds for all ε ∈ (0, ∞), x = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈ (Rd )M that
 h x(m) − Batchmean (x) i  
i i
Batchnormβ,γ,ε (x) = γi √ + βi ,
Batchvari (x) + ε i∈{1,2,...,d} m∈{1,2,...,M }
(10.8)

400
10.1. Batch normalization (BN)

(ii) it holds for all ε ∈ (0, ∞), x ∈ (Rd )M that


Batchmean(Batchnormβ,γ,ε (x)) = β, (10.9)
and
(m)
(iii) it holds for all x = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈ (Rd )M , i ∈ {1, 2, . . . , d} with
(m)
#( M
S
m=1 {xi }) > 1 that

lim supε↘0 Batchvari (Batchnormβ,γ,ε (x)) − (γi )2 = 0 (10.10)

(cf. Definitions 10.1.2, 10.1.3, and 10.1.6).


Proof of Lemma 10.1.7. Note that (10.1), (10.3), (10.5), and (10.7) establish item (i). In
(m)
addition, note that item (i) ensures that for all ε ∈ (0, ∞), x = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈
(Rd )M , i ∈ {1, 2, . . . , d} it holds that
M (m)
1 X  h xi − Batchmeani (x) i 
Batchmeani (Batchnormβ,γ,ε (x)) = γi √ + βi
M m=1 Batchvari (x) + ε
h1 P M (m) 
M m=1 xi − Batchmeani (x) i
= γi √ + βi
Batchvari (x) + ε
h Batchmean (x) − Batchmean (x) i
= γi √ i i
+ βi = βi
Batchvari (x) + ε
(10.11)
(cf. Definitions 10.1.2, 10.1.3, and 10.1.6). This implies item (ii). Furthermore, observe that
(m)
(10.11) and item (i) ensure that for all ε ∈ (0, ∞), x = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈ (Rd )M ,
i ∈ {1, 2, . . . , d} it holds that
Batchvari (Batchnormβ,γ,ε (x))
M (m)
1 X h h xi − Batchmeani (x) i i2
= γi √ + βi − Batchmeani (Batchnormβ,γ,ε (x))
M m=1 Batchvari (x) + ε
1 X
M h x(m) − Batchmean (x) i2
i
(10.12)
= (γi )2 i√
M m=1 Batchvari (x) + ε
h1 M (m)
− Batchmeani (x))2 i
P
m=1 (xi
h Batchvar (x) i
2 M i
= (γi ) = (γi )2 .
Batchvari (x) + ε Batchvari (x) + ε
(m)
Combining this with the fact that for all x = ((xi )i∈{1,2,...,d} )m∈{1,2,...,M } ∈ (Rd )M , i ∈
(m)
{1, 2, . . . , d} with #( Mm=1 {xi }) > 1 it holds that
S

Batchvari (x) > 0 (10.13)


implies item (iii). The proof of Lemma 10.1.7 is thus complete.

401
Chapter 10: ANNs with batch normalization

10.2 Structured description of fully-connected feedfor-


ward ANNs with BN for training
Definition 10.2.1 (Structured description of fully-connected feedforward ANNs with BN).
We denote by B the set given by
 
L
× lk ×lk−1
× . (10.14)
S S S lk
 lk 2
B = L∈N l0 ,l1 ,...,lL ∈N N ⊆{0,1,...,L} k=1
(R × R ) × k∈N
(R )

Definition 10.2.2 (Fully-connected feedforward ANNs with BN). We say that Φ is a


fully-connected feedforward ANN with BN if and only if it holds that

Φ∈B (10.15)

(cf. Definition 10.2.1).

10.3 Realizations of fully-connected feedforward ANNs


with BN for training
In the next definition we apply the multidimensional version of Definition 1.2.1 with batches
as input. For this we implicitly identify batches with matrices. This identification is
exemplified in the following exercise.
Exercise 10.3.1. Let l0 = 2, l1 = 3, M = 4, W ∈ Rl1 ×l0 , B ∈ Rl1 , y ∈ (Rl0 )M , x ∈ (Rl1 )M
satisfy
   
3 −1 1        
0 1 2 −1
W = −1
 3 , B = −1, y= , , , , (10.16)
1 0 −2 1
3 −1 1

and x = Mr,l1 ,M (W y + (B, B, B, B)) (cf. Definitions 1.2.1 and 1.2.4). Prove the following
statement: It holds that
       
0 4 9 0
x=   2 , 0 , 0 , 3.
      (10.17)
0 4 9 0

Definition 10.3.1 (Realizations associated to fully-connected feedforward ANNs with BN).


Let ε ∈ (0, ∞), a ∈ C(R, R). Then we denote by

(10.18)

RB k M S l M
S
a,ε : B → C( M ∈N (R ) , M ∈N (R ) )
S
k,l∈N

402
10.4. Structured descr. of fully-connected feedforward ANNs with BN (inference)

the function which satisfies for all L, M ∈ N, l0 , l1 , . . . , lL ∈ N, N ⊆ {0, 1, . . . , L}, Φ =


× L lk ×lk−1
×
lk
 lk 2

(((Wk , Bk ))k∈{1,2,...,L} , ((βk , γk ))k∈N ) ∈ k=1
R × R × k∈N
(R ) , x0 , y0 ∈
(Rl0 )M , x1 , y1 ∈ (Rl1 )M , . . ., xL , yL ∈ (RlL )M with
(
Batchnormβk ,γk ,ε (xk ) : k ∈ N
∀ k ∈ {0, 1, . . . , L} : yk = and (10.19)
xk :k∈/N

∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),lk ,M (Wk yk−1 + (Bk , Bk , . . . , Bk ))


(10.20)
that

RB l0 B S lL B
RB (x0 ) = yL ∈ (RlL )M (10.21)

a,ε (Φ) ∈ C( B∈N (R ) , B∈N (R ) ) and (Φ)
S
a,ε

and for every Φ ∈ B we call RB a,ε (Φ) the realization function of the fully-connected feed-
forward ANN with BN Φ with activation function a and BN regularization parameter ε
(we call RBa,ε (Φ) the realization of the fully-connected feedforward ANN with BN Φ with
activation a and BN regularization parameter ε) (cf. Definitions 1.2.1, 10.1.6, and 10.2.1).

10.4 Structured description of fully-connected feedfor-


ward ANNs with BN for inference
Definition 10.4.1 (Structured description of fully-connected feedforward ANNs with BN
for given batch means and batch variances). We denote by b the set given by
 
L
× lk ×lk−1
×
S S S lk
 lk 3 lk
b = L∈N l0 ,l1 ,...,lL ∈N N ⊆{0,1,...,L} k=1
(R × R ) × k∈N
((R ) × [0, ∞) ) .
(10.22)
Definition 10.4.2 (Fully-connected feedforward ANNs with BN for given batch means
and batch variances). We say that Φ is a fully-connected feedforward ANN with BN for
given batch means and batch variances if and only if it holds that

Φ∈b (10.23)

(cf. Definition 10.4.1).

10.5 Realizations of fully-connected feedforward ANNs


with BN for inference
Definition 10.5.1 (Realizations associated to fully-connected feedforward ANNs with BN
for given batch means and batch variances). Let ε ∈ (0, ∞), a ∈ C(R, R). Then we denote

403
Chapter 10: ANNs with batch normalization

by

(10.24)

Rba,ε : b → C(Rk , Rl )
S
k,l∈N

the function which satisfies for all L ∈ N, l0 , l1 , . . . , lL ∈ N, N ⊆ {0, 1, . . . , L}, Φ = (((Wk ,


× L lk ×lk−1
×
lk
 lk 3 lk

Bk ))k∈{1,2,...,L} , ((βk , γk , µk , Vk ))k∈N ) ∈ k=1
R × R × k∈N
((R ) × [0, ∞) ) ,
l0 l1 lL
x0 , y0 ∈ R , x1 , y1 ∈ R , . . ., xL , yL ∈ R with
(
batchnormβk ,γk ,µk ,Vk ,ε (xk ) : k ∈ N
∀ k ∈ {0, 1, . . . , L} : yk = and (10.25)
xk :k∈
/N

∀ k ∈ {1, 2, . . . , L} : xk = Ma1(0,L) (k)+idR 1{L} (k),lk (Wk yk−1 + Bk ) (10.26)


that
Rba,ε (Φ) ∈ C(Rl0 , RlL ) Rba,ε (Φ) (x0 ) = yL (10.27)

and
and for every Φ ∈ b we call Rba,ε (Φ) the realization function of the fully-connected feedforward
ANN with BN for given batch means and batch variances Φ with activation function a and
BN regularization parameter ε (cf. Definitions 10.1.5 and 10.4.1).

10.6 On the connection between BN for training and


BN for inference
Definition 10.6.1 (Fully-connected feed-forward ANNs with BN for given batch means and
batch variances associated to fully-connected feedforward ANNs with BN and given input
batches). Let ε ∈ (0, ∞), a ∈ C(R, R), L, M ∈ N, l0 , l1 , . . . , lL ∈ N, N ⊆ {0, 1, . . . , L}, Φ =
×
L
Rlk ×lk−1 × Rlk × ×
 
(((Wk , Bk ))k∈{1,2,...,L} , ((βk , γk ))k∈N ) ∈ k=1 k∈N
(Rlk )2 , x ∈ (Rl0 )M .
Then we say that Ψ is the fully-connected feedforward ANNs with BN for given batch means
and batch variances associated to (Φ, x, a, ε) if and only if there exists x0 , y0 ∈ (Rl0 )M ,
x1 , y1 ∈ (Rl1 )M , . . ., xL , yL ∈ (RlL )M such that
(i) it holds that x0 = x,
(ii) it holds for all k ∈ {0, 1, . . . , L} that
(
Batchnormβk ,γk ,ε (xk ) :k∈N
yk = (10.28)
xk :k∈
/ N,

(iii) it holds for all k ∈ {1, 2, . . . , L} that

xk = Ma1(0,L) (k)+idR 1{L} (k),lk ,M (Wk yk−1 + (Bk , Bk , . . . , Bk )), (10.29)

and

404
10.6. On the connection between BN for training and BN for inference

(iv) it holds that

Ψ = (((Wk , Bk ))k∈{1,2,...,L} , ((βk , γk , Batchmean(xk ), Batchvar(xk )))k∈N )


(10.30)
×L lk ×lk−1
×
lk
 lk 4

∈ k=1
(R × R ) × k∈N
(R )

(cf. Definitions 1.2.1, 10.1.2, 10.1.3, and 10.1.6).

Lemma 10.6.2. Let ε ∈ (0, ∞), a ∈ C(R, R), L, M ∈ N, l0 , l1 , . . . , lL ∈ N, N ⊆ {0, 1, . . . ,


× L lk ×lk−1
×
lk
 lk 2

L}, Φ = (((Wk , Bk ))k∈{1,2,...,L} , ((βk , γk ))k∈N ) ∈ k=1
R × R × k∈N
(R ) ,
(m) l0 M
x = (x )m∈{1,2,...,M } ∈ (R ) and let Ψ be the fully-connected feedforward ANN with BN
for given batch means and batch variances associated to (Φ, x, a, ε) (cf. Definition 10.6.1).
Then

(RB b
a,ε (Φ))(x) = ((Ra,ε (Ψ))(x
(m)
))m∈{1,2,...,M } (10.31)

(cf. Definitions 10.3.1 and 10.5.1).

Proof of Lemma 10.6.2. Observe that (10.19), (10.20), (10.21), (10.25), (10.26), (10.27),
(10.28), (10.29), and (10.30) establish (10.31). The proof of Lemma 10.6.2 is thus complete.

Exercise 10.6.1. Let l0 = 2, l1 = 3, l2 = 1, N = {0, 1}, γ0 = (2, 2), β0 = (0, 0), γ1 = (1, 1, 1),
β1 = (0, 1, 0), x = ((0, 1), (1, 0), (−2, 2), (2, −2)), Φ ∈ B satisfy
    
1 2     
−1  −1 1 −1 
Φ = 3 4, , , −2 , ((γk , βk ))k∈N 
−1 1 −1 1 (10.32)
5 6
× 2
Rlk ×lk−1 × Rlk × ×
 
∈ k=1 k∈N
(Rlk )2

and let Ψ ∈ b be the fully-connected feedforward ANNs with BN for given batch means and
batch variances associated to (Φ, x, r, 0.01). Compute (RB 1 (Φ))(x) and (R
r, 100
b
1 (Ψ))(−1, 1)
r, 100
explicitly and prove that your results are correct (cf. Definitions 1.2.4, 10.2.1, 10.3.1, 10.4.1,
10.5.1, and 10.6.1)!

405
Chapter 10: ANNs with batch normalization

406
Chapter 11

Optimization through random


initializations

In addition to minimizing an objective function through iterative steps of an SGD-type


optimization method, another approach to minimize an objective function is to sample
different random initializations, to iteratively calculate SGD optimization processes starting
at these random initializations, and, thereafter, to pick a SGD trajectory with the smallest
final evaluation of the objective function. The approach to consider different random initial-
izations is reviewed and analyzed within this chapter in detail. The specific presentation of
this chapter is strongly based on Jentzen & Welti [230, Section 5].

11.1 Analysis of the optimization error


11.1.1 The complementary distribution function formula
Lemma 11.1.1 (Complementary distribution function formula). Let µ : B([0, ∞)) → [0, ∞]
be a sigma-finite measure. Then
Z ∞ Z ∞ Z ∞
x µ(dx) = µ([x, ∞)) dx = µ((x, ∞)) dx. (11.1)
0 0 0

Proof of Lemma 11.1.1. First, note that


Z ∞ Z ∞ Z x  Z ∞ Z ∞ 
x µ(dx) = dy µ(dx) = 1(−∞,x] (y) dy µ(dx)
0 0
Z ∞Z ∞ 0 0 0
(11.2)
= 1[y,∞) (x) dy µ(dx).
0 0

Furthermore, observe that the fact that [0, ∞)2 ∋ (x, y) 7→ 1[y,∞) (x) ∈ R is (B([0, ∞)) ⊗
B([0, ∞)))/B(R)-measurable, the assumption that µ is a sigma-finite measure, and Fubini’s

407
Chapter 11: Optimization through random initializations

theorem ensure that


Z ∞Z ∞ Z ∞ Z ∞ Z ∞
1[y,∞) (x) dy µ(dx) = 1[y,∞) (x) µ(dx) dy = µ([y, ∞)) dy. (11.3)
0 0 0 0 0

Combining this with (11.2) shows that for all ε ∈ (0, ∞) it holds that
Z ∞ Z ∞ Z ∞
x µ(dx) = µ([y, ∞)) dy ≥ µ((y, ∞)) dy
0 0
Z ∞ 0
Z ∞ (11.4)
≥ µ([y + ε, ∞)) dy = µ([y, ∞)) dy.
0 ε

Beppo Levi’s monotone convergence theorem hence implies that


Z ∞ Z ∞ Z ∞
x µ(dx) = µ([y, ∞)) dy ≥ µ((y, ∞)) dy
0 0 0
Z ∞ 
≥ sup µ([y, ∞)) dy (11.5)
ε∈(0,∞) ε
Z ∞  Z ∞
= sup µ([y, ∞)) 1(ε,∞) (y) dy = µ([y, ∞)) dy.
ε∈(0,∞) 0 0

The proof of Lemma 11.1.1 is thus complete.

11.1.2 Estimates for the optimization error involving complemen-


tary distribution functions
Lemma 11.1.2. Let (E, δ) be a metric space, let x ∈ E, K ∈ N, p, L ∈ (0, ∞), let (Ω, F, P)
be a probability space, let R : E × Ω → R be (B(E) ⊗ F)/B(R)-measurable, assume for all
y ∈ E, ω ∈ Ω that |R(x, ω) − R(y, ω)| ≤ Lδ(x, y), and let Xk : Ω → E, k ∈ {1, 2, . . . , K},
be i.i.d. random variables. Then
Z ∞
p p
[P(δ(X1 , x) > ε /p )]K dε. (11.6)
1
 
E mink∈{1,2,...,K} |R(Xk ) − R(x)| ≤ L
0

Proof of Lemma 11.1.2. Throughout this proof, let Y : Ω → [0, ∞) satisfy for all ω ∈ Ω
that Y (ω) = mink∈{1,2,...,K} [δ(Xk (ω), x)]p . Note that the fact that Y is a random variable,
the assumption that ∀ y ∈ E, ω ∈ Ω : |R(x, ω) − R(y, ω)| ≤ Lδ(x, y), and Lemma 11.1.1
demonstrate that
E mink∈{1,2,...,K} |R(Xk ) − R(x)|p ≤ Lp E mink∈{1,2,...,K} [δ(Xk , x)]p
   
Z ∞ Z ∞
p p p
= L E[Y ] = L y PY (dy) = L PY ((ε, ∞)) dε
0 0
(11.7)
Z ∞ Z ∞
= Lp P(Y > ε) dε = Lp P mink∈{1,2,...,K} [δ(Xk , x)]p > ε dε.

0 0

408
11.2. Strong convergences rates for the optimization error

Furthermore, observe that the assumption that Xk , k ∈ {1, 2, . . . , K}, are i.i.d. random
variables establishes that for all ε ∈ (0, ∞) it holds that

P mink∈{1,2,...,K} [δ(Xk , x)]p > ε = P ∀ k ∈ {1, 2, . . . , K} : [δ(Xk , x)]p > ε


 

K (11.8)
P([δ(Xk , x)]p > ε) = [P([δ(X1 , x)]p > ε)]K = [P(δ(X1 , x) > ε /p )]K .
Q 1
=
k=1

Combining this with (11.7) proves (11.6). The proof of Lemma 11.1.2 is thus complete.

11.2 Strong convergences rates for the optimization error


11.2.1 Properties of the gamma and the beta function
Lemma 11.2.1. Let : (0, ∞) → (0, ∞) and RB : (0, ∞)2 → (0, ∞) satisfy for all x, y ∈
R ∞Γx−1 1
(0, ∞) that Γ(x) = 0 t e−t dt and B(x, y) = 0 tx−1 (1 − t)y−1 dt. Then

(i) it holds for all x ∈ (0, ∞) that Γ(x + 1) = x Γ(x),

(ii) it holds that Γ(1) = Γ(2) = 1, and


Γ(x)Γ(y)
(iii) it holds for all x, y ∈ (0, ∞) that B(x, y) = Γ(x+y)
.

Proof of Lemma 11.2.1. Throughout this proof, let x, y ∈ (0, ∞), let Φ : (0, ∞) × (0, 1) →
(0, ∞)2 satisfy for all u ∈ (0, ∞), v ∈ (0, 1) that

Φ(u, v) = (u(1 − v), uv), (11.9)

and let f : (0, ∞)2 → (0, ∞) satisfy for all s, t ∈ (0, ∞) that

f (s, t) = s(x−1) t(y−1) e−(s+t) . (11.10)

Note that the integration by parts formula proves that for all x ∈ (0, ∞) it holds that
Z ∞ Z ∞
((x+1)−1) −t
tx −e−t dt
 
Γ(x + 1) = t e dt = −
 0
Z ∞ 0
 Z ∞ (11.11)
 x −t t=∞ (x−1) −t (x−1) −t
= − t e t=0 − x t e dt = x t e dt = x · Γ(x).
0 0

This establishes item (i). Furthermore, observe that


Z ∞
Γ(1) = t0 e−t dt = [−e−t ]t=∞
t=0 = 1. (11.12)
0

409
Chapter 11: Optimization through random initializations

This and item (i) prove item (ii). Moreover, note that the integral transformation theorem
with the diffeomorphism (1, ∞) ∋ t 7→ 1t ∈ (0, 1) ensures that
Z 1 Z ∞
(x−1) (y−1)  1 (x−1)  (y−1) 1
B(x, y) = t (1 − t) dt = t
1 − 1t t2
dt
0 1
Z ∞ Z ∞
(−x−1) t−1 (y−1)
t(−x−y) (t − 1)(y−1) dt
 
= t t
dt = (11.13)
1 1
Z ∞ Z ∞
(−x−y) (y−1) t(y−1)
= (t + 1) t dt = dt.
0 0 (t + 1)(x+y)
In addition, observe that the fact that for all (u, v) ∈ (0, ∞) × (0, 1) it holds that
 
1 − v −u

Φ (u, v) = (11.14)
v u
shows that for all (u, v) ∈ (0, ∞) × (0, 1) it holds that
det(Φ′ (u, v)) = (1 − v)u − v(−u) = u − vu + vu = u ∈ (0, ∞). (11.15)
This, the fact that
Z ∞ Z ∞ 
(x−1) −t (y−1) −t
Γ(x) · Γ(y) = t e dt t e dt
0 0
Z ∞ Z ∞ 
(x−1) −s (y−1) −t
= s e ds t e dt
0
Z ∞Z ∞ 0
(11.16)
= s(x−1) t(y−1) e−(s+t) dt ds
Z0 0

= f (s, t) d(s, t),


(0,∞)2

and the integral transformation theorem imply that


Z
Γ(x) · Γ(y) = f (Φ(u, v)) |det(Φ′ (u, v))| d(u, v)
(0,∞)×(0,1)
Z ∞Z 1
= (u(1 − v))(x−1) (uv)(y−1) e−(u(1−v)+uv) u dv du
Z0 ∞ Z0 1
(11.17)
= u(x+y−1) e−u v (y−1) (1 − v)(x−1) dv du
0 0
Z ∞ Z 1 
(x+y−1) −u (y−1) (x−1)
= u e du v (1 − v) dv
0 0
= Γ(x + y) B(y, x).
This establishes item (iii). The proof of Lemma 11.2.1 is thus complete.

410
11.2. Strong convergences rates for the optimization error

Lemma 11.2.2. It holds for all α, x ∈ [0, 1] that (1 − x)α ≤ 1 − αx.


Proof of Lemma 11.2.2. Note that the fact that for all y ∈ [0, ∞) it holds that [0, ∞) ∋
z 7→ y z ∈ [0, ∞) is convex demonstrates that for all α, x ∈ [0, 1] it holds that
(1 − x)α ≤ α(1 − x)1 + (1 − α)(1 − x)0
(11.18)
= α − αx + 1 − α = 1 − αx.
The proof of Lemma 11.2.2 is thus complete.
Proposition 11.2.3.
R ∞ Let Γ : (0, ∞) → (0, ∞) and ⌊·⌋ : (0, ∞) → N0 satisfy for all x ∈
x−1 −t
(0, ∞) that Γ(x) = 0 t e dt and ⌊x⌋ = max([0, x) ∩ N0 ). Then
(i) it holds that Γ : (0, ∞) → (0, ∞) is convex,
(ii) it holds for all x ∈ (0, ∞) that Γ(x + 1) = x Γ(x) ≤ x⌊x⌋ ≤ max{1, xx },
(iii) it holds for all x ∈ (0, ∞), α ∈ [0, 1] that
x Γ(x + α)
(max{x + α − 1, 0})α ≤ 1−α
≤ ≤ xα , (11.19)
(x + α) Γ(x)
and
(iv) it holds for all x ∈ (0, ∞), α ∈ [0, ∞) that
Γ(x + α)
(max{x + min{α − 1, 0}, 0})α ≤ ≤ (x + max{α − 1, 0})α . (11.20)
Γ(x)

Proof of Proposition 11.2.3. Throughout this proof, let ⌊·⌋ : [0, ∞) → N0 satisfy for all
x ∈ [0, ∞) that ⌊x⌋ = max([0, x] ∩ N0 ). Observe that the fact that for all t ∈ (0, ∞) it holds
that R ∋ x 7→ tx ∈ (0, ∞) is convex establishes that for all x, y ∈ (0, ∞), α ∈ [0, 1] it holds
that
Z ∞ Z ∞
αx+(1−α)y−1 −t
Γ(αx + (1 − α)y) = t e dt = tαx+(1−α)y t−1 e−t dt
Z0 ∞ 0

≤ (αtx + (1 − α)ty )t−1 e−t dt


0
Z ∞ Z ∞ (11.21)
=α tx−1 e−t dt + (1 − α) ty−1 e−t dt
0 0
= α Γ(x) + (1 − α)Γ(y).
This proves item (i). Furthermore, note that item (ii) in Lemma 11.2.1 and item (i) ensure
that for all α ∈ [0, 1] it holds that

Γ(α + 1) = Γ(α · 2 + (1 − α) · 1) ≤ α Γ(2) + (1 − α)Γ(1) = α + (1 − α) = 1. (11.22)

411
Chapter 11: Optimization through random initializations

This shows for all x ∈ (0, 1] that


Γ(x + 1) ≤ 1 = x⌊x⌋ = max{1, xx }. (11.23)
Induction, item (i) in Lemma 11.2.1, and the fact that ∀ x ∈ (0, ∞) : x − ⌊x⌋ ∈ (0, 1]
therefore imply that for all x ∈ [1, ∞) it holds that
 ⌊x⌋ 
(x − i + 1) Γ(x − ⌊x⌋ + 1) ≤ x⌊x⌋ Γ(x − ⌊x⌋ + 1) ≤ x⌊x⌋ ≤ xx = max{1, xx }.
Q
Γ(x + 1) =
i=1
(11.24)
Combining this and (11.23) with item (i) in Lemma 11.2.1 establishes item (ii). Moreover,
observe that Hölder’s inequality and item (i) in Lemma 11.2.1 demonstrate that for all
x ∈ (0, ∞), α ∈ [0, 1] it holds that
Z ∞ Z ∞
x+α−1 −t
Γ(x + α) = t e dt = tαx e−αt t(1−α)x−(1−α) e−(1−α)t dt
Z0 ∞ 0

= [tx e−t ]α [tx−1 e−t ]1−α dt


0
Z ∞ α Z ∞ 1−α (11.25)
x −t x−1 −t
≤ t e dt t e dt
0 0
α 1−α
= [Γ(x + 1)] [Γ(x)] = xα [Γ(x)]α [Γ(x)]1−α
= xα Γ(x).
This and item (i) in Lemma 11.2.1 prove that for all x ∈ (0, ∞), α ∈ [0, 1] it holds that
x Γ(x) = Γ(x + 1) = Γ(x + α + (1 − α)) ≤ (x + α)1−α Γ(x + α). (11.26)
Combining (11.25) and (11.26) ensures that for all x ∈ (0, ∞), α ∈ [0, 1] it holds that
x Γ(x + α)
1−α
≤ ≤ xα . (11.27)
(x + α) Γ(x)
In addition, note that item (i) in Lemma 11.2.1 and (11.27) show that for all x ∈ (0, ∞),
α ∈ [0, 1] it holds that
Γ(x + α) Γ(x + α)
= ≤ xα−1 . (11.28)
Γ(x + 1) x Γ(x)
This implies for all α ∈ [0, 1], x ∈ (α, ∞) that
Γ(x) Γ((x − α) + α) 1
= ≤ (x − α)α−1 = . (11.29)
Γ(x + (1 − α)) Γ((x − α) + 1) (x − α)1−α
This, in turn, establishes for all α ∈ [0, 1], x ∈ (1 − α, ∞) that
Γ(x + α)
(x + α − 1)α = (x − (1 − α))α ≤ . (11.30)
Γ(x)

412
11.2. Strong convergences rates for the optimization error

Next observe that Lemma 11.2.2 demonstrates that for all x ∈ (0, ∞), α ∈ [0, 1] it holds
that
 α
α α max{x + α − 1, 0}
(max{x + α − 1, 0}) = (x + α)
x+α
  α
α 1
= (x + α) max 1 − ,0
x+α (11.31)
   
α α α x
≤ (x + α) 1 − = (x + α)
x+α x+α
x
= .
(x + α)1−α

This and (11.27) prove item (iii). Furthermore, note that induction, item (i) in Lemma 11.2.1,
the fact that ∀ α ∈ [0, ∞) : α − ⌊α⌋ ∈ [0, 1), and item (iii) ensure that for all x ∈ (0, ∞),
α ∈ [0, ∞) it holds that
 ⌊α⌋   ⌊α⌋ 
Γ(x + α) Γ(x + α − ⌊α⌋)
(x + α − i) xα−⌊α⌋
Q Q
= (x + α − i) ≤
Γ(x) i=1 Γ(x) i=1

≤ (x + α − 1)⌊α⌋ xα−⌊α⌋ (11.32)


≤ (x + max{α − 1, 0})⌊α⌋ (x + max{α − 1, 0})α−⌊α⌋
= (x + max{α − 1, 0})α .

Moreover, observe that the fact that ∀ α ∈ [0, ∞) : α − ⌊α⌋ ∈ [0, 1), item (iii), induction,
and item (i) in Lemma 11.2.1 show that for all x ∈ (0, ∞), α ∈ [0, ∞) it holds that

Γ(x + α) Γ(x + ⌊α⌋ + α − ⌊α⌋)


=
Γ(x) Γ(x)
 
α−⌊α⌋ Γ(x + ⌊α⌋)
≥ (max{x + ⌊α⌋ + α − ⌊α⌋ − 1, 0})
Γ(x)
 ⌊α⌋ 
Γ(x)
= (max{x + α − 1, 0})α−⌊α⌋
Q
(x + ⌊α⌋ − i) (11.33)
i=1 Γ(x)
≥ (max{x + α − 1, 0})α−⌊α⌋ x⌊α⌋
= (max{x + α − 1, 0})α−⌊α⌋ (max{x, 0})⌊α⌋
≥ (max{x + min{α − 1, 0}, 0})α−⌊α⌋ (max{x + min{α − 1, 0}, 0})⌊α⌋
= (max{x + min{α − 1, 0}, 0})α .

Combining this with (11.32) establishes item (iv). The proof of Proposition 11.2.3 is thus
complete.

413
Chapter 11: Optimization through random initializations

Corollary
R 1 x−1 11.2.4. Let B : (0, ∞)2 → (0, ∞) satisfy for all x, y ∈ (0, ∞) that B(x, y) =
y−1
R0∞t x−1(1−t− t) dt and let Γ : (0, ∞) → (0, ∞) satisfy for all x ∈ (0, ∞) that Γ(x) =
0
t e dt. Then it holds for all x, y ∈ (0, ∞) with x + y > 1 that

Γ(x) Γ(x) max{1, xx }


≤ B(x, y) ≤ ≤ .
(y + max{x − 1, 0})x (y + min{x − 1, 0})x x(y + min{x − 1, 0})x
(11.34)

Proof of Corollary 11.2.4. Note that item (iii) in Lemma 11.2.1 implies that for all x, y ∈
(0, ∞) it holds that
Γ(x)Γ(y)
B(x, y) = . (11.35)
Γ(y + x)
Furthermore, observe that the fact that for all x, y ∈ (0, ∞) with x + y > 1 it holds
that y + min{x − 1, 0} > 0 and item (iv) in Proposition 11.2.3 demonstrate that for all
x, y ∈ (0, ∞) with x + y > 1 it holds that

Γ(y + x)
0 < (y + min{x − 1, 0})x ≤ ≤ (y + max{x − 1, 0})x . (11.36)
Γ(y)

Combining this with (11.35) and item (ii) in Proposition 11.2.3 proves that for all x, y ∈
(0, ∞) with x + y > 1 it holds that

Γ(x) Γ(x) max{1, xx }


≤ B(x, y) ≤ ≤ .
(y + max{x − 1, 0})x (y + min{x − 1, 0})x x(y + min{x − 1, 0})x
(11.37)
The proof of Corollary 11.2.4 is thus complete.

11.2.2 Product measurability of continuous random fields


Lemma 11.2.5 (Projections in metric spaces). Let (E, d) be a metric space, let n ∈ N,
e1 , e2 , . . . , en ∈ E, and let P : E → E satisfy for all x ∈ E that

P (x) = emin{k∈{1,2,...,n} : d(x,ek )=min{yd(x,e1 ),d(x,e2 ),...,d(x,en )}} . (11.38)

Then

(i) it holds for all x ∈ E that

d(x, P (x)) = min d(x, ek ) (11.39)


k∈{1,2,...,n}

and

(ii) it holds for all A ⊆ E that P −1 (A) ∈ B(E).

414
11.2. Strong convergences rates for the optimization error

Proof of Lemma 11.2.5. Throughout this proof, let D = (D1 , . . . , Dn ) : E → Rn satisfy for
all x ∈ E that

D(x) = (D1 (x), D2 (x), . . . , Dn (x)) = (d(x, e1 ), d(x, e2 ), . . . , d(x, en )). (11.40)

Note that (11.38) ensures that for all x ∈ E it holds that

d(x, P (x)) = d(x, emin{k∈{1,2,...,n} : d(x,ek )=min{d(x,e1 ),d(x,e2 ),...,d(x,en )}} )


= min d(x, ek ). (11.41)
k∈{1,2,...,n}

This establishes item (i). It thus remains to prove item (ii). For this observe that the
fact that d : E × E → [0, ∞) is continuous shows that D : E → Rn is continuous. Hence,
we obtain that D : E → Rn is B(E)/B(Rn )-measurable. Furthermore, note that item (i)
implies that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek }) it holds that

d(x, ek ) = d(x, P (x)) = min d(x, el ). (11.42)


l∈{1,2,...,n}

Therefore, we obtain that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek }) it holds that

k ≥ min{l ∈ {1, 2, . . . , n} : d(x, el ) = min{d(x, e1 ), d(x, e2 ), . . . , d(x, en )}}. (11.43)

Moreover, observe that (11.38) demonstrates that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek })
it holds that
 
min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu )
u∈{1,2,...,n} (11.44)
 
∈ l ∈ {1, 2, . . . , n} : el = ek ⊆ k, k + 1, . . . , n .

Hence, we obtain that for all k ∈ {1, 2, . . . , n}, x ∈ P −1 ({ek }) with ek ∈ l∈N∩[0,k) {el } it
S 
/
holds that  
min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) ≥ k. (11.45)
u∈{1,2,...,n}

Combining this with  (11.43) proves that for all k ∈ {1, 2, . . . , n}, x ∈ P ({ek }) with
−1

l∈N∩[0,k) {el } it holds that


S
ek ∈
/
 
min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) = k. (11.46)
u∈{1,2,...,n}

Therefore, we obtain that for all k ∈ {1, 2, . . . , n} with ek ∈ it holds that


S 
/ l∈N∩[0,k) {el }
   
−1
P ({ek }) ⊆ x ∈ E : min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) = k .
u∈{1,2,...,n}
(11.47)

415
Chapter 11: Optimization through random initializations

This and (11.38) ensure that for all k ∈ {1, 2, . . . , n} with ek ∈ l∈N∩[0,k) {el } it holds that
S 
/
   
−1
P ({ek }) = x ∈ E : min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) = k .
u∈{1,2,...,n}
(11.48)
Combining (11.40) with the fact that D : E →SR is B(E)/B(R n
)-measurable hence estab-
n

lishes that for all k ∈ {1, 2, . . . , n} with ek ∈ l∈N∩[0,k) {el } it holds that

/

P −1 ({ek })
   
= x ∈ E : min l ∈ {1, 2, . . . , n} : d(x, el ) = min d(x, eu ) = k
u∈{1,2,...,n}
   
= x ∈ E : min l ∈ {1, 2, . . . , n} : Dl (x) = min Du (x) = k
u∈{1,2,...,n}
(11.49)
∀ l ∈ N ∩ [0, k) : Dk (x) < Dl (x) and
  
= x ∈ E:
∀ l ∈ {1, 2, . . . , n} : Dk (x) ≤ Dl (x)
   
k−1 n
\  \ \
=  {x ∈ E : Dk (x) < Dl (x)}  {x ∈ E : Dk (x) ≤ Dl (x)} ∈ B(E).

| {z } | {z }
l=1 l=1
∈B(E) ∈B(E)

Therefore, we obtain that for all f ∈ {e1 , e2 , . . . , en } it holds that

P −1 ({f }) ∈ B(E). (11.50)

Hence, we obtain that for all A ⊆ E it holds that

P −1 (A) = P −1 (A ∩ {e1 , e2 , . . . , en }) = f ∈A∩{e1 ,e2 ,...,en } P −1 ({f }) ∈ B(E). (11.51)


S
| {z }
∈B(E)

This proves item (ii). The proof of Lemma 11.2.5 is thus complete.
Lemma 11.2.6. Let (E, d) be a separable metric space, let (E, δ) be a metric space, let (Ω, F)
be a measurable space, let X : E × Ω → E, assume for all e ∈ E that Ω ∋ ω 7→ X(e, ω) ∈ E
is F/B(E)-measurable, and assume for all ω ∈ Ω that E ∋ e 7→ X(e, ω) ∈ E is continuous.
Then X : E × Ω → E is (B(E) ⊗ F)/B(E)-measurable.
Proof of Lemma 11.2.6. Throughout this proof, let e = (em )m∈N : N → E satisfy

{em : m ∈ N} = E, (11.52)

let Pn : E → E, n ∈ N, satisfy for all n ∈ N, x ∈ E that

Pn (x) = emin{k∈{1,2,...,n} : d(x,ek )=min{d(x,e1 ),d(x,e2 ),...,d(x,en )}} , (11.53)

416
11.2. Strong convergences rates for the optimization error

and let Xn : E × Ω → E, n ∈ N, satisfy for all n ∈ N, x ∈ E, ω ∈ Ω that

Xn (x, ω) = X(Pn (x), ω). (11.54)

Note that (11.54) shows that for all n ∈ N, B ∈ B(E) it holds that

(Xn )−1 (B) = {(x, ω) ∈ E × Ω : Xn (x, ω) ∈ B}


[  
(Xn )−1 (B) ∩ (Pn )−1 ({y}) × Ω (11.55)
 
=
y∈Im(Pn )
[ n h io
= (x, ω) ∈ E × Ω : Xn (x, ω) ∈ B and x ∈ (Pn )−1 ({y})
y∈Im(Pn )
[ n h io
= (x, ω) ∈ E × Ω : X(Pn (x), ω) ∈ B and x ∈ (Pn )−1 ({y}) .
y∈Im(Pn )

Item (ii) in Lemma 11.2.5 therefore implies that for all n ∈ N, B ∈ B(E) it holds that
[ n h io
−1
(Xn ) (B) = (x, ω) ∈ E × Ω : X(y, ω) ∈ B and x ∈ (Pn ) ({y})
−1

y∈Im(Pn )
[  
−1
(11.56)

= {(x, ω) ∈ E × Ω : X(y, ω) ∈ B} ∩ (Pn ) ({y}) × Ω
y∈Im(Pn )
[  
E × (X(y, ·))−1 (B) ∩ (Pn )−1 ({y}) × Ω ∈ (B(E) ⊗ F).
 
=
| {z } | {z }
y∈Im(Pn )
∈(B(E)⊗F) ∈(B(E)⊗F)

This demonstrates that for all n ∈ N it holds that Xn is (B(E) ⊗ F)/B(E)-measurable.


Furthermore, observe that item (i) in Lemma 11.2.5 and the assumption that for all ω ∈ Ω
it holds that E ∋ x 7→ X(x, ω) ∈ E is continuous ensure that for all x ∈ E, ω ∈ Ω it holds
that
lim Xn (x, ω) = lim X(Pn (x), ω) = X(x, ω). (11.57)
n→∞ n→∞

Combining this with the fact that for all n ∈ N it holds that Xn : E × Ω → E is (B(E) ⊗
F)/B(E)-measurable establishes that X : E × Ω → E is (B(E) ⊗ F)/B(E)-measurable. The
proof of Lemma 11.2.6 is thus complete.

11.2.3 Strong convergences rates for the optimization error


Proposition 11.2.7. Let d, K ∈ N, L, α ∈ R, β ∈ (α, ∞), let (Ω, F, P) be a probability
space, let R : [α, β]d × Ω → R be a random field, assume for all θ, ϑ ∈ [α, β]d , ω ∈ Ω
that |R(θ, ω) − R(ϑ, ω)| ≤ L∥θ − ϑ∥∞ , let Θk : Ω → [α, β]d , k ∈ {1, 2, . . . , K}, be i.i.d.
random variables, and assume that Θ1 is continuously uniformly distributed on [α, β]d (cf.
Definition 3.3.4). Then

417
Chapter 11: Optimization through random initializations

(i) it holds that R is (B([α, β]d ) ⊗ F)/B(R)-measurable and

(ii) it holds for all θ ∈ [α, β]d , p ∈ (0, ∞) that

 p 1/p
 L(β − α) max{1, (p/d)1/d }
E mink∈{1,2,...,K} |R(Θk ) − R(θ)| ≤
K 1/d (11.58)
L(β − α) max{1, p}
≤ .
K 1/d

Proof of Proposition 11.2.7. Throughout this proof, assume without loss of generality that
L > 0, let δ : ([α, β]d ) × ([α, β]d ) → [0, ∞) satisfy for all θ, ϑ ∈ [α, β]d that

δ(θ, ϑ) = ∥θ − ϑ∥∞ , (11.59)

let B : (0, ∞)2 → (0, ∞) satisfy for all x, y ∈ (0, ∞) that

Z 1
B(x, y) = tx−1 (1 − t)y−1 dt, (11.60)
0

and let Θ1,1 , Θ1,2 , . . . , Θ1,d : Ω → [α, β] satisfy Θ1 = (Θ1,1 , Θ1,2 , . . . , Θ1,d ). First, note that
the assumption that for all θ, ϑ ∈ [α, β]d , ω ∈ Ω it holds that

|R(θ, ω) − R(ϑ, ω)| ≤ L∥θ − ϑ∥∞ (11.61)

proves that for all ω ∈ Ω it holds that [α, β]d ∋ θ 7→ R(θ, ω) ∈ R is continuous. Combining
this with the fact that ([α, β]d , δ) is a separable metric space, the fact that for all θ ∈ [α, β]d
it holds that Ω ∋ ω 7→ R(θ, ω) ∈ R is F/B(R)-measurable, and Lemma 11.2.6 establishes
item (i). Observe that the fact that for all θ ∈ [α, β], ε ∈ [0, ∞) it holds that

min{θ + ε, β} − max{θ − ε, α} = min{θ + ε, β} + min{ε − θ, −α}



= min θ + ε + min{ε − θ, −α}, β + min{ε − θ, −α}
(11.62)

= min min{2ε, θ − α + ε}, min{β − θ + ε, β − α}

≥ min min{2ε, α − α + ε}, min{β − β + ε, β − α}
= min{2ε, ε, ε, β − α} = min{ε, β − α}

and the assumption that Θ1 is continuously uniformly distributed on [α, β]d show that for

418
11.2. Strong convergences rates for the optimization error

all θ = (θ1 , θ2 , . . . , θd ) ∈ [α, β]d , ε ∈ [0, ∞) it holds that



P(∥Θ1 − θ∥∞ ≤ ε) = P maxi∈{1,2,...,d} |Θ1,i − θi | ≤ ε

= P ∀ i ∈ {1, 2, . . . , d} : − ε ≤ Θ1,i − θi ≤ ε

= P ∀ i ∈ {1, 2, . . . , d} : θi − ε ≤ Θ1,i ≤ θi + ε

= P ∀ i ∈ {1, 2, . . . , d} : max{θi − ε, α} ≤ Θ1,i ≤ min{θi + ε, β}
(11.63)
×
 d 
= P Θ1 ∈ i=1
[max{θi − ε, α}, min{θi + ε, β}]
d
1
Q
= (β−α)d (min{θi + ε, β} − max{θi − ε, α})
i=1
n o
1 d εd
≥ (β−α) d [min{ε, β − α}] = min 1, (β−α) d .

Hence, we obtain for all θ ∈ [α, β]d , p ∈ (0, ∞), ε ∈ [0, ∞) that

P(∥Θ1 − θ∥∞ > ε /p ) = 1 − P(∥Θ1 − θ∥∞ ≤ ε /p )


1 1

n d
ε /p
o n d
ε /p
o (11.64)
≤ 1 − min 1, (β−α) d = max 0, 1 − (β−α)d
.

This, item (i), the assumption that for all θ, ϑ ∈ [α, β]d , ω ∈ Ω it holds that

|R(θ, ω) − R(ϑ, ω)| ≤ L∥θ − ϑ∥∞ , (11.65)

the assumption that Θk , k ∈ {1, 2, . . . , K}, are i.i.d. random variables, and Lemma 11.1.2
(applied with (E, δ) ↶ ([α, β]d , δ), (Xk )k∈{1,2,...,K} ↶ (Θk )k∈{1,2,...,K} in the notation of
Lemma 11.1.2) imply that for all θ ∈ [α, β]d , p ∈ (0, ∞) it holds that
Z ∞
p p
[P(∥Θ1 − θ∥∞ > ε /p )]K dε
1
 
E mink∈{1,2,...,K} |R(Θk ) − R(θ)| ≤ L
0
Z ∞h n oiK Z (β−α)p  K
d/p d
p ε p ε /p
≤L max 0, 1 − (β−α)d dε = L 1 − (β−α) dε
(11.66)
d
0 0
Z 1 Z 1
= dp Lp (β − α)p t /d−1 (1 − t)K dt = dp Lp (β − α)p t /d−1 (1 − t)K+1−1 dt
p p

0 0
p p p
= d
L (β − α) B(p/d, K + 1).

Corollary 11.2.4 (applied with x ↶ p/d, y ↶ K + 1 for p ∈ (0, ∞) in the notation of (11.34)
in Corollary 11.2.4) therefore demonstrates that for all θ ∈ [α, β]d , p ∈ (0, ∞) it holds that
p p
L (β − α)p max{1, (p/d)p/d }
E mink∈{1,2,...,K} |R(Θk ) − R(θ)|p ≤ d
 
p
(K + 1 + min{p/d − 1, 0})p/d
d (11.67)
L (β − α)p max{1, (p/d)p/d }
p
≤ .
K p/d
419
Chapter 11: Optimization through random initializations

This ensures for all θ ∈ [α, β]d , p ∈ (0, ∞) that


1/p L(β − α) max{1, (p/d)1/d }
E mink∈{1,2,...,K} |R(Θk ) − R(θ)|p


K 1/d (11.68)
L(β − α) max{1, p}
≤ .
K 1/d
This proves item (ii). The proof of Proposition 11.2.7 is thus complete.

11.3 Strong convergences rates for the optimization error


involving ANNs
11.3.1 Local Lipschitz continuity estimates for the parametrization
functions of ANNs
Lemma 11.3.1. Let a, x, y ∈ R. Then
|max{x, a} − max{y, a}| ≤ max{x, y} − min{x, y} = |x − y|. (11.69)
Proof of Lemma 11.3.1. Note that the fact that
|max{x, a} − max{y, a}| = |max{max{x, y}, a} − max{min{x, y}, a}|
 
= max max{x, y}, a − max min{x, y}, a
n   o
= max max{x, y} − max min{x, y}, a , a − max min{x, y}, a
n  o (11.70)
≤ max max{x, y} − max min{x, y}, a , a − a
n  o n o
= max max{x, y} − max min{x, y}, a , 0 ≤ max max{x, y} − min{x, y}, 0
= max{x, y} − min{x, y} = |max{x, y} − min{x, y}| = |x − y|.
establishes (11.69). The proof of Lemma 11.3.1 is thus complete.
Corollary 11.3.2. Let a, x, y ∈ R. Then
|min{x, a} − min{y, a}| ≤ max{x, y} − min{x, y} = |x − y|. (11.71)
Proof of Corollary 11.3.2. Observe that Lemma 11.3.1 shows that
|min{x, a} − min{y, a}| = |−(min{x, a} − min{y, a})|
= |max{−x, −a} − max{−y, −a}| (11.72)
≤ |(−x) − (−y)| = |x − y|.
The proof of Corollary 11.3.2 is thus complete.

420
11.3. Strong convergences rates for the optimization error involving ANNs

Lemma 11.3.3. Let d ∈ N. Then it holds for all x, y ∈ Rd that

∥Rd (x) − Rd (y)∥∞ ≤ ∥x − y∥∞ (11.73)

(cf. Definitions 1.2.5 and 3.3.4).

Proof of Lemma 11.3.3. Observe that Lemma 11.3.1 demonstrates (11.73). The proof of
Lemma 11.3.3 is thus complete.

Lemma 11.3.4. Let d ∈ N, u ∈ [−∞, ∞), v ∈ (u, ∞]. Then it holds for all x, y ∈ Rd that

∥Cu,v,d (x) − Cu,v,d (y)∥∞ ≤ ∥x − y∥∞ (11.74)

(cf. Definitions 1.2.10 and 3.3.4).

Proof of Lemma 11.3.4. Note that Lemma 11.3.1, Corollary 11.3.2, and the fact that for
all x ∈ R it holds that max{−∞, x} = x = min{x, ∞} imply that for all x, y ∈ R it holds
that
|cu,v (x) − cu,v (y)| = |max{u, min{x, v}} − max{u, min{y, v}}|
(11.75)
≤ |min{x, v} − min{y, v}| ≤ |x − y|

(cf. Definition 1.2.9). Hence, we obtain that for all x = (x1 , x2 , . . . , xd ), y = (y1 , y2 , . . . , yd ) ∈
Rd it holds that

∥Cu,v,d (x) − Cu,v,d (y)∥∞ = max |cu,v (xi ) − cu,v (yi )|


i∈{1,2,...,d}
(11.76)
≤ max |xi − yi | = ∥x − y∥∞
i∈{1,2,...,d}

(cf. Definitions 1.2.10 and 3.3.4). The proof of Lemma 11.3.4 is thus complete.

Lemma 11.3.5 (Row sum norm, operator norm induced by the maximum norm). Let
a, b ∈ N, M = (Mi,j )(i,j)∈{1,2,...,a}×{1,2,...,b} ∈ Ra×b . Then

  " #  
∥M v∥∞ b
(11.77)
P
sup = max |Mi,j | ≤ b max max |Mi,j |
v∈Rb \{0} ∥v∥∞ i∈{1,2,...,a} j=1 i∈{1,2,...,a} j∈{1,2,...,b}

(cf. Definition 3.3.4).

421
Chapter 11: Optimization through random initializations

Proof of Lemma 11.3.5. Observe that


 
∥M v∥∞
sup = sup ∥M v∥∞
v∈Rb ∥v∥∞ v∈Rb , ∥v∥∞ ≤1

= sup ∥M v∥∞
v=(v1 ,v2 ,...,vb )∈[−1,1]b
!
b
P
= sup max Mi,j vj
v=(v1 ,v2 ,...,vb )∈[−1,1]b i∈{1,2,...,a} j=1 (11.78)
!
b
P
= max sup Mi,j vj
i∈{1,2,...,a} v=(v1 ,v2 ,...,vb )∈[−1,1]b j=1
!
b
P
= max |Mi,j |
i∈{1,2,...,a} j=1

(cf. Definition 3.3.4). The proof of Lemma 11.3.5 is thus complete.


Theorem 11.3.6. Let a ∈ R, b ∈ [a, ∞), d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy
L
X
d≥ lk (lk−1 + 1). (11.79)
k=1

Then it holds for all θ, ϑ ∈ Rd that


θ,l ϑ,l
sup ∥N−∞,∞ (x) − N−∞,∞ (x)∥∞
x∈[a,b]l0
" L−1 #"L−1 #
Y X
max{1, ∥θ∥n∞ } ∥ϑ∥∞ L−1−n

≤ max{1, |a|, |b|}∥θ − ϑ∥∞ (lm + 1)
m=0 n=0
" L−1 # (11.80)
L−1
Y
≤ L max{1, |a|, |b|}(max{1, ∥θ∥∞ , ∥ϑ∥∞ }) (lm + 1) ∥θ − ϑ∥∞
m=0
≤ L max{1, |a|, |b|} (∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞
(cf. Definitions 3.3.4 and 4.4.1).
Proof of Theorem 11.3.6. Throughout this proof, let θj = (θj,1 , θj,2 , . . . , θj,d ) ∈ Rd , j ∈
{1, 2}, let d ∈ N satisfy
XL
d= lk (lk−1 + 1), (11.81)
k=1

let Wj,k ∈ Rlk ×lk−1 , k ∈ {1, 2, . . . , L}, j ∈ {1, 2}, and Bj,k ∈ Rlk , k ∈ {1, 2, . . . , L},
j ∈ {1, 2}, satisfy for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} that
(11.82)

T (Wj,1 , Bj,1 ), (Wj,2 , Bj,2 ), . . . , (Wj,L , Bj,L ) = (θj,1 , θj,2 , . . . , θj,d ),

422
11.3. Strong convergences rates for the optimization error involving ANNs

let ϕj,k ∈ N, k ∈ {1, 2, . . . , L}, j ∈ {1, 2}, satisfy for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} that
 h k i
×
ϕj,k = (Wj,1 , Bj,1 ), (Wj,2 , Bj,2 ), . . . , (Wj,k , Bj,k ) ∈ i=1
Rli ×li−1 li
×R , (11.83)

let D = [a, b]l0 , let mj,k ∈ [0, ∞), j ∈ {1, 2}, k ∈ {0, 1, . . . , L}, satisfy for all j ∈ {1, 2},
k ∈ {0, 1, . . . , L} that
(
max{1, |a|, |b|} :k=0
mj,k =  N
(11.84)
max 1, supx∈D ∥(Rr (ϕj,k ))(x)∥∞ : k > 0,

and let ek ∈ [0, ∞), k ∈ {0, 1, . . . , L}, satisfy for all k ∈ {0, 1, . . . , L} that
(
0 :k=0
ek = (11.85)
supx∈D ∥(RN N
r (ϕ1,k ))(x) − (Rr (ϕ2,k ))(x)∥∞ :k>0

(cf. Definitions 1.2.4, 1.3.1, 1.3.4, 1.3.5, and 3.3.4). Note that Lemma 11.3.5 ensures that

e1 = sup ∥(RN N
r (ϕ1,1 ))(x) − (Rr (ϕ2,1 ))(x)∥∞
x∈D
= sup ∥(W1,1 x + B1,1 ) − (W2,1 x + B2,1 )∥∞
x∈D
 
≤ sup ∥(W1,1 − W2,1 )x∥∞ + ∥B1,1 − B2,1 ∥∞
x∈D
"  #  (11.86)
∥(W1,1 − W2,1 )v∥∞
≤ sup sup ∥x∥∞ + ∥B1,1 − B2,1 ∥∞
v∈Rl0 \{0} ∥v∥∞ x∈D

≤ l0 ∥θ1 − θ2 ∥∞ max{|a|, |b|} + ∥B1,1 − B2,1 ∥∞


≤ l0 ∥θ1 − θ2 ∥∞ max{|a|, |b|} + ∥θ1 − θ2 ∥∞
= ∥θ1 − θ2 ∥∞ (l0 max{|a|, |b|} + 1) ≤ m1,0 ∥θ1 − θ2 ∥∞ (l0 + 1).

Furthermore, observe that the triangle inequality proves that for all k ∈ {1, 2, . . . , L}∩(1, ∞)
it holds that
ek = sup ∥(RN N
r (ϕ1,k ))(x) − (Rr (ϕ2,k ))(x)∥∞
x∈D
h   i
= sup W1,k Rlk−1 (RN
r (ϕ1,k−1 ))(x) + B1,k
x∈D

(11.87)
h   i
N
− W2,k Rlk−1 (Rr (ϕ2,k−1 ))(x) + B2,k

  
  
≤ sup W1,k Rlk−1 (RN
r (ϕ1,k−1 ))(x) − W2,k Rlk−1 (RN
r (ϕ2,k−1 ))(x)
x∈D ∞

+ ∥θ1 − θ2 ∥∞ .

423
Chapter 11: Optimization through random initializations

The triangle inequality therefore establishes that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞)
it holds that
 
N
 
ek ≤ sup W1,k − W2,k Rlk−1 (Rr (ϕj,k−1 ))(x) ∞
x∈D
  
N
 N

+ sup W3−j,k Rlk−1 (Rr (ϕ1,k−1 ))(x) − Rlk−1 (Rr (ϕ2,k−1 ))(x)
x∈D ∞

+ ∥θ1 − θ2 ∥∞
"  # 
∥(W1,k − W2,k )v∥∞ (11.88)
sup Rlk−1 (RN

≤ sup r (ϕj,k−1 ))(x) ∞
v∈Rlk−1 \{0}
∥v∥∞ x∈D
"  #
 
∥W3−j,k v∥∞
sup Rlk−1 (RN

+ sup r (ϕ1,k−1 ))(x)
v∈Rlk−1 \{0}
∥v∥∞ x∈D

N

− Rlk−1 (Rr (ϕ2,k−1 ))(x) ∞ + ∥θ1 − θ2 ∥∞ .

Lemma 11.3.5 and Lemma 11.3.3 hence show that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L}∩(1, ∞)
it holds that
 
N

ek ≤ lk−1 ∥θ1 − θ2 ∥∞ sup Rlk−1 (Rr (ϕj,k−1 ))(x) ∞ + ∥θ1 − θ2 ∥∞
x∈D
 
N N
 
+ lk−1 ∥θ3−j ∥∞ sup Rlk−1 (Rr (ϕ1,k−1 ))(x) − Rlk−1 (Rr (ϕ2,k−1 ))(x) ∞
x∈D
(11.89)
 
N
≤ lk−1 ∥θ1 − θ2 ∥∞ sup (Rr (ϕj,k−1 ))(x) ∞ + ∥θ1 − θ2 ∥∞
x∈D
 
N N
+ lk−1 ∥θ3−j ∥∞ sup (Rr (ϕ1,k−1 ))(x) − (Rr (ϕ2,k−1 ))(x) ∞
x∈D
≤ ∥θ1 − θ2 ∥∞ (lk−1 mj,k−1 + 1) + lk−1 ∥θ3−j ∥∞ ek−1 .
Therefore, we obtain that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞) it holds that

ek ≤ mj,k−1 ∥θ1 − θ2 ∥∞ (lk−1 + 1) + lk−1 ∥θ3−j ∥∞ ek−1 . (11.90)

Combining this with (11.86), the fact that e0 = 0, and the fact that m1,0 = m2,0 demonstrates
that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds that

ek ≤ mj,k−1 (lk−1 + 1)∥θ1 − θ2 ∥∞ + lk−1 ∥θ3−j ∥∞ ek−1 . (11.91)

This implies that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2} and all k ∈ {1, 2, . . . , L}
it holds that

ek ≤ mjk−1 ,k−1 (lk−1 + 1)∥θ1 − θ2 ∥∞ + lk−1 ∥θ3−jk−1 ∥∞ ek−1 . (11.92)

424
11.3. Strong convergences rates for the optimization error involving ANNs

Hence, we obtain that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2} and all k ∈
{1, 2, . . . , L} it holds that
k−1
" k−1 # !
X Y 
ek ≤ lm ∥θ3−jm ∥∞ mjn ,n (ln + 1)∥θ1 − θ2 ∥∞
n=0 m=n+1
" k−1 " k−1 # !# (11.93)
X Y 
= ∥θ1 − θ2 ∥∞ lm ∥θ3−jm ∥∞ mjn ,n (ln + 1) .
n=0 m=n+1

Moreover, note that Lemma 11.3.5 ensures that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞),
x ∈ D it holds that
∥(RNr (ϕj,k ))(x)∥∞
 
N
= Wj,k Rlk−1 (Rr (ϕj,k−1 ))(x) + Bj,k

" #
∥Wj,k v∥∞
Rlk−1 (RN

≤ sup r (ϕj,k−1 ))(x) ∞
+ ∥Bj,k ∥∞
v∈Rlk−1 \{0}
∥v∥∞
(11.94)
≤ lk−1 ∥θj ∥∞ Rlk−1 (RN

r (ϕj,k−1 ))(x) ∞
+ ∥θj ∥∞
≤ lk−1 ∥θj ∥∞ (RN
r (ϕj,k−1 ))(x) ∞
+ ∥θj ∥∞
= lk−1 (RN

r (ϕj,k−1 ))(x) ∞
+ 1 ∥θj ∥∞
≤ (lk−1 mj,k−1 + 1)∥θj ∥∞ ≤ mj,k−1 (lk−1 + 1)∥θj ∥∞ .

Therefore, we obtain for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} ∩ (1, ∞) that

mj,k ≤ max{1, mj,k−1 (lk−1 + 1)∥θj ∥∞ }. (11.95)

In addition, observe that Lemma 11.3.5 proves that for all j ∈ {1, 2}, x ∈ D it holds that

∥(RN
r (ϕj,1 ))(x)∥∞ = ∥Wj,1 x + Bj,1 ∥∞
" #
∥Wj,1 v∥∞
≤ sup ∥x∥∞ + ∥Bj,1 ∥∞
v∈Rl0 \{0} ∥v∥∞ (11.96)
≤ l0 ∥θj ∥∞ ∥x∥∞ + ∥θj ∥∞ ≤ l0 ∥θj ∥∞ max{|a|, |b|} + ∥θj ∥∞
= (l0 max{|a|, |b|} + 1)∥θj ∥∞ ≤ m1,0 (l0 + 1)∥θj ∥∞ .

Hence, we obtain that for all j ∈ {1, 2} it holds that

mj,1 ≤ max{1, mj,0 (l0 + 1)∥θj ∥∞ }. (11.97)

Combining this with (11.95) establishes that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds
that
mj,k ≤ max{1, mj,k−1 (lk−1 + 1)∥θj ∥∞ }. (11.98)

425
Chapter 11: Optimization through random initializations

Therefore, we obtain that for all j ∈ {1, 2}, k ∈ {0, 1, . . . , L} it holds that
"k−1 #
Y k
(11.99)

mj,k ≤ mj,0 (ln + 1) max{1, ∥θj ∥∞ } .
n=0

Combining this with (11.93) shows that for all j = (jn )n∈{0,1,...,L} : {0, 1, . . . , L} → {1, 2}
and all k ∈ {1, 2, . . . , L} it holds that
" k−1 " k−1 #
X Y 
ek ≤ ∥θ1 − θ2 ∥∞ lm ∥θ3−jm ∥∞
n=0 m=n+1
"n−1 # !!#
Y
n
· mjn ,0 (lv + 1) max{1, ∥θjn ∥∞ }(ln + 1)
v=0
" k−1 " k−1 # " n
# !!#
X Y Y
(lv + 1) max{1, ∥θjn ∥n∞ }

= m1,0 ∥θ1 − θ2 ∥∞ lm ∥θ3−jm ∥∞
n=0 m=n+1 v=0
" k−1 " k−1 #"k−1 # !#
X Y Y
≤ m1,0 ∥θ1 − θ2 ∥∞ ∥θ3−jm ∥∞ (lv + 1) max{1, ∥θjn ∥n∞ }
n=0 m=n+1 v=0
"k−1 #" k−1 " k−1 # !#
Y X Y
= m1,0 ∥θ1 − θ2 ∥∞ (ln + 1) ∥θ3−jm ∥∞ max{1, ∥θjn ∥n∞ } .
n=0 n=0 m=n+1
(11.100)

Hence, we obtain that for all j ∈ {1, 2}, k ∈ {1, 2, . . . , L} it holds that
"k−1 #" k−1 " k−1 # !#
Y X Y
ek ≤ m1,0 ∥θ1 − θ2 ∥∞ (ln + 1) ∥θ3−j ∥∞ max{1, ∥θj ∥n∞ }
n=0 n=0 m=n+1
"k−1 #" k−1 #
(11.101)
Y X
max{1, ∥θj ∥n∞ } ∥θ3−j ∥k−1−n

= m1,0 ∥θ1 − θ2 ∥∞ (ln + 1) ∞
n=0 n=0
" k−1 #
k−1
Y 
≤ k m1,0 ∥θ1 − θ2 ∥∞ (max{1, ∥θ1 ∥∞ , ∥θ2 ∥∞ }) lm + 1 .
m=0

The proof of Theorem 11.3.6 is thus complete.

Corollary 11.3.7. Let a ∈ R, b ∈ [a, ∞), u ∈ [−∞, ∞), v ∈ (u, ∞], d, L ∈ N, l =


(l0 , l1 , . . . , lL ) ∈ NL+1 satisfy
L
X
d≥ lk (lk−1 + 1). (11.102)
k=1

426
11.3. Strong convergences rates for the optimization error involving ANNs

Then it holds for all θ, ϑ ∈ Rd that


θ,l ϑ,l
sup ∥Nu,v (x) − Nu,v (x)∥∞
x∈[a,b]l0 (11.103)
≤ L max{1, |a|, |b|} (∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞

(cf. Definitions 3.3.4 and 4.4.1).

Proof of Corollary 11.3.7. Note that Lemma 11.3.4 and Theorem 11.3.6 demonstrate that
for all θ, ϑ ∈ Rd it holds that
θ,l ϑ,l
sup ∥Nu,v (x) − Nu,v (x)∥∞
x∈[a,b]l0
θ,l ϑ,l
= sup ∥Cu,v,lL (N−∞,∞ (x)) − Cu,v,lL (N−∞,∞ (x))∥∞
x∈[a,b]l0 (11.104)
θ,l ϑ,l
≤ sup ∥N−∞,∞ (x) − N−∞,∞ (x)∥∞
x∈[a,b]l0

≤ L max{1, |a|, |b|} (∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞

(cf. Definitions 1.2.10, 3.3.4, and 4.4.1). The proof of Corollary 11.3.7 is thus complete.

11.3.2 Strong convergences rates for the optimization error involv-


ing ANNs
Lemma 11.3.8. Let d, d, L, M ∈ N, B, b ∈ [1, ∞), u ∈PR, v ∈ (u, ∞), l = (l0 , l1 , . . . , lL ) ∈
NL+1 , D ⊆ [−b, b]d , assume l0 = d, lL = 1, and d ≥ Li=1 li (li−1 + 1), let Ω be a set, let
Xj : Ω → D, j ∈ {1, 2, . . . , M }, and Yj : Ω → [u, v], j ∈ {1, 2, . . . , M }, be functions, and let
R : [−B, B]d × Ω → [0, ∞) satisfy for all θ ∈ [−B, B]d , ω ∈ Ω that
M 
1 P
R(θ, ω) = θ,l
|N (Xj (ω)) − Yj (ω)|2
(11.105)
M j=1 u,v

(cf. Definition 4.4.1). Then it holds for all θ, ϑ ∈ [−B, B]d , ω ∈ Ω that

|R(θ, ω) − R(ϑ, ω)| ≤ 2(v − u)bL(∥l∥∞ + 1)L B L−1 ∥θ − ϑ∥∞ (11.106)

(cf. Definition 3.3.4).

Proof of Lemma 11.3.8. Observe that the fact that for all x1 , x2 , y ∈ R it holds that
(x1 − y)2 − (x2 − y)2 = (x1 − x2 )((x1 − y) + (x2 − y)), the fact that for all θ ∈ Rd , x ∈ Rd
it holds that Nu,v
θ,l
(x) ∈ [u, v], and the assumption that for all j ∈ {1, 2, . . . , M }, ω ∈ Ω it

427
Chapter 11: Optimization through random initializations

holds that Yj (ω) ∈ [u, v] imply that for all θ, ϑ ∈ [−B, B]d , ω ∈ Ω it holds that

|R(θ, ω) − R(ϑ, ω)|


M  M 
1 P θ,l 2
P ϑ,l 2
= |N (Xj (ω)) − Yj (ω)| − |Nu,v (Xj (ω)) − Yj (ω)|
M j=1 u,v j=1
M 
1 P θ,l 2 ϑ,l 2
≤ [N (Xj (ω)) − Yj (ω)] − [Nu,v (Xj (ω)) − Yj (ω)]
M j=1 u,v
M
1 P
= Nu,vθ,l
(Xj (ω)) − Nu,vϑ,l
(Xj (ω)) (11.107)
M j=1

θ,l ϑ,l

· [Nu,v (Xj (ω)) − Yj (ω)] + [Nu,v (Xj (ω)) − Yj (ω)]
M 
2 P  θ,l ϑ,l
 
≤ supx∈D |Nu,v (x) − Nu,v (x)| supy1 ,y2 ∈[u,v] |y1 − y2 |
M j=1
θ,l ϑ,l
 
= 2(v − u) supx∈D |Nu,v (x) − Nu,v (x)| .

Furthermore, note that the assumption that D ⊆ [−b, b]d , d ≥ Li=1 li (li−1 + 1), l0 = d,
P
lL = 1, b ≥ 1, and B ≥ 1 and Corollary 11.3.7 (applied with a ↶ −b, b ↶ b, u ↶ u, v ↶ v,
d ↶ d, L ↶ L, l ↶ l in the notation of Corollary 11.3.7) ensure that for all θ, ϑ ∈ [−B, B]d
it holds that
θ,l ϑ,l θ,l ϑ,l
supx∈D |Nu,v (x) − Nu,v (x)| ≤ supx∈[−b,b]d |Nu,v (x) − Nu,v (x)|
≤ L max{1, b}(∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞ (11.108)
L L−1
≤ bL(∥l∥∞ + 1) B ∥θ − ϑ∥∞

(cf. Definition 3.3.4). This and (11.107) prove that for all θ, ϑ ∈ [−B, B]d , ω ∈ Ω it holds
that
|R(θ, ω) − R(ϑ, ω)| ≤ 2(v − u)bL(∥l∥∞ + 1)L B L−1 ∥θ − ϑ∥∞ . (11.109)
The proof of Lemma 11.3.8 is thus complete.
Corollary 11.3.9. Let d, d, d, L, M, K ∈ N, B, b ∈ [1, ∞), u ∈ R, vP∈ (u, ∞), l =
(l0 , l1 , . . . , lL ) ∈ NL+1 , D ⊆ [−b, b]d , assume l0 = d, lL = 1, and d ≥ d = Li=1 li (li−1 + 1),
let (Ω, F, P) be a probability space, let Θk : Ω → [−B, B]d , k ∈ {1, 2, . . . , K}, be i.i.d.
random variables, assume that Θ1 is continuously uniformly distributed on [−B, B]d , let
Xj : Ω → D, j ∈ {1, 2, . . . , M }, and Yj : Ω → [u, v], j ∈ {1, 2, . . . , M }, be random variables,
and let R : [−B, B]d × Ω → [0, ∞) satisfy for all θ ∈ [−B, B]d , ω ∈ Ω that
M 
1 P
R(θ, ω) = θ,l
|N (Xj (ω)) − Yj (ω)|2
(11.110)
M j=1 u,v

(cf. Definition 4.4.1). Then

428
11.3. Strong convergences rates for the optimization error involving ANNs

(i) it holds that R is a (B([−B, B]d ) ⊗ F)/B([0, ∞))-measurable function and


(ii) it holds for all θ ∈ [−B, B]d , p ∈ (0, ∞) that
1/p
E mink∈{1,2,...,K} |R(Θk ) − R(θ)|p

p
4(v − u)bL(∥l∥∞ + 1)L B L max{1, p/d}
≤ (11.111)
K 1/d
4(v − u)bL(∥l∥∞ + 1)L B L max{1, p}

K [L−1 (∥l∥∞ +1)−2 ]
(cf. Definition 3.3.4).
Proof of Corollary 11.3.9. Throughout this proof, let L = 2(v − u)bL(∥l∥∞ + 1)L B L−1 ,
let P : [−B, B]d → [−B, B]d satisfy for all θ = (θ1 , θ2 , . . . , θd ) ∈ [−B, B]d that P (θ) =
(θ1 , θ2 , . . . , θd ), and let R : [−B, B]d × Ω → R satisfy for all θ ∈ [−B, B]d , ω ∈ Ω that
M 
1 P
R(θ, ω) = θ,l
|N (Xj (ω)) − Yj (ω)| .2
(11.112)
M j=1 u,v

Observe that the fact that ∀ θ ∈ [−B, B]d : Nu,v θ,l P (θ),l
= Nu,v establishes that for all θ ∈
[−B, B] , ω ∈ Ω it holds that
d

M 
1 P θ,l 2
R(θ, ω) = |N (Xj (ω)) − Yj (ω)|
M j=1 u,v
M  (11.113)
1 P P (θ),l 2
= |N (Xj (ω)) − Yj (ω)| = R(P (θ), ω).
M j=1 u,v
Furthermore, note that Lemma 11.3.8 (applied with d ↶ d, R ↶ ([−B, B]d × Ω ∋ (θ, ω) 7→
R(θ, ω) ∈ [0, ∞)) in the notation of Lemma 11.3.8) shows that for all θ, ϑ ∈ [−B, B]d ,
ω ∈ Ω it holds that
|R(θ, ω) − R(ϑ, ω)| ≤ 2(v − u)bL(∥l∥∞ + 1)L B L−1 ∥θ − ϑ∥∞ = L∥θ − ϑ∥∞ . (11.114)
Moreover, observe that the assumption that Xj , j ∈ {1, 2, . . . , M }, and Yj , j ∈ {1, 2, . . . ,
M }, are random variables demonstrates that R : [−B, B]d × Ω → R is a random field. This,
(11.114), the fact that P ◦ Θk : Ω → [−B, B]d , k ∈ {1, 2, . . . , K}, are i.i.d. random variables,
the fact that P ◦Θ1 is continuously uniformly distributed on [−B, B]d , and Proposition 11.2.7
(applied with d ↶ d, α ↶ −B, β ↶ B, R ↶ R, (Θk )k∈{1,2,...,K} ↶ (P ◦ Θk )k∈{1,2,...,K} in
the notation of Proposition 11.2.7) imply that for all θ ∈ [−B, B]d , p ∈ (0, ∞) it holds that
R is (B([−B, B]d ) ⊗ F)/B(R)-measurable and
1/p
E mink∈{1,2,...,K} |R(P (Θk )) − R(P (θ))|p


L(2B) max{1, (p/d)1/d } 4(v − u)bL(∥l∥∞ + 1)L B L max{1, (p/d)1/d } (11.115)


≤ = .
K 1/d K 1/d
429
Chapter 11: Optimization through random initializations

The fact that P is B([−B, B]d )/B([−B, B]d )-measurable and (11.113) therefore prove
item (i). In addition, note that (11.113), (11.115), and the fact that 2 ≤ d = i=1 li (li−1 +
PL
1) ≤ L(∥l∥∞ + 1)2 ensure that for all θ ∈ [−B, B]d , p ∈ (0, ∞) it holds that
1/p
E mink∈{1,2,...,K} |R(Θk ) − R(θ)|p

1/p
= E mink∈{1,2,...,K} |R(P (Θk )) − R(P (θ))|p


(11.116)
p
4(v − u)bL(∥l∥∞ + 1)L B L max{1, p/d}

K 1/d
4(v − u)bL(∥l∥∞ + 1)L B L max{1, p}
≤ .
K [L−1 (∥l∥∞ +1)−2 ]
This establishes item (ii). The proof of Corollary 11.3.9 is thus complete.

430
Part IV

Generalization

431
Chapter 12

Probabilistic generalization error


estimates

In Chapter 15 below we establish a full error analysis for the training of ANNs in the specific
situation of GD-type optimization methods with many independent random initializations
(see Corollary 15.2.3). For this combined error analysis we do not only employ estimates
for the approximation error (see Part II above) and the optimization error (see Part III
above) but we also employ suitable generalization error estimates. Such generalization error
estimates are the subject of this chapter (cf. Corollary 12.3.10 below) and the next (cf.
Corollary 13.3.3 below). While in this chapter, we treat probabilistic generalization error
estimates, in Chapter we will present generalization error estimates in the strong Lp -sense.
In the literature, related generalization error estimates can, for instance, be found in
the survey articles and books [25, 35, 36, 87, 373] and the references therein. The specific
material in Section 12.1 is inspired by Duchi [116], the specific material in Section 12.2
is inspired by Cucker & Smale [87, Section 6 in Chapter I] and Carl & Stephani [61,
Section 1.1], and the specific presentation of Section 12.3 is strongly based on Beck et al. [25,
Section 3.2].

12.1 Concentration inequalities for random variables


12.1.1 Markov’s inequality
Lemma 12.1.1 (Markov inequality). Let (Ω, F, µ) be a measure space, let X : Ω → [0, ∞)
be F/B([0, ∞))-measurable, and let ε ∈ (0, ∞). Then
R
X dµ

(12.1)

µ X≥ε ≤ .
ε
433
Chapter 12: Probabilistic generalization error estimates

Proof of Lemma 12.1.1. Observe that the fact that X ≥ 0 proves that
ε1{X≥ε} X 1{X≥ε} X
1{X≥ε} = ≤ ≤ . (12.2)
ε ε ε
Hence, we obtain that
R
X dµ
Z
µ(X ≥ ε) = 1{X≥ε} dµ ≤ Ω
. (12.3)
Ω ε
The proof of Lemma 12.1.1 is thus complete.

12.1.2 A first concentration inequality


12.1.2.1 On the variance of bounded random variables
Lemma 12.1.2. Let x ∈ [0, 1], y ∈ R. Then
(x − y)2 ≤ (1 − x)y 2 + x(1 − y)2 . (12.4)
Proof of Lemma 12.1.2. Observe that the assumption that x ∈ [0, 1] assures that
(1 − x)y 2 + x(1 − y)2 = y 2 − xy 2 + x − 2xy + xy 2 ≥ y 2 + x2 − 2xy = (x − y)2 . (12.5)
This establishes (12.4). The proof of Lemma 12.1.2 is thus complete.
Lemma 12.1.3. It holds that supp∈R p(1 − p) = 14 .
Proof of Lemma 12.1.3. Throughout this proof, let f : R → R satisfy for all p ∈ R that
f (p) = p(1 − p). Observe that the fact that ∀ p ∈ R : f ′ (p) = 1 − 2p implies that
{p ∈ R : f ′ (p) = 0} = {1/2}. Combining this with the fact that f is strictly concave implies
that
sup p(1 − p) = sup f (p) = f (1/2) = 1/4. (12.6)
p∈R p∈R

The proof of Lemma 12.1.3 is thus complete.


Lemma 12.1.4. Let (Ω, F, P) be a probability space and let X : Ω → [0, 1] be a random
variable. Then
Var(X) ≤ 1/4. (12.7)
Proof of Lemma 12.1.4. Observe that Lemma 12.1.2 implies that
Var(X) = E (X − E[X])2 ≤ E (1 − X)(E[X])2 + X(1 − E[X])2
   

= (1 − E[X])(E[X])2 + E[X](1 − E[X])2


(12.8)
= (1 − E[X])E[X](E[X] + (1 − E[X]))
= (1 − E[X])E[X].
This and Lemma 12.1.3 demonstrate that Var(X) ≤ 1/4. The proof of Lemma 12.1.4 is thus
complete.

434
12.1. Concentration inequalities for random variables

Lemma 12.1.5. Let (Ω, F, P) be a probability space, let a ∈ R, b ∈ [a, ∞), and let
X : Ω → [a, b] be a random variable. Then

(b − a)2
Var(X) ≤ . (12.9)
4
Proof of Lemma 12.1.5. Throughout this proof, assume without loss of generality that
a < b. Observe that Lemma 12.1.4 implies that
 2 
 2
 2 X−a−(E[X]−a)
Var(X) = E (X − E[X]) = (b − a) E b−a
h  X−a 2 i
= (b − a)2 E X−a
b−a
− E b−a
(12.10)
(b − a)2
= (b − a)2 Var X−a
≤ (b − a)2 ( 14 ) =

b−a
.
4
The proof of Lemma 12.1.5 is thus complete.

12.1.2.2 A concentration inequality


Lemma 12.1.6. Let (Ω, F, P) be a probability space, let N ∈ N, ε ∈ (0, ∞), a1 , a2 , . . . , aN ∈
R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞), and let Xn : Ω → [an , bn ], n ∈
{1, 2, . . . , N }, be independent random variables. Then
N
! P
N
X (bn − an )2
Xn − E[Xn ] ≥ ε ≤ n=1 2 (12.11)

P .
n=1

Proof of Lemma 12.1.6. Note that Lemma 12.1.1 assures that


!  2

XN XN
≥ ε2 
 
P Xn − E[Xn ] ≥ ε = P Xn − E[Xn ]
n=1
hP
n=1
(12.12)
N  2i
E n=1 X n − E[X n ]
≤ .
ε2
In addition, note that the assumption that Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, are
independent variables and Lemma 12.1.5 demonstrate that
hP N
N  2i X h  i
E n=1 Xn − E[Xn ] = E Xn − E[Xn ] Xm − E[Xm ]
n,m=1
(12.13)
N PN 2
n=1 (bn − an )
X h 2 i
= E Xn − E[Xn ] ≤ .
n=1
4

435
Chapter 12: Probabilistic generalization error estimates

Combining this with (12.12) establishes


N
! PN
X
n=1 (bn − an ) 2
(12.14)

P Xn − E[Xn ] ≥ ε ≤
n=1
4ε2

The proof of Lemma 12.1.6 is thus complete.

12.1.3 Moment-generating functions


Definition 12.1.7 (Moment generating functions). Let (Ω, F, P) be a probability space and
let X : Ω → R be a random variable. Then we denote by MX,P : R → [0, ∞] (we denote by
MX : R → [0, ∞]) the function which satisfies for all t ∈ R that

MX,P (t) = E etX (12.15)


 

and we call MX,P the moment-generating function of X with respect to P (we call MX,P the
moment-generating function of X).

12.1.3.1 Moment-generation function for the sum of independent random


variables
Lemma 12.1.8. Let (Ω, F, P) be a probability space, let t ∈ R, N ∈ N, and let Xn : Ω → R,
n ∈ {1, 2, . . . , N }, be independent random variables. Then
YN
MPNn=1 Xn (t) = MXn (t). (12.16)
n=1

Proof of Lemma 12.1.8. Observe that Fubini’s theorem ensures that for all t ∈ R it holds
that
h PN i hYN i YN  YN
MPNn=1 Xn (t) = E et( n=1 Xn ) = E etXn = E etXn =

MXn (t).
n=1 n=1 n=1
(12.17)
The proof of Lemma 12.1.8 is thus complete.

12.1.4 Chernoff bounds


12.1.4.1 Probability to cross a barrier
Proposition 12.1.9. Let (Ω, F, P) be a probability space, let X : Ω → R be a random
variable, and let ε ∈ R. Then

P(X ≥ ε) ≤ inf e−λε E eλX = inf e−λε MX (λ) . (12.18)


  
λ∈[0,∞) λ∈[0,∞)

436
12.1. Concentration inequalities for random variables

Proof of Proposition 12.1.9. Note that Lemma 12.1.1 ensures that for all λ ∈ [0, ∞) it
holds that
E[exp(λX)]
= e−λε E eλX .
 
P(X ≥ ε) ≤ P(λX ≥ λε) = P(exp(λX) ≥ exp(λε)) ≤
exp(λε)
(12.19)
The proof of Proposition 12.1.9 is thus complete.
Corollary 12.1.10. Let (Ω, F, P) be a probability space, let X : Ω → R be a random
variable, and let c, ε ∈ R. Then
P(X ≥ c + ε) ≤ inf e−λε MX−c (λ) . (12.20)

λ∈[0,∞)

Proof of Corollary 12.1.10. Throughout this proof, let Y : Ω → R satisfy


Y = X − c. (12.21)
Observe that Proposition 12.1.9 and (12.21) ensure that
P(X − c ≥ ε) = P(Y ≥ ε) ≤ inf e−λε MY (λ) = inf e−λε MX−c (λ) . (12.22)
 
λ∈[0,∞) λ∈[0,∞)

The proof of Corollary 12.1.10 is thus complete.


Corollary 12.1.11. Let (Ω, F, P) be a probability space, let X : Ω → R be a random variable
with E[|X|] < ∞, and let ε ∈ R. Then
P(X ≥ E[X] + ε) ≤ inf e−λε MX−E[X] (λ) . (12.23)

λ∈[0,∞)

Proof of Corollary 12.1.11. Observe that Corollary 12.1.10 (applied with c ↶ E[X] in the
notation of Corollary 12.1.10) establishes (12.23). The proof of Corollary 12.1.11 is thus
complete.

12.1.4.2 Probability to fall below a barrier


Corollary 12.1.12. Let (Ω, F, P) be a probability space, let X : Ω → R be a random
variable, and let c, ε ∈ R. Then
P(X ≤ c − ε) ≤ inf e−λε Mc−X (λ) . (12.24)

λ∈[0,∞)

Proof of Corollary 12.1.12. Throughout this proof, let c ∈ R satisfy c = −c and let X : Ω →
R satisfy
X = −X. (12.25)
Observe that Corollary 12.1.10 and (12.25) ensure that
P(X ≤ c − ε) = P(−X ≥ −c + ε) = P(X ≥ c + ε) ≤ inf e−λε MX−c (λ)

λ∈[0,∞)
−λε
 (12.26)
= inf e Mc−X (λ) .
λ∈[0,∞)

The proof of Corollary 12.1.12 is thus complete.

437
Chapter 12: Probabilistic generalization error estimates

12.1.4.3 Sums of independent random variables


Corollary 12.1.13. Let (Ω, F, P) be a probability space, let ε ∈ R, N ∈ N, and let Xn : Ω →
R, n ∈ {1, 2, . . . , N }, be independent random variables with maxn∈{1,2,...,N } E[|Xn |] < ∞.
Then
" N # ! "N #!
X Y
e−λε (12.27)

P Xn − E[Xn ] ≥ ε ≤ inf MXn −E[Xn ] (λ) .
λ∈[0,∞)
n=1 n=1

Proof of Corollary 12.1.13. Throughout this proof, let Yn : Ω → R, n ∈ {1, 2, . . . , N },


satisfy for all n ∈ {1, 2, . . . , N } that

Yn = Xn − E[Xn ]. (12.28)

Observe that Proposition 12.1.9, Lemma 12.1.8, and (12.28) ensure that
" N # ! " N # !
X X  
−λε

P Xn − E[Xn ] ≥ ε = P Yn ≥ ε ≤ inf e M Nn=1 Yn (λ)P
λ∈[0,∞)
n=1 n=1
" N
#! " N
#! (12.29)
Y Y
= inf e−λε MYn (λ) = inf e−λε MXn −E[Xn ] (λ) .
λ∈[0,∞) λ∈[0,∞)
n=1 n=1

The proof of Corollary 12.1.13 is thus complete.

12.1.5 Hoeffding’s inequality


12.1.5.1 On the moment-generating function for bounded random variables
Lemma 12.1.14. Let (Ω, F, P) be a probability space, let λ, a ∈ R, b ∈ (a, ∞), p ∈ [0, 1]
−a
satisfy p = (b−a) , let X : Ω → [a, b] be a random variable with E[X] = 0, and let ϕ : R → R
satisfy for all x ∈ R that ϕ(x) = ln(1 − p + pex ) − px. Then

E eλX ≤ eϕ(λ(b−a)) . (12.30)


 

Proof of Lemma 12.1.14. Observe that for all x ∈ R it holds that

x(b − a) = bx − ax = [ab − ax] + [bx − ab] = [a(b − x)] + [b(x − a)]


(12.31)
= a(b − x) + b[b − a − b + x] = a(b − x) + b[(b − a) − (b − x)].

Hence, we obtain that for all x ∈ R it holds that


    
b−x b−x
x=a +b 1− . (12.32)
b−a b−a

438
12.1. Concentration inequalities for random variables

This implies that for all x ∈ R it holds that


    
b−x b−x
λx = λa + 1 − λb. (12.33)
b−a b−a

The fact that R ∋ x 7→ ex ∈ R is convex hence demonstrates that for all x ∈ [a, b] it holds
that
          
λx b−x b−x b − x λa b−x
e = exp λa + 1 − λb ≤ e + 1− eλb .
b−a b−a b−a b−a
(12.34)
The assumption that E[X] = 0 therefore assures that
    
b b
E eλX ≤ λa
eλb . (12.35)
 
e + 1−
b−a b−a

Combining this with the fact that


  
b b
=1− 1−
(b − a) (b − a)
   
(b − a) b
=1− − (12.36)
(b − a) (b − a)
 
−a
=1− =1−p
(b − a)

demonstrates that
    
b b
E eλX ≤ λa
eλb
 
e + 1−
b−a b−a
= (1 − p)eλa + [1 − (1 − p)]eλb (12.37)
λa λb
= (1 − p)e + p e
= (1 − p) + p eλ(b−a) eλa .
 

Moreover, note that the assumption that p = (b−a)


−a
shows that p(b − a) = −a. Hence, we
obtain that a = −p(b − a). This and (12.37) assure that

E eλX ≤ (1 − p) + p eλ(b−a) e−pλ(b−a) = exp ln (1 − p) + p eλ(b−a) e−pλ(b−a)


      
(12.38)
= exp ln (1 − p) + p eλ(b−a) − pλ(b − a) = exp(ϕ(λ(b − a))).
 

The proof of Lemma 12.1.14 is thus complete.

439
Chapter 12: Probabilistic generalization error estimates

12.1.5.2 Hoeffding’s lemma


Lemma 12.1.15. Let p ∈ [0, 1] and let ϕ : R → R satisfy for all x ∈ R that ϕ(x) =
2
ln(1 − p + pex ) − px. Then it holds for all x ∈ R that ϕ(x) ≤ x8 .

Proof of Lemma 12.1.15. Observe that the fundamental theorem of calculus ensures that
for all x ∈ R it holds that
Z x
ϕ(x) = ϕ(0) + ϕ′ (y) dy
0
Z xZ y

= ϕ(0) + ϕ (0)x + ϕ′′ (z) dz dy (12.39)
0 0
x2
 
′ ′′
≤ ϕ(0) + ϕ (0)x + sup ϕ (z) .
2 z∈R

Moreover, note that for all x ∈ R it holds that

pex pex p2 e2x


     

ϕ (x) = −p and ′′
ϕ (x) = − . (12.40)
1 − p + pex 1 − p + pex (1 − p + pex )2

Hence, we obtain that


 
p

ϕ (0) = − p = 0. (12.41)
1−p+p
In the next step we combine (12.40) and the fact that for all a ∈ R it holds that
h  1 2 i  1 2 2
2 2
(12.42)
1 1
− a − 21 ≤ 1

a(1 − a) = a − a = − a − 2a 2
+ 2
+ 2
= 4 4

to obtain that for all x ∈ R it holds that ϕ′′ (x) ≤ 14 . This, (12.39), and (12.41) ensure that
for all x ∈ R it holds that

x2 x2 x2 x2
   
′ ′′ ′′
ϕ(x) ≤ ϕ(0) + ϕ (0)x + sup ϕ (z) = ϕ(0) + sup ϕ (z) ≤ ϕ(0) + = .
2 z∈R 2 z∈R 8 8
(12.43)

The proof of Lemma 12.1.15 is thus complete.

Lemma 12.1.16. Let (Ω, F, P) be a probability space, let a ∈ R, b ∈ [a, ∞), λ ∈ R, and let
X : Ω → [a, b] be a random variable with E[X] = 0. Then
 2 2

E exp(λX) ≤ exp λ (b−a) (12.44)
 
8
.

440
12.1. Concentration inequalities for random variables

Proof of Lemma 12.1.16. Throughout this proof, assume without loss of generality that
a < b, let p ∈ R satisfy p = (b−a)
−a
, and let ϕr : R → R, r ∈ [0, 1], satisfy for all r ∈ [0, 1],
x ∈ R that
ϕr (x) = ln(1 − r + rex ) − rx. (12.45)
Observe that the assumption that E[X] = 0 and the fact that a ≤ E[X] ≤ b ensures that
a ≤ 0 ≤ b. Combining this with the assumption that a < b implies that

−a (b − a)
0≤p= ≤ = 1. (12.46)
(b − a) (b − a)

Lemma 12.1.14 and Lemma 12.1.15 hence demonstrate that


   2 
(λ(b−a))2 λ (b−a)2
(12.47)
 λX  ϕp (λ(b−a))
Ee ≤e = exp(ϕp (λ(b − a))) ≤ exp 8
= exp 8
.

The proof of Lemma 12.1.16 is thus complete.

12.1.5.3 Probability to cross a barrier


Lemma 12.1.17. Let β ∈ (0, ∞), ε ∈ [0, ∞) and let f : [0, ∞) → [0, ∞) satisfy for all
λ ∈ [0, ∞) that f (λ) = βλ2 − ελ. Then
2
ε
inf f (λ) = f ( 2β ε
) = − 4β . (12.48)
λ∈[0,∞)

Proof of Lemma 12.1.17. Observe that for all λ ∈ R it holds that

f ′ (λ) = 2βλ − ε. (12.49)

Moreover, note that


h i2 h i
ε2 ε2 2
ε
f ( 2β ) =β ε

−ε ε

= 4β
− 2β
ε
= − 4β . (12.50)

Combining this and (12.49) establishes (12.48). The proof of Lemma 12.1.17 is thus
complete.

Corollary 12.1.18. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ [0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ̸= 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then
" N # ! !
2
X −2ε
(12.51)

P Xn − E[Xn ] ≥ε ≤ exp PN .
n=1 n=1 (bn − an ) 2

441
Chapter 12: Probabilistic generalization error estimates

Proof of Corollary 12.1.18. Throughout this proof, let β ∈ (0, ∞) satisfy


" N #
1 X
β= (bn − an )2 . (12.52)
8 n=1

Observe that Corollary 12.1.13 ensures that


" N # ! " N
#!
X Y
e−λε (12.53)

P Xn − E[Xn ] ≥ ε ≤ inf MXn −E[Xn ] (λ) .
λ∈[0,∞)
n=1 n=1

Moreover, note that Lemma 12.1.16 proves that for all n ∈ {1, 2, . . . , N } it holds that
 2 2
  2 
λ (bn −an )2
MXn −E[Xn ] (λ) ≤ exp λ [(bn −E[Xn ])−(a
8
n −E[Xn ])]
= exp 8
. (12.54)

Combining this with (12.53) and Lemma 12.1.17 ensures that


" N # ! " N 
# !!

λ2 (b )2
X X
n −an

P Xn − E[Xn ] ≥ ε ≤ inf exp 8
− λε
λ∈[0,∞)
n=1 n=1
" "P # !#
N
− an )2
 
n=1 (bn
2
(12.55)
 2 
= inf exp λ − λε = exp inf βλ − ελ
λ∈[0,∞) 8 λ∈[0,∞)
 2 !
−ε −2ε2
= exp = exp PN .
4β n=1 (bn − an )
2

The proof of Corollary 12.1.18 is thus complete.

12.1.5.4 Probability to fall below a barrier


Corollary 12.1.19. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ [0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ≠ 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then
" N # ! !
2
X −2ε
(12.56)

P Xn − E[Xn ] ≤ −ε ≤ exp PN .
n=1 n=1 (bn − an )2

Proof of Corollary 12.1.19. Throughout this proof, let Xn : Ω → [−bn , −an ], n ∈ {1, 2, . . . ,
N }, satisfy for all n ∈ {1, 2, . . . , N } that

Xn = −Xn . (12.57)

442
12.1. Concentration inequalities for random variables

Observe that Corollary 12.1.18 and (12.57) ensure that


" N # !
X 
P Xn − E[Xn ] ≤ −ε
n=1
" N # !
X
(12.58)

=P −Xn − E[−Xn ] ≥ ε
n=1
" N # ! !
X  −2ε2
=P Xn − E[Xn ] ≥ε ≤ exp PN .
n=1 n=1 (bn − an )2

The proof of Corollary 12.1.19 is thus complete.

12.1.5.5 Hoeffding’s inequality


Corollary 12.1.20. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ [0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ̸= 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then
N
! !
X −2ε2
(12.59)

P Xn − E[Xn ] ≥ ε ≤ 2 exp PN .
2
n=1 n=1 (bn − an )

Proof of Corollary 12.1.20. Observe that


N
!
X 
P Xn − E[Xn ] ≥ ε
n=1
(" N # ) (" N
# )!
X X
(12.60)
 
=P Xn − E[Xn ] ≥ε ∪ Xn − E[Xn ] ≤ −ε
n=1 n=1
" N # ! " N # !
X  X 
≤P Xn − E[Xn ] ≥ε +P Xn − E[Xn ] ≤ −ε .
n=1 n=1

Combining this with Corollary 12.1.18 and Corollary 12.1.19 establishes (12.59). The proof
of Corollary 12.1.20 is thus complete.

Corollary 12.1.21. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ [0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ̸= 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then
N
! !
2 2
1 X −2ε N
(12.61)

P Xn − E[Xn ] ≥ ε ≤ 2 exp PN .
N n=1 n=1 (bn − an )
2

443
Chapter 12: Probabilistic generalization error estimates

Proof of Corollary 12.1.21. Observe that Corollary 12.1.20 ensures that


N
! N
!
1 X  X 
P Xn − E[Xn ] ≥ ε = P Xn − E[Xn ] ≥ εN
N n=1 n=1
! (12.62)
−2(εN )2
≤ 2 exp PN .
(b − a )2
n=1 n n

The proof of Corollary 12.1.21 is thus complete.


Exercise 12.1.1. Prove or disprove the following statement: For every probability space
(Ω, F, P), every N ∈ N, ε ∈ [0, ∞), and every random TN variable X = (X . . . , XN ) : Ω →
Q1N, X2a,i +1
[−1, 1] with ∀ a = (a1 , a2 , . . . , aN ) ∈ [−1, 1] : P( i=1 {Xi ≤ ai }) = i=1 2 it holds that
N N

N
!  2 
1 X −ε N
P (Xn − E[Xn ]) ≥ ε ≤ 2 exp . (12.63)
N i=1 2

Exercise 12.1.2. Prove or disprove the following statement: For every probability space
(Ω, F, P), every N ∈ N, and every random variable X = (XQ 1 , X2 , . . . , XN ) : Ω → [−1, 1]
N

with ∀ a = (a1 , a2 , . . . , aN ) ∈ [−1, 1]N : P( i=1 {Xi ≤ ai }) = i=1 ai2+1 it holds that
N N
T

N
!
1 X 1 h e iN
P (Xn − E[Xn ]) ≥ ≤2 . (12.64)
N n=1 2 4

Exercise 12.1.3. Prove or disprove the following statement: For every probability space
(Ω, F, P), every N ∈ N, and every random variable X = (XQ 1 , X2 , . . . , XN ) : Ω → [−1, 1]
N

with ∀ a = (a1 , a2 , . . . , aN ) ∈ [−1, 1]N : P( N N ai +1


it holds that
T
i=1 {Xi ≤ ai }) = i=1 2

N
! N
e − e−3

1 X 1
P (Xn − E[Xn ]) ≥ ≤2 . (12.65)
N n=1 2 4

Exercise 12.1.4. Prove or disprove the following statement: For every probability space
(Ω, F, P), every N ∈ N, ε ∈ [0, ∞), and every standard normal random variable X =
(X1 , X2 , . . . , XN ) : Ω → RN it holds that
N
!  2 
1 X −ε N
P (Xn − E[Xn ]) ≥ ε ≤ 2 exp . (12.66)
N n=1 2

12.1.6 A strengthened Hoeffding’s inequality


Lemma 12.1.22. Let f, g : (0, ∞) → R satisfy for all x ∈ (0, ∞) that f (x) = 2 exp(−2x)
1
and g(x) = 4x . Then

444
12.2. Covering number estimates

f (x) f (x)
(i) it holds that limx→∞ g(x)
= limx↘0 g(x)
= 0 and

(ii) it holds that g( 12 ) = 1


2
< 2
3
< 2
e
= f ( 21 ).

Proof of Lemma 12.1.22. Note that the fact that limx→∞ exp(−x) x−1
= limx↘0 exp(−x)
x−1
= 0
establishes item (i). Moreover, observe that the fact that e < 3 implies item (ii). The proof
of Lemma 12.1.22 is thus complete.

Corollary 12.1.23. Let (Ω, F, P) be a probability space, let NP∈ N, ε ∈ (0, ∞), a1 , a2 , . . . ,
aN ∈ R, b1 ∈ [a1 , ∞), b2 ∈ [a2 , ∞), . . . , bN ∈ [aN , ∞) satisfy N 2
n=1 (bn − an ) ̸= 0, and let
Xn : Ω → [an , bn ], n ∈ {1, 2, . . . , N }, be independent random variables. Then

N
! ( ! P )
N 2
X −2ε2 (b n − a n )
, n=1 2

P Xn − E[Xn ] ≥ ε ≤ min 1, 2 exp PN .
n=1 (b
n=1 n − a n ) 2 4ε
(12.67)

Proof of Corollary 12.1.23. Observe that Lemma 12.1.6, Corollary 12.1.20, and the fact
that for all B ∈ F it holds that P(B) ≤ 1 establish (12.67). The proof of Corollary 12.1.23
is thus complete.

12.2 Covering number estimates


12.2.1 Entropy quantities
12.2.1.1 Covering radii (Outer entropy numbers)
Definition 12.2.1 (Covering radii). Let (X, d) be a metric space and let n ∈ N. Then we
denote by C(X,d),n ∈ [0, ∞] (we denote by CX,n ∈ [0, ∞]) the extended real number given by
n  o
C(X,d),n = inf r ∈ [0, ∞] : ∃ A ⊆ X : (|A| ≤ n) ∧ (∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r)
(12.68)
and we call C(X,d),n the n-covering radius of (X, d) (we call CX,r the n-covering radius of
X).

Lemma 12.2.2. Let (X, d) be a metric space, let n ∈ N, r ∈ [0, ∞], assume X ̸= ∅,
and let A ⊆ X satisfy |A| ≤ n and ∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r. Then there exist
x1 , x2 , . . . , xn ∈ X such that
"n #
[
X⊆ {v ∈ X : d(xi , v) ≤ r} . (12.69)
i=1

445
Chapter 12: Probabilistic generalization error estimates

Proof of Lemma 12.2.2. Note that the assumption that X ̸= ∅ and the assumption that
|A| ≤ n imply that there exist x1 , x2 , . . . , xn ∈ X which satisfy A ⊆ {x1 , x2 , . . . , xn }. This
and the assumption that ∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r ensure that
" # "n #
[ [
X⊆ {v ∈ X : d(a, v) ≤ r} ⊆ {v ∈ X : d(xi , v) ≤ r} . (12.70)
a∈A i=1

The proof of Lemma 12.2.2 is thus complete.

Sn Let (X, d) be a metric


Lemma 12.2.3.  space, let n ∈ N, r ∈ [0, ∞], x1 , x2 , . . . , xn ∈ X
satisfy X ⊆ i=1 {v ∈ X : d(xi , v) ≤ r} . Then there exists A ⊆ X such that |A| ≤ n and

∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r. (12.71)
Proof of Lemma 12.2.3.SThroughout this proof, let  A = {x1 , x2 , . . . , xn }. Note that the
assumption that X ⊆ i=1 {v ∈ X : d(xi , v) ≤ r} implies that for all v ∈ X there exists
n

i ∈ {1, 2, . . . , n} such that d(xi , v) ≤ r. Hence, we obtain that


∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r. (12.72)
The proof of Lemma 12.2.3 is thus complete.
Lemma 12.2.4. Let (X, d) be a metric space, let n ∈ N, r ∈ [0, ∞], and assume X =
̸ ∅.
Then the following two statements are equivalent:
(i) There exists A ⊆ X such that |A| ≤ n and ∀ x ∈ X : ∃ a ∈ A : d(a, x) ≤ r.
Sn 
(ii) There exist x1 , x2 , . . . , xn ∈ X such that X ⊆ i=1 {v ∈ X : d(x i , v) ≤ r} .
Proof of Lemma 12.2.4. Note that Lemma 12.2.2 and Lemma 12.2.3 prove that ((i) ↔ (ii)).
The proof of Lemma 12.2.4 is thus complete.
Lemma 12.2.5. Let (X, d) be a metric space and let n ∈ N. Then



0   :X=∅


inf r ∈ [0, ∞) : ∃ x1 , x2 , . . . , xn ∈ X :

C(X,d),n =

  n   : X ̸= ∅
 S


 X⊆ {v ∈ X : d(xm , v) ≤ r} ∪ {∞}
m=1
(12.73)
(cf. Definition 12.2.1).
Proof of Lemma 12.2.5. Throughout this proof, assume without loss of generality that
X ̸= ∅ and let a ∈ X. Note that the assumption that d is a metric implies that for all
x ∈ X it holds that d(a, x) ≤ ∞. Combining this with Lemma 12.2.4 proves (12.73). This
completes the proof of Lemma 12.2.5.

446
12.2. Covering number estimates

Exercise 12.2.1. Prove or disprove the following statement: For every metric space (X, d) and
every n, m ∈ N it holds that C(X,d),n < ∞ if and only if C(X,d),m < ∞ (cf. Definition 12.2.1)
Exercise 12.2.2. Prove or disprove the following statement: For every metric space (X, d) and
every n ∈ N it holds that (X, d) is bounded if and only if C(X,d),n < ∞ (cf. Definition 12.2.1).
Exercise 12.2.3. Prove or disprove the following statement: For every n ∈ N and every
metric space (X, d) with X ̸= ∅ it holds that

C(X,d),n = inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(xi , v)


(12.74)
= inf x1 ,x2 ,...,xn ∈X supxn+1 ∈X mini∈{1,2,...,n} d(xi , xn+1 )

(cf. Definition 12.2.1).

12.2.1.2 Packing radii (Inner entropy numbers)


Definition 12.2.6 (Packing radii). Let (X, d) be a metric space and let n ∈ N. Then we
denote by P(X,d),n ∈ [0, ∞] (we denote by PX,n ∈ [0, ∞]) the extended real number given by

P(X,d),n = sup r ∈ [0, ∞) : ∃ x1 , x2 , . . . , xn+1 ∈ X :
mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) > 2r ∪ {0} (12.75)
   

and we call P(X,d),n the n-packing radius of (X, d) (we call P X,r the n-packing radius of X).

Exercise 12.2.4. Prove or disprove the following statement: For every n ∈ N and every
metric space (X, d) with X ̸= ∅ it holds that

P(X,d),n = 21 supx1 ,x2 ,...,xn+1 ∈X mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) (12.76)


 

(cf. Definition 12.2.6).

12.2.1.3 Packing numbers


Definition 12.2.7 (Packing numbers). Let (X, d) be a metric space and let r ∈ [0, ∞].
Then we denote by P (X,d),r ∈ [0, ∞] (we denote by P X,r ∈ [0, ∞]) the extended real number
given by

P (X,d),r = sup

n ∈ N : ∃ x1 , x2 , . . . , xn+1 ∈ X :
mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) > 2r ∪ {0} (12.77)
   

and we call P (X,d),r the r-packing number of (X, d) (we call P X,r the r-packing number of
X).

447
Chapter 12: Probabilistic generalization error estimates

12.2.2 Inequalities for packing entropy quantities in metric spaces


12.2.2.1 Lower bounds for packing radii based on lower bounds for packing
numbers
Lemma 12.2.8 (Lower bounds for packing radii). Let (X, d) be a metric space and let
n ∈ N, r ∈ [0, ∞] satisfy n ≤ P (X,d),r (cf. Definition 12.2.7). Then r ≤ P(X,d),n (cf.
Definition 12.2.6).

Proof of Lemma 12.2.8. Note that (12.77) ensures that there exist x1 , x2 , . . . , xn+1 ∈ X
such that
(12.78)
 
mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) > 2r.
This implies that P(X,d),n ≥ r (cf. Definition 12.2.6). The proof of Lemma 12.2.8 is thus
complete.

12.2.2.2 Upper bounds for packing numbers based on upper bounds for packing
radii
Lemma 12.2.9. Let (X, d) be a metric space and let n ∈ N, r ∈ [0, ∞] satisfy P(X,d),n < r
(cf. Definition 12.2.6). Then P (X,d),r < n (cf. Definition 12.2.7).

Proof of Lemma 12.2.9. Observe that Lemma 12.2.8 establishes that P (X,d),r < n (cf. Defi-
nition 12.2.7). The proof of Lemma 12.2.9 is thus complete.

12.2.2.3 Upper bounds for packing radii based on upper bounds for covering
radii
Lemma 12.2.10. Let (X, d) be a metric space and let n ∈ N. Then P(X,d),n ≤ C(X,d),n (cf.
Definitions 12.2.1 and 12.2.6).

Proof of Lemma 12.2.10. Throughout this proof, assume without loss of generality that
C(X,d),n < ∞ and P(X,d),n > 0, let r ∈ [0, ∞), x1 , x2 , . . . , xn ∈ X satisfy
" n #
[
X⊆ {v ∈ X : d(xm , v) ≤ r} , (12.79)
m=1

let r ∈ [0, ∞), x1 , x2 , . . . , xn+1 ∈ X satisfy

(12.80)
 
mini,j∈{1,2,...,n+1}, i̸=j d(xi , xj ) > 2r,

and let φ : X → {1, 2, . . . , n} satisfy for all v ∈ X that

φ(v) = min{m ∈ {1, 2, . . . , n} : v ∈ {w ∈ X : d(xm , w) ≤ r}} (12.81)

448
12.2. Covering number estimates

(cf. Definitions 12.2.1 and 12.2.6 and Lemma 12.2.5). Observe that (12.81) shows that for
all v ∈ X it holds that
(12.82)

v ∈ w ∈ X : d(xφ(v) , w) ≤ r .

Hence, we obtain that for all v ∈ X it holds that

d(v, xφ(v) ) ≤ r (12.83)

Moreover, note that the fact that φ(x1 ), φ(x2 ), . . . , φ(xn+1 ) ∈ {1, 2, . . . , n} ensures that
there exist i, j ∈ {1, 2, . . . , n + 1} which satisfy

i ̸= j and φ(xi ) = φ(xj ). (12.84)

The triangle inequality, (12.80), and (12.83) hence show that

2r < d(xi , xj ) ≤ d(xi , xφ(xi ) ) + d(xφ(xi ) , xj ) = d(xi , xφ(xi ) ) + d(xj , xφ(xj ) ) ≤ 2r. (12.85)

This implies that r < r. The proof of Lemma 12.2.10 is thus complete.

12.2.2.4 Upper bounds for packing radii in balls of metric spaces

Lemma 12.2.11. Let (X, d) be a metric space, let n ∈ N, x ∈ X, r ∈ (0, ∞], and let
S = {v ∈ X : d(x, v) ≤ r}. Then P(S,d|S×S ),n ≤ r (cf. Definition 12.2.6).

Proof of Lemma 12.2.11. Throughout this proof, assume without loss of generality that
P(S,d|S×S ),n > 0 (cf. Definition 12.2.6). Observe that for all x1 , x2 , . . . , xn+1 ∈ S, i, j ∈
{1, 2, . . . , n + 1} it holds that

d(xi , xj ) ≤ d(xi , x) + d(x, xj ) ≤ 2r. (12.86)

Hence, we obtain that for all x1 , x2 , . . . , xn+1 ∈ S it holds that

mini,j∈{1,2,...,n+1},i̸=j d(xi , xj ) ≤ 2r. (12.87)

Moreover, note that (12.75) ensures that for all ρ ∈ [0, P(S,d|S×S ),n ) there exist x1 , x2 , . . . ,
xn+1 ∈ S such that
mini,j∈{1,2,...,n+1},i̸=j d(xi , xj ) > 2ρ. (12.88)

This and (12.87) demonstrate that for all ρ ∈ [0, P(S,d|S×S ),n ) it holds that 2ρ < 2r. The
proof of Lemma 12.2.11 is thus complete.

449
Chapter 12: Probabilistic generalization error estimates

12.2.3 Inequalities for covering entropy quantities in metric spaces


12.2.3.1 Upper bounds for covering numbers based on upper bounds for
covering radii
Lemma 12.2.12. Let (X, d) be a metric space and let r ∈ [0, ∞], n ∈ N satisfy C(X,d),n < r
(cf. Definition 12.2.1). Then C (X,d),r ≤ n (cf. Definition 4.3.2).
Proof of Lemma 12.2.12. Observe that the assumption that C(X,d),n < r ensures that there
exists A ⊆ X such that |A| ≤ n and
" #
[
X⊆ {v ∈ X : d(a, v) ≤ r} . (12.89)
a∈A

This establishes that C (X,d),r ≤ n (cf. Definition 4.3.2). The proof of Lemma 12.2.12 is thus
complete.
Lemma 12.2.13. Let (X, d) be a compact metric space and let r ∈ [0, ∞], n ∈ N, satisfy
C(X,d),n ≤ r (cf. Definition 12.2.1). Then C (X,d),r ≤ n (cf. Definition 4.3.2).
Proof of Lemma 12.2.13. Throughout this proof, assume without loss of generality that
X ̸= ∅ and let xk,m ∈ X, m ∈ {1, 2, . . . , n}, k ∈ N, satisfy for all k ∈ N that
" n #
[
X⊆ v ∈ X : d(xk,m , v) ≤ r + k1 (12.90)
m=1

(cf. Lemma 12.2.4). Note that the assumption that (X, d) is a compact metric space
demonstrates that there exist x = (xm )m∈{1,2,...,n} : {1, 2, . . . , n} → X and k = (kl )l∈N : N →
N which satisfy that
lim supl→∞ maxm∈{1,2,...,n} d(xm , xkl ,m ) = 0 and lim supl→∞ kl = ∞. (12.91)
Next observe that the assumption that d is a metric ensures that for all v ∈ X, m ∈
{1, 2, . . . , n}, l ∈ N it holds that
d(v, xm ) ≤ d(v, xkl ,m ) + d(xkl ,m , xm ). (12.92)
This and (12.90) prove that for all v ∈ X, l ∈ N it holds that
minm∈{1,2,...,n} d(v, xm ) ≤ minm∈{1,2,...,n} [d(v, xkl ,m ) + d(xkl ,m , xm )]
≤ minm∈{1,2,...,n} d(v, xkl ,m ) + maxm∈{1,2,...,n} d(xkl ,m , xm ) (12.93)
   

≤ r + k1l + maxm∈{1,2,...,n} d(xkl ,m , xm ) .


   

Hence, we obtain for all v ∈ X that


1
+ maxm∈{1,2,...,n} d(xkl ,m , xm ) = r. (12.94)
   
minm∈{1,2,...,n} d(v, xm ) ≤ lim supl→∞ r + kl

This establishes that C (X,d),r ≤ n (cf. Definition 4.3.2). The proof of Lemma 12.2.13 is thus
complete.

450
12.2. Covering number estimates

12.2.3.2 Upper bounds for covering radii based on upper bounds for covering
numbers
Lemma 12.2.14. Let (X, d) be a metric space and let r ∈ [0, ∞], n ∈ N satisfy C (X,d),r ≤ n
(cf. Definition 4.3.2). Then C(X,d),n ≤ r (cf. Definition 12.2.1).
Proof of Lemma 12.2.14. Observe that the assumption that C (X,d),r ≤ n ensures that there
exists A ⊆ X such that |A| ≤ n and
" #
[
X⊆ {v ∈ X : d(a, v) ≤ r} . (12.95)
a∈A

This establishes that C(X,d),n ≤ r (cf. Definition 12.2.1). The proof of Lemma 12.2.14 is
thus complete.

12.2.3.3 Upper bounds for covering radii based on upper bounds for packing
radii
Lemma 12.2.15. Let (X, d) be a metric space and let n ∈ N. Then C(X,d),n ≤ 2P(X,d),n (cf.
Definitions 12.2.1 and 12.2.6).
Proof of Lemma 12.2.15. Throughout this proof, assume w.l.o.g. that X ̸= ∅, assume
without loss of generality that P(X,d),n < ∞, let r ∈ [0, ∞] satisfy r > P(X,d),n , and let
N ∈ N0 ∪ {∞} satisfy N = P (X,d),r (cf. Definitions 12.2.6 and 12.2.7). Observe that
Lemma 12.2.9 ensures that
N = P (X,d),r < n. (12.96)
Moreover, note that the fact that N = P (X,d),r and (12.77) demonstrate that for all
x1 , x2 , . . . , xN +1 , xN +2 ∈ X it holds that

mini,j∈{1,2,...,N +2}, i̸=j d(xi , xj ) ≤ 2r. (12.97)

In addition, observe that the fact that N = P (X,d),r and (12.77) imply that there exist
x1 , x2 , . . . , xN +1 ∈ X which satisfy that

(12.98)

min {d(xi , xj ) : i, j ∈ {1, 2, . . . , N + 1}, i ̸= j} ∪ {∞} > 2r.

Combining this with (12.97) establishes that for all v ∈ X it holds that

mini∈{1,2,...,N } d(xi , v) ≤ 2r. (12.99)

Hence, we obtain that for all w ∈ X it holds that


" n #
[
w∈ {v ∈ X : d(xi , v) ≤ 2r} . (12.100)
m=1

451
Chapter 12: Probabilistic generalization error estimates

Therefore, we obtain that


" n
#
[
X⊆ {v ∈ X : d(xi , v) ≤ 2r} . (12.101)
m=1

Combining this and Lemma 12.2.5 shows that C(X,d),n ≤ 2r (cf. Definition 12.2.1). The
proof of Lemma 12.2.15 is thus complete.

12.2.3.4 Equivalence of covering and packing radii


Corollary 12.2.16. Let (X, d) be a metric space and let n ∈ N. Then P(X,d),n ≤ C(X,d),n ≤
2P(X,d),n (cf. Definitions 12.2.1 and 12.2.6).
Proof of Corollary 12.2.16. Observe that Lemma 12.2.10 and Lemma 12.2.15 establish
that P(X,d),n ≤ C(X,d),n ≤ 2P(X,d),n (cf. Definitions 12.2.1 and 12.2.6). The proof of Corol-
lary 12.2.16 is thus complete.

12.2.4 Inequalities for entropy quantities in finite dimensional


vector spaces
12.2.4.1 Measures induced by Lebesgue–Borel measures
Lemma 12.2.17. Let (V, ~·~) be a normed vector space, let N ∈ N, let b1 , b2 , . . . , bN ∈ V
be a Hamel-basis of V , let λ : B(RN ) → [0, ∞] be the Lebesgue–Borel measure on RN , let
Φ : RN → V satisfy for all r = (r1 , r2 , . . . , rN ) ∈ RN that Φ(r) = r1 b1 + r2 b2 + . . . + rN bN ,
and let ν : B(V ) → [0, ∞] satisfy for all A ∈ B(V ) that

ν(A) = λ(Φ−1 (A)). (12.102)

Then
(i) it holds that Φ is linear,
PN 1/2 PN 1/2
(ii) it holds for all r = (r1 , r2 , . . . , rN ) ∈ RN that ~Φ(r)~ ≤ n=1 ~bn ~
2
n=1 |rn |
2
,

(iii) it holds that Φ ∈ C(RN , V ),

(iv) it holds that Φ is bijective,

(v) it holds that (V, B(V ), ν) is a measure space,

(vi) it holds for all r ∈ (0, ∞), v ∈ V , A ∈ B(V ) that ν({(ra + v) ∈ V : a ∈ A}) =
rN ν(A),

(vii) it holds for all r ∈ (0, ∞) that ν({v ∈ V : ~v~ ≤ r}) = rN ν({v ∈ V : ~v~ ≤ 1}), and

452
12.2. Covering number estimates

(viii) it holds that ν({v ∈ V : ~v~ ≤ 1}) > 0.


Proof of Lemma 12.2.17. Note that for all r = (r1 , r2 , . . . , rN ), s = (s1 , s2 , . . . , sN ) ∈ RN ,
ρ ∈ R it holds that
Φ(ρr + s) = (ρr1 + s1 )b1 + (ρr2 + s2 )b2 + · · · + (ρrN + sN )bN = ρΦ(r) + Φ(s). (12.103)
This establishes item (i). Next observe that Hölder’s inequality shows that for all r =
(r1 , r2 , . . . , rN ) ∈ RN it holds that
N
" N #1/2 " N #1/2
X X X 2
~Φ(r)~ = ~r1 b1 +r2 b2 +· · ·+rN bN ~ ≤ |rn |~bn ~ ≤ ~bn ~2 |rn | . (12.104)
n=1 n=1 n=1

This establishes item (ii). Moreover, note that item (ii) proves item (iii). Furthermore,
observe that the assumption that b1 , b2 , . . . , bN ∈ V is a Hamel-basis of V establishes
item (iv). Next note that (12.102) and item (iii) prove item (v). In addition, observe that
the integral transformation theorem shows that for all r ∈ (0, ∞), v ∈ RN , A ∈ B(RN ) it
holds that
Z
N N
1{ra∈RN : a∈A} (x) dx
   
λ (ra + v) ∈ R : a ∈ A = λ ra ∈ R : a ∈ A =
RN
Z Z (12.105)
= 1A ( r ) dx = r
x N
1A (x) dx = r λ(A).
N
RN RN

Combining item (i) and item (iv) hence demonstrates that for all r ∈ (0, ∞), v ∈ V ,
A ∈ B(V ) it holds that
ν({(ra + v) ∈ V : a ∈ A}) = λ Φ−1 ({(ra + v) ∈ V : a ∈ A})


= λ Φ−1 (ra + v) ∈ RN : a ∈ A
 

= λ rΦ−1 (a) + Φ−1 (v) ∈ RN : a ∈ A (12.106)


  
−1 N −1
  
= λ ra + Φ (v) ∈ R : a ∈ Φ (A)
= rN λ(Φ−1 (A)) = rN ν(A).
This establishes item (vi). Hence, we obtain that for all r ∈ (0, ∞) it holds that
ν({v ∈ V : ~v~ ≤ r}) = ν({rv ∈ V : ~v~ ≤ 1})
= rN ν({v ∈ V : ~v~ ≤ 1}) (12.107)
N
= r ν(X).
This establishes item (vii). Furthermore, observe that (12.107) demonstrates that
h i
N
∞ = λ(R ) = ν(V ) = lim sup ν({v ∈ V : ~v~ ≤ r})
r→∞
h i (12.108)
N
= lim sup r ν({v ∈ V : ~v~ ≤ 1}) .
r→∞

453
Chapter 12: Probabilistic generalization error estimates

Hence, we obtain that ν({v ∈ V : ~v~ ≤ 1}) ̸= 0. This establishes item (viii). The proof of
Lemma 12.2.17 is thus complete.

12.2.4.2 Upper bounds for packing radii


Lemma 12.2.18. Let (V, ~·~) be a normed vector space, let X = {v ∈ V : ~v~ ≤ 1}, let
d : X × X → [0, ∞) satisfy for all v, w ∈ X that d(v, w) = ~v − w~, and let n, N ∈ N
satisfy N = dim(V ). Then
P(X,d),n ≤ 2 (n + 1)− /N (12.109)
1

(cf. Definition 12.2.6).


Proof of Lemma 12.2.18. Throughout this proof, assume without loss of generality that
P(X,d),n > 0, let ρ ∈ [0, P(X,d),n ), let λ : B(RN ) → [0, ∞] be the Lebesgue-Borel measure
on RN , let b1 , b2 , . . . , bN ∈ V be a Hamel-basis of V , let Φ : RN → V satisfy for all
r = (r1 , r2 , . . . , rN ) ∈ RN that
Φ(r) = r1 b1 + r2 b2 + . . . + rN bN , (12.110)
and let ν : B(V ) → [0, ∞] satisfy for all A ∈ B(V ) that
ν(A) = λ(Φ−1 (A)) (12.111)
(cf. Definition 12.2.6). Observe that Lemma 12.2.11 ensures that ρ < P(X,d),n ≤ 1. Moreover,
note that (12.75) shows that there exist x1 , x2 , . . . , xn+1 ∈ X which satisfy
mini,j∈{1,2,...,n+1},i̸=j ~xi − xj ~ = mini,j∈{1,2,...,n+1},i̸=j d(xi , xj ) > 2ρ. (12.112)
Observe that (12.112) ensures that for all i, j ∈ {1, 2, . . . , n + 1} with i ̸= j it holds that
{v ∈ V : ~xi − v~ ≤ ρ} ∩ {v ∈ V : ~xj − v~ ≤ ρ} = ∅. (12.113)
Moreover, note that (12.112) and the fact that ρ < 1 show that for all j ∈ {1, 2, . . . , n + 1},
w ∈ {v ∈ X : d(xj , v) ≤ ρ} it holds that
~w~ ≤ ~w − xj ~ + ~xj ~ ≤ ρ + 1 ≤ 2. (12.114)
Therefore, we obtain that for all j ∈ {1, 2, . . . , n + 1} it holds that
{v ∈ V : ~v − xj ~ ≤ ρ} ⊆ {v ∈ V : ~v~ ≤ 2}. (12.115)
Next observe that Lemma 12.2.17 ensures that (V, B(V ), ν) is a measure space. Combining
this and (12.113) with (12.115) proves that
n+1 n+1
!
X [
ν({v ∈ V : ~v − xj ~ ≤ ρ}) = ν {v ∈ V : ~v − xj ~ ≤ ρ}
j=1 j=1 (12.116)
≤ ν({v ∈ V : ~v~ ≤ 2}).

454
12.2. Covering number estimates

Lemma 12.2.17 hence shows that


n+1
X
N
ρN ν({v ∈ V : ~v~ ≤ 1})
 
(n + 1)ρ ν(X) =
j=1
n+1
X
= ν({v ∈ V : ~v~ ≤ ρ})
j=1 (12.117)
n+1
X
= ν({v ∈ V : ~v − xj ~ ≤ ρ}) ≤ ν({v ∈ V : ~v~ ≤ 2})
j=1

= 2N ν({v ∈ V : ~v~ ≤ 1}) = 2N ν(X).

Next observe that Lemma 12.2.17 demonstrates that ν(X) > 0. Combining this with
(12.117) assures that (n + 1)ρN ≤ 2N . Therefore, we obtain that ρN ≤ (n + 1)−1 2N . Hence,
we obtain that ρ ≤ 2(n + 1)− /N . The proof of Lemma 12.2.18 is thus complete.
1

12.2.4.3 Upper bounds for covering radii


Corollary 12.2.19. Let (V, ~·~) be a normed vector space, let X = {v ∈ V : ~v~ ≤ 1},
let d : X × X → [0, ∞) satisfy for all v, w ∈ X that d(v, w) = ~v − w~, and let n, N ∈ N
satisfy N = dim(V ). Then
C(X,d),n ≤ 4 (n + 1)− /N (12.118)
1

(cf. Definition 12.2.1).

Proof of Corollary 12.2.19. Observe that Corollary 12.2.16 and Lemma 12.2.18 establish
(12.118). The proof of Corollary 12.2.19 is thus complete.

12.2.4.4 Lower bounds for covering radii


Lemma 12.2.20. Let (V, ~·~) be a normed vector space, let X = {v ∈ V : ~v~ ≤ 1}, let
d : X × X → [0, ∞) satisfy for all v, w ∈ X that d(v, w) = ~v − w~, and let n, N ∈ N
satisfy N = dim(V ). Then
n− /N ≤ C(X,d),n (12.119)
1

(cf. Definition 12.2.1).

Proof of Lemma 12.2.20. Throughout this proof, assume without loss of generality that
C(X,d),n < ∞, let ρ ∈ (C(X,d),n , ∞), let λ : B(RN ) → [0, ∞] be the Lebesgue-Borel measure
on RN , let b1 , b2 , . . . , bN ∈ V be a Hamel-basis of V , let Φ : RN → V satisfy for all
r = (r1 , r2 , . . . , rN ) ∈ RN that

Φ(r) = r1 b1 + r2 b2 + . . . + rN bN , (12.120)

455
Chapter 12: Probabilistic generalization error estimates

and let ν : B(V ) → [0, ∞] satisfy for all A ∈ B(V ) that

ν(A) = λ(Φ−1 (A)) (12.121)

(cf. Definition 12.2.1). The fact that ρ > C(X,d),n demonstrates that there exist x1 , x2 , . . . ,
xn ∈ X which satisfy " n #
[
X⊆ {v ∈ X : d(xm , v) ≤ ρ} . (12.122)
m=1

Lemma 12.2.17 hence shows that


n
! n
[ X
ν(X) ≤ ν {v ∈ X : d(xm , v) ≤ ρ} ≤ ν({v ∈ X : d(xm , v) ≤ ρ})
n
m=1 m=1
(12.123)
X
ρN ν({v ∈ X : d(xm , v) ≤ 1}) ≤ nρN ν(X).
 
=
m=1

This and Lemma 12.2.17 demonstrate that 1 ≤ nρN . Hence, we obtain that ρN ≥ n−1 .
This ensures that ρ ≥ n−1/N . The proof of Lemma 12.2.20 is thus complete.

12.2.4.5 Lower and upper bounds for covering radii


Corollary 12.2.21. Let (V, ~·~) be a normed vector space, let X = {v ∈ V : ~v~ ≤ 1},
let d : X × X → [0, ∞) satisfy for all v, w ∈ X that d(v, w) = ~v − w~, and let n, N ∈ N
satisfy N = dim(V ). Then

n− /N ≤ C(X,d),n ≤ 4 (n + 1)− /N (12.124)


1 1

(cf. Definition 12.2.1).

Proof of Corollary 12.2.21. Observe that Corollary 12.2.19 and Lemma 12.2.20 establish
(12.124). The proof of Corollary 12.2.21 is thus complete.

12.2.4.6 Scaling property for covering radii


Lemma 12.2.22. Let (V, ~·~) be a normed vector space, let d : V × V → [0, ∞) satisfy for
all v, w ∈ V that d(v, w) = ~v − w~, let n ∈ N, r ∈ (0, ∞), and let X ⊆ V and X ⊆ V
satisfy X = {rv ∈ V : v ∈ X}. Then

C(X,d|X×X ),n = r C(X,d|X×X ),n (12.125)

(cf. Definition 12.2.1).

456
12.2. Covering number estimates

Proof of Lemma 12.2.22. Throughout this proof, let Φ : V → V satisfy for all v ∈ V that
Φ(v) = rv. Observe that Exercise 12.2.3 shows that
 
r C(X,d),n = r inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(xi , v)
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} ~rxi − rv~
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} ~Φ(xi ) − Φ(v)~
(12.126)
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(Φ(xi ), Φ(v))
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(Φ(xi ), v)
= inf x1 ,x2 ,...,xn ∈X supv∈X mini∈{1,2,...,n} d(xi , v) = C(X,d|X×X ),n

(cf. Definition 12.2.1). This establishes (12.125). The proof of Lemma 12.2.22 is thus
complete.

12.2.4.7 Upper bounds for covering numbers


Proposition 12.2.23. Let (V, ~·~) be a normed vector space with dim(V ) < ∞, let
r, R ∈ (0, ∞), X = {v ∈ V : ~v~ ≤ R}, and let d : X × X → [0, ∞) satisfy for all v, w ∈ X
that d(v, w) = ~v − w~. Then
(
1 :r≥R
C (X,d),r ≤  4R dim(V ) (12.127)
r
:r<R

(cf. Definition 4.3.2).


Proof of Proposition 12.2.23. Throughout this proof, assume without loss of generality that
dim(V ) > 0, assume without loss of generality that r < R, let N ∈ N satisfy N = dim(V ),
let n ∈ N satisfy &  '
N
4R
n= −1 , (12.128)
r
let X = {v ∈ V : ~v~ ≤ 1}, and let d : X × X → [0, ∞) satisfy for all v, w ∈ X that

d(v, w) = ~v − w~ (12.129)

(cf. Definition 4.2.6). Observe that Corollary 12.2.19 proves that

C(X,d),n ≤ 4 (n + 1)− /N (12.130)


1

(cf. Definition 12.2.1). The fact that


&  ' "  #
N N  N
4R 4R 4R
n+1= −1 +1≥ −1 +1= (12.131)
r r r

457
Chapter 12: Probabilistic generalization error estimates

therefore ensures that


" N #−1/N  −1
4R 4R r
C(X,d),n ≤ 4 (n + 1) −1/N
≤4 =4 = . (12.132)
r r R

This and Lemma 12.2.22 demonstrate that


hri
C(X,d),n = R C(X,d),n ≤ R = r. (12.133)
R
Lemma 12.2.13 hence ensures that
 N  dim(V )
4R 4R
C (X,d),r
≤n≤ = (12.134)
r r

(cf. Definition 4.3.2). The proof of Proposition 12.2.23 is thus complete.

Proposition 12.2.24. Let d ∈ N, a ∈ R, b ∈ (a, ∞), r ∈ (0, ∞) and let δ : ([a, b]d ) ×
([a, b]d ) → [0, ∞) satisfy for all x, y ∈ [a, b]d that δ(x, y) = ∥x − y∥∞ (cf. Definition 3.3.4).
Then (
d d 1 : r ≥ (b−a)/2
C ([a,b] ,δ),r ≤ b−a (12.135)
 
2r
≤ b−a d

r
: r < (b−a)/2
(cf. Definitions 4.2.6 and 4.3.2).

Proof of Proposition 12.2.24. Throughout this proof, let N ⊆ N satisfy

N = b−a (12.136)
 
2r
,

for every N ∈ N, i ∈ {1, 2, . . . , N } let gN,i ∈ [a, b] be given by

gN,i = a + (i−1/2)(b−a)/N (12.137)

and let A ⊆ [a, b]d be given by

A = {gN,1 , gN,2 , . . . , gN,N }d (12.138)

(cf. Definition 4.2.6). Observe that it holds for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [a +


N,i ] that
(i−1)(b−a)/N , g

|x − gN,i | = a + (i− /2N)(b−a) − x ≤ a + (i− /2N)(b−a) − a + (i−1)(b−a) (12.139)


1 1  b−a
N
= 2N .

In addition, note that it holds for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [gN,i , a + i(b−a)/N ] that

|x − gN,i | = x − a + (i− /2N)(b−a) ≤ a + i(b−a) (i−1/2)(b−a)


(12.140)
1   b−a
N
− a + N
= 2N .

458
12.3. Empirical risk minimization

Combining this with (12.139) implies for all N ∈ N, i ∈ {1, 2, . . . , N }, x ∈ [a + (i−1)(b−a)/N ,


a + i(b−a)/N ] that |x − gN,i | ≤ (b−a)/(2N ). This proves that for every N ∈ N, x ∈ [a, b] there
exists y ∈ {gN,1 , gN,2 , . . . , gN,N } such that
|x − y| ≤ b−a
2N
. (12.141)
This shows that for every x = (x1 , x2 , . . . , xd ) ∈ [a, b] there exists y = (y1 , y2 , . . . , yd ) ∈ A
d

such that
δ(x, y) = ∥x − y∥∞ = max |xi − yi | ≤ b−a 2N
≤ (b−a)2r
2(b−a)
= r. (12.142)
i∈{1,2,...,d}

Combining this with (4.82), (12.138), (12.136), and the fact that ∀ x ∈ [0, ∞) : ⌈x⌉ ≤
1(0,r] (rx) + 2x1(r,∞) (rx) demonstrates that
≤ 1(0,r] b−a 1(r,∞) b−a
d  d d
C ([a,b] ,δ),r ≤ |A| = (N)d = b−a + b−a (12.143)
 
2r 2 r 2
(cf. Definition 4.3.2). The proof of Proposition 12.2.24 is thus complete.

12.3 Empirical risk minimization


12.3.1 Concentration inequalities for random fields
Lemma 12.3.1. Let (E, d) be a separable metric space and let F ⊆ E be a set. Then
(F, d|F ×F ) (12.144)
is a separable metric space.
Proof of Lemma 12.3.1. Throughout this proof, assume without loss of generality that
F ̸= ∅, let e = (en )n∈N : N → E be a sequence of elements in E such that {en ∈ E : n ∈ N}
is dense in E, and let f = (fn )n∈N : N → F be a sequence of elements in F such that for all
n ∈ N it holds that
(
0 : en ∈ F
d(fn , en ) ≤   1
(12.145)
inf x∈F d(x, en ) + 2n : en ∈
/ F.
Observe that for all v ∈ F \{em ∈ E : m ∈ N}, n ∈ N it holds that
inf d(v, fm ) ≤ inf d(v, fm )
m∈N m∈N∩[n,∞)

≤ inf [d(v, em ) + d(em , fm )]


m∈N∩[n,∞)
 
  1
≤ inf d(v, em ) + inf x∈F d(x, em ) + m
m∈N∩[n,∞) 2 (12.146)
 
1
≤ inf 2 d(v, em ) + m
m∈N∩[n,∞) 2
 
1 1
≤2 inf d(v, em ) + n = n .
m∈N∩[n,∞) 2 2

459
Chapter 12: Probabilistic generalization error estimates

Combining this with the fact that for all v ∈ F ∩ {em ∈ E : m ∈ N} it holds that
inf m∈N d(v, fm ) = 0 ensures that the set {fn ∈ F : n ∈ N} is dense in F . The proof of
Lemma 12.3.1 is thus complete.

Lemma 12.3.2. Let (E, E) be a topological space, assume E ̸= ∅, let E ⊆ E be an at most


countable set, assume that E is dense in E, let (Ω, F) be a measurable space, for every
x ∈ E let fx : Ω → R be F/B(R)-measurable, assume for all ω ∈ Ω that E ∋ x 7→ fx (ω) ∈ R
is continuous, and let F : Ω → R ∪ {∞} satisfy for all ω ∈ Ω that

F (ω) = sup fx (ω). (12.147)


x∈E

Then

(i) it holds for all ω ∈ Ω that F (ω) = supx∈E fx (ω) and

(ii) it holds that F is F/B(R ∪ {∞})-measurable.

Proof of Lemma 12.3.2. Observe that the assumption that E is dense in E shows that for
all g ∈ C(E, R) it holds that
sup g(x) = sup g(x). (12.148)
x∈E x∈E

This and the assumption that for all ω ∈ Ω it holds that E ∋ x 7→ fx (ω) ∈ R is continuous
demonstrate that for all ω ∈ Ω it holds that

F (ω) = sup fx (ω) = sup fx (ω). (12.149)


x∈E x∈E

This proves item (i). Furthermore, note that item (i) and the assumption that for all
x ∈ E it holds that fx : Ω → R is F/B(R)-measurable establish item (ii). The proof of
Lemma 12.3.2 is thus complete.

Lemma 12.3.3.SLet (E, δ) be a separable metric space, let ε, L ∈ R, N ∈ N, z1 , z2 , . . . , zN ∈


E satisfy E ⊆ N i=1 {x ∈ E : 2Lδ(x, zi ) ≤ ε}, let (Ω, F, P) be a probability space, and let
Zx : Ω → R, x ∈ E, be random variables which satisfy for all x, y ∈ E that |Zx − Zy | ≤
Lδ(x, y). Then
XN
P |Zzi | ≥ 2ε (12.150)

P(supx∈E |Zx | ≥ ε) ≤
i=1

(cf. Lemma 12.3.2).

Proof of Lemma 12.3.3. Throughout this proof, let B1 , B2 , . . . , BN ⊆ E satisfy for all
i ∈ {1, 2, . . . , N } that Bi = {x ∈ E : 2Lδ(x, zi ) ≤ ε}. Observe that the triangle inequality

460
12.3. Empirical risk minimization

and the assumption that for all x, y ∈ E it holds that |Zx − Zy | ≤ Lδ(x, y) show that for
all i ∈ {1, 2, . . . , N }, x ∈ Bi it holds that

|Zx | = |Zx − Zzi + Zzi | ≤ |Zx − Zzi | + |Zzi | ≤ Lδ(x, zi ) + |Zzi | ≤ ε


2
+ |Zzi |. (12.151)

Combining this with Lemma 12.3.2 and Lemma 12.3.1 proves that for all i ∈ {1, 2, . . . , N }
it holds that
P supx∈Bi |Zx | ≥ ε ≤ P 2ε + |Zzi | ≥ ε = P |Zzi | ≥ 2ε . (12.152)
  

This, Lemma 12.3.2, and Lemma 12.3.1 establish that


  S  
N
P(supx∈E |Zx | ≥ ε) = P supx∈( N Bi ) |Zx | ≥ ε = P i=1 supx∈Bi |Zx | ≥ ε
S
i=1

N
X N
X (12.153)
ε
 
≤ P supx∈Bi |Zx | ≥ ε ≤ P |Zzi | ≥ 2
.
i=1 i=1

This completes the proof of Lemma 12.3.3.


Lemma 12.3.4. Let (E, δ) be a separable metric space, assume E ̸= ∅, let ε, L ∈ (0, ∞),
let (Ω, F, P) be a probability space, and let Zx : Ω → R, x ∈ E, be random variables which
satisfy for all x, y ∈ E that |Zx − Zy | ≤ Lδ(x, y). Then
 (E,δ), ε −1
P(supx∈E |Zx | ≥ ε) ≤ supx∈E P |Zx | ≥ 2ε . (12.154)

C 2L

(cf. Definition 4.3.2 and Lemma 12.3.2).


ε
Proof of Lemma 12.3.4. Throughout this proof, let N ∈ N ∪ {∞} satisfy N = C (E,δ), SN , as-
2L

sume without loss of generality that N < ∞, and let z1 , z2 , . . . , zN ∈ E satisfy E ⊆ i=1 {x ∈
ε
E : δ(x, zi ) ≤ 2L } (cf. Definition 4.3.2). Observe that Lemma 12.3.2 and Lemma 12.3.3
establish that
N
X
ε
≤ N supx∈E P |Zx | ≥ 2ε . (12.155)
  
P(supx∈E |Zx | ≥ ε) ≤ P |Zzi | ≥ 2
i=1

This completes the proof of Lemma 12.3.4.


Lemma 12.3.5. Let (E, δ) be a separable metric space, assume E ̸= ∅, let (Ω, F, P) be
a probability space, let L ∈ R, for every x ∈ E let Zx : Ω → R be a random variable with
E[|Zx |] < ∞, and assume for all x, y ∈ E that |Zx − Zy | ≤ Lδ(x, y). Then
(i) it holds for all x, y ∈ E, η ∈ Ω that

|(Zx (η) − E[Zx ]) − (Zy (η) − E[Zy ])| ≤ 2Lδ(x, y) (12.156)

and

461
Chapter 12: Probabilistic generalization error estimates

(ii) it holds that Ω ∋ η 7→ supx∈E |Zx (η) − E[Zx ]| ∈ [0, ∞] is F/B([0, ∞])-measurable.

Proof of Lemma 12.3.5. Observe that the assumption that for all x, y ∈ E it holds that
|Zx − Zy | ≤ Lδ(x, y) implies that for all x, y ∈ E, η ∈ Ω it holds that

|(Zx (η) − E[Zx ]) − (Zy (η) − E[Zy ])| = |(Zx (η) − Zy (η)) + (E[Zy ] − E[Zx ])|
≤ |Zx (η) − Zy (η)| + |E[Zx ] − E[Zy ]|
≤ Lδ(x, y) + |E[Zx ] − E[Zy ]|
(12.157)
= Lδ(x, y) + |E[Zx − Zy ]|
≤ Lδ(x, y) + E[|Zx − Zy |]
≤ Lδ(x, y) + Lδ(x, y) = 2Lδ(x, y).

This ensures item (i). Note that item (i) shows that for all η ∈ Ω it holds that E ∋ x 7→
|Zx (η) − E[Zx ]| ∈ R is continuous. Combining this and the assumption that E is separable
with Lemma 12.3.2 proves item (ii). The proof of Lemma 12.3.5 is thus complete.

Lemma 12.3.6. Let (E, δ) be a separable metric space, assume E = ̸ ∅, let ε, L ∈ (0, ∞),
let (Ω, F, P) be a probability space, and let Zx : Ω → R, x ∈ E, be random variables which
satisfy for all x, y ∈ E that E[|Zx |] < ∞ and |Zx − Zy | ≤ Lδ(x, y). Then
 (E,δ), ε −1
P(supx∈E |Zx − E[Zx ]| ≥ ε) ≤ supx∈E P |Zx − E[Zx ]| ≥ 2ε . (12.158)

C 4L

(cf. Definition 4.3.2 and Lemma 12.3.5).

Proof of Lemma 12.3.6. Throughout this proof, let Yx : Ω → R, x ∈ E, satisfy for all x ∈ E,
η ∈ Ω that Yx (η) = Zx (η) − E[Zx ]. Observe that Lemma 12.3.5 ensures that for all x, y ∈ E
it holds that
|Yx − Yy | ≤ 2Lδ(x, y). (12.159)
This and Lemma 12.3.4 (applied with (E, δ) ↶ (E, δ), ε ↶ ε, L ↶ 2L, (Ω, F, P) ↶
(Ω, F, P), (Zx )x∈E ↶ (Yx )x∈E in the notation of Lemma 12.3.4) establish (12.158). The
proof of Lemma 12.3.6 is thus complete.

Lemma 12.3.7. Let (E, δ) be a separable metric space, assume E ̸= ∅, let M ∈ N,


ε, L, D ∈ (0, ∞), let (Ω, F, P) be a probability space, for every x ∈ E let Yx,1 , Yx,2 , . . . ,
Yx,M : Ω → [0, D] be independent random variables, assume for all x, y ∈ E, m ∈ {1, 2, . . . ,
M } that |Yx,m − Yy,m | ≤ Lδ(x, y), and let Zx : Ω → [0, ∞), x ∈ E, satisfy for all x ∈ E that
"M #
1 X
Zx = Yx,m . (12.160)
M m=1

Then

462
12.3. Empirical risk minimization

(i) it holds for all x ∈ E that E[|Zx |] ≤ D < ∞,

(ii) it holds that Ω ∋ η 7→ supx∈E |Zx (η) − E[Zx ]| ∈ [0, ∞] is F/B([0, ∞])-measurable, and

(iii) it holds that


 2 
ε −ε M
P(supx∈E |Zx − E[Zx ]| ≥ ε) ≤ 2C (E,δ), 4L
exp (12.161)
2D2

(cf. Definition 4.3.2).


Proof of Lemma 12.3.7. First, observe that the triangle inequality and the assumption that
for all x, y ∈ E, m ∈ {1, 2, . . . , M } it holds that |Yx,m − Yy,m | ≤ Lδ(x, y) imply that for all
x, y ∈ E it holds that
"M # "M # M
1 X 1 X 1 X 
|Zx − Zy | = Yx,m − Yy,m = Yx,m − Yy,m
M m=1 M m=1 M m=1
"M # (12.162)
1 X
≤ Yx,m − Yy,m ≤ Lδ(x, y).
M m=1

Next note that the assumption that for all x ∈ E, m ∈ {1, 2, . . . , M }, ω ∈ Ω it holds that
|Yx,m (ω)| ∈ [0, D] ensures that for all x ∈ E it holds that
" "M ## "M #
1 X 1 X 
(12.163)
  
E |Zx | = E Yx,m = E Yx,m ≤ D < ∞.
M m=1 M m=1

This proves item (i). Furthermore, note that item (i), (12.162), and Lemma 12.3.5 establish
item (ii). Next observe that (12.160) shows that for all x ∈ E it holds that
"M # " "M ## M
1 X 1 X 1 X
Yx,m − E Yx,m . (12.164)
 
|Zx −E[Zx ]| = Yx,m − E Yx,m =
M m=1 M m=1 M m=1

Combining this with Corollary 12.1.21 (applied with (Ω, F, P) ↶ (Ω, F, P), N ↶ M ,
ε ↶ 2ε , (a1 , a2 , . . . , aN ) ↶ (0, 0, . . . , 0), (b1 , b2 , . . . , bN ) ↶ (D, D, . . . , D), (Xn )n∈{1,2,...,N } ↶
(Yx,m )m∈{1,2,...,M } for x ∈ E in the notation of Corollary 12.1.21) ensures that for all x ∈ E
it holds that
 ε 2 2 !  2 
−2 M −ε M
ε 2
(12.165)

P |Zx − E[Zx ]| ≥ 2 ≤ 2 exp = 2 exp .
M D2 2D2

Combining this, (12.162), and (12.163) with Lemma 12.3.6 establishes item (iii). The proof
of Lemma 12.3.7 is thus complete.

463
Chapter 12: Probabilistic generalization error estimates

12.3.2 Uniform estimates for the statistical learning error


Lemma 12.3.8. Let (E, δ) be a separable metric space, assume E ̸= ∅, let M ∈ N, ε, L, D ∈
(0, ∞), let (Ω, F, P) be a probability space, let Xx,m : Ω → R, x ∈ E, m ∈ {1, 2, . . . , M },
and Ym : Ω → R, m ∈ {1, 2, . . . , M }, be functions, assume for all x ∈ E that (Xx,m , Ym ),
m ∈ {1, 2, . . . , M }, are i.i.d. random variables, assume for all x, y ∈ E, m ∈ {1, 2, . . . , M }
that |Xx,m − Xy,m | ≤ Lδ(x, y) and |Xx,m − Ym | ≤ D, let Ex : Ω → [0, ∞), x ∈ E, satisfy for
all x ∈ E that "M #
1 X
Ex = |Xx,m − Ym |2 , (12.166)
M m=1
and let Ex ∈ [0, ∞), x ∈ E, satisfy for all x ∈ E that Ex = E[|Xx,1 − Y1 |2 ]. Then
Ω ∋ ω 7→ supx∈E |Ex (ω) − Ex | ∈ [0, ∞] is F/B([0, ∞])-measurable and
 2 
ε −ε M
P(supx∈E |Ex − Ex | ≥ ε) ≤ 2C (E,δ), 8LD
exp (12.167)
2D4
(cf. Definition 4.3.2).
Proof of Lemma 12.3.8. Throughout this proof, let Ex,m : Ω → [0, D2 ], x ∈ E, m ∈
{1, 2, . . . , M }, satisfy for all x ∈ E, m ∈ {1, 2, . . . , M } that

Ex,m = |Xx,m − Ym |2 . (12.168)

Observe that the fact that for all x1 , x2 , y ∈ R it holds that (x1 − y)2 − (x2 − y)2 =
(x1 − x2 )((x1 − y) + (x2 − y)), the assumption that for all x ∈ E, m ∈ {1, 2, . . . , M } it holds
that |Xx,m − Ym | ≤ D, and the assumption that for all x, y ∈ E, m ∈ {1, 2, . . . , M } it holds
that |Xx,m − Xy,m | ≤ Lδ(x, y) imply that for all x, y ∈ E, m ∈ {1, 2, . . . , M } it holds that

|Ex,m − Ey,m | = (Xx,m − Ym )2 − (Xy,m − Ym )2


= |Xx,m − Xy,m | (Xx,m − Ym ) + (Xy,m − Ym )
 (12.169)
≤ |Xx,m − Xy,m | |Xx,m − Ym | + |Xy,m − Ym |
≤ 2D|Xx,m − Xy,m | ≤ 2LDδ(x, y).

In addition, note that (12.166) and the assumption that for all x ∈ E it holds that
(Xx,m , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random variables show that for all x ∈ E it holds
that
"M # "M # "M #
1 X  1 X  1 X
E |Xx,m − Ym |2 = E |Xx,1 − Y1 |2 =
   
E Ex = Ex = Ex .
M m=1 M m=1 M m=1
(12.170)
Furthermore, observe that the assumption that for all x ∈ E it holds that (Xx,m , Ym ),
m ∈ {1, 2, . . . , M }, are i.i.d. random variables ensures that for all x ∈ E it holds that Ex,m ,

464
12.3. Empirical risk minimization

m ∈ {1, 2, . . . , M }, are i.i.d. random variables. Combining this, (12.169), and (12.170)
with Lemma 12.3.7 (applied with (E, δ) ↶ (E, δ), M ↶ M , ε ↶ ε, L ↶ 2LD, D ↶ D2 ,
(Ω, F, P) ↶ (Ω, F, P), (Yx,m )x∈E, m∈{1,2,...,M } ↶ (Ex,m )x∈E, m∈{1,2,...,M } , (Zx )x∈E = (Ex )x∈E in
the notation of Lemma 12.3.7) establishes (12.167). The proof of Lemma 12.3.8 is thus
complete.

Proposition 12.3.9. Let d, d, M ∈ N, R, L, R, ε ∈ (0, ∞), let D ⊆ Rd be a compact set,


let (Ω, F, P) be a probability space, let Xm : Ω → D, m ∈ {1, 2, . . . , M }, and Ym : Ω → R,
m ∈ {1, 2, . . . , M }, be functions, assume that (Xm , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d.
random variables, let H = (Hθ )θ∈[−R,R]d : [−R, R]d → C(D, R) satisfy for all θ, ϑ ∈ [−R, R]d ,
x ∈ D that |Hθ (x) − Hϑ (x)| ≤ L∥θ − ϑ∥∞ , assume for all θ ∈ [−R, R]d , m ∈ {1, 2, . . . , M }
that |Hθ (Xm ) − Ym | ≤ R and E[|Y1 |2 ] < ∞, let E : C(D, R) → [0, ∞) satisfy for all
f ∈ C(D, R) that E(f ) = E[|f (X1 ) − Y1 |2 ], and let E : [−R, R]d × Ω → [0, ∞) satisfy for all
θ ∈ [−R, R]d , ω ∈ Ω that
"M #
1 X
E(θ, ω) = |Hθ (Xm (ω)) − Ym (ω)|2 (12.171)
M m=1

(cf. Definition 3.3.4). Then Ω ∋ ω 7→ supθ∈[−R,R]d |E(θ, ω) − E(Hθ )| ∈ [0, ∞] is F/B([0, ∞])-
measurable and
  d   2 
16LRR −ε M
(12.172)

P supθ∈[−R,R]d |E(θ) − E(Hθ )| ≥ ε ≤ 2 max 1, exp .
ε 2R4

Proof of Proposition 12.3.9. Throughout this proof, let B ⊆ Rd satisfy B = [−R, R]d =
{θ ∈ Rd : ∥θ∥∞ ≤ R} and let δ : B × B → [0, ∞) satisfy for all θ, ϑ ∈ B that

δ(θ, ϑ) = ∥θ − ϑ∥∞ . (12.173)

Observe that the assumption that (Xm , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random vari-
ables and the assumption that for all θ ∈ [−R, R]d it holds that Hθ is continuous imply
that for all θ ∈ B it holds that (Hθ (Xm ), Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random
variables. Combining this, the assumption that for all θ, ϑ ∈ B, x ∈ D it holds that
|Hθ (x) − Hϑ (x)| ≤ L∥θ − ϑ∥∞ , and the assumption that for all θ ∈ B, m ∈ {1, 2, . . . , M }
it holds that |Hθ (Xm ) − Ym | ≤ R with Lemma 12.3.8 (applied with (E, δ) ↶ (B, δ),
M ↶ M , ε ↶ ε, L ↶ L, D ↶ R, (Ω, F, P) ↶ (Ω, F, P), (Xx,m )x∈E, m∈{1,2,...,M } ↶
(Hθ (Xm ))θ∈B, m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ↶ (Ym )m∈{1,2,...,M } , (Ex )x∈E ↶ (Ω ∋ ω 7→
E(θ, ω) ∈ [0, ∞)) θ∈B , (Ex )x∈E ↶ (E(Hθ ))θ∈B in the notation of Lemma 12.3.8) estab-


lishes that Ω ∋ ω 7→ supθ∈B |E(θ, ω) − E(Hθ )| ∈ [0, ∞] is F/B([0, ∞])-measurable and


 2 
ε −ε M
(B,δ), 8LR
(12.174)

P supθ∈B |E(θ) − E(Hθ )| ≥ ε ≤ 2C exp
2R4

465
Chapter 12: Probabilistic generalization error estimates

(cf. Definition 4.3.2). Moreover, note that Proposition 12.2.24 (applied with d ↶ d, a ↶ −R,
b ↶ R, r ↶ 8LR ε
, δ ↶ δ in the notation of Proposition 12.2.23) demonstrates that
  d 
ε 16LRR
C (B,δ), 8LR
≤ max 1, . (12.175)
ε

This and (12.174) prove (12.172). The proof of Proposition 12.3.9 is thus complete.
Corollary 12.3.10. Let d, M, L ∈ N, u ∈ P R, v ∈ (u, ∞), R ∈ [1, ∞), ε, b ∈ (0, ∞),
l = (l0 , l1 , . . . , lL ) ∈ N L+1
satisfy lL = 1 and Lk=1 lk (lk−1 + 1) ≤ d, let D ⊆ [−b, b]l0 be a
compact set, let (Ω, F, P) be a probability space, let Xm : Ω → D, m ∈ {1, 2, . . . , M }, and
Ym : Ω → [u, v], m ∈ {1, 2, . . . , M }, be functions, assume that (Xm , Ym ), m ∈ {1, 2, . . . , M },
are i.i.d. random variables, let E : C(D, R) → [0, ∞) satisfy for all f ∈ C(D, R) that
E(f ) = E[|f (X1 ) − Y1 |2 ], and let E : [−R, R]d × Ω → [0, ∞) satisfy for all θ ∈ [−R, R]d ,
ω ∈ Ω that "M #
1 X θ,l
E(θ, ω) = |N (Xm (ω)) − Ym (ω)|2 (12.176)
M m=1 u,v
(cf. Definition 4.4.1). Then
θ,l

(i) it holds that Ω ∋ ω 7→ supθ∈[−R,R]d E(θ, ω) − E Nu,v |D ∈ [0, ∞] is F/B([0, ∞])-
measurable and

(ii) it holds that


θ,l
 
P supθ∈[−R,R]d E(θ) − E Nu,v |D ≥ ε
(12.177)
d 
16L max{1, b}(∥l∥∞ + 1)L RL (v − u) −ε2 M
   
≤ 2 max 1, exp .
ε 2(v − u)4

Proof of Corollary 12.3.10. Throughout this proof, let L ∈ (0, ∞) satisfy

L = L max{1, b} (∥l∥∞ + 1)L RL−1 . (12.178)

Observe that Corollary 11.3.7 (applied with a ↶ −b, b ↶ b, u ↶ u, v ↶ v, d ↶ d, L ↶ L,


l ↶ l in the notation of Corollary 11.3.7) and the assumption that D ⊆ [−b, b]l0 show that
for all θ, ϑ ∈ [−R, R]d it holds that
θ,l ϑ,l
sup |Nu,v (x) − Nu,v (x)|
x∈D
θ,l ϑ,l
≤ sup |Nu,v (x) − Nu,v (x)|
x∈[−b,b]l0 (12.179)
L L−1
≤ L max{1, b} (∥l∥∞ + 1) (max{1, ∥θ∥∞ , ∥ϑ∥∞ }) ∥θ − ϑ∥∞
L L−1
≤ L max{1, b} (∥l∥∞ + 1) R ∥θ − ϑ∥∞ = L∥θ − ϑ∥∞ .

466
12.3. Empirical risk minimization

Furthermore, observe that the fact that for all θ ∈ Rd , x ∈ Rl0 it holds that Nu,v
θ,l
(x) ∈ [u, v]
and the assumption that for all m ∈ {1, 2, . . . , M }, ω ∈ Ω it holds that Ym (ω) ∈ [u, v]
demonstrate that for all θ ∈ [−R, R]d , m ∈ {1, 2, . . . , M } it holds that
θ,l
|Nu,v (Xm ) − Ym | ≤ v − u. (12.180)

Combining this and (12.179) with Proposition 12.3.9 (applied with d ↶ l0 , d ↶ d, M ↶ M ,


R ↶ R, L ↶ L, R ↶ v − u, ε ↶ ε, D ↶ D, (Ω, F, P) ↶ (Ω, F, P), (Xm )m∈{1,2,...,M } ↶
(Xm )m∈{1,2,...,M } , (Ym )m∈{1,2,...,M } ↶ ((Ω ∋ ω 7→ Ym (ω) ∈ R))m∈{1,2,...,M } , H ↶ ([−R, R]d ∋
θ,l
θ 7→ Nu,v |D ∈ C(D, R)), E ↶ E, E ↶ E in the notation of Proposition 12.3.9) establishes
that Ω ∋ ω 7→ supθ∈[−R,R]d E(θ, ω) − E Nu,v |D ∈ [0, ∞] is F/B([0, ∞])-measurable and
θ,l


d 
−ε2 M
   
θ,l
  16LR(v − u)
P supθ∈[−R,R]d E(θ) − E Nu,v |D≥ ε ≤ 2 max 1, exp .
ε 2(v − u)4
(12.181)
The proof of Corollary 12.3.10 is thus complete.

467
Chapter 12: Probabilistic generalization error estimates

468
Chapter 13

Strong generalization error estimates

In Chapter 12 above we reviewed generalization error estimates in the probabilistic sense.


Besides such probabilistic generalization error estimates, generalization error estimates in
the strong Lp -sense are also considered in the literature and in our overall error analysis in
Chapter 15 below we employ such strong generalization error estimates. These estimates
are precisely the subject of this chapter (cf. Corollary 13.3.3 below).
We refer to the beginning of Chapter 12 for a short list of references in the literature
dealing with similar generalization error estimates. The specific material in this chapter
mostly consists of slightly modified extracts from Jentzen & Welti [230, Section 4].

13.1 Monte Carlo estimates

Proposition 13.1.1. Let d, M ∈ N, let (Ω, F, P) be a probability space, let Xj : Ω → Rd ,


j ∈ {1, 2, . . . , M }, be independent random variables, and assume maxj∈{1,2,...,M } E[∥Xj ∥2 ] <
∞ (cf. Definition 3.3.4). Then

  M   M  2 1/2
1 P 1 P
E Xj − E Xj
M j=1 M j=1
 2
 (13.1)
1  2 1/2

≤√ max E ∥Xj − E[Xj ]∥2 .
M j∈{1,2,...,M }

Proof of Proposition 13.1.1. Observe that the fact that for all x ∈ Rd it holds that ⟨x, x⟩ =

469
Chapter 13: Strong generalization error estimates

∥x∥22 demonstrates that


M   M
 2
1 P 1 P
Xj − E Xj
M j=1 M j=1 2
M  M  2
1 P P
= 2 Xj − E Xj
M j=1 j=1 2

1 P M  2
= 2 Xj − E[Xj ] (13.2)
M j=1 2
 M 
1 P
= 2 Xi − E[Xi ], Xj − E[Xj ]
M i,j=1
M   
1 P 2 1 P
= 2 ∥Xj − E[Xj ]∥2 + 2 Xi − E[Xi ], Xj − E[Xj ]
M j=1 M (i,j)∈{1,2,...,M }2 , i̸=j

(cf. Definition 1.4.7). This, the fact that for all independent random variables Y : Ω → Rd
and Z : Ω → Rd with E[∥Y ∥2 + ∥Z∥2 ] < ∞ it holds that E[|⟨Y, Z⟩|] < ∞ and E[⟨Y, Z⟩] =
⟨E[Y ], E[Z]⟩, and the assumption that Xj : Ω → Rd , j ∈ {1, 2, . . . , M }, are independent
random variables establish that
 M   M
 2
1 P 1 P
E Xj − E Xj
M j=1 M j=1 2
M   
1 P  2
 1 P    
= E ∥Xj − E[Xj ]∥2 + 2 E Xi − E[Xi ] , E Xj − E[Xj ]
M 2 j=1 M (i,j)∈{1,2,...,M }2 , i̸=j
M 
1 P 2
(13.3)
 
= E ∥Xj − E[Xj ]∥2
M 2 j=1
 
1  2

≤ max E ∥Xj − E[Xj ]∥2 .
M j∈{1,2,...,M }

The proof of Proposition 13.1.1 is thus complete.

Definition 13.1.2 (Rademacher family). Let (Ω, F, P) be a probability space and let J
be a set. Then we say that (rj )j∈J is a P-Rademacher family if and only if it holds that
rj : Ω → {−1, 1}, j ∈ J, are independent random variables with

∀ j ∈ J : P(rj = 1) = P(rj = −1). (13.4)

Definition 13.1.3 (p-Kahane–Khintchine constant). Let p ∈ (0, ∞). Then we denote by

470
13.1. Monte Carlo estimates

Kp ∈ (0, ∞] the extended real number given by


 
∃ R-Banach space (E, ~·~) :


 






 ∃ probability space (Ω, F, P) : 



Kp = sup c ∈ [0, ∞) : 
 ∃ P-Rademacher family (rj )j∈N : 
 (13.5)

  ∃ k ∈ N : ∃ x1 , x2 , . . . , xk ∈ E\{0} : 

1/p 1/2 
  h‌P i  h i 
p 2

 k
‌ ‌ P k
‌ 
E j=1 rj xj =c E j=1 rj xj
 ‌ ‌ ‌ ‌ 

(cf. Definition 13.1.2).


Lemma 13.1.4. It holds for all p ∈ [2, ∞) that

(13.6)
p
Kp ≤ p − 1 < ∞

(cf. Definition 13.1.3).


Proof of Lemma 13.1.4. Note that (13.5) and Grohs et al. [179, Corollary 2.5] imply (13.6).
The proof of Lemma 13.1.4 is thus complete.
Proposition 13.1.5. Let d, M ∈ N, p ∈ [2, ∞), let (Ω, F, P) be a probability space, let
Xj : Ω → Rd , j ∈ {1, 2, . . . , M }, be independent random variables, and assume

max E[∥Xj ∥2 ] < ∞ (13.7)


j∈{1,2,...,M }

(cf. Definition 3.3.4). Then


 p 1/p
 1/2
  M  M M 
p 2/p
(13.8)
P P P 
E Xj − E Xj ≤ 2Kp E ∥Xj − E[Xj ]∥2
j=1 j=1 2 j=1

(cf. Definition 13.1.3 and Lemma 13.1.4).


Proof of Proposition 13.1.5. Observe that (13.5) and Cox et al. [86, Corollary 5.11] ensure
(13.6). The proof of Proposition 13.1.5 is thus complete.
Corollary 13.1.6. Let d, M ∈ N, p ∈ [2, ∞), let (Ω, F, P) be a probability space, let
Xj : Ω → Rd , j ∈ {1, 2, . . . , M }, be independent random variables, and assume

max E[∥Xj ∥2 ] < ∞ (13.9)


j∈{1,2,...,M }

(cf. Definition 3.3.4). Then


  M    p 1/p √  
1 P 1 PM 2 p−1  p 1/p

E Xj − E Xj ≤ √ max E ∥Xj − E[Xj ]∥2 .
M j=1 M j=1 2 M j∈{1,2,...,M }
(13.10)

471
Chapter 13: Strong generalization error estimates

Proof of Corollary 13.1.6. Note that Proposition 13.1.5 and Lemma 13.1.4 show that
  M   M
 p 1/p
1 P 1 P
E Xj − E Xj
M j=1 M j=1 2
  M  M  p 1/p
1 P P
= E Xj − E Xj
M j=1 j=1 2
M 1/2
2Kp P  2/p
≤ E ∥Xj − E[Xj ]∥p2
M j=1 (13.11)
  1/2
2Kp  p 2/p

≤ M max E ∥Xj − E[Xj ]∥2
M j∈{1,2,...,M }
 
2Kp  p 1/p

=√ max E ∥Xj − E[Xj ]∥2
M j∈{1,2,...,M }
√  
2 p−1  p 1/p

≤ √ max E ∥Xj − E[Xj ]∥2
M j∈{1,2,...,M }

(cf. Definition 13.1.3). The proof of Corollary 13.1.6 is thus complete.

13.2 Uniform strong error estimates for random fields


Lemma 13.2.1. Let (E, δ) be a separable metric space, let N ∈ N, r1 , r2 , . . . , rN ∈ [0, ∞),
z1 , z2 , . . . , zN ∈ E satisfy

E⊆ N (13.12)
S
n=1 {x ∈ E : δ(x, zn ) ≤ rn },

let (Ω, F, P) be a probability space, for every x ∈ E let Zx : Ω → R be a random variable,


let L ∈ [0, ∞) satisfy for all x, y ∈ E that |Zx − Zy | ≤ Lδ(x, y), and let p ∈ [0, ∞). Then
N
 X
p
E (Lrn + |Zzn |)p (13.13)
  
E supx∈E |Zx | ≤
n=1

(cf. Lemma 12.3.2).

Proof of Lemma 13.2.1. Throughout this proof, for every n ∈ {1, 2, . . . , N } let

Bn = {x ∈ E : δ(x, zn ) ≤ rn }. (13.14)

Observe that (13.12) and (13.14) prove that

E⊆ N and (13.15)
S SN
n=1 Bn E⊇ n=1 Bn .

472
13.2. Uniform strong error estimates for random fields

Hence, we obtain that

supx∈E |Zx | = supx∈(SN Bn ) |Zx | = maxn∈{1,2,...,N } supx∈Bn |Zx |. (13.16)


n=1

Therefore, we obtain that

E supx∈E |Zx |p = E maxn∈{1,2,...,N } supx∈Bn |Zx |p


   

(13.17)
N  N
p
E supx∈Bn |Zx |p .
P P  
≤E supx∈Bn |Zx | =
n=1 n=1

(cf. Lemma 12.3.2). Furthermore, note that the assumption that for all x, y ∈ E it holds
that |Zx − Zy | ≤ Lδ(x, y) demonstrates that for all n ∈ {1, 2, . . . , N }, x ∈ Bn it holds that

|Zx | = |Zx − Zzn + Zzn | ≤ |Zx − Zzn | + |Zzn | ≤ Lδ(x, zn ) + |Zzn | ≤ Lrn + |Zzn |. (13.18)

This and (13.17) establish that


N
E supx∈E |Zx |p ≤ E (Lrn + |Zzn |)p . (13.19)
  P  
n=1

The proof of Lemma 13.2.1 is thus complete.


Lemma 13.2.2. Let (E, δ) be a non-empty separable metric space, let (Ω, F, P) be a
probability space, for every x ∈ E let Zx : Ω → R be a random variable, let L ∈ (0, ∞)
satisfy for all x, y ∈ E that |Zx − Zy | ≤ Lδ(x, y), and let p, r ∈ (0, ∞). Then
 
p (E,δ),r p
(13.20)
   
E supx∈E |Zx | ≤ C sup E (Lr + |Zx |)
x∈E

(cf. Definition 4.3.2 and Lemma 12.3.2).


Proof of Lemma 13.2.2. Throughout this proof, assume without loss of generality that
C (E,δ),r < ∞, let N = C (E,δ),r , and let z1 , z2 , . . . , zN ∈ E satisfy

E⊆ N (13.21)
S
n=1 {x ∈ E : δ(x, zn ) ≤ r}

(cf. Definition 4.3.2). Observe that Lemma 13.2.1 (applied with r1 ↶ r, r2 ↶ r, . . . , rN ↶ r


in the notation of Lemma 13.2.1) implies that
N
E supx∈E |Zx |p ≤ E (Lr + |Zzi |)p
  P  
i=1
N
    (13.22)
p p
P    
≤ sup E (Lr + |Zx |) = N sup E (Lr + |Zx |) .
i=1 x∈E x∈E

(cf. Lemma 12.3.2). The proof of Lemma 13.2.2 is thus complete.

473
Chapter 13: Strong generalization error estimates

Lemma 13.2.3. Let (E, δ) be a non-empty separable metric space, let (Ω, F, P) be a
probability space, for every x ∈ E let Zx : Ω → R be a random variable with E[|Zx |] < ∞, let
L ∈ (0, ∞) satisfy for all x, y ∈ E that |Zx − Zy | ≤ Lδ(x, y), and let p ∈ [1, ∞), r ∈ (0, ∞).
Then
1/p h 1/p i
E supx∈E |Zx − E[Zx ]|p ≤ (C (E,δ),r ) /p 2Lr + supx∈E E |Zx − E[Zx ]|p (13.23)
1
 

(cf. Definition 4.3.2 and Lemma 12.3.5).


Proof of Lemma 13.2.3. Throughout this proof, for every x ∈ E let Yx : Ω → R satisfy for
all ω ∈ Ω that
Yx (ω) = Zx (ω) − E[Zx ]. (13.24)
Note that (13.24) and the triangle inequality ensure that for all x, y ∈ E it holds that
|Yx − Yy | = |(Zx − E[Zx ]) − (Zy − E[Zy ])|
= |(Zx − Zy ) − (E[Zx ] − E[Zy ])|
(13.25)
≤ |Zx − Zy | + |E[Zx ] − E[Zy ]|
≤ Lδ(x, y) + E[|Zx − Zy |] ≤ 2Lδ(x, y).
Lemma 13.2.2 (applied with L ↶ 2L, (Ω, F, P) ↶ (Ω, F, P), (Zx )x∈E ↶ (Yx )x∈E in the
notation of Lemma 13.2.2) hence shows that
1/p 1/p
E supx∈E |Zx − E[Zx ]|p = E supx∈E |Yx |p
 
h 1/p i
≤ (C (E,δ),r ) /p supx∈E E (2Lr + |Yx |)p
1


(E,δ),r 1/p
h   i
p 1/p
(13.26)
≤ (C ) 2Lr + supx∈E E |Yx |
h 1/p i
= (C (E,δ),r ) /p 2Lr + supx∈E E |Zx − E[Zx ]|p
1

.

The proof of Lemma 13.2.3 is thus complete.


Lemma 13.2.4. Let (E, δ) be a non-empty separable metric space, let (Ω, F, P) be a
probability space, let M ∈ N, for every  x ∈ E let Yx,m : Ω → R, m ∈ {1, 2, . . . , M }, be
independent random variables with E |Yx,1 | + |Yx,2 | + . . . + |Yx,m | < ∞, let L ∈ (0, ∞)
satisfy for all x, y ∈ E, m ∈ {1, 2, . . . , M } that
|Yx,m − Yy,m | ≤ Lδ(x, y), (13.27)
and for every x ∈ E let Zx : Ω → R satisfy
M 
1 P
Zx = Yx,m . (13.28)
M m=1
Then

474
13.2. Uniform strong error estimates for random fields

(i) it holds for all x ∈ E that E[|Zx |] < ∞,


(ii) it holds that Ω ∋ ω 7→ supx∈E |Zx (ω) − E[Zx ]| ∈ [0, ∞] is F/B([0, ∞])-measurable, and
(iii) it holds for all p ∈ [2, ∞), r ∈ (0, ∞) that
1/p
E supx∈E |Zx − E[Zx ]|p

h √   i
p 1/p
≤ 2(C (E,δ),r ) /p Lr + √p−1
1

M
supx∈E maxm∈{1,2,...,M } E |Y x,m − E[Y x,m ]|
(13.29)

(cf. Definition 4.3.2).


Proof of Lemma 13.2.4. Observe that the assumption that for all x ∈ E, m ∈ {1, 2, . . . , M }
it holds that E[|Yx,m |] < ∞ proves that for all x ∈ E it holds that
 M
 M 
1 P 1 P
E[|Zx |] = E Yx,m ≤ E[|Yx,m |] ≤ max E[|Yx,m |] < ∞. (13.30)
M m=1 M m=1 m∈{1,2,...,M }

This establishes item (i). Note that (13.27) demonstrates that for all x, y ∈ E it holds that
M  M  M 
1 P 1 P
|Yx,m − Yy,m | ≤ Lδ(x, y). (13.31)
P
|Zx − Zy | = Yx,m − Yy,m ≤
M m=1 m=1 M m=1
Item (i) and Lemma 12.3.5 therefore prove item (ii). It thus remains to show item (iii).
For this observe that item (i), (13.31), and Lemma 13.2.3 imply that for all p ∈ [1, ∞),
r ∈ (0, ∞) it holds that
h  i
p 1/p p 1/p
(E,δ),r 1/p
(13.32)
  
E supx∈E |Zx − E[Zx ]| ≤ (C ) 2Lr + supx∈E E |Zx − E[Zx ]|

(cf. Definition 4.3.2). Furthermore, note that (13.30) and Corollary 13.1.6 (applied with
d ↶ 1, (Xm )m∈{1,2,...,M } ↶ (Yx,m )m∈{1,2,...,M } for x ∈ E in the notation of Corollary 13.1.6)
ensure that for all x ∈ E, p ∈ [2, ∞), r ∈ (0, ∞) it holds that
  M   M
 p 1/p
 p 1/p
 1 P 1 P
E |Zx − E[Zx ]| = E Yx,m − E Yx,m
M m=1 M m=1
√   (13.33)
2 p−1  p 1/p

≤ √ max E |Yx,m − E[Yx,m ]| .
M m∈{1,2,...,M }

Combining this with (13.32) shows that for all p ∈ [2, ∞), r ∈ (0, ∞) it holds that
1/p
E supx∈E |Zx − E[Zx ]|p

h √   i
(E,δ),r 1/p 2 √p−1 p 1/p

≤ (C ) 2Lr + M supx∈E maxm∈{1,2,...,M } E |Yx,m − E[Yx,m ]| (13.34)
h √  1/p
i
= 2(C (E,δ),r ) /p Lr + √p−1 supx∈E maxm∈{1,2,...,M } E |Yx,m − E[Yx,m ]|p
1
 
M
.
The proof of Lemma 13.2.4 is thus complete.

475
Chapter 13: Strong generalization error estimates

Corollary 13.2.5. Let (E, δ) be a non-empty separable metric space, let (Ω, F, P) be a
probability space, let M ∈ N, for every  x ∈ E let Yx,m : Ω →  R, m ∈ {1, 2, . . . , M }, be
independent random variables with E |Yx,1 | + |Yx,2 | + . . . + |Yx,m | < ∞, let L ∈ (0, ∞) satisfy
for all x, y ∈ E, m ∈ {1, 2, . . . , M } that |Yx,m − Yy,m | ≤ Lδ(x, y), and for every x ∈ E let
Zx : Ω → R satisfy M 
1 P
Zx = Yx,m . (13.35)
M m=1
Then

(i) it holds for all x ∈ E that E[|Zx |] < ∞,

(ii) it holds that Ω ∋ ω 7→ supx∈E |Zx (ω) − E[Zx ]| ∈ [0, ∞] is F/B([0, ∞])-measurable, and

(iii) it holds for all p ∈ [2, ∞), c ∈ (0, ∞) that


1/p
E supx∈E |Zx − E[Zx ]|p (13.36)


√ 1/p
(E,δ), c √p−1
  h  i
p 1/p
≤ 2 √p−1

M
C L M c + sup x∈E maxm∈{1,2,...,M } E |Y x,m − E[Y x,m ]|

(cf. Definition 4.3.2).

Proof of Corollary 13.2.5. Observe that Lemma



13.2.4 establishes items (i) and (ii). Note

that Lemma 13.2.4 (applied with r ↶ c p−1/(L M ) for c ∈ (0, ∞) in the notation of
Lemma 13.2.4) demonstrates that for all p ∈ [2, ∞), c ∈ (0, ∞) it holds that
1/p
E supx∈E |Zx − E[Zx ]|p

√ 1/p h √
(E,δ), c √p−1

≤2 C L M L cL√p−1
M
√   i
p 1/p
(13.37)
+ √p−1

M
sup x∈E max m∈{1,2,...,M } E |Yx,m − E[Y x,m ]|

√ 1/p h
(E,δ), c √p−1
 1/p i
2 √p−1
c + supx∈E maxm∈{1,2,...,M } E |Yx,m − E[Yx,m ]|p

= M C L M

(cf. Definition 4.3.2). This proves item (iii). The proof of Corollary 13.2.5 is thus complete.

13.3 Strong convergence rates for the generalisation er-


ror
Lemma 13.3.1. Let (E, δ) be a separable metric space, assume E ̸= ∅, let (Ω, F, P) be a
probability space, let M ∈ N, let Xx,m : Ω → R, m ∈ {1, 2, . . . , M }, x ∈ E, and Ym : Ω → R,

476
13.3. Strong convergence rates for the generalisation error

m ∈ {1, 2, . . . , M }, be functions, assume for all x ∈ E that (Xx,m , Ym ), m ∈ {1, 2, . . . , M },


are i.i.d. random variables, let L, b ∈ (0, ∞) satisfy for all x, y ∈ E, m ∈ {1, 2, . . . , M } that

|Xx,m − Ym | ≤ b and |Xx,m − Xy,m | ≤ Lδ(x, y), (13.38)

and let R : E → [0, ∞) and R : E × Ω → [0, ∞) satisfy for all x ∈ E, ω ∈ Ω that


M 
1 P
R(x) = E |Xx,1 − Y1 |2 2
(13.39)
 
and R(x, ω) = |Xx,m (ω) − Ym (ω)| .
M m=1

Then

(i) it holds that Ω ∋ ω 7→ supx∈E |R(x, ω) − R(x)| ∈ [0, ∞] is F/B([0, ∞])-measurable


and

(ii) it holds for all p ∈ [2, ∞), c ∈ (0, ∞) that


√ 1/p  2(c + 1)b2 √p − 1 
(E,δ), cb √p−1

p 1/p
(13.40)
 
E supx∈E |R(x) − R(x)| ≤ C 2L M √
M

(cf. Definition 4.3.2).

Proof of Lemma 13.3.1. Throughout this proof, for every x ∈ E, m ∈ {1, 2, . . . , M } let
Yx,m : Ω → R satisfy Yx,m = |Xx,m − Ym |2 . Observe that the assumption that for all x ∈ E
it holds that (Xx,m , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random variables implies that for all
x ∈ E it holds that
 2

|X − |
M 
1 P M E x,1 Y 1
E |Xx,m − Ym |2 = (13.41)
 
E[R(x)] = = R(x).
M m=1 M

Furthermore, note that the assumption that for all x ∈ E, m ∈ {1, 2, . . . , M } it holds that
|Xx,m − Ym | ≤ b shows that for all x ∈ E, m ∈ {1, 2, . . . , M } it holds that

E[|Yx,m |] = E |Xx,m − Ym |2 ≤ b2 < ∞, (13.42)


 

Yx,m − E[Yx,m ] = |Xx,m − Ym |2 − E |Xx,m − Ym |2 ≤ |Xx,m − Ym |2 ≤ b2 , (13.43)


 

and
E[Yx,m ] − Yx,m = E |Xx,m − Ym |2 − |Xx,m − Ym |2 ≤ E |Xx,m − Ym |2 ≤ b2 . (13.44)
   

Observe that (13.42), (13.43), and (13.44) ensure for all x ∈ E, m ∈ {1, 2, . . . , M },
p ∈ (0, ∞) that
1/p  1/p
E |Yx,m − E[Yx,m ]|p ≤ E b2p = b2 . (13.45)


477
Chapter 13: Strong generalization error estimates

Moreover, note that (13.38) and the fact that for all x1 , x2 , y ∈ R it holds that (x1 − y)2 −
(x2 − y)2 = (x1 − x2 )((x1 − y) + (x2 − y)) show that for all x, y ∈ E, m ∈ {1, 2, . . . , M } it
holds that
|Yx,m − Yy,m | = |(Xx,m − Ym )2 − (Xy,m − Ym )2 |
≤ |Xx,m − Xy,m |(|Xx,m − Ym | + |Xy,m − Ym |) (13.46)
≤ 2b|Xx,m − Xy,m | ≤ 2bLδ(x, y).
The fact that for all x ∈ E it holds that Yx,m , m ∈ {1, 2, . . . , M }, are independent
random variables, (13.42), and Corollary 13.2.5 (applied with (Yx,m )x∈E, m∈{1,2,...,M } ↶
(Yx,m )x∈E, m∈{1,2,...,M } , L ↶ 2bL, (Zx )x∈E ↶ (Ω ∋ ω 7→ R(x, ω) ∈ R)x∈E in the notation of
Corollary 13.2.5) hence establish that
(I) it holds that Ω ∋ ω 7→ supx∈E |R(x, ω) − R(x)| ∈ [0, ∞] is F/B([0, ∞])-measurable
and
(II) it holds for all p ∈ [2, ∞), c ∈ (0, ∞) that
√ 2√ 1/p h
(E,δ), cb √p−1
1/p 
2 √p−1
E supx∈E |R(x) − E[R(x)]|p cb2

≤ M
C 2bL M

 i
p 1/p
. (13.47)

+ supx∈E maxm∈{1,2,...,M } E |Yx,m − E[Yx,m ]|

Observe that item (II), (13.41), (13.42), and (13.45) demonstrate that for all p ∈ [2, ∞),
c ∈ (0, ∞) it holds that

√ 1/p
(E,δ), cb √p−1

p 1/p 2 √p−1
[cb2 + b2 ]
 
E supx∈E |R(x) − R(x)| ≤ M C 2L M

√ 1/p  2(c + 1)b2 √p − 1  (13.48)


(E,δ), cb √p−1

= C 2L M √ .
M
This and item (I) prove items (i) and (ii). The proof of Lemma 13.3.1 is thus complete.
Proposition 13.3.2. Let d ∈ N, D ⊆ Rd , let (Ω, F, P) be a probability space, let M ∈ N,
let Xm = (Xm , Ym ) : Ω → (D × R), m ∈ {1, 2, . . . , M }, be i.i.d. random variables, let α ∈ R,
β ∈ (α, ∞), d ∈ N, let f = (fθ )θ∈[α,β]d : [α, β]d → C(D, R), let L, b ∈ (0, ∞) satisfy for all
θ, ϑ ∈ [α, β]d , m ∈ {1, 2, . . . , M }, x ∈ D that
|fθ (Xm ) − Ym | ≤ b and |fθ (x) − fϑ (x)| ≤ L∥θ − ϑ∥∞ , (13.49)
and let R : [α, β]d → [0, ∞) and R : [α, β]d × Ω → [0, ∞) satisfy for all θ ∈ [α, β]d , ω ∈ Ω
that
M 
1 P
2 2
(13.50)
 
R(θ) = E |fθ (X1 ) − Y1 | and R(θ, ω) = |fθ (Xm (ω)) − Ym (ω)|
M m=1
(cf. Definition 3.3.4). Then

478
13.3. Strong convergence rates for the generalisation error

(i) it holds that Ω ∋ ω 7→ supθ∈[α,β]d |R(θ, ω) − R(θ)| ∈ [0, ∞] is F/B([0, ∞])-measurable


and

(ii) it holds for all p ∈ (0, ∞) that


1/p
E supθ∈[α,β]d |R(θ) − R(θ)|p

" √ p #
2(c + 1)b2 max{1, [2 M L(β − α)(cb)−1 ]ε } max{1, p, d/ε}
≤ inf √
c,ε∈(0,∞) M (13.51)
" p #
2(c + 1)b2 e max{1, p, d ln(4M L2 (β − α)2 (cb)−2 )}
≤ inf √ .
c∈(0,∞) M

Proof of Proposition 13.3.2. Throughout this proof, let (κc )c∈(0,∞) ⊆ (0, ∞) satisfy for all
c ∈ (0, ∞) that √
2 M L(β − α)
κc = , (13.52)
cb
let Xθ,m : Ω → R, m ∈ {1, 2, . . . , M }, θ ∈ [α, β]d , satisfy for all θ ∈ [α, β]d , m ∈
{1, 2, . . . , M } that
Xθ,m = fθ (Xm ), (13.53)
and let δ : [α, β]d × [α, β]d → [0, ∞) satisfy for all θ, ϑ ∈ [α, β]d that

δ(θ, ϑ) = ∥θ − ϑ∥∞ . (13.54)

First, note that the assumption that for all θ ∈ [α, β]d , m ∈ {1, 2, . . . , M } it holds that
|fθ (Xm ) − Ym | ≤ b implies for all θ ∈ [α, β]d , m ∈ {1, 2, . . . , M } that

|Xθ,m − Ym | = |fθ (Xm ) − Ym | ≤ b. (13.55)

Furthermore, observe that the assumption that for all θ, ϑ ∈ [α, β]d , x ∈ D it holds that
|fθ (x) − fϑ (x)| ≤ L∥θ − ϑ∥∞ ensures for all θ, ϑ ∈ [α, β]d , m ∈ {1, 2, . . . , M } that

|Xθ,m − Xϑ,m | = |fθ (Xm ) − fϑ (Xm )| ≤ supx∈D |fθ (x) − fϑ (x)| ≤ L∥θ − ϑ∥∞ = Lδ(θ, ϑ).
(13.56)
The fact that for all θ ∈ [α, β]d it holds that (Xθ,m , Ym ), m ∈ {1, 2, . . . , M }, are i.i.d. random
variables, (13.55), and Lemma 13.3.1 (applied with p ↶ q, C ↶ C, (E, δ) ↶ ([α, β]d , δ),
(Xx,m )x∈E, m∈{1,2,...,M } ↶ (Xθ,m )θ∈[α,β]d , m∈{1,2,...,M } for p ∈ [2, ∞), C ∈ (0, ∞) in the notation
of Lemma 13.3.1) therefore ensure that for all p ∈ [2, ∞), c ∈ (0, ∞) it holds that
Ω ∋ ω 7→ supθ∈[α,β]d |R(θ, ω) − R(θ)| ∈ [0, ∞] is F/B([0, ∞])-measurable and
√ 1/p  2(c + 1)b2 √p − 1 
([α,β]d ,δ), cb √p−1

p 1/p
(13.57)
 
E supθ∈[α,β]d |R(θ) − R(θ)| ≤ C 2L M √
M

479
Chapter 13: Strong generalization error estimates

(cf. Definition 4.3.2). This establishes item (i). Note that Proposition 12.2.24 (applied with
d ↶ d, a ↶ α, b ↶ β, r ↶ r for r ∈ (0, ∞) in the notation of Proposition 12.2.24) shows
that for all r ∈ (0, ∞) it holds that

C ([α,β] ,δ),r ≤ 1[0,r] β−α β−α d


1
d   β−α

+ (r,∞)
n 2  o
r 2
β−α d
1[0,r] 2 + 1(r,∞) β−α
β−α
 
≤ max 1, r 2 (13.58)
n o
d
= max 1, β−α

r
.

Hence, we obtain for all c ∈ (0, ∞), p ∈ [2, ∞) that


√ 1/p   √ d

([α,β]d ,δ), cb √p−1

2(β−α)L M p
C 2L M ≤ max 1, √
cb p−1
  √ d
 n o (13.59)
d
2(β−α)L M p
≤ max 1, cb
= max 1, (κc ) p .

This, (13.57), and Jensen’s inequality demonstrate that for all c, ε, p ∈ (0, ∞) it holds that
1/p
E supθ∈[α,β]d |R(θ) − R(θ)|p

1
≤ E supθ∈[α,β]d |R(θ) − R(θ)|max{2,p, /ε} max{2,p,d/ε}
d
 

n d
o 2(c + 1)b2 pmax{2, p, d/ε} − 1
≤ max 1, (κc ) max{2,p, d/ε}

p
M (13.60)
2
2(c + 1)b max{1, p − 1, d/ε − 1}
= max 1, (κc )min{ /2, /p,ε}
d d


M
2 ε
p
2(c + 1)b max{1, (κc ) } max{1, p, d/ε}
≤ √ .
M
Moreover, observe that the fact that for all a ∈ (1, ∞) it holds that

a /(2 ln(a)) = e /(2 ln(a)) = e /2 = e ≥ 1 (13.61)
1 ln(a) 1

proves that for all c, p ∈ (0, ∞) with κc > 1 it holds that


" p #
2(c + 1)b2 max{1, (κc )ε } max{1, p, d/ε}
inf √
ε∈(0,∞) M
p
2(c + 1)b2 max{1, (κc ) /(2 ln(κc )) } max{1, p, 2d ln(κc )}
1
(13.62)
≤ √
M
2
p
2(c + 1)b e max{1, p, d ln([κc ]2 )}
= √ .
M

480
13.3. Strong convergence rates for the generalisation error

The fact that for all c, p ∈ (0, ∞) with κc ≤ 1 it holds that


" p #
2(c + 1)b2 max{1, (κc )ε } max{1, p, d/ε}
inf √
ε∈(0,∞) M
" p # p
2(c + 1)b2 max{1, p, d/ε} 2(c + 1)b2 max{1, p} (13.63)
= inf √ ≤ √
ε∈(0,∞) M M
p
2(c + 1)b2 e max{1, p, d ln([κc ]2 )}
≤ √ .
M

and (13.60) therefore imply that for all p ∈ (0, ∞) it holds that
1/p
E supθ∈[α,β]d |R(θ) − R(θ)|p

" p #
2(c + 1)b2 max{1, (κc )ε } max{1, p, d/ε}
≤ inf √
c,ε∈(0,∞) M
" √ p #
2(c + 1)b2 max{1, [2 M L(β − α)(cb)−1 ]ε } max{1, p, d/ε}
= inf √
c,ε∈(0,∞) M (13.64)
" p #
2(c + 1)b2 e max{1, p, d ln([κc ]2 )}
≤ inf √
c∈(0,∞) M
" p #
2(c + 1)b2 e max{1, p, d ln(4M L2 (β − α)2 (cb)−2 )}
= inf √ .
c∈(0,∞) M

This establishes item (ii). The proof of Proposition 13.3.2 is thus complete.

Corollary 13.3.3. Let d, M ∈ N, b ∈ [1, ∞), u ∈ R, v ∈ [u + 1, ∞), D ⊆ [−b, b]d , let


(Ω, F, P) be a probability space, let Xm = (Xm , Ym ) : Ω → (D × [u, v]), m ∈ {1, 2, . . . , M },
be i.i.d. random variables,
PL let B ∈ [1, ∞), L, d ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy l0 = d,
lL = 1, and d ≥ i=1 li (li−1 + 1), let R : [−B, B]d → [0, ∞) and R : [−B, B]d × Ω → [0, ∞)
satisfy for all θ ∈ [−B, B]d , ω ∈ Ω that
M 
1 P
(13.65)
 θ,l 2
 θ,l 2
R(θ) = E |Nu,v (X1 ) − Y1 | and R(θ, ω) = |N (Xm (ω)) − Ym (ω)|
M m=1 u,v

(cf. Definition 4.4.1). Then

(i) it holds that Ω ∋ ω 7→ supθ∈[−B,B]d |R(θ, ω)−R(θ)| ∈ [0, ∞] is F/B([0, ∞])-measurable


and

481
Chapter 13: Strong generalization error estimates

(ii) it holds for all p ∈ (0, ∞) that


1/p
E supθ∈[−B,B]d |R(θ) − R(θ)|p

p
9(v − u)2 L(∥l∥∞ + 1) max{p, ln(4(M b)1/L (∥l∥∞ + 1)B)}
≤ √ (13.66)
M
2 2
9(v − u) L(∥l∥∞ + 1) max{p, ln(3M Bb)}
≤ √
M

(cf. Definition 3.3.4).


Proof of Corollary 13.3.3. Throughout this proof, let d = Li=1 li (li−1 + 1) ∈ N, let L =
P
bL(∥l∥∞ + 1)L B L−1 ∈ (0, ∞), for every θ ∈ [−B, B]d let fθ : D → R satisfy for all x ∈ D
that
θ,l
fθ (x) = Nu,v (x), (13.67)
let R : [−B, B]d → [0, ∞) satisfy for all θ ∈ [−B, B]d that

R(θ) = E |fθ (X1 ) − Y1 |2 = E |Nu,v (13.68)


 θ,l
(X1 ) − Y1 |2 ,
  

and let R : [−B, B]d × Ω → [0, ∞) satisfy for all θ ∈ [−B, B]d , ω ∈ Ω that
M  M 
1 P 1 P
R(θ, ω) = 2
|fθ (Xm (ω)) − Ym (ω)| = θ,l
|N (Xm (ω)) − Ym (ω)|2
(13.69)
M m=1 M m=1 u,v

(cf. Definition 3.3.4). Note that the fact that for all θ ∈ Rd , x ∈ Rd it holds that Nu,v θ,l
(x) ∈
[u, v] and the assumption that for all m ∈ {1, 2, . . . , M } it holds that Ym (Ω) ⊆ [u, v] ensure
for all θ ∈ [−B, B]d , m ∈ {1, 2, . . . , M } that
θ,l
|fθ (Xm ) − Ym | = |Nu,v (Xm ) − Ym | ≤ supy1 ,y2 ∈[u,v] |y1 − y2 | = v − u. (13.70)

Furthermore, observe that the assumption that D ⊆ [−b, b]d , l0 = d, and lL = 1, Corol-
lary 11.3.7 (applied with a ↶ −b, b ↶ b, u ↶ u, v ↶ v, d ↶ d, L ↶ L, l ↶ l in the
notation of Corollary 11.3.7), and the assumption that b ≥ 1 and B ≥ 1 show that for all
θ, ϑ ∈ [−B, B]d , x ∈ D it holds that
θ,l ϑ,l
|fθ (x) − fϑ (x)| ≤ supy∈[−b,b]d |Nu,v (y) − Nu,v (y)|
≤ L max{1, b}(∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞ (13.71)
L L−1
≤ bL(∥l∥∞ + 1) B ∥θ − ϑ∥∞ = L∥θ − ϑ∥∞ .

Moreover, note that the fact that d ≥ d and the fact that for all θ = (θ1 , θ2 , . . . , θd ) ∈ Rd it
holds that Nu,v
θ,l (θ1 ,θ2 ,...,θd ),l
= Nu,v demonstrates that for all ω ∈ Ω it holds that

supθ∈[−B,B]d |R(θ, ω) − R(θ)| = supθ∈[−B,B]d |R(θ, ω) − R(θ)|. (13.72)

482
13.3. Strong convergence rates for the generalisation error

In addition, observe that (13.70), (13.71), Proposition 13.3.2 (applied with α ↶ −B, β ↶ B,
d ↶ d, b ↶ v − u, R ↶ R, R ↶ R in the notation of Proposition 13.3.2), the fact that

v − u ≥ (u + 1) − u = 1 (13.73)

and the fact that


d ≤ L∥l∥∞ (∥l∥∞ + 1) ≤ L(∥l∥∞ + 1)2 (13.74)

prove that for all p ∈ (0, ∞) it holds that Ω ∋ ω 7→ supθ∈[−B,B]d |R(θ, ω) − R(θ)| ∈ [0, ∞] is
F/B([0, ∞])-measurable and
1/p
E supθ∈[−B,B]d |R(θ) − R(θ)|p

" p #
2(C + 1)(v − u)2 e max{1, p, d ln(4M L2 (2B)2 (C[v − u])−2 )}
≤ inf √
C∈(0,∞) M (13.75)
" p #
2(C + 1)(v − u)2 e max{1, p, L(∥l∥∞ + 1)2 ln(24 M L2 B 2 C −2 )}
≤ inf √ .
C∈(0,∞) M

Combining this with (13.72) establishes item (i). Note that (13.72), (13.75), the fact that
26 L2 ≤ 26 · 22(L−1) = 24+2L ≤ 24L+2L = 26L , the fact that 3 ≥ e, and the assumption that
B ≥ 1, L ≥ 1, M ≥ 1, and b ≥ 1 imply that for all p ∈ (0, ∞) it holds that
1/p 1/p
E supθ∈[−B,B]d |R(θ) − R(θ)|p = E supθ∈[−B,B]d |R(θ) − R(θ)|p
 
p
2(1/2 + 1)(v − u)2 e max{1, p, L(∥l∥∞ + 1)2 ln(24 M L2 B 2 22 )}
≤ √
M
2
p
3(v − u) e max{p, L(∥l∥∞ + 1)2 ln(26 M b2 L2 (∥l∥∞ + 1)2L B 2L )}
= √
M
(13.76)
p
2
3(v − u) e max{p, 3L2 (∥l∥∞ + 1)2 ln([26L M b2 (∥l∥∞ + 1)2L B 2L ]1/(3L) )}
≤ √
M
p
2
3(v − u) 3 max{p, 3L (∥l∥∞ + 1) ln(22 (M b2 )1/(3L) (∥l∥∞ + 1)B)}
2 2
≤ √
M
p
2
9(v − u) L(∥l∥∞ + 1) max{p, ln(4(M b)1/L (∥l∥∞ + 1)B)}
≤ √ .
M

Next observe that the fact that for all n ∈ N it holds that n ≤ 2n−1 and the fact that
∥l∥∞ ≥ 1 ensure that

4(∥l∥∞ + 1) ≤ 22 · 2(∥l∥∞ +1)−1 = 23 · 2(∥l∥∞ +1)−2 ≤ 32 · 3(∥l∥∞ +1)−2 = 3(∥l∥∞ +1) . (13.77)

483
Chapter 13: Strong generalization error estimates

Hence, we obtain that for all p ∈ (0, ∞) it holds that


p
9(v − u)2 L(∥l∥∞ + 1) max{p, ln(4(M b)1/L (∥l∥∞ + 1)B)}

M
p
2
9(v − u) L(∥l∥∞ + 1) max{p, (∥l∥∞ + 1) ln([3(∥l∥∞ +1) (M b)1/L B]1/(∥l∥∞ +1) )} (13.78)
≤ √
M
2 2
9(v − u) L(∥l∥∞ + 1) max{p, ln(3M Bb)}
≤ √ .
M
This and (13.76) prove item (ii). The proof of Corollary 13.3.3 is thus complete.

484
Part V

Composed error analysis

485
Chapter 14

Overall error decomposition

In Chapter 15 below we combine parts of the approximation error estimates from Part II,
parts of the optimization error estimates from Part III, and parts of the generalization
error estimates from Part IV to establish estimates for the overall error in the training of
ANNs in the specific situation of GD-type optimization methods with many independent
random initializations. For such a combined error analysis we employ a suitable overall
error decomposition for supervised learning problems. It is the subject of this chapter to
review and derive this overall error decomposition (see Proposition 14.2.1 below).
In the literature such kind of error decompositions can, for example, be found in [25, 35,
36, 87, 230]. The specific presentation of this chapter is strongly based on [25, Section 4.1]
and [230, Section 6.1].

14.1 Bias-variance decomposition


Lemma 14.1.1 (Bias-variance decomposition). Let (Ω, F, P) be a probability space, let
(S, S) be a measurable space, let X : Ω → S and Y : Ω → R be random variables with
E[|Y |2 ] < ∞, and let r : L2 (PX ; R) → [0, ∞) satisfy for all f ∈ L2 (PX ; R) that
r(f ) = E |f (X) − Y |2 . (14.1)
 

Then
(i) it holds for all f ∈ L2 (PX ; R) that
r(f ) = E |f (X) − E[Y |X]|2 + E |Y − E[Y |X]|2 , (14.2)
   

(ii) it holds for all f, g ∈ L2 (PX ; R) that


r(f ) − r(g) = E |f (X) − E[Y |X]|2 − E |g(X) − E[Y |X]|2 , (14.3)
   

and

487
Chapter 14: Overall error decomposition

(iii) it holds for all f, g ∈ L2 (PX ; R) that


E |f (X) − E[Y |X]|2 = E |g(X) − E[Y |X]|2 + r(f ) − r(g) . (14.4)
    

Proof of Lemma 14.1.1. First, note that (14.1) shows that for all f ∈ L2 (PX ; R) it holds
that
r(f ) = E |f (X) − Y |2 = E |(f (X) − E[Y |X]) + (E[Y |X] − Y )|2
   

= E |f (X) − E[Y |X]|2 + 2 E f (X) − E[Y |X] E[Y |X] − Y (14.5)


    

+ E |E[Y |X] − Y |2
 

Furthermore, observe that the tower rule demonstrates that for all f ∈ L2 (PX ; R) it holds
that   
E f (X) − E[Y |X] E[Y |X] − Y
h    i
= E E f (X) − E[Y |X] E[Y |X] − Y X
h    i (14.6)
= E f (X) − E[Y |X] E E[Y |X] − Y X
  
= E f (X) − E[Y |X] E[Y |X] − E[Y |X] = 0.
Combining this with (14.5) establishes that for all f ∈ L2 (PX ; R) it holds that
r(f ) = E |f (X) − E[Y |X]|2 + E |E[Y |X] − Y |2 . (14.7)
   

This implies that for all f, g ∈ L2 (PX ; R) it holds that


r(f ) − r(g) = E |f (X) − E[Y |X]|2 − E |g(X) − E[Y |X]|2 . (14.8)
   

Therefore, we obtain that for all f, g ∈ L2 (PX ; R) it holds that


E |f (X) − E[Y |X]|2 = E |g(X) − E[Y |X]|2 + r(f ) − r(g). (14.9)
   

Combining this with (14.7) and (14.8) proves items (i), (ii), and (iii). The proof of
Lemma 14.1.1 is thus complete.

14.1.1 Risk minimization for measurable functions


Proposition 14.1.2. Let (Ω, F, P) be a probability space, let (S, S) be a measurable space, let
X : Ω → S and Y : Ω → R be random variables, assume E[|Y |2 ] < ∞, let E : L2 (PX ; R) →
[0, ∞) satisfy for all f ∈ L2 (PX ; R) that
E(f ) = E |f (X) − Y |2 . (14.10)
 

Then
f ∈ L2 (PX ; R) : E(f ) = inf g∈L2 (PX ;R) E(g)


= f ∈ L2 (PX ; R) : E(f ) = E |E[Y |X] − Y |2 (14.11)


  

= {f ∈ L2 (PX ; R) : f (X) = E[Y |X] P-a.s.}.

488
14.1. Bias-variance decomposition

Proof of Proposition 14.1.2. Note that Lemma 14.1.1 ensures that for all g ∈ L2 (PX ; R) it
holds that
E(g) = E |g(X) − E[Y |X]|2 + E |E[Y |X] − Y |2 . (14.12)
   

Hence, we obtain that for all g ∈ L2 (PX ; R) it holds that

E(g) ≥ E |E[Y |X] − Y |2 . (14.13)


 

Furthermore, observe that (14.12) shows that

f ∈ L2 (PX ; R) : E(f ) = E |E[Y |X] − Y |2


  

= f ∈ L2 (PX ; R) : E |f (X) − E[Y |X]|2 = 0 (14.14)


  

= {f ∈ L2 (PX ; R) : f (X) = E[Y |X] P-a.s.}.

Combining this with (14.13) establishes (14.11). The proof of Proposition 14.1.2 is thus
complete.

Corollary 14.1.3. Let (Ω, F, P) be a probability space, let (S, S) be a measurable space,
let X : Ω → S be a random variable, let M = {(f : S → R) : f is S/B(R)-measurable}, let
φ ∈ M, and let E : M → [0, ∞) satisfy for all f ∈ M that

E(f ) = E |f (X) − φ(X)|2 . (14.15)


 

Then

{f ∈ M : E(f ) = inf g∈M E(g)} = {f ∈ M : E(f ) = 0}


(14.16)
= {f ∈ M : P(f (X) = φ(X)) = 1}.

Proof of Corollary 14.1.3. Note that (14.15) demonstrates that E(φ) = 0. Therefore, we
obtain that
inf E(g) = 0. (14.17)
g∈M

Furthermore, observe that

∈ M : E |f (X) − φ(X)|2 = 0
  
{f ∈ M : E(f ) = 0} = f
 
= f ∈ M : P {ω ∈ Ω : f (X(ω)) ̸= φ(X(ω))} = 0
(14.18)
∈ M : P X −1 ({x ∈ S : f (x) ̸= φ(x)}) = 0
 
= f
= {f ∈ M : PX ({x ∈ S : f (x) ̸= φ(x)}) = 0}.

The proof of Corollary 14.1.3 is thus complete.

489
Chapter 14: Overall error decomposition

14.2 Overall error decomposition


Proposition 14.2.1. Let (Ω, F, P) be a probability space, let M, d ∈ N, D ⊆ Rd , u ∈ R,
v ∈ (u, ∞), for every j ∈ {1, 2, . . . , M } let Xj : Ω → D and Yj : Ω → [u, v] be random
variables, let R : Rd → R satisfy for all θ ∈ Rd that
θ,l
R(θ) = E[|Nu,v (X1 ) − Y1 |2 ], (14.19)

let d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy

(14.20)
PL
l0 = d, lL = 1, and d≥ i=1 li (li−1 + 1),

let R : Rd × Ω → R satisfy for all θ ∈ Rd that


M 
1 P
R(θ) = θ,l 2
|N (Xj ) − Yj | , (14.21)
M j=1 u,v

let E : D → [u, v] be B(D)/B([u, v])-measurable, assume P-a.s. that

E(X1 ) = E[Y1 |X1 ], (14.22)

let B ∈ [0, ∞), for every k, n ∈ N0 let Θk,n : Ω → Rd be a function, let K, N ∈ N,


T ⊆ {0, 1, . . . , N }, let k : Ω → (N0 )2 satisfy for all ω ∈ Ω that

k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (14.23)


and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (14.24)

(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all ϑ ∈ [−B, B]d that
Z
Θk ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l (14.25)
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
   
≤ supx∈D |Nu,v
+ min(k,n)∈{1,2,...,K}×T, ∥Θk,n ∥∞ ≤B [R(Θk,n ) − R(ϑ)].

Proof of Proposition 14.2.1. Throughout this proof, let r : L2 (PX1 ; R) → [0, ∞) satisfy for
all f ∈ L2 (PX1 ; R) that
r(f ) = E[|f (X1 ) − Y1 |2 ]. (14.26)
Observe that the assumption that for all ω ∈ Ω it holds that Y1 (ω) ∈ [u, v] and the fact
that for all θ ∈ Rd , x ∈ Rd it holds that Nu,v θ,l
(x) ∈ [u, v] imply that for all θ ∈ Rd it holds
that E[|Y1 |2 ] ≤ max{u2 , v 2 } < ∞ and
Z
θ,l
(x)|2 PX1 (dx) = E |Nu,v (14.27)
 θ,l
(X1 )|2 ≤ max{u2 , v 2 } < ∞.

|Nu,v
D

490
14.2. Overall error decomposition

Item (iii) in Lemma 14.1.1 (applied with (Ω, F, P) ↶ (Ω, F, P), (S, S) ↶ (D, B(D)),
X ↶ X1 , Y ↶ (Ω ∋ ω 7→ Y1 (ω) ∈ R), r ↶ r, f ↶ Nu,v θ,l
|D , g ↶ Nu,v
ϑ,l
|D for θ, ϑ ∈ Rd in the
notation of item (iii) in Lemma 14.1.1) hence proves that for all θ, ϑ ∈ Rd it holds that
Z
θ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
 θ,l (14.28)
(X1 ) − E(X1 )|2 = E |Nu,v
 θ,l
(X1 ) − E[Y1 |X1 ]|2
 
= E |Nu,v
 ϑ,l
(X1 ) − E[Y1 |X1 ]|2 + r(Nu,v
θ,l ϑ,l

= E |Nu,v |D ) − r(Nu,v |D )
Combining this with (14.26) and (14.19) ensures that for all θ, ϑ ∈ Rd it holds that
Z
θ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
(14.29)
 ϑ,l
(X1 ) − E(X1 )|2 + E |Nu,v
 θ,l
(X1 ) − Y1 |2 − E |Nu,v
 ϑ,l
(X1 ) − Y1 |2
  
= E |Nu,v
Z
ϑ,l
= |Nu,v (x) − E(x)|2 PX1 (dx) + R(θ) − R(ϑ).
D

This shows that for all θ, ϑ ∈ Rd it holds that


Z
θ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
DZ
ϑ,l
= |Nu,v (x) − E(x)|2 PX1 (dx) − [R(θ) − R(θ)] + R(ϑ) − R(ϑ)
D
(14.30)
+ R(θ) − R(ϑ)
Z
ϑ,l
(x) − E(x)|2 PX1 (dx) + 2 maxη∈{θ,ϑ} |R(η) − R(η)|
 
≤ |Nu,v
D
+ R(θ) − R(ϑ).
Furthermore, note that (14.23) establishes that for all ω ∈ Ω it holds that Θk(ω) (ω) ∈
[−B, B]d . Combining (14.30) with (14.24) therefore demonstrates that for all ϑ ∈ [−B, B]d
it holds that
Z
Θk ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
DZ
ϑ,l
(x) − E(x)|2 PX1 (dx) + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
 
≤ |Nu,v
D
+ R(Θk ) − R(ϑ)
Z (14.31)
ϑ,l
(x) − E(x)|2 PX1 (dx) + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
 
= |Nu,v
D
+ min(k,n)∈{1,2,...,K}×T, ∥Θk,n ∥∞ ≤B [R(Θk,n ) − R(ϑ)]
ϑ,l
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
   
≤ supx∈D |Nu,v
+ min(k,n)∈{1,2,...,K}×T, ∥Θk,n ∥∞ ≤B [R(Θk,n ) − R(ϑ)].

491
Chapter 14: Overall error decomposition

The proof of Proposition 14.2.1 is thus complete.

492
Chapter 15

Composed error estimates

In Part II we have established several estimates for the approximation error, in Part III
we have established several estimates for the optimization error, and in Part IV we have
established several estimates for the generalization error. In this chapter we employ the error
decomposition from Chapter 14 as well as parts of Parts II, III, and IV (see Proposition 4.4.12
and Corollaries 11.3.9 and 13.3.3) to establish estimates for the overall error in the training
of ANNs in the specific situation of GD-type optimization methods with many independent
random initializations.
In the literature such overall error analyses can, for instance, be found in [25, 226, 230].
The material in this chapter consist of slightly modified extracts from Jentzen & Welti [230,
Sections 6.2 and 6.3].

15.1 Full strong error analysis for the training of ANNs


Lemma 15.1.1. Let d, d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 , u ∈ [−∞, ∞), v ∈ (u, ∞], let
D ⊆ Rd , assume

d ≥ Li=1 li (li−1 + 1), (15.1)


P
l0 = d, lL = 1, and

let E : D → R be B(D)/B(R)-measurable, let (Ω, F, P) be a probability space, and let


X : Ω → D, k : Ω → (N0 )2 , and Θk,n : Ω → Rd , k, n ∈ N0 , be random variables. Then
(i) it holds that Rd × Rd ∋ (θ, x) 7→ Nu,v
θ,l
(x) ∈ R is (B(Rd ) ⊗ B(Rd ))/B(R)-measurable,
(ii) it holds for all ω ∈ Ω that Rd ∋ x 7→ Nu,v
Θk(ω) (ω),l
(x) ∈ R is B(Rd )/B(R)-mesaurable,
and
(iii) it holds for all p ∈ [0, ∞) that
Z
Ω ∋ ω 7→ |Nu,v Θk(ω) (ω),l
(x) − E(x)|p PX (dx) ∈ [0, ∞] (15.2)
D

493
Chapter 15: Composed error estimates

is F/B([0, ∞])-measurable
(cf. Definition 4.4.1).
Proof of Lemma 15.1.1. Throughout this proof let Ξ : Ω → Rd satisfy for all ω ∈ Ω that
Ξ(ω) = Θk(ω) (ω). (15.3)
Observe that the assumption that Θk,n : Ω → Rd , k, n ∈ N0 , and k : Ω → (N0 )2 are random
variables implies that for all U ∈ B(Rd ) it holds that
Ξ−1 (U ) = {ω ∈ Ω : Ξ(ω) ∈ U } = {ω ∈ Ω : Θk(ω) (ω) ∈ U }
  
= ω ∈ Ω : ∃ k, n ∈ N0 : ([Θk,n (ω) ∈ U ] ∧ [k(ω) = (k, n)])
∞ S∞
(15.4)
S 
= {ω ∈ Ω : Θk,n (ω) ∈ U } ∩ {ω ∈ Ω : k(ω) = (k, n)}
k=0 n=0
∞ S ∞
[(Θk,n )−1 (U )] ∩ [k−1 ({(k, n)})] ∈ F.
S 
=
k=0 n=0

This proves that


Ω ∋ ω 7→ Θk(ω) (ω) ∈ Rd (15.5)
is F/B(Rd )-measurable. Furthermore, note that that Corollary 11.3.7 (applied with
a ↶ −∥x∥∞ , b ↶ ∥x∥∞ , u ↶ u, v ↶ v, d ↶ d, L ↶ L, l ↶ l for x ∈ Rd in the notation
of Corollary 11.3.7) ensures that for all θ, ϑ ∈ Rd , x ∈ Rd it holds that
θ,l ϑ,l θ,l ϑ,l
|Nu,v (x) − Nu,v (x)| ≤ supy∈[−∥x∥∞ ,∥x∥∞ ]l0 |Nu,v (y) − Nu,v (y)|
(15.6)
≤ L max{1, ∥x∥∞ }(∥l∥∞ + 1)L (max{1, ∥θ∥∞ , ∥ϑ∥∞ })L−1 ∥θ − ϑ∥∞

(cf. Definitions 3.3.4 and 4.4.1). This shows for all x ∈ Rd that
Rd ∋ θ 7→ Nu,v
θ,l
(x) ∈ R (15.7)
is continuous. Moreover, observe that the fact that for all θ ∈ Rd it holds that Nu,v
θ,l

C(R , R) establishes that for all θ ∈ R it holds that Nu,v (x) is B(R )/B(R)-measurable.
d d θ,l d

This, (15.7), the fact that (Rd , ∥·∥∞ |Rd ) is a separable normed R-vector space, and
Lemma 11.2.6 prove item (i). Note that item (i) and (15.5) demonstrate that
Ω × Rd ∋ (ω, x) 7→ Nu,v
Θk(ω) (ω),l
(x) ∈ R (15.8)
is (F ⊗ B(Rd ))/B(R)-measurable. This implies item (ii). Observe that item (ii) and the
assumption that E : D → R is B(D)/B(R)-measurable ensure that for all p ∈ [0, ∞) it holds
that
Θk(ω) (ω),l
Ω × D ∋ (ω, x) 7→ |Nu,v (x) − E(x)|p ∈ [0, ∞) (15.9)
is (F ⊗ B(D))/B([0, ∞))-measurable. Tonelli’s theorem hence establishes item (iii). The
proof of Lemma 15.1.1 is thus complete.

494
15.1. Full strong error analysis for the training of ANNs

Proposition 15.1.2. Let (Ω, F, P) be a probability space, let M, d ∈ N, b ∈ [1, ∞),


D ⊆ [−b, b]d , u ∈ R, v ∈ (u, ∞), for every j ∈ N let Xj : Ω → D and Yj : Ω → [u, v]
be random variables, assume that (Xj , Yj ), j ∈ {1, 2, . . . , M }, are i.i.d., let d, L ∈ N,
l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy

d ≥ Li=1 li (li−1 + 1), (15.10)


P
l0 = d, lL = 1, and

let R : Rd × Ω → [0, ∞) satisfy for all θ ∈ Rd that


M 
1 P
R(θ) = θ,l 2
|N (Xj ) − Yj | , (15.11)
M j=1 u,v

let E : D → [u, v] be B(D)/B([u, v])-measurable, assume P-a.s. that

E(X1 ) = E[Y1 |X1 ], (15.12)


d
let K ∈ N,
S∞c ∈ [1, ∞), B ∈ [c, ∞),d for every k, n ∈ N0 let Θk,n : Ω → R be random variables,
assume k=1 Θk,0 (Ω) ⊆ [−B, B] , assume that Θk,0 , k ∈ {1, 2, . . . , K}, are i.i.d., assume
that Θ1,0 is continuously uniformly distributed on [−c, c]d , let N ∈ N, T ⊆ {0, 1, . . . , N }
satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and assume for all ω ∈ Ω that

k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (15.13)


and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (15.14)

(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
 hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
 4(v − u)bL(∥l∥∞ + 1)L cL max{1, p}
θ,l
(x) − E(x)|2 + (15.15)

≤ inf θ∈[−c,c]d supx∈D |Nu,v
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}
+ √
M
(cf. Lemma 15.1.1).
Proof of Proposition 15.1.2. Throughout this proof, let R : Rd → [0, ∞) satisfy for all
θ ∈ Rd that
θ,l
R(θ) = E[|Nu,v (X1 ) − Y1 |2 ]. (15.16)
Note that Proposition 14.2.1 shows that for all ϑ ∈ [−B, B]d it holds that
Z
Θk ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l (15.17)
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
   
≤ supx∈D |Nu,v
+ min(k,n)∈{1,2,...,K}×T, ∥Θk,n ∥∞ ≤B |R(Θk,n ) − R(ϑ)|.

495
Chapter 15: Composed error estimates

The assumption that ∞ k=1 Θk,0 (Ω) ⊆ [−B, B] and the assumption that 0 ∈ T therefore
d
S
prove that
Z
Θk ,l
|Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
   
≤ supx∈D |Nu,v
(15.18)
+ mink∈{1,2,...,K}, ∥Θk,0 ∥∞ ≤B |R(Θk,0 ) − R(ϑ)|
ϑ,l
(x) − E(x)|2 + 2 supθ∈[−B,B]d |R(θ) − R(θ)|
   
= supx∈D |Nu,v
+ mink∈{1,2,...,K} |R(Θk,0 ) − R(ϑ)|.

Minkowski’s inequality hence demonstrates that for all p ∈ [1, ∞), ϑ ∈ [−c, c]d ⊆ [−B, B]d
it holds that
 hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
ϑ,l
1/p 1/p
(x) − E(x)|2p + 2 E supθ∈[−B,B]d |R(θ) − R(θ)|p
 
≤ E supx∈D |Nu,v
+ E mink∈{1,2,...,K} |R(Θk,0 ) − R(ϑ)|p
 1/p (15.19)
ϑ,l
1/p
(x) − E(x)|2 + 2 E supθ∈[−B,B]d |R(θ) − R(θ)|p
  
≤ supx∈D |Nu,v
1/p
+ supθ∈[−c,c]d E mink∈{1,2,...,K} |R(Θk,0 ) − R(θ)|p


(cf. item (i) in Corollary 13.3.3 and item (i) in Corollary 11.3.9). Furthermore, observe that
Corollary 13.3.3 (applied with v ↶ max{u + 1, v}, R ↶ R|[−B,B]d , R ↶ R|[−B,B]d ×Ω in
the notation of Corollary 13.3.3) implies that for all p ∈ (0, ∞) it holds that
1/p
E supθ∈[−B,B]d |R(θ) − R(θ)|p


9(max{u + 1, v} − u)2 L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}


≤ √ (15.20)
M
9 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}
= √ .
M

Moreover, note that Corollary 11.3.9 (applied with d ↶


PL
i=1 li (li−1 + 1), B ↶ c,
(Θk )k∈{1,2,...,K} ↶ (Ω ∋ ω 7→ 1{Θk,0 ∈[−c,c]d } (ω)Θk,0 (ω) ∈ [−c, c]d )k∈{1,2,...,K} , R ↶ R|[−c,c]d ×Ω
in the notation of Corollary 11.3.9) ensures that for all p ∈ (0, ∞) it holds that
1/p
supθ∈[−c,c]d E mink∈{1,2,...,K} |R(Θk,0 ) − R(θ)|p


= supθ∈[−c,c]d E mink∈{1,2,...,K} |R(1{Θk,0 ∈[−c,c]d } Θk,0 ) − R(θ)|p


 1/p
(15.21)
L L
4(v − u)bL(∥l∥∞ + 1) c max{1, p}
≤ .
K [L−1 (∥l∥∞ +1)−2 ]
496
15.1. Full strong error analysis for the training of ANNs

Combining this and (15.20) with (15.19) establishes that for all p ∈ [1, ∞) it holds that
 hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
 4(v − u)bL(∥l∥∞ + 1)L cL max{1, p}
θ,l
(x) − E(x)|2 + (15.22)

≤ inf θ∈[−c,c]d supx∈D |Nu,v
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}
+ √ .
M
In addition, observe that that Jensen’s inequality shows that for all p ∈ (0, ∞) it holds that
 hZ p i1/p
Θk ,l 2
E |Nu,v (x) − E(x)| PX1 (dx)
D
  Z 1
max{1,p} max{1,p} (15.23)
Θk ,l 2
≤ E |Nu,v (x) − E(x)| PX1 (dx)
D

This, (15.22), and the fact that ln(3M Bb) ≥ 1 prove that for all p ∈ (0, ∞) it holds that
 hZ p i1/p
Θk ,l
E |Nu,v (x) − E(x)|2 PX1 (dx)
D
 4(v − u)bL(∥l∥∞ + 1)L cL max{1, p}
θ,l
(x) − E(x)|2 + (15.24)

≤ inf θ∈[−c,c]d supx∈D |Nu,v
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bb)}
+ √ .
M
The proof of Proposition 15.1.2 is thus complete.
1/p
a px

Lemma 15.1.3. Let a, x, p ∈ (0, ∞). Then axp ≤ exp e
.
Proof of Lemma 15.1.3. Note that the fact that for all y ∈ R it holds that y + 1 ≤ ey
demonstrates that
 1/p p  1/p p 1/p
axp = (a /p x)p = e a e x − 1 + 1 = exp a e px . (15.25)
1
≤ e exp a e x − 1


The proof of Lemma 15.1.3 is thus complete.


23B
Lemma 15.1.4. Let M, c ∈ [1, ∞), B ∈ [c, ∞). Then ln(3M Bc) ≤ 18
ln(eM ).

Proof of Lemma 15.1.4. Observe that Lemma 15.1.3 and the fact that 2 3/e ≤ 23/18 imply
that √
3B 2 ≤ exp 2 e3B ≤ exp 23B (15.26)
 
18
.
The fact that B ≥ c ≥ 1 and M ≥ 1 therefore ensures that
ln(3M Bc) ≤ ln(3B 2 M ) ≤ ln([eM ] (15.27)
23B/18 23B
)= 18
ln(eM ).
The proof of Lemma 15.1.4 is thus complete.

497
Chapter 15: Composed error estimates

Theorem 15.1.5. Let (Ω, F, P) be a probability space, let M, d ∈ N, a, u ∈ R, b ∈ (a, ∞),


v ∈ (u, ∞), for every j ∈ N let Xj : Ω → [a, b]d and Yj : Ω → [u, v] be random variables,
assume that (Xj , Yj ), j ∈ {1, 2, . . . , M }, are i.i.d., let A ∈ (0, ∞), L ∈ N satisfy L ≥
A1(6d ,∞) (A)/(2d) + 1, let l = (l , l , . . . , l ) ∈ NL+1 satisfy for all i ∈ {2, 3, 4, . . .} ∩ [0, L) that
0 1 L

l0 = d, l1 ≥ A1(6d ,∞) (A), li ≥ 1(6d ,∞) (A) max{A/d − 2i + 3, 2}, and lL = 1, (15.28)
PL
let d ∈ N satisfy d ≥ i=1 li (li−1 + 1), let R : Rd × Ω → [0, ∞) satisfy for all θ ∈ Rd that
M 
1 P
R(θ) = θ,l 2
|N (Xj ) − Yj | , (15.29)
M j=1 u,v

let E : [a, b]d → [u, v] satisfy P-a.s. that

E(X1 ) = E[Y1 |X1 ], (15.30)

let L ∈ R satisfy for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤ L∥x − y∥1 , let K ∈ N, c ∈
[max{1, L, |a|, |b|, 2|u|, 2|v|},
S∞ ∞), B ∈ [c, ∞), for every k, n ∈ N0 let Θk,n : Ω → Rd be a
random variable, assume k=1 Θk,0 (Ω) ⊆ [−B, B]d , assume that Θk,0 , k ∈ {1, 2, . . . , K},
are i.i.d., assume that Θ1,0 is continuously uniformly distributed on [−c, c]d , let N ∈ N,
T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and assume for
all ω ∈ Ω that

k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (15.31)


and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (15.32)

(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
 hZ p i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P X1 (dx)
[a,b]
2 4
36d c 4L(∥l∥∞ + 1)L cL+2 max{1, p}
≤ + (15.33)
A 2/d
K [L−1 (∥l∥∞ +1)−2 ]
23B 3 L(∥l∥∞ + 1)2 max{p, ln(eM )}
+ √
M

(cf. Lemma 15.1.1).

Proof of Theorem 15.1.5. Note that the assumption that for all x, y ∈ [a, b]d it holds
that |E(x) − E(y)| ≤ L∥x − y∥1 establishes that E : [a, b]d → [u, v] is B([a, b]d )/B([u, v])-
measurable. Proposition 15.1.2 (applied with b ↶ max{1, |a|, |b|}, D ↶ [a, b]d in the

498
15.1. Full strong error analysis for the training of ANNs

notation of Proposition 15.1.2) hence shows that for all p ∈ (0, ∞) it holds that
 hZ p i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| PX1 (dx)
 [a,b] θ,l
(x) − E(x)|2

≤ inf θ∈[−c,c]d supx∈[a,b]d |Nu,v
4(v − u) max{1, |a|, |b|}L(∥l∥∞ + 1)L cL max{1, p} (15.34)
+
K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M B max{1, |a|, |b|})}
+ √ .
M
The fact that max{1, |a|, |b|} ≤ c therefore proves that for all p ∈ (0, ∞) it holds that
 hZ p i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| PX1 (dx)
 [a,b] θ,l
(x) − E(x)|2

≤ inf θ∈[−c,c]d supx∈[a,b]d |Nu,v
4(v − u)L(∥l∥∞ + 1)L cL+1 max{1, p} (15.35)
+ [L −1 (∥l∥ +1)−2 ]
K ∞

18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bc)}


+ √ .
M
Furthermore, observe that Proposition 4.4.12 (applied with f ↶ E in the notation of
Proposition 4.4.12) demonstrates that there exists ϑ ∈ Rd such that ∥ϑ∥∞ ≤ max{1, L, |a|,
|b|, 2[supx∈[a,b]d |E(x)|]} and
3dL(b − a)
ϑ,l
supx∈[a,b]d |Nu,v (x) − E(x)| ≤ . (15.36)
A1/d
The fact that for all x ∈ [a, b]d it holds that E(x) ∈ [u, v] hence implies that
∥ϑ∥∞ ≤ max{1, L, |a|, |b|, 2|u|, 2|v|} ≤ c. (15.37)
This and (15.36) ensure that
θ,l
inf θ∈[−c,c]d supx∈[a,b]d |Nu,v (x) − E(x)|2 ≤ supx∈[a,b]d |Nu,v
ϑ,l
(x) − E(x)|2
(15.38)
2
9d2 L2 (b − a)2

3dL(b − a)
≤ = .
A1/d A2/d
Combining this with (15.35) establishes that for all p ∈ (0, ∞) it holds that
 hZ p i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| PX1 (dx)
[a,b]
9d L (b − a)2 4(v − u)L(∥l∥∞ + 1)L cL+1 max{1, p}
2 2
≤ + (15.39)
A2/d K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u) }L(∥l∥∞ + 1)2 max{p, ln(3M Bc)}
2
+ √ .
M

499
Chapter 15: Composed error estimates

Moreover, note that the fact that max{1, L, |a|, |b|} ≤ c and (b−a)2 ≤ (|a|+|b|)2 ≤ 2(a2 +b2 )
shows that
9L2 (b − a)2 ≤ 18c2 (a2 + b2 ) ≤ 18c2 (c2 + c2 ) = 36c4 . (15.40)
In addition, observe that the fact that B ≥ c ≥ 1, the fact that M ≥ 1, and Lemma 15.1.4
prove that ln(3M Bc) ≤ 23B 18
ln(eM ). This, (15.40), the fact that (v − u) ≤ 2 max{|u|, |v|} =
max{2|u|, 2|v|} ≤ c ≤ B, and the fact that B ≥ 1 demonstrate that for all p ∈ (0, ∞) it
holds that
9d2 L2 (b − a)2 4(v − u)L(∥l∥∞ + 1)L cL+1 max{1, p}
+
A2/d K [L−1 (∥l∥∞ +1)−2 ]
18 max{1, (v − u)2 }L(∥l∥∞ + 1)2 max{p, ln(3M Bc)}
+ √
M
2 4 L L+2
(15.41)
36d c 4L(∥l∥∞ + 1) c max{1, p}
≤ +
A2/d K [L−1 (∥l∥∞ +1)−2 ]
23B L(∥l∥∞ + 1)2 max{p, ln(eM )}
3
+ √ .
M
Combining this with (15.39) implies (15.33). The proof of Theorem 15.1.5 is thus complete.

Corollary 15.1.6. Let (Ω, F, P) be a probability space, let M, d ∈ N, a, u ∈ R, b ∈ (a, ∞),


v ∈ (u, ∞), for every j ∈ N let Xj : Ω → [a, b]d and Yj : Ω → [u, v] be random variables,
assume that (Xj , Yj ), j ∈ {1, 2, . . . , M }, are i.i.d., let d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 ,
assume
d ≥ Li=1 li (li−1 + 1), (15.42)
P
l0 = d, lL = 1, and
let R : Rd × Ω → [0, ∞) satisfy for all θ ∈ Rd that
M 
1 P
R(θ) = θ,l 2
|N (Xj ) − Yj | , (15.43)
M j=1 u,v

let E : [a, b]d → [u, v] satisfy P-a.s. that

E(X1 ) = E[Y1 |X1 ], (15.44)

let L ∈ R satisfy for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤ L∥x − y∥1 , let K ∈ N, c ∈
[max{1, L, |a|, |b|, 2|u|, 2|v|},
S∞ ∞), B ∈ [c, ∞), for every k, n ∈ N0 let Θk,n : Ω → Rd be a
random variable, assume k=1 Θk,0 (Ω) ⊆ [−B, B]d , assume that Θk,0 , k ∈ {1, 2, . . . , K},
are i.i.d., assume that Θ1,0 is continuously uniformly distributed on [−c, c]d , let N ∈ N,
T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and assume for
all ω ∈ Ω that

k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (15.45)


and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (15.46)

500
15.1. Full strong error analysis for the training of ANNs

(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
 hZ p/2 i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P X1 (dx)
[a,b]
6dc2 2L(∥l∥∞ + 1)L cL+1 max{1, p}
≤ + (15.47)
[min({L} ∪ {li : i ∈ N ∩ [0, L)})]1/d K [(2L)−1 (∥l∥∞ +1)−2 ]
5B 2 L(∥l∥∞ + 1) max{p, ln(eM )}
+
M 1/4

(cf. Lemma 15.1.1).

Proof of Corollary 15.1.6. Throughout this proof, let

A = min({L} ∪ {li : i ∈ N ∩ [0, L)}) ∈ (0, ∞). (15.48)

Note that (15.48) ensures that

L ≥ A = A − 1 + 1 ≥ (A − 1)1[2,∞) (A) + 1
A1 (15.49)
≥ A − A2 1[2,∞) (A) + 1 = [2,∞) + 1 ≥ A1(6 2d
 (A) d ,∞) (A)
2
+ 1.

Furthermore, observe that the assumption that lL = 1 and (15.48) establish that

l1 = l1 1{1} (L) + l1 1[2,∞) (L) ≥ 1{1} (L) + A1[2,∞) (L) = A ≥ A1(6d ,∞) (A). (15.50)

Moreover, note that (15.48) shows that for all i ∈ {2, 3, 4, . . .} ∩ [0, L) it holds that

li ≥ A ≥ A1[2,∞) (A) ≥ 1[2,∞) (A) max{A − 1, 2} = 1[2,∞) (A) max{A − 4 + 3, 2}


(15.51)
≥ 1[2,∞) (A) max{A − 2i + 3, 2} ≥ 1(6d ,∞) (A) max{A/d − 2i + 3, 2}.

Combining this, (15.49), and (15.50) with Theorem 15.1.5 (applied with p ↶ p/2 for
p ∈ (0, ∞) in the notation of Theorem 15.1.5) proves that for all p ∈ (0, ∞) it holds that
 hZ p/2 i2/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P X1 (dx)
[a,b]
36d2 c4 4L(∥l∥∞ + 1)L cL+2 max{1, p/2}
≤ + (15.52)
A2/d K [L−1 (∥l∥∞ +1)−2 ]
23B 3 L(∥l∥∞ + 1)2 max{p/2, ln(eM )}
+ √ .
M

This, (15.48), and the fact that L ≥ 1, c ≥ 1, B ≥ 1, and ln(eM ) ≥ 1 demonstrate that for

501
Chapter 15: Composed error estimates

all p ∈ (0, ∞) it holds that


 hZ p/2 i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P X1 (dx)
[a,b]
6dc2 2[L(∥l∥∞ + 1)L cL+2 max{1, p/2}]1/2
≤ +
[min({L} ∪ {li : i ∈ N ∩ [0, L)})]1/d K [(2L)−1 (∥l∥∞ +1)−2 ]
5B 3 [L(∥l∥∞ + 1)2 max{p/2, ln(eM )}]1/2
+ (15.53)
M 1/4
6dc2 2L(∥l∥∞ + 1)L cL+1 max{1, p}
≤ +
[min({L} ∪ {li : i ∈ N ∩ [0, L)})]1/d K [(2L)−1 (∥l∥∞ +1)−2 ]
5B 2 L(∥l∥∞ + 1) max{p, ln(eM )}
+ .
M 1/4
The proof of Corollary 15.1.6 is thus complete.

15.2 Full strong error analysis with optimization via


SGD with random initializations
Corollary 15.2.1. let (Ω, F, P) be a probability space, let M, d ∈ N, a, u ∈ R, b ∈
(a, ∞), v ∈ (u, ∞), for every k, n, j ∈ N0 let Xjk,n : Ω → [a, b]d and Yjk,n : Ω → [u, v] be
random variables, assume that (Xj0,0 , Yj0,0 ), j ∈ {1, 2, . . . , M }, are i.i.d., let d, L ∈ N,
l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy

d ≥ Li=1 li (li−1 + 1), (15.54)


P
l0 = d, lL = 1, and

for every k, n ∈ N0 , J ∈ N let RJk,n : Rd × Ω → [0, ∞) satisfy for all θ ∈ Rd , ω ∈ Ω that


J 
1 P
k,n
RJ (θ, ω) = θ,l k,n k,n
|N (X (ω)) − Yj (ω)| , 2
(15.55)
J j=1 u,v j

let E : [a, b]d → [u, v] satisfy P-a.s. that

E(X10,0 ) = E[Y10,0 |X10,0 ], (15.56)

let L ∈ R satisfy for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤ L∥x − y∥1 , let (Jn )n∈N ⊆ N, for
every k, n ∈ N let G k,n : Rd × Ω → Rd satisfy for all ω ∈ Ω, θ ∈ {ϑ ∈ Rd : (RJk,n n
(·, ω) :
d
R → [0, ∞) is differentiable at ϑ)} that

G k,n (θ, ω) = (∇θ RJk,n


n
)(θ, ω), (15.57)

let K ∈ N, c ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|},S∞), B ∈ [c, ∞), for every k, n ∈ N0 let
Θk,n : Ω → Rd be a random variable, assume ∞ d
k=1 Θk,0 (Ω) ⊆ [−B, B] , assume that Θk,0 ,

502
15.2. Full strong error analysis with optimization via SGD with random initializations

k ∈ {1, 2, . . . , K}, are i.i.d., assume that Θ1,0 is continuously uniformly distributed on
[−c, c]d , let (γn )n∈N ⊆ R satisfy for all k, n ∈ N that
Θk,n = Θk,n−1 − γn G k,n (Θk,n−1 ), (15.58)
let N ∈ N, T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and
assume for all ω ∈ Ω that
k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (15.59)
and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (15.60)
(cf. Definitions 3.3.4 and 4.4.1). Then it holds for all p ∈ (0, ∞) that
 hZ p/2 i1/p
Θk ,l 2
E d
|Nu,v (x) − E(x)| P 0,0 (dx)
X1
[a,b]
6dc2 2L(∥l∥∞ + 1)L cL+1 max{1, p}
≤ + (15.61)
[min({L} ∪ {li : i ∈ N ∩ [0, L)})]1/d
K [(2L)−1 (∥l∥∞ +1)−2 ]
5B 2 L(∥l∥∞ + 1) max{p, ln(eM )}
+
M 1/4
(cf. Lemma 15.1.1).
Proof of Corollary 15.2.1. Note that Corollary 15.1.6 (applied with (Xj )j∈N ↶ (Xj0,0 )j∈N ,
(Yj )j∈N ↶ (Yj0,0 )j∈N , R ↶ RM0,0
in the notation of Corollary 15.1.6) implies (15.61). The
proof of Corollary 15.2.1 is thus complete.
Corollary 15.2.2. Let (Ω, F, P) be a probability space, let M, d ∈ N, a, u ∈ R, b ∈
(a, ∞), v ∈ (u, ∞), for every k, n, j ∈ N0 let Xjk,n : Ω → [a, b]d and Yjk,n : Ω → [u, v] be
random variables, assume that (Xj0,0 , Yj0,0 ), j ∈ {1, 2, . . . , M }, are i.i.d., let d, L ∈ N,
l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy
d ≥ Li=1 li (li−1 + 1), (15.62)
P
l0 = d, lL = 1, and
for every k, n ∈ N0 , J ∈ N let RJk,n : Rd × Ω → [0, ∞) satisfy for all θ ∈ Rd that
J 
1 P
k,n
RJ (θ) = θ,l k,n k,n 2
|N (X ) − Yj | , (15.63)
J j=1 u,v j

let E : [a, b]d → [u, v] satisfy P-a.s. that


E(X10,0 ) = E[Y10,0 |X10,0 ], (15.64)
let L ∈ R satisfy for all x, y ∈ [a, b]d that |E(x) − E(y)| ≤ L∥x − y∥1 , let (Jn )n∈N ⊆ N, for
every k, n ∈ N let G k,n : Rd × Ω → Rd satisfy for all ω ∈ Ω, θ ∈ {ϑ ∈ Rd : (RJk,n n
(·, ω) :
d
R → [0, ∞) is differentiable at ϑ)} that
G k,n (θ, ω) = (∇θ RJk,n
n
)(θ, ω), (15.65)

503
Chapter 15: Composed error estimates

let K ∈ N, c ∈ [max{1, L, |a|, |b|, 2|u|, 2|v|},S∞), B ∈ [c, ∞), for every k, n ∈ N0 let
Θk,n : Ω → Rd be a random variable, assume ∞ d
k=1 Θk,0 (Ω) ⊆ [−B, B] , assume that Θk,0 ,
k ∈ {1, 2, . . . , K}, are i.i.d., assume that Θ1,0 is continuously uniformly distributed on
[−c, c]d , let (γn )n∈N ⊆ R satisfy for all k, n ∈ N that

Θk,n = Θk,n−1 − γn G k,n (Θk,n−1 ), (15.66)

let N ∈ N, T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and


assume for all ω ∈ Ω that

k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (15.67)


and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (15.68)

(cf. Definitions 3.3.4 and 4.4.1). Then


hZ i
Θk ,l
E d
|Nu,v (x) − E(x)| P 0,0 (dx)
X1
[a,b]
6dc2 5B 2 L(∥l∥∞ + 1) ln(eM ) 2L(∥l∥∞ + 1)L cL+1 (15.69)
≤ + +
[min{L, l1 , l2 , . . . , lL−1 }]1/d M 1/4 K [(2L)−1 (∥l∥∞ +1)−2 ]

(cf. Lemma 15.1.1).


Proof of Corollary 15.2.2. Observe that Jensen’s inequality ensures that
hZ i hZ 1/2 i
E d
|N Θk ,l
u,v (x) − E(x)| P 0,0 (dx) ≤ E
X1 d
|N Θk ,l
u,v (x) − E(x)|2
P 0,0 (dx)
X1 . (15.70)
[a,b] [a,b]

This and Corollary 15.2.1 (applied with p ↶ 1 in the notation of Corollary 15.2.1) establish
(15.69). The proof of Corollary 15.2.2 is thus complete.
Corollary 15.2.3. Let (Ω, F, P) be a probability space, M, d ∈ N, for every k, n, j ∈ N0
let Xjk,n : Ω → [0, 1]d and Yjk,n : Ω → [0, 1] be random variables, assume that (Xj0,0 , Yj0,0 ),
j ∈ {1, 2, . . . , M }, are i.i.d., for every k, n ∈ N0 , J ∈ N let RJk,n : Rd × Ω → [0, ∞) satisfy
for all θ ∈ Rd that
J 
1 P
k,n
RJ (θ, ω) = θ,l k,n k,n
|N (X (ω)) − Yj (ω)| , 2
(15.71)
J j=1 0,1 j

let d, L ∈ N, l = (l0 , l1 , . . . , lL ) ∈ NL+1 satisfy

(15.72)
PL
l0 = d, lL = 1, and d≥ i=1 li (li−1 + 1),

let E : [0, 1]d → [0, 1] satisfy P-a.s. that

E(X10,0 ) = E[Y10,0 |X10,0 ], (15.73)

504
15.2. Full strong error analysis with optimization via SGD with random initializations

let c ∈ [2, ∞), satisfy for all x, y ∈ [0, 1]d that |E(x) − E(y)| ≤ c∥x − y∥1 , let (Jn )n∈N ⊆ N,
for every k, n ∈ N let G k,n : Rd × Ω → Rd satisfy for all ω ∈ Ω, θ ∈ {ϑ ∈ Rd : (RJk,n n
(·, ω) :
d
R → [0, ∞) is differentiable at ϑ)} that

G k,n (θ, ω) = (∇θ RJk,n


n
)(θ, ω), (15.74)

let K ∈ N, for every k, n ∈ N0 let Θk,n : Ω → Rd be a random variable, assume ∞


S
k=1 Θk,0 (Ω)
⊆ [−c, c]d , assume that Θk,0 , k ∈ {1, 2, . . . , K}, are i.i.d., assume that Θ1,0 is continuously
uniformly distributed on [−c, c]d , let (γn )n∈N ⊆ R satisfy for all k, n ∈ N that

Θk,n = Θk,n−1 − γn G k,n (Θk,n−1 ), (15.75)

let N ∈ N, T ⊆ {0, 1, . . . , N } satisfy 0 ∈ T, let k : Ω → (N0 )2 be a random variable, and


assume for all ω ∈ Ω that

k(ω) ∈ {(k, n) ∈ {1, 2, . . . , K} × T : ∥Θk,n (ω)∥∞ ≤ B} (15.76)


and R(Θk(ω) (ω)) = min(k,n)∈{1,2,...,K}×T, ∥Θk,n (ω)∥∞ ≤B R(Θk,n (ω)) (15.77)

(cf. Definitions 3.3.4 and 4.4.1). Then


hZ i
Θk ,l
E d
|N0,1 (x) − E(x)| P 0,0 (dx)
X1
[0,1]
6dc2 5c2 L(∥l∥∞ + 1) ln(eM ) L(∥l∥∞ + 1)L cL+1 (15.78)
≤ + + [(2L)−1 (∥l∥∞ +1)−2 ]
[min{L, l1 , l2 , . . . , lL−1 }]1/d M 1/4 K

(cf. Lemma 15.1.1).

Proof of Corollary 15.2.3. Note that Corollary 15.2.2 (applied with a ↶ 0, u ↶ 0, b ↶ 1,


v ↶ 1, L ↶ c, c ↶ c, B ↶ c in the notation of Corollary 15.2.2), the fact that c ≥ 2 and
M ≥ 1, and Lemma 15.1.4 show (15.78). The proof of Corollary 15.2.3 is thus complete.

505
Chapter 15: Composed error estimates

506
Part VI

Deep learning for partial differential


equations (PDEs)

507
Chapter 16

Physics-informed neural networks


(PINNs)

Deep learning methods have not only become very popular for data-driven learning problems,
but are nowadays also heavily used for solving mathematical equations such as ordinary and
partial differential equations (cf., for example, [119, 187, 347, 379]). In particular, we refer
to the overview articles [24, 56, 88, 145, 237, 355] and the references therein for numerical
simulations and theoretical investigations for deep learning methods for PDEs.

Often deep learning methods for PDEs are obtained, first, by reformulating the PDE
problem under consideration as an infinite dimensional stochastic optimization problem,
then, by approximating the infinite dimensional stochastic optimization problem through
finite dimensional stochastic optimization problems involving deep ANNs as approximations
for the PDE solution and/or its derivatives, and thereafter, by approximately solving the
resulting finite dimensional stochastic optimization problems through SGD-type optimization
methods.

Among the most basic schemes of such deep learning methods for PDEs are PINNs
and DGMs; see [347, 379]. In this chapter we present in Theorem 16.1.1 in Section 16.1 a
reformulation of PDE problems as stochastic optimization problems, we use the theoretical
considerations from Section 16.1 to briefly sketch in Section 16.2 a possible derivation of
PINNs and DGMs, and we present in Sections 16.3 and 16.4 numerical simulations for
PINNs and DGMs. For simplicity and concreteness we restrict ourselves in this chapter
to the case of semilinear heat PDEs. The specific presentation of this chapter is based on
Beck et al. [24].

509
Chapter 16: Physics-informed neural networks (PINNs)

16.1 Reformulation of PDE problems as stochastic opti-


mization problems
Both PINNs and DGMs are based on reformulations of the considered PDEs as suitable
infinite dimensional stochastic optimization problems. In Theorem 16.1.1 below we present
the theoretical result behind this reformulation in the special case of semilinear heat PDEs.
Theorem 16.1.1. Let T ∈ (0, ∞), d ∈ N, g ∈ C 2 (Rd , R), u ∈ C 1,2 ([0, T ] × Rd , R),
t ∈ C([0, T ], (0, ∞)), x ∈ C(Rd , (0, ∞)), assume that g has at most polynomially growing
partial derivatives, let (Ω, F, P) be a probability space, let T : Ω → [0, T ] and X : Ω → Rd
be independent random variables, assume for all A ∈ B([0, T ]), B ∈ B(Rd ) that
Z Z
P(T ∈ A) = t(t) dt and P(X ∈ B) = x(x) dx, (16.1)
A B

let f : R → R be Lipschitz continuous, and let L : C 1,2 ([0, T ] × Rd , R) → [0, ∞] satisfy for
all v = (v(t, x))(t,x)∈[0,T ]×Rd ∈ C 1,2 ([0, T ] × Rd , R) that

L(v) = E |v(0, X ) − g(X )|2 + ∂v (T , X ) − (∆x v)(T , X ) − f (v(T , X )) 2 . (16.2)


  
∂t

Then the following two statements are equivalent:


(i) It holds that L(u) = inf v∈C 1,2 ([0,T ]×Rd ,R) L(v).

(ii) It holds for all t ∈ [0, T ], x ∈ Rd that u(0, x) = g(x) and


∂u
(16.3)

∂t
(t, x) = (∆x u)(t, x) + f (u(t, x)).

Proof of Theorem 16.1.1. Observe that (16.2) proves that for all v ∈ C 1,2 ([0, T ] × Rd , R)
with ∀ x ∈ Rd : u(0, x) = g(x) and ∀ t ∈ [0, T ], x ∈ Rd : ∂u
∂t
(t, x) = (∆x u)(t, x) + f (u(t, x))
it holds that
L(v) = 0. (16.4)
This and the fact that for all v ∈ C 1,2 ([0, T ] × Rd , R) it holds that L(v) ≥ 0 establish that
((ii) → (i)). Note that the assumption that f is Lipschitz continuous, the assumption that
g is twice continuously differentiable, and the assumption that g has at most polynomially
growing partial derivatives demonstrate that there exists v ∈ C 1,2 ([0, T ] × Rd , R) which
satisfies for all t ∈ [0, T ], x ∈ Rd that v(0, x) = g(x) and
∂v
(16.5)

∂t
(t, x) = (∆x v)(t, x) + f (v(t, x))

(cf., for instance, Beck et al. [23, Corollary 3.4]). This and (16.4) show that

inf L(v) = 0. (16.6)


v∈C 1,2 ([0,T ]×Rd ,R)

510
16.2. Derivation of PINNs and deep Galerkin methods (DGMs)

Furthermore, observe that (16.2), (16.1), and the assumption that T and X are independent
imply that for all v ∈ C 1,2 ([0, T ] × Rd , R) it holds that
Z  
2
|v(0, x) − g(x)|2 + ∂v

L(v) = ∂t
(t, x) − (∆ x v)(t, x) − f (v(t, x)) t(t)x(x) d(t, x).
[0,T ]×Rd
(16.7)
The assumption that t and x are continuous and the fact that for all t ∈ [0, T ], x ∈ Rd
it holds that t(t) ≥ 0 and x(x) ≥ 0 therefore ensure that for all v ∈ C 1,2 ([0, T ] × Rd , R),
t ∈ [0, T ], x ∈ Rd with L(v) = 0 it holds that
 
2
|v(0, x) − g(x)|2 + ∂v (16.8)

∂t
(t, x) − (∆ x v)(t, x) − f (v(t, x)) t(t)x(x) = 0.

This and the assumption that for all t ∈ [0, T ], x ∈ Rd it holds that t(t) > 0 and x(x) > 0
show that for all v ∈ C 1,2 ([0, T ] × Rd , R), t ∈ [0, T ], x ∈ Rd with L(v) = 0 it holds that
2
|v(0, x) − g(x)|2 + ∂v (16.9)

∂t
(t, x) − (∆x v)(t, x) − f (v(t, x)) = 0.
Combining this with (16.6) proves that ((i) → (ii)). The proof of Theorem 16.1.1 is thus
complete.

16.2 Derivation of PINNs and deep Galerkin methods


(DGMs)
In this section we employ the reformulation of semilinear PDEs as optimization prob-
lems from Theorem 16.1.1 to sketch an informal derivation of deep learning schemes
to approximate solutions of semilinear heat PDEs. For this let T ∈ (0, ∞), d ∈ N,
u ∈ C 1,2 ([0, T ] × Rd , R), g ∈ C 2 (Rd , R) satisfy that g has at most polynomially growing
partial derivatives, let f : R → R be Lipschitz continuous, and assume for all t ∈ [0, T ],
x ∈ Rd that u(0, x) = g(x) and
∂u
(16.10)

∂t
(t, x) = (∆x u)(t, x) + f (u(t, x)).
In the framework described in the previous sentence, we think of u as the unknown PDE
solution. The objective of this derivation is to develop deep learning methods which aim to
approximate the unknown function u.
In the first step we employ Theorem 16.1.1 to reformulate the PDE problem associated
to (16.10) as an infinite dimensional stochastic optimization problem over a function space.
For this let t ∈ C([0, T ], (0, ∞)), x ∈ C(Rd , (0, ∞)), let (Ω, F, P) be a probability space,
let T : Ω → [0, T ] and X : Ω → Rd be independent random variables, assume for all
A ∈ B([0, T ]), B ∈ B(Rd ) that
Z Z
P(T ∈ A) = t(t) dt and P(X ∈ B) = x(x) dx, (16.11)
A B

511
Chapter 16: Physics-informed neural networks (PINNs)

and let L : C 1,2 ([0, T ]×Rd , R) → [0, ∞] satisfy for all v = (v(t, x))(t,x)∈[0,T ]×Rd ∈ C 1,2 ([0, T ]×
Rd , R) that

L(v) = E |v(0, X ) − g(X )|2 + ∂v


(T , X ) − (∆x v)(T , X ) − f (v(T , X )) 2 . (16.12)
  
∂t

Observe that Theorem 16.1.1 assures that the unknown function u satisfies

L(u) = 0 (16.13)

and is thus a minimizer of the optimization problem associated to (16.12). Motivated by


this, we consider aim to find approximations of u by computing approximate minimizers
of the function L : C 1,2 ([0, T ] × Rd , R) → [0, ∞]. Due to its infinite dimensionality this
optimization problem is however not yet amenable to numerical computations.
For this reason, in the second step, we reduce this infinite dimensional stochastic
optimization problem to a finite dimensional stochastic optimization problem involving
ANNs. Specifically,
Ph let a : R → R be differentiable, let h d∈ N, l1 , l2 , . . . , lh , d ∈ N satisfyd
d = l1 (d + 2) + k=2 lk (lk−1 + 1) + lh + 1, and let L : R → [0, ∞) satisfy for all θ ∈ R
that
θ,d+1

L(θ) = L NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR

θ,d+1 2
= E NM a,l ,Ma,l ,...,Ma,l ,idR
(0, X ) − g(X )
1 2 h

 ∂NMθ,d+1,M  (16.14)
a,l2 ,...,Ma,lh ,idR θ,d+1
a,l1

+ ∂t
(T , X ) − ∆x NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(T , X )

θ,d+1
2
− f NM a,l ,Ma,l ,...,Ma,lh ,idR (T , X ))
1 2

(cf. Definitions 1.1.3 and 1.2.1). We can now compute an approximate minimizer of the
function L by computing an approximate minimizer ϑ ∈ Rd of the function L and employing
the realization NM ϑ,d+1
a,l1 ,Ma,l2 ,...,Ma,lh ,idR
of the ANN associated to this approximate minimizer
as an approximate minimizer of L.
The third and last step of this derivation is to approximately compute such an ap-
proximate minimizer of L by means of SGD-type optimization methods. We now sketch
this in the case of the plain-vanilla SGD optimization method (cf. Definition 7.2.1). Let
ξ ∈ Rd , J ∈ N, (γn )n∈N ⊆ [0, ∞), for every n ∈ N, j ∈ {1, 2, . . . , J} let Tn,j : Ω → [0, T ] and
Xn,j : Ω → Rd be random variables, assume for all n ∈ N, j ∈ {1, 2, . . . , J}, A ∈ B([0, T ]),
B ∈ B(Rd ) that

P(T ∈ A) = P(Tn,j ∈ A) and P(X ∈ B) = P(Xn,j ∈ B), (16.15)

512
16.3. Implementation of PINNs

let l : Rd × [0, T ] × Rd → R satisfy for all θ ∈ Rd , t ∈ [0, T ], x ∈ Rd that


θ,d+1 2
l(θ, t, x) = NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
(0, x) − g(x)
 ∂NMθ,d+1,M ,...,M ,id 
θ,d+1
(16.16)
a,l1 a,l2 a,lh

+ ∂t
R
(t, x) − ∆x NM a,l1 ,M a,l2 ,...,M a,lh ,id R
(t, x)
θ,d+1
2
− f NM a,l ,M a,l ,...,M a,l ,id R
(t, x)) ,
1 2 h

and let Θ = (Θn )n∈N0 : N0 × Ω → Rd satisfy for all n ∈ N that


" J #
1X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , Tn,j , Xn,j ) . (16.17)
J j=1

Finally, the idea of PINNs and DGMs is then to choose for large enough n ∈ N the
realization NMΘn ,d+1
a,l ,Ma,l ,...,Ma,l ,idR
as an approximation
1 2 h

Θn ,d+1
NM a,l ,Ma,l
1 2
,...,Ma,lh ,idR ≈u (16.18)

of the unknown solution u of the PDE in (16.10).


The ideas and the resulting schemes in the above derivation were first introduced as
PINNs in Raissi et al. [347] and as DGMs in Sirignano & Spiliopoulos [379]. Very roughly
speaking, PINNs and DGMs in their original form differ in the way the joint distribution of
the random variables (Tn,j , Xn,j )(n,j)∈N×{1,2,...,J} would be chosen. Loosely speaking, in the
case of PINNs the originally proposed distribution for (Tn,j , Xn,j )(n,j)∈N×{1,2,...,J} would be
based on drawing a finite number of samples of the random variable (T , X ) and then having
the random variable (Tn,j , Xn,j )(n,j)∈N×{1,2,...,J} be randomly chosen among those samples.
In the case of DGMs the original proposition would be to choose (Tn,j , Xn,j )(n,j)∈N×{1,2,...,J}
independent and identically distributed. Implementations of PINNs and DGMs that employ
more sophisticated optimization methods, such as the Adam SGD optimization method,
can be found in the next section.

16.3 Implementation of PINNs


In Source code 16.1 below we present a simple implementation of the PINN method, as
explained in Section 16.2 above, for finding an approximation of a solution u ∈ C 1,2 ([0, 3] ×
R2 ) of the two-dimensional Allen–Cahn-type semilinear heat equation
∂u 1
(∆x u)(t, x) + u(t, x) − [u(t, x)]3 (16.19)

∂t
(t, x) = 200

with u(0, x) = sin(∥x∥22 ) for t ∈ [0, 3], x ∈ R2 . This implementation follows the original
proposal in Raissi et al. [347] in that it first chooses 20000 realizations of the random variable

513
Chapter 16: Physics-informed neural networks (PINNs)

(T , X ), where T is continuous uniformly distributed on [0, 3] and where X is normally


distributed on R2 with mean 0 ∈ R2 and covariance 4 I2 ∈ R2×2 (cf. Definition 1.5.5). It then
trains a fully connected feed-forward ANN with 4 hidden layers (with 50 neurons on each
hidden layer) and using the swish activation function with parameter 1 (cf. Section 1.2.8).
The training uses batches of size 256 with each batch chosen from the 20000 realizations of
the random variable (T , X ) which were picked beforehand. The training is performed using
the Adam SGD optimization method (cf. Section 7.9). A plot of the resulting approximation
of the solution u after 20000 training steps is shown in Figure 16.1.
1 import torch
2 import matplotlib . pyplot as plt
3 from torch . autograd import grad
4 from matplotlib . gridspec import GridSpec
5 from matplotlib . cm import ScalarMappable
6
7
8 dev = torch . device ( " cuda :0 " if torch . cuda . is_available () else
9 " cpu " )
10
11 T = 3.0 # the time horizom
12 M = 20000 # the number of training samples
13
14 torch . manual_seed (0)
15
16 x_data = torch . randn (M , 2) . to ( dev ) * 2
17 t_data = torch . rand (M , 1) . to ( dev ) * T
18
19 # The initial value
20 def phi ( x ) :
21 return x . square () . sum ( axis =1 , keepdims = True ) . sin ()
22
23 # We use a network with 4 hidden layers of 50 neurons each and the
24 # Swish activation function ( called SiLU in PyTorch )
25 N = torch . nn . Sequential (
26 torch . nn . Linear (3 , 50) , torch . nn . SiLU () ,
27 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
28 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
29 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
30 torch . nn . Linear (50 , 1) ,
31 ) . to ( dev )
32
33 optimizer = torch . optim . Adam ( N . parameters () , lr =3 e -4)
34
35 J = 256 # the batch size
36
37 for i in range (20000) :
38 # Choose a random batch of training samples
39 indices = torch . randint (0 , M , (J ,) )

514
16.3. Implementation of PINNs

40 x = x_data [ indices , :]
41 t = t_data [ indices , :]
42

43 x1 , x2 = x [: , 0:1] , x [: , 1:2]
44
45 x1 . requires_grad_ ()
46 x2 . requires_grad_ ()
47 t . requires_grad_ ()
48

49 optimizer . zero_grad ()
50
51 # Denoting by u the realization function of the ANN , compute
52 # u (0 , x ) for each x in the batch
53 u0 = N ( torch . hstack (( torch . zeros_like ( t ) , x ) ) )
54 # Compute the loss for the initial condition
55 initial_loss = ( u0 - phi ( x ) ) . square () . mean ()
56
57 # Compute the partial derivatives using automatic
58 # differentiation
59 u = N ( torch . hstack (( t , x1 , x2 ) ) )
60 ones = torch . ones_like ( u )
61 u_t = grad (u , t , ones , create_graph = True ) [0]
62 u_x1 = grad (u , x1 , ones , create_graph = True ) [0]
63 u_x2 = grad (u , x2 , ones , create_graph = True ) [0]
64 ones = torch . ones_like ( u_x1 )
65 u_x1x1 = grad ( u_x1 , x1 , ones , create_graph = True ) [0]
66 u_x2x2 = grad ( u_x2 , x2 , ones , create_graph = True ) [0]
67
68 # Compute the loss for the PDE
69 Laplace = u_x1x1 + u_x2x2
70 pde_loss = ( u_t - (0.005 * Laplace + u - u **3) ) . square () . mean ()
71
72 # Compute the total loss and perform a gradient step
73 loss = initial_loss + pde_loss
74 loss . backward ()
75 optimizer . step ()
76
77
78 # ## Plot the solution at different times
79
80 mesh = 128
81 a , b = -3 , 3
82
83 gs = GridSpec (2 , 4 , width_ratios =[1 , 1 , 1 , 0.05])
84 fig = plt . figure ( figsize =(16 , 10) , dpi =300)
85
86 x , y = torch . meshgrid (
87 torch . linspace (a , b , mesh ) ,
88 torch . linspace (a , b , mesh ) ,

515
Chapter 16: Physics-informed neural networks (PINNs)

89 indexing = " xy "


90 )
91 x = x . reshape (( mesh * mesh , 1) ) . to ( dev )
92 y = y . reshape (( mesh * mesh , 1) ) . to ( dev )
93
94 for i in range (6) :
95 t = torch . full (( mesh * mesh , 1) , i * T / 5) . to ( dev )
96 z = N ( torch . cat (( t , x , y ) , 1) )
97 z = z . detach () . cpu () . numpy () . reshape (( mesh , mesh ) )
98
99 ax = fig . add_subplot ( gs [ i // 3 , i % 3])
100 ax . set_title ( f " t = { i * T / 5} " )
101 ax . imshow (
102 z , cmap = " viridis " , extent =[ a , b , a , b ] , vmin = -1.2 , vmax =1.2
103 )
104
105 # Add the colorbar to the figure
106 norm = plt . Normalize ( vmin = -1.2 , vmax =1.2)
107 sm = ScalarMappable ( cmap = " viridis " , norm = norm )
108 cax = fig . add_subplot ( gs [: , 3])
109 fig . colorbar ( sm , cax = cax , orientation = ’ vertical ’)
110
111 fig . savefig ( " ../ plots / pinn . pdf " , bbox_inches = " tight " )

Source code 16.1 (code/pinn.py): A simple implementation in PyTorch of the


PINN method, computing an approximation of the function u ∈ C 1,2 ([0, 3] × R2 , R)
which satisfies for all t ∈ [0, 2], x ∈ R2 that ∂u 1

∂t
(t, x) = 200 (∆x u)(t, x) + u(t, x) −
[u(t, x)] and u(0, x) = sin(∥x∥2 ) (cf. Definition 3.3.4). The plot created by this code
3 2

is shown in Figure 16.1.

16.4 Implementation of DGMs


In Source code 16.2 below we present a simple implementation of the DGM, as explained
in Section 16.2 above, for finding an approximation for a solution u ∈ C 1,2 ([0, 3] × R2 ) of
the two-dimensional Allen–Cahn-type semilinear heat equation
∂u 1
(∆x u)(t, x) + u(t, x) − [u(t, x)]3 (16.20)

∂t
(t, x) = 200

with u(0, x) = sin(x1 ) sin(x2 ) for t ∈ [0, 3], x = (x1 , x2 ) ∈ R2 . As originally proposed
in Sirignano & Spiliopoulos [379], this implementation chooses for each training step a
batch of 256 realizations of the random variable (T , X ), where T is continuously uniformly
distributed on [0, 3] and where X is normally distributed on R2 with mean 0 ∈ R2 and
covariance 4 I2 ∈ R2×2 (cf. Definition 1.5.5). Like the PINN implementation in Source
code 16.1, it trains a fully connected feed-forward ANN with 4 hidden layers (with 50

516
16.4. Implementation of DGMs

t = 0.0 t = 0.6 t = 1.2


3 3 3

2 2 2 1.0

1 1 1

0 0 0

1 1 1 0.5

2 2 2

3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
0.0
t = 1.8 t = 2.4 t = 3.0
3 3 3

2 2 2

1 1 1 0.5

0 0 0

1 1 1

2 2 2 1.0

3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3

Figure 16.1 (plots/pinn.pdf): Plots for the functions [−3, 3]2 ∋ x 7→ U (t, x) ∈ R,
where t ∈ {0, 0.6, 1.2, 1.8, 2.4, 3} and where U ∈ C([0, 3] × R2 , R) is an approximation
of the
 function u ∈ C 1,2 ([0, 3] × R2 , R) which satisfies for all t ∈ [0, 3], x ∈ R2 that
∂u
∂t
(t, x) = 200 (∆x u)(t, x) + u(t, x) − [u(t, x)]3 and u(0, x) = sin(∥x∥22 ) computed by
1

means of the PINN method as implemented in Source code 16.1 (cf. Definition 3.3.4).

neurons on each hidden layer) and using the swish activation function with parameter 1 (cf.
Section 1.2.8). The training is performed using the Adam SGD optimization method (cf.
Section 7.9). A plot of the resulting approximation of the solution u after 30000 training
steps is shown in Figure 16.2.
1 import torch
2 import matplotlib . pyplot as plt
3 from torch . autograd import grad
4 from matplotlib . gridspec import GridSpec
5 from matplotlib . cm import ScalarMappable
6
7

8 dev = torch . device ( " cuda :0 " if torch . cuda . is_available () else
9 " cpu " )
10
11 T = 3.0 # the time horizom

517
Chapter 16: Physics-informed neural networks (PINNs)

12
13 # The initial value
14 def phi ( x ) :
15 return x . sin () . prod ( axis =1 , keepdims = True )
16
17 torch . manual_seed (0)
18
19 # We use a network with 4 hidden layers of 50 neurons each and the
20 # Swish activation function ( called SiLU in PyTorch )
21 N = torch . nn . Sequential (
22 torch . nn . Linear (3 , 50) , torch . nn . SiLU () ,
23 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
24 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
25 torch . nn . Linear (50 , 50) , torch . nn . SiLU () ,
26 torch . nn . Linear (50 , 1) ,
27 ) . to ( dev )
28
29 optimizer = torch . optim . Adam ( N . parameters () , lr =3 e -4)
30
31 J = 256 # the batch size
32

33 for i in range (30000) :


34 # Choose a random batch of training samples
35 x = torch . randn (J , 2) . to ( dev ) * 2
36 t = torch . rand (J , 1) . to ( dev ) * T
37
38 x1 = x [: , 0:1]
39 x2 = x [: , 1:2]
40
41 x1 . requires_grad_ ()
42 x2 . requires_grad_ ()
43 t . requires_grad_ ()
44

45 optimizer . zero_grad ()
46
47 # Denoting by u the realization function of the ANN , compute
48 # u (0 , x ) for each x in the batch
49 u0 = N ( torch . hstack (( torch . zeros_like ( t ) , x ) ) )
50 # Compute the loss for the initial condition
51 initial_loss = ( u0 - phi ( x ) ) . square () . mean ()
52
53 # Compute the partial derivatives using automatic
54 # differentiation
55 u = N ( torch . hstack (( t , x1 , x2 ) ) )
56 ones = torch . ones_like ( u )
57 u_t = grad (u , t , ones , create_graph = True ) [0]
58 u_x1 = grad (u , x1 , ones , create_graph = True ) [0]
59 u_x2 = grad (u , x2 , ones , create_graph = True ) [0]
60 ones = torch . ones_like ( u_x1 )

518
16.4. Implementation of DGMs

61 u_x1x1 = grad ( u_x1 , x1 , ones , create_graph = True ) [0]


62 u_x2x2 = grad ( u_x2 , x2 , ones , create_graph = True ) [0]
63

64 # Compute the loss for the PDE


65 Laplace = u_x1x1 + u_x2x2
66 pde_loss = ( u_t - (0.005 * Laplace + u - u **3) ) . square () . mean ()
67
68 # Compute the total loss and perform a gradient step
69 loss = initial_loss + pde_loss
70 loss . backward ()
71 optimizer . step ()
72
73
74 # ## Plot the solution at different times
75

76 mesh = 128
77 a , b = - torch . pi , torch . pi
78
79 gs = GridSpec (2 , 4 , width_ratios =[1 , 1 , 1 , 0.05])
80 fig = plt . figure ( figsize =(16 , 10) , dpi =300)
81

82 x , y = torch . meshgrid (
83 torch . linspace (a , b , mesh ) ,
84 torch . linspace (a , b , mesh ) ,
85 indexing = " xy "
86 )
87 x = x . reshape (( mesh * mesh , 1) ) . to ( dev )
88 y = y . reshape (( mesh * mesh , 1) ) . to ( dev )
89
90 for i in range (6) :
91 t = torch . full (( mesh * mesh , 1) , i * T / 5) . to ( dev )
92 z = N ( torch . cat (( t , x , y ) , 1) )
93 z = z . detach () . cpu () . numpy () . reshape (( mesh , mesh ) )
94
95 ax = fig . add_subplot ( gs [ i // 3 , i % 3])
96 ax . set_title ( f " t = { i * T / 5} " )
97 ax . imshow (
98 z , cmap = " viridis " , extent =[ a , b , a , b ] , vmin = -1.2 , vmax =1.2
99 )
100
101 # Add the colorbar to the figure
102 norm = plt . Normalize ( vmin = -1.2 , vmax =1.2)
103 sm = ScalarMappable ( cmap = " viridis " , norm = norm )
104 cax = fig . add_subplot ( gs [: , 3])
105 fig . colorbar ( sm , cax = cax , orientation = ’ vertical ’)
106
107 fig . savefig ( " ../ plots / dgm . pdf " , bbox_inches = " tight " )

519
Chapter 16: Physics-informed neural networks (PINNs)

Source code 16.2 (code/dgm.py): A simple implementation in PyTorch of the deep


Galerkin method, computing an approximation of the function 1,2 2
 u ∈ C 1 ([0, 3] × R , R)
which satisfies for all t ∈ [0, 3], x = (x1 , x2 ) ∈ R that ∂t (t, x) = 200 (∆x u)(t, x) +
2 ∂u

u(t, x) − [u(t, x)]3 and u(0, x) = sin(x1 ) sin(x2 ). The plot created by this code is
shown in Figure 16.2.

t = 0.0 t = 0.6 t = 1.2


3 3 3

2 2 2 1.0

1 1 1

0 0 0

1 1 1 0.5

2 2 2

3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3
0.0
t = 1.8 t = 2.4 t = 3.0
3 3 3

2 2 2

1 1 1 0.5

0 0 0

1 1 1

2 2 2 1.0

3 3 3
3 2 1 0 1 2 3 3 2 1 0 1 2 3 3 2 1 0 1 2 3

Figure 16.2 (plots/dgm.pdf): Plots for the functions [−π, π]2 ∋ x 7→ U (t, x) ∈ R,
where t ∈ {0, 0.6, 1.2, 1.8, 2.4, 3} and where U ∈ C([0, 3] × R2 , R) is an approximation
of the function u ∈ C 1,2 ([0, 3]×R2 , R) which satisfies for all t ∈ [0, 3], x = (x1 , x2 ) ∈ R2
that u(0, x) = sin(x1 ) sin(x2 ) and ∂u 1

∂t
(t, x) = 200 (∆x u)(t, x) + u(t, x) − [u(t, x)]3
computed by means of Source code 16.2.

520
Chapter 17

Deep Kolmogorov methods (DKMs)

The PINNs and the DGMs presented in Chapter 16 do, on the one hand, not exploit a lot
of structure of the underlying PDE in the process of setting up the associated stochastic
optimization problems and have as such the key advantage to be very widely applicable
deep learning methods for PDEs. On the other hand, deep learning methods for PDEs that
in some way exploit the specific structure of the considered PDE problem often result in
more accurate approximations (cf., for example, Beck et al. [24] and the references therein).
In particular, there are several deep learning approximation methods in the literature which
exploit in the process of setting up stochastic optimization problems that the PDE itself
admits a stochastic representation. In the literature there are a lot of deep learning methods
which are based on such stochastic formulations of PDEs and therefore have a strong link
to stochastic analysis and formulas of the Feynman–Kac-type (cf., for instance, [20, 119,
145, 187, 207, 336] and the references therein).

The schemes in Beck et al. [19], which we refer to as DKMs, belong to the simplest of
such deep learning methods for PDEs. In this chapter we present in Sections 17.1, 17.2,
17.3, and 17.4 theoretical considerations leading to a reformulation of heat PDE problems
as stochastic optimization problems (see Proposition 17.4.1 below), we use these theoretical
considerations to derive DKMs in the specific case of heat equations in Section 17.5, and we
present an implementation of DKMs in the case of a simple two-dimensional heat equation
in Section 17.6.

Sections 17.1 and 17.2 are slightly modified extracts from Beck et al. [18], Section 17.3
is inspired by Beck et al. [23, Section 2], and Sections 17.4 and 17.5 are inspired by Beck et
al. [18].

521
Chapter 17: Deep Kolmogorov methods (DKMs)

17.1 Stochastic optimization problems for expectations


of random variables
Lemma 17.1.1. Let (Ω, F, P) be a probability space and let X : Ω → R be a random variable
with E[|X|2 ] < ∞. Then
(i) it holds for all y ∈ R that

E |X − y|2 = E |X − E[X]|2 + |E[X] − y|2 , (17.1)


   

(ii) there exists a unique z ∈ R such that

E |X − z|2 = inf E |X − y|2 , (17.2)


   
y∈R

and
(iii) it holds that
E |X − E[X]|2 = inf E |X − y|2 . (17.3)
   
y∈R

Proof of Lemma 17.1.1. Note that Lemma 7.2.3 establishes item (i). Observe that item (i)
proves items (ii) and (iii). The proof of Lemma 17.1.1 is thus complete.

17.2 Stochastic optimization problems for expectations


of random fields
Proposition 17.2.1. Let d ∈ N, a ∈ R, b ∈ (a, ∞), let (Ω, F, P) be a probability space, let
X = (Xx )x∈[a,b]d : [a, b]d × Ω → R be (B([a, b]d ) ⊗ F)/B(R)-measurable, assume for every
x ∈ [a, b]d that E[|Xx |2 ] < ∞, and assume that [a, b]d ∋ x 7→ E[Xx ] ∈ R is continuous. Then
(i) there exists a unique u ∈ C([a, b]d , R) such that
Z Z 
2
E |Xx − v(x)|2 dx (17.4)
   
E |Xx − u(x)| dx = inf
[a,b]d v∈C([a,b]d ,R) [a,b]d

and
(ii) it holds for all x ∈ [a, b]d that u(x) = E[Xx ].
Proof of Proposition 17.2.1. Note that item (i) in Lemma 17.1.1 and the assumption that for
all x ∈ [a, b]d it holds that E[|Xx |2 ] < ∞ demonstrate that for every function u : [a, b]d → R
and every x ∈ [a, b]d it holds that

E |Xx − u(x)|2 = E |Xx − E[Xx ]|2 + |E[Xx ] − u(x)|2 . (17.5)


   

522
17.2. Stochastic optimization problems for expectations of random fields

Fubini’s theorem (see, for example, Klenke [248, Theorem 14.16]) hence implies that for all
u ∈ C([a, b]d , R) it holds that
Z Z Z
2 2
|E[Xx ] − u(x)|2 dx. (17.6)
   
E |Xx − u(x)| dx = E |Xx − E[Xx ]| dx +
[a,b]d [a,b]d [a,b]d

This ensures that


Z
E |Xx − E[Xx ]|2 dx
 
[a,b]d
Z 
2
 
≥ inf E |Xx − v(x)| dx (17.7)
v∈C([a,b]d ,R) [a,b]d
Z Z 
2 2
 
= inf E |Xx − E[Xx ]| dx + |E[Xx ] − v(x)| dx
v∈C([a,b]d ,R) [a,b]d [a,b]d

The assumption that [a, b]d ∋ x 7→ E[Xx ] ∈ R is continuous therefore shows that
Z Z 
2 2
   
E |Xx − E[Xx ]| dx ≥ inf E |Xx − E[Xx ]| dx
[a,b]d v∈C([a,b]d ,R) [a,b]d
Z (17.8)
E |Xx − E[Xx ]|2 dx.
 
=
[a,b]d

Hence, we obtain that


Z Z 
E |Xx − E[Xx ]|2 dx = 2
(17.9)
   
inf E |Xx − v(x)| dx .
[a,b]d v∈C([a,b]d ,R) [a,b]d

The fact that the function [a, b]d ∋ x 7→ E[Xx ] ∈ R is continuous therefore establishes that
there exists u ∈ C([a, b]d , R) such that
Z Z 
2 2
(17.10)
   
E |Xx − u(x)| dx = inf E |Xx − v(x)| dx .
[a,b]d v∈C([a,b]d ,R) [a,b]d

Furthermore, observe that (17.6) and (17.9) prove that for all u ∈ C([a, b]d , R) with
Z Z 
2 2
(17.11)
   
E |Xx − u(x)| dx = inf E |Xx − v(x)| dx
[a,b]d v∈C([a,b]d ,R) [a,b]d

it holds that
Z
E |Xx − E[Xx ]|2 dx
 
[a,b]d
Z  Z
2
E |Xx − u(x)|2 dx (17.12)
   
= inf E |Xx − v(x)| dx =
v∈C([a,b]d ,R) [a,b]d [a,b]d
Z Z
E |Xx − E[Xx ]|2 dx + |E[Xx ] − u(x)|2 dx.
 
=
[a,b]d [a,b]d

523
Chapter 17: Deep Kolmogorov methods (DKMs)

Hence, we obtain that for all u ∈ C([a, b]d , R) with


Z Z 
2
E |Xx − v(x)|2 dx (17.13)
   
E |Xx − u(x)| dx = inf
[a,b]d v∈C([a,b]d ,R) [a,b]d

it holds that Z
|E[Xx ] − u(x)|2 dx = 0. (17.14)
[a,b]d

This and the assumption that [a, b]d ∋ x 7→ E[Xx ] ∈ R is continuous demonstrate that for
all y ∈ [a, b]d , u ∈ C([a, b]d , R) with
Z Z 
2 2
(17.15)
   
E |Xx − u(x)| dx = inf E |Xx − v(x)| dx
[a,b]d v∈C([a,b]d ,R) [a,b]d

it holds that u(y) = E[Xy ]. Combining this with (17.10) establishes items (i) and (ii). The
proof of Proposition 17.2.1 is thus complete.

17.3 Feynman–Kac formulas


17.3.1 Feynman–Kac formulas providing existence of solutions
Lemma 17.3.1 (A variant of Lebesgue’s theorem on dominated convergence). Let (Ω, F, P)
be a probability space, for every n ∈ N0 let Xn : Ω → R be a random variable, assume for
all ε ∈ (0, ∞) that
lim sup P(|Xn − X0 | > ε) = 0, (17.16)
n→∞
 
let Y : Ω → R be a random variable with E |Y | < ∞, and assume for all n ∈ N that
P(|Xn | ≤ Y ) = 1. Then
 
(i) it holds that lim supn→∞ E |Xn − X0 | = 0,
 
(ii) it holds that E |X0 | < ∞, and

(iii) it holds that lim supn→∞ E[Xn ] − E[X0 ] = 0.

Proof of Lemma 17.3.1. Note that, for instance, the variant of Lebesgue’s theorem on
dominated convergence in Klenke [248, Corollary 6.26] proves items (i), (ii), and (iii). The
proof of Lemma 17.3.1 is thus complete.

Proposition 17.3.2. Let T ∈ (0, ∞), d, m ∈ N, B ∈ Rd×m , φ ∈ C 2 (Rd , R) satisfy


∂2
(17.17)
Pd ∂
  
supx∈Rd i,j=1 |φ(x)| + ∂xi
φ (x) + ∂xi ∂xj
φ (x) < ∞,

524
17.3. Feynman–Kac formulas

let (Ω, F, P) be a probability space, let Z : Ω → Rm be a standard normal random variable,


and let u : [0, T ] × Rd → R satisfy for all t ∈ [0, T ], x ∈ Rd that

(17.18)
 
u(t, x) = E φ(x + tBZ) .

Then
(i) it holds that u ∈ C 1,2 ([0, T ] × Rd , R) and

(ii) it holds for all t ∈ [0, T ], x ∈ Rd that


∂u
(t, x) = 12 Trace BB ∗ (Hessx u)(t, x) (17.19)
 
∂t

(cf. Definition 2.4.5).


Proof of Proposition 17.3.2. Throughout this proof, let

e1 = (1, 0, . . . , 0), e2 = (0, 1, . . . , 0), . . . , em = (0, . . . , 0, 1) ∈ Rm (17.20)

and for√every t ∈ [0, T ], x ∈ Rd let ψt,x : Rm → R, satisfy for all y ∈ Rm that ψt,x (y) =
φ(x + tBy). Note that the assumption that φ ∈ C 2 (Rd , R), the chain rule, Lemma 17.3.1,
and (17.17) imply that
(I) for all x ∈ Rd it holds that (0, T ] ∋ t 7→ u(t, x) ∈ R is differentiable,

(II) for all t ∈ [0, T ] it holds that Rd ∋ x 7→ u(t, x) ∈ R is twice differentiable,

(III) for all t ∈ (0, T ], x ∈ Rd it holds that



∂u 1
(17.21)
  
∂t
(t, x) = E (∇φ)(x + tBZ), √
2 t
BZ ,

and

(IV) for all t ∈ [0, T ], x ∈ Rd it holds that



(17.22)
 
(Hessx u)(t, x) = E (Hess φ)(x + tBZ)

(cf. Definition 1.4.7). Note that items (III) and (IV), the assumption that φ ∈ C 2 (Rd , R),
the assumption that
∂2
Pd ∂
(17.23)
  
supx∈Rd i,j=1 φ(x) + | ∂xi φ (x)| + ∂xi ∂xj
φ (x) < ∞,

the fact that E ∥Z∥2 < ∞, and Lemma 17.3.1 ensure that
 

(0, T ] × Rd ∋ (t, x) 7→ ∂u (17.24)



∂t
(t, x) ∈ R

525
Chapter 17: Deep Kolmogorov methods (DKMs)

and
[0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d (17.25)
are continuous (cf. Definition 3.3.4). Furthermore, observe that item (IV) and the fact
that for all X ∈ Rm×d , Y ∈ Rd×m it holds that Trace(XY ) = Trace(Y X) show that for all
t ∈ (0, T ], x ∈ Rd it holds that
1 ∗
 h
1 ∗
√ i
2
Trace BB (Hess x u)(t, x) = E 2
Trace BB (Hess φ)(x + tBZ)
√ √
h m 
1 ∗
i 1 P ∗
= 2 E Trace B (Hess φ)(x + tBZ)B = 2 E ⟨ek , B (Hess φ)(x + tBZ)Bek ⟩
k=1
√ √
m  m 
1
P 1
P ′′
= 2E ⟨Bek , (Hess φ)(x + tBZ)Bek ⟩ = 2 E φ (x + tBZ)(Bek , Bek )
k=1 k=1
m  m 
1 ′′ 1
P ∂2
ψ (Z) = 2t1 E[(∆ψt,x )(Z)]
P 
= 2t E (ψt,x ) (Z)(ek , ek ) = 2t E ∂y 2 t,x
k
k=1 k=1
(17.26)
(cf. Definition 2.4.5). The assumption that Z : Ω → Rm is a standard normal random
variable and integration by parts therefore demonstrate that for all t ∈ (0, T ], x ∈ Rd it
holds that
1 ∗

2
Trace BB (Hess x u)(t, x)
" # " #
exp ⟨y,y⟩ exp − ⟨y,y⟩ 
Z Z
1 2 1 2
= (∆ψt,x )(y) dy = ⟨(∇ψt,x )(y), y⟩ dy
2t Rm (2π)m/2 2t Rm (2π)m/2
(17.27)
" #
1
Z D √ E exp − ⟨y,y⟩ 
= √ B ∗ (∇φ)(x + tBy), y 2
dy
2 t Rm (2π)m/2
1  √ √
= √ E ⟨B ∗ (∇φ)(x + tBZ), Z⟩ = E (∇φ)(x + tBZ), 2√ 1
  
t
BZ .
2 t
Item (III) hence establishes that for all t ∈ (0, T ], x ∈ Rd it holds that
∂u
(t, x) = 12 Trace BB ∗ (Hessx u)(t, x) . (17.28)
 
∂t

The fundamental theorem of calculus therefore proves that for all t, s ∈ (0, T ], x ∈ Rd it
holds that
Z t Z t
∂u 1 ∗
(17.29)
 
u(t, x) − u(s, x) = ∂t
(r, x) dr = 2
Trace BB (Hess x u)(r, x) dr.
s s

The fact that [0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d is continuous hence implies for
all t ∈ (0, T ], x ∈ Rd that
1 t1
 
u(t, x) − u(0, x) u(t, x) − u(s, x)
Z
Trace BB ∗ (Hessx u)(r, x) dr. (17.30)

= lim = 2
t s↘0 t t 0

526
17.3. Feynman–Kac formulas

This and the fact that [0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d is continuous ensure
that for all x ∈ Rd it holds that
u(t, x) − u(0, x) 1
− 2 Trace BB ∗ (Hessx u)(0, x)

lim sup
t↘0 t
 Z t 
1 1 ∗
 1 ∗

≤ lim sup Trace BB (Hessx u)(s, x) − 2 Trace BB (Hessx u)(0, x) ds
t↘0 t 0 2
" #
 
≤ lim sup sup 12 Trace BB ∗ (Hessx u)(s, x) − (Hessx u)(0, x)

= 0.
t↘0 s∈[0,t]
(17.31)

Item (I) therefore shows that for all x ∈ Rd it holds that [0, T ] ∋ t 7→ u(t, x) ∈ R is
differentiable. Combining this with (17.31) and (17.28) ensures that for all t ∈ [0, T ], x ∈ Rd
it holds that
∂u 1 ∗
(17.32)
 
∂t
(t, x) = 2
Trace BB (Hess x u)(t, x) .
This and the fact that [0, T ] × Rd ∋ (t, x) 7→ (Hessx u)(t, x) ∈ Rd×d is continuous establish
item (i). Note that (17.32) proves item (ii). The proof of Proposition 17.3.2 is thus
complete.
Definition 17.3.3 (Standard Brownian motions). Let (Ω, F, P) be a probability space.
We say that W is an m-dimensional P-standard Brownian motion (we say that W is a
P-standard Brownian motion, we say that W is a standard Brownian motion) if and only
if there exists T ∈ (0, ∞) such that
(i) it holds that m ∈ N,

(ii) it holds that W : [0, T ] × Ω × Rm is a function,

(iii) it holds for all ω ∈ Ω that [0, T ] ∋ s 7→ Ws (ω) ∈ Rm is continuous,

(iv) it holds for all ω ∈ Ω that W0 (ω) = 0 ∈ Rm ,

(v) it holds for all t1 ∈ [0, T ], t2 ∈ [0, T ] with t1 < t2 that Ω ∋ ω 7→ (t2 − t1 )−1/2 (Wt2 (ω) −
Wt1 (ω)) ∈ Rm is a standard normal random variable, and

(vi) it holds for all n ∈ {3, 4, 5, . . . }, t1 , t2 , . . . , tn ∈ [0, T ] with t1 ≤ t2 ≤ · · · ≤ tn that


Wt2 − Wt1 , Wt3 − Wt2 , . . . , Wtn − Wtn−1 are independent.

1 import numpy as np
2 import matplotlib . pyplot as plt
3
4 def g e n e r a t e _ b r o w n i a n _ m o t i o n (T , N ) :
5 increments = np . random . randn ( N ) * np . sqrt ( T / N )

527
Chapter 17: Deep Kolmogorov methods (DKMs)

6 BM = np . cumsum ( increments )
7 BM = np . insert ( BM , 0 , 0)
8 return BM
9
10 T = 1
11 N = 1000
12 t_values = np . linspace (0 , T , N +1)
13
14 fig , axarr = plt . subplots (2 , 2)
15
16 for i in range (2) :
17 for j in range (2) :
18 BM = g e n e r a t e _ b r o w n i a n _ m o t i o n (T , N )
19 axarr [i , j ]. plot ( t_values , BM )
20

21 plt . tight_layout ()
22 plt . savefig ( ’ ../ plots / brownian_motions . pdf ’)
23 plt . show ()

Source code 17.1 (code/brownian_motion.py): Python code producing four


trajectories of a 1-dimensional standard Brownian motion.

Corollary 17.3.4. Let T ∈ (0, ∞), d, m ∈ N, B ∈ Rd×m , φ ∈ C 2 (Rd , R) satisfy


2
(17.33)
Pd ∂
φ (x) + ∂x∂i ∂xj φ (x) < ∞,
  
supx∈Rd i,j=1 |φ(x)| + ∂xi

let (Ω, F, P) be a probability space, let W : [0, T ] × Ω → Rm be a standard Brownian motion,


and let u : [0, T ] × Rd → R satisfy for all t ∈ [0, T ], x ∈ Rd that
(17.34)
 
u(t, x) = E φ(x + BWt )
(cf. Definition 17.3.3). Then
(i) it holds that u ∈ C 1,2 ([0, T ] × Rd , R) and
(ii) it holds for all t ∈ [0, T ], x ∈ Rd that
∂u 1 ∗
(17.35)
 
∂t
(t, x) = 2
Trace BB (Hess x u)(t, x)

(cf. Definition 2.4.5).


Proof of Corollary 17.3.4. First, observe that the assumption that W : [0, T ] × Ω → Rm is
a standard Brownian motion demonstrates that for all t ∈ [0, T ], x ∈ Rd it holds that
√ WT
  
u(t, x) = E[φ(x + BWt )] = E φ x + tB √ . (17.36)
T
The fact that W√ T : Ω → Rm is a standard normal random variable and Proposition 17.3.2
T
hence establish items (i) and (ii). The proof of Corollary 17.3.4 is thus complete.

528
17.3. Feynman–Kac formulas

1.5
2.0
1.5 1.0
1.0
0.5
0.5
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.0
1.5
0.5
1.0
1.0
0.5
1.5
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Figure 17.1 (plots/brownian_motions.pdf): Four trajectories of a 1-dimensional


standard Brownian motion

17.3.2 Feynman–Kac formulas providing uniqueness of solutions


Lemma 17.3.5 (A special case of Vitali’s convergence theorem). Let (Ω, F, P) be a
probability space, let Xn : Ω → R, n ∈ N0 , be random variables with
(17.37)

P lim supn→∞ |Xn − X0 | = 0 = 1,
and let p ∈ (1, ∞) satisfy supn∈N E[|Xn |p ] < ∞. Then
 
(i) it holds that lim supn→∞ E |Xn − X0 | = 0,
 
(ii) it holds that E |X0 | < ∞, and
(iii) it holds that lim supn→∞ E[Xn ] − E[X0 ] = 0.
Proof of Lemma 17.3.5. First, note that the assumption that
sup E |Xn |p < ∞
 
(17.38)
n∈N

529
Chapter 17: Deep Kolmogorov methods (DKMs)

and, for example, the consequence of de la Vallée-Poussin’s theorem in Klenke [248, Corol-
lary 6.21] imply that {Xn : n ∈ N} is uniformly integrable. This, (17.37), and Vitali’s
convergence theorem in, for instance, Klenke [248, Theorem 6.25] prove items (i) and (ii).
Observe that items (i) and (ii) establish item (iii). The proof of Lemma 17.3.5 is thus
complete.

Proposition 17.3.6. Let d ∈ N, T, ρ ∈ (0, ∞), f ∈ C([0, T ] × Rd , R), let u ∈ C 1,2 ([0, T ] ×
Rd , R) have at most polynomially growing partial derivatives, assume for all t ∈ [0, T ],
x ∈ Rd that
∂u
(17.39)

∂t
(t, x) = ρ (∆x u)(t, x) + f (t, x),
let (Ω, F, P) be a probability space, and let W : [0, T ] × Ω → Rd be a standard Brownian
motion (cf. Definition 17.3.3). Then it holds for all t ∈ [0, T ], x ∈ Rd that
 Z t 
(17.40)
p p
u(t, x) = E u(0, x + 2ρWt ) + f (t − s, x + 2ρWs ) ds .
0

Proof of Proposition 17.3.6. Throughout this proof, let D1 : [0, T ] × Rd → R satisfy for all
t ∈ [0, T ], x ∈ Rd that
D1 (t, x) = ∂u (17.41)

∂t
(t, x),
let D2 = (D2,1 , D2,2 , . . . , D2,d ) : [0, T ] × Rd → Rd satisfy for all t ∈ [0, T ], x ∈ Rd that
D2 (t, x) = (∇x u)(t, x), let H = (Hi,j )i,j∈{1,2,...,d} : [0, T ] × Rd → Rd×d satisfy for all t ∈ [0, T ],
x ∈ Rd that
H(t, x) = (Hessx u)(t, x), (17.42)
let γ : Rd → R satisfy for all z ∈ Rd that
∥z∥22 
γ(z) = (2π)− /2 exp − (17.43)
d
2
,

and let vt,x : [0, t] → R, t ∈ [0, T ], x ∈ Rd , satisfy for all t ∈ [0, T ], x ∈ Rd , s ∈ [0, t] that

(17.44)
 p 
vt,x (s) = E u(s, x + 2ρWt−s )

(cf. Definition 3.3.4). Note that the assumption that W is a standard Brownian motion
ensures that for all t ∈ (0, T ], s ∈ [0, t) it holds that (t − s)−1/2 Wt−s : Ω → Rd is a standard
normal random variable. This shows that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t) it holds that
p
vt,x (s) = E u(s, x + 2ρ(t − s)(t − s)− /2 Wt−s )
1
 

(17.45)
Z p
= u(s, x + 2ρ(t − s)z)γ(z) dz.
Rd

The assumption that


√ u has at most polynomially growing partial derivatives, the fact
that (0, ∞) ∋ s 7→ s ∈ (0, ∞) is differentiable, the chain rule, and Vitali’s convergence

530
17.3. Feynman–Kac formulas

theorem therefore demonstrate that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t) it holds that
vt,x |[0,t) ∈ C 1 ([0, t), R) and
Z  p

p

′ −ρz
(vt,x ) (s) = D1 (s, x + 2ρ(t − s)z) + D2 (s, x + 2ρ(t − s)z), √ γ(z) dz
2ρ(t−s)
Rd
(17.46)
(cf. Definition 1.4.7). Furthermore, observe that the fact that for all z ∈ Rd it holds that
(∇γ)(z) = −γ(z)z implies that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t) it holds that
Z  p

√ −ρz
D2 (s, x + 2ρ(t − s)z), γ(z) dz
2ρ(t−s)
Rd
Z  
(17.47)
p ρ(∇γ)(z)
= D2 (s, x + 2ρ(t − s)z), √ dz
2ρ(t−s)
Rd
X d Z p

ρ ∂γ
=√ D2,i (s, x + 2ρ(t − s)z)( ∂zi )(z1 , z2 , . . . , zd ) dz .
2ρ(t−s) i=1 Rd

Moreover, note that integration by parts proves that for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t),
i ∈ {1, 2, . . . , d}, a ∈ R, b ∈ (a, ∞) it holds that
Z b p ∂γ
D2,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))( ∂z i
)(z1 , z2 , . . . , zd ) dzi
a
h izi =b
(17.48)
p
= D2,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))γ(z1 , z2 , . . . , zd )
zi =a
Z bp p
− 2ρ(t − s)Hi,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))γ(z1 , z2 , . . . , zd ) dzi .
a

The assumption that u has at most polynomially growing derivatives hence establishes that
for all t ∈ (0, T ], x ∈ Rd , s ∈ [0, t), i ∈ {1, 2, . . . , d} it holds that
Z p ∂γ

D2,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd )) ∂z i
(z1 , z2 , . . . , zd ) dzi
R
p Z p (17.49)
= − 2ρ(t − s) Hi,i (s, x + 2ρ(t − s)(z1 , z2 , . . . , zd ))γ(z1 , z2 , . . . , zd ) dzi .
R

Combining this with (17.47) and Fubini’s theorem ensures that for all t ∈ (0, T ], x ∈ Rd ,
s ∈ [0, t) it holds that
Z  p

−ρz
D2 (s, x + 2ρ(t − s)z), √ γ(z) dz
2ρ(t−s)
Rd
Xd Z
(17.50)
p
= −ρ Hi,i (s, x + 2ρ(t − s)(z))γ(z) dz
i=1 Rd
Z p 
=− ρ Trace H(s, x + 2ρ(t − s)(z)) γ(z) dz.
Rd

531
Chapter 17: Deep Kolmogorov methods (DKMs)

This, (17.46), (17.39), and the fact that for all t ∈ (0, T ], s ∈ [0, t) it holds that (t −
s)−1/2 Wt−s : Ω → Rd is a standard normal random variable show that for all t ∈ (0, T ],
x ∈ Rd , s ∈ [0, t) it holds that
Z p p

 
(vt,x ) (s) = D1 (s, x + 2ρ(t − s)z) − ρ Trace H(s, x + 2ρ(t − s)z) γ(z) dz
d
ZR h i
(17.51)
p p
= f (s, x + 2ρ(t − s)z)γ(z) dz = E f (s, x + 2ρWt−s ) .
Rd

The fact that W0 = 0, the fact that for all t ∈ [0, T ], x ∈ Rd it holds that vt,x : [0, t] → R
is continuous, and the fundamental theorem of calculus therefore demonstrate that for all
t ∈ [0, T ], x ∈ Rd it holds that
h p i Z t
u(t, x) = E u(t, x + 2ρWt−t ) = vt,x (t) = vt,x (0) + (vt,x )′ (s) ds
h i Z t h
0
i (17.52)
p p
= E u(0, x + 2ρWt ) + E f (s, x + 2ρWt−s ) ds.
0

Fubini’s theorem and the fact that u and f are at most polynomially growing hence imply
(17.40). The proof of Proposition 17.3.6 is thus complete.

Corollary 17.3.7. Let d ∈ N, T, ρ ∈ (0, ∞), ϱ = 2ρT , a ∈ R, b ∈ (a, ∞), let φ : Rd → R
be a function, let u ∈ C 1,2 ([0, T ] × Rd , R) have at most polynomially growing partial
derivatives, assume for all t ∈ [0, T ], x ∈ Rd that u(0, x) = φ(x) and
∂u
(17.53)

∂t
(t, x) = ρ (∆x u)(t, x),
let (Ω, F, P) be a probability space, and let W : Ω → Rd be a standard normal random
variable. Then
(i) it holds that φ : Rd → R is twice continuously differentiable with at most polynomially
growing partial derivatives and
 
(ii) it holds for all x ∈ Rd that u(T, x) = E φ(ϱW + x) .
Proof of Corollary 17.3.7. Observe that the assumption that u ∈ C 1,2 ([0, T ] × Rd , R) has
at most polynomially growing partial derivatives and the fact that for all x ∈ Rd it holds
that φ(x) = u(0, x) prove item (i). Furthermore, note that Proposition 17.3.6 establishes
item (ii). The proof of Corollary 17.3.7 is thus complete.
Definition 17.3.8 (Continuous convolutions). Let d ∈ N and let f : Rd → R and g : Rd → R
be B(Rd )/B(R)-measurable. Then we denote by
n R
f⃝∗ g : x ∈ Rd : min Rd max{0, f (x − y)g(y)} dy,
o
− Rd min{0, f (x − y)g(y)} dy < ∞ → [−∞, ∞] (17.54)
R

532
17.3. Feynman–Kac formulas

the function which satisfies for all x ∈ Rd with


(17.55)
R R
min Rd max{0, f (x − y)g(y)} dy, − Rd min{0, f (x − y)g(y)} dy < ∞
that Z
∗ g)(x) =
(f ⃝ f (x − y)g(y) dy. (17.56)
Rd

Exercise 17.3.1. Let d ∈ N, T ∈ (0, ∞), for every σ ∈ (0, ∞) let γσ : Rd → R satisfy for all
x ∈ Rd that
−∥x∥22
 
2 − d2
γσ (x) = (2πσ ) exp , (17.57)
2σ 2
and for every ρ ∈ (0, ∞), φ ∈ C 2 (Rd , R) with supx∈Rd
Pd ∂
i,j=1 |φ(x)| + |( ∂xi φ)(x)| +
2
|( ∂x∂i ∂xj φ)(x)| < ∞ let uρ,φ : [0, T ] × Rd → R satisfy for all t ∈ (0, T ], x ∈ Rd that


uρ,φ (0, x) = φ(x) and ∗ γ√2tρ )(x)


uρ,φ (t, x) = (φ ⃝ (17.58)
(cf. Definitions 3.3.4 and 17.3.8). Prove Pdor disprove the∂ following statement: For all
∂2
ρ ∈ (0, ∞), φ ∈ C (R , R) with supx∈Rd
2 d

i,j=1 |φ(x)| + |( ∂xi φ)(x)| + |( ∂xi ∂xj φ)(x)| <∞
it holds for all t ∈ (0, T ), x ∈ R that uρ,φ ∈ C ([0, T ] × R , R) and
d 1,2 d

∂uρ,φ 
∂t
(t, x) = ρ (∆x uρ,φ )(t, x). (17.59)
Exercise 17.3.2. Prove or disprove the following statement: For every x ∈ R it holds that
Z 
−x2 /2 1 −t2 /2 −ixt
e =√ e e dt . (17.60)
2π R
Exercise 17.3.3. Let d ∈ N, T ∈ (0, ∞), for every σ ∈ (0, ∞) let γσ : Rd → R satisfy for all
x ∈ Rd that
−∥x∥22
 
2 − d2
γσ (x) = (2πσ ) exp , (17.61)
2σ 2
∂2
for every φ ∈ C 2 (Rd , R) with supx∈Rd
Pd ∂

i,j=1 |φ(x)| + |( ∂xi
φ)(x)| + |( ∂xi ∂xj
φ)(x)| <∞
let uφ : [0, T ] × R → R satisfy for all t ∈ (0, T ], x ∈ R that
d d

uφ (0, x) = φ(x) and ∗ γ√2t )(x),


uφ (t, x) = (φ ⃝ (17.62)
and for every i = (i1 , . . . , id ) ∈ Nd let ψi : Rd → R satisfy for all x = (x1 , . . . , xd ) ∈ Rd that
" d #
d
Y
ψi (x) = 2 2 sin(ik πxk ) (17.63)
k=1

(cf. Definitions 3.3.4 and 17.3.8). Prove or disprove the following statement: For all
i = (i1 , . . . , id ) ∈ Nd , t ∈ [0, T ], x ∈ Rd it holds that
(17.64)
Pd
uψi (t, x) = exp −π 2 2

k=1 |ik | t ψi (x).

533
Chapter 17: Deep Kolmogorov methods (DKMs)

Exercise 17.3.4. Let d ∈ N, T ∈ (0, ∞), for every σ ∈ (0, ∞) let γσ : Rd → R satisfy for all
x ∈ Rd that
−∥x∥22
 
2 − d2
γσ (x) = (2πσ ) exp , (17.65)
2σ 2
and for every i = (i1 , . . . , id ) ∈ Nd let ψi : Rd → R satisfy for all x = (x1 , . . . , xd ) ∈ Rd that
" d #
d
Y
ψi (x) = 2 2 sin(ik πxk ) (17.66)
k=1

(cf. Definition 3.3.4). Prove or disprove the following statement: For every i = (i1 , . . . , id ) ∈
Nd , s ∈ [0, T ], y ∈ Rd and every function u ∈ C 1,2 ([0, T ] × Rd , R) with at most polynomially
growing partial derivatives which satisfies for all t ∈ (0, T ), x ∈ Rd that u(0, x) = ψi (x) and
∂u
(17.67)

∂t
(t, x) = (∆x u)(t, x)
it holds that
(17.68)
Pd
u(s, y) = exp −π 2 2
 
k=1 |ik | s ψi (y).

17.4 Reformulation of PDE problems as stochastic opti-


mization problems
The proof of the next result, Proposition 17.4.1 below, is based on an application of
Proposition 17.2.1 and Proposition 17.3.6. A more general result than Proposition 17.4.1
with a detailed proof can, for example, be found in Beck et al. [18, Proposition 2.7].

Proposition 17.4.1. Let d ∈ N, T, ρ ∈ (0, ∞), ϱ = 2ρT , a ∈ R, b ∈ (a, ∞), let
φ : Rd → R be a function, let u ∈ C 1,2 ([0, T ] × Rd , R) have at most polynomially growing
partial derivatives, assume for all t ∈ [0, T ], x ∈ Rd that u(0, x) = φ(x) and
∂u
(17.69)

∂t
(t, x) = ρ (∆x u)(t, x),
let (Ω, F, P) be a probability space, let W : Ω → Rd be a standard normal random variable,
let X : Ω → [a, b]d be a continuously uniformly distributed random variable, and assume that
W and X are independent. Then
(i) it holds that φ : Rd → R is twice continuously differentiable with at most polynomially
growing partial derivatives,
(ii) there exists a unique continuous function U : [a, b]d → R such that
E |φ(ϱW + X ) − U (X )|2 = E |φ(ϱW + X ) − v(X )|2 , (17.70)
   
inf
v∈C([a,b]d ,R)

and

534
17.4. Reformulation of PDE problems as stochastic optimization problems

(iii) it holds for every x ∈ [a, b]d that U (x) = u(T, x).

Proof of Proposition 17.4.1. First, observe that (17.69), the assumption that W is a stan-
dard normal random variable, and Corollary 17.3.7 ensure that for all x ∈ Rd it holds that
φ : Rd → R is twice continuously differentiable with at most polynomially growing partial
derivatives and
(17.71)
   
u(T, x) = E u(0, ϱW + x) = E φ(ϱW + x) .
Furthermore, note that the assumption that W is a standard normal random variable, the
fact that φ is continuous, and the fact that φ has at most polynomially growing partial
derivatives and is continuous show that

(I) it holds that [a, b]d × Ω ∋ (x, ω) 7→ φ(ϱW(ω) + x) ∈ R is (B([a, b]d ) ⊗ F)/B(R)-
measurable and

(II) it holds for all x ∈ [a, b]d that E[|φ(ϱW + x)|2 ] < ∞.

Proposition 17.2.1 and (17.71) hence ensure that

(A) there exists a unique continuous function U : [a, b]d → R which satisfies that
Z Z 
E |φ(ϱW + x) − U (x)|2 dx = E |φ(ϱW + x) − v(x)|2 dx
   
inf
[a,b]d v∈C([a,b]d ,R) [a,b]d
(17.72)
and

(B) it holds for all x ∈ [a, b]d that U (x) = u(T, x).

Moreover, observe that the assumption that W and X are independent, item (I), and the
assumption that X is continuously uniformly distributed on [a, b]d demonstrate that for all
v ∈ C([a, b]d , R) it holds that
Z
1
2
E |φ(ϱW + x) − v(x)|2 dx. (17.73)
   
E |φ(ϱW + X ) − v(X )| = d
(b − a) [a,b]d

Combining this with item (A) implies item (ii). Note that items (A) and (B) and (17.73)
prove item (iii). The proof of Proposition 17.4.1 is thus complete.

While Proposition 17.4.1 above recasts the solutions of the PDE in (17.69) at a particular
point in time as the solutions of a stochastic optimization problem, we can also derive from
this a corollary which shows that the solutions of the PDE over an entire timespan are
similarly the solutions of a stochastic optimization problem.

535
Chapter 17: Deep Kolmogorov methods (DKMs)


Corollary 17.4.2. Let d ∈ N, T, ρ ∈ (0, ∞), ϱ = 2ρ, a ∈ R, b ∈ (a, ∞), let φ : Rd → R
be a function, let u ∈ C 1,2 ([0, T ] × Rd , R) be a function with at most polynomially growing
partial derivatives which satisfies for all t ∈ [0, T ], x ∈ Rd that u(0, x) = φ(x) and
∂u
(17.74)

∂t
(t, x) = ρ (∆x u)(t, x),
let (Ω, F, P) be a probability space, let W : Ω → Rd be a standard normal random variable,
let τ : Ω → [0, T ] be a continuously uniformly distributed random variable, let X : Ω → [a, b]d
be a continuously uniformly distributed random variable, and assume that W, τ , and X are
independent. Then
(i) there exists a unique U ∈ C([0, T ] × [a, b]d , R) which satisfies that
√ √
E |φ(ϱ τ W + X ) − U (τ, X )|2 = E |φ(ϱ τ W + X ) − v(τ, X )|2
   
inf
v∈C([0,T ]×[a,b]d ,R)
(17.75)
and
(ii) it holds for all t ∈ [0, T ], x ∈ [a, b]d that U (t, x) = u(t, x).
Proof of Corollary 17.4.2. Throughout this proof, let F : C([0, T ] × [a, b]d , R) → [0, ∞]
satisfy for all v ∈ C([0, T ] × [a, b]d , R) that

F (v) = E |φ(ϱ τ W + X ) − v(τ, X )|2 . (17.76)
 

Observe that Proposition 17.4.1 establishes that for all v ∈ C([0, T ] × [a, b]d , R), s ∈ [0, T ]
it holds that
√ √
E |φ(ϱ sW + X ) − v(s, X )|2 ≥ E |φ(ϱ sW + X ) − u(s, X )|2 . (17.77)
   

Furthermore, note that the assumption that W, τ , and X are independent, the assumption
that τ : Ω → [0, T ] is continuously uniformly distributed, and Fubini’s theorem ensure that
for all v ∈ C([0, T ] × [a, b]d , R) it holds that
√ √
Z
2
E |φ(ϱ sW + X ) − v(s, X )|2 ds. (17.78)
   
F (v) = E |φ(ϱ τ W + X ) − v(τ, X )| =
[0,T ]

This and (17.77) show that for all v ∈ C([0, T ] × [a, b]d , R) it holds that

Z
(17.79)
 
F (v) ≥ E |φ(ϱ sW + X ) − u(s, X )| ds.
[0,T ]

Combining this with (17.78) demonstrates that for all v ∈ C([0, T ] × [a, b]d , R) it holds that
F (v) ≥ F (u). Therefore, we obtain that
F (u) = inf F (v). (17.80)
v∈C([0,T ]×[a,b]d ,R)

536
17.5. Derivation of DKMs

This and (17.78) imply that for all U ∈ C([0, T ] × [a, b]d , R) with

F (U ) = inf F (v) (17.81)


v∈C([0,T ]×[a,b]d ,R)

it holds that
√ √
Z Z
E |φ(ϱ sW + X ) − u(s, X )| ds. (17.82)
   
E |φ(ϱ sW + X ) − U (s, X )| ds =
[0,T ] [0,T ]

Combining this with (17.77) proves that for all U R∈ C([0, T ] × [a, b]d , R) with F (U ) =
inf v∈C([0,T ]×[a,b]d ,R) F (v) there exists A ⊆ [0, T ] with A 1 dx = T such that for all s ∈ A it
holds that
√ √
E |φ(ϱ sW + X ) − U (s, X )|2 = E |φ(ϱ sW + X ) − u(s, X )|2 . (17.83)
   

Proposition 17.4.1 therefore establishes that for all UR∈ C([0, T ] × [a, b]d , R) with F (U ) =
inf v∈C([0,T ]×[a,b]d ,R) F (v) there exists A ⊆ [0, T ] with A 1 dx = T such that for all s ∈ A
it holds that U (s) = u(s). The fact that u ∈ C([0, T ] × [a, b]d , R) hence ensures that for
all U ∈ C([0, T ] × [a, b]d , R) with F (U ) = inf v∈C([0,T ]×[a,b]d ,R) F (v) it holds that U = u.
Combining this with (17.80) proves items (i) and (ii). The proof of Corollary 17.4.2 is thus
complete.

17.5 Derivation of DKMs


In this section we present in the special case of the heat equation a rough derivation of
the DKMs introduced in Beck et al. [19]. This derivation will proceed along the analogous
steps as the derivation of PINNs and DGMs in Section 16.2. Firstly, we will employ
Proposition 17.4.1 to reformulate the PDE problem under consideration as an infinite
dimensional stochastic optimization problem, secondly, we will employ ANNs to reduce
the infinite dimensional stochastic optimization problem to a finite dimensional stochastic
optimization problem, and thirdly, we will aim to approximately solve this finite dimensional
stochastic optimization problem by means of SGD-type optimization methods. We start
by introducing the setting of the problem. Let d ∈ N, T, ρ ∈ (0, ∞), a ∈ R, b ∈ (a, ∞), let
φ : Rd → R be a function, let u ∈ C 1,2 ([0, T ] × Rd , R) have at most polynomially growing
partial derivatives, and assume for all t ∈ [0, T ], x ∈ Rd that u(0, x) = φ(x) and
∂u
(17.84)

∂t
(t, x) = ρ (∆x u)(t, x).

In the framework described in the previous sentence, we think of u as the unknown PDE
solution. The objective of this derivation is to develop deep learning methods which aim to
approximate the unknown PDE solution u(T, ·)|[a,b]d : [a, b]d → R at time T restricted on
[a, b]d .

537
Chapter 17: Deep Kolmogorov methods (DKMs)

In the first step, we employ Proposition 17.4.1 to recast the unknown target function √
u(T, ·)|[a,b]d : [a, b]d → R as the solution of an optimization problem. For this let ϱ = 2ρT ,
let (Ω, F, P) be a probability space, let W : Ω → Rd be a standard normally distributed
random variable, let X : Ω → [a, b]d be a continuously uniformly distributed random variable,
assume that W and X are independent, and let L : C([a, b]d , R) → [0, ∞] satisfy for all
v ∈ C([a, b]d , R) that
L(v) = E |φ(ϱW + X ) − v(X )|2 . (17.85)
 

Proposition 17.4.1 then ensures that the unknown target function u(T, ·)|[a,b]d : [a, b]d → R
is the unique global minimizer of the function L : C([a, b]d , R) → [0, ∞]. Minimizing L is,
however, not yet amenable to numerical computations.
In the second step, we therefore reduce this infinite dimensional stochastic optimization
problem to a finite dimensional stochastic optimization problem involving ANNs. Specifically,
let
Pha : R → R be differentiable, let h ∈ dN, l1 , l2 , . . . , lh , d ∈ N satisfy dd = l1 (d + 1) +
k=2 lk (lk−1 + 1) + lh + 1, and let L : R → [0, ∞) satisfy for all θ ∈ R that
 
θ,d

L(θ) = L NM a,l1 ,Ma,l2 ,...,Ma,lh ,idR
|[a,b] d
(17.86)
θ,d 2
 
= E |φ(ϱW + X ) − NM a,l ,M a,l ,...,M a,l ,id R
(X )|
1 2 h

(cf. Definitions 1.1.3 and 1.2.1). We can now compute an approximate minimizer of the
function L by computing an approximate  minimizer ϑ ∈d R of the function L and employing
d

the realization NMa,l ,Ma,l ,...,Ma,l ,idR |[a,b]d ∈ C([a, b] , R) of the ANN associated to this
θ,d
1 2 h
approximate minimizer restricted on [a, b]d as an approximate minimizer of L.
In the third step, we use SGD-type methods to compute such an approximate minimizer
of L. We now sketch this in the case of the plain-vanilla SGD optimization method (cf.
Definition 7.2.1). Let ξ ∈ Rd , J ∈ N, (γn )n∈N ⊆ [0, ∞), for every n ∈ N, j ∈ {1, 2, . . . , J} let
Wn,j : Ω → Rd be a standard normally distributed random variable and let Xn,j : Ω → [a, b]d
be a continuously uniformly distributed random variable, let l : Rd × [0, T ] × Rd → R satisfy
for all θ ∈ Rd , w ∈ Rd , x ∈ [a, b]d that
2
θ,d
l(θ, w, x) = NM a,l ,Ma,l2 ,...,Ma,lh ,idR (ϱw + x) − v(x) ,
(17.87)
1

and let Θ = (Θn )n∈N0 : N0 × Ω → Rd satisfy for all n ∈ N that


" J #
1 X
Θ0 = ξ and Θn = Θn−1 − γn (∇θ l)(Θn−1 , Wn,j , Xn,j ) . (17.88)
J j=1

Finally, the idea of DKMs is to consider for large enough n ∈ N the realization function
NMΘn ,d
a,l ,Ma,l ,...,Ma,l ,idR
as an approximation
1 2 h

Θn ,d
(17.89)

NM a,l ,Ma,l ,...,Ma,lh ,idR |[a,b]d ≈ u(T, ·)|[a,b]d
1 2

538
17.6. Implementation of DKMs

of the unknown solution u of the PDE in (17.84) at time T restricted to [a, b]d .
An implementation in the case of a two-dimensional heat equation of the DKMs derived
above that employs the more sophisticated Adam SGD optimization method instead of the
SGD optimization method can be found in the next section.

17.6 Implementation of DKMs


In Source code 17.2 below we present a simple implementation of a DKM, as explained in
Section 17.5 above, for finding an approximation of a solution u ∈ C 1,2 ([0, 2] × R2 ) of the
two-dimensional heat equation
∂u
(17.90)

∂t
(t, x) = (∆x u)(t, x)

with u(0, x) = cos(x1 ) + cos(x2 ) for t ∈ [0, 2], x = (x1 , x2 ) ∈ R2 . This implementation
trains a fully connected feed-forward ANN with 2 hidden layers (with 50 neurons on each
hidden layer) and using the ReLU activation function (cf. Section 1.2.3). The training uses
batches of size 256 with each batch consisting of 256 randomly chosen realizations of the
random variable (T , X ), where T is continuously uniformly distributed random variable
on [0, 2] and where X is a continuously uniformly distributed random variable on [−5, 5]2 .
The training is performed using the Adam SGD optimization method (cf. Section 7.9). A
plot of the resulting approximation of the solution u after 3000 training steps is shown in
Figure 16.1.
1 import torch
2 import matplotlib . pyplot as plt
3
4 # Use the GPU if available
5 dev = torch . device ( " cuda " if torch . cuda . is_available () else " cpu " )
6

7 # Computes an approximation of E [| phi ( sqrt (2* rho * T ) W + xi ) -


8 # N ( xi ) | 2 ] with W a standard normal random variable using the rows
9 # of x as # independent realizations of the random variable xi
10 def loss (N , rho , phi , t , x ) :
11 W = torch . randn_like ( x ) . to ( dev )
12 return ( phi ( torch . sqrt (2 * rho * t ) * W + x ) -
13 N ( torch . cat (( t , x ) ,1) ) ) . square () . mean ()
14
15 d = 2 # the input dimension
16 a , b = -5.0 , 5.0 # the domain will be [a , b ]^ d
17 T = 2.0 # the time horizon
18 rho = 1.0 # the diffusivity
19
20 # Define the initial value
21 def phi ( x ) :
22 return x . cos () . sum ( axis =1 , keepdim = True )

539
Chapter 17: Deep Kolmogorov methods (DKMs)

23
24 # Define a neural network with two hidden layers with 50 neurons
25 # each using ReLU activations
26 N = torch . nn . Sequential (
27 torch . nn . Linear ( d +1 , 50) , torch . nn . ReLU () ,
28 torch . nn . Linear (50 , 50) , torch . nn . ReLU () ,
29 torch . nn . Linear (50 , 1)
30 ) . to ( dev )
31

32 # Configure the training parameters and optimization algorithm


33 steps = 3000
34 batch_size = 256
35 optimizer = torch . optim . Adam ( N . parameters () )
36
37 # Train the network
38 for step in range ( steps ) :
39 # Generate uniformly distributed samples from [a , b ]^ d
40 x = ( torch . rand ( batch_size , d ) * (b - a ) + a ) . to ( dev )
41 t = T * torch . rand ( batch_size , 1) . to ( dev )
42
43 optimizer . zero_grad ()
44 # Compute the loss
45 L = loss (N , rho , phi , t , x )
46 # Compute the gradients
47 L . backward ()
48 # Apply changes to weights and biases of N
49 optimizer . step ()
50
51 # Plot the result at M +1 timesteps
52 M = 5
53 mesh = 128
54
55 def toNumpy ( t ) :
56 return t . detach () . cpu () . numpy () . reshape (( mesh , mesh ) )
57
58 fig , axs = plt . subplots (2 ,3 , subplot_kw = dict ( projection = ’3 d ’) )
59 fig . set_size_inches (16 , 10)
60 fig . set_dpi (300)
61

62 for i in range ( M +1) :


63 x = torch . linspace (a , b , mesh )
64 y = torch . linspace (a , b , mesh )
65 x , y = torch . meshgrid (x , y , indexing = ’ xy ’)
66 x = x . reshape (( mesh * mesh ,1) ) . to ( dev )
67 y = y . reshape (( mesh * mesh ,1) ) . to ( dev )
68 z = N ( torch . cat (( i * T / M * torch . ones (128*128 ,1) . to ( dev ) , x , y ) ,
69 1) )
70
71 axs [ i //3 , i %3]. set_title ( f " t = { i * T / M } " )

540
17.6. Implementation of DKMs

72 axs [ i //3 , i %3]. set_zlim ( -2 ,2)


73 axs [ i //3 , i %3]. plot_surface ( toNumpy ( x ) , toNumpy ( y ) , toNumpy ( z ) ,
74 cmap = ’ viridis ’)
75
76 fig . savefig ( f " ../ plots / kolmogorov . pdf " , bbox_inches = ’ tight ’)

Source code 17.2 (code/kolmogorov.py): A simple implementation in PyTorch of


the deep Kolmogorov method based on Corollary 17.4.2, computing an approximation
of the function u ∈ C 1,2 ([0, 2]×R2 , R) which satisfies for all t ∈ [0, 2], x = (x1 , x2 ) ∈ R2
that ∂u (t, x) = (∆x u)(t, x) and u(0, x) = cos(x1 ) + cos(x2 ).

∂t

t = 0.0 t = 0.4 t = 0.8

2.0 2.0 2.0


1.5 1.5 1.5
1.0 1.0 1.0
0.5 0.5 0.5
0.0 0.0 0.0
0.5 0.5 0.5
1.0 1.0 1.0
1.5 1.5 1.5
2.0 2.0 2.0
4 4 4
2 2 2
4 0 4 0 4 0
2 2 2 2 2 2
0 0 0
2 4 2 4 2 4
4 4 4

t = 1.2 t = 1.6 t = 2.0

2.0 2.0 2.0


1.5 1.5 1.5
1.0 1.0 1.0
0.5 0.5 0.5
0.0 0.0 0.0
0.5 0.5 0.5
1.0 1.0 1.0
1.5 1.5 1.5
2.0 2.0 2.0
4 4 4
2 2 2
4 0 4 0 4 0
2 2 2 2 2 2
0 0 0
2 4 2 4 2 4
4 4 4

Figure 17.2 (plots/kolmogorov.pdf): Plots for the functions [−5, 5]2 ∋ x 7→


U (t, x) ∈ R, where t ∈ {0, 0.4, 0.8, 1.2, 1.6, 2} and where U ∈ C([0, 2] × R2 , R)
is an approximation for the function u ∈ C 1,2 ([0, 2] × R2 , R) satisfies for all t ∈ [0, 2],
x = (x1 , x2 ) ∈ R that ∂t (t, x) = (∆x u)(t, x) and u(0, x) = cos(x1 ) + cos(x2 )
∂u
2


computed by means of Source code 17.2.

541
Chapter 17: Deep Kolmogorov methods (DKMs)

542
Chapter 18

Further deep learning methods for PDEs

Besides PINNs, DGMs, and DKMs reviewed in Chapters 16 and 17 above there are also a
large number of other works which propose and study deep learning based approximation
methods for various classes of PDEs. In the following we mention a selection of such methods
from the literature roughly grouped into three classes. Specifically, we consider deep learning
methods for PDEs which employ strong formulations of PDEs to set up learning problems in
Section 18.1, we consider deep learning methods for PDEs which employ weak or variational
formulations of PDEs to set up learning problems in Section 18.2, and we consider deep
learning methods for PDEs which employ intrinsic stochastic representations of PDEs to
set up learning problems in Section 18.3. Finally, in Section 18.4 we also point to several
theoretical results and error analyses for deep learning methods for PDEs in the literature.
Our selection of references for methods as well as theoretical results is by no means
complete. For more complete reviews of the literature on deep learning methods for PDEs
and corresponding theoretical results we refer, for instance, to the overview articles [24, 56,
88, 120, 145, 237, 355].

18.1 Deep learning methods based on strong formula-


tions of PDEs
There are a number of deep learning based methods for PDEs in the literature that employ
residuals of strong formulations of PDEs to set up learning problems (cf., for example,
Theorem 16.1.1 and (16.16) for the residual of the strong formulation in the case of semilinear
heat PDEs). Basic methods in this category include the PINNs (see Raissi et al. [347]) and
DGMs (see Sirignano & Spiliopoulos [379]) reviewed in Chapter 16 above, the approach
proposed in Berg & Nyström [34], the theory-guided neural networks (TGNNs) proposed in
Wang et al. [405], and the two early methods proposed in [106, 260]. There are also many
refinements and adaptions of these basic methods in the literature including

543
Chapter 18: Further deep learning methods for PDEs

• the conservative PINNs (cPINNs) methodology for conservation laws in Jagtap et


al. [219] which relies on multiple ANNs representing a PDE solution on respective
sub-domains,

• the extended PINNs (XPINNs) methodology in Jagtap & Karniadakis [90] which
generalizes the domain decomposition idea of Jagtap et al. [219] to other types of
PDEs,

• the Navier-Stokes flow nets (NSFnets) methodology in Jin et al. [231] which explores
the use of PINNs for the incompressible Navier-Stokes PDEs,

• the Bayesian PINNs methodology in Yang et al. [421] which combines PINNs with
Bayesian neural networks (BNNs) from Bayesian learning (cf., for instance, [287,
300]),

• the parareal PINNs (PPINNs) methodology for time-dependent PDEs with long time
horizons in Meng et al. [295] which combines the PINNs methodology with ideas
from parareal algorithms (cf., for example, [42, 290]) in order to split up long-time
problems into many independent short-time problems,

• the SelectNets methodology in Gu et al. [183] which extends the PINNs methodology
by employing a second ANN to adaptively select during the training process the
points at which the residual of the PDE is considered, and

• the fractional PINNs (fPINNs) methodology in Pang et al. [324] which extends the
PINNs methodology to PDEs with fractional derivatives such as space-time fractional
advection-diffusion equations.
We also refer to the article Lu et al. [286] which introduces an elegant Python library for
PINNs called DeepXDE and also provides a good introduction to PINNs.

18.2 Deep learning methods based on weak formulations


of PDEs
Another group of deep learning methods for PDEs relies on weak or variational formulations
of PDEs to set up learning problems. Such methods include
• the variational PINNs (VPINNs) methodology in Kharazmi et al. [241, 242] which
use the residuals of weak formulations of PDEs for a fixed set of test functions to set
up a learning problem,

• the VarNets methodology in Khodayi-Mehr & Zavlanos [243] which employs a similar
methodology than VPINNs but also consider parametric PDEs,

544
18.3. Deep learning methods based on stochastic representations of PDEs

• the weak form TGNN methodology in Xu et al. [420] which further extend the VPINNs
methodology by (amongst other adaptions) considering test functions in the weak
formulation of PDEs tailored to the considered problem,
• the deep fourier residual method in Taylor et al. [393] which is based on minimizing
the dual norm of the weak-form residual operator of PDEs by employing Fourier-type
representations of this dual norm which can efficiently be approximated using the
discrete sine transform (DST) and discrete cosine transform (DCT),
• the weak adversarial networks (WANs) methodology in Zang et al. [428] (cf. also Bao
et al. [13]) which is based on approximating both the solution of the PDE and the test
function in the weak formulation of the PDE by ANNs and on using an adversarial
approach (cf., for instance, Goodfellow et al. [165]) to train both networks to minimize
and maximize, respectively, the weak-form residual of the PDE,
• the Friedrichs learning methodology in Chen et al. [66] which is similar to the WAN
methodology but uses a different minimax formulation for the weak solution related
to Friedrichs’ theory on symmetric system of PDEs (see Friedrichs [139]),
• the deep Ritz method for elliptic PDEs in E & Yu [124] which employs variational
minimization problems associated to PDEs to set up a learning problem,
• the deep Nitsche method in Liao & Ming [274] which refines the deep Ritz method
using Nitsche’s method (see Nitsche [313]) to enforce boundary conditions, and
• the deep domain decomposition method (D3M) in Li et al. [268] which refines the deep
Ritz method using domain decompositions.
We also refer to the multi-scale deep neural networks (MscaleDNNs) in Cai et al. [58, 279]
for a refined ANN architecture which can be employed in both the strong-form-based PINNs
methodology and the variational-form-based deep Ritz methodology.

18.3 Deep learning methods based on stochastic repre-


sentations of PDEs
A further class of deep learning based methods for PDEs are based on intrinsic links
between PDEs and probability theory such as Feynman–Kac-type formulas; cf., for example,
[318, Section 8.2], [234, Section 4.4] for linear Feynman–Kac formulas based on (forward)
stochastic differential equations (SDEs) and cf., for instance, [73, 325–327] for nonlinear
Feynman–Kac-type formulas based on backward stochastic differential equations (BSDEs).
The DKMs for linear PDEs (see Beck et al. [19]) reviewed in Chapter 17 are one type of
such methods based on linear Feynman–Kac formulas. Other methods based on stochastic
representations of PDEs include

545
Chapter 18: Further deep learning methods for PDEs

• the deep BSDE methodology in E et al. [119, 187] which suggests to approximate
solutions of semilinear parabolic PDEs by approximately solving the BSDE associated
to the considered PDE through the nonlinear Feyman-Kac formula (see Pardoux &
Peng [325, 326]) using a new deep learning methodology based on

– reinterpreting the BSDE as a stochastic control problem in which the objective


is to minimize the distance between the terminal value of the controlled process
and the terminal value of the BSDE,
– discretizing the control problem in time, and
– approximately solving the discrete time control problem by approximating the
policy functions at each time steps by means of ANNs as proposed in E &
Han [186],

• the generalization of the deep BSDE methodology in Han & Long [188] for semilinear
and quasilinear parabolic PDEs based on forward backward stochastic differential
equations (FBSDEs)

• the refinements of the deep BSDE methodology in [64, 140, 196, 317, 346] which
explore different nontrivial variations and extensions of the original deep BSDE
methodology including different ANN architectures, initializations, and loss functions,

• the extension of the deep BSDE methodology to fully nonlinear parabolic PDEs in
Beck et al. [20] which is based on a nonlinear Feyman-Kac formula involving second
order BSDEs (see Cheridito et al. [73]),

• the deep backward schemes for semilinear parabolic PDEs in Huré et al. [207] which
also rely on BSDEs but set up many separate learning problems which are solved
inductively backwards in time instead of one single optimization problem,

• the deep backward schemes in Pham et al. [336] which extend the methodology in
Huré et al. [207] to fully nonlinear parabolic PDEs,

• the deep splitting method for semilinear parabolic PDEs in Beck et al. [17] which
iteratively solve for small time increments linear approximations of the semilinear
parabolic PDEs using DKMs,

• the extensions of the deep backwards schemes to partial integro-differential equations


(PIDEs) in [62, 154],

• the extensions of the deep splitting method to PIDEs in [50, 138],

• the methods in Nguwi et al. [308, 309, 311] which are based on representations of
PDE solutions involving branching-type processes (cf., for example, also [195, 197,

546
18.4. Error analyses for deep learning methods for PDEs

310] and the references therein for nonlinear Feynman–Kac-type formulas based on
such branching-type processes), and

• the methodology for elliptic PDEs in Kremsner et al. [256] which relies on suitable
representations of elliptic PDEs involving BSDEs with random terminal times.

18.4 Error analyses for deep learning methods for PDEs


Until today there is not yet any complete error analysis for a GD/SGD based ANN training
approximation scheme for PDEs in the literature (cf. also Remark 9.14.5 above). However,
there are now several partial error analysis results for deep learning methods for PDEs in
the literature (cf., for instance, [26, 137, 146, 158, 188, 298, 299] and the references therein).
In particular, there are nowadays a number of results which rigorously establish that
ANNs have the fundamental capacity to approximate solutions of certain classes of PDEs
without the curse of dimensionality (COD) (cf., for example, [27] and [314, Chapter 1])
in the sense that the number of parameters of the approximating ANN grows at most
polynomially in both the reciprocal 1/ε of the prescribed approximation accuracy ε ∈ (0, ∞)
and the PDE dimension d ∈ N. We refer, for instance, to [10, 35, 37, 128, 161, 162, 177,
179, 181, 205, 228, 259, 353] for such and related ANN approximation results for solutions
of linear PDEs and we refer, for example, to [3, 82, 178, 209] for such and related ANN
approximation results for solutions of nonlinear PDEs.
The proofs in the above named ANN approximation results are usually based, first, on
considering a suitable algorithm which approximates the considered PDEs without the COD
and, thereafter, on constructing ANNs which approximate the considered approximation
algorithm. In the context of linear PDEs the employed approximation algorithms are
typically standard Monte Carlo methods (cf., for instance, [155, 168, 250] and the references
therein) and in the context of nonlinear PDEs the employed approximation algorithms are
typically nonlinear Monte Carlo methods of the mulitlevel-Picard-type (cf., for example,
[21, 22, 150, 208, 210–212, 214, 304, 305] and the references therein).
In the literature the above named polynomial growth property in both the reciprocal
1/ε of the prescribed approximation accuracy ε ∈ (0, ∞) and the PDE dimension d ∈ N is

also referred to as polynomial tractability (cf., for instance, [314, Definition 4.44], [315], and
[316]).

547
Chapter 18: Further deep learning methods for PDEs

548
Index of abbreviations

ANN (artificial neural network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3


BERT (Bidirectional Encoder Representations from Transformers) . . . . . . . . . . . . . . . . . . . . 74
BN (batch normalization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
BNN (Bayesian neural network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
BSDE (backward stochastic differential equation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
CNN (convolutional ANN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
COD (curse of dimensionality) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 547
CV (computer vision) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
D3M (deep domain decomposition method) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
DCT (discrete cosine transform) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
DGM (deep Galerkin method) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
DKM (deep Kolmogorov method) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
DST (discrete sine transform). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .545
ELU (exponential linear unit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
FBSDE (forward backward stochastic differential equation) . . . . . . . . . . . . . . . . . . . . . . . . . . 546
FNO (Fourier neural operator) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
GD (gradient descent) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
GELU (Gaussian error linear unit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
GF (gradient flow) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
GNN (graph neural network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
GPT (generative pre-trained transformer) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
KL (Kurdyka–Łojasiewicz) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
LLM (large language model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
LSTM (long short-term memory) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
MscaleDNN (multi-scale deep neural network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
NLP (natural language processing). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59
NSFnet (Navier-Stokes flow net) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
ODE (ordinary differential equation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
PDE (partial differential equation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
PIDE (partial integro-differential equation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546
PINN (physics-informed neural network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

549
Index of abbreviations

PPINN (parareal PINN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544


RNN (recurrent ANN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
ReLU (rectified linear unit) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
RePU (rectified power unit). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47
ResNet (residual ANN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
SDE (stochastic differential equation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
SGD (stochastic gradient descent). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3
TGNN (theory-guided neural network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 543
VPINN (variational PINN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
WAN (weak adversarial network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 545
XPINN (extended PINN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
cPINN (conservative PINN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544
deepONet (deep operator network) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
fPINN (fractional PINN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544

550
List of figures

Figure 1.4: plots/relu.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30


Figure 1.5: plots/clipping.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34
Figure 1.6: plots/softplus.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 1.7: plots/gelu.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 1.8: plots/logistic.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Figure 1.9: plots/swish.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 1.10: plots/tanh.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 1.11: plots/softsign.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 1.12: plots/leaky_relu.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Figure 1.13: plots/elu.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 1.14: plots/repu.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Figure 1.15: plots/sine.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Figure 1.16: plots/heaviside.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 5.1: plots/gradient_plot1.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
Figure 5.2: plots/gradient_plot2.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Figure 5.3: plots/l1loss.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Figure 5.4: plots/mseloss.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185
Figure 5.5: plots/huberloss.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
Figure 5.6: plots/crossentropyloss.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Figure 5.7: plots/kldloss.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .193
Figure 6.1: plots/GD_momentum_plots.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
Figure 7.1: plots/sgd.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Figure 7.2: plots/sgd2.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
Figure 7.3: plots/sgd_momentum.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .308
Figure 7.4: plots/mnist.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Figure 7.5: plots/mnist_optim.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
Figure 16.1: plots/pinn.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Figure 16.2: plots/dgm.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520
Figure 17.1: plots/brownian_motions.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
Figure 17.2: plots/kolmogorov.pdf . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541

551
List of figures

552
List of source codes

Source code 1.1: code/activation_functions/plot_util.py . . . . . . . . . . . . . . . . . . . . . . . 29


Source code 1.2: code/activation_functions/relu_plot.py . . . . . . . . . . . . . . . . . . . . . . . 30
Source code 1.3: code/activation_functions/clipping_plot.py . . . . . . . . . . . . . . . . . . 34
Source code 1.4: code/activation_functions/softplus_plot.py . . . . . . . . . . . . . . . . . . 35
Source code 1.5: code/activation_functions/gelu_plot.py . . . . . . . . . . . . . . . . . . . . . . . 37
Source code 1.6: code/activation_functions/logistic_plot.py . . . . . . . . . . . . . . . . . . 38
Source code 1.7: code/activation_functions/swish_plot.py . . . . . . . . . . . . . . . . . . . . . . 41
Source code 1.8: code/activation_functions/tanh_plot.py . . . . . . . . . . . . . . . . . . . . . . . 42
Source code 1.9: code/activation_functions/softsign_plot.py . . . . . . . . . . . . . . . . . . 43
Source code 1.10: code/activation_functions/leaky_relu_plot.py . . . . . . . . . . . . . . . 44
Source code 1.11: code/activation_functions/elu_plot.py . . . . . . . . . . . . . . . . . . . . . . . 46
Source code 1.12: code/activation_functions/repu_plot.py . . . . . . . . . . . . . . . . . . . . . . 48
Source code 1.13: code/activation_functions/sine_plot.py . . . . . . . . . . . . . . . . . . . . . . 49
Source code 1.14: code/activation_functions/heaviside_plot.py . . . . . . . . . . . . . . . . 50
Source code 1.15: code/fc-ann-manual.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Source code 1.16: code/fc-ann.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Source code 1.17: code/fc-ann2.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Source code 1.18: code/conv-ann.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Source code 1.19: code/conv-ann-ex.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64
Source code 1.20: code/res-ann.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Source code 5.1: code/gradient_plot1.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Source code 5.2: code/gradient_plot2.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Source code 5.3: code/loss_functions/l1loss_plot.py . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Source code 5.4: code/loss_functions/mseloss_plot.py . . . . . . . . . . . . . . . . . . . . . . . . . . 184
Source code 5.5: code/loss_functions/huberloss_plot.py . . . . . . . . . . . . . . . . . . . . . . . 187
Source code 5.6: code/loss_functions/crossentropyloss_plot.py . . . . . . . . . . . . . . . 188
Source code 5.7: code/loss_functions/kldloss_plot.py . . . . . . . . . . . . . . . . . . . . . . . . . . 193
Source code 6.1: code/example_GD_momentum_plots.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
Source code 7.1: code/optimization_methods/sgd.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
Source code 7.2: code/optimization_methods/sgd2.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284
Source code 7.3: code/optimization_methods/midpoint_sgd.py . . . . . . . . . . . . . . . . . . 303

553
List of source codes

Source code 7.4: code/optimization_methods/momentum_sgd.py . . . . . . . . . . . . . . . . . . 306


Source code 7.5: code/optimization_methods/momentum_sgd_bias_adj.py . . . . . . . . 308
Source code 7.6: code/optimization_methods/nesterov_sgd.py . . . . . . . . . . . . . . . . . . 310
Source code 7.7: code/optimization_methods/adagrad.py . . . . . . . . . . . . . . . . . . . . . . . . 315
Source code 7.8: code/optimization_methods/rmsprop.py . . . . . . . . . . . . . . . . . . . . . . . . .317
Source code 7.9: code/optimization_methods/rmsprop_bias_adj.py . . . . . . . . . . . . . . 319
Source code 7.10: code/optimization_methods/adadelta.py . . . . . . . . . . . . . . . . . . . . . . 321
Source code 7.11: code/optimization_methods/adam.py . . . . . . . . . . . . . . . . . . . . . . . . . . . 324
Source code 7.12: code/mnist.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Source code 7.13: code/mnist_optim.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Source code 16.1: code/pinn.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514
Source code 16.2: code/dgm.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517
Source code 17.1: code/brownian_motion.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Source code 17.2: code/kolmogorov.py . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539

554
List of definitions

Chapter 1
Definition 1.1.1: Affine functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Definition 1.1.3: Vectorized description of fully-connected feedforward ANNs . . . . . . . 23
Definition 1.2.1: Multidimensional versions of one-dimensional functions . . . . . . . . . . . 27
Definition 1.2.4: ReLU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Definition 1.2.5: Multidimensional ReLU activation functions . . . . . . . . . . . . . . . . . . . . . . 30
Definition 1.2.9: Clipping activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Definition 1.2.10: Multidimensional clipping activation functions . . . . . . . . . . . . . . . . . . . 35
Definition 1.2.11: Softplus activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35
Definition 1.2.13: Multidimensional softplus activation functions . . . . . . . . . . . . . . . . . . . 36
Definition 1.2.15: GELU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Definition 1.2.17: Multidimensional GELU unit activation function . . . . . . . . . . . . . . . . 38
Definition 1.2.18: Standard logistic activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Definition 1.2.19: Multidimensional standard logistic activation functions . . . . . . . . . . 39
Definition 1.2.22: Swish activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Definition 1.2.24: Multidimensional swish activation functions. . . . . . . . . . . . . . . . . . . . . .41
Definition 1.2.25: Hyperbolic tangent activation function . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Definition 1.2.26: Multidimensional hyperbolic tangent activation functions . . . . . . . . 43
Definition 1.2.28: Softsign activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Definition 1.2.29: Multidimensional softsign activation functions . . . . . . . . . . . . . . . . . . . 44
Definition 1.2.30: Leaky ReLU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Definition 1.2.33: Multidimensional leaky ReLU activation function . . . . . . . . . . . . . . . . 46
Definition 1.2.34: ELU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Definition 1.2.36: Multidimensional ELU activation function . . . . . . . . . . . . . . . . . . . . . . . 47
Definition 1.2.37: RePU activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Definition 1.2.38: Multidimensional RePU activation function . . . . . . . . . . . . . . . . . . . . . . 48
Definition 1.2.39: Sine activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Definition 1.2.40: Multidimensional sine activation functions . . . . . . . . . . . . . . . . . . . . . . . 49
Definition 1.2.41: Heaviside activation function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Definition 1.2.42: Multidimensional Heaviside activation functions . . . . . . . . . . . . . . . . . 50
Definition 1.2.43: Softmax activation function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

555
List of definitions

Definition 1.3.1: Structured description of fully-connected feedforward ANNs . . . . . . 52


Definition 1.3.2: Fully-connected feedforward ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Definition 1.3.4: Realizations of fully-connected feedforward ANNs . . . . . . . . . . . . . . . . . 53
Definition 1.3.5: Transformation from the structured to the vectorized description of
fully-connected feedforward ANNs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57
Definition 1.4.1: Discrete convolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Definition 1.4.2: Structured description of feedforward CNNs . . . . . . . . . . . . . . . . . . . . . . 60
Definition 1.4.3: Feedforward CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Definition 1.4.4: One tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Definition 1.4.5: Realizations associated to feedforward CNNs . . . . . . . . . . . . . . . . . . . . . . 61
Definition 1.4.7: Standard scalar products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Definition 1.5.1: Structured description of fully-connected ResNets . . . . . . . . . . . . . . . . . 66
Definition 1.5.2: Fully-connected ResNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Definition 1.5.4: Realizations associated to fully-connected ResNets . . . . . . . . . . . . . . . . 67
Definition 1.5.5: Identity matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Definition 1.6.1: Function unrolling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Definition 1.6.2: Description of RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Definition 1.6.3: Vectorized description of simple fully-connected RNN nodes . . . . . . . 71
Definition 1.6.4: Vectorized description of simple fully-connected RNNs . . . . . . . . . . . . 71
Chapter 2
Definition 2.1.1: Composition of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Definition 2.1.6: Powers of fully-connected feedforward ANNs . . . . . . . . . . . . . . . . . . . . . . 84
Definition 2.2.1: Parallelization of fully-connected feedforward ANNs. . . . . . . . . . . . . . .84
Definition 2.2.6: Fully-connected feedforward ReLU identity ANNs. . . . . . . . . . . . . . . . .89
Definition 2.2.8: Extensions of fully-connected feedforward ANNs . . . . . . . . . . . . . . . . . . 90
Definition 2.2.12: Parallelization of fully-connected feedforward ANNs with different
length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Definition 2.3.1: Fully-connected feedforward affine transformation ANNs . . . . . . . . . . 96
Definition 2.3.4: Scalar multiplications of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Definition 2.4.1: Sums of vectors as fully-connected feedforward ANNs . . . . . . . . . . . . . 98
Definition 2.4.5: Transpose of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Definition 2.4.6: Concatenation of vectors as fully-connected feedforward ANNs . . . 100
Definition 2.4.10: Sums of fully-connected feedforward ANNs with the same length102
Chapter 3
Definition 3.1.1: Modulus of continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Definition 3.1.5: Linear interpolation operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Definition 3.2.1: Activation functions as fully-connected feedforward ANNs . . . . . . . 113
Definition 3.3.4: Quasi vector norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Chapter 4
Definition 4.1.1: Metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .127

556
Definition 4.1.2: Metric space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Definition 4.2.1: 1-norm ANN representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Definition 4.2.5: Maxima ANN representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Definition 4.2.6: Floor and ceiling of real numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Definition 4.3.2: Covering numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Definition 4.4.1: Rectified clipped ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
Chapter 6
Definition 6.1.1: GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Definition 6.2.1: Explicit midpoint GD optimization method . . . . . . . . . . . . . . . . . . . . . . 239
Definition 6.3.1: Momentum GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Definition 6.3.5: Bias-adjusted momentum GD optimization method . . . . . . . . . . . . . . 247
Definition 6.4.1: Nesterov accelerated GD optimization method . . . . . . . . . . . . . . . . . . . 269
Definition 6.5.1: Adagrad GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
Definition 6.6.1: RMSprop GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Definition 6.6.3: Bias-adjusted RMSprop GD optimization method . . . . . . . . . . . . . . . 272
Definition 6.7.1: Adadelta GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
Definition 6.8.1: Adam GD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
Chapter 7
Definition 7.2.1: SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Definition 7.3.1: Explicit midpoint SGD optimization method . . . . . . . . . . . . . . . . . . . . 303
Definition 7.4.1: Momentum SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . .305
Definition 7.4.2: Bias-adjusted momentum SGD optimization method . . . . . . . . . . . . . 307
Definition 7.5.1: Nesterov accelerated SGD optimization method. . . . . . . . . . . . . . . . . .310
Definition 7.5.3: Simplified Nesterov accelerated SGD optimization method . . . . . . . 314
Definition 7.6.1: Adagrad SGD optimization method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315
Definition 7.7.1: RMSprop SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Definition 7.7.3: Bias-adjusted RMSprop SGD optimization method . . . . . . . . . . . . . . 318
Definition 7.8.1: Adadelta SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
Definition 7.9.1: Adam SGD optimization method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322
Chapter 8
Definition 8.2.1: Diagonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Chapter 9
Definition 9.1.1: Standard KL inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
Definition 9.1.2: Standard KL functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
Definition 9.7.1: Analytic functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
Definition 9.15.1: Fréchet subgradients and limiting Fréchet subgradients . . . . . . . . . 390
Definition 9.16.1: Non-smooth slope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Definition 9.17.1: Generalized KL inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Definition 9.17.2: Generalized KL functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .397
Chapter 10

557
List of definitions

Definition 10.1.1: Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399


Definition 10.1.2: Batch mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Definition 10.1.3: Batch variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Definition 10.1.5: BN operations for given batch mean and batch variance . . . . . . . . 400
Definition 10.1.6: Batch normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
Definition 10.2.1: Structured description of fully-connected feedforward ANNs with BN
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .402
Definition 10.2.2: Fully-connected feedforward ANNs with BN . . . . . . . . . . . . . . . . . . . . 402
Definition 10.3.1: Realizations associated to fully-connected feedforward ANNs with
BN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Definition 10.4.1: Structured description of fully-connected feedforward ANNs with BN
for given batch means and batch variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Definition 10.4.2: Fully-connected feedforward ANNs with BN for given batch means
and batch variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Definition 10.5.1: Realizations associated to fully-connected feedforward ANNs with
BN for given batch means and batch variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Definition 10.6.1: Fully-connected feed-forward ANNs with BN for given batch means
and batch variances associated to fully-connected feedforward ANNs with BN and
given input batches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Chapter 12
Definition 12.1.7: Moment generating functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
Definition 12.2.1: Covering radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
Definition 12.2.6: Packing radii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Definition 12.2.7: Packing numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
Chapter 13
Definition 13.1.2: Rademacher family . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
Definition 13.1.3: p-Kahane–Khintchine constant . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 470
Chapter 17
Definition 17.3.3: Standard Brownian motions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
Definition 17.3.8: Continuous convolutions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .532

558
Bibliography

[1] Abdel-Hamid, O., Mohamed, A., Jiang, H., Deng, L., Penn, G., and Yu, D.
Convolutional Neural Networks for Speech Recognition. IEEE/ACM Trans. Audio,
Speech, Language Process. 22, 10 (2014), pp. 1533–1545. url: doi.org/10.1109/
TASLP.2014.2339736.
[2] Absil, P.-A., Mahony, R., and Andrews, B. Convergence of the iterates of
descent methods for analytic cost functions. SIAM J. Optim. 16, 2 (2005), pp. 531–
547. url: doi.org/10.1137/040605266.
[3] Ackermann, J., Jentzen, A., Kruse, T., Kuckuck, B., and Padgett, J. L.
Deep neural networks with ReLU, leaky ReLU, and softplus activation provably
overcome the curse of dimensionality for Kolmogorov partial differential equations
with Lipschitz nonlinearities in the Lp -sense. arXiv:2309.13722 (2023), 52 pp. url:
arxiv.org/abs/2309.13722.
[4] Alpaydın, E. Introduction to Machine Learning. 4th ed. MIT Press, Cambridge,
Mass., 2020. 712 pp.
[5] Amann, H. Ordinary differential equations. Walter de Gruyter & Co., Berlin, 1990.
xiv+458 pp. url: doi.org/10.1515/9783110853698.
[6] Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg,
E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J.,
Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K.,
Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L.,
Gong, C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B.,
LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S.,
Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J.,
Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A.,
Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z.,
Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D.,
Yuan, B., Zhan, J., and Zhu, Z. Deep Speech 2 : End-to-End Speech Recognition
in English and Mandarin. In Proceedings of The 33rd International Conference on
Machine Learning (New York, NY, USA, June 20–22, 2016). Ed. by Balcan, M. F.

559
Bibliography

and Weinberger, K. Q. Vol. 48. Proceedings of Machine Learning Research. PMLR,


2016, pp. 173–182. url: proceedings.mlr.press/v48/amodei16.html.
[7] An, J. and Lu, J. Convergence of stochastic gradient descent under a local La-
jasiewicz condition for deep neural networks. arXiv:2304.09221 (2023), 14 pp. url:
arxiv.org/abs/2304.09221.
[8] Attouch, H. and Bolte, J. On the convergence of the proximal algorithm for
nonsmooth functions involving analytic features. Math. Program. 116, 1–2 (2009),
pp. 5–16. url: doi.org/10.1007/s10107-007-0133-5.
[9] Bach, F. Learning Theory from First Principles. Draft version of April 19, 2023.
book draft, to be published by MIT Press. 2023. url: www.di.ens.fr/%7Efbach/
ltfp_book.pdf.
[10] Baggenstos, J. and Salimova, D. Approximation properties of residual neural
networks for Kolmogorov PDEs. Discrete Contin. Dyn. Syst. Ser. B 28, 5 (2023),
pp. 3193–3215. url: doi.org/10.3934/dcdsb.2022210.
[11] Bahdanau, D., Cho, K., and Bengio, Y. Neural Machine Translation by Jointly
Learning to Align and Translate. arXiv:1409.0473 (2014), 15 pp. url: arxiv.org/
abs/1409.0473.
[12] Baldi, P. and Hornik, K. Neural networks and principal component analysis:
Learning from examples without local minima. Neural Networks 2, 1 (1989), pp. 53–
58. url: doi.org/10.1016/0893-6080(89)90014-2.
[13] Bao, G., Ye, X., Zang, Y., and Zhou, H. Numerical solution of inverse problems
by weak adversarial networks. Inverse Problems 36, 11 (2020), Art. No. 115003,
31 pp. url: doi.org/10.1088/1361-6420/abb447.
[14] Barron, A. R. Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Trans. Inform. Theory 39, 3 (1993), pp. 930–945. url: doi.org/10.
1109/18.256500.
[15] Barron, A. R. Approximation and estimation bounds for artificial neural networks.
Mach. Learn. 14, 1 (1994), pp. 115–133. url: doi.org/10.1007/bf00993164.
[16] Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A.,
Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A.,
Faulkner, R., Gulcehre, C., Song, F., Ballard, A., Gilmer, J., Dahl, G.,
Vaswani, A., Allen, K., Nash, C., Langston, V., Dyer, C., Heess, N.,
Wierstra, D., Kohli, P., Botvinick, M., Vinyals, O., Li, Y., and Pascanu,
R. Relational inductive biases, deep learning, and graph networks. arXiv:1806.01261
(2018), 40 pp. url: arxiv.org/abs/1806.01261.

560
Bibliography

[17] Beck, C., Becker, S., Cheridito, P., Jentzen, A., and Neufeld, A. Deep
splitting method for parabolic PDEs. SIAM J. Sci. Comput. 43, 5 (2021), A3135–
A3154. url: doi.org/10.1137/19M1297919.
[18] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. Solving
stochastic differential equations and Kolmogorov equations by means of deep learning.
arXiv:1806.00421 (2018), 56 pp. url: arxiv.org/abs/1806.00421.
[19] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. Solving
the Kolmogorov PDE by means of deep learning. J. Sci. Comput. 88, 3 (2021),
Art. No. 73, 28 pp. url: doi.org/10.1007/s10915-021-01590-0.
[20] Beck, C., E, W., and Jentzen, A. Machine learning approximation algorithms
for high-dimensional fully nonlinear partial differential equations and second-order
backward stochastic differential equations. J. Nonlinear Sci. 29, 4 (2019), pp. 1563–
1619. url: doi.org/10.1007/s00332-018-9525-3.
[21] Beck, C., Gonon, L., and Jentzen, A. Overcoming the curse of dimensionality in
the numerical approximation of high-dimensional semilinear elliptic partial differential
equations. arXiv:2003.00596 (2020), 50 pp. url: arxiv.org/abs/2003.00596.
[22] Beck, C., Hornung, F., Hutzenthaler, M., Jentzen, A., and Kruse, T.
Overcoming the curse of dimensionality in the numerical approximation of Allen-
Cahn partial differential equations via truncated full-history recursive multilevel
Picard approximations. J. Numer. Math. 28, 4 (2020), pp. 197–222. url: doi.org/
10.1515/jnma-2019-0074.
[23] Beck, C., Hutzenthaler, M., and Jentzen, A. On nonlinear Feynman–Kac
formulas for viscosity solutions of semilinear parabolic partial differential equations.
Stoch. Dyn. 21, 8 (2021), Art. No. 2150048, 68 pp. url: doi . org / 10 . 1142 /
S0219493721500489.
[24] Beck, C., Hutzenthaler, M., Jentzen, A., and Kuckuck, B. An overview
on deep learning-based approximation methods for partial differential equations.
Discrete Contin. Dyn. Syst. Ser. B 28, 6 (2023), pp. 3697–3746. url: doi.org/10.
3934/dcdsb.2022238.
[25] Beck, C., Jentzen, A., and Kuckuck, B. Full error analysis for the training
of deep neural networks. Infin. Dimens. Anal. Quantum Probab. Relat. Top. 25, 2
(2022), Art. No. 2150020, 76 pp. url: doi.org/10.1142/S021902572150020X.
[26] Belak, C., Hager, O., Reimers, C., Schnell, L., and Würschmidt, M.
Convergence Rates for a Deep Learning Algorithm for Semilinear PDEs (2021).
Available at SSRN, 42 pp. url: doi.org/10.2139/ssrn.3981933.
[27] Bellman, R. Dynamic programming. Reprint of the 1957 edition. Princeton
University Press, Princeton, NJ, 2010, xxx+340 pp. url: doi . org / 10 . 1515 /
9781400835386.

561
Bibliography

[28] Beneventano, P., Cheridito, P., Graeber, R., Jentzen, A., and Kuck-
uck, B. Deep neural network approximation theory for high-dimensional functions.
arXiv:2112.14523 (2021), 82 pp. url: arxiv.org/abs/2112.14523.
[29] Beneventano, P., Cheridito, P., Jentzen, A., and von Wurstemberger, P.
High-dimensional approximation spaces of artificial neural networks and applications
to partial differential equations. arXiv:2012.04326 (2020). url: arxiv.org/abs/
2012.04326.
[30] Bengio, Y., Simard, P., and Frasconi, P. Learning long-term dependencies
with gradient descent is difficult. IEEE Trans. Neural Netw. 5, 2 (1994), pp. 157–166.
url: doi.org/10.1109/72.279181.
[31] Bengio, Y., Boulanger-Lewandowski, N., and Pascanu, R. Advances in
optimizing recurrent networks. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing (Vancouver, BC, Canada, May 26–31, 2013). 2013,
pp. 8624–8628. url: doi.org/10.1109/ICASSP.2013.6639349.
[32] Benth, F. E., Detering, N., and Galimberti, L. Neural networks in Fréchet
spaces. Ann. Math. Artif. Intell. 91, 1 (2023), pp. 75–103. url: doi.org/10.1007/
s10472-022-09824-z.
[33] Bercu, B. and Fort, J.-C. Generic Stochastic Gradient Methods. In Wiley
Encyclopedia of Operations Research and Management Science. Ed. by Cochran,
J. J., Cox Jr., L. A., Keskinocak, P., Kharoufeh, J. P., and Smith, J. C. John Wiley
& Sons, Ltd., 2013. url: doi.org/10.1002/9780470400531.eorms1068.
[34] Berg, J. and Nyström, K. A unified deep artificial neural network approach to
partial differential equations in complex geometries. Neurocomputing 317 (2018),
pp. 28–41. url: doi.org/10.1016/j.neucom.2018.06.056.
[35] Berner, J., Grohs, P., and Jentzen, A. Analysis of the Generalization Error:
Empirical Risk Minimization over Deep Artificial Neural Networks Overcomes the
Curse of Dimensionality in the Numerical Approximation of Black–Scholes Partial
Differential Equations. SIAM J. Math. Data Sci. 2, 3 (2020), pp. 631–657. url:
doi.org/10.1137/19M125649X.
[36] Berner, J., Grohs, P., Kutyniok, G., and Petersen, P. The Modern
Mathematics of Deep Learning. In Mathematical Aspects of Deep Learning. Ed.
by Grohs, P. and Kutyniok, G. Cambridge University Press, 2022, pp. 1–111. url:
doi.org/10.1017/9781009025096.002.
[37] Beznea, L., Cimpean, I., Lupascu-Stamate, O., Popescu, I., and Zarnescu,
A. From Monte Carlo to neural networks approximations of boundary value problems.
arXiv:2209.01432 (2022), 40 pp. url: arxiv.org/abs/2209.01432.
[38] Bierstone, E. and Milman, P. D. Semianalytic and subanalytic sets. Inst. Hautes
Études Sci. Publ. Math. 67 (1988), pp. 5–42. url: doi.org/10.1007/BF02699126.

562
Bibliography

[39] Bishop, C. M. Neural networks for pattern recognition. The Clarendon Press, Oxford
University Press, New York, 1995, xviii+482 pp.
[40] Bjorck, N., Gomes, C. P., Selman, B., and Weinberger, K. Q. Understand-
ing Batch Normalization. In Advances in Neural Information Processing Systems
(NeurIPS 2018). Ed. by Bengio, S., Wallach, H., Larochelle, H., Grauman, K.,
Cesa-Bianchi, N., and Garnett, R. Vol. 31. Curran Associates, Inc., 2018. url:
proceedings.neurips.cc/paper_files/paper/2018/file/36072923bfc3cf477
45d704feb489480-Paper.pdf.
[41] Blum, E. K. and Li, L. K. Approximation theory and feedforward networks. Neural
Networks 4, 4 (1991), pp. 511–515. url: doi.org/10.1016/0893-6080(91)90047-9.
[42] Blumers, A. L., Li, Z., and Karniadakis, G. E. Supervised parallel-in-time
algorithm for long-time Lagrangian simulations of stochastic dynamics: Application
to hydrodynamics. J. Comput. Phys. 393 (2019), pp. 214–228. url: doi.org/10.
1016/j.jcp.2019.05.016.
[43] Bölcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. Optimal approxi-
mation with sparsely connected deep neural networks. SIAM J. Math. Data Sci. 1, 1
(2019), pp. 8–45. url: doi.org/10.1137/18M118709X.
[44] Bolte, J., Daniilidis, A., and Lewis, A. The Łojasiewicz inequality for nons-
mooth subanalytic functions with applications to subgradient dynamical systems.
SIAM J. Optim. 17, 4 (2006), pp. 1205–1223. url: doi.org/10.1137/050644641.
[45] Bolte, J. and Pauwels, E. Conservative set valued fields, automatic differentia-
tion, stochastic gradient methods and deep learning. Math. Program. 188, 1 (2021),
pp. 19–51. url: doi.org/10.1007/s10107-020-01501-5.
[46] Borovykh, A., Bohte, S., and Oosterlee, C. W. Conditional Time Series
Forecasting with Convolutional Neural Networks. arXiv:1703.04691 (2017), 22 pp.
url: arxiv.org/abs/1703.04691.
[47] Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel,
L., LeCun, Y., Muller, U., Sackinger, E., Simard, P., and Vapnik, V.
Comparison of classifier methods: a case study in handwritten digit recognition. In
Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol.
3 - Conference C: Signal Processing (Cat. No.94CH3440-5) (Jerusalem, Israel, Oct. 9–
13, 1994). Vol. 2. 1994, pp. 77–82. url: doi.org/10.1109/ICPR.1994.576879.
[48] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization Methods for Large-
Scale Machine Learning. SIAM Rev. 60, 2 (2018), pp. 223–311. url: doi.org/10.
1137/16M1080173.
[49] Bourlard, H. and Kamp, Y. Auto-association by multilayer perceptrons and
singular value decomposition. Biol. Cybernet. 59, 4–5 (1988), pp. 291–294. url:
doi.org/10.1007/BF00332918.

563
Bibliography

[50] Boussange, V., Becker, S., Jentzen, A., Kuckuck, B., and Pellissier, L.
Deep learning approximations for non-local nonlinear PDEs with Neumann boundary
conditions. arXiv:2205.03672 (2022), 59 pp. url: arxiv.org/abs/2205.03672.
[51] Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and
Bengio, S. Generating Sentences from a Continuous Space. In Proceedings of the
20th SIGNLL Conference on Computational Natural Language Learning (Berlin,
Germany, Aug. 7–12, 2016). Ed. by Riezler, S. and Goldberg, Y. Association for
Computational Linguistics, 2016, pp. 10–21. url: doi.org/10.18653/v1/K16-1002.
[52] Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University
Press, 2004. 727 pp. url: doi.org/10.1017/CBO9780511804441.
[53] Brandstetter, J., van den Berg, R., Welling, M., and Gupta, J. K.
Clifford Neural Layers for PDE Modeling. arXiv:2209.04934 (2022), 58 pp. url:
arxiv.org/abs/2209.04934.
[54] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal,
P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S.,
Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A.,
Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E.,
Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S.,
Radford, A., Sutskever, I., and Amodei, D. Language Models are Few-Shot
Learners. arXiv:2005.14165 (2020), 75 pp. url: arxiv.org/abs/2005.14165.
[55] Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. Spectral Networks
and Locally Connected Networks on Graphs. arXiv:1312.6203 (2013), 14 pp. url:
arxiv.org/abs/1312.6203.
[56] Brunton, S. L. and Kutz, J. N. Machine Learning for Partial Differential
Equations. arXiv:2303.17078 (2023), 16 pp. url: arxiv.org/abs/2303.17078.
[57] Bubeck, S. Convex Optimization: Algorithms and Complexity. Found. Trends
Mach. Learn. 8, 3–4 (2015), pp. 231–357. url: doi.org/10.1561/2200000050.
[58] Cai, W. and Xu, Z.-Q. J. Multi-scale Deep Neural Networks for Solving High
Dimensional PDEs. arXiv:1910.11710 (2019), 14 pp. url: arxiv.org/abs/1910.
11710.
[59] Cakir, E., Parascandolo, G., Heittola, T., Huttunen, H., and Virtanen,
T. Convolutional Recurrent Neural Networks for Polyphonic Sound Event Detection.
IEEE/ACM Trans. Audio, Speech and Lang. Proc. 25, 6 (2017), pp. 1291–1303. url:
doi.org/10.1109/TASLP.2017.2690575.
[60] Calin, O. Deep learning architectures—a mathematical approach. Springer, Cham,
2020, xxx+760 pp. url: doi.org/10.1007/978-3-030-36721-3.

564
Bibliography

[61] Carl, B. and Stephani, I. Entropy, compactness and the approximation of


operators. Vol. 98. Cambridge University Press, Cambridge, 1990, x+277 pp. url:
doi.org/10.1017/CBO9780511897467.
[62] Castro, J. Deep learning schemes for parabolic nonlocal integro-differential equa-
tions. Partial Differ. Equ. Appl. 3, 6 (2022), Art. No. 77, 35 pp. url: doi.org/10.
1007/s42985-022-00213-z.
[63] Caterini, A. L. and Chang, D. E. Deep neural networks in a mathematical
framework. Springer, Cham, 2018, xiii+84 pp. url: doi.org/10.1007/978-3-319-
75304-1.
[64] Chan-Wai-Nam, Q., Mikael, J., and Warin, X. Machine learning for semi linear
PDEs. J. Sci. Comput. 79, 3 (2019), pp. 1667–1712. url: doi.org/10.1007/s10915-
019-00908-3.
[65] Chatterjee, S. Convergence of gradient descent for deep neural networks.
arXiv:2203.16462 (2022), 23 pp. url: arxiv.org/abs/2203.16462.
[66] Chen, F., Huang, J., Wang, C., and Yang, H. Friedrichs Learning: Weak
Solutions of Partial Differential Equations via Deep Learning. SIAM J. Sci. Comput.
45, 3 (2023), A1271–A1299. url: doi.org/10.1137/22M1488405.
[67] Chen, K., Wang, C., and Yang, H. Deep Operator Learning Lessens the Curse
of Dimensionality for PDEs. arXiv:2301.12227 (2023), 21 pp. url: arxiv.org/abs/
2301.12227.
[68] Chen, T. and Chen, H. Approximations of continuous functionals by neural
networks with application to dynamic systems. IEEE Trans. Neural Netw. 4, 6
(1993), pp. 910–918. url: doi.org/10.1109/72.286886.
[69] Chen, T. and Chen, H. Universal approximation to nonlinear operators by neural
networks with arbitrary activation functions and its application to dynamical systems.
IEEE Trans. Neural Netw. 6, 4 (1995), pp. 911–917. url: doi.org/10.1109/72.
392253.
[70] Cheridito, P., Jentzen, A., and Rossmannek, F. Efficient approximation of
high-dimensional functions with neural networks. IEEE Trans. Neural Netw. Learn.
Syst. 33, 7 (2022), pp. 3079–3093. url: doi.org/10.1109/TNNLS.2021.3049719.
[71] Cheridito, P., Jentzen, A., and Rossmannek, F. Gradient descent provably
escapes saddle points in the training of shallow ReLU networks. arXiv:2208.02083
(2022), 16 pp. url: arxiv.org/abs/2208.02083.
[72] Cheridito, P., Jentzen, A., and Rossmannek, F. Landscape analysis for
shallow neural networks: complete classification of critical points for affine target
functions. J. Nonlinear Sci. 32, 5 (2022), Art. No. 64, 45 pp. url: doi.org/10.
1007/s00332-022-09823-8.

565
Bibliography

[73] Cheridito, P., Soner, H. M., Touzi, N., and Victoir, N. Second-order
backward stochastic differential equations and fully nonlinear parabolic PDEs. Comm.
Pure Appl. Math. 60, 7 (2007), pp. 1081–1110. url: doi.org/10.1002/cpa.20168.
[74] Chizat, L. and Bach, F. On the Global Convergence of Gradient Descent for Over-
parameterized Models using Optimal Transport. In Advances in Neural Information
Processing Systems (NeurIPS 2018). Ed. by Bengio, S., Wallach, H., Larochelle,
H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. Vol. 31. Curran Associates,
Inc., 2018. url: proceedings . neurips . cc / paper _ files / paper / 2018 / file /
a1afc58c6ca9540d057299ec3016d726-Paper.pdf.
[75] Chizat, L., Oyallon, E., and Bach, F. On Lazy Training in Differentiable
Programming. In Advances in Neural Information Processing Systems (NeurIPS
2019). Ed. by Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox,
E., and Garnett, R. Vol. 32. Curran Associates, Inc., 2019. url: proceedings .
neurips.cc/paper_files/paper/2019/file/ae614c557843b1df326cb29c57225
459-Paper.pdf.
[76] Cho, K., van Merriënboer, B., Bahdanau, D., and Bengio, Y. On the
Properties of Neural Machine Translation: Encoder–Decoder Approaches. In Proceed-
ings of SSST-8, Eighth Workshop on Syntax, Semantics and Structure in Statistical
Translation (Doha, Qatar, Oct. 25, 2014). Association for Computational Linguistics,
2014, pp. 103–111. url: doi.org/10.3115/v1/W14-4012.
[77] Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares,
F., Schwenk, H., and Bengio, Y. Learning Phrase Representations using RNN
Encoder–Decoder for Statistical Machine Translation. arXiv:1406.1078 (2014), 15 pp.
url: arxiv.org/abs/1406.1078.
[78] Choi, K., Fazekas, G., Sandler, M., and Cho, K. Convolutional recurrent
neural networks for music classification. In 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) (New Orleans, LA, USA, Mar. 5–
9, 2017). 2017, pp. 2392–2396. url: doi.org/10.1109/ICASSP.2017.7952585.
[79] Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and Le-
Cun, Y. The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth
International Conference on Artificial Intelligence and Statistics (San Diego, Cal-
ifornia, USA, May 9–12, 2015). Ed. by Lebanon, G. and Vishwanathan, S. V. N.
Vol. 38. Proceedings of Machine Learning Research. PMLR, 2015, pp. 192–204. url:
proceedings.mlr.press/v38/choromanska15.html.
[80] Choromanska, A., LeCun, Y., and Ben Arous, G. Open Problem: The
landscape of the loss surfaces of multilayer networks. In Proceedings of The 28th
Conference on Learning Theory (Paris, France, July 3–6, 2015). Ed. by Grünwald, P.,
Hazan, E., and Kale, S. Vol. 40. Proceedings of Machine Learning Research. PMLR,
2015, pp. 1756–1760. url: proceedings.mlr.press/v40/Choromanska15.html.

566
Bibliography

[81] Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., and Bengio, Y.
Attention-Based Models for Speech Recognition. In Advances in Neural Informa-
tion Processing Systems (NeurIPS 2015). Ed. by Cortes, C., Lawrence, N., Lee,
D., Sugiyama, M., and Garnett, R. Vol. 28. Curran Associates, Inc., 2015. url:
proceedings.neurips.cc/paper_files/paper/2015/file/1068c6e4c8051cfd4
e9ea8072e3189e2-Paper.pdf.
[82] Cioica-Licht, P. A., Hutzenthaler, M., and Werner, P. T. Deep neural
networks overcome the curse of dimensionality in the numerical approximation
of semilinear partial differential equations. arXiv:2205.14398 (2022), 34 pp. url:
arxiv.org/abs/2205.14398.
[83] Clevert, D.-A., Unterthiner, T., and Hochreiter, S. Fast and Accurate
Deep Network Learning by Exponential Linear Units (ELUs). arXiv:1511.07289
(2015), 14 pp. url: arxiv.org/abs/1511.07289.
[84] Colding, T. H. and Minicozzi II, W. P. Łojasiewicz inequalities and applications.
In Surveys in Differential Geometry 2014. Regularity and evolution of nonlinear
equations. Vol. 19. Int. Press, Somerville, MA, 2015, pp. 63–82. url: doi.org/10.
4310/SDG.2014.v19.n1.a3.
[85] Coleman, R. Calculus on normed vector spaces. Springer New York, 2012, xi+249
pp. url: doi.org/10.1007/978-1-4614-3894-6.
[86] Cox, S., Hutzenthaler, M., Jentzen, A., van Neerven, J., and Welti, T.
Convergence in Hölder norms with applications to Monte Carlo methods in infinite
dimensions. IMA J. Numer. Anal. 41, 1 (2020), pp. 493–548. url: doi.org/10.
1093/imanum/drz063.
[87] Cucker, F. and Smale, S. On the mathematical foundations of learning. Bull.
Amer. Math. Soc. (N.S.) 39, 1 (2002), pp. 1–49. url: doi.org/10.1090/S0273-
0979-01-00923-5.
[88] Cuomo, S., Di Cola, V. S., Giampaolo, F., Rozza, G., Raissi, M., and Pic-
cialli, F. Scientific Machine Learning Through Physics–Informed Neural Networks:
Where we are and What’s Next. J. Sci. Comp. 92, 3 (2022), Art. No. 88, 62 pp. url:
doi.org/10.1007/s10915-022-01939-z.
[89] Cybenko, G. Approximation by superpositions of a sigmoidal function. Math.
Control Signals Systems 2, 4 (1989), pp. 303–314. url: doi.org/10.1007/BF02551
274.
[90] D. Jagtap, A. and Em Karniadakis, G. Extended Physics-Informed Neural
Networks (XPINNs): A Generalized Space-Time Domain Decomposition Based Deep
Learning Framework for Nonlinear Partial Differential Equations. Commun. Comput.
Phys. 28, 5 (2020), pp. 2002–2041. url: doi.org/10.4208/cicp.OA-2020-0164.

567
Bibliography

[91] Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q., and Salakhutdinov,
R. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context.
In Proceedings of the 57th Annual Meeting of the Association for Computational
Linguistics (Florence, Italy, July 28–Aug. 2, 2019). Association for Computational
Linguistics, 2019, pp. 2978–2988. url: doi.org/10.18653/v1/P19-1285.
[92] Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and
Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional
non-convex optimization. In Advances in Neural Information Processing Systems.
Ed. by Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K.
Vol. 27. Curran Associates, Inc., 2014. url: proceedings.neurips.cc/paper_
files/paper/2014/file/17e23e50bedc63b4095e3d8204ce063b-Paper.pdf.
[93] Davis, D., Drusvyatskiy, D., Kakade, S., and Lee, J. D. Stochastic sub-
gradient method converges on tame functions. Found. Comput. Math. 20, 1 (2020),
pp. 119–154. url: doi.org/10.1007/s10208-018-09409-5.
[94] De Ryck, T. and Mishra, S. Generic bounds on the approximation error for
physics-informed (and) operator learning. arXiv:2205.11393 (2022), 40 pp. url:
arxiv.org/abs/2205.11393.
[95] Defferrard, M., Bresson, X., and Vandergheynst, P. Convolutional Neural
Networks on Graphs with Fast Localized Spectral Filtering. In Advances in Neural
Information Processing Systems. Ed. by Lee, D., Sugiyama, M., Luxburg, U., Guyon,
I., and Garnett, R. Vol. 29. Curran Associates, Inc., 2016. url: proceedings .
neurips.cc/paper_files/paper/2016/file/04df4d434d481c5bb723be1b6df1
ee65-Paper.pdf.
[96] Défossez, A., Bottou, L., Bach, F., and Usunier, N. A Simple Convergence
Proof of Adam and Adagrad. arXiv:2003.02395 (2020), 30 pp. url: arxiv.org/
abs/2003.02395.
[97] Deisenroth, M. P., Faisal, A. A., and Ong, C. S. Mathematics for machine
learning. Cambridge University Press, Cambridge, 2020, xvii+371 pp. url: doi.
org/10.1017/9781108679930.
[98] Deng, B., Shin, Y., Lu, L., Zhang, Z., and Karniadakis, G. E. Approximation
rates of DeepONets for learning operators arising from advection–diffusion equations.
Neural Networks 153 (2022), pp. 411–426. url: doi.org/10.1016/j.neunet.2022.
06.019.
[99] Dereich, S., Jentzen, A., and Kassing, S. On the existence of minimizers in
shallow residual ReLU neural network optimization landscapes. arXiv:2302.14690
(2023), 26 pp. url: arxiv.org/abs/2302.14690.

568
Bibliography

[100] Dereich, S. and Kassing, S. Convergence of stochastic gradient descent schemes


for Lojasiewicz-landscapes. arXiv:2102.09385 (2021), 24 pp. url: arxiv.org/abs/
2102.09385.
[101] Dereich, S. and Kassing, S. Cooling down stochastic differential equations:
Almost sure convergence. Stochastic Process. Appl. 152 (2022), pp. 289–311. url:
doi.org/10.1016/j.spa.2022.06.020.
[102] Dereich, S. and Kassing, S. On the existence of optimal shallow feedforward
networks with ReLU activation. arXiv:2303.03950 (2023), 17 pp. url: arxiv.org/
abs/2303.03950.
[103] Dereich, S. and Müller-Gronbach, T. General multilevel adaptations for
stochastic approximation algorithms. arXiv:1506.05482 (2017), 33 pages. url: arxiv.
org/abs/1506.05482.
[104] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding. In Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
(Minneapolis, MN, USA, June 2–7, 2019). Association for Computational Linguistics,
2019, pp. 4171–4186. url: doi.org/10.18653/v1/N19-1423.
[105] Ding, X., Zhang, Y., Liu, T., and Duan, J. Deep Learning for Event-Driven
Stock Prediction. In Proceedings of the 24th International Conference on Artificial
Intelligence (Buenos Aires, Argentina, July 25–31, 2015). IJCAI’15. AAAI Press,
2015, pp. 2327–2333. url: www.ijcai.org/Proceedings/15/Papers/329.pdf.
[106] Dissanayake, M. W. M. G. and Phan-Thien, N. Neural-network-based approx-
imations for solving partial differential equations. Commun. Numer. Methods Engrg.
10, 3 (1994), pp. 195–201. url: doi.org/10.1002/cnm.1640100303.
[107] Doersch, C. Tutorial on Variational Autoencoders. arXiv:1606.05908 (2016), 23 pp.
url: arxiv.org/abs/1606.05908.
[108] Donahue, J., Hendricks, L. A., Rohrbach, M., Venugopalan, S., Guadar-
rama, S., Saenko, K., and Darrell, T. Long-Term Recurrent Convolutional
Networks for Visual Recognition and Description. IEEE Trans. Pattern Anal. Mach.
Intell. 39, 4 (2017), pp. 677–691. url: doi.org/10.1109/TPAMI.2016.2599174.
[109] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X.,
Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S.,
Uszkoreit, J., and Houlsby, N. An Image is Worth 16x16 Words: Transformers
for Image Recognition at Scale. arXiv:2010.11929 (2020), 22 pp. url: arxiv.org/
abs/2010.11929.

569
Bibliography

[110] Dos Santos, C. and Gatti, M. Deep Convolutional Neural Networks for Sentiment
Analysis of Short Texts. In Proceedings of COLING 2014, the 25th International Con-
ference on Computational Linguistics: Technical Papers (Dublin, Ireland, Aug. 23–29,
2014). Dublin City University and Association for Computational Linguistics, 2014,
pp. 69–78. url: aclanthology.org/C14-1008.
[111] Dozat, T. Incorporating Nesterov momentum into Adam. https://openreview.
net/forum?id=OM0jvwB8jIp57ZJjtNEZ. [Accessed 6-December-2017]. 2016.
[112] Dozat, T. Incorporating Nesterov momentum into Adam. http://cs229.stanford.
edu/proj2015/054_report.pdf. [Accessed 6-December-2017]. 2016.
[113] Du, S. and Lee, J. On the Power of Over-parametrization in Neural Networks
with Quadratic Activation. In Proceedings of the 35th International Conference on
Machine Learning (Stockholm, Sweden, July 10–15, 2018). Ed. by Dy, J. and Krause,
A. Vol. 80. Proceedings of Machine Learning Research. PMLR, 2018, pp. 1329–1338.
url: proceedings.mlr.press/v80/du18a.html.
[114] Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient Descent Finds Global
Minima of Deep Neural Networks. In Proceedings of the 36th International Conference
on Machine Learning (Long Beach, CA, USA, June 9–15, 2019). Ed. by Chaudhuri,
K. and Salakhutdinov, R. Vol. 97. Proceedings of Machine Learning Research. PMLR,
2019, pp. 1675–1685. url: proceedings.mlr.press/v97/du19c.html.
[115] Du, T., Huang, Z., and Li, Y. Approximation and Generalization of DeepONets
for Learning Operators Arising from a Class of Singularly Perturbed Problems.
arXiv:2306.16833 (2023), 32 pp. url: arxiv.org/abs/2306.16833.
[116] Duchi, J. Probability Bounds. https : / / stanford . edu / ~jduchi / projects /
probability_bounds.pdf. [Accessed 27-October-2023].
[117] Duchi, J., Hazan, E., and Singer, Y. Adaptive Subgradient Methods for Online
Learning and Stochastic Optimization. J. Mach. Learn. Res. 12 (2011), pp. 2121–
2159. url: jmlr.org/papers/v12/duchi11a.html.
[118] Dumoulin, V., Belghazi, I., Poole, B., Mastropietro, O., Lamb, A., Ar-
jovsky, M., and Courville, A. Adversarially Learned Inference. arXiv:1606.00704
(2016), 18 pp. url: arxiv.org/abs/1606.00704.
[119] E, W., Han, J., and Jentzen, A. Deep learning-based numerical methods for
high-dimensional parabolic partial differential equations and backward stochastic
differential equations. Commun. Math. Stat. 5, 4 (2017), pp. 349–380. url: doi.
org/10.1007/s40304-017-0117-6.
[120] E, W., Han, J., and Jentzen, A. Algorithms for solving high dimensional PDEs:
from nonlinear Monte Carlo to machine learning. Nonlinearity 35, 1 (2021), p. 278.
url: doi.org/10.1088/1361-6544/ac337f.

570
Bibliography

[121] E, W., Ma, C., and Wu, L. The Barron space and the flow-induced function
spaces for neural network models. Constr. Approx. 55, 1 (2022), pp. 369–406. url:
doi.org/10.1007/s00365-021-09549-y.
[122] E, W., Ma, C., Wu, L., and Wojtowytsch, S. Towards a Mathematical
Understanding of Neural Network-Based Machine Learning: What We Know and
What We Don’t. CSIAM Trans. Appl. Math. 1, 4 (2020), pp. 561–615. url: doi.
org/10.4208/csiam-am.SO-2020-0002.
[123] E, W. and Wojtowytsch, S. Some observations on high-dimensional partial
differential equations with Barron data. In Proceedings of the 2nd Mathematical
and Scientific Machine Learning Conference (Aug. 16–19, 2021). Ed. by Bruna, J.,
Hesthaven, J., and Zdeborova, L. Vol. 145. Proceedings of Machine Learning Research.
PMLR, 2022, pp. 253–269. url: proceedings.mlr.press/v145/e22a.html.
[124] E, W. and Yu, B. The deep Ritz method: a deep learning-based numerical algorithm
for solving variational problems. Commun. Math. Stat. 6, 1 (2018), pp. 1–12. url:
doi.org/10.1007/s40304-018-0127-z.
[125] Eberle, S., Jentzen, A., Riekert, A., and Weiss, G. Normalized gradient flow
optimization in the training of ReLU artificial neural networks. arXiv:2207.06246
(2022), 26 pp. url: arxiv.org/abs/2207.06246.
[126] Eberle, S., Jentzen, A., Riekert, A., and Weiss, G. S. Existence, uniqueness,
and convergence rates for gradient flows in the training of artificial neural networks
with ReLU activation. Electron. Res. Arch. 31, 5 (2023), pp. 2519–2554. url:
doi.org/10.3934/era.2023128.
[127] Einsiedler, M. and Ward, T. Functional analysis, spectral theory, and applica-
tions. Vol. 276. Springer, Cham, 2017, xiv+614 pp. url: doi.org/10.1007/978-3-
319-58540-6.
[128] Elbrächter, D., Grohs, P., Jentzen, A., and Schwab, C. DNN expression
rate analysis of high-dimensional PDEs: application to option pricing. Constr. Approx.
55, 1 (2022), pp. 3–71. url: doi.org/10.1007/s00365-021-09541-6.
[129] Encyclopedia of Mathematics: Lojasiewicz inequality. https://encyclopediaofmath.
org/wiki/Lojasiewicz_inequality. [Accessed 28-August-2023].
[130] Fabbri, M. and Moro, G. Dow Jones Trading with Deep Learning: The Un-
reasonable Effectiveness of Recurrent Neural Networks. In Proceedings of the 7th
International Conference on Data Science, Technology and Applications (Porto,
Portugal, July 26–28, 2018). Ed. by Bernardino, J. and Quix, C. SciTePress - Science
and Technology Publications, 2018. url: doi.org/10.5220/0006922101420153.
[131] Fan, J., Ma, C., and Zhong, Y. A selective overview of deep learning. Statist.
Sci. 36, 2 (2021), pp. 264–290. url: doi.org/10.1214/20-sts783.

571
Bibliography

[132] Fehrman, B., Gess, B., and Jentzen, A. Convergence Rates for the Stochastic
Gradient Descent Method for Non-Convex Objective Functions. J. Mach. Learn.
Res. 21, 136 (2020), pp. 1–48. url: jmlr.org/papers/v21/19-636.html.
[133] Fischer, T. and Krauss, C. Deep learning with long short-term memory networks
for financial market predictions. European J. Oper. Res. 270, 2 (2018), pp. 654–669.
url: doi.org/10.1016/j.ejor.2017.11.054.
[134] Fraenkel, L. E. Formulae for high derivatives of composite functions. Math.
Proc. Cambridge Philos. Soc. 83, 2 (1978), pp. 159–165. url: doi.org/10.1017/
S0305004100054402.
[135] Fresca, S., Dede’, L., and Manzoni, A. A comprehensive deep learning-based
approach to reduced order modeling of nonlinear time-dependent parametrized PDEs.
J. Sci. Comput. 87, 2 (2021), Art. No. 61, 36 pp. url: doi.org/10.1007/s10915-
021-01462-7.
[136] Fresca, S. and Manzoni, A. POD-DL-ROM: enhancing deep learning-based
reduced order models for nonlinear parametrized PDEs by proper orthogonal decom-
position. Comput. Methods Appl. Mech. Engrg. 388 (2022), Art. No. 114181, 27 pp.
url: doi.org/10.1016/j.cma.2021.114181.
[137] Frey, R. and Köck, V. Convergence Analysis of the Deep Splitting Scheme:
the Case of Partial Integro-Differential Equations and the associated FBSDEs with
Jumps. arXiv:2206.01597 (2022), 21 pp. url: arxiv.org/abs/2206.01597.
[138] Frey, R. and Köck, V. Deep Neural Network Algorithms for Parabolic PIDEs
and Applications in Insurance and Finance. Computation 10, 11 (2022). url: doi.
org/10.3390/computation10110201.
[139] Friedrichs, K. O. Symmetric positive linear differential equations. Comm. Pure
Appl. Math. 11 (1958), pp. 333–418. url: doi.org/10.1002/cpa.3160110306.
[140] Fujii, M., Takahashi, A., and Takahashi, M. Asymptotic Expansion as Prior
Knowledge in Deep Learning Method for High dimensional BSDEs. Asia-Pacific
Financial Markets 26, 3 (2019), pp. 391–408. url: doi.org/10.1007/s10690-019-
09271-7.
[141] Fukumizu, K. and Amari, S. Local minima and plateaus in hierarchical structures
of multilayer perceptrons. Neural Networks 13, 3 (2000), pp. 317–327. url: doi.
org/10.1016/S0893-6080(00)00009-5.
[142] Gallon, D., Jentzen, A., and Lindner, F. Blow up phenomena for gra-
dient descent optimization methods in the training of artificial neural networks.
arXiv:2211.15641 (2022), 84 pp. url: arxiv.org/abs/2211.15641.

572
Bibliography

[143] Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Con-
volutional Sequence to Sequence Learning. In Proceedings of the 34th International
Conference on Machine Learning (Sydney, Australia, Aug. 6–11, 2017). Ed. by Pre-
cup, D. and Teh, Y. W. Vol. 70. Proceedings of Machine Learning Research. PMLR,
2017, pp. 1243–1252. url: proceedings.mlr.press/v70/gehring17a.html.
[144] Gentile, R. and Welper, G. Approximation results for Gradient Descent trained
Shallow Neural Networks in 1d. arXiv:2209.08399 (2022), 49 pp. url: arxiv.org/
abs/2209.08399.
[145] Germain, M., Pham, H., and Warin, X. Neural networks-based algorithms
for stochastic control and PDEs in finance. arXiv:2101.08068 (2021), 27 pp. url:
arxiv.org/abs/2101.08068.
[146] Germain, M., Pham, H., and Warin, X. Approximation error analysis of some
deep backward schemes for nonlinear PDEs. SIAM J. Sci. Comput. 44, 1 (2022),
A28–A56. url: doi.org/10.1137/20M1355355.
[147] Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to Forget: Continual
Prediction with LSTM. Neural Comput. 12, 10 (2000), pp. 2451–2471. url: doi.
org/10.1162/089976600300015015.
[148] Gers, F. A., Schraudolph, N. N., and Schmidhuber, J. Learning precise
timing with LSTM recurrent networks. J. Mach. Learn. Res. 3, 1 (2003), pp. 115–143.
url: doi.org/10.1162/153244303768966139.
[149] Gess, B., Kassing, S., and Konarovskyi, V. Stochastic Modified Flows, Mean-
Field Limits and Dynamics of Stochastic Gradient Descent. arXiv:2302.07125 (2023),
24 pp. url: arxiv.org/abs/2302.07125.
[150] Giles, M. B., Jentzen, A., and Welti, T. Generalised multilevel Picard ap-
proximations. arXiv:1911.03188 (2019), 61 pp. url: arxiv.org/abs/1911.03188.
[151] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E.
Neural Message Passing for Quantum Chemistry. In Proceedings of the 34th Interna-
tional Conference on Machine Learning (Sydney, Australia, Aug. 6–11, 2017). Ed. by
Precup, D. and Teh, Y. W. Vol. 70. Proceedings of Machine Learning Research.
PMLR, 2017, pp. 1263–1272. url: proceedings.mlr.press/v70/gilmer17a.html.
[152] Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich Feature Hierar-
chies for Accurate Object Detection and Semantic Segmentation. In Proceedings of
the 2014 IEEE Conference on Computer Vision and Pattern Recognition (Columbus,
OH, USA, June 23–28, 2014). CVPR ’14. IEEE Computer Society, 2014, pp. 580–587.
url: doi.org/10.1109/CVPR.2014.81.

573
Bibliography

[153] Glorot, X. and Bengio, Y. Understanding the difficulty of training deep feedfor-
ward neural networks. In Proceedings of the Thirteenth International Conference on
Artificial Intelligence and Statistics (Chia Laguna Resort, Sardinia, Italy, May 13–15,
2010). Ed. by Teh, Y. W. and Titterington, M. Vol. 9. Proceedings of Machine
Learning Research. PMLR, 2010, pp. 249–256. url: proceedings.mlr.press/v9/
glorot10a.html.
[154] Gnoatto, A., Patacca, M., and Picarelli, A. A deep solver for BSDEs with
jumps. arXiv:2211.04349 (2022), 31 pp. url: arxiv.org/abs/2211.04349.
[155] Gobet, E. Monte-Carlo methods and stochastic processes. From linear to non-linear.
CRC Press, Boca Raton, FL, 2016, xxv+309 pp.
[156] Godichon-Baggioni, A. and Tarrago, P. Non asymptotic analysis of Adaptive
stochastic gradient algorithms and applications. arXiv:2303.01370 (2023), 59 pp.
url: arxiv.org/abs/2303.01370.
[157] Goldberg, Y. Neural Network Methods for Natural Language Processing. Springer
Cham, 2017, xx+292 pp. url: doi.org/10.1007/978-3-031-02165-7.
[158] Gonon, L. Random Feature Neural Networks Learn Black-Scholes Type PDEs
Without Curse of Dimensionality. J. Mach. Learn. Res. 24, 189 (2023), pp. 1–51.
url: jmlr.org/papers/v24/21-0987.html.
[159] Gonon, L., Graeber, R., and Jentzen, A. The necessity of depth for artificial
neural networks to approximate certain classes of smooth and bounded functions
without the curse of dimensionality. arXiv:2301.08284 (2023), 101 pp. url: arxiv.
org/abs/2301.08284.
[160] Gonon, L., Grigoryeva, L., and Ortega, J.-P. Approximation bounds for
random neural networks and reservoir systems. Ann. Appl. Probab. 33, 1 (2023),
pp. 28–69. url: doi.org/10.1214/22-aap1806.
[161] Gonon, L., Grohs, P., Jentzen, A., Kofler, D., and Šiška, D. Uniform error
estimates for artificial neural network approximations for heat equations. IMA J.
Numer. Anal. 42, 3 (2022), pp. 1991–2054. url: doi.org/10.1093/imanum/drab027.
[162] Gonon, L. and Schwab, C. Deep ReLU network expression rates for option
prices in high-dimensional, exponential Lévy models. Finance Stoch. 25, 4 (2021),
pp. 615–657. url: doi.org/10.1007/s00780-021-00462-7.
[163] Gonon, L. and Schwab, C. Deep ReLU neural networks overcome the curse of
dimensionality for partial integrodifferential equations. Anal. Appl. (Singap.) 21, 1
(2023), pp. 1–47. url: doi.org/10.1142/S0219530522500129.
[164] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT Press,
Cambridge, MA, 2016, xxii+775 pp. url: www.deeplearningbook.org/.

574
Bibliography

[165] Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,
D., Ozair, S., Courville, A., and Bengio, Y. Generative Adversarial Networks.
arXiv:1406.2661 (2014), 9 pp. url: arxiv.org/abs/1406.2661.
[166] Gori, M., Monfardini, G., and Scarselli, F. A new model for learning in
graph domains. In Proceedings. 2005 IEEE International Joint Conference on Neural
Networks, 2005. Vol. 2. 2005, 729–734 vol. 2. url: doi.org/10.1109/IJCNN.2005.
1555942.
[167] Goswami, S., Jagtap, A. D., Babaee, H., Susi, B. T., and Karniadakis,
G. E. Learning stiff chemical kinetics using extended deep neural operators.
arXiv:2302.12645 (2023), 21 pp. url: arxiv.org/abs/2302.12645.
[168] Graham, C. and Talay, D. Stochastic simulation and Monte Carlo methods.
Vol. 68. Mathematical foundations of stochastic simulation. Springer, Heidelberg,
2013, xvi+260 pp. url: doi.org/10.1007/978-3-642-39363-1.
[169] Graves, A. Generating Sequences With Recurrent Neural Networks. arXiv:1308.0850
(2013), 43 pp. url: arxiv.org/abs/1308.0850.
[170] Graves, A. and Jaitly, N. Towards End-To-End Speech Recognition with Re-
current Neural Networks. In Proceedings of the 31st International Conference on
Machine Learning (Bejing, China, June 22–24, 2014). Ed. by Xing, E. P. and Jebara,
T. Vol. 32. Proceedings of Machine Learning Research 2. PMLR, 2014, pp. 1764–1772.
url: proceedings.mlr.press/v32/graves14.html.
[171] Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., and
Schmidhuber, J. A Novel Connectionist System for Unconstrained Handwriting
Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 31, 5 (2009), pp. 855–868.
url: doi.org/10.1109/TPAMI.2008.137.
[172] Graves, A., Mohamed, A.-r., and Hinton, G. E. Speech recognition with deep
recurrent neural networks. In 2013 IEEE International Conference on Acoustics,
Speech and Signal Processing (Vancouver, BC, Canada, May 26–31, 2013). 2013,
pp. 6645–6649. url: doi.org/10.1109/ICASSP.2013.6638947.
[173] Graves, A. and Schmidhuber, J. Framewise phoneme classification with bidirec-
tional LSTM and other neural network architectures. Neural Networks 18, 5 (2005).
IJCNN 2005, pp. 602–610. url: doi.org/10.1016/j.neunet.2005.06.042.
[174] Greff, K., Srivastava, R. K., Koutník, J., Steunebrink, B. R., and Schmid-
huber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural Netw. Learn. Syst.
28, 10 (2017), pp. 2222–2232. url: doi.org/10.1109/TNNLS.2016.2582924.
[175] Gribonval, R., Kutyniok, G., Nielsen, M., and Voigtlaender, F. Approx-
imation spaces of deep neural networks. Constr. Approx. 55, 1 (2022), pp. 259–367.
url: doi.org/10.1007/s00365-021-09543-4.

575
Bibliography

[176] Griewank, A. and Walther, A. Evaluating Derivatives. 2nd ed. Society for
Industrial and Applied Mathematics, 2008. url: doi.org/10.1137/1.9780898717
761.
[177] Grohs, P. and Herrmann, L. Deep neural network approximation for high-
dimensional elliptic PDEs with boundary conditions. IMA J. Numer. Anal. 42, 3
(May 2021), pp. 2055–2082. url: doi.org/10.1093/imanum/drab031.
[178] Grohs, P. and Herrmann, L. Deep neural network approximation for high-
dimensional parabolic Hamilton-Jacobi-Bellman equations. arXiv:2103.05744 (2021),
23 pp. url: arxiv.org/abs/2103.05744.
[179] Grohs, P., Hornung, F., Jentzen, A., and von Wurstemberger, P. A
proof that artificial neural networks overcome the curse of dimensionality in the
numerical approximation of Black-Scholes partial differential equations. Mem. Amer.
Math. Soc. 284, 1410 (2023), v+93 pp. url: doi.org/10.1090/memo/1410.
[180] Grohs, P., Hornung, F., Jentzen, A., and Zimmermann, P. Space-time error
estimates for deep neural network approximations for differential equations. Adv.
Comput. Math. 49, 1 (2023), Art. No. 4, 78 pp. url: doi.org/10.1007/s10444-
022-09970-2.
[181] Grohs, P., Jentzen, A., and Salimova, D. Deep neural network approximations
for solutions of PDEs based on Monte Carlo algorithms. Partial Differ. Equ. Appl.
3, 4 (2022), Art. No. 45, 41 pp. url: doi.org/10.1007/s42985-021-00100-z.
[182] Grohs, P. and Kutyniok, G., eds. Mathematical aspects of deep learning. Cambridge
University Press, Cambridge, 2023, xviii+473 pp. url: doi . org / 10 . 1016 / j .
enganabound.2022.10.033.
[183] Gu, Y., Yang, H., and Zhou, C. SelectNet: Self-paced learning for high-dimensio-
nal partial differential equations. J. Comput. Phys. 441 (2021), p. 110444. url:
doi.org/10.1016/j.jcp.2021.110444.
[184] Gühring, I., Kutyniok, G., and Petersen, P. Error bounds for approximations
with deep ReLU neural networks in W s,p norms. Anal. Appl. (Singap.) 18, 5 (2020),
pp. 803–859. url: doi.org/10.1142/S0219530519410021.
[185] Guo, X., Li, W., and Iorio, F. Convolutional Neural Networks for Steady Flow
Approximation. In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining (San Francisco, California, USA, Aug. 13–
17, 2016). KDD ’16. New York, NY, USA: Association for Computing Machinery,
2016, pp. 481–490. url: doi.org/10.1145/2939672.2939738.
[186] Han, J. and E, W. Deep Learning Approximation for Stochastic Control Problems.
arXiv:1611.07422 (2016), 9 pp. url: arxiv.org/abs/1611.07422.

576
Bibliography

[187] Han, J., Jentzen, A., and E, W. Solving high-dimensional partial differential
equations using deep learning. Proc. Natl. Acad. Sci. USA 115, 34 (2018), pp. 8505–
8510. url: doi.org/10.1073/pnas.1718942115.
[188] Han, J. and Long, J. Convergence of the deep BSDE method for coupled FBSDEs.
Probab. Uncertain. Quant. Risk 5 (2020), Art. No. 5, 33 pp. url: doi.org/10.
1186/s41546-020-00047-w.
[189] Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical
learning. 2nd ed. Data mining, inference, and prediction. Springer, New York, 2009,
xxii+745 pp. url: doi.org/10.1007/978-0-387-84858-7.
[190] He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image
Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) (Las Vegas, NV, USA, June 27–30, 2016). 2016, pp. 770–778. url: doi.
org/10.1109/CVPR.2016.90.
[191] He, K., Zhang, X., Ren, S., and Sun, J. Identity Mappings in Deep Residual
Networks. In Computer Vision – ECCV 2016, 14th European Conference, Proceedings
Part IV (Amsterdam, The Netherlands, Oct. 11–14, 2016). Ed. by Leibe, B., Matas,
J., Sebe, N., and Welling, M. Springer, Cham, 2016, pp. 630–645. url: doi.org/10.
1007/978-3-319-46493-0_38.
[192] Heiß, C., Gühring, I., and Eigel, M. Multilevel CNNs for Parametric PDEs.
arXiv:2304.00388 (2023), 42 pp. url: arxiv.org/abs/2304.00388.
[193] Hendrycks, D. and Gimpel, K. Gaussian Error Linear Units (GELUs).
arXiv:1606.08415v4 (2016), 10 pp. url: arxiv.org/abs/1606.08415.
[194] Henry, D. Geometric theory of semilinear parabolic equations. Vol. 840. Springer-
Verlag, Berlin, 1981, iv+348 pp.
[195] Henry-Labordere, P. Counterparty Risk Valuation: A Marked Branching Diffu-
sion Approach. arXiv:1203.2369 (2012), 17 pp. url: arxiv.org/abs/1203.2369.
[196] Henry-Labordere, P. Deep Primal-Dual Algorithm for BSDEs: Applications of
Machine Learning to CVA and IM (2017). Available at SSRN. url: doi.org/10.
2139/ssrn.3071506.
[197] Henry-Labordère, P. and Touzi, N. Branching diffusion representation for
nonlinear Cauchy problems and Monte Carlo approximation. Ann. Appl. Probab. 31,
5 (2021), pp. 2350–2375. url: doi.org/10.1214/20-aap1649.
[198] Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data
with neural networks. Science 313, 5786 (2006), pp. 504–507. url: doi.org/10.
1126/science.1127647.

577
Bibliography

[199] Hinton, G., Srivastava, N., and Swersky, K. Lecture 6e: RMSprop: Divide
the gradient by a running average of its recent magnitude. https : / / www . cs .
toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf. [Accessed
01-December-2017].
[200] Hinton, G. E. and Zemel, R. Autoencoders, Minimum Description Length and
Helmholtz Free Energy. In Advances in Neural Information Processing Systems.
Ed. by Cowan, J., Tesauro, G., and Alspector, J. Vol. 6. Morgan-Kaufmann, 1993.
url: proceedings.neurips.cc/paper_files/paper/1993/file/9e3cfc48eccf8
1a0d57663e129aef3cb-Paper.pdf.
[201] Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Comput.
9, 8 (1997), pp. 1735–1780. url: doi.org/10.1162/neco.1997.9.8.1735.
[202] Hornik, K. Some new results on neural network approximation. Neural Networks
6, 8 (1993), pp. 1069–1072. url: doi.org/10.1016/S0893-6080(09)80018-X.
[203] Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural
Networks 4, 2 (1991), pp. 251–257. url: doi.org/10.1016/0893-6080(91)90009-T.
[204] Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks
are universal approximators. Neural Networks 2, 5 (1989), pp. 359–366. url: doi.
org/10.1016/0893-6080(89)90020-8.
[205] Hornung, F., Jentzen, A., and Salimova, D. Space-time deep neural network
approximations for high-dimensional partial differential equations. arXiv:2006.02199
(2020), 52 pages. url: arxiv.org/abs/2006.02199.
[206] Huang, G., Liu, Z., Maaten, L. V. D., and Weinberger, K. Q. Densely
Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision
and Pattern Recognition (CVPR) (Honolulu, HI, USA, July 21–26, 2017). Los
Alamitos, CA, USA: IEEE Computer Society, 2017, pp. 2261–2269. url: doi.org/
10.1109/CVPR.2017.243.
[207] Huré, C., Pham, H., and Warin, X. Deep backward schemes for high-dimensional
nonlinear PDEs. Math. Comp. 89, 324 (2020), pp. 1547–1579. url: doi.org/10.
1090/mcom/3514.
[208] Hutzenthaler, M., Jentzen, A., and Kruse, T. Overcoming the curse of
dimensionality in the numerical approximation of parabolic partial differential equa-
tions with gradient-dependent nonlinearities. Found. Comput. Math. 22, 4 (2022),
pp. 905–966. url: doi.org/10.1007/s10208-021-09514-y.
[209] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. A proof
that rectified deep neural networks overcome the curse of dimensionality in the
numerical approximation of semilinear heat equations. SN Partial Differ. Equ. Appl.
10, 1 (2020). url: doi.org/10.1007/s42985-019-0006-9.

578
Bibliography

[210] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. Multilevel
Picard approximations for high-dimensional semilinear second-order PDEs with
Lipschitz nonlinearities. arXiv:2009.02484 (2020), 37 pp. url: arxiv.org/abs/
2009.02484.
[211] Hutzenthaler, M., Jentzen, A., Kruse, T., and Nguyen, T. A. Overcoming
the curse of dimensionality in the numerical approximation of backward stochastic
differential equations. arXiv:2108.10602 (2021), 34 pp. url: arxiv.org/abs/2108.
10602.
[212] Hutzenthaler, M., Jentzen, A., Kruse, T., Nguyen, T. A., and von
Wurstemberger, P. Overcoming the curse of dimensionality in the numerical
approximation of semilinear parabolic partial differential equations. Proc. A. 476,
2244 (2020), Art. No. 20190630, 25 pp. url: doi.org/10.1098/rspa.2019.0630.
[213] Hutzenthaler, M., Jentzen, A., Pohl, K., Riekert, A., and Scarpa, L.
Convergence proof for stochastic gradient descent in the training of deep neural
networks with ReLU activation for constant target functions. arXiv:2112.07369
(2021), 71 pp. url: arxiv.org/abs/2112.07369.
[214] Hutzenthaler, M., Jentzen, A., and von Wurstemberger, P. Overcoming
the curse of dimensionality in the approximative pricing of financial derivatives with
default risks. Electron. J. Probab. 25 (2020), Art. No. 101, 73 pp. url: doi.org/10.
1214/20-ejp423.
[215] Ibragimov, S., Jentzen, A., Kröger, T., and Riekert, A. On the existence
of infinitely many realization functions of non-global local minima in the training of
artificial neural networks with ReLU activation. arXiv:2202.11481 (2022), 49 pp.
url: arxiv.org/abs/2202.11481.
[216] Ibragimov, S., Jentzen, A., and Riekert, A. Convergence to good non-optimal
critical points in the training of neural networks: Gradient descent optimization
with one random initialization overcomes all bad non-global local minima with high
probability. arXiv:2212.13111 (2022), 98 pp. url: arxiv.org/abs/2212.13111.
[217] Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network
Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd Inter-
national Conference on Machine Learning – Volume 37 (Lille, France, July 6–11,
2015). Ed. by Bach, F. and Blei, D. ICML’15. JMLR.org, 2015, pp. 448–456.
[218] Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence
and Generalization in Neural Networks. In Advances in Neural Information Processing
Systems. Ed. by Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi,
N., and Garnett, R. Vol. 31. Curran Associates, Inc., 2018. url: proceedings .
neurips.cc/paper_files/paper/2018/file/5a4be1fa34e62bb8a6ec6b91d2462
f5a-Paper.pdf.

579
Bibliography

[219] Jagtap, A. D., Kharazmi, E., and Karniadakis, G. E. Conservative physics-


informed neural networks on discrete domains for conservation laws: Applications
to forward and inverse problems. Comput. Methods Appl. Mech. Engrg. 365 (2020),
p. 113028. url: doi.org/10.1016/j.cma.2020.113028.
[220] Jentzen, A., Kuckuck, B., Neufeld, A., and von Wurstemberger,
P. Strong error analysis for stochastic gradient descent optimization algorithms.
arXiv:1801.09324 (2018), 75 pages. url: arxiv.org/abs/1801.09324.
[221] Jentzen, A., Kuckuck, B., Neufeld, A., and von Wurstemberger, P.
Strong error analysis for stochastic gradient descent optimization algorithms. IMA J.
Numer. Anal. 41, 1 (2020), pp. 455–492. url: doi.org/10.1093/imanum/drz055.
[222] Jentzen, A., Mazzonetto, S., and Salimova, D. Existence and uniqueness
properties for solutions of a class of Banach space valued evolution equations (2018),
28 pp. url: arxiv.org/abs/1812.06859.
[223] Jentzen, A. and Riekert, A. A proof of convergence for the gradient descent
optimization method with random initializations in the training of neural networks
with ReLU activation for piecewise linear target functions. J. Mach. Learn. Res. 23,
260 (2022), pp. 1–50. url: jmlr.org/papers/v23/21-0962.html.
[224] Jentzen, A. and Riekert, A. On the Existence of Global Minima and Convergence
Analyses for Gradient Descent Methods in the Training of Deep Neural Networks. J.
Mach. Learn. 1, 2 (2022), pp. 141–246. url: doi.org/10.4208/jml.220114a.
[225] Jentzen, A. and Riekert, A. Convergence analysis for gradient flows in the
training of artificial neural networks with ReLU activation. J. Math. Anal. Appl. 517,
2 (2023), Art. No. 126601, 43 pp. url: doi.org/10.1016/j.jmaa.2022.126601.
[226] Jentzen, A. and Riekert, A. Strong Overall Error Analysis for the Training of
Artificial Neural Networks Via Random Initializations. Commun. Math. Stat. (2023).
url: doi.org/10.1007/s40304-022-00292-9.
[227] Jentzen, A., Riekert, A., and von Wurstemberger, P. Algorithmically
Designed Artificial Neural Networks (ADANNs): Higher order deep operator learning
for parametric partial differential equations. arXiv:2302.03286 (2023), 22 pp. url:
arxiv.org/abs/2302.03286.
[228] Jentzen, A., Salimova, D., and Welti, T. A proof that deep artificial neural
networks overcome the curse of dimensionality in the numerical approximation of
Kolmogorov partial differential equations with constant diffusion and nonlinear drift
coefficients. Commun. Math. Sci. 19, 5 (2021), pp. 1167–1205. url: doi.org/10.
4310/CMS.2021.v19.n5.a1.

580
Bibliography

[229] Jentzen, A. and von Wurstemberger, P. Lower error bounds for the stochastic
gradient descent optimization algorithm: Sharp convergence rates for slowly and fast
decaying learning rates. J. Complexity 57 (2020), Art. No. 101438. url: doi.org/
10.1016/j.jco.2019.101438.
[230] Jentzen, A. and Welti, T. Overall error analysis for the training of deep neural
networks via stochastic gradient descent with random initialisation. Appl. Math.
Comput. 455 (2023), Art. No. 127907, 34 pp. url: doi.org/10.1016/j.amc.2023.
127907.
[231] Jin, X., Cai, S., Li, H., and Karniadakis, G. E. NSFnets (Navier-Stokes
flow nets): Physics-informed neural networks for the incompressible Navier-Stokes
equations. J. Comput. Phys. 426 (2021), Art. No. 109951. url: doi.org/10.1016/
j.jcp.2020.109951.
[232] Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ron-
neberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko,
A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie,
A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T.,
Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M.,
Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O.,
Senior, A. W., Kavukcuoglu, K., Kohli, P., and Hassabis, D. Highly
accurate protein structure prediction with AlphaFold. Nature 596, 7873 (2021),
pp. 583–589. url: doi.org/10.1038/s41586-021-03819-2.
[233] Kainen, P. C., Kůrková, V., and Vogt, A. Best approximation by linear
combinations of characteristic functions of half-spaces. J. Approx. Theory 122, 2
(2003), pp. 151–159. url: doi.org/10.1016/S0021-9045(03)00072-8.
[234] Karatzas, I. and Shreve, S. E. Brownian motion and stochastic calculus. 2nd ed.
Vol. 113. Springer-Verlag, New York, 1991, xxiv+470 pp. url: doi.org/10.1007/
978-1-4612-0949-2.
[235] Karevan, Z. and Suykens, J. A. Transductive LSTM for time-series prediction:
An application to weather forecasting. Neural Networks 125 (2020), pp. 1–9. url:
doi.org/10.1016/j.neunet.2019.12.030.
[236] Karim, F., Majumdar, S., Darabi, H., and Chen, S. LSTM Fully Convolutional
Networks for Time Series Classification. IEEE Access 6 (2018), pp. 1662–1669. url:
doi.org/10.1109/ACCESS.2017.2779939.
[237] Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S.,
and Yang, L. Physics-informed machine learning. Nat. Rev. Phys. 3, 6 (2021),
pp. 422–440. url: doi.org/10.1038/s42254-021-00314-5.

581
Bibliography

[238] Karpathy, A., Johnson, J., and Fei-Fei, L. Visualizing and Understanding
Recurrent Networks. arXiv:1506.02078 (2015), 12 pp. url: arxiv.org/abs/1506.
02078.
[239] Kawaguchi, K. Deep Learning without Poor Local Minima. In Advances in Neural
Information Processing Systems. Ed. by Lee, D., Sugiyama, M., Luxburg, U., Guyon,
I., and Garnett, R. Vol. 29. Curran Associates, Inc., 2016. url: proceedings .
neurips.cc/paper_files/paper/2016/file/f2fc990265c712c49d51a18a32b39
f0c-Paper.pdf.
[240] Khan, S., Naseer, M., Hayat, M., Zamir, S. W., Khan, F. S., and Shah, M.
Transformers in Vision: A Survey. ACM Comput. Surv. 54, 10s (2022), Art. No. 200,
41 pp. url: doi.org/10.1145/3505244.
[241] Kharazmi, E., Zhang, Z., and Karniadakis, G. E. Variational Physics-Informed
Neural Networks For Solving Partial Differential Equations. arXiv:1912.00873 (2019),
24 pp. url: arxiv.org/abs/1912.00873.
[242] Kharazmi, E., Zhang, Z., and Karniadakis, G. E. M. hp-VPINNs: variational
physics-informed neural networks with domain decomposition. Comput. Methods
Appl. Mech. Engrg. 374 (2021), Art. No. 113547, 25 pp. url: doi.org/10.1016/j.
cma.2020.113547.
[243] Khodayi-Mehr, R. and Zavlanos, M. VarNet: Variational Neural Networks for
the Solution of Partial Differential Equations. In Proceedings of the 2nd Conference
on Learning for Dynamics and Control (June 10–11, 2020). Ed. by Bayen, A. M.,
Jadbabaie, A., Pappas, G., Parrilo, P. A., Recht, B., Tomlin, C., and Zeilinger, M.
Vol. 120. Proceedings of Machine Learning Research. PMLR, 2020, pp. 298–307.
url: proceedings.mlr.press/v120/khodayi-mehr20a.html.
[244] Khoo, Y., Lu, J., and Ying, L. Solving parametric PDE problems with artificial
neural networks. European J. Appl. Math. 32, 3 (2021), pp. 421–435. url: doi.org/
10.1017/S0956792520000182.
[245] Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings
of the 2014 Conference on Empirical Methods in Natural Language Processing
(EMNLP) (Doha, Qatar, Oct. 25–29, 2014). Ed. by Moschitti, A., Pang, B., and
Daelemans, W. Association for Computational Linguistics, 2014, pp. 1746–1751.
url: doi.org/10.3115/v1/D14-1181.
[246] Kingma, D. P. and Welling, M. Auto-Encoding Variational Bayes. arXiv:1312.
6114 (2013), 14 pp. url: arxiv.org/abs/1312.6114.
[247] Kingma, D. P. and Ba, J. Adam: A Method for Stochastic Optimization.
arXiv:1412.6980 (2014), 15 pp. url: arxiv.org/abs/1412.6980.
[248] Klenke, A. Probability Theory. 2nd ed. Springer-Verlag London Ltd., 2014.
xii+638 pp. url: doi.org/10.1007/978-1-4471-5361-0.

582
Bibliography

[249] Kontolati, K., Goswami, S., Karniadakis, G. E., and Shields, M. D.


Learning in latent spaces improves the predictive accuracy of deep neural operators.
arXiv:2304.07599 (2023), 22 pp. url: arxiv.org/abs/2304.07599.
[250] Korn, R., Korn, E., and Kroisandt, G. Monte Carlo methods and models
in finance and insurance. CRC Press, Boca Raton, FL, 2010, xiv+470 pp. url:
doi.org/10.1201/9781420076196.
[251] Kovachki, N., Lanthaler, S., and Mishra, S. On universal approximation
and error bounds for Fourier neural operators. J. Mach. Learn. Res. 22 (2021),
Art. No. 290, 76 pp. url: jmlr.org/papers/v22/21-0806.html.
[252] Kovachki, N., Li, Z., Liu, B., Azizzadenesheli, K., Bhattacharya, K.,
Stuart, A., and Anandkumar, A. Neural Operator: Learning Maps Between
Function Spaces With Applications to PDEs. J. Mach. Learn. Res. 24 (2023), Art.
No. 89, 97 pp. url: jmlr.org/papers/v24/21-1524.html.
[253] Kramer, M. A. Nonlinear principal component analysis using autoassociative
neural networks. AIChE Journal 37, 2 (1991), pp. 233–243. url: doi.org/10.1002/
aic.690370209.
[254] Krantz, S. G. and Parks, H. R. A primer of real analytic functions. 2nd ed.
Birkhäuser Boston, Inc., Boston, MA, 2002, xiv+205 pp. url: doi.org/10.1007/
978-0-8176-8134-0.
[255] Kratsios, A. The universal approximation property: characterization, construction,
representation, and existence. Ann. Math. Artif. Intell. 89, 5–6 (2021), pp. 435–469.
url: doi.org/10.1007/s10472-020-09723-1.
[256] Kremsner, S., Steinicke, A., and Szölgyenyi, M. A Deep Neural Network
Algorithm for Semilinear Elliptic PDEs with Applications in Insurance Mathematics.
Risks 8, 4 (2020), Art. No. 136, 18 pp. url: doi.org/10.3390/risks8040136.
[257] Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification
with Deep Convolutional Neural Networks. In Advances in Neural Information
Processing Systems. Ed. by Pereira, F., Burges, C., Bottou, L., and Weinberger, K.
Vol. 25. Curran Associates, Inc., 2012. url: proceedings.neurips.cc/paper_
files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
[258] Kurdyka, K., Mostowski, T., and Parusiński, A. Proof of the gradient
conjecture of R. Thom. Ann. of Math. (2) 152, 3 (2000), pp. 763–792. url: doi.
org/10.2307/2661354.
[259] Kutyniok, G., Petersen, P., Raslan, M., and Schneider, R. A theoretical
analysis of deep neural networks and parametric PDEs. Constr. Approx. 55, 1 (2022),
pp. 73–125. url: doi.org/10.1007/s00365-021-09551-4.

583
Bibliography

[260] Lagaris, I., Likas, A., and Fotiadis, D. Artificial neural networks for solving
ordinary and partial differential equations. IEEE Trans. Neural Netw. 9, 5 (1998),
pp. 987–1000. url: doi.org/10.1109/72.712178.
[261] Lanthaler, S., Molinaro, R., Hadorn, P., and Mishra, S. Nonlinear Re-
construction for Operator Learning of PDEs with Discontinuities. arXiv:2210.01074
(2022), 40 pp. url: arxiv.org/abs/2210.01074.
[262] LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,
Hubbard, W., and Jackel, L. D. Backpropagation Applied to Handwritten Zip
Code Recognition. Neural Comput. 1, 4 (1989), pp. 541–551. url: doi.org/10.
1162/neco.1989.1.4.541.
[263] LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature 521 (2015),
pp. 436–444. url: doi.org/10.1038/nature14539.
[264] Lee, C.-Y., Xie, S., Gallagher, P., Zhang, Z., and Tu, Z. Deeply-Supervised
Nets. In Proceedings of the Eighteenth International Conference on Artificial Intelli-
gence and Statistics (San Diego, California, USA, May 9–12, 2015). Ed. by Lebanon,
G. and Vishwanathan, S. V. N. Vol. 38. Proceedings of Machine Learning Research.
PMLR, 2015, pp. 562–570. url: proceedings.mlr.press/v38/lee15a.html.
[265] Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I.,
and Recht, B. First-order methods almost always avoid strict saddle points. Math.
Program. 176, 1–2 (2019), pp. 311–337. url: doi . org / 10 . 1007 / s10107 - 019 -
01374-3.
[266] Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient Descent
Only Converges to Minimizers. In 29th Annual Conference on Learning Theory
(Columbia University, New York, NY, USA, June 23–26, 2016). Ed. by Feldman, V.,
Rakhlin, A., and Shamir, O. Vol. 49. Proceedings of Machine Learning Research.
PMLR, 2016, pp. 1246–1257. url: proceedings.mlr.press/v49/lee16.html.
[267] Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O.,
Stoyanov, V., and Zettlemoyer, L. BART: Denoising Sequence-to-Sequence
Pre-training for Natural Language Generation, Translation, and Comprehension.
arXiv:1910.13461 (2019). url: arxiv.org/abs/1910.13461.
[268] Li, K., Tang, K., Wu, T., and Liao, Q. D3M: A Deep Domain Decomposition
Method for Partial Differential Equations. IEEE Access 8 (2020), pp. 5283–5294.
url: doi.org/10.1109/ACCESS.2019.2957200.
[269] Li, Z., Huang, D. Z., Liu, B., and Anandkumar, A. Fourier Neural Operator
with Learned Deformations for PDEs on General Geometries. arXiv:2207.05209
(2022). url: arxiv.org/abs/2207.05209.

584
Bibliography

[270] Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K.,
Stuart, A., and Anandkumar, A. Neural Operator: Graph Kernel Network
for Partial Differential Equations. arXiv:2003.03485 (2020). url: arxiv.org/abs/
2003.03485.
[271] Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Bhattacharya, K.,
Stuart, A., and Anandkumar, A. Fourier Neural Operator for Parametric Partial
Differential Equations. In International Conference on Learning Representations.
2021. url: openreview.net/forum?id=c8P9NQVtmnO.
[272] Li, Z., Kovachki, N., Azizzadenesheli, K., Liu, B., Stuart, A., Bhat-
tacharya, K., and Anandkumar, A. Multipole graph neural operator for para-
metric partial differential equations. Advances in Neural Information Processing
Systems 33 (2020), pp. 6755–6766.
[273] Li, Z., Zheng, H., Kovachki, N., Jin, D., Chen, H., Liu, B., Azizzadenesheli,
K., and Anandkumar, A. Physics-Informed Neural Operator for Learning Partial
Differential Equations. arXiv:2111.03794 (2021). url: arxiv.org/abs/2111.03794.
[274] Liao, Y. and Ming, P. Deep Nitsche Method: Deep Ritz Method with Essential
Boundary Conditions. Commun. Comput. Phys. 29, 5 (2021), pp. 1365–1384. url:
doi.org/10.4208/cicp.OA-2020-0219.
[275] Liu, C. and Belkin, M. Accelerating SGD with momentum for over-parameterized
learning. arXiv:1810.13395 (2018). url: arxiv.org/abs/1810.13395.
[276] Liu, L. and Cai, W. DeepPropNet–A Recursive Deep Propagator Neural Network
for Learning Evolution PDE Operators. arXiv:2202.13429 (2022). url: arxiv.org/
abs/2202.13429.
[277] Liu, Y., Kutz, J. N., and Brunton, S. L. Hierarchical deep learning of multiscale
differential equation time-steppers. Philos. Trans. Roy. Soc. A 380, 2229 (2022),
Art. No. 20210200, 17 pp. url: doi.org/10.1098/rsta.2021.0200.
[278] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., and Guo,
B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In
Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
(Montreal, QC, Canada, Oct. 10–17, 2021). IEEE Computer Society, 2021, pp. 10012–
10022. url: doi.org/10.1109/ICCV48922.2021.00986.
[279] Liu, Z., Cai, W., and Xu, Z.-Q. J. Multi-scale deep neural network (MscaleDNN)
for solving Poisson-Boltzmann equation in complex domains. Commun. Comput.
Phys. 28, 5 (2020), pp. 1970–2001.
[280] Loizou, N. and Richtárik, P. Momentum and stochastic momentum for stochas-
tic gradient, Newton, proximal point and subspace descent methods. Comput. Optim.
Appl. 77, 3 (2020), pp. 653–710. url: doi.org/10.1007/s10589-020-00220-z.

585
Bibliography

[281] Łojasiewicz, S. Ensembles semi-analytiques. Unpublished lecture notes. Institut


des Hautes Études Scientifiques, 1964. url: perso.univ- rennes1.fr/michel.
coste/Lojasiewicz.pdf.
[282] Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for
semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR) (Boston, MA, USA, June 7–12, 2015). IEEE Computer Society,
2015, pp. 3431–3440. url: doi.org/10.1109/CVPR.2015.7298965.
[283] Lu, J., Batra, D., Parikh, D., and Lee, S. ViLBERT: Pretraining Task-
Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Advances
in Neural Information Processing Systems. Ed. by Wallach, H., Larochelle, H.,
Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R. Vol. 32. Curran Associates,
Inc., 2019. url: proceedings . neurips . cc / paper _ files / paper / 2019 / file /
c74d97b01eae257e44aa9d5bade97baf-Paper.pdf.
[284] Lu, L., Jin, P., Pang, G., Zhang, Z., and Karniadakis, G. E. Learning
nonlinear operators via DeepONet based on the universal approximation theorem of
operators. Nature Machine Intelligence 3, 3 (2021), pp. 218–229. url: doi.org/10.
1038/s42256-021-00302-5.
[285] Lu, L., Meng, X., Cai, S., Mao, Z., Goswami, S., Zhang, Z., and Karni-
adakis, G. E. A comprehensive and fair comparison of two neural operators (with
practical extensions) based on FAIR data. Comput. Methods Appl. Mech. Engrg. 393
(2022), Art. No. 114778. url: doi.org/10.1016/j.cma.2022.114778.
[286] Lu, L., Meng, X., Mao, Z., and Karniadakis, G. E. DeepXDE: A Deep Learning
Library for Solving Differential Equations. SIAM Rev. 63, 1 (2021), pp. 208–228.
url: doi.org/10.1137/19M1274067.
[287] Luo, X. and Kareem, A. Bayesian deep learning with hierarchical prior: Pre-
dictions from limited and noisy data. Structural Safety 84 (2020), p. 101918. url:
doi.org/10.1016/j.strusafe.2019.101918.
[288] Luong, M.-T., Pham, H., and Manning, C. D. Effective Approaches to Attention-
based Neural Machine Translation. arXiv:1508.04025 (2015). url: arxiv.org/abs/
1508.04025.
[289] Ma, C., Wu, L., and E, W. A Qualitative Study of the Dynamic Behavior for
Adaptive Gradient Algorithms. arXiv:2009.06125 (2020). url: arxiv.org/abs/
2009.06125.
[290] Maday, Y. and Turinici, G. A parareal in time procedure for the control of partial
differential equations. C. R. Math. Acad. Sci. Paris 335, 4 (2002), pp. 387–392. url:
doi.org/10.1016/S1631-073X(02)02467-6.

586
Bibliography

[291] Mahendran, A. and Vedaldi, A. Visualizing deep convolutional neural networks


using natural pre-images. Int. J. Comput. Vis. 120, 3 (2016), pp. 233–255. url:
doi.org/10.1007/s11263-016-0911-8.
[292] Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adver-
sarial Autoencoders. arXiv:1511.05644 (2015). url: arxiv.org/abs/1511.05644.
[293] Mao, X., Shen, C., and Yang, Y.-B. Image Restoration Using Very Deep
Convolutional Encoder-Decoder Networks with Symmetric Skip Connections. In
Advances in Neural Information Processing Systems. Ed. by Lee, D., Sugiyama, M.,
Luxburg, U., Guyon, I., and Garnett, R. Vol. 29. Curran Associates, Inc., 2016. url:
proceedings.neurips.cc/paper_files/paper/2016/file/0ed9422357395a0d4
879191c66f4faa2-Paper.pdf.
[294] Masci, J., Meier, U., Cireşan, D., and Schmidhuber, J. Stacked Convolutional
Auto-Encoders for Hierarchical Feature Extraction. In Artificial Neural Networks
and Machine Learning – ICANN 2011 (Espoo, Finland, June 14–17, 2011). Ed. by
Honkela, T., Duch, W., Girolami, M., and Kaski, S. Springer Berlin Heidelberg,
2011, pp. 52–59.
[295] Meng, X., Li, Z., Zhang, D., and Karniadakis, G. E. PPINN: Parareal
physics-informed neural network for time-dependent PDEs. Comput. Methods Appl.
Mech. Engrg. 370 (2020), p. 113250. url: doi.org/10.1016/j.cma.2020.113250.
[296] Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. On the Almost
Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems. In
Advances in Neural Information Processing Systems. Ed. by Larochelle, H., Ranzato,
M., Hadsell, R., Balcan, M., and Lin, H. Vol. 33. Curran Associates, Inc., 2020,
pp. 1117–1128. url: proceedings.neurips.cc/paper_files/paper/2020/file/
0cb5ebb1b34ec343dfe135db691e4a85-Paper.pdf.
[297] Meuris, B., Qadeer, S., and Stinis, P. Machine-learning-based spectral methods
for partial differential equations. Scientific Reports 13, 1 (2023), p. 1739. url:
doi.org/10.1038/s41598-022-26602-3.
[298] Mishra, S. and Molinaro, R. Estimates on the generalization error of Physics
Informed Neural Networks (PINNs) for approximating a class of inverse problems
for PDEs. arXiv:2007.01138 (2020). url: arxiv.org/abs/2007.01138.
[299] Mishra, S. and Molinaro, R. Estimates on the generalization error of Physics
Informed Neural Networks (PINNs) for approximating PDEs. arXiv:2006.16144
(2020). url: arxiv.org/abs/2006.16144.
[300] Neal, R. M. Bayesian Learning for Neural Networks. Springer New York, 1996.
204 pp. url: doi.org/10.1007/978-1-4612-0745-0.

587
Bibliography

[301] Nelsen, N. H. and Stuart, A. M. The random feature model for input-output
maps between Banach spaces. SIAM J. Sci. Comput. 43, 5 (2021), A3212–A3243.
url: doi.org/10.1137/20M133957X.
[302] Nesterov, Y. A method of solving a convex programming problem with convergence
rate O(1/k 2 ). In Soviet Mathematics Doklady. Vol. 27. 1983, pp. 372–376.
[303] Nesterov, Y. Introductory lectures on convex optimization: A basic course. Vol. 87.
Springer, New York, 2013, xviii+236 pp. url: doi.org/10.1007/978- 1- 4419-
8853-9.
[304] Neufeld, A. and Wu, S. Multilevel Picard approximation algorithm for semilinear
partial integro-differential equations and its complexity analysis. arXiv:2205.09639
(2022). url: arxiv.org/abs/2205.09639.
[305] Neufeld, A. and Wu, S. Multilevel Picard algorithm for general semilinear
parabolic PDEs with gradient-dependent nonlinearities. arXiv:2310.12545 (2023).
url: arxiv.org/abs/2310.12545.
[306] Ng, A. coursera: Improving Deep Neural Networks: Hyperparameter tuning, Reg-
ularization and Optimization. https://www.coursera.org/learn/deep-neural-
network. [Accessed 6-December-2017].
[307] Ng, J. Y.-H., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga,
R., and Toderici, G. Beyond Short Snippets: Deep Networks for Video Classifica-
tion. arXiv:1503.08909 (2015). url: arxiv.org/abs/1503.08909.
[308] Nguwi, J. Y., Penent, G., and Privault, N. A deep branching solver for fully
nonlinear partial differential equations. arXiv:2203.03234 (2022). url: arxiv.org/
abs/2203.03234.
[309] Nguwi, J. Y., Penent, G., and Privault, N. Numerical solution of the incom-
pressible Navier-Stokes equation by a deep branching algorithm. arXiv:2212.13010
(2022). url: arxiv.org/abs/2212.13010.
[310] Nguwi, J. Y., Penent, G., and Privault, N. A fully nonlinear Feynman-Kac
formula with derivatives of arbitrary orders. J. Evol. Equ. 23, 1 (2023), Art. No. 22,
29 pp. url: doi.org/10.1007/s00028-023-00873-3.
[311] Nguwi, J. Y. and Privault, N. Numerical solution of the modified and non-
Newtonian Burgers equations by stochastic coded trees. Jpn. J. Ind. Appl. Math. 40,
3 (2023), pp. 1745–1763. url: doi.org/10.1007/s13160-023-00611-9.
[312] Nguyen, Q. and Hein, M. The Loss Surface of Deep and Wide Neural Networks.
In Proceedings of the 34th International Conference on Machine Learning (Sydney,
Australia, Aug. 6–11, 2017). Ed. by Precup, D. and Teh, Y. W. Vol. 70. Proceedings
of Machine Learning Research. PMLR, 2017, pp. 2603–2612. url: proceedings.
mlr.press/v70/nguyen17a.html.

588
Bibliography

[313] Nitsche, J. Über ein Variationsprinzip zur Lösung von Dirichlet-Problemen bei
Verwendung von Teilräumen, die keinen Randbedingungen unterworfen sind. Abh.
Math. Sem. Univ. Hamburg 36 (1971), pp. 9–15. url: doi.org/10.1007/BF029959
04.
[314] Novak, E. and Woźniakowski, H. Tractability of multivariate problems. Vol. I:
Linear information. Vol. 6. European Mathematical Society (EMS), Zürich, 2008,
xii+384 pp. url: doi.org/10.4171/026.
[315] Novak, E. and Woźniakowski, H. Tractability of multivariate problems. Volume
II: Standard information for functionals. Vol. 12. European Mathematical Society
(EMS), Zürich, 2010, xviii+657 pp. url: doi.org/10.4171/084.
[316] Novak, E. and Woźniakowski, H. Tractability of multivariate problems. Volume
III: Standard information for operators. Vol. 18. European Mathematical Society
(EMS), Zürich, 2012, xviii+586 pp. url: doi.org/10.4171/116.
[317] Nüsken, N. and Richter, L. Solving high-dimensional Hamilton-Jacobi-Bellman
PDEs using neural networks: perspectives from the theory of controlled diffusions
and measures on path space. Partial Differ. Equ. Appl. 2, 4 (2021), Art. No. 48,
48 pp. url: doi.org/10.1007/s42985-021-00102-x.
[318] Øksendal, B. Stochastic differential equations. 6th ed. An introduction with
applications. Springer-Verlag, Berlin, 2003, xxiv+360 pp. url: doi.org/10.1007/
978-3-642-14394-6.
[319] Olah, C. Understanding LSTM Networks. http://colah.github.io/posts/2015-
08-Understanding-LSTMs/. [Accessed 9-October-2023].
[320] OpenAI. GPT-4 Technical Report. arXiv:2303.08774 (2023). url: arxiv.org/
abs/2303.08774.
[321] Opschoor, J. A. A., Petersen, P. C., and Schwab, C. Deep ReLU networks
and high-order finite element methods. Anal. Appl. (Singap.) 18, 5 (2020), pp. 715–
770. url: doi.org/10.1142/S0219530519410136.
[322] Panageas, I. and Piliouras, G. Gradient Descent Only Converges to Minimizers:
Non-Isolated Critical Points and Invariant Regions. arXiv:1605.00405 (2016). url:
arxiv.org/abs/1605.00405.
[323] Panageas, I., Piliouras, G., and Wang, X. First-order methods almost al-
ways avoid saddle points: The case of vanishing step-sizes. In Advances in Neu-
ral Information Processing Systems. Ed. by Wallach, H., Larochelle, H., Beygelz-
imer, A., d’Alché-Buc, F., Fox, E., and Garnett, R. Vol. 32. Curran Associates,
Inc., 2019. url: proceedings . neurips . cc / paper _ files / paper / 2019 / file /
3fb04953d95a94367bb133f862402bce-Paper.pdf.

589
Bibliography

[324] Pang, G., Lu, L., and Karniadakis, G. E. fPINNs: Fractional Physics-Informed
Neural Networks. SIAM J. Sci. Comput. 41, 4 (2019), A2603–A2626. url: doi.org/
10.1137/18M1229845.
[325] Pardoux, É. and Peng, S. Backward stochastic differential equations and quasilin-
ear parabolic partial differential equations. In Stochastic partial differential equations
and their applications. Vol. 176. Lect. Notes Control Inf. Sci. Springer, Berlin, 1992,
pp. 200–217. url: doi.org/10.1007/BFb0007334.
[326] Pardoux, É. and Peng, S. G. Adapted solution of a backward stochastic differ-
ential equation. Systems Control Lett. 14, 1 (1990), pp. 55–61. url: doi.org/10.
1016/0167-6911(90)90082-6.
[327] Pardoux, E. and Tang, S. Forward-backward stochastic differential equations and
quasilinear parabolic PDEs. Probab. Theory Related Fields 114, 2 (1999), pp. 123–150.
url: doi.org/10.1007/s004409970001.
[328] Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent
neural networks. In Proceedings of the 30th International Conference on Machine
Learning (Atlanta, GA, USA, June 17–19, 2013). Ed. by Dasgupta, S. and McAllester,
D. Vol. 28. Proceedings of Machine Learning Research 3. PMLR, 2013, pp. 1310–1318.
url: proceedings.mlr.press/v28/pascanu13.html.
[329] Perekrestenko, D., Grohs, P., Elbrächter, D., and Bölcskei, H. The
universal approximation power of finite-width deep ReLU networks. arXiv:1806.01528
(2018). url: arxiv.org/abs/1806.01528.
[330] Pérez-Ortiz, J. A., Gers, F. A., Eck, D., and Schmidhuber, J. Kalman filters
improve LSTM network performance in problems unsolvable by traditional recurrent
nets. Neural Networks 16, 2 (2003), pp. 241–250. url: doi.org/10.1016/S0893-
6080(02)00219-8.
[331] Petersen, P. Linear Algebra. Springer New York, 2012. x+390 pp. url: doi.org/
10.1007/978-1-4614-3612-6.
[332] Petersen, P., Raslan, M., and Voigtlaender, F. Topological properties of
the set of functions generated by neural networks of fixed size. Found. Comput. Math.
21, 2 (2021), pp. 375–444. url: doi.org/10.1007/s10208-020-09461-0.
[333] Petersen, P. and Voigtlaender, F. Optimal approximation of piecewise smooth
functions using deep ReLU neural networks. Neural Networks 108 (2018), pp. 296–
330. url: doi.org/10.1016/j.neunet.2018.08.019.
[334] Petersen, P. and Voigtlaender, F. Equivalence of approximation by convolu-
tional neural networks and fully-connected networks. Proc. Amer. Math. Soc. 148, 4
(2020), pp. 1567–1581. url: doi.org/10.1090/proc/14789.

590
Bibliography

[335] Pham, H. and Warin, X. Mean-field neural networks: learning mappings on


Wasserstein space. arXiv:2210.15179 (2022). url: arxiv.org/abs/2210.15179.
[336] Pham, H., Warin, X., and Germain, M. Neural networks-based backward scheme
for fully nonlinear PDEs. Partial Differ. Equ. Appl. 2, 1 (2021), Art. No. 16, 24 pp.
url: doi.org/10.1007/s42985-020-00062-8.
[337] Polyak, B. T. Some methods of speeding up the convergence of iteration methods.
USSR Computational Mathematics and Mathematical Physics 4, 5 (1964), pp. 1–17.
[338] PyTorch: SGD. https://pytorch.org/docs/stable/generated/torch.optim.
SGD.html. [Accessed 4-September-2023].
[339] Qian, N. On the momentum term in gradient descent learning algorithms. Neural
Networks 12, 1 (1999), pp. 145–151. url: doi.org/10.1016/S0893-6080(98)00116-
6.
[340] Radford, A., Jozefowicz, R., and Sutskever, I. Learning to Generate Reviews
and Discovering Sentiment. arXiv:1704.01444 (2017). url: arxiv.org/abs/1704.
01444.
[341] Radford, A., Narasimhan, K., Salimans, T., and Sutskever, I. Improving
language understanding by generative pre-training (2018), 12 pp. url: openai.com/
research/language-unsupervised.
[342] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever,
I. Language Models are Unsupervised Multitask Learners (2019), 24 pp. url:
openai.com/research/better-language-models.
[343] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M.,
Zhou, Y., Li, W., and Liu, P. J. Exploring the Limits of Transfer Learning with
a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21, 140 (2020), pp. 1–67.
url: jmlr.org/papers/v21/20-074.html.
[344] Rafiq, M., Rafiq, G., Jung, H.-Y., and Choi, G. S. SSNO: Spatio-Spectral
Neural Operator for Functional Space Learning of Partial Differential Equations.
IEEE Access 10 (2022), pp. 15084–15095. url: doi.org/10.1109/ACCESS.2022.
3148401.
[345] Raiko, T., Valpola, H., and Lecun, Y. Deep Learning Made Easier by Linear
Transformations in Perceptrons. In Proceedings of the Fifteenth International Confer-
ence on Artificial Intelligence and Statistics (La Palma, Canary Islands, Apr. 21–23,
2012). Ed. by Lawrence, N. D. and Girolami, M. Vol. 22. Proceedings of Machine
Learning Research. PMLR, 2012, pp. 924–932. url: proceedings.mlr.press/v22/
raiko12.html.

591
Bibliography

[346] Raissi, M. Forward-Backward Stochastic Neural Networks: Deep Learning of High-


dimensional Partial Differential Equations. arXiv:1804.07010 (2018). url: arxiv.
org/abs/1804.07010.
[347] Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-informed neural
networks: A deep learning framework for solving forward and inverse problems
involving nonlinear partial differential equations. J. Comput. Phys. 378 (2019),
pp. 686–707. url: doi.org/10.1016/j.jcp.2018.10.045.
[348] Rajpurkar, P., Hannun, A. Y., Haghpanahi, M., Bourn, C., and Ng,
A. Y. Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks.
arXiv:1707.01836 (2017). url: arxiv.org/abs/1707.01836.
[349] Ranzato, M., Huang, F. J., Boureau, Y.-L., and LeCun, Y. Unsupervised
Learning of Invariant Feature Hierarchies with Applications to Object Recognition.
In 2007 IEEE Conference on Computer Vision and Pattern Recognition. 2007, pp. 1–
8. url: doi.org/10.1109/CVPR.2007.383157.
[350] Raonić, B., Molinaro, R., Ryck, T. D., Rohner, T., Bartolucci, F.,
Alaifari, R., Mishra, S., and de Bézenac, E. Convolutional Neural Operators
for robust and accurate learning of PDEs. arXiv:2302.01178 (2023). url: arxiv.
org/abs/2302.01178.
[351] Reddi, S. J., Kale, S., and Kumar, S. On the Convergence of Adam and Beyond.
arXiv:1904.09237 (2019). url: arxiv.org/abs/1904.09237.
[352] Reichstein, M., Camps-Valls, G., Stevens, B., Jung, M., Denzler, J.,
Carvalhais, N., and Prabhat. Deep learning and process understanding for
data-driven Earth system science. Nature 566, 7743 (2019), pp. 195–204. url:
doi.org/10.1038/s41586-019-0912-1.
[353] Reisinger, C. and Zhang, Y. Rectified deep neural networks overcome the curse
of dimensionality for nonsmooth value functions in zero-sum games of nonlinear stiff
systems. Anal. Appl. (Singap.) 18, 6 (2020), pp. 951–999. url: doi.org/10.1142/
S0219530520500116.
[354] Ruder, S. An overview of gradient descent optimization algorithms. arXiv:1609.04747
(2016). url: arxiv.org/abs/1609.04747.
[355] Ruf, J. and Wang, W. Neural networks for option pricing and hedging: a literature
review. arXiv:1911.05620 (2019). url: arxiv.org/abs/1911.05620.
[356] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. Learning Internal
Representations by Error Propagation. In. Parallel Distributed Processing: Explo-
rations in the Microstructure of Cognition, Vol. 1: Foundations. Cambridge, MA,
USA: MIT Press, 1986, pp. 318–362.

592
Bibliography

[357] Safran, I. and Shamir, O. On the Quality of the Initial Basin in Overspecified
Neural Networks. In Proceedings of The 33rd International Conference on Machine
Learning (New York, NY, USA, June 20–22, 2016). Vol. 48. Proceedings of Machine
Learning Research. PMLR, 2016, pp. 774–782. url: proceedings.mlr.press/v48/
safran16.html.
[358] Safran, I. and Shamir, O. Spurious Local Minima are Common in Two-Layer
ReLU Neural Networks. In Proceedings of the 35th International Conference on
Machine Learning (Stockholm, Sweden, July 10–15, 2018). Vol. 80. Proceedings of
Machine Learning Research. ISSN: 2640-3498. PMLR, 2018, pp. 4433–4441. url:
proceedings.mlr.press/v80/safran18a.html.
[359] Sainath, T. N., Mohamed, A., Kingsbury, B., and Ramabhadran, B. Deep
convolutional neural networks for LVCSR. In 2013 IEEE International Conference
on Acoustics, Speech and Signal Processing (Vancouver, BC, Canada, May 26–31,
2013). IEEE Computer Society, 2013, pp. 8614–8618. url: doi.org/10.1109/
ICASSP.2013.6639347.
[360] Sak, H., Senior, A., and Beaufays, F. Long Short-Term Memory Based Re-
current Neural Network Architectures for Large Vocabulary Speech Recognition.
arXiv:1402.1128 (2014). url: arxiv.org/abs/1402.1128.
[361] Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J., and
Battaglia, P. W. Learning to Simulate Complex Physics with Graph Networks.
arXiv:2002.09405 (Feb. 2020). url: arxiv.org/abs/2002.09405.
[362] Sanchez-Lengeling, B., Reif, E., Pearce, A., and Wiltschko, A. B. A
Gentle Introduction to Graph Neural Networks. https://distill.pub/2021/gnn-
intro/. [Accessed 10-October-2023].
[363] Sandberg, I. Approximation theorems for discrete-time systems. IEEE Trans.
Circuits Syst. 38, 5 (1991), pp. 564–566. url: doi.org/10.1109/31.76498.
[364] Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. How Does Batch
Normalization Help Optimization? In Advances in Neural Information Processing
Systems. Ed. by Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi,
N., and Garnett, R. Vol. 31. Curran Associates, Inc., 2018. url: proceedings .
neurips.cc/paper_files/paper/2018/file/905056c1ac1dad141560467e0a99
e1cf-Paper.pdf.
[365] Sarao Mannelli, S., Vanden-Eijnden, E., and Zdeborová, L. Optimization
and Generalization of Shallow Neural Networks with Quadratic Activation Functions.
In Advances in Neural Information Processing Systems. Ed. by Larochelle, H.,
Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. Vol. 33. Curran Associates, Inc.,
2020, pp. 13445–13455. url: proceedings . neurips . cc / paper _ files / paper /
2020/file/9b8b50fb590c590ffbf1295ce92258dc-Paper.pdf.

593
Bibliography

[366] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini,
G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 20, 1 (2009),
pp. 61–80. url: doi.org/10.1109/TNN.2008.2005605.
[367] Schmidhuber, J. Deep learning in neural networks: An overview. Neural Networks
61 (2015), pp. 85–117. url: doi.org/10.1016/j.neunet.2014.09.003.
[368] Schütt, K. T., Sauceda, H. E., Kindermans, P.-J., Tkatchenko, A., and
Müller, K.-R. SchNet – A deep learning architecture for molecules and materials.
The Journal of Chemical Physics 148, 24 (2018). url: doi.org/10.1063/1.5019779.
[369] Schwab, C., Stein, A., and Zech, J. Deep Operator Network Approximation
Rates for Lipschitz Operators. arXiv:2307.09835 (2023). url: arxiv.org/abs/
2307.09835.
[370] Schwab, C. and Zech, J. Deep learning in high dimension: neural network
expression rates for generalized polynomial chaos expansions in UQ. Anal. Appl.
(Singap.) 17, 1 (2019), pp. 19–55. url: doi.org/10.1142/S0219530518500203.
[371] Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun,
Y. OverFeat: Integrated Recognition, Localization and Detection using Convolutional
Networks. arXiv:1312.6229 (2013). url: arxiv.org/abs/1312.6229.
[372] Sezer, O. B., Gudelek, M. U., and Ozbayoglu, A. M. Financial time series
forecasting with deep learning : A systematic literature review: 2005–2019. Appl. Soft
Comput. 90 (2020), Art. No. 106181. url: doi.org/10.1016/j.asoc.2020.106181.
[373] Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning.
From Theory to Algorithms. Cambridge University Press, 2014, xvi+397 pp. url:
doi.org/10.1017/CBO9781107298019.
[374] Shen, Z., Yang, H., and Zhang, S. Deep network approximation characterized
by number of neurons. Commun. Comput. Phys. 28, 5 (2020), pp. 1768–1811. url:
doi.org/10.4208/cicp.oa-2020-0149.
[375] Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., and Woo, W.-c.
Convolutional LSTM Network: A Machine Learning Approach for Precipitation
Nowcasting. In Advances in Neural Information Processing Systems. Ed. by Cortes,
C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. Vol. 28. Curran Associates,
Inc., 2015. url: proceedings . neurips . cc / paper _ files / paper / 2015 / file /
07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf.
[376] Siami-Namini, S., Tavakoli, N., and Siami Namin, A. A Comparison of ARIMA
and LSTM in Forecasting Time Series. In 2018 17th IEEE International Conference
on Machine Learning and Applications (ICMLA) (Orlando, FL, USA, Dec. 17–20,
2018). IEEE Computer Society, 2018, pp. 1394–1401. url: doi.org/10.1109/
ICMLA.2018.00227.

594
Bibliography

[377] Silvester, J. R. Determinants of block matrices. Math. Gaz. 84, 501 (2000),
pp. 460–467. url: doi.org/10.2307/3620776.
[378] Simonyan, K. and Zisserman, A. Very Deep Convolutional Networks for Large-
Scale Image Recognition. arXiv:1409.1556 (2014). url: arxiv.org/abs/1409.1556.
[379] Sirignano, J. and Spiliopoulos, K. DGM: A deep learning algorithm for solving
partial differential equations. J. Comput. Phys. 375 (2018), pp. 1339–1364. url:
doi.org/10.1016/j.jcp.2018.08.029.
[380] Sitzmann, V., Martel, J. N. P., Bergman, A. W., Lindell, D. B., and
Wetzstein, G. Implicit Neural Representations with Periodic Activation Functions.
arXiv:2006.09661 (2020). url: arxiv.org/abs/2006.09661.
[381] Soltanolkotabi, M., Javanmard, A., and Lee, J. D. Theoretical Insights Into
the Optimization Landscape of Over-Parameterized Shallow Neural Networks. IEEE
Trans. Inform. Theory 65, 2 (2019), pp. 742–769. url: doi.org/10.1109/TIT.2018.
2854560.
[382] Soudry, D. and Carmon, Y. No bad local minima: Data independent training
error guarantees for multilayer neural networks. arXiv:1605.08361 (2016). url:
arxiv.org/abs/1605.08361.
[383] Soudry, D. and Hoffer, E. Exponentially vanishing sub-optimal local minima in
multilayer neural networks. arXiv:1702.05777 (2017). url: arxiv.org/abs/1702.
05777.
[384] Srivastava, R. K., Greff, K., and Schmidhuber, J. Training Very Deep
Networks. In Advances in Neural Information Processing Systems. Ed. by Cortes, C.,
Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. Vol. 28. Curran Associates,
Inc., 2015. url: proceedings . neurips . cc / paper _ files / paper / 2015 / file /
215a71a12769b056c3c32e7299f1c5ed-Paper.pdf.
[385] Srivastava, R. K., Greff, K., and Schmidhuber, J. Highway Networks.
arXiv:1505.00387 (2015). url: arxiv.org/abs/1505.00387.
[386] Sun, R. Optimization for deep learning: theory and algorithms. arXiv:1912.08957
(Dec. 2019). url: arxiv.org/abs/1912.08957.
[387] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of
initialization and momentum in deep learning. In Proceedings of the 30th International
Conference on Machine Learning (Atlanta, GA, USA, June 17–19, 2013). Ed. by
Dasgupta, S. and McAllester, D. Vol. 28. Proceedings of Machine Learning Research
3. PMLR, 2013, pp. 1139–1147. url: proceedings.mlr.press/v28/sutskever13.
html.

595
Bibliography

[388] Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to Sequence Learning with
Neural Networks. In Advances in Neural Information Processing Systems. Ed. by
Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., and Weinberger, K. Vol. 27.
Curran Associates, Inc., 2014. url: proceedings . neurips . cc / paper _ files /
paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf.
[389] Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction.
2nd ed. MIT Press, Cambridge, MA, 2018, xxii+526 pp.
[390] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Er-
han, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions.
In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(Boston, MA, USA, June 7–12, 2015). IEEE Computer Society, 2015, pp. 1–9. url:
doi.org/10.1109/CVPR.2015.7298594.
[391] Tadić, V. B. Convergence and convergence rate of stochastic gradient search in the
case of multiple and non-isolated extrema. Stochastic Process. Appl. 125, 5 (2015),
pp. 1715–1755. url: doi.org/10.1016/j.spa.2014.11.001.
[392] Tan, L. and Chen, L. Enhanced DeepONet for modeling partial differential
operators considering multiple input functions. arXiv:2202.08942 (2022). url: arxiv.
org/abs/2202.08942.
[393] Taylor, J. M., Pardo, D., and Muga, I. A deep Fourier residual method for
solving PDEs using neural networks. Comput. Methods Appl. Mech. Engrg. 405
(2023), Art. No. 115850, 27 pp. url: doi.org/10.1016/j.cma.2022.115850.
[394] Teschl, G. Ordinary differential equations and dynamical systems. Vol. 140. Amer-
ican Mathematical Society, Providence, RI, 2012, xii+356 pp. url: doi.org/10.
1090/gsm/140.
[395] Tropp, J. A. An Elementary Proof of the Spectral Radius Formula for Matrices.
http://users.cms.caltech.edu/~jtropp/notes/Tro01-Spectral-Radius.pdf.
[Accessed 16-February-2018]. 2001.
[396] Van den Oord, A., Dieleman, S., and Schrauwen, B. Deep content-based
music recommendation. In Advances in Neural Information Processing Systems.
Ed. by Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K.
Vol. 26. Curran Associates, Inc., 2013. url: proceedings.neurips.cc/paper_
files/paper/2013/file/b3ba8f1bee1238a2f37603d90b58898d-Paper.pdf.
[397] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez,
A. N., Kaiser, Ł., and Polosukhin, I. Attention is All you Need. In Advances in
Neural Information Processing Systems. Ed. by Guyon, I., Luxburg, U. V., Bengio,
S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. Vol. 30. Curran
Associates, Inc., 2017. url: proceedings.neurips.cc/paper_files/paper/2017/
file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

596
Bibliography

[398] Vatanen, T., Raiko, T., Valpola, H., and LeCun, Y. Pushing Stochastic
Gradient towards Second-Order Methods – Backpropagation Learning with Transfor-
mations in Nonlinearities. In Neural Information Processing. Ed. by Lee, M., Hirose,
A., Hou, Z.-G., and Kil, R. M. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013,
pp. 442–449.
[399] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and
Bengio, Y. Graph Attention Networks. arXiv:1710.10903 (2017). url: arxiv.org/
abs/1710.10903.
[400] Venturi, L., Bandeira, A. S., and Bruna, J. Spurious Valleys in One-hidden-
layer Neural Network Optimization Landscapes. J. Mach. Learn. Res. 20, 133 (2019),
pp. 1–34. url: jmlr.org/papers/v20/18-674.html.
[401] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T.,
and Saenko, K. Sequence to Sequence – Video to Text. In Proceedings of the IEEE
International Conference on Computer Vision (ICCV) (Santiago, Chile, Dec. 7–13,
2015). IEEE Computer Society, 2015. url: doi.org/10.1109/ICCV.2015.515.
[402] Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A. Extracting
and Composing Robust Features with Denoising Autoencoders. In Proceedings of the
25th International Conference on Machine Learning. ICML ’08. Helsinki, Finland:
Association for Computing Machinery, 2008, pp. 1096–1103. url: doi.org/10.
1145/1390156.1390294.
[403] Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A.
Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network
with a Local Denoising Criterion. J. Mach. Learn. Res. 11, 110 (2010), pp. 3371–3408.
url: jmlr.org/papers/v11/vincent10a.html.
[404] Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., and
Tang, X. Residual Attention Network for Image Classification. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Honolulu,
HI, USA, July 21–26, 2017). IEEE Computer Society, 2017. url: doi.org/10.1109/
CVPR.2017.683.
[405] Wang, N., Zhang, D., Chang, H., and Li, H. Deep learning of subsurface
flow via theory-guided neural network. J. Hydrology 584 (2020), p. 124700. url:
doi.org/10.1016/j.jhydrol.2020.124700.
[406] Wang, S., Wang, H., and Perdikaris, P. Learning the solution operator of
parametric partial differential equations with physics-informed DeepONets. Science
Advances 7, 40 (2021), eabi8605. url: doi.org/10.1126/sciadv.abi8605.
[407] Wang, Y., Zou, R., Liu, F., Zhang, L., and Liu, Q. A review of wind speed
and wind power forecasting with deep neural networks. Appl. Energy 304 (2021),
Art. No. 117766. url: doi.org/10.1016/j.apenergy.2021.117766.

597
Bibliography

[408] Wang, Z., Yan, W., and Oates, T. Time series classification from scratch with
deep neural networks: A strong baseline. In 2017 International Joint Conference on
Neural Networks (IJCNN). 2017, pp. 1578–1585. url: doi.org/10.1109/IJCNN.
2017.7966039.
[409] Welper, G. Approximation Results for Gradient Descent trained Neural Networks.
arXiv:2309.04860 (2023). url: arxiv.org/abs/2309.04860.
[410] Wen, G., Li, Z., Azizzadenesheli, K., Anandkumar, A., and Benson,
S. M. U-FNO – An enhanced Fourier neural operator-based deep-learning model for
multiphase flow. arXiv:2109.03697 (2021). url: arxiv.org/abs/2109.03697.
[411] West, D. Introduction to Graph Theory. Prentice Hall, 2001. 588 pp.
[412] Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K.
Simplifying Graph Convolutional Networks. In Proceedings of the 36th International
Conference on Machine Learning (Long Beach, California, USA, June 9–15, 2019).
Ed. by Chaudhuri, K. and Salakhutdinov, R. Vol. 97. Proceedings of Machine
Learning Research. PMLR, 2019, pp. 6861–6871. url: proceedings.mlr.press/
v97/wu19e.html.
[413] Wu, K., Yan, X.-b., Jin, S., and Ma, Z. Asymptotic-Preserving Convolutional
DeepONets Capture the Diffusive Behavior of the Multiscale Linear Transport
Equations. arXiv:2306.15891 (2023). url: arxiv.org/abs/2306.15891.
[414] Wu, Z., Ramsundar, B., Feinberg, E. N., Gomes, J., Geniesse, C., Pappu,
A. S., Leswing, K., and Pande, V. MoleculeNet: a benchmark for molecular
machine learning. Chem. Sci. 9 (2 2018), pp. 513–530. url: doi.org/10.1039/
C7SC02664A.
[415] Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. A Compre-
hensive Survey on Graph Neural Networks. IEEE Trans. Neural Netw. Learn. Syst.
32, 1 (2021), pp. 4–24. url: doi.org/10.1109/TNNLS.2020.2978386.
[416] Xie, J., Xu, L., and Chen, E. Image Denoising and Inpainting with Deep Neural
Networks. In Advances in Neural Information Processing Systems. Ed. by Pereira, F.,
Burges, C., Bottou, L., and Weinberger, K. Vol. 25. Curran Associates, Inc., 2012.
url: proceedings.neurips.cc/paper_files/paper/2012/file/6cdd60ea0045
eb7a6ec44c54d29ed402-Paper.pdf.
[417] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. Aggregated Residual
Transformations for Deep Neural Networks. In 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (Honolulu, HI, USA, July 21–26, 2017).
IEEE Computer Society, 2017, pp. 5987–5995. url: doi.org/10.1109/CVPR.2017.
634.

598
Bibliography

[418] Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H.,
Lan, Y., Wang, L., and Liu, T.-Y. On Layer Normalization in the Transformer
Architecture. In Proceedings of the 37th International Conference on Machine Learn-
ing (July 13–18, 2020). ICML’20. JMLR.org, 2020, 975, pp. 10524–10533. url:
proceedings.mlr.press/v119/xiong20b.html.
[419] Xiong, W., Huang, X., Zhang, Z., Deng, R., Sun, P., and Tian, Y. Koopman
neural operator as a mesh-free solver of non-linear partial differential equations.
arXiv:2301.10022 (2023). url: arxiv.org/abs/2301.10022.
[420] Xu, R., Zhang, D., Rong, M., and Wang, N. Weak form theory-guided neural
network (TgNN-wf) for deep learning of subsurface single- and two-phase flow. J.
Comput. Phys. 436 (2021), Art. No. 110318, 20 pp. url: doi.org/10.1016/j.jcp.
2021.110318.
[421] Yang, L., Meng, X., and Karniadakis, G. E. B-PINNs: Bayesian physics-
informed neural networks for forward and inverse PDE problems with noisy data. J.
Comput. Phys. 425 (2021), Art. No. 109913. url: doi.org/10.1016/j.jcp.2020.
109913.
[422] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le,
Q. V. XLNet: Generalized Autoregressive Pretraining for Language Understanding.
arXiv:1906.08237 (2019). url: arxiv.org/abs/1906.08237.
[423] Yarotsky, D. Error bounds for approximations with deep ReLU networks. Neural
Networks 94 (2017), pp. 103–114. url: doi.org/10.1016/j.neunet.2017.07.002.
[424] Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and
Leskovec, J. Graph Convolutional Neural Networks for Web-Scale Recommender
Systems. In Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery & Data Mining (London, United Kingdom, Aug. 19–23, 2018).
KDD ’18. New York, NY, USA: Association for Computing Machinery, 2018, pp. 974–
983. url: doi.org/10.1145/3219819.3219890.
[425] Yu, Y., Si, X., Hu, C., and Zhang, J. A Review of Recurrent Neural Networks:
LSTM Cells and Network Architectures. Neural Comput. 31, 7 (July 2019), pp. 1235–
1270. url: doi.org/10.1162/neco_a_01199.
[426] Yun, S., Jeong, M., Kim, R., Kang, J., and Kim, H. J. Graph Transformer
Networks. In Advances in Neural Information Processing Systems. Ed. by Wallach, H.,
Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., and Garnett, R. Vol. 32.
Curran Associates, Inc., 2019. url: proceedings . neurips . cc / paper _ files /
paper/2019/file/9d63484abb477c97640154d40595a3bb-Paper.pdf.
[427] Zagoruyko, S. and Komodakis, N. Wide Residual Networks. arXiv:1605.07146
(2016). url: arxiv.org/abs/1605.07146.

599
Bibliography

[428] Zang, Y., Bao, G., Ye, X., and Zhou, H. Weak adversarial networks for high-
dimensional partial differential equations. J. Comput. Phys. 411 (2020), pp. 109409,
14. url: doi.org/10.1016/j.jcp.2020.109409.
[429] Zeiler, M. D. ADADELTA: An Adaptive Learning Rate Method. arXiv:1212.5701
(2012). url: arxiv.org/abs/1212.5701.
[430] Zeng, D., Liu, K., Lai, S., Zhou, G., and Zhao, J. Relation Classification
via Convolutional Deep Neural Network. In Proceedings of COLING 2014, the 25th
International Conference on Computational Linguistics: Technical Papers. Dublin,
Ireland: Dublin City University and Association for Computational Linguistics, Aug.
2014, pp. 2335–2344. url: aclanthology.org/C14-1220.
[431] Zhang, A., Lipton, Z. C., Li, M., and Smola, A. J. Dive into Deep Learning.
Cambridge University Press, 2023. url: d2l.ai.
[432] Zhang, J., Zhang, S., Shen, J., and Lin, G. Energy-Dissipative Evolutionary
Deep Operator Neural Networks. arXiv:2306.06281 (2023). url: arxiv.org/abs/
2306.06281.
[433] Zhang, J., Mokhtari, A., Sra, S., and Jadbabaie, A. Direct Runge-Kutta
Discretization Achieves Acceleration. arXiv:1805.00521 (2018). url: arxiv.org/
abs/1805.00521.
[434] Zhang, X., Zhao, J., and LeCun, Y. Character-level Convolutional Networks for
Text Classification. In Advances in Neural Information Processing Systems. Ed. by
Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. Vol. 28. Curran
Associates, Inc., 2015. url: proceedings.neurips.cc/paper_files/paper/2015/
file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf.
[435] Zhang, Y., Li, Y., Zhang, Z., Luo, T., and Xu, Z.-Q. J. Embedding Principle:
a hierarchical structure of loss landscape of deep neural networks. arXiv:2111.15527
(2021). url: arxiv.org/abs/2111.15527.
[436] Zhang, Y., Zhang, Z., Luo, T., and Xu, Z.-Q. J. Embedding Principle of Loss
Landscape of Deep Neural Networks. arXiv:2105.14573 (2021). url: arxiv.org/
abs/2105.14573.
[437] Zhang, Y. and Wallace, B. A Sensitivity Analysis of (and Practitioners’ Guide
to) Convolutional Neural Networks for Sentence Classification. In Proceedings of the
Eighth International Joint Conference on Natural Language Processing (Volume 1:
Long Papers) (Taipei, Taiwan, Nov. 27–Dec. 1, 2017). Asian Federation of Natural
Language Processing, 2017, pp. 253–263. url: aclanthology.org/I17-1026.
[438] Zhang, Y., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam Can Converge
Without Any Modification On Update Rules. arXiv:2208.09632 (2022). url: arxiv.
org/abs/2208.09632.

600
Bibliography

[439] Zhang, Z., Cui, P., and Zhu, W. Deep Learning on Graphs: A Survey. IEEE
Trans. Knowledge Data Engrg. 34, 1 (2022), pp. 249–270. url: doi.org/10.1109/
TKDE.2020.2981333.
[440] Zheng, Y., Liu, Q., Chen, E., Ge, Y., and Zhao, J. L. Time Series Classification
Using Multi-Channels Deep Convolutional Neural Networks. In Web-Age Information
Management. Ed. by Li, F., Li, G., Hwang, S.-w., Yao, B., and Zhang, Z. Springer,
Cham, 2014, pp. 298–310. url: doi.org/10.1007/978-3-319-08010-9_33.
[441] Zhou, H., Zhang, S., Peng, J., Zhang, S., Li, J., Xiong, H., and Zhang, W.
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecast-
ing. Proceedings of the AAAI Conference on Artificial Intelligence 35, 12 (2021),
pp. 11106–11115. url: doi.org/10.1609/aaai.v35i12.17325.
[442] Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C.,
and Sun, M. Graph neural networks: A review of methods and applications. AI
Open 1 (2020), pp. 57–81. url: doi.org/10.1016/j.aiopen.2021.01.001.
[443] Zhu, Y. and Zabaras, N. Bayesian deep convolutional encoder-decoder networks
for surrogate modeling and uncertainty quantification. J. Comput. Phys. 366 (2018),
pp. 415–447. url: doi.org/10.1016/j.jcp.2018.04.018.

601

You might also like