0% found this document useful (0 votes)
56 views10 pages

Mathematics in AI: Neural Networks

The document discusses the integral role of mathematics in advancing artificial intelligence (AI), particularly through the application of analytical and probabilistic tools to neural networks. It covers topics such as supervised learning, optimization techniques, and the evolution of neural network architectures, highlighting the importance of mathematical insights in improving AI systems. The article aims to inspire mathematicians to engage with AI by illustrating the interplay between mathematical theory and practical AI applications.

Uploaded by

ihkhan.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views10 pages

Mathematics in AI: Neural Networks

The document discusses the integral role of mathematics in advancing artificial intelligence (AI), particularly through the application of analytical and probabilistic tools to neural networks. It covers topics such as supervised learning, optimization techniques, and the evolution of neural network architectures, highlighting the importance of mathematical insights in improving AI systems. The article aims to inspire mathematicians to engage with AI by illustrating the interplay between mathematical theory and practical AI applications.

Uploaded by

ihkhan.cs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

The Mathematics of Artificial Intelligence

Gabriel Peyré
CNRS and ENS, Université PSL
[email protected]
January 22, 2025
arXiv:2501.10465v1 [math.OC] 15 Jan 2025

Abstract
This overview article highlights the critical role of mathematics in artificial intel-
ligence (AI), emphasizing that mathematics provides tools to better understand and
enhance AI systems. Conversely, AI raises new problems and drives the develop-
ment of new mathematics at the intersection of various fields. This article focuses
on the application of analytical and probabilistic tools to model neural network
architectures and better understand their optimization. Statistical questions (par-
ticularly the generalization capacity of these networks) are intentionally set aside,
though they are of crucial importance. We also shed light on the evolution of ideas
that have enabled significant advances in AI through architectures tailored to spe-
cific tasks, each echoing distinct mathematical techniques. The goal is to encourage
more mathematicians to take an interest in and contribute to this exciting field.

1 Supervised Learning
Recent advancements in artificial intelligence have mainly stemmed from the develop-
ment of neural networks, particularly deep networks. The first significant successes, after
2010, came from supervised learning, where training pairs (xi , y i )i are provided, with xi
representing data (e.g., images or text) and y i corresponding labels (typically, classes de-
scribing the content of the data). More recently, spectacular progress has been achieved
in unsupervised learning, where labels yi are unavailable, thanks to techniques known as
generative AI. These methods will be discussed in Section 4.

Empirical Risk Minimization. In supervised learning, the objective is to construct


a function fθ (x), dependent on parameters θ, such that it approximates the data well:
y i ≈ fθ (xi ). The dominant paradigm involves finding the parameters θ by minimizing an
empirical risk function, defined as:
X
min E(θ) := ℓ(fθ (xi ), y i ), (1)
θ
i

where ℓ is a loss function, typically ℓ(y, y ′) = ky − y ′ k2 for vector-valued (y i )i . Optimiza-


tion of E(θ) is performed using gradient descent:

θt+1 = θt − τ ∇E(θt ), (2)

1
where τ is the step size. In practice, a variant known as stochastic gradient descent [23]
is used to handle large datasets by randomly sampling a subset at each step t. A com-
prehensive theory for the convergence of this method exists in cases where fθ depends
linearly on θ, as E(θ) is convex. However, in the case of deep neural networks, E(θ) is
non-convex, making its theoretical analysis challenging and still largely unresolved. In
some instances, partial mathematical analyses provide insights into the observed practi-
cal successes and guide necessary modifications to improve convergence. For a detailed
overview of existing results, refer to [2].

Automatic Differentiation. Computing the gradient ∇E(θ) is fundamental. Finite


difference methods or traditional differentiation of composite functions would be too
costly, with complexity on the order of O(DT ), where D is the dimensionality of θ and T
is the computation time of fθ . The advances in deep network optimization rely on the use
of automatic differentiation in reverse mode [13], often referred to as “backpropagation
of the gradient,” with complexity on the order of O(T ).

2 Two-Layer Neural Networks


Multi-Layer Perceptrons. A multi-layer network, or multi-layer perceptron (MLP) [25],
computes the function fθ (x) = xL in L steps (or layers) recursively starting from x0 = x:

xℓ+1 = σ(Wℓ xℓ + bℓ ), (3)

where Wℓ is a weight matrix, bℓ a bias vector, and σ a non-linear activation function,


es
such as the sigmoid σ(s) = 1+e s , which is bounded, or ReLU (Rectified Linear Unit),

σ(s) = max(s, 0), which is unbounded. The parameters to optimize are the weights and
biases θ = (Wℓ , bℓ )ℓ . If σ were linear, then fθ would remain linear regardless of the
number of layers. The non-linearity of σ enriches the set of functions representable by
fθ , by increasing both the width (dimension of intermediate vectors xk ) and the depth
(number L of layers).

Two-Layer Perceptrons and Universality. The mathematically best-understood


case is that of L = 2 layers. Denoting W1 = (vk )nk=1 and W2⊤ = (uk )nk=1 as the n rows
and columns of the two matrices (where n is the network width), we can write fθ as a
sum of contributions from its n neurons:
n
1X
fθ (x) = uk σ(hvk , xi + bk ). (4)
n k=1

The parameters are θ = (uk , vk , bk )nk=1 . Here, we added a normalization factor 1/n (to
later study the case n → +∞) and ignored the non-linearity of the second layer. A
classical result by Cybenko [8] shows that these functions fθ can uniformly approximate
any continuous function on a compact domain. This result is similar to the Weierstrass
theorem for polynomial approximation, except that (4) defines, for a fixed n, a non-linear
function space. The elegant proof by Hornik [15] relies on the Stone-Weierstrass theorem,
which implies the result for σ = cos (since (4) defines an algebra of functions when n
is arbitrary). The proof is then completed by uniformly approximating a cosine in 1-D
using a sum of sigmoids or piecewise affine functions (e.g., for σ = ReLU).

2
Mean-Field Representation. This result is, however, disappointing, as it does not
specify how many neurons are needed to achieve a given approximation error. This is
impossible without adding assumptions about the regularity of the function to approx-
imate. Barron’s fundamental result R [4] introduces a Banach space of regular functions,
defined by the semi-norm kf kB := |fˆ(ω)||ω| dω, where fˆ denotes the Fourier transform
of f . Barron shows that, in the L2 norm, for any n, there exists a network fθ with n
neurons such that the
 approximation
 error on a compact Ω of radius R is of the order
Rkf kB
kf − fθ kL2 (Ω) = O √n . This result is remarkable because it avoids the “curse of
dimensionality”: unlike polynomial approximation, the error does not grow exponentially
with the dimension d (although the dimension affects the constant kf kB ).
Barron’s proof relies on a “mean-field” generalization of (4), where a distribution (a
probability measure) ρ is considered over the parameter space (u, v, b), and the network
is expressed as: Z
Fρ (x) := uσ(hv, xi + b) dρ(u, v, b). (5)

A finite-size network (4), fθ = Fρ̂ , is obtained with a discrete measure ρ̂ = n1 nk=1 δ(uk ,vk ,bk ) .
P
An advantage of the representation (5) is that Fρ depends linearly on ρ, which is crucial
for Barron’s proof. This
√ proof uses a probabilistic Monte Carlo-like method (where the
error decreases as 1/ n as desired): it involves constructing a distribution ρ from f , then
sampling a discrete measure ρ̂ whose parameters (uk , vk , bk )k are distributed according to
ρ.

Wasserstein Gradient Flow. In general, analyzing the convergence of the optimiza-


tion (2) is challenging because the function E is non-convex. Recent analyses have shown
that when the number of neurons n is large, the dynamics are not trapped in a local
minimum. The fundamental result by Chizat and Bach [7] is based on the fact that the
distribution ρt defined by gradient descent (2), as τt → 0, follows a gradient flow in the
space of probability distributions equipped with the optimal transport distance. These
Wasserstein gradient flows, introduced by [16] and studied extensively in the book [1],
satisfy a McKean-Vlasov-type partial differential equation:

∂t ρt + div(ρt V(ρ)) = 0, (6)

where x → V(ρ)(x) is a vector field depending on ρ, which can be computed explicitly


from the data (xi , y i )i . The result by Chizat and Bach can be viewed both as a PDE
result (convergence to an equilibrium of a class of PDEs with a specific vector field V(ρ))
and as a machine learning result (successful training of a two-layer network via gradient
descent when the number of neurons is sufficiently large).

3 Very Deep Networks


The unprecedented recent success of neural networks began with the work of [17], which
demonstrated that deep networks, when trained on large datasets, achieve unmatched
performance. The first key point is that to achieve these performances, it is necessary to
use weight matrices Wk that exploit the structure of the data. For images, this means
using convolutions [18]. However, this approach is not sufficient for extremely deep net-
works, with the number of layers L reaching into the hundreds.

3
Residual Networks. The major breakthrough that empirically demonstrated that net-
work performance increases with L was the introduction of residual connections, known
as ResNet [14]. The main idea is to ensure that, for most layers xℓ , the dimensions of xℓ
and xℓ+1 are identical, and to replace (3) with L steps:
1 ⊤
xℓ+1 = xℓ + U σ(Vℓ xℓ + bℓ ), (7)
L ℓ
where Uℓ , Vℓ ∈ Rn×d are weight matrices, with n being the number of neurons per layer
(which, as in (4), can be increased to enlarge the function space). The intuition for the
success of (7) is that this formula allows, unlike (3), for steps that are small deformations
near identity mappings. This makes the network fθ (x0 ) = xL obtained after L steps well-
posed even when L is large, and it can be rigorously proven [20] that this well-posedness
is preserved during optimization via gradient descent (2).

Neural Differential Equation. As L approaches +∞, (7) can be interpreted as a


discretization of an ordinary differential equation:
dxs
= Us⊤ σ(Vs xs + bs ), (8)
ds
where s ∈ [0, 1] indexes the network depth. The network fθ (x0 ) := x1 maps the
initialization xs=0 to the solution xs=1 of (8) at time s = 1. The parameters are
θ = (Us , Vs , bs )s∈[0,1] . This formalization, referred to as a neural ODE, was initially intro-
duced in [6] to leverage tools from adjoint equation theory to reduce the memory cost of
computing the gradient ∇E(θ) during backpropagation. It also establishes a connection
to control theory, as training via gradient descent (2) computes an optimal control θ that
interpolates between the data xi and labels y i . However, the specificity of learning the-
ory compared to control theory lies in the goal of computing such control using gradient
descent (2). To date, no detailed results exist on this. Nonetheless, in his thesis, Raphael
Barboni [3] demonstrated that if the network is initialized near an interpolating network,
gradient descent converges to it.

4 Generative AI for Vector Data


Self-Supervised Pre-Training. The remarkable success of large generative AI models
for vector data, such as images (often referred to as “diffusion models”), and for text (large
language models or LLMs), relies on the use of new large-scale network architectures and
the creation of new training tasks known as “self-supervised”. Manually defining the
labels y i through human intervention is too costly, so these are calculated automatically
by solving simple tasks. For images, these involve denoising tasks, while for text, they
involve predicting the next word. A key advantage of these simple tasks (known as pre-
training tasks) is that it becomes possible to use a pre-trained network fθ generatively.
Starting from an image composed of pure noise and iterating on the network, one can
randomly generate a realistic image [27]. Similarly, for text, starting from a prompt and
sequentially predicting the next word, it is possible to generate text, for example, to
answer questions [5]. We will first describe the case of vector data generation, such as
images, and address LLMs for text in the following section.

4
Sampling as an Optimization Problem. Generative AI models aim to generate (or
“sample”) random vectors x according to a distribution β, which is learned from a large
training dataset (xi )i . This distribution is obtained by “pushing forward” a reference
distribution α (most commonly an isotropic Gaussian distribution α = N (0, Id)) through
a neural network fθ . Specifically, if X ∼ α is distributed according to α, then fθ (X) ∼ β
follows the law β, which is denoted as (fθ )♯ α = β, where ♯ represents the pushforward
operator (also known as the image measure in probability theory).
Early approaches to generative AI, such as “Generative Adversarial Networks” (GANs) [12],
attempted to directly optimize a distance D between probability distributions (e.g., an
optimal transport distance [22]):
min D((fθ )♯ α, β). (9)
θ

This problem is challenging to optimize because computing D is expensive, and β must


be approximated from the data (xi )i .

Flow-Based Generation. Recent successes, particularly in the generation of vector


data, involve neural networks fθ that are computed iteratively by integrating a flow [21],
similar to a neural differential equation (8):
dxs
fθ (x0 ) := x1 where = gθ (xs , s), (10)
ds
where gθ : (x, s) ∈ Rd × R → Rd . The input space includes an additional temporal
dimension s, compared to fθ . The most effective neural networks for images are notably
U-Nets [24].
The central mathematical question is how to replace (9) with a simpler optimization
problem when fθ is computed by integrating a flow (10). An extremely elegant solution
was first proposed in the context of diffusion models [27] (corresponding to the specific
case α = N (0, Id)) and later generalized under the name “flow matching” [19].
The main idea is that if x0 ∼ α is initialized randomly, then xs , the solution at time
s of (10), follows a distribution αs that interpolates between α0 = α and α1 , which is
desired to match β. This distribution satisfies a conservation equation:
∂s αs + div(αs vs ) = 0, (11)
where the vector field is defined as vs (x) := gθ (x, s). This equation is similar to the
evolution of neuron distributions during optimization (6), but it is simpler because vs
does not depend on the distribution αs , making the equation linear in αs .

Denoising Pre-Training. The question is how to find a vector field vs (x) = gθ (x, s)
such that α1 = β, i.e., the final distribution matches the desired one. If vs is known,
the distribution αs is uniquely determined. The key idea is to reason in reverse: starting
from an interpolation αs satisfying α0 = α and α1 = β, how can we compute a vector
field vs such that (11) holds? There are infinitely many possible solutions (since vs can
be modified by a conservative field), but for certain specific interpolations αs , remarkably
simple expressions can be derived.
One example is the interpolation obtained via barycentric averaging: take a pair
x0 ∼ α and x1 ∼ β, and define αs as the distribution of (1 − s)x0 + sx1 . It can be
shown [19] that an admissible vector field is given by a simple conditional expectation:

vs (x) = Ex0 ∼α,x1 ∼β x1 − x0 | (1 − s)x0 + sx1 = x . (12)

5
The key advantage of this formula is that the conditional mean corresponds to a linear
regression, which can be approximated using a neural network vs ≈ gθ (·, s). The expec-
tation over x1 ∼ β can be replaced by a sum over training data (xi )i , leading to the
following optimization problem:
Z 1 X
min E(θ) := Ex0 ∼α kxi − x0 − gθ ((1 − s)x0 + sxi , s)k2 ds.
θ 0 i

This function E(θ) is an empirical risk function (1), and similar optimization techniques
are used to find anRoptimal θ. In particular, stochastic gradient descent efficiently handles
1
both the integral 0 and the expectation Ex0 ∼α .
The problem (12) corresponds to an unsupervised pre-training task: there are no
labels y i , but an artificial supervision task is created by adding random noise x0 to the
data xi . This task is called denoising. We will now see that, for textual data, a different
pre-training task is used.

5 Generative AI for Text


Tokenization and Next-Word Prediction. Generative AI methods for text differ
from those used for vector data generation. The neural network architectures are different
(they involve transformers, as we will describe), and the pre-training method is based not
on denoising but on next-word prediction [28]. It is worth noting that these transformer
neural networks are now also used for image generation [9], but specific aspects related to
the causal nature of text remain crucial. The first preliminary step, called “tokenization”,
consists of transforming the input text into a sequence of vectors X = (x[1], . . . , x[P ]),
where the number P is variable (and may increase, for instance, when generating text
from a prompt). Each token x[p] is a vector that encodes a group of letters, generally at
the level of a syllable. The neural network x = fθ (X) is applied to all the tokens and
predicts a new token x. During training, a large corpus of text (X i )i is available. If we
denote by X̃ i the text X i with the last token xi removed, we minimize an empirical risk
for prediction: X
min E(θ) := ℓ(fθ (X̃ i ), xi ),
θ
i

which is exactly similar to (1). When using a pre-trained network fθ for text generation,
one starts with a prompt X and iteratively adds a new token in an auto-regressive manner:
X ← [X, fθ (X)].

Transformers and Attention. The large networks used for text generation tasks are
Transformer networks [10]. Unlike ResNet (7), these networks fθ no longer operate on a
single vector x, but on a set of vectors X = (x[1], . . . , x[P ]) of size P . In Transformers,
the ResNet layer (7), operating on a single vector, is replaced by an attention layer, where
all tokens interact through a barycentric combination of vectors (V x[q])q where V is a
matrix:
X ehQx[p],Kx[q]i
Aω (X)p := Mp,q (V x[q]), where Mp,q := P hQx[p],Kx[ℓ]i with ω := (Q, K, V ).
q ℓe
(13)

6
The coefficients Mp,q are computed by normalized correlation between x[p] and x[q],
depending on two parameter matrices Q and K. A Transformer fθ (X) is then defined
similarly to ResNet (7), iterating L attention layers with residual connections:
1
Xℓ+1 = Xℓ + Aω (Xℓ ). (14)
L ℓ
The parameters θ of the Transformer fθ (X0 ) = XL with L layers are θ = (ωℓ =
(Qℓ , Kℓ , Vℓ ))0L−1 .
This description is simplified: in practice, a Transformer network also integrates nor-
malization layers and MLPs operating independently on each token. For text applications,
attention must be causal, imposing Mi,j = 0 for j > i. This constraint is essential for
recursively generating text and ensuring the next-word prediction task is meaningful.

Mean-Field Representation of Attention. The attention mechanism (13) can be


viewed as a system of interacting particles, and (14) describes the evolution of tokens
across depth. As with ResNet (8), in the limit L → +∞, one can consider a system of
coupled ordinary differential equations:
dXs
= Aωs (Xs ). (15)
ds
A crucial point is that for non-causal Transformers, the system Xs = (xs [p])p is invariant
under permutations ofP the indices. Thus, this system can be represented as a probability
distribution µs := P1 p δxs [p] over the token space. This perspective was adopted in
Michael Sander’s thesis [26], which rewrites attention as:
R hQx,Kx′i ′
e V x dµ(x′ )
Aω (x) := R hQx,Kx′i where ω := (Q, K, V ).
e dµ(x′ )
The particle system (15) then becomes a conservation equation for advection by the
vector field Aωs (µ):
∂s µs + div(µs Aωs (µs )) = 0. (16)
Surprisingly, this yields a McKean-Vlasov-type equation, similar to the one describ-
ing the training of two-layer MLPs (6), but with the velocity field Aω (µ)(x) replacing
V(ρ)(u, v, b). However, here the evolution occurs in the token space x rather than in
the neuron space (u, v, b), and the evolution variable corresponds to depth s, not the
optimization time t.
Unlike (6), the evolution (16) is not a gradient flow in the Wasserstein metric [26].
Nonetheless, in certain cases, this evolution can be analyzed, and it can be shown that
the measure µs converges to a single Dirac mass [11] as s → +∞: the tokens tend to
cluster. Better understanding these evolutions, as well as the optimization of parameters
θ = (Qs , Ks , Vs )s∈[0,1] via gradient descent (2), remains an open problem. This problem
can be viewed as a control problem for the PDE (16).

Conclusion
Mathematics plays a critical role in understanding and improving the performance of
deep architectures while presenting new theoretical challenges. The emergence of Trans-
formers and generative AI raises immense mathematical problems, particularly for better

7
understanding the training of these networks and exploring the structure of "optimal"
networks. One essential question remains: whether an LLM merely interpolates train-
ing data or is capable of genuine reasoning. Moreover, issues of resource efficiency and
privacy in AI system development demand significant theoretical advancements, where
mathematics will play a pivotal role. Whether designing resource-efficient models, en-
suring compliance with ethical standards, or exploring the fundamental limits of these
systems, mathematics is poised to be an indispensable tool for the future of artificial
intelligence.

References
[1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces
and in the space of probability measures. Springer Science & Business Media, 2008.
[2] Francis Bach. Learning theory from first principles. MIT press, 2024.
[3] Raphaël Barboni, Gabriel Peyré, and François-Xavier Vialard. Understanding the
training of infinitely deep and wide resnets with conditional optimal transport. arXiv
preprint arXiv:2403.12887, 2024.
[4] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal
function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell,
et al. Language models are few-shot learners. Advances in neural information pro-
cessing systems, 33:1877–1901, 2020.
[6] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural
ordinary differential equations. Advances in neural information processing systems,
31, 2018.
[7] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for
over-parameterized models using optimal transport. Advances in neural information
processing systems, 31, 2018.
[8] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathe-
matics of control, signals and systems, 2(4):303–314, 1989.
[9] Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recog-
nition at scale. arXiv preprint arXiv:2010.11929, 2020.
[10] A. Vaswani et al. Attention is all you need. Advances in Neural Information Pro-
cessing Systems, 2017.
[11] Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy, and Philippe Rigollet. The
emergence of clusters in self-attention dynamics. Advances in Neural Information
Processing Systems, 36, 2024.
[12] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.
Advances in neural information processing systems 27, 2014.

8
[13] Andreas Griewank and Andrea Walther. Evaluating derivatives: principles and tech-
niques of algorithmic differentiation. SIAM, 2008.

[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning
for image recognition. In Proceedings of the IEEE conference on computer vision
and pattern recognition, pages 770–778, 2016.

[15] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward
networks are universal approximators. Neural networks, 2(5):359–366, 1989.

[16] Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation
of the fokker–planck equation. SIAM journal on mathematical analysis, 29(1):1–17,
1998.

[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification
with deep convolutional neural networks. Advances in neural information processing
systems, 25, 2012.

[18] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based
learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–
2324, 1998.

[19] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le.
Flow matching for generative modeling. Proc. ICLR 2023, 2023.

[20] Pierre Marion, Yu-Han Wu, Michael E Sander, and Gérard Biau. Implicit regular-
ization of deep residual networks towards neural odes. Proc. ICLR’23, 2023.

[21] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed,
and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and
inference. Journal of Machine Learning Research, 22(57):1–64, 2021.

[22] Gabriel Peyré, Marco Cuturi, et al. Computational optimal transport. Foundations
and Trends® in Machine Learning, 11(5-6):355–607, 2019.

[23] Herbert Robbins and Sutton Monro. A stochastic approximation method. The
annals of mathematical statistics, pages 400–407, 1951.

[24] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional net-
works for biomedical image segmentation. In Medical image computing and computer-
assisted intervention–MICCAI 2015: 18th international conference, Munich, Ger-
many, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.

[25] Frank Rosenblatt. The perceptron: a probabilistic model for information storage
and organization in the brain. Psychological review, 65(6):386, 1958.

[26] Michael E Sander, Pierre Ablin, Mathieu Blondel, and Gabriel Peyré. Sinkform-
ers: Transformers with doubly stochastic attention. In International Conference on
Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022.

[27] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep
unsupervised learning using nonequilibrium thermodynamics. In International con-
ference on machine learning, pages 2256–2265. PMLR, 2015.

9
[28] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with
neural networks. Proceedings of the 28th International Conference on Neural Infor-
mation Processing Systems, 2014.

10

You might also like