Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+
Sourangshu Ghosh IISc Bangalore Mathematical Foundations of Deep Learning Version 5+
Sourangshu Ghosh*
Abstract
We start with a strict problem formulation by establishing the risk functional as a measurable
function space mapping, studying its properties through Fréchet differentiability and convex func-
tional minimization. Deep neural network complexity is studied through VC-dimension theory
and Rademacher complexity, defining generalization bounds and hypothesis class constraints. The
universal approximation capabilities of neural networks are sharpened by convolution operators,
the Stone-Weierstrass theorem, and Sobolev embeddings, with quantifiable bounds on expressivity
obtained via Fourier analysis and compactness arguments by the Rellich-Kondrachov theorem. The
depth-width trade-offs in expressivity are examined via capacity measures, spectral representations
of activation functions, and energy-based functional approximations.
The mathematical framework of training dynamics is established through carefully examining gra-
dient flow, stationary points, and Hessian eigenspectrum properties of loss landscapes. The Neural
Tangent Kernel (NTK) regime is abstracted as an asymptotic linearization of deep learning dynam-
ics with exact spectral decomposition techniques offering theoretical explanations of generalization.
PAC-Bayesian methods, spectral regularization, and information-theoretic constraints are used to
prove generalization bounds, explaining the stability of deep networks under probabilistic risk mod-
els.
The work is extended to state-of-the-art deep learning models such as convolutional neural net-
works (CNNs), recurrent neural networks (RNNs), transformers, generative adversarial networks
(GANs), and variational autoencoders (VAEs) with strong functional analysis of representational
capabilities. Optimal transport theory in deep learning is found with the application of Wasser-
stein distances, Sinkhorn regularization, and Kantorovich duality linking generative modeling with
embeddings of probability space. Theoretical formulations of game-theoretic deep learning architec-
tures are examined, establishing variational inequalities, equilibrium constraints, and evolutionary
stability conditions in adversarial learning paradigms.
Reinforcement learning is formalized by stochastic control theory, Bellman operators, and dynamic
programming principles, with precise derivations of policy optimization methods. We present a
rigorous treatment of optimization methods, including stochastic gradient descent (SGD), adaptive
moment estimation (Adam), and Hessian-based second-order methods, with emphasis on spectral
regularization and convergence guarantees. The information-theoretic constraints in deep learning
generalization are further examined via rate-distortion theory, entropy-based priors, and variational
inference methods.
Metric learning, adversarial robustness, and Bayesian deep learning are mathematically formal-
ized, with clear derivations of Mahalanobis distances, Gaussian mixture models, extreme value
theory, and Bayesian nonparametric priors. Few-shot and zero-shot learning paradigms are ana-
lyzed through meta-learning frameworks, Model-Agnostic Meta-Learning (MAML), and Bayesian
hierarchical inference. The mathematical framework of neural network architecture search (NAS)
is constructed through evolutionary algorithms, reinforcement learning-based policy optimization,
and differential operator constraints.
3
Theoretical contributions in kernel regression, deep Kolmogorov approaches, and neural approxima-
tions of differential operators are rigorously discussed, relating deep learning models to functional
approximation in infinite-dimensional Hilbert spaces. The mathematical concepts behind causal
inference in deep learning are expressed through structural causal models (SCMs), counterfactual
reasoning, domain adaptation, and invariant risk minimization. Deep learning models are discussed
using the framework of variational functionals, tensor calculus, and high-dimensional probability
theory.
This book offers a mathematically complete, carefully stated, and scientifically sound synthesis
of deep learning theory, linking mathematical fundamentals to the latest developments in neural
network science. Through its integration of functional analysis, information theory, stochastic pro-
cesses, and optimization into a unified theoretical structure, this research is a seminal guide for
scholars who aim to advance the mathematical foundations of deep learning.
4
Contents
1 Mathematical Foundations 1
1.1 Problem Definition: Risk Functional as a Mapping Between Spaces . . . . . . . . . 1
1.1.1 Measurable Function Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Risk as a Functional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Approximation Spaces for Neural Networks . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 VC-dimension theory for discrete hypotheses . . . . . . . . . . . . . . . . . . 6
1.2.2 Rademacher complexity for continuous spaces . . . . . . . . . . . . . . . . . 10
1.2.3 Sobolev Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Rellich-Kondrachov Compactness Theorem . . . . . . . . . . . . . . . . . . . 14
1.2.5 Fréchet-Kolmogorov compactness criterion . . . . . . . . . . . . . . . . . . . 15
1.2.6 Sobolev-Poincaré inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5
6 CONTENTS
26 Appendix 586
26.1 Linear Algebra Essentials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
26.1.1 Matrices and Vector Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 586
26.1.2 Vector Spaces and Linear Transformations . . . . . . . . . . . . . . . . . . . 587
26.1.3 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . 587
26.1.4 Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . . . . . . 587
26.2 Probability and Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
26.2.1 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588
26.2.2 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
26.2.3 Statistical Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589
26.3 Optimization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
26.3.1 Gradient Descent (GD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
26.3.2 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . 593
26.3.3 Second-Order Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
26.4 Matrix Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
26.4.1 Matrix Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
26.4.2 Tensor Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595
26.5 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597
26.5.1 Entropy: The Fundamental Measure of Uncertainty . . . . . . . . . . . . . . 597
26.5.2 Source Coding Theorem: Fundamental Limits of Compression . . . . . . . . 600
26.5.3 Noisy Channel Coding Theorem: Fundamental Limits of Communication . . 602
26.5.4 Rate-Distortion Theory: Lossy Data Compression . . . . . . . . . . . . . . . 604
26.5.5 Applications of Information Theory . . . . . . . . . . . . . . . . . . . . . . . 606
26.5.6 Conclusion: Information Theory as a Universal Mathematical Principle . . . 612
27 Acknowledgments 613
List of Tables
15
16 LIST OF TABLES
17
1 Mathematical Foundations
where X is the input space, Y is the output space, P is the distribution of data, ℓ(·, ·) is a loss
function, θ are neural network parameters. This problem entails the integration of various fields,
each of which is examined in systematic detail below.
1
2 CHAPTER 1. MATHEMATICAL FOUNDATIONS
set operations. To start with, Σ is closed under complementation such that for every set A ∈ Σ,
the complement Ac = X \ A is also in Σ. This ensures the possibility of establishing the ”non-
occurrence” of measurable events in a mathematically coherent manner. S∞ Second, Σ is closed under
∞
countable unions: for any countable family {An }n=1 ⊆ Σ, the union n=1 An is also in Σ, allowing
measurable sets to be constructed out of countably infinite operations. De Morgan’s laws subse-
quently suggest countable intersection closure, since ∞
T S∞ c c
A
n=1 n = ( n=1 n ) , so that the structure
A
supports conjunctions of countable sets of events. Lastly, the presence of the empty set ∅ ∈ Σ is an
axiom which allows for a null reference point, so that the σ-algebra is not empty and can be used
to express the ”impossibility” of some outcomes.
Literature Review: Rao et. al. (2024) [1] investigated approximation theory within Lebesgue
measurable function spaces, providing an analysis of operator convergence. They also established
a theoretical framework for function approximation in Lebesgue spaces and provided a rigorous
study of symmetric properties in function spaces. Mukhopadhyay and Ray (2025) [2] provided a
comprehensive introduction to measurable function spaces, with a focus on Lp-spaces and their
completeness properties. They also established the fundamental role of Lp-spaces in measure the-
ory and discussed the relationship between continuous function spaces and measurable functions.
Szoldra (2024) [3] examined measurable function spaces in quantum mechanics, exploring the role
of measurable observables in ergodic theory. They connected functional analysis and measure the-
ory to quantum state evolution and provided a mathematical foundation for quantum machine
learning in function spaces. Lee (2025) [10] investigated metric space theory and functional analy-
sis in the context of measurable function spaces in AI models. He formalized how function spaces
can model self-referential structures in AI and provided a bridge between measure theory and gen-
erative models. Song et. al. (2025) [4] discussed measurable function spaces in the context of
urban renewal and performance evaluation. They developed a rigorous evaluation metric using
measurable function spaces and explored function space properties in applied data science and
urban analytics. Mennaoui et. al. (2025) [5] explored measurable function spaces in the theory
of evolution equations, a key concept in functional analysis. They established a rigorous study
of measurable operator functions and provided new insights into function spaces and their role in
solving differential equations. Pedroza (2024) [6] investigated domain stability in machine learning
models using function spaces. He established a formal mathematical relationship between function
smoothness and domain adaptation and uses topological and measurable function spaces to analyze
stability conditions in learning models. Cerreia-Vioglio and Ok (2024) [7] developed a new integra-
tion theory for measurable set-valued functions. They introduced a generalization of integration
over Banach-valued functions and established weak compactness properties in measurable function
spaces. Averin (2024) [8] applied measurable function spaces to gravitational entropy theory. He
provided a rigorous proof of entropy bounds using function space formalism and connected measure
theory with relativistic field equations. Potter (2025) [9] investigated measurable function spaces
in the context of Fourier analysis and crystallographic structures. He established new results on
Fourier transforms of measurable functions and introduced a novel framework for studying function
spaces in invariant shift operators.
Measurable spaces are not merely abstract structures but are the backbone of measure theory,
probability, and integration. For example, the Borel σ-algebra B(R) on the real numbers R is the
smallest σ-algebra containing all open intervals (a, b) for a, b ∈ R. This σ-algebra is pivotal in
defining Lebesgue measure, where measurable sets generalize the classical notion of intervals to
include sets that are neither open nor closed. Moreover, the construction of a σ-algebra generated
by a collection of subsets C ⊆ 2X , denoted σ(C), provides a minimal framework that includes C
and satisfies all σ-algebra properties, enabling the systematic extension of measurability to more
complex scenarios. For instance, starting with intervals in R, one can build the Borel σ-algebra, a
critical tool in modern analysis.
1.1. PROBLEM DEFINITION: RISK FUNCTIONAL AS A MAPPING BETWEEN SPACES 3
The structure of a measurable space allows the definition of a measure µ, a function µ : Σ → [0, ∞]
that assigns a non-negative value to each set in Σ, adhering to two key axioms: µ(∅) = 0 and count-
∞
able additivity, P∞ that for any disjoint collection {An }n=1 ⊆ Σ, the measure of their union
S∞ which states
satisfies µ ( n=1 An ) = n=1 µ(An ). This property is indispensable in extending intuitive notions
of length, area, and volume to arbitrary measurable sets, paving the way for the Lebesgue integral.
A function f : X → R is then termed Σ-measurable if for every Borel set B ∈ B(R), the preimage
f −1 (B) belongs to Σ. This definition ensures that the function is compatible with the σ-algebra, a
necessity for defining integrals and expectation in probability theory.
In summary, measurable spaces represent a highly versatile and mathematically rigorous frame-
work, underpinning vast areas of analysis and probability. Their theoretical depth lies in their
ability to systematically handle infinite operations while maintaining closure, consistency, and flex-
ibility for defining measures, measurable functions, and integrals. As such, the rigorous study of
measurable spaces is indispensable for advancing modern mathematical theory, providing a bridge
between abstract set theory and applications in real-world phenomena.
Let (X , ΣX , µX ) and (Y, ΣY , µY ) be measurable spaces. The true risk functional is defined as:
Z
R(f ) = ℓ(f (x), y) dP (x, y), (1.2)
X ×Y
where:
• f belongs to a hypothesis space F ⊆ Lp (X , µX ).
• P (x, y) is a Borel probability measure over X × Y, satisfying
R
X ×Y
1 dP = 1.
where ⟨·, ·⟩ denotes the inner product in L2 (X ). In the field of risk management and decision
theory, the concept of a risk functional is a mathematical formalization that captures how risk
is quantified for a given outcome or state. A risk functional, denoted as R, acts as a map that
takes elements from a given space X (which represents the possible outcomes or states) and returns
a real-valued risk measure. This risk measure, R(x), expresses the degree of risk or the adverse
outcome associated with a particular element x ∈ X. The space X may vary depending on the
context—this could be a space of random variables, trajectories, or more complex function spaces.
The risk functional maps x to R, i.e., R : X → R, where each R(x) reflects the risk associated with
the specific realization x. One of the most foundational forms of risk functionals is based on the
expectation of a loss function L(x), where x ∈ X represents a random variable or state, and L(x)
quantifies the loss associated with that state. The risk functional can be expressed as an expected
loss, written mathematically as:
Z
R(x) = E[L(x)] = L(x)p(x) dx (1.4)
X
where p(x) is the probability density function of the outcome x, and the integration is taken
over the entire space X. In this setup, L(x) can be any function that measures the severity or
unfavorable nature of the outcome x. In a financial context, L(x) could represent the loss function
for a portfolio, and p(x) would be the probability density function of the portfolio’s returns. In
many cases, a specific form of L(x) is used, such as L(x) = (x − µ)2 , where µ is the target or
expected value. This choice results in the risk functional representing the variance of the outcome
x, expressed as: Z
R(x) = (x − µ)2 p(x) dx (1.5)
X
This formulation captures the variability or dispersion of outcomes around a mean value, a common
risk measure in applications like portfolio optimization or risk management. Additionally, another
widely used class of risk functionals arises from quantile-based risk measures, such as Value-
at-Risk (VaR), which focuses on the extreme tail behavior of the loss distribution. The VaR at
a confidence level α ∈ [0, 1] is defined as the smallest value l such that the probability of L(x)
exceeding l is no greater than 1 − α, i.e.,
P (L(x) ≤ l) ≥ α (1.6)
This defines a threshold beyond which the worst outcomes are expected to occur with probability
1 − α. Value-at-Risk provides a measure of the worst-case loss under normal circumstances, but it
does not provide information about the severity of losses exceeding this threshold. To address this
limitation, the Conditional Value-at-Risk (CVaR) is introduced, which measures the expected
loss given that the loss exceeds the VaR threshold. Mathematically, CVaR at the level α is given
by:
CVaRα (x) = E[L(x) | L(x) ≥ VaRα (x)] (1.7)
This conditional expectation provides a more detailed assessment of the potential extreme losses
beyond the VaR threshold. The CVaR is a more comprehensive measure, capturing the tail risk
and providing valuable information about the magnitude of extreme adverse events. In cases where
the space X represents trajectories or paths, such as in the context of continuous-time processes
or dynamical systems, the risk functional is often formulated in terms of integrals over time. For
example, consider x(t) as a trajectory in the function space C([0, T ], Rn ), the space of continuous
functions on the interval [0, T ]. The risk functional in this case might quantify the total deviation
of the trajectory from a reference or target trajectory over time. A typical example could be the
total squared deviation, written as:
Z T
R(x) = ∥x(t) − x̄(t)∥2 dt (1.8)
0
1.1. PROBLEM DEFINITION: RISK FUNCTIONAL AS A MAPPING BETWEEN SPACES 5
where x̄(t) represents a reference trajectory and ∥ · ∥ is a norm, such as the Euclidean norm. This
risk functional quantifies the total deviation (or energy) of the trajectory from the target path
over the entire time interval, and is used in various applications such as control theory
Pn and optimal
2 2
trajectory planning. A common choice for the norm ∥x(t)∥ might be ∥x(t)∥ = i=1 xi (t), where
xi (t) are the components of the trajectory x(t) in Rn . In some cases, the space X of possible
outcomes may not be a finite-dimensional vector space, but instead a Banach space or a Hilbert
space, particularly when x represents a more complex object such as a function or a trajectory.
For example, the space C([0, T ], Rn ) is a Banach space, and the risk functional may involve the
evaluation of integrals over this function space. In such settings, the risk functional can take the
form:
Z T
R(x) = ∥x(t)∥pp dt (1.9)
0
where ∥ · ∥p is the p-norm, and p ≥ 1. For p = 2, this risk functional represents the total energy
of the trajectory, but other norms can be used to emphasize different types of risks. For instance,
the L∞ -norm would focus on the maximum deviation of the trajectory from the target path. The
concept of convexity plays a significant role in the theory of risk functionals. Convexity ensures
that the risk associated with a convex combination of two states x1 and x2 is less than or equal to
the weighted average of the risks of the individual states. Mathematically, for λ ∈ [0, 1], convexity
demands that:
R(λx1 + (1 − λ)x2 ) ≤ λR(x1 ) + (1 − λ)R(x2 ) (1.10)
This property reflects the diversification effect in risk management, where mixing several states or
outcomes generally leads to a reduction in overall risk. Convex risk functionals are particularly
important in portfolio theory, where they allow for risk minimization through diversification. For
example, if R(x) represents the variance of a portfolio’s returns, then the convexity property ensures
that combining different assets will result in a portfolio with lower overall risk than the risk of any
individual asset. Monotonicity is another important property for risk functionals, ensuring that
the risk increases as the outcome becomes more adverse. If x1 is worse than x2 according to some
partial order, we have:
R(x1 ) ≥ R(x2 ) (1.11)
Monotonicity ensures that the risk functional behaves in a way that aligns with intuitive notions
of risk: worse outcomes are associated with higher risk. In financial contexts, this is reflected in
the fact that losses increase the associated risk measure. Finally, in some applications, the risk
functional is derived from perturbation analysis to study how small changes in parameters affect
the overall risk. Consider x(ϵ) as a perturbed trajectory, where ϵ is a small parameter, and the
Fréchet derivative of the risk functional with respect to ϵ is given by:
d
R(x(ϵ)) (1.12)
dϵ ϵ=0
This derivative quantifies the sensitivity of the risk to perturbations in the system and is crucial
in the analysis of stability and robustness. Such analyses are essential in areas like stochastic
control and optimization, where it is important to understand how small changes in the model’s
parameters can influence the risk profile.
Thus, the risk functional is a powerful tool for quantifying and managing uncertainty, and its
formulation can be adapted to various settings, from random variables and stochastic processes to
continuous trajectories and dynamic systems. The risk functional provides a rigorous mathemat-
ical framework for assessing and minimizing risk in complex systems, and its flexibility makes it
applicable across a wide range of domains.
6 CHAPTER 1. MATHEMATICAL FOUNDATIONS
Literature Review: There are several articles that explore the VC-dimension theory for dis-
crete hypotheses very rigorously. N. Bousquet and S. Thomassé (2015) [18] explored in their paper
the VC-dimension in the context of graph theory, connecting it to structural properties such as the
Erdős-Pósa property. Yıldız and Alpaydin (2009) [19] in their article computed the VC-dimension
for decision tree hypothesis spaces, considering both discrete and continuous features. Zhang et.
al. (2012) [20] introduced a discretized VC-dimension to bridge real-valued and discrete hypothesis
spaces, offering new theoretical tools for complexity analysis. Riondato and Zdonik (2011) [21]
adapted VC-dimension theory to database systems, analyzing SQL query selectivity using a theo-
retical lens. Riggle and Sonderegger (2010) [22] investigated the VC-dimension in linguistic models,
focusing on grammar hypothesis spaces. Anderson (2023) [23] provided a comprehensive review
of VC-dimension in fuzzy systems, particularly in logic frameworks involving discrete structures.
Fox et. al. (2021) [24] proved key conjectures for systems with bounded VC-dimension, offering
insights into combinatorial implications. Johnson (2021) [25] discusses binary representations and
VC-dimensions, with implications for discrete hypothesis modeling. Janzing (2018) [26] in his paper
focuses on hypothesis classes with low VC-dimension in causal inference frameworks. Hüllermeier
and Tehrani (2012) [27] in their paper explored the theoretical VC-dimension of Choquet integrals,
applied to discrete machine learning models. The book titled “Foundations of Machine Learning”
[28] by Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar offers a very good foundational
discussion on VC-dimension in the context of statistical learning. Another book titled “Learning
Theory: An Approximation Theory Viewpoint” by Felipe Cucker and Ding-Xuan Zhou [29] dis-
cusses the role of VC-dimension in approximation theory. Yet another book titled “Understanding
Machine Learning: From Theory to Algorithms” by Shai Shalev-Shwartz and Shai Ben-David [30]
contains detailed chapters on hypothesis spaces and VC-dimension.
For discrete hypotheses, the VC-dimension theory applies to a class of hypotheses that map a
set of input points to binary output labels (typically 0 or 1). The VC-dimension for a hypothesis
class refers to the largest set of data points that can be shattered by that class, where ”shattering”
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 7
means that the hypothesis class can realize all possible labelings of these points.
We shall now discuss the Formal Mathematical Framework. Let X be a finite or infinite
set called the instance space, which represents the input space. Consider a hypothesis class H,
where each hypothesis h ∈ H is a function h : X → {0, 1}. The function h classifies each element
of X into one of two classes: 0 or 1. Given a subset S = {x1 , x2 , . . . , xk } ⊆ X, we say that H
shatters S if for every possible labeling ⃗y = (y1 , y2 , . . . , yk ) ∈ {0, 1}k , there exists some h ∈ H
such that for all i ∈ {1, 2, . . . , k}:
h(xi ) = yi (1.15)
In other words, a hypothesis class H shatters S if it can produce every possible binary labeling
on the set S. The VC-dimension VC(H) is defined as the size of the largest set S that can be
shattered by H:
VC(H) = sup{k | ∃S ⊆ X, |S| = k, S is shattered by H} (1.16)
If no set of points can be shattered, then the VC-dimension is 0. Some Properties of the VC-
Dimension are
We shall now discuss the VC-Dimension and Generalization Bounds (VC Theorem). The VC-
dimension theorem (often referred to as Hoeffding’s bound or the generalization bound)
provides a probabilistic guarantee on the relationship between the training error and the true error.
Specifically, it gives an upper bound on the probability that the generalization error exceeds the
empirical error (training error) by more than ϵ.
This result is derived from combinatorial arguments involving the number of hyperplane arrange-
ments that a neural network can realize in a given input space. If the activation function is nonlinear
(such as sigmoid or tanh), the capacity can increase but remains constrained by the number of pa-
rameters. The intuition behind this bound lies in the observation that each neuron introduces a
decision boundary in the input space, and the number of such boundaries grows approximately
as W log W rather than W due to dependencies among neurons. When considering discrete hy-
potheses, the hypothesis space H is restricted to a finite number of functions. This occurs in
scenarios where the weights and biases are quantized to a finite set of values, or where the ac-
tivation functions themselves take on a finite number of distinct outputs. If there are K possible
distinct values for each parameter and there are W parameters, then the total number of possible
functions that the network can realize is at most
|H| ≤ K W . (1.19)
From the fundamental relationship between the VC-dimension and the number of hypotheses in a
finite hypothesis space, we obtain the bound
Applying this result to a neural network with quantized weights and biases, we obtain
For discrete neural networks, where |H| is finite, we obtain the bound
which ensures that the number of realizable functions is significantly smaller than in the continuous
case, thereby reducing overfitting potential. The interplay between VC-dimension and struc-
tural risk minimization (SRM) further highlights the importance of capacity control in neural
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 9
For discrete networks, the practical implication of this bound is that models with limited pre-
cision weights or constrained architectures tend to generalize better than overparameterized
networks with unnecessarily high VC-dimension. From an optimization standpoint, discrete neu-
ral networks have a smaller number of local minima in the loss landscape compared to their
continuous counterparts. This is because the number of unique parameter configurations is finite,
leading to a more structured and potentially more tractable optimization problem. However,
the trade-off is that discrete optimization is often NP-hard, requiring specialized techniques such
as simulated annealing, evolutionary algorithms, or integer programming methods. Thus, VC-
dimension theory provides profound insights into the expressive power, generalization ability,
and optimization complexity of neural networks. Discrete neural networks exhibit a reduced
VC-dimension compared to their continuous counterparts, leading to potentially better gen-
eralization, provided that the model retains sufficient expressivity for the given problem. The
trade-off between expressivity and generalization is fundamental to designing efficient neu-
ral network architectures that perform well in practice.
Let D be the distribution from which the training data is drawn, and let err(h)
ˆ and err(h) represent
the empirical error and true error of a hypothesis h ∈ H, respectively:
n
1X
err(h)
ˆ = 1{h(xi )̸=yi } (1.27)
n i=1
|err(h)
ˆ − err(h)| ≤ ϵ (1.29)
This result shows that the generalization error (the difference between the true and empirical error)
is small with high probability, provided the sample size n is large enough and the VC-dimension
d is not too large. The sample complexity n required to guarantee that the generalization error is
within ϵ with high probability 1 − δ is given by:
C 1 1
n ≥ 2 d log + log (1.31)
ϵ ϵ δ
where C is a constant depending on the distribution. This bound emphasizes the importance of
VC-dimension in controlling the complexity of the hypothesis class. A larger VC-dimension requires
a larger sample size to avoid overfitting and ensure reliable generalization. Some Detailed Examples
are:
10 CHAPTER 1. MATHEMATICAL FOUNDATIONS
where w ∈ R2 is the weight vector and b ∈ R is the bias term. The VC-dimension of
linear classifiers in R2 is 3. This can be rigorously shown by noting that for any set of 3
points in R2 , the hypothesis class H can shatter these points. In fact, any possible binary
labeling of the 3 points can be achieved by some linear classifier. However, for 4 points in
R2 , it is impossible to shatter all possible binary labelings (e.g., the four vertices of a convex
quadrilateral), meaning the VC-dimension is 3.
2. Example 2: Polynomial Classifiers of Degree d: Consider a polynomial hypothesis class
in Rn of degree d. The hypothesis class H consists of polynomials of the form:
X
h(x) = αi1 ,i2 ,...,in xi11 xi22 . . . xinn (1.33)
i1 ,i2 ,...,in
where the αi1 ,i2 ,...,in are coefficients and x = (x1 , x2 , . . . , xn ). The VC-dimension of poly-
nomial classifiers of degree d in Rn grows as O(nd ), implying that the complexity of the
hypothesis class increases rapidly with both the degree d and the dimension n of the input
space.
Neural networks, depending on their architecture, can have very high VC-dimensions. In particu-
lar, the VC-dimension of a neural network with L layers, each containing N neurons, is typically
O(N L ), indicating that the VC-dimension grows exponentially with both the number of neurons
and the number of layers. This result provides insight into the complexity of neural networks and
their capacity to overfit data when the training sample size is insufficient.
process how to control the generalization error of time series models wherein past values of the
outcome are used to predict future values.
where ∥f ∥∞ = ess supx∈X |f (x)| denotes the essential supremum. For rigor, F is assumed measur-
able in the sense that for every ϵ > 0, there exists a countable subset Fϵ ⊆ F such that:
where Eσ denotes expectation over σ. The supremum can be interpreted as a functional dual
norm in L∞ (X , R), where F is the unit ball. Using the symmetrization technique, the Rademacher
complexity relates to the deviation of Pn [f ] from D[f ]:
where: h i
Rn (F) = ES R̂S (F) . (1.41)
This is derived by first symmetrizing the sample and then invoking Jensen’s inequality and the
independence of σ. There are some Complexity Bounds that use Covering Numbers and Entropy
that need to be discussed. In Metric Entropy, we let ∥ · ∥∞ be the metric on F. The covering
number N (ϵ, F, ∥ · ∥∞ ) satisfies:
Regarding the Dudley’s Entropy Integral, For a bounded function class F (compact under ∥ · ∥∞ ):
12 ∞ p
Z
Rn (F) ≤ inf 4α + √ log N (ϵ, F, ∥ · ∥∞ ) dϵ . (1.43)
α>0 n α
12 CHAPTER 1. MATHEMATICAL FOUNDATIONS
reinforcing the link between Rn (F) and generalization. There are some Applications in Continuous
Function Classes. One example is the RKHS with Gaussian Kernel. For F as the unit ball of an
RKHS with kernel k(x, x′ ), the covering number satisfies:
1
log N (ϵ, F, ∥ · ∥∞ ) ∼ O 2 , (1.45)
ϵ
yielding:
1
Rn (F) ∼ O √ . (1.46)
n
For F ⊆ H s (Rd ), the covering number depends on the smoothness s and dimension d:
1
Rn (F) ∼ O . (1.47)
ns/d
Rademacher complexity is deeply embedded in modern empirical process theory. Its intricate
relationship with measure-theoretic tools, symmetrization, and concentration inequalities provides
a robust theoretical foundation for understanding generalization in high-dimensional spaces.
W k,p (X ) ,→ C m (X ), (1.48)
1.2. APPROXIMATION SPACES FOR NEURAL NETWORKS 13
if k − dp > m, ensuring fθ ∈ C ∞ (X ) for smooth activations σ. For a function u ∈ Lp (Ω), its weak
derivative Dα u satisfies:
Z Z
|α|
α
u(x)D ϕ(x) dx = (−1) v(x)ϕ(x) dx ∀ϕ ∈ Cc∞ (Ω), (1.49)
Ω Ω
p
where v ∈ L (Ω) is the weak derivative. This definition extends the classical notion of differentiation
to functions that may not be pointwise differentiable. The Sobolev norm encapsulates both function
values and their derivatives:
1/p
X
∥u∥W k,p (Ω) = ∥Dα u∥pLp (Ω) . (1.50)
|α|≤k
Key properties:
• Semi-norm Dominance: The W k,p -norm is controlled by the seminorm |u|W k,p , ensuring
sensitivity to high-order derivatives.
• Poincaré Inequality: For Ω bounded, u − uΩ satisfies:
∥u − uΩ ∥Lp ≤ C∥Du∥Lp . (1.51)
Sobolev spaces W k,p (Ω) embed into Lq (Ω) or C m (Ω), depending on k, p, q, and n. These embeddings
govern the smoothness and integrability of u and its derivatives. There are several Advanced
Theorems on Sobolev Embeddings. They are as follows:
1. Sobolev Embedding Theorem: Let Ω ⊂ Rn be a bounded domain with Lipschitz bound-
ary. Then:
• If k > n/p, W k,p (Ω) ,→ C m,α (Ω) with m = ⌊k − n/p⌋ and α = k − n/p − m.
• If k = n/p, W k,p (Ω) ,→ Lq (Ω) for q < ∞.
• If k < n/p, W k,p (Ω) ,→ Lq (Ω) where 1
q
= 1
p
− nk .
For k > n/p, Fourier decay implies uniform bounds, ensuring u ∈ C m,α . Interpolation spaces
bridge Lp and W k,p , providing finer embeddings. Duality: Sobolev embeddings are equivalent to
boundedness of adjoint operators in Lq . For −∆u = f , u ∈ W 2,p (Ω) ensures u ∈ C 0,α (Ω) if p > n/2.
Sobolev spaces govern variational problems in geometry, e.g., minimal surfaces and harmonic maps.
On Ω with fractal boundaries, trace theorems refine Sobolev embeddings.
14 CHAPTER 1. MATHEMATICAL FOUNDATIONS
Literature Review: Lassoued (2026) [51] examined function spaces on the torus and their lack
of compactness, highlighting cases where the classical Rellich-Kondrachov result fails. He extended
compact embedding results to function spaces with periodic structures. He also discussed trace
theorems and regular function spaces in this new context. Chen et.al. (2024) [52] extended the
Rellich-Kondrachov theorem to Hörmander vector fields, a class of differential operators that appear
in hypoelliptic PDEs. They established a degenerate compact embedding theorem, generalizing pre-
vious results in the field. They also provided applications to geometric inequalities, highlighting
the role of compact embeddings in PDE theory. Adams and Fournier (2003) [53] in their book
provided a complete proof of the Rellich-Kondrachov theorem, along with a discussion of compact
embeddings. They also covered function space theory, embedding theorems, and applications in
PDEs. Brezis (2010) [54] wrote a highly recommended resource for understanding Sobolev spaces
and their compactness properties. The book included applications to variational methods and weak
solutions of PDEs. Evans (2022) [55] in his classic PDE textbook includes a discussion of compact
Sobolev embeddings, their implications for weak convergence, and applications in variational meth-
ods. Maz’ya (2011) [56] provided a detailed treatment of Sobolev space theory, including compact
embedding theorems in various settings.
To rigorously state the theorem, we consider a bounded open domain Ω ⊂ Rn with a Lipschitz
boundary. For 1 ≤ p < n, the theorem asserts that the embedding
∥uk ∥W 1,p (Ω) = ∥uk ∥Lp (Ω) + ∥∇uk ∥Lp (Ω) ≤ C, (1.56)
then there exists a subsequence {ukj } and a function u ∈ Lq (Ω) such that
However, weak convergence alone does not imply compactness. To obtain strong convergence
in Lq (Ω), we need additional arguments. This is accomplished using the Fréchet-Kolmogorov
compactness criterion, which states that a bounded subset of Lq (Ω) is compact if and only if it
is tight and uniformly equicontinuous. More formally, compactness follows if
2. The sequence uk (x) does not escape to infinity in a way that prevents strong convergence.
To quantify this, we invoke the Sobolev-Poincaré inequality, which states that for p < n, there
exists a constant C such that
Z
1
∥u − uΩ ∥L (Ω) ≤ C∥∇u∥L (Ω) , uΩ =
q p u(x) dx. (1.61)
|Ω| Ω
Thus,
∥uk − u∥Lq (Ω) → 0, (1.64)
which establishes the strong convergence in Lq (Ω), completing the proof. The key insight is
that compactness arises because the gradients of uk provide control over the oscillations of uk ,
ensuring that the sequence cannot oscillate indefinitely without converging in norm. The crucial
role of Sobolev embeddings is to guarantee that even though W 1,p (Ω) does not embed compactly
np
into itself, it does embed compactly into Lq (Ω) for q < n−p . This embedding ensures that weak
1,p q
convergence in W (Ω) implies strong convergence in L (Ω), proving the theorem.
To prove necessity, suppose that F is relatively compact in Lp . Then every sequence {fn } ⊂ F has
a strongly convergent subsequence. This implies that the norms of the functions in F are uniformly
16 CHAPTER 1. MATHEMATICAL FOUNDATIONS
bounded, since otherwise there would exist a sequence {fn } such that ∥fn ∥Lp → ∞, contradicting
compactness. Hence, there exists a constant M > 0 such that for all f ∈ F,
Z p1
p
∥f ∥Lp = |f (x)| dx ≤ M. (1.65)
Rn
Next, to establish uniform integrability, assume for contradiction that F is not tight. Then there
exists ε > 0 such that for every compact set K ⊂ Rn , there exists some function fK ∈ F satisfying
Z
|fK (x)|p dx ≥ ε. (1.66)
Rn \K
This contradicts relative compactness, as it implies the existence of a sequence {fn } with mass
escaping to infinity, preventing any strong convergence in Lp . Hence, for every ε > 0, there exists
a compact K ⊂ Rn such that Z
sup |f (x)|p dx < ε. (1.67)
f ∈F Rn \K
To establish the necessity of equicontinuity, suppose for contradiction that it does not hold. Then
there exists ε > 0 such that for every δ > 0, there is some function fδ ∈ F and some shift h with
|h| < δ such that Z
|fδ (x + h) − fδ (x)|p dx ≥ ε. (1.68)
Rn
This contradicts compactness, as it implies the existence of arbitrarily oscillatory sequences pre-
venting strong convergence in Lp . Hence, for every ε > 0, there exists δ > 0 such that for all
|h| < δ, Z
sup |f (x + h) − f (x)|p dx < ε. (1.69)
f ∈F Rn
To prove sufficiency, assume that F satisfies boundedness, uniform integrability, and equicontinuity.
Consider a sequence {fn } ⊂ F. The first condition guarantees that the sequence is uniformly
bounded in Lp , ensuring weak compactness by Banach-Alaoglu. The second condition ensures that
the functions do not escape to infinity, implying tightness. The third condition ensures that the
sequence is equicontinuous, preventing high-frequency oscillations. By the diagonal argument, we
can extract a subsequence {fnk } that is a Cauchy sequence in Lp , ensuring strong convergence. To
rigorously show that a subsequence converges, define the modulus of continuity functional,
Z
ωF (δ) = sup |f (x + h) − f (x)|p dx, (1.70)
f ∈F Rn
which satisfies ωF (δ) → 0 as δ → 0 due to equicontinuity. Given ε > 0, choose δ such that
ωF (δ) < 3ε , and a compact set K such that
Z
ε
sup |f (x)|p dx < . (1.71)
f ∈F Rn \K 3
By weak compactness, there exists a subsequence fnk converging weakly in Lp to some f . Ap-
plying Vitali’s theorem, we obtain strong convergence, proving compactness. Thus, the Fréchet-
Kolmogorov criterion is fully established.
Lp (Ω). The Sobolev-Poincaré inequality asserts the existence of a constant C = C(Ω, p) > 0 such
that
∥u − uΩ ∥Lp∗ (Ω) ≤ C∥∇u∥Lp (Ω) (1.72)
np
for all u ∈ W 1,p (Ω), where p∗ = n−p
is the Sobolev conjugate exponent and uΩ is the mean value
of u over Ω, given by Z
1
uΩ = u(x) dx. (1.73)
|Ω| Ω
To prove this inequality, we first use a representation formula for u(x). Given any two points
x, y ∈ Ω, we consider the fundamental theorem of calculus applied along the segment connecting x
to y. Define the parametrization
Taking the absolute value and applying Hölder’s inequality with conjugate exponents p and p′ =
p
p−1
,
Z 1
|u(y) − u(x)| ≤ |y − x| |∇u(γ(t))| dt. (1.78)
0
Now, integrating both sides over y ∈ Ω, using Fubini’s theorem and Minkowski’s integral inequality,
Z Z Z 1 p∗
p∗
|u(y) − u(x)| dy ≤ |y − x| |∇u(γ(t))| dt dy. (1.79)
Ω Ω 0
Using the properties of integral norms and applying Hölder’s inequality again,
Z p1∗
p∗
|u(y) − u(x)| dy ≤ C∥∇u∥Lp (Ω) . (1.80)
Ω
where C depends on Ω and p, completing the proof. Let Ω ⊂ Rn be a bounded domain with a
Lipschitz boundary. Consider the Sobolev space W 1,p (Ω) for 1 ≤ p < n, which consists of functions
u ∈ Lp (Ω) whose weak derivatives ∇u also belong to Lp (Ω). The Sobolev-Poincaré inequality
asserts the existence of a constant C = C(Ω, p) > 0 such that
To prove this inequality, we first use a representation formula for u(x). Given any two points
x, y ∈ Ω, we consider the fundamental theorem of calculus applied along the segment connecting x
to y. Define the parametrization
Taking the absolute value and applying Hölder’s inequality with conjugate exponents p and p′ =
p
p−1
,
Z 1
|u(y) − u(x)| ≤ |y − x| |∇u(γ(t))| dt. (1.88)
0
Now, integrating both sides over y ∈ Ω, using Fubini’s theorem and Minkowski’s integral inequality,
Z Z Z 1 p∗
p∗
|u(y) − u(x)| dy ≤ |y − x| |∇u(γ(t))| dt dy. (1.89)
Ω Ω 0
Using the properties of integral norms and applying Hölder’s inequality again,
Z p1∗
p∗
|u(y) − u(x)| dy ≤ C∥∇u∥Lp (Ω) . (1.90)
Ω
where C depends on Ω and p, completing the proof. The consequences and applications are:
1. Regularity of PDE Solutions: The Sobolev-Poincaré inequality is crucial in proving the
existence and regularity of weak solutions to elliptic PDEs.
2. Compactness and Rellich-Kondrachov Theorem: It plays a role in proving the compact
embedding of W 1,p (Ω) into Lq (Ω), which is fundamental in functional analysis.
3. Control of Function Oscillations: It quantifies how much a function can deviate from its
mean, which is used in various areas of mathematical physics and geometry.
One important extension is the sharp form of the Sobolev-Poincaré inequality, which involves ex-
plicit best constants in terms of the domain geometry. Specifically, in some cases, the optimal
constant is related to eigenvalues of the Laplace operator or geometric properties such as the diam-
eter or inradius of Ω. Another important extension is the fractional Sobolev-Poincaré inequality,
which deals with function spaces like W s,p (Ω) where s is fractional, incorporating nonlocal effects.
The Universal Approximation Theorem (UAT) is a fundamental result in neural network theory,
stating that a feedforward neural network with a single hidden layer containing a finite number of
neurons can approximate any continuous function on a compact subset of Rn to any desired degree
of accuracy, provided that an appropriate activation function is used. This theorem has significant
implications in machine learning, function approximation, and deep learning architectures.
Literature Review: Hornik et. al. (1989) [57] in their seminal paper rigorously proved that
multilayer feedforward neural networks with a single hidden layer and a sigmoid activation func-
tion can approximate any continuous function on a compact set. It extends prior results and lays
the foundation for the modern understanding of UAT. Cybenko (1989) [58] provided one of the first
rigorous proofs of the UAT using the sigmoid function as the activation function. They demon-
strated that a single hidden layer network can approximate any continuous function arbitrarily well.
Barron (1993) [59] extended UAT by quantifying the approximation error and analyzing the rate
of convergence. This work is crucial for understanding the practical efficiency of neural networks.
Pinkus (1999) [60] provided a comprehensive survey of UAT from the perspective of approximation
theory and also discussed conditions for approximation with different activation functions and the
theoretical limits of neural networks. Lu et.al. (2017) [61] investigated how the width of neural
networks affects their approximation capability, challenging the notion that deeper networks are
always better. They also provided insights into trade-offs between depth and width. Hanin and
Sellke (2018) [62] extended UAT to ReLU activation functions, showing that deep ReLU networks
achieve universal approximation while maintaining minimal width constraints. Garcıa-Cervera et.
al. (2024) [63] extended the universal approximation theorem to set-valued functions and its ap-
plications to Deep Operator Networks (DeepONets), which are useful in control theory and PDE
modeling. Majee et.al. (2024) [64] explored the universal approximation properties of deep neu-
ral networks for solving inverse problems using Markov Chain Monte Carlo (MCMC) techniques.
Toscano et. al. (2024) [65] introduced Kurkova-Kolmogorov-Arnold Networks (KKANs), an ex-
tension of UAT incorporating Kolmogorov’s superposition theorem for improved approximation
capabilities. Son (2025) [66] established a new framework for operator learning based on the UAT,
providing a theoretical foundation for backpropagation-free deep networks.
19
20 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF
The kernel ϕ(x) is typically chosen to be smooth, compactly supported, and normalized such that
Z
ϕ(x) dx = 1. (2.2)
Rn
To approximate f locally, we introduce a scaling parameter ϵ > 0 and define the scaled kernel ϕϵ (x)
as x
ϕϵ (x) = ϵ−n ϕ . (2.3)
ϵ
The factor ϵ−n ensures that ϕϵ (x) remains a probability density function, satisfying
Z Z
ϕϵ (x) dx = ϕ(x) dx = 1. (2.4)
Rn Rn
x−y
Performing the change of variables z = we have y = x − ϵz and dy = ϵn dz. Substituting into
ϵ
,
the integral, we obtain Z
(f ∗ ϕϵ )(x) = f (x − ϵz)ϕ(z) dz. (2.6)
Rn
This representation shows that (f ∗ ϕϵ )(x) is a smoothed version of f (x), where the smoothing
is controlled by the parameter ϵ. As ϵ → 0, the kernel ϕϵ (x) becomes increasingly concentrated
around x, and we recover f (x) in the limit:
assuming f is continuous. This result can be rigorously proven using properties of the kernel
ϕ, such as its smoothness and compact support, and the dominated convergence theorem, which
ensures that the integral converges uniformly to f (x). Now, let us consider the role of convolution
operators in the approximation of f by neural networks. A single-layer feedforward neural network
is expressed as
M
X
ˆ
f (x) = ci σ(wiT x + bi ), (2.8)
i=1
n
where ci ∈ R are coefficients, wi ∈ R are weight vectors, bi ∈ R are biases, and σ : R → R is the
activation function. The activation function σ(wiT x + bi ) can be interpreted as a localized response
function, analogous to the kernel ϕ(x − y) in convolution. By drawing an analogy between the two,
we can write the neural network approximation as
M
X
fˆ(x) ≈ f (xi )ϕϵ (x − xi )∆x (2.9)
i=1
The term ∥f − f ∗ ϕϵ ∥∞ represents the error introduced by smoothing f with the kernel ϕϵ , and it
can be made arbitrarily small by choosing ϵ sufficiently small, provided f is regular enough (e.g.,
Lipschitz continuous). The term ∥f ∗ ϕϵ − fˆ∥∞ quantifies the error due to discretization, which
vanishes as the number of neurons M → ∞. To rigorously analyze the convergence of fˆ(x) to
f (x), we rely on the density of neural network approximators in function spaces. The Universal
2.1. APPROXIMATION USING CONVOLUTION OPERATORS 21
Approximation Theorem states that, for any continuous function f on a compact domain Ω ⊂ Rn
and any ϵ > 0, there exists a neural network fˆ with finitely many neurons such that
This result hinges on the ability of the activation function σ to generate a rich set of basis func-
tions. For example, if σ(x) = max(0, x) (ReLU), the network approximates f (x) by piecewise linear
functions. If σ(x) = 1+e1−x (sigmoid), the network generates smooth approximations that resemble
logistic regression.
In this refined proof of the UAT, convolution operators provide a unifying framework for un-
derstanding the smoothing, localization, and discretization processes that underlie neural network
approximations. The interplay between ϕϵ (x), f ∗ ϕϵ (x), and fˆ(x) reveals the profound mathemat-
ical structure that connects classical approximation theory with modern machine learning. This
connection not only enhances our theoretical understanding of neural networks but also guides the
design of architectures and algorithms for practical applications.
[78] used the Stone-Weierstrass theorem to analyze function approximation in fuzzy logic systems
and explored the applications in control systems and AI.
This supremum norm is critical in defining the proximity between continuous functions, as we seek
to approximate any function f ∈ C(X) by a function g from a subalgebra A ⊂ C(X). The Stone-
Weierstrass theorem guarantees that if the subalgebra A satisfies two essential properties—(1) it
contains the constant functions, and (2) it separates points—then the closure of A in the supremum
norm will be the entire space C(X). To formalize this, we define the point separation property
as follows: for every pair of distinct points x1 , x2 ∈ X, there exists a function h ∈ A such that
h(x1 ) ̸= h(x2 ). This condition ensures that functions from A are sufficiently “rich” to distinguish
between different points in X. Mathematically, this is expressed as:
Given these two properties, the Stone-Weierstrass theorem asserts that for any continuous function
f ∈ C(X) and any ϵ > 0, there exists an element g ∈ A such that:
This result ensures that any continuous function on a compact Hausdorff space can be approxi-
mated arbitrarily closely by functions from a sufficiently rich subalgebra. In the context of the
Universal Approximation Theorem (UAT), we seek to apply the Stone-Weierstrass theorem
to the approximation capabilities of neural networks. Let K ⊆ Rn be a compact subset, and let
f ∈ C(K) be a continuous function defined on this set. A feedforward neural network with a
non-linear activation function σ has the form:
N
X
fˆθ (x) = wi σ(⟨wi , x⟩ + bi ) (2.15)
i=1
where ⟨wi , x⟩ represents the inner product between the weight vector wi and the input x, and bi
represents the bias term. The activation function σ is typically non-linear (such as the sigmoid or
ReLU function), and the parameters θ = {wi , bi }N i=1 are the weights and biases of the network. The
function fˆθ (x) is a weighted sum of the non-linear activations applied to the affine transformations
of x.
We now explore the connection between neural networks and the Stone-Weierstrass theorem. A
critical observation is that the set of functions defined by a neural network with non-linear activation
is a subalgebra of C(K) provided the activation function σ is sufficiently rich in its non-linearity.
This non-linearity ensures that the network can separate points in K, meaning that for any two
distinct points x1 , x2 ∈ K, there exists a network function fˆθ that takes distinct values at these
points. This satisfies the point separation condition required by the Stone-Weierstrass theorem.
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 23
To formalize this, consider two distinct points x1 , x2 ∈ K. Since σ is non-linear, the function fˆθ (x)
with appropriately chosen weights and biases will satisfy:
Thus, the algebra of neural network functions satisfies the point separation property. By applying
the Stone-Weierstrass theorem, we conclude that this algebra is dense in C(K), meaning that for
any continuous function f ∈ C(K) and any ϵ > 0, there exists a neural network function fˆθ such
that:
∥f (x) − fˆθ (x)∥∞ < ϵ ∀x ∈ K (2.17)
This rigorous result shows that neural networks with a non-linear activation function can approxi-
mate any continuous function on a compact set arbitrarily closely in the supremum norm, thereby
proving the Universal Approximation Theorem. To further explore this, consider the error term:
For a given function f and a compact set K, this error term can be made arbitrarily small by
increasing the number of neurons in the hidden layer of the neural network. This increases the
capacity of the network, effectively enlarging the subalgebra of functions generated by the network,
thereby improving the approximation. As the number of neurons increases, the network’s ability to
approximate any function from C(K) becomes increasingly precise, which aligns with the conclusion
of the Stone-Weierstrass theorem that the network functions form a dense subalgebra in C(K).
Thus, the Universal Approximation Theorem, derived through the Stone-Weierstrass theorem,
rigorously proves that neural networks can approximate any continuous function on a compact set
to any desired degree of accuracy. The combination of the non-linearity of the activation function
and the architecture of the neural network guarantees that the network can generate a dense
subalgebra of continuous functions, ultimately allowing it to approximate any function from C(K).
This result not only formalizes the approximation power of neural networks but also provides a
deep theoretical foundation for understanding their capabilities as universal approximators.
where the functions ψpq (xp ) encode the univariate projections of the input variables xp , and the
outer functions Φq aggregate these projections into the final output. This decomposition highlights
a fundamental property of multivariate continuous functions: their expressiveness can be captured
through hierarchical compositions of simpler, univariate components.
24 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF
Literature Review: There are some Classical References on the Kolmogorov-Arnold Superpo-
sition Theorem (KST). Kolmogorov (1957) [79] in his Foundational Paper on KST established that
any continuous function of several variables can be represented as a superposition of continuous
functions of a single variable and addition. This was groundbreaking because it provided a uni-
versal function decomposition method, independent of inner-product spaces. He proved that there
exist functions ϕq and ψq such that any function f (x1 , x2 , . . . , xn ) can be expressed as:
2n+1 n
!
X X
f (x1 , ..., xn ) = ϕq ψqp (xp ) (2.20)
q=1 p=1
where the ψqp are univariate functions. Kolmogorov provided a mathematical basis for approxima-
tion theory and neural networks, influencing modern machine learning architectures. Arnold (1963)
[80] refined Kolmogorov’s theorem by proving that one can restrict the superposition to functions
of at most two variables instead of one. Arnold’s formulation led to the Kolmogorov-Arnold
representation: !
2n+1
X n
X
f (x1 , ..., xn ) = ϕq xq + ψqp (xp ) (2.21)
q=1 p=1
making the theorem more suitable for practical computations. Arnold strengthened the expressiv-
ity of neural networks, inspiring alternative function representations in high-dimensional settings.
Lorentz (2008) [81] in his book discusses the significance of KST in approximation theory and con-
structive mathematics. He provided error estimates for approximating multivariate functions using
Kolmogorov-type decompositions. He showed how KST fits within Bernstein approximation the-
ory. He helped frame KST in the context of function approximation, bridging it to computational
applications. Building on this theoretical foundation, Hornik et. al. (1989) [57] demonstrated
that multilayer feedforward networks are universal approximators, meaning that neural networks
with a single hidden layer can approximate any continuous function. This work bridged the gap
between the Kolmogorov-Arnold theorem and practical neural network design, providing a rigor-
ous justification for the use of deep architectures. Pinkus (1999) [60] analyzed the role of KST
in multilayer perceptrons (MLPs), showing how it influences function expressibility in neural
networks. He demonstrated that feedforward neural networks can approximate arbitrary functions
using Kolmogorov superposition. He also provided bounds on network depth and width required for
universal approximation. He played a crucial role in understanding the theoretical power of deep
learning. In more recent years, Montúfar, Pascanu, Cho, and Bengio (2014) [441] explored the
expressive power of deep neural networks by analyzing the number of linear regions they can repre-
sent. Their work provided a modern perspective on the Kolmogorov-Arnold theorem, showing how
depth enhances the ability of networks to model complex functions. Schmidt-Hieber (2020) [442]
rigorously analyzed the approximation properties of deep ReLU networks, demonstrating their effi-
ciency in approximating high-dimensional functions and further connecting the Kolmogorov-Arnold
theorem to modern deep learning practices. Yarotsky (2017) [443] complemented this by providing
explicit error bounds for approximating functions using deep ReLU networks, offering insights into
how depth and activation functions influence approximation accuracy. Telgarsky (2016) [444] con-
tributed to this body of work by rigorously proving that deeper networks can represent functions
more efficiently than shallow ones, aligning with the hierarchical decomposition suggested by the
Kolmogorov-Arnold theorem. This work provided theoretical insights into why depth is crucial in
modern neural networks. Lu et. al. (2017) [445] explored the expressive power of neural networks
from the perspective of width rather than depth, showing how width can also play a critical role in
function approximation. This complemented the Kolmogorov-Arnold theorem by offering a more
nuanced understanding of network design. Finally, Zhang et. al. (2021) [446] provided a rigorous
analysis of how deep learning models generalize, which is closely related to their ability to approx-
imate complex functions. While not directly about the Kolmogorov-Arnold theorem, their work
contextualized these theoretical insights within the broader framework of generalization in deep
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 25
learning, offering practical implications for the design and training of neural networks.
There are several very recent contributions in the Kolmogorov-Arnold Superposition Theorem
(KST) (2024–2025). Guilhoto and Perdikaris (2024) [82] explored how KST can be reformulated
using deep learning architectures. They proposed Kolmogorov-Arnold Networks (KANs), a new
type of neural network inspired by KST. They showed that KANs outperform traditional feedfor-
ward networks in function approximation tasks. They also provided empirical evidence of KAN
efficiency in real-world datasets. They also introduced a new paradigm in machine learning, making
function decomposition more interpretable. Alhafiz, M. R. et al. (2025) [83] applied KST-based
networks to turbulence modeling in fluid mechanics. They demonstrated how KANs improve pre-
dictive accuracy for Navier-Stokes turbulence models. They showed a reduction in computational
complexity compared to classical turbulence models. They also developed a data-driven turbulence
modeling framework leveraging KST. They advanced machine learning applications in computa-
tional fluid dynamics (CFD). Lorencin, I. et al. (2024) [84] used KST-inspired neural networks for
predicting propulsion system parameters in ships. They implemented KANs to model hybrid ship
propulsion (Combined Diesel-Electric and Gas - CODLAG) and demonstrated a highly accurate
prediction model for propulsion efficiency. They also provided a new benchmark dataset for ship
propulsion research. They extended KST applications to naval engineering & autonomous systems.
In the context of neural networks, this result establishes the theoretical universality of function ap-
proximation. A neural network with a single hidden layer approximates a function f (x1 , x2 , . . . , xn )
by representing it as !
W
X n
X
f (x1 , x2 , . . . , xn ) ≈ ai σ wij xj + bi , (2.22)
i=1 j=1
where W is the width of the hidden layer, σ is a nonlinear activation function, wij are weights,
bi are biases, and ai are output weights. The expressive power of such shallow networks depends
critically on the width W , as the universal approximation theorem ensures that W → ∞ suffices
to approximate any continuous function arbitrarily well. However, for a fixed approximation error
ϵ > 0, the required width grows exponentially with the input dimension n, satisfying a lower bound
of
W ≥ C · ϵ−n , (2.23)
where C depends on the function’s Lipschitz constant. This exponential dependence, sometimes
called the ”curse of dimensionality,” underscores the inefficiency of shallow architectures in captur-
ing high-dimensional dependencies.
The advantage of depth becomes apparent when we consider deep neural networks, which uti-
lize hierarchical representations. A deep network with D layers and width W per layer constructs
a function as a composition of layer-wise transformations:
where h(k) denotes the output of the k-th layer, W (k) is the weight matrix, b(k) is the bias vector,
and σ is the nonlinear activation. The final output of the network is then given by
The depth D of the network allows it to approximate hierarchical compositions of functions. For
example, if a target function f (x) has a compositional structure
where each gi is a simple function, the depth D directly corresponds to the number of nested
transformations. This compositional hierarchy enables deep networks to approximate functions
efficiently, achieving a reduction in the required parameter count. The approximation error ϵ for a
deep network decreases polynomially with D, satisfying
1
ϵ≤O , (2.27)
D2
which is exponentially more efficient than the error scaling for shallow networks. In light of the
Kolmogorov-Arnold theorem, the decomposition
2n n
!
X X
f (x1 , x2 , . . . , xn ) = Φq ψpq (xp ) (2.28)
q=0 p=1
demonstrates how deep networks align naturally with the structure of multivariate functions. The
inner functions ψpq capture local dependencies, while the outer functions Φq aggregate these into a
global representation. This layered decomposition mirrors the depth-based structure of neural net-
works, where each layer learns a specific aspect of the function’s complexity. Finally, the parameter
count in a deep network with D layers and width W per layer is given by
P ≤ O(D · W 2 ), (2.29)
to better compression and generalization. Kim et. al. (2024) [220] introduced Neural Fourier
Modelling (NFM), a novel approach to representing time-series data compactly while preserving
its expressivity. It outperforms traditional models like Short-Time Fourier Transform (STFT) in
retaining long-term dependencies. Xie et. al. (2024) [221] explored how Fourier basis functions
can be used to enhance the expressivity of tensor networks while maintaining computational ef-
ficiency. It establishes trade-offs between expressivity and model complexity in machine learning
architectures. Liu et. al. (2024) [222] integrated spectral modulation and Fourier transforms into
implicit neural representations for text-to-image synthesis. Fourier analysis improves global coher-
ence while preserving local expressivity in generative models. Zhang (2024) [223] demonstrated how
Fourier and Lock-in spectrum techniques can represent long-term variations in mechanical signals.
The Fourier-based decomposition allows for more expressive representations of mechanical failures
and degradation. Hamed and Lachiri (2024) [224] applied Fourier transformations to speech syn-
thesis models, improving their ability to transfer expressive content from text to speech. Fourier
series allows capturing prosody, rhythm, and tone variations effectively. Lehmann et. al. (2024)
[225] integrated Fourier-based deep learning models for seismic activity prediction. It explores the
expressivity of Fourier Neural Operators (FNOs) in capturing wave propagations in different geo-
logical environments.
The Fourier analysis of expressivity in neural networks seeks to rigorously quantify how neural
architectures, characterized by their depth and width, can approximate functions through the de-
composition of those functions into their Fourier spectra. Consider a square-integrable function
f : Rd → R, for which the Fourier transform is defined as
Z
ˆ
f (ξ) = f (x)e−i2πξ·x dx (2.31)
Rd
where ξ ∈ Rd represents the frequency. The inverse Fourier transform reconstructs the function as
Z
f (x) = fˆ(ξ)ei2πξ·x dξ (2.32)
Rd
The magnitude |fˆ(ξ)| reflects the energy contribution of the frequency ξ to f . Neural networks
approximate f by capturing its Fourier spectrum, but the architecture fundamentally governs
how efficiently this approximation can be achieved, especially in the presence of high-frequency
components. For shallow networks with one hidden layer and a finite number of neurons, the
universal approximation theorem establishes that
n
X
f (x) ≈ ai ϕ(wi · x + bi ) (2.33)
i=1
where ϕ is the activation function, wi ∈ Rd are weights, bi ∈ R are biases, and ai ∈ R are
coefficients. The Fourier transform of this representation can be expressed as
n
X
fˆ(ξ) ≈ ai ϕ̂(ξ)e−i2πξ·bi (2.34)
i=1
where ϕ̂(ξ) denotes the Fourier transform of the activation function. For smooth activation func-
tions like sigmoid or tanh, ϕ̂(ξ) decays exponentially as ∥ξ∥ → ∞, limiting the network’s ability
to approximate functions with high-frequency content unless the width n is exceedingly large.
Specifically, the Fourier coefficients decay as
|fˆ(ξ)| ∼ e−β∥ξ∥ (2.35)
where β > 0 depends on the smoothness of ϕ. This restriction implies that shallow networks
are biased toward low-frequency functions unless their width scales exponentially with the input
28 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF
dimension d. Deep networks, on the other hand, leverage their hierarchical structure to overcome
these limitations. A deep network with L layers recursively composes functions, producing an
output of the form
where ϕl is the activation function at layer l, W(l) are weight matrices, and b(l) are bias vectors.
The Fourier transform of this composition can be analyzed iteratively. If h(l) = ϕl (W(l) h(l−1) +b(l) )
represents the output of the l-th layer, then
where ∗ denotes convolution and ϕ̂l is the Fourier transform of the activation function. The recursive
application of this convolution amplifies high-frequency components, enabling deep networks to
approximate functions whose Fourier spectra exhibit polynomial decay. Specifically, the Fourier
coefficients of a deep network decay as
where α depends on the activation function. This is in stark contrast to the exponential decay
observed in shallow networks.
The activation function plays a pivotal role in shaping the Fourier spectrum of neural networks. For
example, the rectified linear unit (ReLU) ϕ(x) = max(0, x) introduces significant high-frequency
components into the network. The Fourier transform of the ReLU activation is given by
1
ϕ̂(ξ) = (2.39)
2πiξ
which decays more slowly than the Fourier transforms of smooth activations. Consequently, ReLU-
based networks are particularly effective at approximating functions with oscillatory behavior. To
illustrate, consider the function
f (x) = sin(2πξ · x) (2.40)
A shallow network requires an exponentially large number of neurons to approximate f when ∥ξ∥ is
large, but a deep network can achieve the same approximation with polynomially fewer parameters
by leveraging its hierarchical structure. The expressivity of deep networks can be further quantified
by considering their ability to approximate bandlimited functions, i.e., functions f whose Fourier
spectra are supported on ∥ξ∥ ≤ ωmax . For a shallow network with width n, the required number
of neurons scales as
n ∼ (ωmax )d (2.41)
where d is the input dimension. In contrast, for a deep network with depth L, the width scales as
reflecting the exponential efficiency of depth in distributing the approximation of frequency compo-
nents across layers. For example, if f (x) = cos(2πξ · x) with ∥ξ∥ = ωmax , a deep network requires
significantly fewer parameters than a shallow network to approximate f to the same accuracy. A
neural network with an input x ∈ Rd and output f (x) can be expressed as a sum of nonlinearly
transformed weighted combinations of the input. Mathematically, this can be written as
m
X
f (x) = ai σ(wiT x + bi ), (2.43)
i=1
2.2. DEPTH VS. WIDTH: CAPACITY ANALYSIS 29
where σ(·) is the activation function, wi ∈ Rd are the weight vectors, bi ∈ R are the biases, and ai
are the output weights. To understand the spectral properties of f (x), it is necessary to analyze
the Fourier transform of the activation function σ(wiT x + bi ), as the network output consists of a
superposition of such transformed functions. The Fourier transform of f (x) is given by
Z
fˆ(k) = f (x)e−2πik·x dx. (2.44)
Rd
The key term in this expression is the Fourier transform of the activation function σ(wT x + b),
which we denote as Z
σ̂(k) = σ(wT x + b)e−2πik·x dx. (2.46)
Rd
The decay rate of σ̂(k) determines the extent to which different frequency components are retained
in the Fourier spectrum of the neural network. If σ(x) is a smooth function, then σ̂(k) decays
rapidly for large ∥k∥, implying that high-frequency components are suppressed. If σ(x) is piecewise
continuous or non-differentiable at some points, the decay rate of σ̂(k) is slower, allowing the
network to capture higher frequencies. To quantify the decay properties of the Fourier transform
of an activation function, consider the case of the ReLU activation function, defined as
This strong decay implies that networks with sigmoid activation functions are biased toward learn-
ing only low-frequency components. Similarly, the hyperbolic tangent activation function tanh(x)
also exhibits an exponential spectral decay. For a more general class of activation functions, the
decay rate of σ̂(k) can be estimated as
for some constants C, α > 0, implying that the function is a strong low-pass filter. If σ(x) is piece-
wise smooth but non-differentiable at certain points (such as ReLU), then the Fourier transform
satisfies a power-law decay
|σ̂(k)| ≤ C|k|−p (2.52)
for some p > 0. A particularly interesting case arises when the activation function is sinusoidal,
such as σ(x) = sin(x), for which the Fourier transform does not decay at all. The implications of
these spectral properties become evident in the training dynamics of neural networks. When using
gradient descent, the evolution of the Fourier coefficients of f (x) over time follows
dfˆ(k, t)
= −λ(k)fˆ(k, t), (2.53)
dt
30 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF
where λ(k) is an effective learning rate for each frequency. The decay behavior of σ̂(k) influences
λ(k), meaning that activation functions with strong spectral decay impose a bottleneck on the
learning of high-frequency components. From a function approximation perspective, the spectral
characteristics of the activation function determine the types of functions that can be efficiently
represented by a neural network. Smooth activation functions lead to approximations that are
predominantly low-frequency, whereas activation functions with slower Fourier decay allow the
network to approximate functions with higher-frequency content. Thus, the activation function
fundamentally shapes the Fourier spectrum of neural networks, controlling the network’s ability to
represent and learn different frequency components.
This table summarizes the spectral characteristics of various activation functions and their impact
on frequency learning in neural networks.
In summary, the Fourier analysis of expressivity rigorously demonstrates the superiority of deep
networks over shallow ones in approximating complex functions. Depth introduces a hierarchi-
cal compositional structure that enables the efficient representation of high-frequency components,
while width provides a rich basis for approximating the function’s Fourier spectrum. Together,
these properties explain the remarkable capacity of deep neural networks to approximate functions
with intricate spectral structures, offering a mathematically rigorous foundation for understanding
their expressivity.
We will explicitly derive σ̂(k) for each activation function in a mathematically rigorous manner.
This will include the Sigmoid, Tanh, ReLU, Leaky ReLU, and Sinusoidal activation functions.
where H(x) is the Heaviside step function. Since H(x) is zero for x < 0, the integral simplifies to:
Z ∞
σ̂(k) = xe−2πikx dx. (2.70)
0
Thus, the decay is still |k|−2 , but the amplitude is modulated by (1 + α).
These rigorous derivations confirm how different activation functions influence the Fourier spectrum
of neural networks. The decay properties play a crucial role in determining the network’s ability
to learn high- or low-frequency functions.
Leonhard Euler solved this problem in the 18th century and astonishingly found that the sum
2
converges to π6 , meaning:
∞
X 1 π2
= . (2.80)
n=1
n2 6
2.3. THE CONNECTION BETWEEN DIFFERENT MATHEMATICS PROBLEMS AND DEEP LEARNIN
This result is remarkable because it links a seemingly simple infinite sum to the fundamental
mathematical constant π, which is deeply connected to trigonometry and Fourier analysis. A
summary of notable proofs given by mathematicians to the Basel problem is given by Ghosh
(2020) [675]. A solution to the Basel problem using the Calculus of Residues is given by Ghosh
(2021) [776]. In modern mathematics, the Basel Problem’s solution is understood in the broader
context of zeta functions, where the Riemann zeta function ζ(s) is defined as:
∞
X 1
ζ(s) = s
. (2.81)
n=1
n
For s = 2, we recover the Basel Problem. The significance of such summations extends far beyond
pure mathematics; they appear in areas such as signal processing, numerical analysis, and
machine learning. In particular, deep learning models, which fundamentally rely on function
approximation and optimization, exhibit connections to the types of series that arise in the Basel
Problem, particularly in relation to Fourier series and the decay of spectral components in neural
network approximations.
2.3.1.1 The Basel Problem, Fourier Series, and Function Approximation in Deep
Learning
One of the most natural places where the Basel Problem appears in applied mathematics is in
Fourier series. In Fourier analysis, a periodic function can be decomposed into a sum of sinusoidal
functions with different frequencies. The coefficients of these sinusoidal components determine
the function’s structure, and their squared magnitudes are directly tied to the Basel-type sums.
Specifically, for a function f (x) with Fourier expansion:
∞
X
f (x) = an cos(nx) + bn sin(nx), (2.82)
n=1
Parseval’s theorem states that the total energy of the function is given by:
∞
X
(a2n + b2n ). (2.83)
n=1
If the function is smooth, the coefficients an , bn decay rapidly, often behaving like n12 or faster. The
Basel Problem is thus closely related to the rate of decay of Fourier coefficients, which directly
impacts how well a function can be approximated using a truncated Fourier series. Deep learn-
ing, particularly in neural network function approximation, shares a fundamental connection with
Fourier series. Neural networks with sufficient width and depth can approximate arbitrary func-
tions, and recent research has shown that deep networks tend to first learn lower-frequency
components before higher frequencies—a phenomenon known as spectral bias. The Basel-
type decay of Fourier coefficients is precisely what governs the smoothness of these approximations,
meaning that understanding series like the Basel Problem helps us characterize how neural networks
generalize and learn complex functions.
2.3.1.2 The Role of the Basel Problem in Regularization and Weight Decay
Another important connection between the Basel Problem and deep learning emerges in the context
of regularization techniques, particularly L2 regularization, also known as weight decay.
Regularization techniques in neural networks help prevent overfitting by penalizing large weight
magnitudes, encouraging smoother function approximations. Mathematically, L2 regularization
adds a penalty term to the loss function:
X
Lreg = λ wi2 , (2.84)
i
34 CHAPTER 2. UNIVERSAL APPROXIMATION THEOREM: REFINED PROOF
where wi are the weights of the network, and λ is a regularization parameter. This penalty ensures
that the network does not learn excessively large weights, which could lead to high-frequency
oscillations in the approximated function.The sum of squared weights in L2 regularization resembles
the Basel-type summation of reciprocal squares. Just as the Basel Problem reflects the decay
of Fourier coefficients for smooth functions, L2 regularization ensures that the learned function
maintains smoothness by controlling the magnitude of weights. Thus, there is a deep analogy:
both involve penalizing high-frequency components to favor smooth, well-behaved
solutions.
Thus, the decay of Fourier coefficients (as described by the Basel Problem) provides an intuition for
why neural networks prefer lower frequencies. This insight is valuable in designing deep learning
architectures, especially in areas like implicit neural representations (INRs), where Fourier
features are explicitly incorporated into the model design to control the spectral bias. While the
Basel Problem originates in pure mathematics, its influence extends deeply into areas of applied
mathematics, including function approximation, signal processing, and neural networks. The prob-
lem’s solution reveals a fundamental truth about how series behave, and this truth manifests in
Fourier series, weight decay regularization, and spectral bias in deep learning. The key takeaway is
that the Basel Problem describes the natural decay of Fourier coefficients in function
approximations, and similar decay patterns emerge in deep learning when training networks to
approximate complex functions. Euler’s result continues to play a role in shaping our understanding
of function representations in neural networks today.
3 Training Dynamics and NTK Lineariza-
tion
Literature Review: Trevisan et. al. [85] investigated how knowledge distillation can be ana-
lyzed using the Neural Tangent Kernel (NTK) framework and demonstrated that under certain
conditions, the training dynamics of a student model in knowledge distillation closely follow NTK
linearization. They explored how NTK affects generalization and feature transfer in the distillation
process. They provided theoretical insight into why knowledge distillation improves performance
in deep networks. Bonfanti et. al. (2024) [86] studied how NTK behaves in the nonlinear regime,
particularly in Physics-Informed Neural Networks (PINNs). They showed that when PINNs oper-
ate outside the NTK regime, their performance degrades due to high sensitivity to initialization
and weight updates. They established conditions under which NTK linearization is insufficient
for PINNs, emphasizing the need for nonlinear adaptations. They provided practical guidelines
for designing PINNs that maintain stable training dynamics. Jacot et. al. (2018) [87] introduced
the Neural Tangent Kernel (NTK) as a fundamental framework for analyzing infinite-width neural
networks. They proved that as width approaches infinity, neural networks evolve as linear models
governed by the NTK. They derived generalization bounds for infinitely wide networks and con-
nected training dynamics to kernel methods. They established NTK as a core tool in deep learning
theory, leading to further developments in training dynamics research. Lee et. al. (2019) [88]
extended NTK theory to arbitrarily deep networks, showing that even deep architectures behave as
linear models under gradient descent and proved that training dynamics remain stable regardless of
network depth when width is sufficiently large. They explored practical implications for initializing
and optimizing deep networks. They strengthened NTK theory by confirming its validity beyond
shallow networks. Yang and Hu (2022) [89] challenged the conventional NTK assumption that fea-
ture learning is negligible in infinite-width networks and showed that certain activation functions
can induce nontrivial feature learning even in infinite-width regimes. They suggested that feature
learning can be integrated into NTK theory, opening new directions in kernel-based deep learning
research. Xiang et. al. (2023) [90] investigated how finite-width effects impact training dynam-
ics under NTK assumptions and showed that finite-width networks deviate from NTK predictions
due to higher-order corrections in weight updates. They derived corrections to NTK theory for
practical networks, improving its predictive power for real-world architectures. They refined NTK
approximations, making them more applicable to modern deep-learning models. Lee et. al. (2019)
[91] extended NTK linearization to deep convolutional networks, analyzing their training dynamics
under infinite width and showed how locality and weight sharing in CNNs impact NTK behav-
ior. They also demonstrated practical consequences for CNN training in real-world applications.
They bridged NTK theory and convolutional architectures, providing new theoretical tools for CNN
analysis.
35
36 CHAPTER 3. TRAINING DYNAMICS AND NTK LINEARIZATION
plains the dynamics of gradient descent in the context of neural networks, covering topics such as
backpropagation, vanishing gradients, and saddle points. The book also discusses the role of learn-
ing rates, momentum, and adaptive optimization methods in shaping the trajectory of gradient
flow. Sra et. al. (2012) [475] included several chapters dedicated to the theoretical and practi-
cal aspects of gradient-based optimization in machine learning. It provides rigorous mathematical
treatments of gradient flow dynamics, including convergence analysis, the impact of stochasticity in
stochastic gradient descent (SGD), and the geometry of loss landscapes in high-dimensional spaces.
Choromanska et. al. (2015) [476] rigorously analyzed the loss surfaces of deep neural networks.
It demonstrates that the loss landscape is highly non-convex but contains a large number of local
minima that are close in function value to the global minimum. The paper provides insights into
how gradient flow navigates these complex landscapes and why it often converges to satisfactory
solutions despite the non-convexity. Arora et al. (2019) [477] provided a theoretical framework for
understanding the dynamics of gradient descent in deep neural networks. It rigorously analyzes
the role of overparameterization in enabling gradient flow to converge to global minima, even in
the absence of explicit regularization. The paper also explores the implicit regularization effects of
gradient descent and their impact on generalization. Du et. al. (2019) [468] establishes theoretical
guarantees for the convergence of gradient descent to global minima in overparameterized neural
networks. It rigorously proves that gradient flow can efficiently minimize the training loss to zero,
even in the presence of non-convexity, by leveraging the high-dimensional geometry of the loss land-
scape. The authors provided a rigorous analysis of the exponential convergence of gradient descent
in overparameterized neural networks. It shows that the gradient flow dynamics are characterized
by a rapid decrease in the loss function, driven by the alignment of the network’s parameters with
the data. The paper also discusses the role of initialization in shaping the trajectory of gradient
flow. Zhang et al. (2017) [446] challenged traditional notions of generalization in deep learning.
It rigorously demonstrates that deep neural networks can fit random labels, suggesting that the
dynamics of gradient flow are not solely driven by the data distribution but also by the implicit
biases of the optimization algorithm. The paper highlights the importance of understanding how
gradient flow interacts with the architecture and initialization of neural networks. Baratin et. al.
(2020) [478] explored the implicit regularization effects of gradient flow in deep learning from the
perspective of function space. It rigorously demonstrates that gradient descent in overparameter-
ized models tends to converge to solutions that minimize certain norms or complexity measures,
providing insights into why these models generalize well despite their capacity to overfit. Balduzzi
et al. (2018) [479] extended the analysis of gradient flow to multi-agent optimization problems,
such as those encountered in generative adversarial networks (GANs). It rigorously characterizes
the dynamics of gradient descent in games, highlighting the role of rotational forces and the chal-
lenges of convergence in non-cooperative settings. The paper provides tools for understanding how
gradient flow behaves in complex, interactive learning scenarios. Allen-Zhu et al. (2019) [470]
provided a rigorous convergence theory for deep learning models trained with gradient descent. It
shows that overparameterization enables gradient flow to avoid bad local minima and converge to
global minima efficiently. The paper also analyzes the role of initialization, step size, and network
depth in shaping the dynamics of gradient descent.
The dynamics of gradient flow in neural network training are fundamentally governed by the
continuous evolution of parameters θ(t) under the influence of the negative gradient of the loss
function, expressed as
dθ(t)
= −∇θ L(θ(t)). (3.1)
dt
The loss function, typically of the form
n
1 X
L(θ) = ∥f (xi ; θ) − yi ∥2 , (3.2)
2n i=1
3.1. GRADIENT FLOW AND STATIONARY POINTS 37
measures the discrepancy between the network’s predicted outputs f (xi ; θ) and the true labels yi .
At stationary points of the flow, the condition
∇θ L(θ∗ ) = 0 (3.3)
holds, indicating that the gradient vanishes. To classify these stationary points, the Hessian ma-
trix H = ∇2θ L(θ) is examined. For eigenvalues {λi } of H, the nature of the stationary point is
determined: λi > 0 for all i corresponds to a local minimum, λi < 0 for all i to a local maximum,
and mixed signs indicate a saddle point.Under gradient flow dθ(t)dt
= −∇θ L(θ(t)), the trajectory
converges to critical points:
lim ∥∇θ L(θ(t))∥ = 0. (3.4)
t→∞
The gradient flow also governs the temporal evolution of the network’s predictions f (x; θ(t)). A
Taylor series expansion of f (x; θ) about an initial parameter θ0 gives:
1
f (x; θ) = f (x; θ0 ) + Jf (x; θ0 )(θ − θ0 ) + (θ − θ0 )⊤ Hf (x; θ0 )(θ − θ0 ) + O(∥θ − θ0 ∥3 ), (3.5)
2
where Jf (x; θ0 ) = ∇θ f (x; θ0 ) is the Jacobian and Hf (x; θ0 ) is the Hessian of f (x; θ) with respect to
θ. In the NTK (neural tangent kernel) regime, higher-order terms are negligible due to the large
parameterization of the network, and the linear approximation suffices:
Under gradient flow, the time derivative of the network’s predictions is given by:
Substituting the parameter dynamics dθ(t) = −∇θ L(θ(t)) = − ni=1 (f (xi ; θ(t)) − yi )Jf (xi ; θ(t)),
P
dt
this becomes: n
df (x; θ(t)) X
=− Jf (x; θ(t))Jf (xi ; θ(t))⊤ (f (xi ; θ(t)) − yi ). (3.8)
dt i=1
Defining the NTK as K(x, x′ ; θ) = Jf (x; θ)Jf (x′ ; θ)⊤ , and assuming constancy of the NTK during
training (K(x, x′ ; θ) ≈ K0 (x, x′ )), the evolution equation simplifies to:
n
df (x; θ(t)) X
=− K0 (x, xi )(f (xi ; θ(t)) − yi ). (3.9)
dt i=1
Rewriting in matrix form, let f (t) = [f (x1 ; θ(t)), . . . , f (xn ; θ(t))]⊤ and y = [y1 , . . . , yn ]⊤ . The NTK
matrix K0 ∈ Rn×n evaluated at initialization defines the system:
df (t)
= −K0 (f (t) − y). (3.10)
dt
The solution to this linear system is:
As t → ∞, the predictions converge to the labels: f (t) → y, implying zero training error. The
eigenvalues of K0 determine the rates of convergence. Diagonalizing K0 as K0 = QΛQ⊤ , where Q
is orthogonal and Λ = diag(λ1 , . . . , λn ), the dynamics in the eigenbasis are:
df̃ (t)
= −Λ(f̃ (t) − ỹ), (3.12)
dt
38 CHAPTER 3. TRAINING DYNAMICS AND NTK LINEARIZATION
Each mode decays exponentially with a rate proportional to the eigenvalue λi . Modes with larger
λi converge faster, while smaller eigenvalues slow convergence.
The NTK framework thus rigorously explains the linearization of training dynamics in overparam-
eterized neural networks. This linear behavior ensures that the optimization trajectory remains
within a convex region of the parameter space, leading to both convergence and generalization. By
leveraging the constancy of the NTK, the complexity of nonlinear neural networks is reduced to an
analytically tractable framework that aligns closely with empirical observations.
The behavior of the loss function around a specific parameter value θ0 can be rigorously analyzed
using a second-order Taylor expansion. This expansion is given by:
1
L(θ) = L(θ0 ) + (θ − θ0 )⊤ ∇θ L(θ0 ) + (θ − θ0 )⊤ H(θ0 )(θ − θ0 ) + O(∥θ − θ0 ∥3 ). (3.15)
2
Here, the term (θ − θ0 )⊤ ∇θ L(θ0 ) represents the linear variation of the loss, while the quadratic
term 21 (θ − θ0 )⊤ H(θ0 )(θ − θ0 ) describes the curvature effects. The eigenvalues of H(θ0 ) dictate the
nature of the critical point θ0 . Specifically, if all λi > 0, θ0 is a local minimum; if all λi < 0, it is
a local maximum; and if the eigenvalues have mixed signs, θ0 is a saddle point. The leading-order
approximation to the change in the loss function, ∆L ≈ 12 δθ⊤ H(θ0 )δθ, highlights the dependence
on the eigenstructure of H(θ0 ). In the context of gradient descent, parameter updates follow the
iterative scheme:
θ(t+1) = θ(t) − η∇θ L(θ(t) ), (3.16)
where η is the learning rate. Substituting the Taylor expansion of ∇θ L(θ(t) ) around θ0 gives:
To analyze this update rigorously, we project θ(t) − θ0 onto the eigenbasis of H(θ0 ), expressing it
as:
Xd
(t) (t)
θ − θ0 = c i vi , (3.18)
i=1
(t)
where ci = vi⊤ (θ(t) − θ0 ). Substituting this expansion into the gradient descent update rule yields:
h i
(t+1) (t) ⊤ (t)
ci = ci − η vi ∇θ L(θ0 ) + λi ci . (3.19)
The convergence of this iterative scheme is governed by the condition |1−ηλi | < 1, which constrains
the learning rate η relative to the spectrum of H(θ0 ). For eigenvalues λi with large magnitudes,
3.1. GRADIENT FLOW AND STATIONARY POINTS 39
In the Neural Tangent Kernel (NTK) regime, the evolution of a neural network during train-
ing can be approximated by a linearization of the network output around the initialization. Let
fθ (x) denote the output of the network for input x. Linearizing fθ (x) around θ0 gives:
fθ (x) ≈ fθ0 (x) + ∇θ fθ0 (x)⊤ (θ − θ0 ). (3.20)
The NTK, defined as:
K(x, x′ ) = ∇θ fθ0 (x)⊤ ∇θ fθ0 (x′ ), (3.21)
remains approximately constant during training for sufficiently wide networks. The training dy-
namics of the parameters are described by:
dθ
= −∇θ L(θ), (3.22)
dt
which, under the NTK approximation, becomes:
dθ
= −K∇θ L(θ), (3.23)
dt
where K is the NTK matrix evaluated at initialization. The evolution of the loss function is gov-
erned by the eigenvalues of K, which control the rate of convergence in different directions.
The spectral properties of the Hessian play a pivotal role in the generalization properties of neural
networks. Empirical studies reveal that the eigenvalue spectrum of H(θ) often exhibits a ”bulk-
and-spike” structure, with a dense bulk of eigenvalues near zero and a few large outliers. The bulk
corresponds to flat directions in the loss landscape, which contribute to the robustness and gener-
alization of the model, while the spikes represent sharp directions associated with overfitting. This
spectral structure can be analyzed using random matrix theory, where the density of eigenvalues
ρ(λ) is modeled by distributions such as the Marchenko-Pastur law:
1 p
ρ(λ) = (λ+ − λ)(λ − λ− ), (3.24)
2πλq
√
where λ± = (1± q)2 are the spectral bounds and q = nd is the ratio of the number of parameters to
the number of data points. This rigorous analysis links the Hessian structure to both the optimiza-
tion dynamics and the generalization performance of neural networks, providing a comprehensive
mathematical understanding of the training process. The Hessian H(θ) satisfies:
H(θ) = ∇2θ L(θ) = E(x,y) ∇θ fθ (x)∇θ fθ (x)⊤ .
(3.25)
For overparameterized networks, H(θ) is nearly degenerate, implying the existence of flat minima.
dθ
Substituting dt
= −∇θ L(θ), we have:
∂fθ (x)
= −∇θ fθ (x)⊤ ∇θ L(θ). (3.27)
∂t
The gradient of the loss function, L(θ), can be expressed explicitly in terms of the training data.
For a generic loss function over the dataset, this takes the form:
n
1X
L(θ) = ℓ(fθ (xi ), yi ), (3.28)
n i=1
where ℓ(fθ (xi ), yi ) represents the loss for the i-th data point. The gradient of the loss with respect
to the parameters is therefore given by:
n
1X
∇θ L(θ) = ∇θ fθ (xi )∇fθ (xi ) ℓ(fθ (xi ), yi ). (3.29)
n i=1
To introduce the Neural Tangent Kernel (NTK), we define it as the Gram matrix of the Jacobians
of the network output with respect to the parameters:
In the overparameterized regime, where the number of parameters p is significantly larger than the
number of training data points n, it has been empirically and theoretically observed that the NTK
Θ(x, x′ ; θ) remains nearly constant during training. Specifically, Θ(x, x′ ; θ) ≈ Θ(x, x′ ; θ0 ), where
θ0 represents the parameters at initialization. This constancy significantly simplifies the analysis
of the network’s training dynamics. To see this, consider the solution to the differential equation
governing the output dynamics. Let F (t) ∈ Rn×m represent the matrix of network outputs for all
training inputs, where the i-th row corresponds to fθ (xi ). The dynamics can be expressed in matrix
form as:
∂F (t) 1
= − Θ(θ0 )∇F L(F ), (3.33)
∂t n
where Θ(θ0 ) ∈ Rn×n is the NTK matrix evaluated at initialization, and ∇F L(F ) is the gradient of
the loss with respect to the output matrix F . For the special case of a mean squared error loss,
1
L(F ) = 2n ∥F − Y ∥2F , where Y ∈ Rn×m is the matrix of target outputs, the gradient simplifies to:
1
∇F L(F ) = (F − Y ). (3.34)
n
Substituting this into the dynamics, we obtain:
∂F (t) 1
= − 2 Θ(θ0 )(F (t) − Y ). (3.35)
∂t n
The solution to this differential equation is:
Θ(θ0 )
F (t) = Y + e− n2
t
(F (0) − Y ), (3.36)
3.2. NTK REGIME 41
where F (0) represents the initial outputs of the network. As t → ∞, the exponential term vanishes,
and the network outputs converge to the targets Y , provided that Θ(θ0 ) is positive definite. The rate
of convergence is determined by the eigenvalues of Θ(θ0 ), with smaller eigenvalues corresponding
to slower convergence along the associated eigenvectors. To understand the stationary points of
this system, we note that these occur when ∂F∂t(t) = 0. From the dynamics, this implies:
Θ(θ0 )(F − Y ) = 0. (3.37)
If Θ(θ0 ) is invertible, this yields F = Y , indicating that the network exactly interpolates the train-
ing data at the stationary point. However, if Θ(θ0 ) is not full-rank, the stationary points form a
subspace of solutions satisfying (I − Π)(F − Y ) = 0, where Π is the projection operator onto the
column space of Θ(θ0 ).
The NTK framework provides a mathematically rigorous lens to analyze training dynamics, eluci-
dating the interplay between parameter evolution, kernel properties, and loss convergence in neural
networks. By linearizing the training dynamics through the NTK, we achieve a deep understand-
ing of how overparameterized networks evolve under gradient flow and how they reach stationary
points, revealing their capacity to interpolate data with remarkable precision.
the NTK. Belkin et. al. (2019) [474] explored the connection between deep learning and kernel
learning, emphasizing the role of the NTK in understanding generalization and optimization. It
provides a high-level perspective on why the NTK framework is essential for analyzing modern
machine learning models.
The Neural Tangent Kernel (NTK) regime is a fundamental framework for understanding the
dynamics of gradient descent in highly overparameterized neural networks. Consider a neural net-
work f (x; θ) parameterized by θ ∈ RP , where P represents the total number of parameters, and
x ∈ Rd is the input vector. For a training dataset {(xi , yi )}N
i=1 , the loss function L(t) at time t is
given by
N
1 X
L(t) = (f (xi ; θ(t)) − yi )2 . (3.38)
2N i=1
The parameters evolve according to gradient descent as θ(t + 1) = θ(t) − η∇θ L(t), where η > 0 is
the learning rate. In the NTK regime, we consider the first-order Taylor expansion of the network
output around the initialization θ 0 :
This linear approximation transforms the nonlinear dynamics of f into a simpler, linearized form.
To analyze training, we introduce the Jacobian matrix J ∈ RN ×P , where Jij = ∂f (x i ;θ 0 )
∂θj
. The vector
N
of outputs f (t) ∈ R , aggregating predictions over the dataset, evolves as
As P → ∞, the NTK converges to a deterministic matrix that remains nearly constant during
training. Substituting the linearized form of f (t) into the gradient descent update equation gives
η
f (t + 1) = f (t) − Θ(f (t) − y), (3.42)
N
where y ∈ RN is the vector of true labels. Defining the residual r(t) = f (t) − y, the dynamics of
training reduce to η
r(t + 1) = I − Θ r(t). (3.43)
N
The eigendecomposition Θ = QΛQ⊤ , with orthogonal Q and diagonal Λ = diag(λ1 , . . . , λN ), allows
us to analyze the decay of residuals in the eigenbasis of Θ:
η
r̃(t + 1) = I − Λ r̃(t), (3.44)
N
where r̃(t) = Q⊤ r(t). Each component decays as
t
ηλi
r̃i (t) = 1 − r̃i (0). (3.45)
N
For small η, the training dynamics are approximately continuous, governed by
dr(t) 1
= − Θr(t), (3.46)
dt N
leading to the solution
Θt
r(t) = exp − r(0). (3.47)
N
3.2. NTK REGIME 43
The NTK for specific architectures, such as fully connected ReLU networks, can be derived using
layerwise covariance matrices. Let Σ(l) (x, x′ ) denote the covariance between pre-activations at layer
l. The recurrence relation for Σ(l) is
1 (l−1)
Σ(l) (x, x′ ) = ∥z (x)∥∥z(l−1) (x′ )∥ (sin θ + (π − θ) cos θ) , (3.48)
2π
−1 Σ(l−1) (x,x′ )
where θ = cos √ . The NTK, a sum over contributions from all layers,
(l−1)
Σ (l−1) ′ ′
(x,x)Σ (x ,x )
quantifies how parameter updates propagate through the network.
In the infinite-width limit, the NTK framework predicts generalization properties, as the kernel
matrix Θ governs both training and test-time behavior. The NTK connects neural networks to
classical kernel methods, offering a bridge between deep learning and well-established theoretical
tools in approximation theory. This regime’s deterministic and analytical tractability enables pre-
cise characterizations of network performance, convergence rates, and robustness to initialization
and learning rate variations.
4 Generalization Bounds: PAC-Bayes and
Spectral Analysis
At the core of the PAC-Bayes formalism lies the ambition to rigorously quantify the generalization
ability of hypotheses h ∈ H based on their performance on a finite dataset S ∼ Dm , where D
represents the underlying, and typically unknown, data distribution. The PAC framework, which
was originally designed to provide high-confidence guarantees on the true risk
44
4.1. PAC-BAYES FORMALISM 45
serves as a computable proxy. The key question addressed by PAC-Bayes is: How does R̂(h, S)
relate to R(h), and how can we bound the deviation probabilistically? For a distribution Q over H,
these risks are generalized as:
R(Q) = Eh∼Q [R(h)], R̂(Q, S) = Eh∼Q [R̂(h, S)]. (4.3)
This generalization is pivotal because it allows the analysis to transcend individual hypotheses and
consider probabilistic ensembles, where Q(h) represents a posterior belief over the hypothesis space
conditioned on the observed data. We now need to discuss how Prior and Posterior Distributions
encode knowledge and complexity. The prior P is a fixed distribution over H that reflects pre-data
assumptions about the plausibility of hypotheses. Crucially, P must be independent of S to avoid
biasing the bounds. The posterior Q, however, is data-dependent and typically chosen to minimize
a combination of empirical risk and complexity. This choice is guided by the PAC-Bayes inequality,
which regularizes Q via its Kullback-Leibler (KL) divergence from P :
Z
Q(h)
KL(Q∥P ) = Q(h) log dh. (4.4)
H P (h)
The KL divergence quantifies the informational cost of updating P to Q, serving as a penalty term
that discourages overly complex posteriors. This regularization is critical in preventing overfitting,
ensuring that Q achieves a balance between data fidelity and model simplicity.
Let’s derive the PAC-Bayes Inequality: Probabilistic and Information-Theoretic Foundations. The
derivation of the PAC-Bayes inequality hinges on a combination of probabilistic tools and information-
theoretic arguments. A central step involves applying a change of measure from P to Q, leveraging
the identity:
Q(h)
Eh∼Q [f (h)] = Eh∼P f (h) . (4.5)
P (h)
This allows the incorporation of Q into bounds that originally apply to fixed h. By analyzing
the moment-generating function of deviations between R̂(h, S) and R(h), and applying Hoeffding’s
inequality to the empirical loss, we arrive at the following bound for any Q and P , with probability
at least 1 − δ: s
KL(Q∥P ) + log 1δ
R(Q) ≤ R̂(Q, S) + . (4.6)
2m
The generalization bound is therefore given by:
r
KL(Q∥P ) + log(1/δ)
L(f ) − Lemp (f ) ≤ , (4.7)
2N
where KL(Q∥P ) quantifies the divergence between the posterior Q and prior P . This bound is
remarkable because it explicitly ties the true risk R(Q) to the empirical risk R̂(Q, S), the KL
divergence, and the sample size m. The PAC-Bayes bound encapsulates three competingq forces:
log 1δ
the empirical risk R̂(Q, S), the complexity penalty KL(Q∥P ), and the confidence term 2m
. This
interplay reflects a fundamental trade-off in learning:
46 CHAPTER 4. GENERALIZATION BOUNDS: PAC-BAYES AND SPECTRAL ANALYSIS
1. Empirical Risk: R̂(Q, S) captures how well the posterior Q fits the training data.
2. Complexity: The KL divergence ensures that Q remains close to P , discouraging overfitting
and promoting generalization.
q 1
log δ
3. Confidence: The term 2m
shrinks with increasing sample size, tightening the bound and
enhancing reliability.
The KL term also introduces an inherent regularization effect, penalizing hypotheses that deviate
significantly from prior knowledge. This aligns with Occam’s Razor, favoring simpler explanations
that are consistent with the data.
There are several extensions and Advanced Applications of Pac-Bayes Formalism. While the classi-
cal PAC-Bayes framework assumes i.i.d. data, recent advancements have generalized the theory to
handle structured data, such as in time-series and graph-based learning. Furthermore, alternative
divergence measures, like Rényi divergence or Wasserstein distance, have been explored to accom-
modate scenarios where KL divergence may be inappropriate. In practical settings, PAC-Bayes
bounds have been instrumental in analyzing neural networks, Bayesian ensembles, and stochas-
tic processes, offering theoretical guarantees even in high-dimensional, non-convex optimization
landscapes.
4.1.1 KL divergence
The Kullback-Leibler (KL) divergence, also known as the relative entropy, is a fundamental concept
in information theory, probability theory, and statistics. It quantifies the difference between two
probability distributions, measuring the inefficiency of assuming that data is generated from one
distribution when it is actually generated from another. Mathematically, for two discrete probability
distributions P (x) and Q(x) defined over a common sample space X , the KL divergence is given
by
X P (x)
DKL (P ∥ Q) = P (x) log (4.8)
x∈X
Q(x)
where the logarithm is conventionally taken to be the natural logarithm unless otherwise specified.
For continuous probability distributions with probability density functions p(x) and q(x), the KL
divergence is defined as Z
p(x)
DKL (P ∥ Q) = p(x) log dx. (4.9)
X q(x)
The KL divergence can be understood as the expectation of the logarithmic difference between the
two distributions, weighted by the true distribution P (x):
P (x)
DKL (P ∥ Q) = EP log . (4.10)
Q(x)
Expanding this expectation in integral form,
Z Z
DKL (P ∥ Q) = p(x) log p(x) dx − p(x) log q(x) dx. (4.11)
X X
while the second integral represents the cross-entropy between P (x) and Q(x), given by
Z
H(P, Q) = − p(x) log q(x) dx. (4.13)
X
4.1. PAC-BAYES FORMALISM 47
This follows from the fact that the term EP [log P (x)] does not depend on θ and is therefore irrelevant
for optimization. In Bayesian inference, KL divergence plays a central role in variational inference.
Given an intractable posterior P (z | x), one approximates it with a tractable distribution Q(z),
where the objective is to minimize
DKL (Q(z) ∥ P (z | x)). (4.20)
This minimization leads to the Evidence Lower Bound (ELBO),
log P (x) ≥ EQ [log P (x, z)] − EQ [log Q(z)], (4.21)
which is central in modern probabilistic modeling. Moreover, KL divergence is related to mutual
information. Given two random variables X and Y with a joint distribution P (X, Y ) and marginals
P (X) and P (Y ), the mutual information is given by
I(X; Y ) = DKL (P (X, Y ) ∥ P (X)P (Y )), (4.22)
which quantifies the amount of information shared between X and Y . In information geometry,
the Fisher information matrix is derived from the second-order expansion of KL divergence,
∂ log P (x | θ) ∂ log P (x | θ)
gij = EP . (4.23)
∂θi ∂θj
KL divergence is also connected to the Pinsker inequality,
r
1
∥P − Q∥TV ≤ DKL (P ∥ Q). (4.24)
2
This establishes KL divergence as a fundamental measure of probability distance in statistical
inference and machine learning.
48 CHAPTER 4. GENERALIZATION BOUNDS: PAC-BAYES AND SPECTRAL ANALYSIS
where p(x) and q(x) denote the Radon-Nikodym derivatives (i.e., probability density functions) of
P and Q with respect to a common reference measure µ. That is,
dP dQ
p(x) = , q(x) = . (4.26)
dµ dµ
For the case where P and Q are discrete distributions supported on a finite or countable set X , the
Rényi divergence takes the form
1 X
Dα (P ∥Q) = ln p(x)α q(x)1−α . (4.27)
α − 1 x∈X
we compute Z
d p(x)
F (α) = p(x)α q(x)1−α ln dµ(x). (4.31)
dα Ω q(x)
Applying the logarithmic derivative, we obtain
X p(x)
lim Dα (P ∥Q) = p(x) ln = DKL (P ∥Q), (4.32)
α→1
x∈X
q(x)
demonstrating the convergence to the KL divergence. The Rényi divergence satisfies the following
fundamental mathematical properties. It is non-negative,
For special values of α, Rényi divergence simplifies to various known divergences. When α = 0, it
becomes X
D0 (P ∥Q) = − ln q(x), (4.35)
{x:p(x)>0}
4.1. PAC-BAYES FORMALISM 49
which represents the negative logarithm of the probability mass that Q assigns to the support of
P . When α = 2, we obtain
X p(x)2
D2 (P ∥Q) = ln , (4.36)
x∈X
q(x)
p(x)
D∞ (P ∥Q) = ln sup , (4.37)
x∈X q(x)
which measures the worst-case discrepancy between the two distributions. One crucial property of
the Rényi divergence is its invariance under transformations. Given a Markov kernel T that maps
P and Q to new distributions T P and T Q, the data-processing inequality states
Dα (T P ∥T Q) ≤ Dα (P ∥Q), (4.38)
ensuring that no transformation can increase the divergence between distributions. Rényi diver-
gence also satisfies the joint convexity property: for a mixture of probability distributions Pi and
Qi with weights λi , !
X X X
Dα λi Pi λi Qi ≤ λi Dα (Pi ∥Qi ). (4.39)
i i i
This inequality demonstrates that Rényi divergence behaves well under probabilistic mixing. An-
other fundamental result is the Pinsker-type bound, which states that for α > 1, the Rényi diver-
gence is bounded below by a function of the total variation distance dTV (P, Q),
α α−1 2
Dα (P ∥Q) ≥ ln 1 + dTV (P, Q) . (4.40)
α−1 α
For small dTV (P, Q), this inequality shows that Rényi divergence behaves quadratically in the
deviation between P and Q. Furthermore, the Rényi divergence is closely related to the Chernoff
information, which determines the optimal exponent for Bayesian hypothesis testing. The Chernoff
information is defined as
C(P, Q) = − min Dα (P ∥Q), (4.41)
α∈(0,1)
which plays a fundamental role in large deviations theory and hypothesis testing.
In summary, the Rényi divergence is a highly versatile and mathematically rich measure of dissim-
ilarity between probability distributions. It generalizes the KL divergence, interpolates between
various known divergence measures, satisfies essential properties such as monotonicity and data-
processing inequalities, and has deep connections to large deviation theory, statistical estimation,
and hypothesis testing.
where Π(µ, ν) denotes the set of all probability measures γ on the product space X × X satisfying
Z Z
dγ(x, y) = µ(x), dγ(x, y) = ν(y). (4.43)
X X
This definition arises from the optimal transport problem, originally formulated by Monge in 1781
and later extended by Kantorovich. The Monge formulation seeks a transport map T : X → X
that pushes µ onto ν, i.e., satisfies
T# µ = ν (4.44)
where T# µ denotes the pushforward measure defined by
However, the existence of an optimal transport map is highly constrained and may fail in many
cases. To overcome this, Kantorovich introduced a relaxation in which transport maps are replaced
by probability couplings γ, leading to the variational problem
Z
p
Wp (µ, ν) = inf d(x, y)p dγ(x, y). (4.47)
γ∈Π(µ,ν) X ×X
This formulation allows for mass splitting, making the problem more tractable and ensuring the
existence of minimizers under mild conditions. The dual formulation of the Wasserstein distance
is given by Z Z
Wpp (µ, ν) = sup φ(x)dµ(x) + ψ(y)dν(y) (4.48)
φ,ψ X X
where Fµ and Fν are the cumulative distribution functions of µ and ν, respectively. For p = 2, the
squared Wasserstein-2 distance is given by
Z
2
W2 (µ, ν) = inf d(x, y)2 dγ(x, y). (4.52)
γ∈Π(µ,ν) X ×X
The Wasserstein distance induces a metric on the space of probability measures, providing a notion
of distance that is weaker than total variation but stronger than weak convergence. Specifically, if
µn → µ in Wasserstein distance, then for any function f satisfying
it holds that Z Z
f (x)dµn (x) → f (x)dµ(x). (4.55)
X X
The Wasserstein metric endows the space of probability measures with a Riemannian-like structure,
where geodesics are given by McCann interpolation, defined as
This approximation allows for efficient computation via iterative updates using the Sinkhorn-Knopp
algorithm. The Wasserstein distance plays a crucial role in probability theory, statistics, machine
learning, and functional analysis, providing a rigorous mathematical framework for comparing
distributions with deep connections to convex geometry, measure theory, and differential geometry.
Literature Review: Jin et. al. (2025) [102] introduced a novel confusional spectral regularization
technique to improve fairness in machine learning models. The study focuses on the spectral norm
of the robust confusion matrix and proposes a method to control spectral properties, ensuring more
robust and unbiased learning. It provides insights into how regularization can mitigate biases in
classification tasks. Ye et. al. (2025) [103] applied spectral clustering with regularization to detect
small clusters in complex networks. The work enhances spectral clustering techniques by integrat-
ing regularization methods, allowing improved performance in anomaly detection and community
detection tasks. The approach significantly improves robustness in highly noisy data environments.
Bhattacharjee and Bharadwaj (2025) [104] explored how spectral domain representations can ben-
efit from autoencoder-based feature extraction combined with stochastic regularization techniques.
The authors propose a Symmetric Autoencoder (SymAE) that enables better generalization of
spectral features, particularly useful in high-dimensional data and deep learning applications. Wu
et. al. (2025) [105] applied spectral regularization to geophysical data processing, specifically for
high-resolution velocity spectrum analysis. The approach enhances the resolution of velocity esti-
mation in seismic imaging by using hyperbolic Radon transform regularization, demonstrating how
spectral regularization can benefit applications beyond traditional ML. Ortega et. al. (2025) [106]
52 CHAPTER 4. GENERALIZATION BOUNDS: PAC-BAYES AND SPECTRAL ANALYSIS
applied Tikhonov regularization to atmospheric spectral analysis, optimizing gas retrieval strategies
in high-resolution spectroscopic observations. The work significantly improves methane (CH4) and
nitrous oxide (N2O) detection accuracy by reducing noise in spectral measurements, showcasing
the impact of spectral regularization in remote sensing and environmental monitoring. Kazmi et.
al. (2025) [107] proposed a spectral regularization-based federated learning model to improve ro-
bustness in cybersecurity threat detection. The model addresses the issue of non-IID data in SDN
(Software Defined Networks) by utilizing spectral norm-based regularization within deep learning
architectures. Zhao et. al. (2025) [108] introduced a regularized deep spectral clustering method,
which enhances feature selection and clustering robustness. The authors utilize projected adap-
tive feature selection combined with spectral graph regularization, improving clustering accuracy
and interpretability in high-dimensional datasets. Saranya and Menaka (2025) [109] integrated
spectral regularization with quantum-based machine learning to analyze EEG signals for Autism
Spectrum Disorder (ASD) detection. The proposed method improves spatial filtering and feature
extraction using wavelet-based regularization, leading to more reliable EEG pattern recognition.
Dhalbisoi et. al. (2024) [110] developed a Regularized Zero-Forcing (RZF) method for spectral
efficiency optimization in beyond 5G networks. The authors demonstrate that spectral regulariza-
tion techniques can significantly improve signal-to-noise ratios in wireless communication systems,
optimizing data transmission in massive MIMO architectures. Wei et. al. (2025) [111] explored the
use of spectral regularization in medical imaging, particularly in 3D near-infrared spectral tomogra-
phy. The proposed model integrates regularized convolutional neural networks (CNNs) to improve
tissue imaging resolution and accuracy, demonstrating an application of spectral regularization in
biomedical engineering.
Let us define a target function f (x), where x ∈ Rd , and its Fourier transform fˆ(ξ) as
Z
ˆ
f (ξ) = f (x)e−i2πξ·x dx (4.60)
Rd
This transform breaks down f (x) into frequency components indexed by ξ. In the context of deep
learning, we seek to approximate f (x) with a neural network output fNN (x; θ), where θ represents
the set of trainable parameters. The loss function to be minimized is typically the mean squared
error: Z
L(θ) = |f (x) − fNN (x; θ)|2 dx (4.61)
Rd
We can equivalently express this loss in the Fourier domain, leveraging Parseval’s theorem:
Z 2
L(θ) = fˆ(ξ) − fˆNN (ξ; θ) dξ (4.62)
Rd
where η is the learning rate. The gradient of the loss function with respect to θ is
Z
∇θ L(θ) = 2 fˆNN (ξ; θ) − fˆ(ξ) ∇θ fˆNN (ξ; θ) dξ (4.64)
Rd
At the core of this gradient descent process lies the behavior of the gradient ∇θ fˆNN (ξ; θ) with respect
to the frequency components ξ. For neural networks, particularly those with ReLU activations, the
gradients of the output with respect to the parameters are expected to decay for high-frequency
components. This can be approximated as
1
R(ξ) ∼ (4.65)
1 + ∥ξ∥2
4.2. SPECTRAL REGULARIZATION 53
which implies that the neural network is inherently more sensitive to low-frequency components of
the target function during early iterations of training. This spectral decay is a direct consequence
of the structure of the network’s activations, which are more sensitive to low-frequency features due
to their smoother, lower-order terms. To understand the role of the neural tangent kernel (NTK),
which governs the linearized dynamics of the neural network, we define the NTK as
P
′
X ∂fNN (x; θ) ∂fNN (x′ ; θ)
Θ(x, x ; θ) = (4.66)
i=1
∂θi ∂θi
The NTK essentially describes how the output of the network changes with respect to its param-
eters. The evolution of the network’s output during training can be approximated by the solution
to a linear system governed by the NTK. The output of the network at time t is given by
X
ck 1 − e−ηλk t ϕk (x)
fNN (x; t) = (4.67)
k
where {λk } are the eigenvalues of Θ, and {ϕk (x)} are the corresponding eigenfunctions. The
eigenvalues λk determine the speed of convergence for each frequency mode, with low-frequency
modes (large λk ) converging more quickly than high-frequency ones (small λk ):
This differential learning rate for frequency components leads to the spectral regularization phe-
nomenon, where the network learns the low-frequency components of the function first, and the
high-frequency modes only begin to adapt once the low-frequency ones have been approximated
with sufficient accuracy. In a more formal setting, the spectral bias can also be understood in terms
of Sobolev spaces. A neural network function fNN can be seen as a function in a Sobolev space
W m,2 , where the norm of a function f in this space is defined as
Z 2
m
2
∥f ∥W m,2 = 1 + ∥ξ∥2 fˆ(ξ) dξ (4.69)
Rd
When training a neural network, the optimization process implicitly regularizes the higher-order
Sobolev norms, meaning that the network will initially approximate the target function in terms
of lower-order derivatives (which correspond to low-frequency modes). This can be expressed by
introducing a regularization term in the loss function:
where λ is a regularization parameter that controls the trade-off between data fidelity and smooth-
ness in the approximation.
Thus, spectral regularization emerges as a consequence of the network’s architecture, the nature of
gradient descent optimization, and the inherent smoothness of the functions that neural networks
are capable of learning. The mathematical structure of the NTK and the regularization properties
of the Sobolev spaces provide a rigorous framework for understanding why neural networks prior-
itize the learning of low-frequency modes, reinforcing the idea that neural networks are implicitly
biased toward smooth, low-frequency approximations at the beginning of training. This insight
has profound implications for the generalization behavior of neural networks and their capacity to
approximate complex functions.
5 Game-theoretic formulations of Deep
Neural Networks
Deep Neural Networks (DNNs) can be rigorously formulated within a game-theoretic framework
by considering the interplay between competing objectives in optimization, adversarial robustness,
and equilibrium strategies. The fundamental nature of DNNs involves a multi-agent optimization
problem where various components interact within a high-dimensional function space. Formally,
a deep neural network is parameterized by a set of weights θ ∈ Rd and seeks to optimize a loss
function L : Rd × Rn → R, where the training data (x, y) ∼ P (x, y) is drawn from an unknown
distribution. The optimization problem can be expressed as
The equilibrium of this adversarial game corresponds to a minimax solution, satisfying the condition
If the loss function satisfies convexity in θ and concavity in δ, then the minimax theorem guarantees
the existence of an equilibrium solution:
Pdata (x)
D∗ (x) = . (5.6)
Pdata (x) + PG (x)
54
55
leading to a situation where the discriminator is unable to distinguish real from generated samples.
If D(x) is parameterized by ϕ, then training the discriminator corresponds to solving
ϕ∗ = arg max Ex∼Pdata [log Dϕ (x)] + Ez∼Pz [log(1 − Dϕ (Gθ (z)))]. (5.8)
ϕ
Given that neural networks can approximate arbitrary functions, their training can also be for-
mulated in terms of multi-agent differential game theory. Consider a dynamic system where the
evolution of parameters follows
dθ
= −∇θ L(θ, ϕ), (5.9)
dt
dϕ
= ∇ϕ L(θ, ϕ). (5.10)
dt
The Hamiltonian for this system is given by
where p is the costate variable and u represents control parameters. The Hamilton-Jacobi-Bellman
equation characterizing optimal control in a reinforcement learning-based DNN training scenario is
∂V
+ min [∇θ L · u + p · f (θ, u)] = 0. (5.12)
∂t u
This provides a mechanism for gradient-based optimization that incorporates selection and muta-
tion principles. Another critical formulation involves Nash equilibria in non-cooperative learning
settings, such as federated learning, where multiple learners optimize separate but interdependent
loss functions. The Nash equilibrium conditions require that
where θ−i denotes the set of parameters for all players except i. A Nash equilibrium is attained
when
∂Li
= 0, ∀i. (5.16)
∂θi
Under convexity assumptions, Nash learning dynamics can be characterized by a gradient projection
algorithm:
θ(t+1) = θ(t) − η∇θ L(θ(t) ). (5.17)
Beyond Nash equilibria, deep neural networks can also be analyzed through variational inequalities,
which generalize equilibrium concepts. Given a function F : Rd → Rd , the variational inequality
problem seeks θ∗ such that
⟨F (θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Rd . (5.18)
56 CHAPTER 5. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS
The first term in the equation represents selection dynamics, where parameter configurations with
higher fitness proliferate, while the second term models mutations, accounting for stochastic noise
in learning updates. The steady-state solution to this equation corresponds to an evolutionarily
stable distribution p∗ (θ), which satisfies
Eθ′ ∼p∗ (θ) [F (θ′ )] > Eθ′ ∼p(θ) [F (θ′ )] (5.25)
for any small perturbation p(θ) from p∗ (θ). The learning process in deep networks can thus be
viewed as an evolutionary competition in which weight distributions evolve toward stable equilibria
or limit cycles. The standard stochastic gradient descent (SGD) update rule, given by
θ(t+1) = θ(t) − η∇θ L(θ), (5.26)
5.1. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS (DNNS) THROUGH EVO
can be reinterpreted as a mean-field approximation of the replicator equation with stochastic dif-
fusion,
dθ
= −∇θ L(θ) + σξ(t), (5.27)
dt
where ξ(t) represents a stochastic perturbation modeling noise in the gradient updates. In the limit
of infinitesimally small learning rate η, the SGD update can be seen as an Ornstein-Uhlenbeck
process,
dθ = −∇θ L(θ)dt + σdWt , (5.28)
where Wt denotes a Wiener process. The equilibrium distribution of network parameters is then
governed by the Fokker-Planck equation,
∂p(θ, t) σ2
= −∇θ · (p(θ, t)∇θ L(θ)) + ∇2 p(θ, t). (5.29)
∂t 2
This equation describes the evolution of the probability density of weight configurations under
a combination of gradient descent and stochastic exploration, which can be interpreted as a
mutation-selection process in an evolutionary game. The steady-state solution p∗ (θ) corresponds
to a Boltzmann-Gibbs distribution,
e−βL(θ)
p∗ (θ) = , (5.30)
Z
where β ∝ 1/σ 2 acts as an inverse temperature parameter, and Z is the partition function ensuring
normalization. This establishes a direct connection between deep learning optimization and princi-
ples of statistical mechanics, whereby learning dynamics resemble an energy minimization process
subject to thermal noise. In the case of generative adversarial networks (GANs), the evolutionary
perspective is particularly insightful. A GAN consists of a generator G and a discriminator D,
which play a two-player zero-sum game governed by the minimax objective,
The training dynamics of GANs exhibit co-evolutionary behavior, which can be captured by the
coupled replicator equations,
dpG (θG , t)
= pG (θG , t)(FG (θG ) − F̄G ) + σ 2 ∇2 pG (θG , t), (5.32)
dt
dpD (θD , t)
= pD (θD , t)(FD (θD ) − F̄D ) + σ 2 ∇2 pD (θD , t). (5.33)
dt
Unlike standard deep learning, GAN training often fails to converge to an equilibrium due to cycling
behavior. This phenomenon can be analyzed through Hamiltonian game dynamics,
dθG
= −∇θG LG (θG , θD ), (5.34)
dt
dθD
= ∇θD LD (θG , θD ). (5.35)
dt
These equations describe a conservative dynamical system, where learning trajectories follow closed
orbits rather than converging to fixed points. This behavior is characteristic of zero-sum games
and is mathematically analogous to the Red Queen effect in evolutionary biology, where competing
species continually adapt without achieving a stable equilibrium.
58 CHAPTER 5. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS
⟨∇L(θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Rd . (5.36)
This inequality characterizes the equilibrium conditions in the optimization landscape of deep
learning. It expresses the fact that at the optimal point θ∗ , any infinitesimal movement in the
parameter space does not yield a lower value of the loss function. The gradient descent update
rule, given by
θ(k+1) = θ(k) − η∇L(θ(k) ), (5.37)
where η > 0 is the learning rate, can be rewritten as a projection-based variational inequality
if constraints exist. Specifically, if the parameter space is constrained to a closed and convex set
Θ ⊂ Rd , then the variational inequality formulation of the training process is
⟨∇L(θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Θ. (5.38)
This variational inequality expresses the fact that θ∗ is a stationary point of the constrained opti-
mization problem, meaning that the projected gradient update
ensures that the sequence {θ(k) } converges to the solution of the variational inequality under suitable
assumptions on the loss function. A key property in variational inequalities is monotonicity. The
operator associated with the variational inequality is given by
For convex loss functions, ∇L(θ) is a monotone operator, ensuring the uniqueness of the equilibrium
point. If ∇L(θ) is strongly monotone, satisfying
then the optimization process exhibits exponential convergence, meaning that the iterates sat-
isfy
∥θ(k) − θ∗ ∥ ≤ Ce−mk ∥θ(0) − θ∗ ∥, (5.43)
where C is a constant. This strong monotonicity property provides a rigorous foundation for
understanding the rate of convergence of deep neural network training. In adversarial training,
deep networks are often formulated as saddle-point problems of the form
where z represents an adversarial perturbation. The equilibrium conditions in this case are de-
scribed by the system of variational inequalities
where ξ represents a random variable encoding the data distribution. The variational inequality
for SGD is then
⟨F (θ∗ ), θ − θ∗ ⟩ ≥ 0, ∀θ ∈ Rd . (5.48)
Stochastic approximation methods ensure convergence to the equilibrium point under conditions
on the variance of the gradient estimator. Specifically, if F (θ) is co-coercive, meaning there exists
L > 0 such that
∥F (θ1 ) − F (θ2 )∥2 ≤ L⟨F (θ1 ) − F (θ2 ), θ1 − θ2 ⟩, (5.49)
then projected SGD satisfies
L σ2
E[∥θ(k) − θ∗ ∥2 ] ≤ , (5.50)
2m k
where σ 2 is the variance of the stochastic gradient noise. This establishes a rigorous bound on the
convergence rate of SGD in terms of variational inequalities. The generalization properties of deep
networks can also be rigorously studied using variational inequalities. Given a distribution shift
perturbation δ, the parameter shift θ′ = θ + δ satisfies the generalization bound
where the constant C is determined by the Lipschitz continuity of the gradient field. This bound
ensures that the network’s performance remains stable under small perturbations in the data dis-
tribution. The variational inequality framework provides a unified perspective on optimization,
equilibrium, and generalization in deep learning, offering rigorous theoretical insights into the dy-
namics of training and stability of neural networks.
θt+1 = θt + ut . (5.52)
Let J(θ0 , u) denote the cumulative loss function associated with a given control sequence {ut }Tt=0 ,
defined as
T
X
J(θ0 , u) = c(θt , ut ), (5.53)
t=0
60 CHAPTER 5. GAME-THEORETIC FORMULATIONS OF DEEP NEURAL NETWORKS
where c(θt , ut ) is the instantaneous cost function. In the context of neural network training, the
cost function is typically the empirical risk over a dataset D, given by
where ℓ is a loss function such as cross-entropy or mean squared error, and fθt (x) represents the
neural network’s output for an input x. The optimal control problem consists of finding a control
sequence {u∗t }Tt=0 such that
J ∗ (θ0 ) = min J(θ0 , u). (5.55)
u
Applying the principle of dynamic programming, the optimal cost-to-go function is defined recur-
sively as
V (θt ) = min {c(θt , ut ) + V (θt+1 )} . (5.56)
ut
Considering the continuous-time limit where the weight updates are infinitesimal, i.e., ut → θ̇(t)dt,
we obtain the Hamilton-Jacobi-Bellman (HJB) equation
∂V ∂V
+ min u + c(θ, u) = 0. (5.58)
∂t u ∂θ
The optimal control u∗ (t) satisfies
∗ ∂V
u (t) = arg min u + c(θ, u) . (5.59)
u ∂θ
Alternatively, the problem can be analyzed using Pontryagin’s Maximum Principle. Define the
Hamiltonian
H(θ, u, λ) = c(θ, u) + λ⊤ (θ + u), (5.60)
where λ is the costate variable that evolves according to the adjoint equation
dλ ∂H
=− . (5.61)
dt ∂θ
For optimality, the control must satisfy
∂H ∂c
= + λ = 0. (5.62)
∂u ∂u
Solving for u∗ , we obtain
−1
∂ 2c
∗ ∂c
u =− +λ . (5.63)
∂u2 ∂u
In the reinforcement learning framework, the optimal control formulation is connected to policy
optimization, where the policy π(at |st ) determines the action at given the state st . The expected
return under policy π is given by
" T #
X
J(π) = Eτ ∼π γ t Rt , (5.64)
t=0
where Rt is the reward function and γ ∈ (0, 1] is the discount factor. The policy gradient theorem
states that " T #
X
∇θ J(π) = Eτ ∼π ∇θ log π(at |st )Qπ (st , at ) , (5.65)
t=0
5.3. OPTIMAL CONTROL IN REINFORCEMENT LEARNING (RL)-BASED DEEP NEURAL NETWOR
where Qπ (st , at ) is the state-action value function satisfying the Bellman equation
By applying the HJB equation, the optimal policy update rule is derived as
where η is the learning rate, often adaptively adjusted based on second-order curvature information
such as in natural gradient descent. More formally, using the Fisher information matrix F (θ), the
optimal update rule can be written as
This establishes the link between optimal control and policy optimization in reinforcement learning,
providing a rigorous theoretical foundation for RL-based DNN training.
6 Optimal transport theory in Deep Neu-
ral Networks
Optimal transport theory provides a mathematically rigorous framework for measuring distances
between probability distributions by solving a constrained optimization problem that minimizes
the cost of transporting mass from one distribution to another. Consider two probability measures
µ and ν defined on a common measurable space (Ω, F), where µ has support X ⊆ Rd and ν has
support Y ⊆ Rd . Given a cost function c : X × Y → R≥0 , the optimal transport problem seeks to
determine a transport plan that minimizes the total cost of transforming µ into ν. In the Monge
formulation, this problem is expressed as
Z
inf c(x, T (x))dµ(x), (6.1)
T :X →Y X
subject to the constraint that the transport map T satisfies the push-forward condition
T# µ = ν, (6.2)
which ensures that the measure ν is obtained by pushing forward µ through T , i.e., for any mea-
surable set A ⊆ Y,
ν(A) = µ(T −1 (A)). (6.3)
The Monge problem is often ill-posed since a transport map may not exist. To alleviate this issue,
the Kantorovich relaxation introduces a coupling γ between µ and ν, where γ is a probability
measure on X × Y satisfying the marginal constraints
Z Z
dγ(x, y) = dµ(x), dγ(x, y) = dν(y). (6.4)
Y X
where Π(µ, ν) is the set of all couplings satisfying the marginal constraints. A commonly used cost
function is the p-th power of the Euclidean distance, i.e., c(x, y) = ∥x − y∥p , which leads to the
Wasserstein distance of order p, given by
Z p1
p
Wp (µ, ν) = inf ∥x − y∥ dγ(x, y) . (6.6)
γ∈Π(µ,ν) X ×Y
The Wasserstein distance defines a proper metric on the space of probability measures, provided
that µ and ν have finite p-th moments, i.e.,
Z Z
p
∥x∥ dµ(x) < ∞, ∥y∥p dν(y) < ∞. (6.7)
X Y
62
63
In deep neural networks, optimal transport plays a crucial role in generative modeling, where it
provides a more stable alternative to divergence-based training objectives. In Wasserstein gener-
ative adversarial networks (WGANs), the standard Jensen-Shannon divergence is replaced by the
Wasserstein-1 distance, leading to the objective
where f is a Lipschitz-1 function, Pr is the real data distribution, Pz is the latent distribution,
and G : Z → X is the generator function. The use of the Wasserstein distance improves gradient
behavior and mitigates mode collapse, which is a common issue in standard GAN training. In
domain adaptation, optimal transport is used to align source and target distributions by finding a
transport plan γ that minimizes the Wasserstein distance between them. Given a source distribution
Ps and a target distribution Pt , the goal is to find a transport map T such that
T# Ps ≈ Pt , (6.9)
The regularization parameter λ > 0 controls the trade-off between accuracy and computational
efficiency. The Sinkhorn algorithm, which iteratively updates the dual potentials using Bregman
projections, provides an efficient means of computing approximate solutions to the OT problem.
Optimal transport has also been applied to learning energy-based models, where it is used to define a
structured loss function for training probability models. In probabilistic autoencoders, OT distances
provide a principled approach for matching latent and data distributions. In Bayesian deep learning,
OT-based priors ensure that posterior distributions maintain smooth and stable representations,
improving generalization in uncertainty estimation tasks. The OT-based barycenter problem arises
in multi-modal learning, where the goal is to compute a central distribution that minimizes the
sum of Wasserstein distances to a set of given measures µ1 , µ2 , . . . , µn . The barycenter ν is defined
as n
X
arg min Wp (ν, µi )p , (6.12)
ν
i=1
Optimal transport thus provides a fundamental mathematical framework for distributional match-
ing in deep learning. By leveraging Wasserstein distances and entropy-regularized formulations,
OT enables more stable training, improved generalization, and better interpretability in neural
networks, making it an essential tool for modern deep learning methodologies.
where P (i) and M (i) are the probabilities assigned to the i-th event by P and M , respectively. For
continuous probability distributions, the KLD is expressed as:
Z ∞
p(x)
DKL (P ∥M ) = p(x) log dx, (6.17)
−∞ m(x)
where p(x) and m(x) are the probability density functions of P and M , respectively. Substituting
the definition of M into the JSD, we obtain:
1X P (i) 1X Q(i)
JSD(P, Q) = P (i) log 1 + Q(i) log 1 . (6.18)
2 i 2
(P (i) + Q(i)) 2 i 2
(P (i) + Q(i))
1 ∞ 1 ∞
Z Z
p(x) q(x)
JSD(P, Q) = p(x) log 1 dx + q(x) log 1 dx. (6.19)
2 −∞ 2
(p(x) + q(x)) 2 −∞ 2
(p(x) + q(x))
The JSD can also be expressed in terms of Shannon entropy. Let H(P ) denote the Shannon entropy
of P , defined for discrete distributions as:
X
H(P ) = − P (i) log P (i). (6.20)
i
0 ≤ JSD(P, Q) ≤ 1. (6.23)
The lower bound is achieved when P = Q, and the upper bound is approached when P and Q are
maximally different. The square root of the JSD is a metric, satisfying the triangle inequality:
p p p
JSD(P, Q) ≤ JSD(P, R) + JSD(R, Q), (6.24)
for any probability distribution R. This property makes the JSD particularly useful in applications
requiring a distance metric, such as clustering and classification. The JSD is also related to the
total variation distance δ(P, Q) by the inequality:
p
δ(P, Q) ≤ 2 · JSD(P, Q). (6.25)
This relationship connects the JSD to other statistical distances, providing a bridge between
information-theoretic and measure-theoretic perspectives. The JSD is widely used in machine
learning, bioinformatics, and statistics due to its symmetry, boundedness, and interpretability. It
is particularly useful for comparing empirical distributions, as it avoids the infinite values that can
arise with the KLD. The JSD can also be generalized to compare more than two distributions. For
n distributions P1 , P2 , . . . , Pn , the generalized JSD is:
n
! n
1X 1X
JSD(P1 , P2 , . . . , Pn ) = H Pi − H(Pi ). (6.26)
n i=1 n i=1
This generalization retains the properties of the standard JSD, including symmetry and bounded-
ness. The JSD is also closely related to the concept of mutual information, as it can be interpreted
as the mutual information between a random variable and a mixture distribution. Specifically, for
two distributions P and Q, the JSD can be expressed as:
context of PAEs, this translates to minimizing the discrepancy between the encoded latent distri-
bution qϕ (z|x) and the prior distribution p(z), as well as between the decoded data distribution
pθ (x|z) and the true data distribution pdata (x).
Let µ and ν be two probability measures defined on metric spaces X and Y, respectively. The
OT problem seeks a coupling π ∈ Π(µ, ν) that minimizes the total transportation cost:
Z
Wc (µ, ν) = inf c(x, y) dπ(x, y), (6.28)
π∈Π(µ,ν) X ×Y
where c(x, y) is a cost function, typically the Euclidean distance c(x, y) = ∥x − y∥p for p ≥ 1. The
p-Wasserstein distance is a special case of OT, defined as:
Z 1/p
p
Wp (µ, ν) = inf ∥x − y∥ dπ(x, y) . (6.29)
π∈Π(µ,ν) X ×Y
In PAEs, the latent distribution qϕ (z|x) is often modeled as a Gaussian N (µϕ (x), Σϕ (x)), and the
prior p(z) is typically a standard Gaussian N (0, I). The OT distance between these distributions
can be computed using the 2-Wasserstein distance, which has a closed-form expression for Gaussian
distributions:
1/2 1/2
W22 (N (µ1 , Σ1 ), N (µ2 , Σ2 )) = ∥µ1 − µ2 ∥2 + Tr(Σ1 + Σ2 − 2(Σ1 Σ2 Σ1 )1/2 ). (6.30)
For the decoded data distribution pθ (x|z), the OT distance to the true data distribution pdata (x)
can be approximated using discrete OT methods. Given samples {xi }N N
i=1 from pdata (x) and {x̂i }i=1
from pθ (x|z), the empirical OT problem becomes:
X
Wc (µ̂, ν) = min c(x̂i , xj )πij , (6.31)
π∈Π(µ̂,ν)
i,j
where µ̂ = N1 N
P 1
PN
i=1 δx̂i and νP= N i=1 δxi are
P empirical measures. The coupling π is a doubly
1 1
stochastic matrix satisfying i πij = N and j πij = N . The OT distance can be incorporated
into the PAE objective function as a regularization term. The overall loss function L combines the
reconstruction loss and the OT-based regularization:
L(θ, ϕ) = Ex∼pdata [− log pθ (x|z)] + λW22 (qϕ (z|x), p(z)) + γWc (pθ (x|z), pdata (x)), (6.32)
where λ and γ are hyperparameters controlling the strength of the regularization terms. The
first term ensures accurate reconstruction, while the second and third terms enforce consistency
between the latent and prior distributions, and between the decoded and true data distributions,
respectively. The OT framework also allows for the use of entropy-regularized OT, which introduces
a regularization term to the OT problem to make it computationally tractable. The entropy-
regularized OT problem is given by:
Z
Wc,ϵ (µ, ν) = inf c(x, y) dπ(x, y) + ϵH(π) (6.33)
π∈Π(µ,ν) X ×Y
R
where H(π) = X ×Y π(x, y) log π(x, y) dx dy is the entropy of the coupling π, and ϵ > 0 is the regu-
larization parameter. This formulation leads to the Sinkhorn algorithm, which provides an efficient
way to compute approximate OT distances.
In summary, OT distances offer a principled and mathematically rigorous approach for matching
latent and data distributions in PAEs. By minimizing the Wasserstein distance between distri-
butions, PAEs can achieve better alignment between the encoded latent space and the prior, as
well as between the decoded data and the true data distribution. This results in more robust and
interpretable models, with improved generative and reconstructive capabilities.
6.2. MATCHING LATENT AND DATA DISTRIBUTIONS IN PROBABILISTIC AUTOENCODERS67
The learning objective is to maximize the marginal likelihood p(x), which can be expressed as:
Z
p(x) = pθ (x|z)p(z) dz. (6.36)
However, this integral is intractable due to the high-dimensional latent space. Instead, the evidence
lower bound (ELBO) is maximized, which is derived using variational inference:
log p(x) ≥ Eqϕ (z|x) [log pθ (x|z)] − DKL (qϕ (z|x)∥p(z)) = ELBO. (6.37)
The first term in the ELBO is the reconstruction loss, which encourages the model to accurately
reconstruct the input data. The second term is the Kullback-Leibler (KL) divergence between the
variational posterior qϕ (z|x) and the prior p(z), which regularizes the latent space to match the
prior distribution. The ELBO can be rewritten as:
d
1X 2 2
(x) − µ2ϕ,i (x) + 1 ,
ELBO = Eqϕ (z|x) [log pθ (x|z)] − log σϕ,i (x) − σϕ,i (6.38)
2 i=1
2
where µϕ,i (x) and σϕ,i (x) are the mean and variance of the i-th dimension of the latent variable z.
The parameters ϕ and θ are optimized using stochastic gradient descent. The expectation term in
the ELBO is approximated using Monte Carlo sampling. For a single sample z(l) ∼ qϕ (z|x), the
ELBO can be approximated as:
To enable backpropagation through the sampling process, the reparameterization trick is used.
Specifically, z is reparameterized as:
where ϵ ∼ N (0, I) and ⊙ denotes element-wise multiplication. This allows gradients to flow through
the sampling process, enabling efficient optimization. The reconstruction loss log pθ (x|z) depends
on the type of data. For continuous data, it is often the negative log-likelihood of a Gaussian
distribution:
1
log pθ (x|z) = − ∥x − µθ (z)∥22 + constant. (6.41)
2
68 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS
For binary data, the Bernoulli distribution is used, and the reconstruction loss becomes:
D
X
log pθ (x|z) = xi log x̂i + (1 − xi ) log(1 − x̂i ), (6.42)
i=1
where x̂i is the i-th element of the reconstructed data µθ (z). The KL divergence term DKL (qϕ (z|x)∥p(z))
can be computed analytically for Gaussian distributions. For a standard Gaussian prior p(z) =
N (0, I) and a Gaussian variational posterior qϕ (z|x) = N (z; µϕ (x), Σϕ (x)), the KL divergence is:
1
tr(Σϕ (x)) + ∥µϕ (x)∥22 − d − log det(Σϕ (x)) .
DKL (qϕ (z|x)∥p(z)) = (6.43)
2
2 2
If the variational posterior is diagonal, i.e., Σϕ (x) = diag(σϕ,1 (x), . . . , σϕ,d (x)), the KL divergence
simplifies to:
d
1X 2
σϕ,i (x) + µ2ϕ,i (x) − 1 − log σϕ,i 2
DKL (qϕ (z|x)∥p(z)) = (x) . (6.44)
2 i=1
The optimization of the ELBO is performed using gradient-based methods. The gradients of the
ELBO with respect to ϕ and θ are computed using automatic differentiation. For a mini-batch of
data {x(1) , . . . , x(B) }, the stochastic gradient estimate of the ELBO is:
B
1 X
∇ϕ,θ log pθ (x(b) |z(b) ) − DKL (qϕ (z|x(b) )∥p(z)) ,
∇ϕ,θ ELBO ≈ (6.45)
B b=1
where z(b) ∼ qϕ (z|x(b) ). The probabilistic autoencoder can be extended to more complex architec-
tures, such as hierarchical models with multiple layers of latent variables. In such cases, the latent
variables are structured hierarchically, and the joint distribution is factorized as:
L−1
Y
p(x, z1 , . . . , zL ) = p(x|z1 ) p(zl |zl+1 )p(zL ), (6.46)
l=1
where z1 , . . . , zL are the latent variables at different levels of the hierarchy. The variational posterior
is also factorized hierarchically:
L
Y
qϕ (z1 , . . . , zL |x) = qϕ (z1 |x) qϕ (zl |zl−1 ). (6.47)
l=2
The ELBO for hierarchical models includes additional KL divergence terms for each level of the
hierarchy:
L
X
ELBO = Eqϕ (z1 ,...,zL |x) [log pθ (x|z1 )] − DKL (qϕ (zl |zl−1 )∥p(zl |zl+1 )). (6.48)
l=1
In summary, probabilistic autoencoders provide a principled framework for learning latent repre-
sentations of data by combining the strengths of autoencoders and probabilistic models. The ELBO
serves as the optimization objective, balancing reconstruction accuracy and regularization of the
latent space. The reparameterization trick enables efficient gradient-based optimization, and the
framework can be extended to hierarchical models for more complex data structures.
is defined for p ≥ 1. The 2-Wasserstein distance is particularly significant due to its geometric
interpretation and its applications in optimal transport theory, statistics, and machine learning.
Given two probability measures µ and ν on X, the 2-Wasserstein distance W2 (µ, ν) is defined
as the infimum of the expected squared distance between random variables X and Y , where X is
distributed according to µ and Y is distributed according to ν, over all possible joint distributions
π of (X, Y ) with marginals µ and ν. Mathematically, this is expressed as:
Z 1/2
2
W2 (µ, ν) = inf d(x, y) dπ(x, y) , (6.49)
π∈Π(µ,ν) X×X
where Π(µ, ν) denotes the set of all joint probability measures π on X × X with marginals µ and
ν, i.e., for any measurable sets A, B ⊂ X, π(A × X) R= µ(A) and π(X × B) = ν(B). The distance
d(x, y) is the metric on the space X, and the integral X×X d(x, y)2 dπ(x, y) represents the expected
squared distance under the coupling π. The 2-Wasserstein distance can also be expressed in terms
of the cumulative distribution functions (CDFs) of µ and ν when X = R. Let Fµ and Fν be the
CDFs of µ and ν, respectively. Then, the 2-Wasserstein distance is given by:
Z 1 1/2
W2 (µ, ν) = |Fµ−1 (q) − Fν−1 (q)|2 dq , (6.50)
0
where Fµ−1 and Fν−1 are the quantile functions (inverse CDFs) of µ and ν, respectively. This for-
mulation is particularly useful in one-dimensional settings, as it reduces the problem to computing
the integral of the squared difference between the quantile functions. In the case where µ and ν are
absolutely continuous with respect to the Lebesgue measure, with probability density functions fµ
and fν , the 2-Wasserstein distance can be related to the optimal transport map T that pushes µ
forward to ν, i.e., T# µ = ν. The map T minimizes the transport cost:
Z 1/2
2
W2 (µ, ν) = |x − T (x)| dµ(x) . (6.51)
X
The optimal transport map T is often characterized by the Monge-Ampère equation, which relates
the densities fµ and fν through the Jacobian determinant of T :
In higher dimensions, the 2-Wasserstein distance is more challenging to compute, but it can be
approximated using numerical methods such as linear programming, entropic regularization, or
Sinkhorn iterations. The dual formulation of the 2-Wasserstein distance, derived from Kantorovich
duality, provides another perspective:
Z Z
2
W2 (µ, ν) = sup ϕ(x) dµ(x) + ψ(y) dν(y) , (6.53)
ϕ,ψ X X
where the supremum is taken over all pairs of continuous functions ϕ and ψ on X satisfying the
constraint:
ϕ(x) + ψ(y) ≤ d(x, y)2 ∀x, y ∈ X. (6.54)
This dual formulation is particularly useful in theoretical analyses and provides a connection to
convex optimization. The 2-Wasserstein distance has several important properties, including sym-
metry, non-negativity, and the triangle inequality, making it a true metric on the space of probability
measures with finite second moments. It is also sensitive to the geometry of the underlying space
X, as it incorporates the metric d(x, y) directly into its definition. This geometric sensitivity makes
it particularly useful in applications such as image processing, where the spatial arrangement of
70 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS
pixels is crucial.
In summary, the 2-Wasserstein distance is a powerful tool for comparing probability distributions,
with deep connections to optimal transport theory, convex analysis, and geometry. Its rigorous
mathematical formulation and rich theoretical properties make it a cornerstone of modern proba-
bility and statistics.
Let (X, d) be a complete separable metric space, and let Pp (X) denote the space of probability
measures on X with finite p-th moments. For P, Q ∈ Pp (X), the p-Wasserstein distance Wp (P, Q)
is defined as: Z 1/p
p
Wp (P, Q) = inf d(x, y) dπ(x, y) (6.55)
π∈Π(P,Q) X×X
where Π(P, Q) is the set of all couplings (joint distributions) with marginals P and Q. The Wasser-
stein distance induces a metric on Pp (X), endowing it with a geometric structure that is sensitive
to the underlying topology of X. This sensitivity ensures that the Wasserstein distance captures
not only global but also local differences between distributions, making it particularly suitable for
regularizing posterior distributions in Bayesian inference. In Bayesian Deep Learning, the prior p(θ)
and the posterior q(θ) are probability measures over the parameter space Θ. To ensure smoothness
and stability of the posterior, we introduce an OT-based regularization term into the variational
inference objective. Specifically, we augment the evidence lower bound (ELBO) with a Wasserstein
penalty:
LELBO (q) = Eq (θ) [log p(D | θ)] − KL(q(θ) ∥ p(θ)) + λWp (p(θ), q(θ)) (6.56)
where λ > 0 is a regularization hyperparameter, and KL(q(θ) ∥ p(θ)) is the Kullback-Leibler
divergence. The Wasserstein term Wp (p(θ), q(θ)) enforces geometric consistency between the prior
and the posterior, ensuring that the posterior does not deviate excessively from the prior in a
manner that respects the underlying metric structure of Θ. The smoothness of the posterior is
guaranteed by the properties of the Wasserstein distance. Specifically, the Wasserstein distance is
Lipschitz-continuous with respect to perturbations in the distributions. For any P, Q, R ∈ Pp (X),
the triangle inequality holds:
This inequality ensures that small changes in the data distribution D or the prior p(θ) lead to
proportionally small changes in the posterior q(θ), thereby promoting stability. Furthermore, the
Wasserstein distance is convex in its arguments, which facilitates efficient optimization and guaran-
tees the existence of a unique minimizer under appropriate conditions. To compute the Wasserstein
distance in practice, we often employ entropic regularization, which transforms the OT problem into
a strictly convex optimization problem. The entropic regularized Wasserstein distance Wp,ϵ (P, Q)
is defined as: Z
Wp,ϵ (P, Q) = inf d(x, y)p dπ(x, y) + ϵH(π) (6.58)
π∈Π(P,Q) X×X
6.4. APPLICATION OF OPTIMAL TRANSPORT IN LEARNING ENERGY-BASED MODELS71
R
where H(π) = X×X π(x, y) log π(x, y) dx dy is the entropy of the coupling π, and ϵ > 0 is the
regularization parameter. The entropic regularization ensures that the optimal coupling π ∗ is
unique and has the form:
f (x) + g(y) − d(x, y)p
∗
π (x, y) = exp (6.59)
ϵ
where f and g are dual potentials that satisfy the Schrödinger system of equations. This formulation
allows for efficient computation using the Sinkhorn-Knopp algorithm, which iteratively updates the
dual potentials f and g. The stability and smoothness of the posterior are further reinforced by
the dual formulation of the Wasserstein distance. By the Kantorovich-Rubinstein duality, the
1-Wasserstein distance can be expressed as:
Z Z
W1 (P, Q) = sup f (x) dP (x) − f (x) dQ(x) (6.60)
f ∈Lip1 (X) X X
where Lip1 (X) is the set of 1-Lipschitz functions on X. This dual formulation highlights the role
of Lipschitz continuity in controlling the smoothness of the posterior, as the Wasserstein distance
penalizes functions that vary too rapidly.
In summary, OT-based priors ensure that posterior distributions maintain smooth and stable rep-
resentations by leveraging the geometric and analytical properties of the Wasserstein distance. The
Wasserstein distance provides a rigorous framework for regularizing Bayesian inference, ensuring
that the posterior remains close to the prior in a geometrically meaningful way. This approach is
computationally tractable due to entropic regularization and the Sinkhorn-Knopp algorithm, and it
is theoretically grounded in measure theory, convex analysis, and functional analysis. The resulting
Bayesian framework is robust to noise and distributional shifts, leading to improved generalization
in uncertainty estimation tasks.
where Γ(Pdata , Pθ ) denotes the set of all joint probability measures γ on X × X with marginals Pdata
and Pθ , and c(x, y) is a cost function, typically chosen as c(x, y) = d(x, y)p for p ≥ 1. For p = 2,
this yields the squared Euclidean cost, and the corresponding OT distance is the 2-Wasserstein
distance: Z 1/2
W2 (Pdata , Pθ ) = inf ∥x − y∥22 dγ(x, y) . (6.62)
γ∈Γ(Pdata ,Pθ ) X×X
In the context of energy-based models, the model distribution Pθ (x) is defined via an energy function
Eθ (x) and the Boltzmann-Gibbs distribution:
1
Pθ (x) = exp(−Eθ (x)), (6.63)
Z(θ)
Z
Z(θ) = exp(−Eθ (x)) dx, (6.64)
X
72 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS
where Z(θ) is the partition function. The goal is to minimize the discrepancy between Pdata and
Pθ , which can be achieved by minimizing the Wasserstein distance W2 (Pdata , Pθ ). The Kantorovich
duality theorem provides a dual formulation of the OT problem:
Z Z
2
W2 (Pdata , Pθ ) = sup φ(x) dPdata (x) + ψ(y) dPθ (y) , (6.65)
φ,ψ X X
subject to the constraint φ(x)+ψ(y) ≤ ∥x−y∥2 for all x, y ∈ X. Here, φ and ψ are the Kantorovich
potentials, which are related via the c-transform:
The Kantorovich potentials φ and ψ can be parameterized using neural networks, leading to the
Wasserstein Generative Adversarial Network (WGAN) framework. In this setting, the discriminator
learns φ, while the generator minimizes the Wasserstein distance. The gradient of the Wasserstein
distance with respect to the model parameters θ is given by:
where φθ is the Kantorovich potential parameterized by θ. This gradient can be estimated using
Monte Carlo sampling from Pdata and Pθ , enabling stochastic optimization. To enhance computa-
tional efficiency and stability, entropy regularization is often introduced, leading to the Sinkhorn
divergence: Z
2 2
W2,ϵ (Pdata , Pθ ) = inf ∥x − y∥2 dγ(x, y) + ϵH(γ) , (6.69)
γ∈Γ(Pdata ,Pθ ) X×X
where Z
H(γ) = γ(x, y) log γ(x, y) dx dy (6.70)
X×X
is the entropy of the coupling γ, and ϵ > 0 is the regularization parameter. The Sinkhorn di-
vergence provides a smoothed approximation of the Wasserstein distance, which is particularly
advantageous in high-dimensional settings. The optimization problem for learning EBMs using OT
can be formulated as:
2
min W2,ϵ (Pdata , Pθ ) + λR(θ), (6.71)
θ
where R(θ) is a regularization term, such as ∥θ∥22 , and λ is a hyperparameter. The gradient of the
Sinkhorn divergence with respect to θ is:
2
∇θ W2,ϵ (Pdata , Pθ ) = Ex∼Pdata [∇θ φθ (x)] − Ey∼Pθ [∇θ φθ (y)] + ϵ∇θ H(γθ ), (6.72)
where γθ is the optimal coupling under entropy regularization. This gradient can be computed
efficiently using the Sinkhorn-Knopp algorithm, which iteratively updates the coupling γ and the
potentials φ.
where Π(a, b) is the set of all coupling matrices satisfying the marginal constraints. The entropy-
regularized version of the Wasserstein distance, known as the Sinkhorn distance, is given by:
Wϵ (a, b) = ⟨P ∗ , C⟩ − ϵH(P ∗ ), (6.80)
74 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS
where P ∗ is the solution to the entropy-regularized optimal transport problem. The Sinkhorn-
Knopp algorithm provides an efficient way to compute P ∗ and, consequently, the Sinkhorn distance.
The algorithm can also be extended to handle unbalanced optimal transport problems, where
the source and target distributions do not necessarily have the same total mass. In this case,
the optimization problem is modified to include additional penalty terms for deviations from the
marginal constraints:
min ⟨P, C⟩ − ϵH(P ) + λ1 KL(P 1m ∥a) + λ2 KL(P ⊤ 1n ∥b), (6.81)
P ∈Rn×m
+
where KL(·∥·) is the Kullback-Leibler divergence, and λ1 , λ2 > 0 are regularization parameters.
The Sinkhorn-Knopp algorithm can be adapted to solve this problem by modifying the update
rules for u and v:
(k+1) ai (k+1) bj
ui = Pm (k)
+ λ1 , vj = Pn (k+1)
+ λ2 (6.82)
j=1 Kij vj i=1 Kij ui
This extension allows the algorithm to handle a broader range of applications in deep learning,
such as semi-supervised learning and data augmentation, where the source and target distributions
may not be perfectly aligned.
where U(a, b) = {P ∈ Rn×m + | P1m = a, P⊤P 1n =Pb} is the set of all valid transport plans,
1n and 1m are vectors of ones, and ⟨P, C⟩ = ni=1 m j=1 Pij Cij is the Frobenius inner product.
The Sinkhorn distance introduces an entropic regularization term to this problem, resulting in the
following optimization problem:
OTλ (a, b) = min ⟨P, C⟩ − λH(P), (6.84)
P∈U(a,b)
where f ∈ Rn and g ∈ Rm are dual variables. The optimal dual variables f ∗ and g∗ are related to
the scaling vectors u∗ and v∗ by:
∗ ∗
∗ f ∗ g
u = exp and v = exp . (6.90)
λ λ
The Sinkhorn distance can thus be computed using either the primal or dual formulation, depending
on the specific application. The dual formulation is particularly useful for deriving theoretical
properties of the Sinkhorn distance, such as its convexity and smoothness with respect to the
input distributions a and b. The Sinkhorn distance satisfies several key mathematical properties.
First, it is symmetric, meaning that Sinkhornλ (a, b) = Sinkhornλ (b, a). Second, it satisfies the
triangle inequality, making it a valid metric on the space of probability distributions. Third, it is
differentiable with respect to the input distributions a and b, which is useful for gradient-based
optimization in machine learning applications. Finally, the Sinkhorn distance is computationally
efficient, with a time complexity of O(nm) per iteration of the Sinkhorn-Knopp algorithm, making it
scalable to high-dimensional problems. The Sinkhorn distance can also be generalized to continuous
probability measures using the Kantorovich formulation of optimal transport. Let µ and ν be
probability measures defined on metric spaces X and Y, respectively, and let c : X × Y → R+ be a
continuous cost function. The entropically regularized optimal transport problem in the continuous
setting is given by: Z
OTλ (µ, ν) = min c(x, y) dπ(x, y) − λH(π), (6.91)
π∈Π(µ,ν) X ×Y
This formulation extends the Sinkhorn distance to continuous probability measures, enabling its
application to a broader class of problems in probability theory, statistics, and machine learning.
The Sinkhorn distance thus provides a rigorous and computationally efficient framework for mea-
suring the dissimilarity between probability distributions, combining the theoretical foundations of
optimal transport with the practical advantages of entropic regularization.
Here, Π(Pr , Pg ) represents the set of all joint distributions γ(x, y) with marginals Pr and Pg , and
the infimum is taken over all such joint distributions. The Wasserstein-1 distance quantifies the
minimal cost required to transport mass from Pr to Pg , where the cost is proportional to the
Euclidean distance ∥x−y∥. In WGANs, the generator G and the critic D are optimized to minimize
and maximize the Wasserstein-1 distance, respectively. The critic D is trained to approximate the
Wasserstein-1 distance between Pr and Pg . The objective function for the critic is given by:
Here, z is a random noise vector sampled from a prior distribution pz (z), and G(z) is the generated
sample. To ensure that the critic D is a 1-Lipschitz function, a constraint is imposed on its gradients.
This constraint is typically enforced using a gradient penalty term, leading to the modified critic
loss function:
LGP 2
D = Ex∼Pr [D(x)] − Ez∼pz (z) [D(G(z))] + λEx̂∼Px̂ [(∥∇x̂ D(x̂)∥2 − 1) ] (6.97)
In this equation, λ is a hyperparameter controlling the strength of the gradient penalty, and x̂ is
sampled uniformly along straight lines connecting pairs of points sampled from Pr and Pg . The term
(∥∇x̂ D(x̂)∥2 − 1)2 penalizes deviations from the 1-Lipschitz constraint, ensuring that the critic’s
gradients have a norm of at most 1. The generator G is trained to minimize the Wasserstein-1
distance by maximizing the critic’s output on generated samples. The objective function for the
generator is given by:
LG = −Ez∼pz (z) [D(G(z))] (6.98)
The optimization process alternates between updating the critic and the generator. The critic is
updated multiple times for each update of the generator to ensure that it remains close to the
optimal 1-Lipschitz function. The parameter updates for the critic and generator are given by:
θD ← θD − ηD ∇θD LGP
D (6.99)
θG ← θG − ηG ∇θG LG (6.100)
Here, θD and θG are the parameters of the critic and generator, respectively, and ηD and ηG are
the learning rates for the critic and generator. The training continues until the generator pro-
duces samples that are indistinguishable from real data according to the critic. The Wasserstein-
1 distance provides several theoretical advantages over traditional GAN objectives, such as the
6.6. KANTOROVICH DUALITY 77
Jensen-Shannon divergence used in the original GAN formulation. The Wasserstein-1 distance is
continuous and differentiable almost everywhere, which leads to more stable training dynamics.
Additionally, the Wasserstein-1 distance correlates better with sample quality, providing a mean-
ingful metric for evaluating the performance of the generator. The mathematical properties of the
Wasserstein-1 distance, including its sensitivity to the geometry of the underlying space and its
ability to provide meaningful gradients even when the supports of Pr and Pg are disjoint, make
WGANs a powerful tool for generative modeling.
where D is the set of 1-Lipschitz functions. The Wasserstein-1 distance W1 (Pr , Pg ) is approximated
by the critic D, and the generator G is trained to minimize this distance. The use of the Wasserstein-
1 distance in WGANs provides a theoretically grounded framework for generative modeling, leading
to improved stability and performance compared to traditional GANs. The mathematical rigor of
the Wasserstein-1 distance and its properties ensure that WGANs are well-suited for a wide range
of generative tasks.
The Kantorovich duality states that the optimal transport cost can equivalently be expressed in
terms of potential functions φ : X → R and ψ : Y → R satisfying the inequality
Under appropriate conditions, the optimal transport cost is given by the dual formulation
Z Z
sup φ(x)dµ(x) + ψ(y)dν(y) (6.105)
φ,ψ X Y
For the specific case of the Wasserstein-1 distance, where c(x, y) = ∥x − y∥, the dual problem
simplifies to Z Z
W1 (µ, ν) = sup φ(x)dµ(x) − φ(y)dν(y) . (6.108)
∥φ∥L ≤1 X Y
This adversarial framework provides a more stable training procedure than standard GANs, which
rely on the Jensen-Shannon divergence. The Kantorovich duality ensures that the discriminator
function learns a transport potential that characterizes the optimal transport map, guiding the
generator toward minimizing the Wasserstein distance between the real and generated distributions.
Beyond generative modeling, optimal transport theory is also applied in supervised learning, where
the Wasserstein distance serves as a loss function for comparing distributions in high-dimensional
spaces. When computational efficiency is required, entropic regularization techniques introduce an
additional entropy penalty term H(γ), yielding the regularized transport problem
Z
inf c(x, y)dγ(x, y) + ϵH(γ) , (6.111)
γ∈Π(µ,ν) X ×Y
where Z
H(γ) = γ(x, y) log γ(x, y)dxdy. (6.112)
X ×Y
This modification allows for efficient computation via Sinkhorn iterations, leveraging the dual
formulation
Z Z Z
(φ(x)+ψ(y)−c(x,y))/ϵ
sup φ(x)dµ(x) + ψ(y)dν(y) − ϵ e dxdy . (6.113)
φ,ψ X Y X ×Y
Such approaches are particularly useful in deep learning, where high-dimensional transport prob-
lems arise in applications such as domain adaptation, clustering, and metric learning. The theoret-
ical underpinnings of Kantorovich duality thus play a critical role in designing robust algorithms
for optimizing probability distributions in high-dimensional spaces, ensuring efficient convergence
and improved generalization performance in deep neural networks.
where Π(µ, ν) is the space of joint probability distributions on X × Y with marginals µ and ν,
respectively, satisfying the marginalization conditions
Z Z
dγ(x, y) = dµ(x), dγ(x, y) = dν(y). (6.115)
Y X
The dual potentials ϕ(x) and ψ(y) satisfy the iterative update equations
Z
ψ (k) (y)−c(x,y)
(k+1)
ϕ (x) = −λ log e λ dν(y), (6.121)
Y
Z
ϕ(k) (x)−c(x,y)
(k+1)
ψ (y) = −λ log e λ dµ(x). (6.122)
X
These updates converge exponentially to the optimal dual potentials, enabling the efficient com-
putation of entropy-regularized transport distances. In the discrete setting, where the probability
measures µ and ν are represented as vectors u ∈ Rn and v ∈ Rm , the entropy-regularized optimal
transport problem reduces to a matrix scaling problem. Given a cost matrix C ∈ Rn×m , the optimal
transport matrix T satisfies
Cij
Tij = exp − ui vj . (6.123)
λ
This leads to the Sinkhorn-Knopp iteration
µ ν
u(k+1) = , v (k+1) = , (6.124)
Kv (k) K T u(k)
80 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS
where Kij = e−Cij /λ and division is element-wise. These updates ensure fast convergence and nu-
merical stability, making entropy-regularized optimal transport suitable for large-scale applications
in deep learning. In the context of deep neural networks, entropy-regularized optimal transport
plays a crucial role in generative modeling, domain adaptation, and adversarial training. Given a
parameterized generator Gθ (z) with latent variable z ∼ PZ , the entropy-regularized Wasserstein
loss for training a generative model is given by
Z
λ
L(θ) = Wc (PGθ , PX ) = inf c(Gθ (z), x)dγ(z, x) + λH(γ) . (6.125)
γ∈Π(PGθ ,PX )
This formulation enables stable gradient-based training by ensuring smoothness in the transport
plan. Furthermore, in domain adaptation, the entropy-regularized Wasserstein distance is mini-
mized between the source and target feature distributions PS and PT , leading to the optimization
problem Z
LOT = Wcλ (PS , PT ) = inf c(x, y)dγ(x, y) + λH(γ) . (6.126)
γ∈Π(PS ,PT )
This objective ensures that the learned representation preserves geometric consistency between
domains while mitigating domain shift. Thus, entropy regularization in optimal transport provides
a mathematically rigorous, computationally efficient, and theoretically well-grounded framework for
deep learning applications, ensuring convexity, uniqueness, and numerical stability while enabling
large-scale optimization in high-dimensional probability spaces.
which ensures that mass is conserved. Given a cost function c : X × Y → R, the total transport
cost associated with the map T is given by the functional
Z
C[T ] = c(x, T (x))f (x)dx. (6.128)
X
where the infimum is taken over all measurable transport maps satisfying the pushforward condition.
The primary difficulty in solving Monge’s problem lies in the fact that a transport map may not
exist in general, particularly when the measure µ is more spread out than ν, or when mass-splitting
is required, which motivates the relaxation of Monge’s problem to the Kantorovich formulation.
Instead of restricting attention to deterministic transport maps, Kantorovich introduced the notion
of transport plans, which are probability measures γ on the product space X × Y satisfying the
marginal constraints
Z Z
dγ(x, y) = dµ(x), dγ(x, y) = dν(y). (6.130)
Y X
6.7. ENTROPY REGULARIZATION 81
The set of all such couplings is denoted as Π(µ, ν), which consists of all probability measures
γ(x, y) whose projections onto X and Y coincide with µ and ν, respectively. The Kantorovich
optimal transport problem is then formulated as the minimization problem
Z
inf c(x, y)dγ(x, y). (6.131)
γ∈Π(µ,ν) X×Y
Under mild conditions, such as compactness of X and Y and continuity of c(x, y), existence of an
optimal transport plan γ ∗ is guaranteed. If γ ∗ is supported on the graph of a function y = T (x),
then the optimal plan corresponds to a Monge-type transport. The problem of characterizing such
solutions leads naturally to Kantorovich duality, which provides a variational formulation of
optimal transport in terms of potentials φ : X → R and ψ : Y → R. The dual problem is expressed
as Z Z
sup φ(x)dµ(x) + ψ(y)dν(y) (6.132)
φ,ψ X Y
The optimality conditions for the Kantorovich dual problem imply the complementary slackness
condition
φ(x) + ψ(y) = c(x, y) for (x, y) ∈ supp(γ ∗ ), (6.134)
which characterizes the support of the optimal transport plan. When the cost function is the
squared Euclidean distance, c(x, y) = ∥x − y∥2 , the transport problem reduces to finding a con-
vex function u(x) whose gradient defines the optimal transport map. This leads to the Monge-
Ampère equation
f (x)
det D2 u(x) = (6.135)
g(∇u(x))
which provides a geometric interpretation of optimal transport: the transport map is given
by the gradient of a convex function, ensuring volume preservation under mass transport. In this
setting, the transport map satisfies
T (x) = ∇u(x), (6.136)
which is the Brenier map for the quadratic cost function. The induced transport cost is given by
the Wasserstein-2 distance
Z 1/2
2
W2 (µ, ν) = inf ∥x − y∥ dγ(x, y) . (6.137)
γ∈Π(µ,ν) X×Y
These equations reveal deep connections between optimal transport, geometric analysis, and func-
tional inequalities, providing a unifying framework for problems in probability, physics, and machine
learning.
82 CHAPTER 6. OPTIMAL TRANSPORT THEORY IN DEEP NEURAL NETWORKS
Consider two proper, convex, and lower semicontinuous functions f : X → R ∪ {+∞} and
g : X → R ∪ {+∞}. The primal optimization problem is given by
which seeks to minimize the sum of these two convex functions over the domain X. The associated
Fenchel conjugate function (also referred to as the Legendre-Fenchel transform) is defined
as
f ∗ (x∗ ) = sup {⟨x∗ , x⟩ − f (x)} , x∗ ∈ X ∗ , (6.141)
x∈X
which captures the maximum possible affine lower bound of f (x). Similarly, the conjugate of g is
given by
g ∗ (x∗ ) = sup {⟨x∗ , x⟩ − g(x)} , x∗ ∈ X ∗ . (6.142)
x∈X
The Fenchel-Rockafellar duality theorem states that under appropriate constraint qualifications,
strong duality holds, meaning that the optimal values of the primal and dual problems coincide:
To ensure strong duality, a sufficient condition is the existence of a point x0 ∈ X such that
This constraint qualification ensures that the subdifferential intersection necessary for optimality
is nonempty. In particular, the subdifferential of a convex function f at a point x is defined as
This condition provides the necessary and sufficient optimality criterion in terms of subdifferential
calculus. The theorem extends naturally to infinite-dimensional settings, where it serves as a
foundation for convex variational problems in Hilbert and Banach spaces. By employing the Fenchel
conjugates, it allows for a reformulation of many classical variational problems in dual terms,
enabling powerful analytical techniques such as saddle-point theory, strong convexity conditions,
and dual space representations. The theorem’s implications in optimization, functional analysis,
and mathematical economics highlight its fundamental role in modern analysis.
7 Open Set Learning
Open set learning is a foundational challenge in statistical learning theory and pattern recognition,
characterized by the necessity of classifying data instances drawn from both known and unknown
distributions. Consider a training dataset D = {(xi , yi )}N d
i=1 , where each instance xi ∈ R is
associated with a label yi from a finite set of known classes Ytrain . The fundamental challenge in
open set learning arises when test samples are drawn from a distribution Ptest that includes both
the training distribution Ptrain and an additional unknown component Punknown , such that
However, in the presence of unknown classes, minimizing only this risk function leads to over-
confident misclassifications when a sample xu ∼ Punknown is assigned to one of the known classes.
This necessitates the introduction of a rejection mechanism, leading to an extended open set risk
functional
Ropen (f ) = E(x,y)∼Ptrain [ℓ(f (x), y)] + λExu ∼Punknown [ℓreject (f (xu ))] (7.4)
where ℓreject is a loss function that penalizes misclassification of unknown samples. A common
approach is to introduce a threshold-based rejection criterion where the classification function f
satisfies
ℓreject (f (xu )) = max fc (xu ) − τ, (7.5)
c
where τ is a rejection threshold such that if maxc fc (xu ) < τ , the sample is assigned to an unknown
category. The introduction of this threshold formally defines an open space O(f ), given by
n o
O(f ) = x ∈ Rd max fc (x) < τ , (7.6)
c
which is crucial in minimizing the open space risk introduced by Scheirer et al. (2013):
Z
Ropen (f ) = P (x)dx. (7.7)
O(f )
Incorporating open space risk minimization into the learning framework, the overall objective func-
tion becomes
L(f ) = Remp (f ) + βRopen (f ), (7.8)
where β is a hyperparameter that balances classification accuracy with rejection performance. The
underlying theoretical framework necessitates consideration of both discriminative and generative
84
85
In generative modeling, the problem is framed as estimating the likelihood P (x | y) under a known
class distribution model, such as a Gaussian Mixture Model (GMM):
K
X
P (x | y = c) = πk N (x | µk , Σk ), (7.10)
k=1
but with a nonparametric prior that ensures more flexible decision boundaries.
4. Extreme Value Theory (EVT) Models The tail distribution of softmax probabilities
is modeled using an Extreme Value Theorem (EVT) approach. Given softmax scores
σc (x), we fit a Weibull distribution to the tail:
β !
x−λ
PWeibull (x) = 1 − exp − . (7.20)
κ
6. Support Vector Models (OC-SVM and SVDD) One-Class SVM (OC-SVM): Finds
a separating hyperplane wT ϕ(x) + b = 0 such that:
1
min ∥w∥2 (7.25)
w,b 2
subject to:
∥ϕ(xi ) − c∥2 ≤ R2 + ξi . (7.28)
A sample is rejected if:
∥ϕ(x) − c∥ > R. (7.29)
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING87
Each model presents a trade-off between expressivity, computational complexity, and interpretabil-
ity. The choice of model for Open Set Learning depends on the underlying data distribution and
the required rejection confidence.
A distance-based approach further refines the rejection mechanism by computing the Mahalanobis
distance p
DM (x, µc , Σc ) = (x − µc )T Σ−1
c (x − µc ) (7.30)
and rejecting samples that satisfy
The challenge of domain adaptation in open set scenarios was further addressed by Saito et al.
(2018) [1170] where adversarial training was used to align known class features while ensuring that
unknown class features remained distinguishable. Their approach improved generalization across
datasets with shifting distributions. Expanding on the theoretical foundations of OSL, Geng et
al. (2020) [1171] provided a structured taxonomy of existing methods, evaluating their relative
strengths and weaknesses. This survey has become a key reference for researchers entering the
field. A novel approach to learning class boundaries in an open set scenario was proposed by
Chen et al. (2020) [1172]. They introduced the concept of Reciprocal Points, which helped define
88 CHAPTER 7. OPEN SET LEARNING
the extra-class space for each known category, reducing misclassification of unknowns. This work
significantly improved decision boundary learning for open set classification.
Author(s) Contribution
Scheirer et al. (2012) [1167] Introduced the concept of open space risk and pro-
posed the 1-vs-Set Machine classifier, which mini-
mizes both empirical and open space risk to iden-
tify unseen categories.
Bendale and Boult Extended the framework to open world learning
(2015) [1168] with the Nearest Non-Outlier (NNO) algorithm,
allowing incremental learning from evolving data.
Busto and Gall Proposed the Assign-and-Transform Iterative
(2017) [1169] (ATI) method for domain adaptation when the
target domain includes unknown classes not
present in the source.
Saito et al. (2018) [1170] Used adversarial training to align features of
known classes and distinguish unknowns, improv-
ing generalization in open set domain adaptation
scenarios.
Geng et al. (2020) [1171] Offered a comprehensive taxonomy and theoret-
ical framework for Open Set Learning methods,
becoming a foundational survey in the field.
Chen et al. (2020) [1172] Introduced Reciprocal Points to define extra-class
space and improve the separation between known
and unknown classes.
Few-shot learning in open set recognition presents additional complexities, as limited labeled data
makes it difficult to differentiate between known and unknown categories. Addressing this, Liu et
al. (2020) [1173] introduced the PEELER algorithm, which integrates meta-learning and entropy
maximization to enhance recognition capabilities. Another innovative approach came from Kong
and Ramanan (2021) [1174] where Generative Adversarial Networks (GANs) were leveraged to
synthesize diverse open-set examples. This method helped in explicitly modeling the open space,
improving classifier robustness. Advancing the theoretical aspects, Fang et al. (2021) [1175] pro-
vided generalization bounds for open set learning and introduced the Auxiliary Open-Set Risk
(AOSR) algorithm, ensuring robust decision-making under open-world conditions.
Active learning, which involves selecting the most informative samples for labeling, was linked
to open set recognition by Mandivarapu et al. (2022) [1176]. They proposed a framework where
unknown instances were queried and incorporated into the learning process, making the classifier
more adaptive to novel data. The impact of semi-supervised learning on OSL was explored by En-
gelbrecht and du Preez (2020) [1177]. They combined positive and unlabeled learning, improving
model robustness when training data contained unknown categories. Similarly, Zhou et al. (2024)
[1185] introduced a contrastive learning framework that used an unknown score to separate known
and unknown instances more effectively.
Another fundamental challenge in open set recognition is handling distributional shifts, which
was rigorously analyzed by Shao et al. (2022) [1178]. Their work addressed scenarios where class
distributions evolved between training and testing, ensuring that OSL models could generalize de-
spite such shifts. Theoretical underpinnings of OSL were further explored by Park et al. (2024)
[1179] which provided insights into how neural networks distinguish between known and unknown
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING89
classes using Jacobian-based metrics. Open set recognition also extends to object detection, as
demonstrated by Liu et al. (2022) [1180] which developed a hybrid model that utilized both la-
beled and unlabeled data for detecting unknown objects in complex visual scenes.
Abouzaid et al. (2023) [1186] present a cost-effective alternative to conventional vector network
analyzers by leveraging a D-band Frequency-Modulated Continuous-Wave (FMCW) radar system
combined with deep learning models for material characterization, where the integration of an
open-set recognition technique allows the system to reject measurements from unknown materials
that were not encountered during training, ensuring enhanced reliability and robustness in prac-
tical applications; their work demonstrates how deep learning, when coupled with domain-specific
hardware, can improve classification confidence in real-world settings where unknown materials
may appear frequently. Similarly, Cevikalp et al. (2023) [1187] proposed a novel approach that
effectively unifies anomaly detection and open set recognition by approximating class acceptance
regions with compact hypersphere models, providing a clear separation between known and un-
known instances, where their method improves the generalization capability of OSL models by
explicitly defining decision boundaries that allow the model to reject unseen samples with greater
confidence, bridging a crucial gap between two previously distinct problem domains and offering a
new perspective on how class boundaries should be defined in high-dimensional spaces; their work
emphasizes the importance of modeling uncertainty in classification tasks to improve the identifi-
cation of unknown samples in real-world applications.
Palechor et al. (2023) [1188] highlight a critical limitation in current open-set classification re-
search—namely, that existing evaluations are often performed on small-scale, low-resolution datasets
that do not accurately reflect real-world challenges—so they introduce three large-scale open-set
protocols built using subsets of ImageNet, where these protocols vary in terms of similarity be-
tween known and unknown classes to provide a more realistic assessment of open-set recognition
models; furthermore, they propose a novel validation metric designed to evaluate a model’s ability
to both classify known samples and correctly reject unknown ones, thus contributing significantly
to the standardization of open-set evaluation methodologies and setting new benchmarks for future
research in this domain, where their large-scale approach helps bridge the gap between academic
research and practical deployment scenarios by providing a more realistic evaluation framework.
Lastly, Cen et al. (2023) [1189] delve into the concept of Unified Open-Set Recognition (UOSR),
which not only aims to reject unknown samples but also to handle known samples that are misclas-
sified by the model, where they analyze the distribution of uncertainty scores and find that misclas-
sified known samples exhibit uncertainty distributions similar to those of truly unknown samples,
highlighting a fundamental limitation of existing open-set classifiers; they further explore how dif-
ferent training settings—such as pre-training and outlier exposure—affect UOSR performance, and
they propose the FS-KNNS method for few-shot UOSR settings, which achieves state-of-the-art
performance across multiple evaluation conditions, demonstrating that a unified framework capable
of handling both unknown rejection and misclassification is necessary for real-world applications
where classification errors can have significant consequences.
Huang et al. (2022) [1190] introduce a novel approach that leverages semantic reconstruction to
bridge the gap between known and unknown classes by focusing on class-specific feature recovery,
enabling more robust rejection of out-of-distribution samples. Complementarily, Wang et al. (2022)
[1191] propose an Area Under the Curve (AUC)-optimized objective function that directly improves
open-set recognition performance by training deep networks to balance the trade-off between closed-
set classification and unknown sample detection, providing a novel perspective on learning decision
boundaries in an end-to-end manner. Additionally, Alliegro et al. (2022) [1192] present a bench-
mark dataset tailored for 3D open-set learning, evaluating deep learning architectures on object
point cloud classification where previously unseen object categories must be identified and re-
jected, highlighting the need for better feature representations in 3D data. Meanwhile, Grieggs et
al. (2021) [1193] propose a unique application of OSL in the context of handwriting recognition,
demonstrating how human perception can be leveraged to identify errors in automatic transcription
systems when encountering unfamiliar handwriting styles, further expanding the applicability of
OSL methodologies beyond traditional classification problems.
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING91
The intersection of open-set learning and object detection has also received attention, with Liu
et al. (2022) [1180] introducing a semi-supervised framework that leverages labeled and unlabeled
data to improve object detection models in open-world settings where new object categories emerge
post-training, making their method highly relevant for real-world surveillance and autonomous driv-
ing applications. Similarly, Grcić et al. (2022) [1194] proposed a hybrid approach that combines
traditional anomaly detection with deep feature learning to improve open-set performance in dense
prediction tasks such as semantic segmentation. Moreover, Moon et al. (2022) [1195] introduced
a simulator that generates synthetic samples mimicking the characteristics of unknown instances,
allowing models to improve their robustness against unfamiliar data distributions. Meanwhile,
Kuchibhotla et al. (2022) [1196] addressed the problem of incrementally learning new unknown
categories, proposing a framework that adapts dynamically to new unseen data without requiring
retraining, making it well-suited for applications like autonomous agents and continual learning
scenarios.
Another critical research direction is open-set image generation, explored by Katsumata et al.
(2022) [1197] where they propose a Generative Adversarial Network (GAN)-based framework for
semi-supervised image synthesis, ensuring that generated images align with both known class distri-
butions and the potential feature spaces of unknown samples, bridging the gap between generative
modeling and open-set classification. Bao et al. (2022) [1198] extend OSL into the domain of tem-
poral action recognition, developing a method that detects and localizes previously unseen human
actions in video sequences, an essential capability for video surveillance and activity monitoring.
Dietterich and Guyer (2022) [1199] conducted a theoretical analysis of why deep networks strug-
gle with open-set generalization, proposing that model behavior is largely dictated by the level of
feature familiarity during training, thereby highlighting the importance of designing architectures
that explicitly account for feature generalization beyond closed-set learning.
Long-tailed and class imbalance issues are further addressed by Cai et al. (2022) [1200] where
they propose a method for localizing unfamiliar samples in long-tailed distributions by leveraging
feature similarity measures to identify and reject outliers, providing an important step towards
integrating OSL with long-tailed classification problems. Similarly, Wang et al. (2022) [1201]
present a framework that generalizes open-world learning paradigms to specific user-defined tasks,
improving model adaptability by considering the dynamics of real-world data distributions. Zhang
et al. (2022) [1202] introduce an architecture search algorithm optimized for OSL, demonstrating
that network design itself plays a crucial role in determining a model’s ability to reject unknown
instances, thus contributing a new perspective to the development of OSL-optimized deep learning
architectures. Lu et al. (2022) [1203] present a prototype-based approach that enhances open-set
rejection by mining robust feature prototypes from known classes, refining decision boundaries by
ensuring that unknown samples are adequately separated from learned class distributions. Their
work aligns closely with recent efforts in prototype learning for anomaly detection and open-world
learning, reinforcing the importance of well-structured feature representations in OSL.
The following five papers collectively advance the field of Open Set Learning (OSL) by addressing
key challenges in recognizing known classes while rejecting unknown or out-of-distribution (OOD)
samples. Xia et al. (2021) [1204] propose an Adversarial Motorial Prototype Framework (AMPF),
leveraging adversarial learning to refine class prototypes and explicitly model uncertainty bound-
aries, offering strong theoretical grounding but facing instability in training. Kong and Ramanan
(2021) [1205] introduce OpenGAN, which generates synthetic OOD data via GANs, improving gen-
eralization but requiring auxiliary OOD data for optimal performance. Huang et al. (2021) [1206]
take a semi-supervised approach in Trash to Treasure, using cross-modal matching to mine useful
OOD samples from unlabeled data, though this depends on multi-modal availability. Wang et al.
(2021) [1207] present an energy-based model (EBM) for uncertainty calibration, providing a prin-
cipled confidence measure without OOD data but at higher computational cost. Lastly, Zhang and
Ding (2021) [1208] adapt prototypical matching for zero-shot segmentation with open-set rejection,
offering efficiency but relying on pre-defined class embeddings. These works highlight diverse strate-
gies—adversarial learning, generative modeling, multi-modal mining, energy-based uncertainty, and
metric-based rejection—each with unique trade-offs in stability, data requirements, and scalability.
A comparative analysis reveals that generative approaches (OpenGAN, AMPF) excel in synthesiz-
ing or refining unknown-class representations but often suffer from training instability or data de-
pendency. In contrast, discriminative methods (Huang et al. (2021) [1206], Xia et al. (2021) [1204])
leverage existing data more efficiently but may struggle with feature overlap or modality constraints.
The energy-based framework stands out for its theoretical robustness in uncertainty quantification
but lacks computational efficiency. Future directions could integrate these paradigms—e.g., com-
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING93
bining generative synthetic data with energy-based calibration or extending contrastive learning
to reduce multi-modal reliance. Additionally, bridging zero-shot learning (Zhang and Ding (2021)
[1208]) with open-set recognition (Wang et al. (2021) [1207]) could yield more scalable solutions
for real-world open-world scenarios.
Girish et al. (2021) [1209] propose a framework for discovering and attributing GAN-generated
images in an open-world context, leveraging contrastive learning and unsupervised clustering to
identify novel synthetic sources. This work is pivotal in extending open-set learning to generative
models, where unknown sources must be incrementally detected. Similarly, Wang et al. (2021)
[1210] tackle open-world video object segmentation by introducing a benchmark for dense, uniden-
tified object segmentation, emphasizing the need for models to reject unknown objects while in-
crementally learning new categories. Their approach combines uncertainty estimation with spatio-
temporal consistency, providing a robust baseline for dynamic open-world settings. Cen et al.
(2021) [1211] further advance this direction by integrating deep metric learning into semantic seg-
mentation, proposing a prototype-based mechanism to distinguish known and unknown classes
through margin-based feature separation. Their work bridges open-set recognition and dense pre-
diction, demonstrating significant improvements in unknown class rejection.
Wu et al. (2021) [1212] introduce NGC, a unified framework for learning with open-world noisy
data, combining noise robustness with open-set rejection via graph-based label propagation and
94 CHAPTER 7. OPEN SET LEARNING
uncertainty-aware sample selection. This work is particularly rigorous in its theoretical analysis of
noise tolerance in open-set scenarios. Bastan et al. (2021) [1213] explore large-scale open-set logo
detection, employing hierarchical clustering and outlier-aware loss functions to handle real-world
open-set noise. Their empirical results highlight the scalability challenges in open-set detection.
Saito et al. (2021) [1214] propose OpenMatch, a semi-supervised learning method that enforces
consistency regularization while explicitly handling outliers, offering a principled way to integrate
open-set robustness into semi-supervised pipelines. Finally, Esmaeilpour et al. (2022) [1215] extend
CLIP for zero-shot open-set detection, leveraging vision-language pretraining to recognize novel
categories without labeled examples. Their work underscores the potential of multimodal models
in open-world settings but also exposes limitations in fine-grained unknown class discrimination.
Collectively, these papers advance open-set learning through novel architectures, benchmarks, and
theoretical insights, though challenges remain in scalability, noise robustness, and generalizability
to real-world open-ended environments.
Chen et al. (2021) [1216] introduce Adversarial Reciprocal Points Learning, which leverages ad-
versarial optimization to create reciprocal points that define decision boundaries for known classes
while rejecting unknowns through a novel geometric margin constraint. This approach is theo-
retically grounded in metric learning and adversarial robustness, demonstrating superior perfor-
mance on benchmark datasets. Similarly, Guo et al. (2021) [1217] propose a Conditional Varia-
tional Capsule Network, combining capsule networks with conditional variational autoencoders to
model complex data distributions, enabling better uncertainty quantification for open-set scenar-
ios. Their work extends the theoretical framework of variational inference to hierarchical feature
representations, improving discriminability between known and unknown classes. Bao et al. (2021)
[1218] explore Evidential Deep Learning for action recognition, employing subjective logic to model
epistemic uncertainty explicitly. Their method provides a probabilistic interpretation of open-set
recognition, offering robustness against outliers in video data. Meanwhile, Sun et al. (2021) [1219]
introduce M2IOSR, which maximizes mutual information between input data and latent represen-
tations to enhance feature discriminability. Their information-theoretic approach ensures compact
class-specific manifolds while maintaining separability from unknown samples, supported by rig-
orous bounds on mutual information optimization. Hwang et al. (2021) [1220] tackle Open-Set
Panoptic Segmentation, proposing an exemplar-based approach that leverages prototype learning
to distinguish known and unknown objects in segmentation tasks. Their method combines metric
learning with panoptic segmentation, providing a novel way to handle open-set scenarios in dense
prediction tasks. Finally, Balasubramanian et al. (2021) [1221] focus on real-world applications,
combining deep learning with ensemble methods for detecting unknown traffic scenarios. Their
work emphasizes practical robustness, using ensemble diversity to improve uncertainty estimation
in safety-critical environments.
Salomon et al. (2020) [1228] propose Siamese networks for open-set face recognition in small gal-
leries, leveraging metric learning to distinguish known from unknown classes effectively. Similarly,
Jia and Chan (2021) [1227] introduce the MMF loss, extending traditional feature learning by
incorporating margin-based constraints to enhance discriminability in open-set scenarios. Their
follow-up work (Jia and Chan, 2022) [1226] presents a self-supervised de-transformation autoen-
coder that learns robust representations by reconstructing original images from augmented views,
improving generalization to unknown classes. These approaches emphasize representation learning
but differ in their mechanisms—metric learning vs. margin-based losses vs. self-supervised recon-
struction—highlighting the trade-offs between discriminative power and generalization. Meanwhile,
Yue et al. (2021) [1225] tackle open-set and zero-shot recognition through counterfactual reasoning,
generating synthetic unknowns to refine decision boundaries. Their method bridges OSR and zero-
shot learning by leveraging generative models, offering a unified perspective on handling unseen
categories. Cevikalp et al. (2021) [1224] propose a deep polyhedral conic classifier, formulating
OSR as a compact geometric problem where classes are modeled as convex cones, enabling both
closed-set accuracy and open-set robustness. Zhou et al. (2021) [1223] take a different approach
by learning ”placeholder” prototypes for potential unknown classes during training, dynamically
adjusting decision boundaries to accommodate novel test-time instances. Their method explicitly
models uncertainty, contrasting with the geometric rigidity of polyhedral classifiers. Lastly, Jang
and Kim (2023) [1222] present a teacher-explorer-student (TES) paradigm, where an ”explorer”
96 CHAPTER 7. OPEN SET LEARNING
network identifies challenging open-set samples to guide the student’s learning process. This tripar-
tite framework introduces a novel meta-learning aspect to OSR, emphasizing adaptive exploration
over static modeling.
Sun et al. (2020) [1229] propose a Conditional Gaussian Distribution Learning (CGDL) method,
which models class-conditional distributions to improve open-set recognition by leveraging uncer-
tainty estimation. Similarly, Perera et al. (2020) [1230] introduce a hybrid generative-discriminative
framework that combines variational autoencoders (VAEs) with discriminative classifiers to enhance
feature separation between known and unknown classes. Meanwhile, Ditria et al. (2020) [1231]
present OpenGAN, a generative adversarial network (GAN) variant designed to generate outliers
that improve open-set detection by training the discriminator to reject synthetic unknowns. These
works collectively emphasize the importance of probabilistic modeling and generative techniques in
distinguishing between in-distribution and out-of-distribution samples, while also addressing scala-
bility and generalization challenges in real-world applications. Further expanding the scope of OSL,
Geng and Chen (2020) [1232] propose a collective decision framework that aggregates multiple clas-
sifiers to improve robustness in open-set scenarios, demonstrating the benefits of ensemble-based
uncertainty quantification. Jang and Kim (2020) [1233] develop a One-vs-Rest deep probability
model that explicitly estimates the probability of a sample belonging to an unknown class, offering
a computationally efficient alternative to traditional threshold-based methods. Lastly, Zhang et
al. (2020) [1234] explore hybrid models that combine discriminative and generative components,
achieving state-of-the-art performance by jointly optimizing feature learning and open-set rejec-
tion. These studies highlight a trend toward hybrid architectures that integrate multiple learning
paradigms to enhance open-set robustness.
Shao et al. (2020) [1235] tackle adversarial vulnerabilities in OSR with Open-set Adversarial De-
fense, proposing a defense mechanism that integrates open-set robustness into adversarial training,
ensuring resilience against both adversarial attacks and unknown class intrusions. These works col-
lectively advance OSR by refining feature learning and decision boundaries, though they differ in
their emphasis on hybrid architectures, latent space optimization, and adversarial robustness. Yu
et al. (2020) [1236] introduce a Multi-Task Curriculum Framework for Open-Set Semi-Supervised
Learning, dynamically balancing supervised and unsupervised learning to progressively handle un-
known classes, demonstrating strong performance in mixed-label settings. Miller et al. (2021) [1237]
propose Class Anchor Clustering, a distance-based loss function that enforces compact class-specific
clusters while maintaining separation, improving open-set classification by directly optimizing fea-
ture space geometry. Jia and Chan (2021) [1238] contribute MMF, a loss extension that enhances
feature discriminability in OSR by jointly minimizing intra-class variance and maximizing inter-
class separation. Finally, Oliveira et al. (2020) [1239] extend OSR to dense prediction tasks with
Fully Convolutional Open Set Segmentation, introducing uncertainty-aware mechanisms to reject
unknown pixels in segmentation, a novel application of open-set principles.
Yang et al. (2020) [1240] propose S2OSC, a holistic semi-supervised framework that integrates con-
sistency regularization and entropy minimization to enhance open set classification, demonstrating
robustness in handling both labeled and unlabeled data. Similarly, Sun et al. (2020) [1241] leverage
conditional probabilistic generative models to estimate the likelihood of samples belonging to known
classes, using uncertainty thresholds to reject unknowns, thus providing a probabilistic foundation
for open set recognition. Yang et al. (2020) [1242] introduce the Convolutional Prototype Network
(CPN), which learns discriminative prototypes for known classes and employs a distance-based
rejection mechanism, achieving state-of-the-art performance on benchmark datasets. Dhamija et
al. (2020) [1243] highlight the understudied challenge of open set detection in object recognition,
proposing an evaluation framework that underscores the limitations of current object detectors in
rejecting unknown classes. Their work calls for a paradigm shift in designing detection systems
to account for open-world assumptions. Meyer and Drummond (2019) [1244] emphasize metric
learning as a pivotal tool for open set recognition in robotic vision, demonstrating its efficacy in
active learning scenarios where unknown classes are incrementally discovered. Oza and Patel (2019)
[1245] present a multi-task learning approach that jointly optimizes classification and reconstruc-
tion tasks, leveraging autoencoders to model the latent space of known classes and detect outliers.
Yoshihashi et al. (2019) [1246] propose Classification-Reconstruction Learning (CROSR), which
combines discriminative and generative learning to improve open set performance by reconstructing
input samples and using reconstruction error as a secondary cue for rejection.
98 CHAPTER 7. OPEN SET LEARNING
Malalur and Jaakkola (2019) [1247] propose an alignment-based matching network that lever-
ages metric learning for one-shot classification and OSR, emphasizing the importance of feature
alignment in distinguishing known from unknown classes. Similarly, Schlachter et al. (2019) [1248]
introduce intra-class splitting, a technique that refines decision boundaries by decomposing known
classes into sub-clusters, thereby improving open-set discriminability. Meanwhile, Imoscopi et al.
(2019) [1249] investigate speaker identification, demonstrating that discriminatively trained neural
networks can effectively reject unknown speakers by optimizing confidence thresholds. These works
collectively highlight the role of discriminative training and refined feature representations in OSR.
On the other hand, Mundt et al. (2019) [1250] challenge the necessity of generative models for out-
of-distribution detection, showing that deep neural network uncertainty measures—such as softmax
entropy and Monte Carlo dropout—can achieve competitive performance without explicit genera-
tive modeling. This finding suggests that discriminative approaches, when properly calibrated, can
suffice for open-set scenarios.
Liu et al. (2019) [1251] tackle large-scale, long-tailed recognition in an open world, proposing a de-
coupled learning framework that balances head and tail class performance while rejecting unknown
samples—a critical advancement for practical deployment. Perera and Patel (2019) [1252] focus
on deep transfer learning for novelty detection, leveraging pre-trained models to identify multiple
unknown classes, thus bridging the gap between OSR and transfer learning. Finally, Xiong et al.
(2019) [1253] shift the focus from classification to counting, presenting a spatial divide-and-conquer
approach that transitions from open-set to closed-set object counting, demonstrating the versatility
of OSR principles beyond traditional recognition tasks.
Yang et al. (2019) [1254] explore open-set human activity recognition using micro-Doppler sig-
natures, leveraging the distinct radar-based motion patterns to differentiate known from unknown
activities, thus demonstrating the applicability of open-set methods in sensor-based domains. Sim-
ilarly, Oza and Patel (2019) [1255] propose C2AE, a class-conditioned autoencoder that learns
discriminative latent representations by reconstructing input samples conditioned on their class,
effectively separating known and unknown samples through reconstruction error thresholds. This
work is particularly notable for its integration of generative and discriminative learning, providing
a robust framework for open-set recognition in computer vision. Liu et al. (2018) [1256] take a
theoretical approach, introducing PAC guarantees for open category detection, ensuring probabilis-
tic bounds on detection errors—a crucial contribution for safety-critical applications. Venkataram
et al. (2018) [1257] adapt convolutional neural networks (CNNs) for open-set text classification,
employing prototype-based rejection mechanisms, highlighting the adaptability of deep learning
models to textual open-set scenarios. Meanwhile, Hassen and Chan (2018) [1258] propose a neural-
network-based representation learning method that explicitly models uncertainty, improving open-
set robustness by learning decision boundaries that account for unseen data distributions.
Further expanding the scope, Shu et al. (2018) [1259] investigate open-world classification, in-
troducing a framework for discovering unseen classes incrementally, which bridges open-set recog-
nition and continual learning. Dhamija et al. (2018) [1260] address ”agnostophobia” (fear of the
unknown) in deep networks by proposing loss functions that penalize overconfidence on unknown
samples, enhancing model calibration in open-set environments. Finally, Zheng et al. (2018) [1261]
100 CHAPTER 7. OPEN SET LEARNING
explore open-set adversarial examples, revealing vulnerabilities in open-set systems and proposing
defenses against adversarial attacks that exploit unknown-class detection mechanisms.
Table 7.8: Summary of Key Contributions in Open Set Recognition and Related Tasks
Neal et al. (2018) [1262] introduce a novel approach using counterfactual image generation to
simulate unknown classes, enhancing model robustness by exposing classifiers to synthetic outliers
during training. Similarly, Rudd et al. (2017) [1263] propose the Extreme Value Machine (EVM),
leveraging extreme value theory (EVT) to model the probability of sample inclusion, providing
a theoretically grounded method for open set recognition. Vignotto and Engelke (2018) [1264]
further refine EVT-based approaches by comparing Generalized Pareto Distribution (GPD) and
Generalized Extreme Value (GEV) classifiers, demonstrating their effectiveness in modeling tail
distributions for open set scenarios. Meanwhile, Cardoso et al. (2017) [1265] explore weightless
neural networks, which rely on probabilistic memory structures rather than traditional weight
updates, offering a unique solution for open set recognition that avoids retraining and adapts dy-
namically to new data. These works collectively advance the theoretical foundations of open set
learning, with EVT-based methods and synthetic data generation emerging as dominant paradigms.
Adversarial and generative techniques also play a pivotal role in open set learning, as evidenced by
7.1. LITERATURE REVIEW OF DEEP NEURAL NETWORK-BASED OPEN SET LEARNING101
several studies. Rozsa et al. (2017) [1266] compare Softmax and Openmax in adversarial settings,
revealing Openmax’s superior robustness due to its ability to reject uncertain samples. Shu et al.
(2017) [1267] extend open set principles to text classification with DOC, a deep learning framework
that estimates the probability of a document belonging to an unknown class by modeling semantic
boundaries. Generative approaches are further explored by Ge et al. (2017) [1268], who introduce
Generative Openmax, synthesizing unknown class samples to improve multi-class open set clas-
sification. Complementing this, Yu et al. (2017) [1269] employ adversarial sample generation to
create realistic outliers, training classifiers to distinguish known from unknown categories effectively.
A controversial perspective was presented by Vaze et al. (2021) [1181]. They argued that well-
trained closed-set classifiers could perform open set recognition without specialized modifications,
challenging existing methodologies. In addition to individual papers, several surveys and repos-
itories have played crucial roles in advancing OSL research. Barcina-Blanco et al. (2023) [1182]
provided an extensive literature review, identifying gaps and future directions in OSL, particularly
its connections to out-of-distribution detection and uncertainty estimation. A curated collection of
papers and resources by iCGY96 on GitHub [1183], which serves as a valuable knowledge repository
for researchers.
Finally, while not explicitly about OSL, advances in topological deep learning have implications
for open set scenarios. The Wikipedia article on ”Topological Deep Learning” [1184] discusses
how topological representations—such as graphs, simplicial complexes, and hypergraphs—can im-
prove the ability of deep learning models to handle complex data structures. Understanding these
mathematical foundations could lead to new techniques for open set classification by leveraging
topological features to distinguish known from unknown data distributions.
These 20 contributions represent a broad spectrum of advancements in Open Set Learning, ranging
from foundational theory to practical applications across diverse domains. The field continues to
evolve, with researchers developing more robust methods to handle real-world uncertainties where
unknown categories must be identified dynamically.
unknown samples not by probability estimation, but through geometric conformity, paving the way
for structural inductive biases in OSR.
This metric effectively measures how many standard deviations away the point x is from the mean
µc when considering the full covariance structure. The inverse covariance matrix Σ−1 accounts
for correlations between different feature dimensions, ensuring that directions of high variance
contribute less to the distance measurement, while directions of low variance contribute more. If
the covariance matrix Σ is diagonal, the Mahalanobis distance reduces to a normalized Euclidean
distance, with each feature being scaled by the inverse of its standard deviation. Explicitly, if
then v
u d
uX (xi − µc,i )2
DM (x, c) = t . (7.36)
i=1
σi2
However, in high-dimensional feature spaces where correlations exist between different feature com-
ponents, the full covariance matrix must be considered. The primary utility of Mahalanobis distance
in Open Set Learning stems from its ability to identify samples that deviate significantly from the
known class distributions. The decision rule for classification in Open Set Learning using Maha-
lanobis distance is defined as follows: for a given test sample x, the predicted label ŷ is assigned
based on
ŷ = arg min DM (x, c), (7.37)
c∈K
7.3. MAHALANOBIS DISTANCE 105
where K is the set of known classes. However, in Open Set Learning, the classification decision must
also incorporate a rejection criterion for unknown samples. This is accomplished by introducing a
rejection threshold τ , such that if
min DM (x, c) > τ, (7.38)
c∈K
This formulation ensures robust covariance estimation and improves the generalization ability of
the Mahalanobis distance metric. In deep Open Set Learning settings, feature representations
obtained from neural networks may not be perfectly Gaussian, necessitating additional adaptation
techniques. One approach is to learn feature embeddings such that Mahalanobis distance becomes
more effective for Open Set Recognition. If fθ : Rd → Rd is a neural network feature extractor
parameterized by θ, then the loss function can be augmented to explicitly minimize intra-class
Mahalanobis distances while maximizing inter-class separability:
X
LM = DM (fθ (x), y). (7.43)
(x,y)∈Dtrain
This enforces a constraint that encourages the Mahalanobis distance of unknown samples to remain
above the rejection threshold τ , improving the ability of the model to detect unknown samples.
The combined objective function used in Open Set Learning can then be formulated as
L = LM + λOOD LOOD , (7.45)
where λOOD is a balancing hyperparameter. In addition to the rejection-based approach, Maha-
lanobis distance can be integrated with probabilistic modeling techniques such as Gaussian mixture
models (GMMs) for better open set decision boundaries. Given a mixture model with C Gaussian
components, the probability of a feature vector x belonging to class c is computed as
πc N (x | µc , Σc )
P (c | x) = P , (7.46)
c′ ∈K πc′ N (x | µc′ , Σc′ )
106 CHAPTER 7. OPEN SET LEARNING
is the Gaussian density function. This formulation enables a soft probabilistic decision rule for
Open Set Learning. The Mahalanobis distance, due to its statistical foundation, provides a rigorous
means of identifying unknown samples in Open Set Learning by leveraging class-specific feature
distributions. Its ability to incorporate covariance structures makes it significantly more robust than
Euclidean distance, leading to improved open set classification performance in high-dimensional
spaces.
Key Insight: BNNs excel in scalability and integration with deep learning but often rely on
approximations that may underestimate uncertainty.
proposed C2AE, a class-conditioned autoencoder that uses reconstruction error and Dirichlet un-
certainty for OSL. Yoshihashi et al. (2019, CVPR) [1246] combined VAEs with Dirichlet priors
to jointly optimize classification and reconstruction for open-set robustness. Kong and Ramanan
(2021, ICCV) [1174] developed OpenGAN, where Dirichlet uncertainty guides GANs to synthe-
size adversarial unknowns. Neal et al. (2018, ECCV) [1262] generated counterfactual unknowns
using Dirichlet-based sampling to augment training data. Zhang et al. (2020, ECCV) [1234] hy-
bridized Dirichlet models with discriminative classifiers, achieving SOTA on open-set benchmarks.
Charoenphakdee et al. (2021, ICML) [1295] analyzed Dirichlet calibration under label noise, im-
proving OSL reliability. Hendrycks et al. (2019, NeurIPS) [1296] used Dirichlet-based confidence
thresholds to detect anomalies. Vaze et al. (2022, CVPR) [1297] generalized EVM to few-shot OSL
by integrating meta-learning with Dirichlet uncertainty.
Key Insight: Dirichlet methods provide interpretable uncertainty but require careful calibration
to avoid overconfidence.
Key Insight: GPs offer gold-standard uncertainty quantification but face computational bot-
tlenecks in high dimensions.
Key Insight: VAEs excel in generative novelty detection but struggle with likelihood-based met-
rics; hybrid approaches are often necessary.
Future Directions: Hybrid architectures (e.g., BNNs + GPs), tighter PAC-Bayesian bounds,
and calibration-aware training are promising avenues. These Bayesian approaches collectively ad-
vance OSL by unifying uncertainty quantification with adaptive learning.
P (y | x, D) (7.48)
where y belongs to the set of known classes Yknown , and D = {(xi , yi )}N
i=1 represents the training
data. The Bayesian formulation considers a prior distribution P (y) over classes and updates it
using the likelihood P (x | y) via Bayes’ theorem:
P (x | y)P (y)
P (y | x, D) = (7.49)
P (x)
where the evidence term is given by marginalization over all known classes:
X
P (x) = P (x | y ′ )P (y ′ ). (7.50)
y ′ ∈Yknown
P (x | y)P (y)
P (y | x, D) = (7.51)
P (x) + P (x | yu )P (yu )
where P (yu ) is the prior probability of encountering an unknown class. If P (yu ) is too low, the model
may overcommit to known classes. We have to model the Likelihood via Generative Distributions.
To estimate P (x | y), a Bayesian model assumes a parametric likelihood such as a Gaussian mixture:
Mk
X (k) (k) (k)
P (x | y = k) = πj N (x | µj , Σj ). (7.52)
j=1
exceeds a threshold, the sample is classified as unknown. Bayesian Neural Networks can also be
used for Open Set Learning. A Bayesian neural network (BNN) introduces a prior distribution over
the network parameters w, leading to a posterior distribution:
P (D | w)P (w)
P (w | D) = . (7.56)
P (D)
The predictive posterior for a test input x is then given by:
Z
P (y | x, D) = P (y | x, w)P (w | D)dw. (7.57)
Since this integral is intractable, approximation techniques such as Variational Inference (VI) or
Monte Carlo Dropout are used. In VI, the posterior is approximated by a parametric distribution
qϕ (w):
min DKL (qϕ (w) ∥ P (w | D)) (7.58)
ϕ
For open-set detection, an unknown class component is introduced, leading to an augmented Dirich-
let prior:
K
Y K
X
αi −1
P (p | α, αu ) ∝ pi · (1 − pi )αu −1 . (7.65)
i=1 i=1
If the posterior probability mass assigned to the unknown component is significant, the sample is
rejected as open-set.
In conclusion, A Bayesian approach to Open Set Learning provides a principled uncertainty quan-
tification mechanism, distinguishing unknown from known samples by leveraging posterior proba-
bilities, entropy-based rejection criteria, and epistemic uncertainty estimates. The use of Bayesian
Neural Networks, Gaussian Processes, and Dirichlet distributions enables robust open-set recogni-
tion, improving model reliability in real-world applications.
110 CHAPTER 7. OPEN SET LEARNING
(y)
where each Gaussian component k for class y is parameterized by a mean vector µk ∈ Rd , a
(y) (y)
covariance matrix Σk ∈ Rd×d , and a mixture weight πk satisfying the normalization constraint:
K
X (y) (y)
πk = 1, 0 ≤ πk ≤ 1. (7.67)
k=1
The probability density function of each multivariate normal component is given explicitly by:
(y) (y) 1 1 (y) T (y)−1 (y)
N (x | µk , Σk ) = (y)
exp − (x − µk ) Σk (x − µk ) . (7.68)
(2π)d/2 |Σk |1/2 2
By expanding the quadratic form in the exponent, the probability density function can be rewritten
in terms of Mahalanobis distance:
d X d
!
(y) (y) 1 1 X (y) (y)−1 (y)
N (x | µk , Σk ) = (y) 1/2
exp − (xi − µki )Σk,ij (xj − µkj ) . (7.69)
d/2
(2π) |Σk | 2 i=1 j=1
The total log-likelihood of the data under the GMM model is given by:
K
X (y) (y) (y)
log P (x | y) = log πk N (x | µk , Σk ). (7.70)
k=1
Direct computation of the log-sum can lead to numerical instability, which is mitigated using the
log-sum-exp trick:
K
X (y) 1 (y) T (y)−1 (y)
log P (x | y) = m + log πk exp − (x − µk ) Σk (x − µk ) − m , (7.71)
k=1
2
where
1 (y) T (y)−1 (y)
m = max − (x − µk ) Σk (x − µk ) (7.72)
k 2
is chosen to stabilize exponentiation. For open set recognition, an input sample x is classified as
an out-of-distribution (OOD) sample if its likelihood P (x | y) falls below a predefined threshold τ ,
i.e.,
log P (x | y) < τ ⇒ x is out-of-distribution. (7.73)
An alternative formulation leverages Bayesian inference to determine the most probable class y ∗
given an observed sample x. Using Bayes’ theorem,
P (x | y)P (y)
P (y | x) = P ′ ′
, (7.74)
y ′ P (x | y )P (y )
7.7. DIRICHLET PROCESS GAUSSIAN MIXTURE MODEL (DP-GMM) 111
where P (y) is the prior probability of class y, which is often assumed uniform over the set of
known classes. The most probable class is then determined via the maximum a posteriori (MAP)
estimation:
y ∗ = arg max P (y | x). (7.75)
y
∗
If P (y | x) is below a rejection threshold η, the sample is assigned to an unknown category:
The Gaussian Mixture Model provides a theoretically rigorous and computationally efficient ap-
proach for open set learning, leveraging probabilistic likelihood estimation, Bayesian inference,
and robust distance measures to distinguish in-distribution and out-of-distribution samples. The
parameter optimization via Expectation-Maximization ensures maximum likelihood estimation of
the mixture components, and the decision criteria based on likelihood thresholds or Mahalanobis
distances provide strong theoretical guarantees for identifying unknown samples.
Mixture Model (GMM), effectively allowing the number of clusters to grow dynamically with the
data. Given an observed dataset D = {(xi , yi )}N d
i=1 , where each xi ∈ R represents a sample in a d-
dimensional feature space and yi is its corresponding class label, the likelihood estimation problem
can be expressed in the form
∞
X
P (x|y) = P (x|z = k, y)P (z = k|y) (7.83)
k=1
where z represents a latent variable denoting the cluster assignment. The likelihood P (x|z =
k, y) follows a multivariate Gaussian distribution parameterized by cluster-specific mean µk and
covariance Σk , given by
1 1 T −1
P (x|z = k, y) = N (x|µk , Σk ) = exp − (x − µk ) Σk (x − µk ) . (7.84)
(2π)d/2 |Σk |1/2 2
Since the number of mixture components is unknown, a Dirichlet Process (DP) prior is imposed
on the cluster assignment probabilities, leading to the hierarchical model
where G is a discrete random probability measure drawn from a Dirichlet Process with concen-
tration parameter α > 0 and base distribution G0 , which defines the prior over component
parameters. The construction of the DP-GMM is commonly represented using the stick-breaking
process, where the mixture weights are recursively defined as
k−1
Y
πk = vk (1 − vj ), vk ∼ Beta(1, α) (7.86)
j=1
Given this formulation, the posterior distribution over cluster assignments follows
The open-set learning problem requires the ability to identify and appropriately model unseen
categories. This is naturally handled by the Chinese Restaurant Process (CRP) represen-
tation of the DP, where a new data point x∗ is assigned to a new cluster with probability
α
P (z ∗ = K + 1|x∗ , y ∗ ) = (7.89)
α+N
where N is the total number of previously observed samples. Consequently, the likelihood of an
unseen data point under the DP-GMM model expands as
K
X
∗ ∗
P (x |y ) = P (x∗ |z ∗ = k, y ∗ )P (z ∗ = k|y ∗ ) + P (x∗ |z ∗ = K + 1, y ∗ )P (z ∗ = K + 1|y ∗ ). (7.90)
k=1
To further quantify the uncertainty in cluster assignments, the Shannon entropy of the
posterior is computed as
K+1
X
H(x) = − P (z = k|x, y) log P (z = k|x, y). (7.92)
k=1
A high entropy indicates significant uncertainty, suggesting that the sample does not belong to
any of the known categories and should be treated as an open-set example. If entropy exceeds a
critical threshold, the sample is automatically assigned to a new cluster, thereby allowing the
model to flexibly accommodate previously unseen classes. The hierarchical Bayesian formulation
of the DP-GMM further provides a robust probabilistic framework for managing inherent
ambiguity in classification by leveraging both the prior knowledge encoded in G0 and the
adaptive nature of the Dirichlet Process, which dynamically introduces new components
as necessary. From an inference perspective, model parameters (µk , Σk ) and mixture proportions
πk are estimated using variational inference or Gibbs sampling, where the posterior updates
involve integrating over all possible assignments z given the prior over cluster structure. The
posterior distribution over mixture components follows
yi ∈ {1, 2, . . . , K} for known classes or yi = K + 1 for an unknown class, the likelihood P (x|y)
is modeled as a multivariate normal distribution with class-specific parameters µy and Σy . These
parameters are drawn from the NIW prior distribution:
The posterior distribution of µy and Σy given the observed data Dy = {xi : yi = y} is also an
NIW distribution, with updated hyperparameters µ∗y , λ∗ , Ψ∗ , ν ∗ . The updated hyperparameters are
derived as:
λµ0 + Ny x̄y
µ∗y = , (7.98)
λ + Ny
λ∗ = λ + Ny , (7.99)
λNy
Ψ∗ = Ψ + Sy + (x̄y − µ0 )(x̄y − µ0 )T , (7.100)
λ + Ny
ν ∗ = ν + Ny , (7.101)
where Ny = |Dy | is the number of data points in class y, x̄y = N1y xi ∈Dy xi is the sample mean of
P
the data points in class y, and Sy = xi ∈Dy (xi − x̄y )(xi − x̄y )T is the scatter matrix for class y.
P
Due to the conjugacy of the NIW prior, this integral can be evaluated analytically, resulting in a
multivariate Student’s t-distribution:
Ψ∗
P (x|y) = T (x|µ∗y , , ν ∗ − d + 1), (7.103)
λ∗ (ν ∗ − d + 1)
where T (x|µ, Σ, ν) is the multivariate Student’s t-distribution with mean µ, scale matrix Σ, and
degrees of freedom ν. The multivariate Student’s t-distribution is given by:
−(ν+d)/2
Γ ν+d
2 1 T −1
T (x|µ, Σ, ν) = 1 + (x − µ) Σ (x − µ) . (7.104)
Γ ν2 (νπ)d/2 |Σ|1/2
ν
In the open set learning framework, the likelihood P (x|y) is used to compute the probability that
a new data point x belongs to a known class y. If P (x|y) falls below a predefined threshold
for all known classes y, the data point x is classified as belonging to an unknown class. This
approach leverages the statistical properties of the NIW distribution and the multivariate Student’s
t-distribution to provide a rigorous and principled solution to the open set learning problem. The
mathematical formulation ensures that the model can effectively distinguish between known and
unknown classes while maintaining robustness and interpretability.
theorem, which states that the maximum of independent and identically distributed (i.i.d.) ran-
dom variables, properly normalized, converges to one of three types of extreme value distributions:
Gumbel, Fréchet, or Weibull. The key observation in open set recognition is that samples from
an unknown class exhibit extreme deviations in their likelihood estimates, which can be rigorously
modeled through EVT-based tail fitting.
The first step in this formulation is to define a log-likelihood function, which quantifies the prob-
ability of observing a sample given a particular class. Let the log-likelihood function be given
by
L(x) = log P (x|y) (7.105)
where L(x) captures the statistical relationship between the observed data and the conditional
density function. In EVT-based approaches, one considers the distribution of extreme values of
L(x), specifically in the lower tail, since OOD samples tend to have significantly lower likelihoods.
The fundamental result from EVT states that for a sufficiently large number of independent ob-
servations, the limiting distribution of the minimum values follows a Generalized Extreme Value
(GEV) distribution, given by
exp − 1 + ξ(z−µ) −1/ξ , ξ ̸= 0
σ
F (z; ξ, µ, σ) = (7.106)
−(z−µ)/σ
exp −e , ξ=0
where ξ is the shape parameter, µ is the location parameter, and σ is the scale parameter. In
the context of likelihood estimation, the limiting behavior of L(x) follows a Generalized Pareto
Distribution (GPD), which describes the excess over a given threshold. The probability density
function (PDF) of the GPD is given by
−1−1/ξ
1 ξz
1 + , ξ ̸= 0, z > 0
g(z; ξ, β) = β β (7.107)
1 e−z/β , ξ = 0, z > 0
β
where z = Lmax − L(x) represents the deviation from the maximum likelihood observed in training
data. The fundamental insight from EVT is that for sufficiently high thresholds, the distribution
of these deviations converges to the GPD. This allows us to estimate the probability that a given
likelihood falls below a certain threshold τ using
−1/ξ
ξ(τ − µ)
P (L(x) < τ ) = 1 − 1 + (7.108)
σ
which provides a principled way of defining OOD detection thresholds. A sample x is considered
to be OOD if
P (x|y) < τ (7.109)
where τ is selected based on the EVT-derived confidence interval. Given a set of training samples
{xi }N
i=1 from a known class, one can fit the EVT parameters by solving the maximum likelihood
estimation (MLE) problem
N
X
ˆ σ̂ = arg max
ξ, log g(Lmax − L(xi ); ξ, σ). (7.110)
ξ,σ
i=1
This provides an optimal estimate of the tail parameters, ensuring that the EVT model accurately
represents the behavior of extreme likelihood values. Once the parameters are estimated, one can
compute a dynamic threshold for rejecting OOD samples by solving
−ξ !
σ 1−p
τ =µ+ −1 (7.111)
ξ N
116 CHAPTER 7. OPEN SET LEARNING
where p is the desired false positive rate for in-distribution samples. The EVT-based formulation
allows for a probabilistic characterization of whether a given sample belongs to a known or unknown
class, rather than relying on heuristics. To incorporate this within a Bayesian framework, the
posterior probability of a class given an observation is computed using Bayes’ theorem as
P (x|y)P (y)
P (y|x) = . (7.112)
P (x)
Since EVT provides a distributional form for P (x) under extreme value behavior, one can estimate
the denominator using the cumulative probability function of the GPD:
X
P (x) = P (x|y ′ )P (y ′ ). (7.113)
y′
Replacing P (x|y) with the EVT-derived approximation, the posterior can be rewritten as
−1/ξ
1 + ξ(L(x)−µ)
σ
P (y)
P (y|x) = −1/ξ . (7.114)
P ξ(L(x)−µ) ′
y′ 1 + σ
P (y )
This formulation ensures that as L(x) moves further into the tail, the posterior probability of all
known classes decreases, leading to a natural rejection mechanism for OOD samples. An alterna-
tive approach to likelihood estimation involves using a logistic function to approximate the EVT
threshold behavior, given by
1
P (x|y) ≈ (7.115)
1 + e−α(L(x)−τ )
where α is a scale parameter controlling the steepness of the probability drop-off. The EVT-based
model ensures that open set detection is grounded in rigorous statistical principles, as opposed
to ad-hoc thresholding methods. The theoretical justification of EVT-based likelihood estimation
follows from the asymptotic stability of the GPD under maximum domain of attraction conditions,
ensuring that the fitted distribution remains valid for new samples. Consequently, the EVT-based
formulation provides a robust and theoretically sound approach to estimating likelihoods under
generative models, enabling principled open set recognition.
P (D|θ)P (θ)
P (θ|D) = , (7.116)
P (D)
where P (D|θ) is the likelihood function, P (θ) is the prior over the network parameters, and P (D)
is the marginal likelihood (also known as the model evidence), which ensures proper normalization.
Since the marginal likelihood is intractable in deep neural networks due to the high-dimensional
7.10. BAYESIAN NEURAL NETWORKS (BNNS) 117
integral over all possible parameter configurations, approximate inference techniques such as vari-
ational inference, Monte Carlo methods (e.g., Hamiltonian Monte Carlo or Stochastic Gradient
Langevin Dynamics), or Laplace approximations are employed to obtain an approximation to the
posterior distribution. Given this posterior, the Bayesian predictive distribution for a new input
x∗ conditioned on a class label y ∗ is obtained by marginalizing over θ:
Z
P (x |y ) = P (x∗ |y ∗ , θ)P (θ|D) dθ.
∗ ∗
(7.117)
This integral encodes both aleatoric uncertainty (due to inherent noise in the data) and epistemic
uncertainty (due to lack of knowledge). Since this integral is computationally intractable for high-
dimensional parameter spaces, it is typically approximated via Monte Carlo integration using an
ensemble of sampled parameters θm ∼ P (θ|D):
M
∗ 1 X
∗
P (x |y ) ≈ P (x∗ |y ∗ , θm ). (7.118)
M m=1
A fundamental property of Bayesian models in the context of open set learning is that the variance
of the predictive distribution provides a direct measure of epistemic uncertainty. The predictive
mean and variance are given by
Z
E[x |y , D] = x∗ P (x∗ |y ∗ , θ)P (θ|D) dθ,
∗ ∗
(7.119)
Z
∗ ∗
Var[x |y , D] = x∗2 P (x∗ |y ∗ , θ)P (θ|D) dθ − (E[x∗ |y ∗ , D])2 . (7.120)
For inputs x∗ that are outside the training distribution, the variance of the predictive distribution is
expected to be significantly higher because the posterior over θ is conditioned only on the observed
training data and lacks information about unseen samples. This property allows Bayesian models
to naturally detect OOD data by thresholding on predictive uncertainty. To model the conditional
likelihood P (x|y), generative Bayesian models introduce latent variables z to capture underlying
structure in the data: Z
P (x|y, θ) = P (x|z, θ)P (z|y, θ) dz. (7.121)
Since direct inference over the latent variable posterior P (z|x, y, θ) is intractable, variational in-
ference approximates it with a variational distribution q(z|x, y, ϕ), leading to the Evidence Lower
Bound (ELBO):
log P (x|y) ≥ Eq(z|x,y,ϕ) [log P (x|z, θ)] − DKL (q(z|x, y, ϕ)||P (z|y, θ)). (7.122)
Here, DKL (q||p) denotes the Kullback-Leibler divergence, which enforces regularization by min-
imizing the difference between the approximate posterior and the true prior. In practice, deep
generative models such as Variational Autoencoders (VAEs), Normalizing Flows, and Deep Energy-
Based Models leverage Bayesian formulations to learn structured representations of P (x|y) in high-
dimensional spaces. From a Bayesian decision-theoretic perspective, the total uncertainty in a
prediction is measured using the predictive entropy:
Z
H(x |y ) = − P (x∗ |y ∗ , θ) log P (x∗ |y ∗ , θ) dθ.
∗ ∗
(7.123)
Decomposing this entropy into aleatoric and epistemic components via mutual information, the
epistemic uncertainty is captured by the expected Kullback-Leibler divergence between individual
parameter-conditioned predictions and the full posterior predictive distribution:
Higher epistemic uncertainty indicates that the model is less confident in its prediction due to
insufficient training data, which is a key indicator of an OOD sample. By setting an uncertainty
threshold, a Bayesian classifier can reject predictions on OOD data, effectively implementing an
open set classifier.
Bayesian neural networks thus provide a mathematically rigorous solution to the open set learning
problem by naturally incorporating uncertainty through posterior inference over model parame-
ters, leveraging predictive distributions to measure epistemic uncertainty, and applying principled
probabilistic inference techniques such as variational approximations and Monte Carlo integration.
The estimation of P (x|y) in this framework allows the model to dynamically adjust its confidence
in predictions and provide robust detection of OOD samples by analyzing likelihood distributions.
P (y|x)P (x)
P (x|y) = , (7.125)
P (y)
where P (y|x) is the posterior probability of class y given x, P (x) is the marginal distribution
of x, and P (y) is the prior probability of class y. Under the SVM framework, the posterior
P (y|x) is modeled using a discriminant function fy (x), which is optimized to separate classes while
maximizing the margin. The optimization problem for learning fy (x) is formulated as:
N
1 X
min ∥fy ∥2H + C ξi , (7.126)
fy ,ξ 2
i=1
yi fy (xi ) ≥ 1 − ξi , ξi ≥ 0, ∀i = 1, . . . , N, (7.127)
where ∥fy ∥H denotes the norm of fy in the reproducing kernel Hilbert space (RKHS) H, C is a
regularization parameter controlling the trade-off between margin maximization and classification
error, and ξi are slack variables accounting for misclassifications. The discriminant function fy (x)
is expressed as:
fy (x) = wyT ϕ(x) + by , (7.128)
where wy is the weight vector in the feature space, ϕ(x) is a feature mapping function, and by is the
bias term. To extend this framework to open set learning, the likelihood P (x|y) is augmented with
a rejection mechanism to handle unknown classes. This is achieved by introducing a class-specific
threshold τy , such that:
(
P (y|x)P (x)
P (y)
if fy (x) ≥ τy ,
P (x|y) = (7.129)
0 otherwise.
7.11. SUPPORT VECTOR MODELS 119
The threshold τy is determined by minimizing the empirical risk on a validation set Dval =
{(xi , yi )}M
i=1 , formalized as:
M
X
τy = arg min I(fy (xi ) < τ ) · L(yi , unknown), (7.130)
τ
i=1
where I(·) is the indicator function, and L(yi , unknown) is the loss incurred when an instance
from an unknown class is misclassified as belonging to class y. This ensures that the model rejects
instances from unknown classes with high confidence. The likelihood P (x|y) is further refined using
kernel methods, which enable SVMs to operate in a high-dimensional feature space. The kernel
function k(x, x′ ) is defined as:
k(x, x′ ) = ⟨ϕ(x), ϕ(x′ )⟩H , (7.131)
where ⟨·, ·⟩H denotes the inner product in the RKHS H. The discriminant function fy (x) in the
kernelized form is given by:
N
X
fy (x) = αi yi k(xi , x) + by , (7.132)
i=1
where αi are the Lagrange multipliers obtained from solving the dual optimization problem:
N N N
X 1 XX
max αi − αi αj yi yj k(xi , xj ), (7.133)
α
i=1
2 i=1 j=1
subject to the constraints:
N
X
0 ≤ αi ≤ C, αi yi = 0. (7.134)
i=1
The kernelized discriminant function allows the model to capture complex decision boundaries in
the input space. The likelihood P (x|y) is then estimated by combining the kernelized discriminant
function with the rejection mechanism:
( exp(f (x))
y
P
′ ∈Y exp(fy ′ (x))
if fy (x) ≥ τy ,
P (x|y) = y (7.135)
0 otherwise.
This formulation ensures that the model assigns high likelihood to instances from known classes
while rejecting instances from unknown classes. The overall objective is to maximize the log-
likelihood of the observed data under the model:
XN
L(θ) = log P (xi |yi ; θ), (7.136)
i=1
where θ represents the parameters of the model, including wy , by , and τy . To further enhance the
rigor, the optimization problem is regularized using a penalty term Ω(θ) to prevent overfitting:
min −L(θ) + λΩ(θ), (7.137)
θ
where λ is a regularization parameter, and Ω(θ) is typically chosen as the L2 -norm of the parameters:
Ω(θ) = ∥wy ∥22 + ∥by ∥22 . (7.138)
This ensures that the model generalizes well to unseen data while maintaining discriminative power.
In conclusion, the open set learning problem under SVMs is rigorously framed as the estima-
tion of P (x|y) through a combination of probabilistic modeling, kernel methods, and optimization.
The discriminant function fy (x) is optimized to separate known classes while incorporating a re-
jection mechanism for unknown classes, ensuring robust performance in open set scenarios. The
likelihood P (x|y) is derived using Bayes’ theorem and refined through kernelized discriminant func-
tions, resulting in a mathematically precise and computationally tractable framework for open set
learning.
120 CHAPTER 7. OPEN SET LEARNING
where αi ∈ R+ are the Lagrange multipliers obtained from the dual formulation of the SVDD
optimization problem. The dual problem is derived by introducing Lagrange multipliers αi and γi
for the inequality constraints, resulting in the following Lagrangian:
N
X N
X N
X
L(R, a, ξi , αi , γi ) = R2 + C ξi − αi R2 + ξi − ∥xi − a∥2 − γi ξi . (7.143)
i=1 i=1 i=1
By setting the derivatives of the Lagrangian with respect to R, a, and ξi to zero, the dual opti-
mization problem is obtained as:
N
X N X
X N
max αi K(xi , xi ) − αi αj K(xi , xj ), (7.144)
αi
i=1 i=1 j=1
The radius R is determined by selecting any support vector xk for which 0 < αk < C, and
computing:
N
X N X
X N
2 2
R = ∥xk − a∥ = K(xk , xk ) − 2 αi K(xk , xi ) + αi αj K(xi , xj ). (7.147)
i=1 i=1 j=1
The likelihood P (x|y) is then used to classify a test point x as belonging to the open set if the
likelihood falls below a threshold τ ∈ R+ , which is determined based on the desired trade-off
between false positives and false negatives. The decision rule is formalized as:
The threshold τ can be estimated using cross-validation or other statistical techniques to optimize
the model’s performance. This formulation provides a rigorous and mathematically sound frame-
work for open set learning, combining principles from optimization, kernel methods, and proba-
bilistic modeling to estimate the likelihood P (x|y) and distinguish between known and unknown
classes in a principled manner. The SVDD-based approach ensures that the model generalizes well
to unseen data while maintaining computational efficiency and robustness to outliers.
8 Zero-Shot Learning
Zero-shot learning (ZSL) has emerged as a pivotal paradigm in machine learning, enabling models to
recognize classes not seen during training by leveraging auxiliary information such as attributes, se-
mantic embeddings, or textual descriptions. The foundational work of Lampert et al. (2009) [1317]
introduced attribute-based classification, where classes are described by high-level attributes, and a
probabilistic framework is used to infer unseen classes. This approach formalizes ZSL as a transfer
learning problem, where knowledge from seen classes Y s is transferred to unseen classes Y u via a
shared semantic space S. The key idea is to learn a mapping ϕ : X → S from the input space
X to the semantic space S, followed by a compatibility function f : S × Y → R that scores the
alignment between embeddings and class labels. Early methods relied on linear projections, as
seen in the work of Akata et al. (2013) [1318], who proposed a bilinear compatibility function
f (ϕ(x), y) = ϕ(x)T W ψ(y), where ψ(y) is the semantic embedding of class y and W is a learned
weight matrix.
The literature has since evolved to address critical challenges such as the semantic gap and do-
main shift. Romera-Paredes and Torr (2015) [1319] proposed an elegant closed-form solution us-
ing ridge regression to learn the projection matrix W , minimizing ∥W X − S∥2 + λ∥W ∥2 , where
X and S are matrices of seen class instances and their attributes, respectively. However, this
approach assumes a linear relationship, which is often insufficient. Nonlinear extensions, such
as deep neural networks, were introduced by Xian et al. (2017) [1320], who employed a multi-
modalPembedding space to ′ align visual ′and semantic representations. Their objective function
L = (x,y)∈Y s max(0, ∆(y, y ) + f (ϕ(x), y ) − f (ϕ(x), y)) incorporates a margin-based ranking loss
to ensure correct class rankings. Concurrently, Zhang et al. (2017) [1321] proposed a genera-
tive adversarial network (GAN) framework to synthesize features for unseen classes, formalized as
minG maxD Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))], where G generates fake features condi-
tioned on class embeddings.
The domain shift problem, where the projected features of unseen classes deviate from their true
distribution, was addressed by Fu et al. (2015) [1322] through transductive learning, leveraging
unlabeled unseen class data during training. Their formulation minimizes L = Lsup + λLunsup ,
where Lsup is the supervised loss on seen classes and Lunsup regularizes the projection using un-
seen class instances. Similarly, Kodirov et al. (2017) [1323] proposed a self-training framework
with L = ∥W X − S∥2 + λ∥W ∥2,1 , enforcing sparsity to improve generalization. Another line of
work focuses on hybrid models, as in the case of Changpinyo et al. (2016) [1324], who com-
bined attribute-based and semantic embedding methods using a joint optimization framework
minW,A ∥W X − AS∥2 + λ1 ∥W ∥2 + λ2 ∥A∥2 , where A is a matrix mapping attributes to embed-
dings.
Recent advances have explored graph-based methods to model relationships between classes. Kampffmeyer
et al. (2019) [1325] proposed a graph convolutional network (GCN) to propagate information across
122
8.2. ANALYSIS OF ZERO-SHOT LEARNING 123
classes, with the loss L = i,j ∥f (xi ) − f (xj )∥2 Aij , where Aij is the adjacency matrix encoding
P
class relationships. Similarly, Wang et al. (2018) [1326] introduced a knowledge graph framework
to enrich semantic representations, while Li et al. (2019) [1327] leveraged hierarchical class struc-
tures to improve ZSL performance. The emergence of large-scale pretrained vision-language models
like CLIP (Radford et al., 2021) [1328] has further revolutionized ZSL, enabling zero-shot transfer
T ψ(y)/τ )
via natural language prompts. Their contrastive learning objective L = − log P ′exp(ϕ(x) T ′
y ∈Y exp(ϕ(x) ψ(y )/τ )
aligns images and text embeddings in a shared space, achieving state-of-the-art results.
Despite these advancements, challenges remain, such as bias towards seen classes and the need
for robust evaluation protocols. Chao et al. (2016) [1329] highlighted the importance of gener-
alized ZSL (GZSL), where the test set includes both seen and unseen classes, and proposed the
s ·accu
harmonic mean H = 2·acc
accs +accu
to balance performance. Verma et al. (2018) [1330] addressed bias
mitigation through calibrated stacking, while Huynh and Elhamifar (2020) [1331] proposed a com-
positional framework for fine-grained ZSL. Theoretical insights from Palatucci et al. (2009) [1332]
and Socher et al. (2013) [1333] have also shaped the field, emphasizing the role of semantic spaces
in knowledge transfer. The integration of meta-learning (Hariharan and Girshick, 2017) [1334]
and few-shot learning (Xian et al., 2018) [1335] has further expanded ZSL’s applicability, while
emerging paradigms like open-set recognition (Scheirer et al., 2013) [1336] and out-of-distribution
detection (Yang et al., 2021) [1337] continue to push its boundaries.
The problem can be formulated using an attribute-based representation, where each class y is
associated with a high-dimensional semantic embedding a(y) ∈ Rd , which encodes meaningful re-
lationships between classes. The objective is to learn a function g : X → Rd that maps input
samples to the same semantic space. Given a new sample x′ , the predicted class is determined by
minimizing a distance function d:
ŷ = arg min d(g(x′ ), a(y)) (8.1)
y∈U
with z representing latent features shared across seen and unseen classes. Variational inference is
commonly employed to approximate this integral, leading to an evidence lower bound (ELBO):
log P (x|a(y)) ≥ Eq(z|x) [log P (x|z)] − DKL (q(z|x)∥P (z|a(y))) (8.9)
where q(z|x) is an inference network. Zero-shot learning can also be formulated through a linear
mapping W such that:
g(x) = W f (x) (8.10)
where f (x) is a feature extractor (e.g., a deep CNN), and W aligns it with the semantic space.
This leads to a ridge regression formulation:
X
W ∗ = arg min ∥W f (x) − a(y)∥2 + λ∥W ∥2 (8.11)
W
(x,y)∈S
which assigns the test instance to the nearest class in the semantic space. Generalized zero-shot
learning (GZSL) extends this by allowing both seen and unseen classes at test time, requiring a
calibrated compatibility function:
ŷ = arg max g(x′ )T a(y) − γI(y ∈ S)
(8.14)
y∈S∪U
where γ controls the bias toward seen classes. The choice of semantic space and feature extractor
significantly affects performance, as different embedding choices induce different generalization
behaviors. Alternative formulations include energy-based models:
E(x, y) = g(x)T W a(y) + by (8.15)
trained via hinge loss: X X
L= max(0, 1 − E(x, y) + E(x, y ′ )) (8.16)
(x,y)∈S y ′ ̸=y
where the denominator represents the partition function, ensuring normalization across all possible
class labels. This formulation enforces a probabilistic interpretation of the energy function, ensuring
that samples are more likely assigned to classes with lower energy. In zero-shot learning, the
fundamental challenge is to extend knowledge to unseen classes using auxiliary information, often
in the form of semantic embeddings a(y), which encode prior knowledge about each class y. The
energy function is typically parameterized as a compatibility function between the input x and the
semantic representation a(y), formulated as
where f (x) is a feature extractor mapping x into a latent space, and W is a learnable weight matrix.
The objective function is constructed to minimize the energy for correct class-attribute associations
while maximizing it for incorrect ones. A contrastive loss is often used to enforce this property:
" #
X X
L= E(x, y) + log exp(−E(x, y ′ )) (8.20)
(x,y) y ′ ̸=y
which penalizes incorrect associations by ensuring the energy of the correct class y is lower than
all incorrect alternatives y ′ . To enhance generalization, additional regularization terms are often
introduced. A common approach is the margin-based loss:
XX
Lmargin = max(0, ∆ + E(x, y) − E(x, y ′ )) (8.21)
(x,y) y ′ ̸=y
where ∆ is a margin enforcing separation between correct and incorrect classes. This formulation
ensures robustness against overfitting to seen classes and enables transfer to unseen classes. To
improve stability, energy-based models often employ gradient-based optimization strategies, lever-
aging backpropagation to refine W , f (x), and other parameters. The gradients are computed
as " #
′
∂L X ∂E(x, y) X ∂E(x, y )
= − P (y ′ |x) (8.22)
∂W ∂W y′
∂W
(x,y)
" #
′
∂L X ∂E(x, y) X ∂E(x, y )
= − P (y ′ |x) (8.23)
∂f (x) ∂f (x) y ′
∂f (x)
(x,y)
where the expectation over incorrect classes ensures optimization is guided towards reducing energy
for true class-attribute associations. A crucial aspect of EBMs in ZSL is the use of generative
constraints, where the model is regularized using a reconstruction loss to ensure feature consistency
across seen and unseen categories. A common approach involves a variational autoencoder (VAE)-
based loss:
LVAE = Eq(z|x) [−E(x, y) + λDKL (q(z|x) ∥ p(z))] (8.24)
126 CHAPTER 8. ZERO-SHOT LEARNING
where q(z|x) is an approximate posterior, and p(z) is a prior distribution. This enforces meaningful
latent representations for generalization to unseen classes. To further refine the learned embeddings,
contrastive self-supervised techniques can be integrated. The contrastive learning loss is defined as
X exp(E(xi , xj )/τ )
Lcontrastive = − log P (8.25)
xi ,xj ∼P(x) xk exp(E(xi , xk )/τ )
where τ is a temperature parameter and P(x) is a positive sample pair distribution. This reg-
ularization aligns semantically similar instances, ensuring the energy function learns robust and
transferable representations. The inference stage in zero-shot learning using EBMs involves solv-
ing an optimization problem to assign a query sample x to the class y that minimizes the energy
function:
ŷ = arg min E(x, y) (8.26)
y
where E(x, y) is evaluated for both seen and unseen classes. Since the energy function is trained
using both contrastive and probabilistic constraints, it generalizes effectively to novel categories. In
summary, EBMs for ZSL leverage compatibility functions, contrastive constraints, and probabilistic
formulations to ensure robust zero-shot generalization. The interplay between energy minimization,
semantic embeddings, and self-supervised learning techniques provides a mathematically rigorous
approach for knowledge transfer across seen and unseen categories.
where y + represents the true label, y − represents a negative label, I(·) is an indicator function,
and m is a margin parameter ensuring sufficient separation between positive and negative pairs.
The energy function E(x, y) is typically parameterized using neural networks and optimized via
stochastic gradient descent. The gradient updates follow
X
I(y = y + )∇θ E(x, y + ) − I(y ̸= y + )∇θ max(0, m − E(x, y − )) .
∇θ Lcontrastive = (8.28)
(x,y)
Margin-based loss extends the contrastive loss by enforcing a stricter separation criterion between
positive and negative pairs. The loss function is
X
E(x, y + ) − E(x, y − ) + m +
Lmargin = (8.29)
(x,y)
where [z]+ = max(0, z) ensures non-negative loss contributions. This formulation encourages
E(x, y + ) to be strictly lower than E(x, y − ) by at least margin m. The gradient computation
follows X
I(E(x, y − ) − E(x, y + ) < m) ∇θ E(x, y + ) − ∇θ E(x, y − ) .
∇θ Lmargin = (8.30)
(x,y)
8.3. ENERGY-BASED MODELS FOR ZERO-SHOT LEARNING 127
Variational Autoencoder (VAE)-based loss introduces probabilistic modeling into EBMs for ZSL by
defining a latent variable z that captures the underlying structure of the data. The energy function
is parameterized as
E(x, y, z) = − log p(x|z) − log p(z|y) − log p(y). (8.31)
The VAE-based loss is the negative Evidence Lower Bound (ELBO), given by
where q(z|x) is the approximate posterior, p(z|y) is the prior, and DKL denotes the Kullback-Leibler
divergence. The gradient updates for training involve
Contrastive learning loss in EBMs for ZSL builds upon the contrastive framework by explicitly
constructing positive and negative pairs in an embedding space where similarity is measured. A
common formulation is the InfoNCE loss
exp(−E(x, y))
Lcontrastive learning = −E(x,y) log P ′
. (8.34)
y ′ ∈Y exp(−E(x, y ))
This formulation ensures that the energy of the correct label E(x, y) is minimized while normalizing
across all possible labels. The corresponding gradient updates are
" #
X
∇θ Lcontrastive learning = E(x,y) ∇θ E(x, y) − p(y ′ |x)∇θ E(x, y ′ ) , (8.35)
y ′ ∈Y
where p(y ′ |x) is the probability of a label under the energy model. Each of these loss functions
plays a crucial role in shaping the energy landscape for Zero-Shot Learning, ensuring effective
generalization to unseen classes.
where pS (x, y) and pU (x, y) are the joint distributions over seen and unseen data, respectively. This
constraint ensures that the energy landscape maintains generalizability by aligning unseen class
embeddings with seen class statistics. The generative constraints can be explicitly incorporated via
128 CHAPTER 8. ZERO-SHOT LEARNING
an adversarial loss term that minimizes the Kullback-Leibler (KL) divergence between seen and
unseen class distributions in the latent space:
where z = fθ (x) represents the latent feature encoding. Since direct estimation of pU (z) is in-
tractable, it is typically approximated using a generative model qϕ (z|y), leading to the variational
formulation:
qϕ (z|y)
Lgen = Ey∼pU (y) Ez∼qϕ (z|y) log . (8.39)
pS (z)
This enforces a shared latent space between seen and unseen distributions, facilitating knowledge
transfer in a generative manner. A key generative constraint in EBMs is the entropy-based reg-
ularization, which ensures that the energy landscape does not collapse into degenerate solutions.
The entropy of the conditional distribution p(y|x; θ) ∝ e−E(x,y;θ) is maximized via
X
Lentropy = −Ep(x) p(y|x; θ) log p(y|x; θ), (8.40)
y
which prevents the model from overfitting to seen classes by enforcing a smooth energy manifold.
This is further complemented by a contrastive loss enforcing inter-class separability:
′
X
Lcontrast = E(x,y)∼pS (x,y) eE(x,y ;θ)−E(x,y;θ) . (8.41)
y ′ ̸=y
Together, these constraints regularize the energy function such that the generative structure cap-
tures semantic transferability between seen and unseen categories. Another fundamental genera-
tive constraint involves aligning class prototypes with their corresponding semantic representations
through a structural loss
X
Lproto = ∥gθ (y) − Ep(x|y) [fθ (x)]∥2 . (8.42)
y∈YS ∪YU
This ensures that the learned embeddings remain consistent with high-level class descriptions, en-
abling robust generalization to unseen categories. A further refinement of the generative constraint
is the incorporation of a marginal energy regularization term, enforcing stability across variations
in input perturbations:
′
Lmarginal = E(x,y)∼pS (x,y) max E(x , y; θ) − E(x, y; θ) , (8.43)
x′ ∼B(x)
where B(x) denotes a local perturbation neighborhood of x. This encourages energy invariance
within class-consistent transformations, further reinforcing generalization properties. Finally, the
overall training objective for EBMs in zero-shot learning integrates the aforementioned generative
constraints into a single optimization problem:
Mathematically, given a dataset D = {(xi , yi )}N i=1 consisting of training examples with inputs
xi ∈ X and their corresponding class labels yi ∈ Yseen , the goal of zero-shot inference is to predict
labels y ∗ ∈ Yunseen for new inputs. The energy function is typically defined as
E(x, y) = −fθ (x, y) (8.45)
where fθ (x, y) is a learnable compatibility function parameterized by θ. Inference in EBMs for ZSL
is performed by solving the minimization problem
y ∗ = arg min E(x, y). (8.46)
y∈Yunseen
The energy function is often decomposed into different terms capturing multiple constraints, such
as semantic consistency, visual-semantic alignment, and regularization. A common formulation
incorporates a bilinear compatibility function between input feature representations ϕ(x) and class
embeddings ψ(y):
E(x, y) = −ϕ(x)T W ψ(y), (8.47)
where W is a learnable weight matrix encoding the mapping between the visual and semantic
spaces. The inference process then reduces to
y ∗ = arg max ϕ(x)T W ψ(y). (8.48)
y∈Yunseen
Energy minimization for inference can also be formulated as a probabilistic model using the Boltz-
mann distribution:
exp(−E(x, y))
P (y|x) = P ′
. (8.49)
y ′ ∈Y exp(−E(x, y ))
where ∆ is a margin parameter ensuring that the correct label has a lower energy than incorrect
ones. In scenarios where inference involves generative modeling, energy-based models can be cou-
pled with contrastive divergence training. The gradient of the energy function with respect to the
input features provides a principled way to update representations:
∂E(x, y) ∂fθ (x, y)
=− . (8.52)
∂x ∂x
By leveraging learned semantic embeddings, energy minimization enables inference even in the
absence of explicit training data for unseen categories. The effectiveness of energy-based models for
zero-shot learning thus fundamentally relies on the quality of the learned energy landscape, ensuring
smooth generalization from seen to unseen classes through structured energy minimization.
130 CHAPTER 8. ZERO-SHOT LEARNING
and yi ∈ Ytrain , the objective in zero-shot learning is to classify a test sample x∗ into one of the
unseen categories Ytest , such that Ytrain ∩ Ytest = ∅.
where β is the meta-learning rate. Another approach is the meta-embedding strategy, where feature
representations are learned to generalize across tasks. Let ϕ(x; θ) denote an embedding function
mapping inputs to a latent space Rd . The objective is to learn an embedding that maintains a
consistent similarity structure across both seen and unseen categories. The compatibility function
F (x, y; θ) is often modeled as a bilinear compatibility function:
where ψ(y) is a semantic embedding of label y, and W is a learned transformation matrix. The
model is trained to maximize the compatibility of correct pairs while minimizing that of incorrect
pairs, often using a ranking loss:
XX
L= max(0, γ − F (xi , yi ) + F (xi , yj )), (8.57)
i j̸=i
A query sample x∗ is classified by computing a distance metric, commonly the squared Euclidean
distance:
exp(−d(ϕ(x∗ ; θ), pc ))
p(y = c|x∗ ) = P ∗
. (8.59)
c′ exp(−d(ϕ(x ; θ), pc′ ))
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 131
Graph-based meta-learning frameworks construct a graph G = (V, E) where nodes represent cate-
gories, and edges encode semantic or structural relationships. Given a graph convolutional network
(GCN) with adjacency matrix A and feature matrix X, the node embeddings are computed itera-
tively as:
H (l+1) = σ(AH (l) W (l) ), (8.60)
where W (l) are layer-specific learnable parameters and σ is a non-linearity. The final node embed-
dings serve as classifiers for unseen categories. Bayesian meta-learning introduces uncertainty quan-
tification by modeling the parameters as distributions rather than point estimates. Let θ ∼ p(θ)
be a prior distribution over parameters. The posterior is computed using Bayes’ rule:
p(D|θ)p(θ)
p(θ|D) = . (8.61)
p(D)
Inference is performed by marginalizing over the posterior:
Z
p(y |x , D) = p(y ∗ |x∗ , θ)p(θ|D)dθ.
∗ ∗
(8.62)
Approximate inference methods such as variational Bayes or Monte Carlo sampling are used to
estimate the integral. Each of these approaches forms the foundation of meta-learning in zero-
shot learning, providing a mechanism to generalize across unseen categories using transferable
knowledge.
where U (θ, T ) represents the model parameters after adaptation on task T . In the standard
supervised learning setting, given a task-specific loss function LT (θ), the adaptation process involves
performing one or more gradient descent steps with respect to this loss function. The adapted
parameters for task Ti after one step of gradient descent are
which involves optimizing the initial parameters θ to ensure that the adapted parameters θi′ yield low
loss across tasks. This optimization is performed using gradient descent, leading to a second-order
gradient update of the form
X
θ ←θ−β ∇θ LTi (θ − α∇θ LTi (θ)) (8.66)
i
where β is the meta-learning rate. The second-order term in this update makes MAML computa-
tionally expensive, but it allows for more expressive updates. In zero-shot learning, where the model
132 CHAPTER 8. ZERO-SHOT LEARNING
must generalize to tasks without direct exposure to them during meta-training, the effectiveness
of MAML is attributed to its ability to produce an initialization that is inherently generalizable.
Specifically, if the task distribution p(T ) is designed such that the training tasks span a diverse
range of possible learning scenarios, then the learned initialization can generalize to unseen tasks
by leveraging a smooth loss landscape across task variations. Mathematically, this can be expressed
as minimizing the expected task-generalization error
min ET ∼p(T ) ET ′ ∼ptest (T ′ ) LT ′ (U (θ, T )) (8.67)
θ
where ptest (T ′ ) represents the distribution over unseen tasks. To approximate this expectation, a
common strategy is to incorporate a regularization term that promotes smooth adaptation across
task variations, yielding the regularized meta-objective
X
min LTi (θi′ ) + λR(θ) (8.68)
θ
i
where R(θ) is a regularization function that penalizes large gradients or promotes parameter
smoothness, and λ controls the strength of this regularization. The zero-shot capability of MAML
can be further improved by explicitly optimizing for robustness in the presence of distribution shifts.
This can be achieved by modifying the meta-objective to incorporate an adversarial worst-case term
where padv (T ′ ) represents an adversarially chosen distribution over test tasks. This leads to a
minimax optimization problem, ensuring that the learned initialization is robust to extreme vari-
ations in task distributions. Additionally, when applied to function approximation problems such
as few-shot regression or reinforcement learning, MAML can be interpreted as an implicit form
of hierarchical Bayesian inference. In this interpretation, the meta-learned parameters θ serve as
a prior, and task-specific fine-tuning corresponds to a Bayesian posterior update. This perspec-
tive connects MAML with variational inference methods, where the meta-learning process can be
rewritten as optimizing the evidence lower bound (ELBO)
where qT (θ′ ) represents the posterior distribution over task-specific parameters and p(θ) is the
learned meta-prior. The KL-divergence term DKL enforces consistency between the meta-learned
initialization and the task-specific updates.
To achieve this, a semantic embedding space S ∈ Rm is constructed, where each class c (both
seen and unseen) is represented by a prototype vector sc ∈ Rm . The embeddings are often derived
from textual descriptions (e.g., word vectors, attribute annotations) and must be aligned with the
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 133
F(x, sc ) = xT W sc (8.71)
where W ∈ Rd×m is a learnable transformation matrix. The objective function for training is then
given by a ranking loss that enforces the correct pair (x, sy ) to have a higher compatibility score
than incorrect class pairs:
N X
X
max (0, γ − F(xi , syi ) + F(xi , sc )) (8.72)
i=1 c̸=yi
where γ is a margin parameter ensuring separation between correct and incorrect matches. To
enable generalization to unseen classes, meta-embedding strategies are introduced, where a meta-
learned function G constructs a class prototype dynamically:
sc = G(zc ) (8.73)
where zc is a raw class descriptor (e.g., word embeddings, attribute vectors) and G is a neural
network that transforms zc into a semantically meaningful embedding. The training objective
extends to minimizing the discrepancy between predicted and actual prototypes:
X
min ∥G(zc ) − sc ∥2 (8.74)
G
c∈Yseen
so that for unseen classes, sc = G(zc ) provides a meaningful representation. This ensures that when
an unseen instance xu is projected into the space, the classification decision:
is based on learned semantic relationships rather than direct supervision. Additionally, variational
methods can be employed to refine embeddings by incorporating uncertainty estimation via a
probabilistic latent space:
p(sc |zc ) = N (µG (zc ), ΣG (zc )) (8.76)
where µG (zc ) and ΣG (zc ) are the mean and covariance functions modeled via neural networks. The
overall loss function then includes a Kullback-Leibler (KL) divergence term to regularize the learned
distributions: X X
L= ∥sc − µG (zc )∥2 + λ DKL (p(sc |zc )∥N (0, I)) (8.77)
c∈Yseen c∈Yseen
The goal is to learn a function f : X → Rd that maps input instances into an embedding space
where classification is performed based on distances to class prototypes, enabling generalization to
an unseen class set Yu without direct training data.
134 CHAPTER 8. ZERO-SHOT LEARNING
The embedding function f (x) is parameterized as a deep neural network that learns a transforma-
tion from the raw input space to a metric space with a well-defined distance function d : Rd × Rd →
R. One commonly employed distance function is the squared Euclidean distance:
d(zi , zj ) = ∥zi − zj ∥22 (8.78)
where zi = f (xi ) and zj = f (xj ). Alternatively, cosine similarity is often used for improved
generalization:
z⊤i zj
dcos (zi , zj ) = 1 − (8.79)
∥zi ∥∥zj ∥
A central component of metric-based meta-learning is the construction of prototypes cy for each
class, which serve as the representative embeddings for classification. The prototype of a class y is
computed as the mean embedding of all support examples belonging to that class:
1 X
cy = f (xi ) (8.80)
|Sy |
(xi ,yi )∈Sy
where Sy is the set of all training instances labeled as y. Given a query instance xq , its class
assignment is determined by computing the distance to all class prototypes and selecting the closest
one:
ŷq = arg min d(f (xq ), cy ) (8.81)
y∈Ys
In the meta-learning paradigm, training occurs episodically, where each episode simulates a ZSL
scenario with a small support set S and a query set Q. The meta-objective minimizes the classifi-
cation loss over the query set using a nearest-prototype strategy:
X
Lmeta = ℓ(yq , ŷq ) (8.82)
(xq ,yq )∈Q
Training involves optimizing f and W jointly using episodic meta-learning, ensuring that f learns to
generalize from seen to unseen classes by aligning feature embeddings with semantic representations.
The meta-learning update follows:
θ ← θ − η∇θ Lmeta , W ← W − η∇W Lmeta (8.86)
where θ denotes the parameters of the embedding function f , η is the learning rate, and gradients
are computed over multiple episodes. The effectiveness of metric-based meta-learning in ZSL is
enhanced by incorporating contrastive loss:
X
Lcontrastive = yij d(zi , zj ) + (1 − yij ) max(0, τ − d(zi , zj )) (8.87)
i,j
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 135
where yij is 1 if xi and xj belong to the same class, 0 otherwise, and τ is a margin parameter.
This loss encourages similar instances to be closer while pushing dissimilar ones apart. Zero-shot
generalization relies on the alignment between the semantic space and the learned metric space.
Regularization techniques such as distribution calibration adjust the prototype computation by
leveraging seen class distributions:
1 X
cy = λg(ay ) + (1 − λ) cy ′ (8.88)
|Ny | y′ ∈N
y
where Ny is a set of semantically similar seen classes to y, and λ controls the balance between
semantic projection and distribution calibration. By iteratively refining the metric space and
semantic alignment, metric-based meta-learning enables effective zero-shot classification, where
new classes are recognized based on learned semantic-visual correspondences without requiring
additional samples. The integration of episodic training, prototype-based classification, contrastive
learning, and semantic projection ensures that the learned model generalizes robustly to unseen
categories.
Let each class c ∈ C = Cs ∪Cu be represented by a feature vector zc ∈ Rd , typically derived from word
embeddings such as Word2Vec or GloVe. The task is to learn a classifier g(x) = arg maxc P (c|x),
where P (c|x) is the probability of an input sample x belonging to class c. The underlying principle
of the graph-based approach is to propagate information across the graph structure using message
passing frameworks such as Graph Neural Networks (GNNs). The node feature update at layer t
is given by
X (t−1)
h(t)
c = σ
wcc′ W(t) hc′ + b(t) (8.89)
c′ ∈N (c)
(t)
where hc is the feature representation of class c at layer t, N (c) denotes the neighboring classes,
wcc′ is an edge weight encoding semantic similarity between class c and c′ , and σ is a nonlinear
activation function such as ReLU. The parameters W(t) and b(t) are learned via backpropagation.
To transfer knowledge, meta-learning is employed by training the model on a sequence of meta-
tasks Ti = (Si , Qi ), where Si is the support set and Qi is the query set. The meta-learning objective
is to minimize the expected loss across tasks:
X
min ETi [L(gθ (Si ), Qi )] (8.90)
θ
i
where gθ is the classifier parameterized by θ and L is the classification loss. The classifier is
optimized via an episodic training strategy, which mimics the zero-shot setting by holding out
certain classes during training. Specifically, we define a parameterized distance metric dϕ (x, hc ) for
classification, such as cosine similarity
xT hc
dϕ (x, hc ) = (8.91)
∥x∥∥hc ∥
136 CHAPTER 8. ZERO-SHOT LEARNING
or a Mahalanobis distance
dϕ (x, hc ) = (x − hc )T Σ−1 (x − hc ) (8.92)
where Σ is a learned covariance matrix capturing intra-class variations. The final classification
decision is made via a softmax layer
exp(dϕ (x, hc ))
P (c|x) = P (8.93)
c′ ∈C exp(dϕ (x, hc′ ))
To improve generalization to unseen classes, the model incorporates a regularization term based on
the graph Laplacian L = D − A, where D is the degree matrix and A is the adjacency matrix of
the class graph. The regularization enforces smoothness in feature representations across the graph
X
Lreg = wcc′ ∥hc − hc′ ∥2 (8.94)
c,c′
Thus, the final objective function combines classification loss and regularization
X
L= ETi [Lclass (gθ (Si ), Qi )] + λLreg (8.95)
i
where (c, c+ ) are semantically similar class pairs. The model is trained using stochastic gradient
descent with updates
θ ← θ − η∇θ L (8.97)
where η is the learning rate. The learned class representations are then used to classify unseen
class samples by assigning them to the nearest prototype
By leveraging the structured class graph, propagating knowledge through message passing, and
optimizing via meta-learning, this approach enables robust zero-shot classification even when visual
samples from unseen classes are unavailable.
In Bayesian meta-learning, we assume that the task distribution follows a hierarchical generative
model. Given a hyperprior distribution p(z), the task-specific parameters θi are sampled condition-
ally on z as
θi ∼ p(θ|z), (8.100)
8.4. META-LEARNING APPROACHES FOR ZERO-SHOT LEARNING 137
where θi are the task-specific parameters for Ti . The likelihood of observing dataset Di given θi is
then mi
Y
p(Di |θi ) = p(yij |xij , θi ). (8.101)
j=1
For meta-learning, we define a posterior distribution over z conditioned on all observed tasks:
p(z|T ) ∝ p(T |z)p(z), (8.102)
where
N Z
Y
p(T |z) = p(Di |θi )p(θi |z)dθi . (8.103)
i=1
For zero-shot learning, where a new task T∗ with dataset D∗ is encountered without labeled exam-
ples, the goal is to infer the posterior predictive distribution
Z
p(y∗ |x∗ , T ) = p(y∗ |x∗ , θ∗ )p(θ∗ |T )dθ∗ . (8.104)
A variational Bayesian approach approximates p(z|T ) using a variational distribution q(z), typically
assumed to be Gaussian:
q(z) = N (z|µ, Σ). (8.105)
The variational objective is to minimize the KL divergence
L = DKL (q(z)∥p(z|T )), (8.106)
which expands to the evidence lower bound (ELBO)
" N #
X
L = Eq(z) log p(Di |z) − DKL (q(z)∥p(z)). (8.107)
i=1
For amortized inference, we introduce an encoder q(z|D), parameterized as a neural network, such
that
q(z|D) = N (µ(D), Σ(D)). (8.108)
During inference, for a new task T∗ , the posterior predictive distribution is given by marginalizing
out z: Z
p(y∗ |x∗ , T ) = p(y∗ |x∗ , z)q(z|T )dz. (8.109)
Using Monte Carlo approximation, the integral can be estimated as
K
1 X
p(y∗ |x∗ , T ) ≈ p(y∗ |x∗ , zk ), (8.110)
K k=1
where zk ∼ q(z|T ). The function p(y∗ |x∗ , z) is modeled using a deep neural network fθ (x∗ , z), which
outputs a probability distribution over classes. For zero-shot learning, the classifier must generalize
to unseen classes by leveraging the inferred latent structure. This is achieved by incorporating
semantic embeddings e(y) that define a probability distribution over labels:
p(y|x, z) ∝ exp(e(y)T fθ (x, z)). (8.111)
This formulation enables the model to predict classes that were never observed in the training
set by mapping the latent variable z to a space shared across seen and unseen classes. The final
optimization objective for meta-learning is
N
X
Lmeta = Eq(z|Di ) [log p(Di |z)] − DKL (q(z|Di )∥p(z)). (8.112)
i=1
Through this Bayesian hierarchical approach, zero-shot learning emerges as a byproduct of inferring
a shared structure across tasks, allowing for principled generalization to unseen distributions.
9 Neural Network Basics
Literature Review: Goodfellow et. al. (2016) [112] wrote one of the most comprehensive books
on deep learning, covering the theoretical foundations of neural networks, optimization techniques,
and probabilistic models. It is widely used in academic courses and research. Haykin (2009)
[114] explained neural networks from a signal processing perspective, covering perceptrons, back-
propagation, and recurrent networks with a strong mathematical approach. Schmidhuber (2015)
[115] gave a historical and theoretical review of deep learning architectures, including recurrent
neural networks (RNNs), convolutional neural networks (CNNs), and long short-term memory
(LSTM). Bishop (2006) [116] gave a Bayesian perspective on neural networks and probabilistic
graphical models, emphasizing the statistical underpinnings of learning. Poggio and Smale (2003)
[117] established theoretical connections between neural networks, kernel methods, and function
approximation. LeCun (2015) [118] discusses the principles behind modern deep learning, includ-
ing backpropagation, unsupervised learning, and hierarchical feature extraction. Cybenko (1989)
[58] proved the universal approximation theorem, demonstrating that a neural network with a sin-
gle hidden layer can approximate any continuous function. Hornik et. al. (1989) [57] extended
Cybenko’s theorem, proving that multilayer perceptrons (MLPs) are universal function approxi-
mators. Pinkus (1999) [60] gave a rigorous mathematical discussion on neural networks from the
perspective of approximation theory. Tishby and Zaslavsky (2015) [119] introduced the information
bottleneck framework for understanding deep neural networks, explaining how networks learn to
compress and encode information efficiently.
To determine the output, this value is passed through the step activation function, defined mathe-
matically as (
1, z ≥ 0,
ϕ(z) = (9.2)
0, z < 0.
Thus, the perceptron’s decision-making process can be expressed as
⃗ T ⃗x + b),
y = ϕ(w (9.3)
where y ∈ {0, 1}. The equation w ⃗ T ⃗x + b = 0 defines a hyperplane in Rn , which acts as the decision
boundary. For any input ⃗x, the classification is determined by the sign of w ⃗ T ⃗x +b, specifically y = 1
⃗ T ⃗x +b ≥ 0 and y = 0 otherwise. Geometrically, this classification corresponds to partitioning the
if w
input space into two distinct half-spaces. To train the perceptron, a supervised learning algorithm
138
9.1. PERCEPTRONS AND ARTIFICIAL NEURONS 139
⃗ and the bias b iteratively using labeled training data {(⃗xi , yi )}m
adjusts the weights w i=1 , where yi
T
represents the ground truth. When the predicted output ypred = ϕ(w ⃗ ⃗xi + b) differs from yi , the
weight vector and bias are updated according to the rule
⃗ ←w
w ⃗ + η(yi − ypred )⃗xi , (9.4)
and
b ← b + η(yi − ypred ), (9.5)
where η > 0 is the learning rate. Each individual weight wj is updated as
For a linearly separable dataset, the Perceptron Convergence Theorem asserts that the algorithm
will converge to a solution after a finite number of updates. Specifically, the number of updates is
bounded by
R2
, (9.7)
γ2
where R = maxi ∥⃗xi ∥ is the maximum norm of the input vectors, and γ is the minimum margin,
defined as
⃗ T ⃗xi + b)
yi (w
γ = min . (9.8)
i ∥w∥⃗
The limitations of the perceptron, particularly its inability to solve linearly inseparable problems
such as the XOR problem, necessitate the extension to artificial neurons with non-linear activation
functions. A popular choice is the sigmoid activation function
1
ϕ(z) = , (9.9)
1 + e−z
which maps z ∈ R to the continuous interval (0, 1). The derivative of the sigmoid function, essential
for gradient-based optimization, is
Another widely used activation function is the hyperbolic tangent tanh(z), defined as
ez − e−z
tanh(z) = , (9.11)
ez + e−z
with derivative
tanh′ (z) = 1 − tanh2 (z). (9.12)
ReLU, or Rectified Linear Unit, is defined as
with derivative (
1, z > 0,
ϕ′ (z) = (9.14)
0, z ≤ 0.
These non-linear activations enable the network to approximate non-linear decision boundaries,
a capability absent in the perceptron. Artificial neurons form the building blocks of multi-layer
perceptrons (MLPs), where neurons are organized into layers. For an L-layer network, the input ⃗x
is transformed layer by layer. At layer l, the output is
⃗ (l)⃗z(l−1) + ⃗b(l) ),
⃗z(l) = ϕ(l) (W (9.15)
140 CHAPTER 9. NEURAL NETWORK BASICS
where W⃗ (l) ∈ Rnl ×nl−1 is the weight matrix, ⃗b(l) ∈ Rnl is the bias vector, and ϕ(l) is the activation
function. The network’s output is
⃗yˆ = ϕ(L) (W
⃗ (L)⃗z(L−1) + ⃗b(L) ). (9.16)
The Universal Approximation Theorem guarantees that MLPs with sufficient neurons and non-
linear activations can approximate any continuous function f : Rn → Rm to arbitrary precision.
Formally, for any ϵ > 0, there exists an MLP g(⃗x) such that
for all ⃗x ∈ Rn . Training an MLP minimizes a loss function L that quantifies the error between
predicted outputs ⃗yˆ and ground truth labels ⃗y . For regression, the mean squared error is
m
1 X ˆ
L= ∥⃗yi − ⃗yi ∥2 , (9.18)
m i=1
⃗ (l) , ⃗b(l) }L as
Optimization uses stochastic gradient descent (SGD), updating parameters Θ = {W l=1
Θ ← Θ − η∇Θ L. (9.20)
This recursive structure, combined with chain rule applications, efficiently propagates error signals
from the output layer back to the input layer.
Artificial neurons and their extensions have thus provided the foundation for modern deep learn-
ing. Their mathematical underpinnings and computational frameworks are instrumental in solving
a wide array of problems, from classification and regression to complex decision-making. The in-
terplay of linear algebra, calculus, and optimization theory in their formulation ensures that these
networks are both theoretically robust and practically powerful.
Here, Wk ∈ Rmk ×mk−1 represents the weight matrix, ⃗bk ∈ Rmk is the bias vector, and fk : Rmk →
Rmk is a component-wise activation function. Formally, if we denote the input layer as ⃗a0 = ⃗x, the
9.2. FEEDFORWARD NEURAL NETWORKS 141
final output of the network, ⃗y ∈ Rm , is given by ⃗aL = fL (WL⃗aL−1 + ⃗bL ). Each transformation in
this sequence can be described as ⃗zk = Wk⃗ak−1 + ⃗bk , followed by the activation ⃗ak = fk (⃗zk ). The
affine transformation ⃗zk = Wk⃗ak−1 + ⃗bk encapsulates the linear combination of inputs with weights
Wk and the addition of biases ⃗bk . For any two layers k and k + 1, the overall transformation can
be represented by
⃗zk+1 = Wk+1 (Wk⃗ak−1 + ⃗bk ) + ⃗bk+1 . (9.24)
Expanding this, we have
⃗zk+1 = Wk+1 Wk⃗ak−1 + Wk+1⃗bk + ⃗bk+1 . (9.25)
Without the nonlinearity introduced by fk , the network reduces to a single affine transformation
⃗y = W ⃗x + ⃗b, (9.26)
Thus, the incorporation of nonlinear activation functions is critical, as it enables the network to
approximate non-linear mappings. Activation functions fk are applied element-wise to the pre-
activation vector ⃗zk . The choice of activation significantly affects the network’s behavior and
training. For example, the sigmoid activation f (x) = 1+e1−x compresses inputs into the range (0, 1)
and has a derivative given by
f ′ (x) = f (x)(1 − f (x)). (9.28)
ex −e−x
The hyperbolic tangent activation f (x) = tanh(x) = ex +e−x
maps inputs to (−1, 1) with a derivative
The ReLU activation f (x) = max(0, x), commonly used in modern networks, has a derivative
(
′ 1 x > 0,
f (x) = (9.30)
0 x ≤ 0.
These derivatives are essential for gradient-based optimization. The objective of training a feedfor-
ward neural network is to minimize a loss function L, which measures the discrepancy between the
predicted outputs ⃗yi and the true targets ⃗ti over a dataset {(⃗xi , ⃗ti )}N
i=1 . For regression problems,
the mean squared error (MSE) is often used, given by
N N m
1 X 1 XX
L= ∥⃗yi − ⃗ti ∥2 = (yi,j − ti,j )2 . (9.31)
N i=1 N i=1 j=1
where ti,j represents the one-hot encoded labels. The gradient of L with respect to the network
parameters is computed using backpropagation, which applies the chain rule iteratively to propagate
errors from the output layer to the input layer. During backpropagation, the error signal at the
output layer is computed as
∂L
δL = = ∇⃗y L ⊙ fL′ (⃗zL ), (9.33)
∂⃗zL
142 CHAPTER 9. NEURAL NETWORK BASICS
where ⊙ denotes the Hadamard product. For hidden layers, the error signal propagates backward
as
T
δk = (Wk+1 δk+1 ) ⊙ fk′ (⃗zk ). (9.34)
The gradients of the loss with respect to the weights and biases are then given by
∂L ∂L
= δk⃗aTk−1 , = δk . (9.35)
∂Wk ∂⃗bk
These gradients are used to update the parameters through optimization algorithms like stochastic
gradient descent (SGD), where
∂L ⃗bk ← ⃗bk − η ∂L ,
Wk ← Wk − η , (9.36)
∂Wk ∂⃗bk
with η > 0 as the learning rate. The universal approximation theorem rigorously establishes that
a feedforward neural network with at least one hidden layer and sufficiently many neurons can
approximate any continuous function f : Rn → Rm on a compact domain D ⊂ Rn . Specifically,
for any ϵ > 0, there exists a network fˆ such that ∥f (⃗x) − fˆ(⃗x)∥ < ϵ for all ⃗x ∈ D. This expressive
capability arises because the composition of affine transformations and nonlinear activations allows
the network to approximate highly complex functions by partitioning the input space into regions
and assigning different functional behaviors to each.
where w⊤ x represents the dot product of the weight vector and the input vector. The activation
function σ(z) is then applied to this net input to obtain the output of the neuron a:
n
!
X
a = σ(z) = σ w i xi + b . (9.38)
i=1
The activation function introduces a non-linearity into the neuron’s response, which is a crucial
aspect of neural networks because, without it, the network would only be able to perform linear
transformations of the input data, limiting its ability to approximate complex, real-world func-
tions. The non-linearity introduced by σ(z) is fundamental because it enables the network to
capture intricate relationships between the input and output, making neural networks capable of
9.3. ACTIVATION FUNCTIONS 143
solving problems that require hierarchical feature extraction, such as image classification, time-
series forecasting, and language modeling. The importance of non-linearity is most clearly evident
when considering the mathematical formulation of a multi-layer neural network. For a feed-forward
neural network with L layers, the output ŷ of the network is given by the composition of successive
affine transformations and activation functions. Let x denote the input vector, Wk and bk be the
weight matrix and bias vector for the k-th layer, and σk be the activation function for the k-th
layer. The output of the network is:
If σ(z) were a linear function, say σ(z) = c·z for some constant c, the composition of such functions
would still result in a linear function. Specifically, if each σk were linear, the overall network function
would simplify to a single linear transformation:
ŷ = c1 · x + c2 , (9.40)
where c1 and c2 are constants dependent on the parameters of the network. In this case, the
network would have no greater expressive power than a simple linear regression model, regardless
of the number of layers. Thus, the non-linearity introduced by activation functions allows neural
networks to approximate any continuous function, as guaranteed by the universal approximation
theorem. This theorem states that a feed-forward neural network with at least one hidden layer
and a sufficiently large number of neurons can approximate any continuous function f (x), provided
the activation function is non-linear and the network has enough capacity.
Next, consider the mathematical properties that the activation function σ(z) must possess. First,
it must be differentiable to allow the use of gradient-based optimization methods like backpropaga-
tion for training. Backpropagation relies on the chain rule of calculus to compute the gradients of
the loss function L with respect to the parameters (weights and biases) of the network. Suppose
L = L(ŷ, y) is the loss function, where ŷ is the predicted output of the network and y is the true
label. During training, we compute the gradient of L with respect to the weights using the chain
rule. Let ak = σk (zk ) represent the output of the activation function at layer k, where zk is the
input to the activation function. The gradient of the loss with respect to the weights at layer k is
given by:
∂L ∂L ∂ak ∂zk
= . (9.41)
∂Wk ∂ak ∂zk ∂Wk
The term ∂a k
∂zk
is the derivative of the activation function, which must exist and be well-defined
for gradient-based optimization to work effectively. If the activation function is not differentiable,
the backpropagation algorithm cannot compute the gradients, preventing the training process from
proceeding.
Now consider the specific forms of activation functions commonly used in practice. The sigmoid
activation function is one of the most well-known, defined as:
1
σ(z) = . (9.42)
1 + e−z
Its derivative is:
σ ′ (z) = σ(z)(1 − σ(z)), (9.43)
which can be derived by applying the chain rule to the expression for σ(z). Although sigmoid
is differentiable and smooth, it suffers from the vanishing gradient problem, especially for large
positive or negative values of z. Specifically, as z → ∞, σ ′ (z) → 0, and similarly as z → −∞,
σ ′ (z) → 0. This results in very small gradients during backpropagation, making it difficult for
the network to learn when the input values become extreme. To mitigate the vanishing gradient
144 CHAPTER 9. NEURAL NETWORK BASICS
problem, the hyperbolic tangent (tanh) function is often used as an alternative. It is defined
as:
ez − e−z
tanh(z) = z , (9.44)
e + e−z
with derivative:
tanh′ (z) = 1 − tanh2 (z). (9.45)
The tanh function outputs values in the range (−1, 1), which helps to center the data around
zero. While the tanh function overcomes some of the vanishing gradient issues associated with the
sigmoid function, it still suffers from the problem for large |z|, where the gradients approach zero.
The Rectified Linear Unit (ReLU) is another commonly used activation function. It is defined
as:
ReLU(z) = max(0, z), (9.46)
with derivative: (
1, z > 0,
ReLU′ (z) = (9.47)
0, z ≤ 0.
where α is a small constant, typically chosen to be 0.01. The derivative of the Leaky ReLU is:
(
1, z > 0,
Leaky ReLU′ (z) = (9.49)
α, z ≤ 0.
Leaky ReLU ensures that neurons do not become entirely inactive by allowing a small, non-zero
gradient for negative values of z. Finally, for classification tasks, particularly when there are
multiple classes, the Softmax activation function is often used in the output layer of the neural
network. The Softmax function is defined as:
ezi
Softmax(zi ) = Pn , (9.50)
j=1 ezj
where zi is the input to the i-th neuron in the output layer, and the denominator ensures that the
outputs sum to 1, making them interpretable as probabilities. The Softmax function is typically
used in multi-class classification problems, where the network must predict one class out of several
possible categories.
In summary, activation functions are a vital component of neural networks, enabling them to learn
intricate patterns in data, allowing for the successful application of neural networks to diverse tasks.
Different activation functions—such as sigmoid, tanh, ReLU, Leaky ReLU, and Softmax—each of-
fer distinct advantages and limitations, and their choice significantly impacts the performance and
training dynamics of the neural network.
9.4. LOSS FUNCTIONS 145
where L(yi , ŷi ) represents the loss function that computes the error between the true output yi
and the predicted output ŷi for each data point. To minimize this objective function, optimization
algorithms such as gradient descent are used. The general update rule for the weights W is given
by:
W ← W − η∇W L(W) (9.52)
where η is the learning rate, and ∇W L(W) is the gradient of the loss function with respect to
the weights. The gradient is computed using backpropagation, which applies the chain rule
of calculus to propagate the error backward through the network, updating the parameters to
minimize the loss. For this, we use the partial derivatives of the loss with respect to each layer’s
weights and biases, ensuring the error is distributed appropriately across all layers. For regression
tasks, the Mean Squared Error (MSE) loss is frequently used. This loss function quantifies
the error as the average squared difference between the predicted and true values. The MSE for a
dataset of N examples is given by:
N
1 X
LMSE = (yi − ŷi )2 (9.53)
N i=1
where ŷi = f (xi ; W) is the network’s predicted output for the i-th input xi . The gradient of the
MSE with respect to the network’s output ŷi is:
∂LMSE
= 2 (ŷi − yi ) (9.54)
∂ ŷi
This gradient guides the weight update in the direction that minimizes the squared error, leading
to a better fit of the model to the training data. For classification tasks, the cross-entropy
loss is often employed, as it is particularly well-suited to tasks where the output is a probability
distribution over multiple classes. In the binary classification case, where the target label yi is
either 0 or 1, the binary cross-entropy loss function is defined as:
N
1 X
LCE =− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] (9.55)
N i=1
where ŷi = f (xi ; W) is the predicted probability that the i-th sample belongs to the positive class
(i.e., class 1). For multiclass classification, where the target label yi is a one-hot encoded vector
representing the true class, the general form of the cross-entropy loss is:
N C
1 XX
LCE =− yi,c log(ŷi,c ) (9.56)
N i=1 c=1
146 CHAPTER 9. NEURAL NETWORK BASICS
where C is the number of classes, and ŷi,c = f (xi ; W) is the predicted probability that the i-th
sample belongs to class c. The gradient of the cross-entropy loss with respect to the predicted
probabilities ŷi is:
∂LCE
= ŷi,c − yi,c (9.57)
∂ ŷi,c
This gradient facilitates the weight update by adjusting the model’s parameters to reduce the dif-
ference between the predicted probabilities and the actual class labels.
In neural network training, the optimization process often involves regularization techniques to
prevent overfitting, especially in cases with high-dimensional data or deep networks. L2 reg-
ularization (also known as Ridge regression) is one common approach, which penalizes large
weights by adding a term proportional to the squared L2 norm of the weights to the loss function.
The regularized loss function becomes:
n
X
Lreg = LMSE + λ Wj2 (9.58)
j=1
where λ is the regularization strength, and Wj represents the parameters of the network. The
gradient of the regularized loss with respect to the weights is:
∂Lreg ∂LMSE
= + 2λWj (9.59)
∂Wj ∂Wj
This additional term discourages large values of the weights, reducing the complexity of the model
and helping it generalize better to unseen data. Another form of regularization is L1 regulariza-
tion (or Lasso regression), which promotes sparsity in the model by adding the L1 norm of the
weights to the loss function. The L1 regularized loss function is:
n
X
Lreg = LMSE + λ |Wj | (9.60)
j=1
The gradient of this regularized loss function with respect to the weights is:
∂Lreg ∂LMSE
= + λ sign(Wj ) (9.61)
∂Wj ∂Wj
where sign(Wj ) is the sign function, which returns 1 for positive values of Wj , −1 for negative
values, and 0 for Wj = 0. L1 regularization encourages the model to select only a small subset of
features by forcing many of the weights to exactly zero, thus simplifying the model and promoting
interpretability. The optimization process for neural networks can be viewed as solving a non-
convex optimization problem, given the highly non-linear activation functions and the deep
architectures typically used. In this context, stochastic gradient descent (SGD) is commonly
employed to perform the optimization by updating the weights based on the gradient computed
from a random mini-batch of the data. The update rule for SGD can be expressed as:
where ∇W Lbatch is the gradient of the loss function computed over the mini-batch, and η is the
learning rate. Due to the non-convexity of the objective function, SGD tends to converge to a local
minimum or a saddle point, rather than the global minimum, especially in deep neural networks
with many layers.
In summary, the loss function plays a central role in guiding the optimization process in neu-
ral network training by quantifying the error between the predicted and true outputs. Different
9.4. LOSS FUNCTIONS 147
loss functions are employed depending on the nature of the problem, with MSE being common for
regression and cross-entropy used for classification. Regularization techniques such as L2 and L1
regularization are incorporated to prevent overfitting and ensure better generalization. Through
optimization algorithms like gradient descent, the neural network parameters are iteratively up-
dated based on the gradients of the loss function, with the ultimate goal of minimizing the loss
across all training examples.
10 Few-Shot Learning
Few-shot learning (FSL) refers to the problem of training a model to generalize to new tasks using
only a limited number of labeled examples. Mathematically, given a dataset D = {(xi , yi )}N i=1
consisting of N labeled instances, a conventional supervised learning model aims to learn a mapping
f : X → Y by minimizing an empirical risk functional of the form
N
X
L(f ) = ℓ(f (xi ), yi ) (10.1)
i=1
where θ denotes the model parameters. However, in a few-shot setting, the number of labeled
examples per class is very small, typically denoted as K, where K ≪ N , making it challenging to
generalize in a conventional learning paradigm. Few-shot learning is often formalized using meta-
learning approaches where the goal is to optimize a learning algorithm itself such that it can quickly
adapt to new tasks using very few labeled samples. A common meta-learning formulation involves
the optimization of a meta-objective defined over a distribution of tasks T . Given a distribution
p(T ), a meta-learner optimizes parameters θ such that the expected loss over tasks is minimized:
where each task T consists of a small training set Strain (the support set) and a validation set Stest
(the query set). The loss function for a given task T is typically written as:
X
LT (fθ ) = ℓ(fθStrain (x), y) (10.4)
(x,y)∈Stest
where fθStrain denotes the model adapted using the small support set Strain . A well-known approach to
FSL is Model-Agnostic Meta-Learning (MAML), where the optimization is performed by updating
θ such that it rapidly adapts to new tasks via a few gradient updates. Specifically, MAML defines a
bi-level optimization problem where an inner loop computes task-specific parameters θ′ via gradient
descent:
θ′ = θ − α∇θ LT (fθ ) (10.5)
where α is the step size. The outer loop then updates θ based on the loss incurred on the query
set: X
θ ←θ−β ∇θ LT (fθ′ ) (10.6)
T ∼p(T )
where β is another learning rate. The goal of MAML is to find an initialization θ such that a
small number of gradient steps suffices for good generalization. Another prominent FSL approach
148
10.1. META-LEARNING FORMULATION IN FEW SHOT LEARNING 149
is metric-based learning, where models learn an embedding function gθ : X → Rd such that a simi-
larity function S(gθ (xi ), gθ (xj )) enables effective classification in a low-data regime. A widely used
method is prototypical networks, which represent each class in the support set with a prototype:
1 X
ck = gθ (xi ) (10.7)
|Strain,k |
(xi ,yi )∈Strain,k
where Strain,k is the subset of the support set corresponding to class k. Classification is then
performed using a softmax function over distances:
exp(−d(gθ (x), ck ))
P (y = k|x) = P (10.8)
k′ exp(−d(gθ (x), ck′ ))
where d(·, ·) is typically the squared Euclidean distance. The training objective is the negative
log-likelihood: X
L=− log P (y|x) (10.9)
(x,y)∈Stest
Bayesian methods in FSL adopt a probabilistic framework where uncertainty in model parameters
is explicitly modeled using a prior p(θ). Given a support set Strain , a Bayesian model infers a
posterior over parameters:
p(θ|Strain ) ∝ p(Strain |θ)p(θ) (10.10)
A predictive distribution over a new query point x∗ is obtained by marginalizing over θ:
Z
p(y |x , Strain ) = p(y ∗ |x∗ , θ)p(θ|Strain )dθ
∗ ∗
(10.11)
Ultimately, the success of FSL hinges on leveraging shared structure across tasks, designing effective
adaptation mechanisms, and optimizing inductive biases to mitigate data scarcity.
where LT (fθ ) represents the loss function associated with the task T , typically computed over the
query set after adapting the model using the support set. One of the most widely used meta-learning
approaches is Model-Agnostic Meta-Learning (MAML), which optimizes for an initial parameter
set θ that can be quickly adapted to a new task using only a few gradient steps. The inner-loop
adaptation updates the model parameters as:
where α is the inner-loop learning rate. The meta-objective then minimizes the loss computed on
the query set with respect to the updated parameters:
X
min LTi (fθi′ ) (10.14)
θ
Ti ∼p(T )
150 CHAPTER 10. FEW-SHOT LEARNING
where β is the meta-learning rate. Another formulation follows metric-based meta-learning, where
the model learns an embedding function gθ that maps input samples to a feature space where
classification can be performed using a simple distance metric. In Prototypical Networks, the class
prototype for each class c in the support set is computed as:
1 X
cc = gθ (xj ) (10.17)
|Sc |
(xj ,yj )∈Sc
where Sc is the set of examples belonging to class c. The probability of assigning a query sample
xq to class c is then given by:
exp(−d(gθ (xq ), cc ))
p(yq = c | xq ) = P (10.18)
c′ exp(−d(gθ (xq ), cc′ ))
where d(·, ·) is a distance metric, often chosen as the squared Euclidean distance. Contrastive
learning-based meta-learning methods instead optimize a loss function based on relative similarities.
A common choice is the contrastive loss:
X
L= I[yi = yj ] · d(gθ (xi ), gθ (xj )) − (1 − I[yi = yj ]) · max(0, m − d(gθ (xi ), gθ (xj ))) (10.19)
i,j
where m is a margin parameter and I[·] is an indicator function. Bayesian meta-learning approaches
introduce a probabilistic treatment where the model parameters θ are sampled from a learned
posterior distribution p(θ | D), updated via variational inference:
qϕ (θ) = arg min DKL (q(θ) ∥ p(θ | D)) (10.20)
q
where DKL (· ∥ ·) denotes the Kullback-Leibler divergence. The expected posterior predictive
distribution for a new task is then computed as:
p(yq | xq , S) = Eθ∼qϕ (θ) [p(yq | xq , θ)] (10.21)
which can be approximated using Monte Carlo sampling. Gradient-based meta-learning methods
can also be formulated using recurrent neural networks, where the meta-learner is a recurrent
model that updates a task-specific hidden state ht over a sequence of examples. Given a sequence
of support set updates, the parameterized recurrence is given by:
ht = fψ (ht−1 , ∇θ Lt ) (10.22)
and the model parameters for a given task are inferred as:
θt = gψ (ht ) (10.23)
where fψ and gψ are recurrent function approximators parameterized by ψ. These meta-learning
paradigms share a common objective: to minimize the expected generalization error of a model
trained over a distribution of tasks, enabling rapid adaptation to new scenarios with minimal labeled
data. The fundamental mechanism across formulations involves optimizing a bi-level objective
where the outer loop adjusts meta-parameters and the inner loop performs task-specific adaptation.
The optimization problem governing this framework can be expressed in a general form as:
min ET ∼p(T ) LT (fθ−∇θ LT (fθ ) ) (10.24)
θ
For a given task Ti , we assume a dataset Di composed of a support set Ditrain and a query set
Ditest . Let the model be parameterized by θ and let the task-specific loss function be LTi . The
adaptation to task Ti involves performing one or more gradient descent steps with respect to LTi
evaluated on Ditrain . The standard first-order adaptation rule for a single gradient step is given by:
where α is the learning rate for task-specific updates. The meta-objective is to find an initialization
θ such that after task-specific adaptation, the model performs well on the query set Ditest . The
meta-loss is computed over the query set:
X
Lmeta = LTi (θi′ , Ditest ) (10.26)
Ti ∼p(T )
where β is the meta-learning rate. Expanding the gradient term using the chain rule,
dθi′
∇θ LTi (θi′ , Ditest ) = ∇θi′ LTi (θi′ , Ditest ) · (10.28)
dθ
with
dθi′
= I − α∇2θ LTi (θ, Ditrain ) (10.29)
dθ
where I is the identity matrix and ∇2θ LTi is the Hessian of the loss function with respect to θ. The
optimization process involves computing second-order derivatives, which can be computationally
expensive. A first-order approximation, called First-Order MAML (FOMAML), simplifies this
update by ignoring the Hessian term:
X
θ ←θ−β ∇θi′ LTi (θi′ , Ditest ) (10.30)
Ti ∼p(T )
where θi′ is still obtained using the standard gradient descent update. This approximation reduces
computational overhead while still achieving strong meta-learning performance. The effectiveness
of MAML depends on its ability to learn an initialization θ that allows rapid adaptation across a
diverse set of tasks. Formally, we define the expected meta-loss over the task distribution:
which represents the expectation over the loss incurred after a single adaptation step. The meta-
optimization aims to minimize this expected loss, ensuring that θ provides an effective starting
152 CHAPTER 10. FEW-SHOT LEARNING
point for task-specific fine-tuning. For multiple gradient adaptation steps, the parameter update
for a given task follows an iterative process:
(k) (k−1) (k−1)
θi = θi − α∇θ(k−1) LTi (θi , Ditrain ) (10.32)
i
for k = 1, . . . , K, where K denotes the number of inner-loop gradient updates. The final adapted
(K)
parameters θi are then used to compute the meta-loss:
X (K)
Lmeta = LTi (θi , Ditest ) (10.33)
Ti ∼p(T )
which determines the outer-loop optimization of θ. The full meta-gradient computation involves
backpropagating through K gradient updates, leading to higher-order derivative terms:
K (k)
X (K)
Y dθi
∇θ Lmeta = ∇θ(K) LTi (θi , Ditest ) · (k−1)
(10.34)
Ti ∼p(T )
i
k=1 dθi
where each term captures the second-order effects of iterative gradient updates.
where λ is a weighting factor that controls the relative importance of inter-class separation. The
metric function dθ (xi , xj ) is often implemented using a neural network that maps instances into an
embedding space via an encoder fθ : X → Rd , such that
One of the key approaches in metric-based few-shot learning is prototypical networks, where each
class in the support set is represented by a prototype ck , computed as the mean of embedded
support examples belonging to class k:
1 X
ck = fθ (xi ). (10.37)
|Sk |
(xi ,yi )∈Sk
The classification of a query example xq is then performed by computing its similarity to each
prototype, typically using the squared Euclidean distance,
exp(−d(fθ (xq ), ck ))
p(yq = k|xq ) = P . (10.38)
k′ exp(−d(fθ (xq ), ck′ ))
10.4. BAYESIAN METHODS IN FEW SHOT LEARNING 153
Another prominent approach is relation networks, which replace the explicit distance metric with
a learned similarity function gϕ : Rd × Rd → R, parameterized by ϕ. The probability of assigning
query xq to class k is given by
Siamese networks approach metric learning through pairwise comparisons, where a shared encoder
fθ maps pairs of examples (xi , xj ) into an embedding space, and the similarity is computed as
where W is a learned weight vector and σ is the sigmoid activation function. The model is trained
to minimize the binary cross-entropy loss,
X
L=− I(yi = yj ) log sθ (xi , xj ) + (1 − I(yi = yj )) log(1 − sθ (xi , xj )) . (10.41)
(xi ,yi ),(xj ,yj )∈S
Triplet networks extend this by considering an anchor example xa , a positive example xp (same
class), and a negative example xn (different class), enforcing the constraint
By optimizing these loss functions, metric-based few-shot learning methods enable models to gener-
alize well to novel classes, as the learned metric encodes class-invariant representations of instances.
The embedding space acts as a structured manifold where geometric distances capture semantic
similarities, allowing rapid adaptation to unseen tasks with minimal labeled data.
The Bayesian framework enables meta-learning by assuming that tasks in FSL are drawn from
an unknown distribution over tasks p(T ). Given a set of tasks {Ti }, each with a small support set
Si and query set Qi , the goal is to learn a distribution over task-specific parameters θi conditioned
on the support set:
p(Si |θi )p(θi )
p(θi |Si ) = (10.45)
p(Si )
154 CHAPTER 10. FEW-SHOT LEARNING
A hierarchical Bayesian model treats the task-specific parameters θi as drawn from a global distri-
bution with hyperparameters ϕ:
p(θi |ϕ) = N (µϕ , Σϕ ) (10.46)
where ϕ = (µϕ , Σϕ ) defines a prior distribution over θi . This enables the model to generalize across
tasks by updating the prior p(ϕ) using multiple tasks:
p(ϕ|{Si }) ∝ p({Si }|ϕ)p(ϕ) (10.47)
Given a new task with support set S∗ , the predictive distribution over labels y∗ for query inputs x∗
is obtained by marginalizing over θ:
Z
p(y∗ |x∗ , S∗ ) = p(y∗ |x∗ , θ)p(θ|S∗ )dθ (10.48)
Since exact inference of p(θ|D) is often intractable, variational inference is commonly used. We
approximate p(θ|D) with a variational distribution q(θ|λ), parameterized by λ, minimizing the
Kullback-Leibler (KL) divergence:
L(λ) = Eq(θ|λ) [log p(D|θ)] − DKL (q(θ|λ)∥p(θ)) (10.49)
This leads to an optimization problem where λ is updated via gradient descent to minimize L(λ),
ensuring the learned distribution q(θ|λ) approximates the true posterior. Another approach is
Bayesian nonparametrics, where the distribution over functions f : X → Y is modeled using a
Gaussian Process (GP), providing a principled way to quantify uncertainty. Given a prior GP:
f (x) ∼ GP(m(x), k(x, x′ )) (10.50)
where m(x) is the mean function and k(x, x′ ) is the covariance function, the posterior over functions
given training data D is:
p(f |D) = GP(m̃(x), k̃(x, x′ )) (10.51)
where m̃(x) and k̃(x, x′ ) are the posterior mean and covariance functions derived using Bayes’ rule.
The predictive distribution for a new input x∗ follows:
Z
p(y∗ |x∗ , D) = p(y∗ |f (x∗ ))p(f |D)df (10.52)
which yields a closed-form Gaussian predictive distribution due to the properties of GPs. Another
key Bayesian technique is Bayesian Neural Networks (BNNs), where instead of learning a single
weight matrix W , a distribution over weights is maintained:
p(W |D) ∝ p(D|W )p(W ) (10.53)
Inference in BNNs requires approximations like Monte Carlo Dropout, where approximate Bayesian
inference is performed by applying dropout at both training and test time, yielding an empirical
posterior:
T
1X
p(y∗ |x∗ , D) ≈ p(y∗ |x∗ , Wt ) (10.54)
T t=1
where Wt are sampled from the variational posterior q(W ). This captures uncertainty in predic-
tions, crucial for robust few-shot learning. Bayesian optimization further refines FSL by selecting
informative examples to maximize learning efficiency. Given an acquisition function a(x) based on
the posterior predictive distribution:
x∗ = arg max a(x) (10.55)
x
a new query point is selected to minimize uncertainty and maximize information gain, enhancing
generalization from few samples. Thus, Bayesian methods in FSL provide a mathematically rigorous
probabilistic framework to model uncertainty, transfer knowledge across tasks, and make robust
predictions even with limited training data.
11 Metric Learning
where the triplet loss ensures that the distance between similar points is smaller than the distance
between dissimilar ones by a margin of at least 1. An alternative approach is the information-
theoretic metric learning (ITML) framework, which minimizes the Kullback-Leibler divergence
between two Gaussian distributions defined by different Mahalanobis distances:
1
tr(M−1 −1
DKL (pM ∥pM0 ) = 0 M) − log det(M0 M) − d (11.6)
2
Deep metric learning extends these concepts by parameterizing the metric function using a neural
network fθ with learnable parameters θ, leading to a learned embedding space where distances are
computed as
dθ (xi , xj ) = ∥fθ (xi ) − fθ (xj )∥2 (11.7)
Instead of learning a linear transformation as in Mahalanobis-based methods, deep metric learning
implicitly learns a nonlinear mapping such that the Euclidean distance in the transformed space
aligns with semantic similarity. The contrastive loss is commonly used in this context:
X X
dθ (xi , xj ) + max(0, m − dθ (xi , xj ))2 (11.8)
(i,j)∈S (i,j)∈D
155
156 CHAPTER 11. METRIC LEARNING
A more recent approach, the normalized temperature-scaled cross-entropy loss (NT-Xent), utilizes
a softmax formulation:
X exp(sim(zi , zj )/τ )
ℓ=− log P (11.10)
i k̸=i exp(sim(zi , zk )/τ )
zTi zj
sim(zi , zj ) = (11.11)
∥zi ∥∥zj ∥
By optimizing the metric under these constraints, metric learning produces feature representations
that are well-structured for downstream tasks such as retrieval, verification, and clustering.
This term ensures that the squared distances between each point and its target neighbors remain
small. The push term, on the other hand, ensures that impostors—points xk of a different class
than xi but closer than its farthest target neighbor—are pushed away by at least a margin 1,
formulated using hinge loss as XXX
ξijk (11.14)
i j∈Ni k
where C is a regularization parameter controlling the tradeoff between the pull and push terms.
The constraint M ⪰ 0 ensures that the learned metric remains a valid Mahalanobis distance.
The optimization is typically solved using semidefinite programming (SDP) or projected gradient
descent methods while enforcing the PSD constraint on M. To interpret the impact of the learned
metric, consider an eigenvalue decomposition of M, given by
M = UΛUT (11.17)
where M is a symmetric positive semi-definite matrix, i.e., M ⪰ 0, ensuring that dM (xi , xj ) defines
a proper distance function. The fundamental idea of ITML is to find a metric M that minimizes
the divergence from a given prior metric M0 , typically the identity matrix I, subject to constraints
that enforce distances between pairs of points to satisfy specified conditions. To formalize this,
ITML minimizes the KL divergence between the distributions parameterized by M and M0 . Given
that Gaussian distributions parameterized by a Mahalanobis metric have a natural representation in
terms of covariance matrices, the KL divergence between two Gaussian distributions with covariance
matrices M−1 and M−1 0 is
1
DKL (N (0, M−1 )∥N (0, M−1 tr(M−1 −1
0 )) = 0 M) − log det(M0 M) − d (11.19)
2
where tr(·) denotes the trace operator, and log det(·) is the logarithm of the determinant. The
optimization problem in ITML is thus formulated as
min tr(M−1 −1
0 M) − log det(M0 M) (11.20)
M⪰0
function ensures that the learned metric remains close to the prior M0 while satisfying the con-
straints. To solve this optimization problem, a log-barrier method is commonly employed, leading
to an iterative update for M of the form
where γt is an adaptive step size chosen to satisfy the constraints. This iterative procedure en-
sures that M remains positive semi-definite at every step. The learned metric is influenced by the
data-driven constraints while maintaining a balance between adaptation and preservation of the
prior structure. The ITML framework exhibits strong theoretical properties, particularly in terms
of generalization and convexity. Since the optimization problem is convex in M, the solution is
globally optimal, ensuring robustness. Moreover, the constraint formulation allows for a natural
interpretation: if the prior M0 = I, then the learned metric is a transformation of Euclidean space
that deforms distances according to the prescribed similarity relationships. In practice, ITML is
highly effective due to its ability to incorporate prior knowledge and adapt flexibly to data distri-
butions. The reliance on KL divergence ensures stability in learning, preventing excessive deviation
from the prior. By iteratively refining the metric with convex optimization, ITML efficiently learns
a Mahalanobis metric that best captures the underlying relationships within the data.
∥fθ (xa ) − fθ (xp )∥22 + α < ∥fθ (xa ) − fθ (xn )∥22 , (11.24)
where α > 0 is a margin that enforces a minimum separation between positive and negative pairs.
To achieve this objective, DML often employs contrastive loss, which minimizes the Euclidean
distance between positive pairs while maximizing it for negative pairs. Given a pair (xi , xj ), the
contrastive loss function is
where yij = 1 if xi and xj belong to the same class, yij = 0 otherwise, dij = ∥fθ (xi ) − fθ (xj )∥2 is
the Euclidean distance, and m is a margin parameter. Another widely used loss function in DML
is the triplet loss, given by
X
Ltriplet = max(0, ∥fθ (xa ) − fθ (xp )∥22 − ∥fθ (xa ) − fθ (xn )∥22 + α). (11.26)
(a,p,n)∈T
This loss function ensures that the anchor-positive distance is always smaller than the anchor-
negative distance by at least α, thereby enforcing clustering of similar instances while separating
dissimilar ones. In deep learning-based metric learning, the embedding function fθ is typically
parameterized by a deep neural network such as a convolutional neural network (CNN) for image
data or a recurrent neural network (RNN) for sequential data. Let x be an input sample and let
the network be denoted as fθ (x). The network is optimized by minimizing the empirical risk
where D represents the data distribution. To further refine the embeddings, methods such as hard
negative mining are employed, where the negative samples are chosen such that
This strategy ensures that the network focuses on difficult examples that are closer to the anchor,
making learning more effective. To generalize beyond triplet-based losses, one can also consider the
N-pair loss, which extends the triplet loss by incorporating multiple negatives:
N
!
X X
LN-pair = log 1 + exp(∥fθ (xi ) − fθ (xj )∥22 − ∥fθ (xi ) − fθ (x+ 2
i )∥2 ) , (11.29)
i=1 j̸=i
Another recent improvement in deep metric learning involves contrastive learning-based methods,
where positive and negative pairs are dynamically generated using a memory bank or momentum
encoder. The InfoNCE loss, a contrastive loss derived from mutual information maximization, is
" #
exp(fθ (xi ) · fθ (x+i )/τ )
LInfoNCE = −E log P − , (11.31)
j exp(fθ (xi ) · fθ (xj )/τ )
Finally, the quality of the learned embeddings can be evaluated using retrieval-based metrics such
as mean average precision (mAP), recall at k (R@k), and normalized mutual information (NMI),
ensuring that the learned distance metric effectively captures semantic similarity.
Given a batch of N samples, let each data point xi be augmented to obtain two correlated views,
denoted as x′i and x′′i . These views are encoded into the feature space using a learned function
fθ (·), which is typically a deep neural network. The feature representations of the two augmented
views are given by
zi′ = fθ (x′i ), zi′′ = fθ (x′′i ) (11.34)
where the output embeddings are L2-normalized to lie on a unit hypersphere:
∥zi′ ∥ = 1, ∥zi′′ ∥ = 1. (11.35)
The similarity between two representations is computed using the cosine similarity, which is given
by
zi · zj
sim(zi , zj ) = = zi⊤ zj . (11.36)
∥zi ∥∥zj ∥
Since zi and zj are unit vectors, their dot product is directly the cosine of the angle between
them. To define the NT-Xent loss, a temperature scaling parameter τ is introduced to control the
sharpness of the similarity distribution. The temperature-scaled similarity is given by
zi⊤ zj
sij = . (11.37)
τ
This scaling prevents collapse by adjusting the entropy of the softmax distribution. The probability
of zi′ matching with zi′′ (its positive counterpart) over all possible samples in the batch is computed
using the softmax function
exp(si,i′′ )
pi,i′′ = P . (11.38)
j̸=i exp(si,j )
The NT-Xent loss for a single positive pair (zi′ , zi′′ ) is then formulated as
exp(si,i′′ )
ℓi = − log pi,i′′ = − log P . (11.39)
j̸=i exp(si,j )
To compute the full batch-wise NT-Xent loss, we sum over all samples and their corresponding
augmentations, leading to
N N
!
1 X 1 X exp(si,i′′ ) exp(si′′ ,i )
L= (ℓi + ℓi′′ ) = − log P + log P . (11.40)
2N i=1 2N i=1 j̸=i exp(s i,j ) j̸=i exp(si′′ ,j )
The denominator in the softmax function acts as a contrastive term, where all other embeddings in
the batch contribute as negative samples. Since the numerator contains only one positive pair, the
loss forces these representations to be closer, while the denominator encourages separation from all
other samples. A key insight into NT-Xent is its connection to InfoNCE loss (used in contrastive
predictive coding), but it is fully symmetric, meaning both views of the same instance contribute
equally. This symmetry helps in learning invariant representations by reinforcing consistent feature
extraction across different augmentations. To ensure proper gradient flow, the temperature τ is
crucial. If τ is too high, the softmax distribution becomes too uniform, leading to weak discrimi-
nation. Conversely, if τ is too low, the model learns overly sharp distributions, which can lead to
overfitting. The gradient of NT-Xent loss with respect to the embedding zi can be derived using
the softmax gradient formula
∂L X
= pi,j (zj − zi ). (11.41)
∂zi j
This update step ensures that positive pairs (zi , zi′′ ) get pulled together while negative pairs are
pushed apart in proportion to their softmax probabilities. In the large batch limit, NT-Xent can be
approximated by Monte Carlo sampling of negative pairs, ensuring computational efficiency in
practical self-supervised learning frameworks such as SimCLR. Thus, the NT-Xent loss serves as
a fundamental tool for learning discriminative, invariant, and well-clustered embeddings
in self-supervised contrastive learning frameworks.
12 Adversial Learning
Adversarial learning is a fundamental paradigm in machine learning where two models are trained
in opposition to each other, leading to the generation of robust representations and improvements in
generalization. The central concept underlying adversarial learning can be mathematically formu-
lated in terms of a minimax optimization problem, where one model seeks to maximize an objective
function while the other aims to minimize it. Formally, given a function f (θ) parameterized by θ,
adversarial learning can be framed as
min max L(f (θ, x + δ), y) (12.1)
θ δ
where L is the loss function, x is the input, y is the ground truth label, and δ represents an adver-
sarial perturbation optimized to maximize the loss. The adversarial perturbation δ is constrained
within a perturbation bound ϵ, ensuring that the perturbed example x + δ remains within a small
neighborhood of x according to a given norm constraint:
∥δ∥p ≤ ϵ (12.2)
where ∥ · ∥p represents the p-norm. The most common choice is the ℓ∞ -norm, which results in
perturbations constrained as
∥δ∥∞ ≤ ϵ. (12.3)
A fundamental technique in adversarial learning is the generation of adversarial examples using
gradient-based methods. The Fast Gradient Sign Method (FGSM) computes adversarial perturba-
tions using the sign of the gradient of the loss function with respect to the input:
δ = ϵ · sign(∇x L(f (θ, x), y)). (12.4)
This results in the adversarial example:
x′ = x + δ = x + ϵ · sign(∇x L(f (θ, x), y)). (12.5)
A more refined approach is the Projected Gradient Descent (PGD) method, which iteratively
updates the adversarial example using gradient ascent:
xt+1 = ΠBϵ (x) (xt + α · sign(∇x L(f (θ, xt ), y))) , (12.6)
where ΠBϵ (x) denotes projection onto the ℓ∞ -ball of radius ϵ around x, and α is the step size. The
adversarial training process modifies the learning objective by incorporating adversarially generated
examples into the training procedure, leading to the following robust optimization problem:
min E(x,y)∼D max L(f (θ, x + δ), y) , (12.7)
θ ∥δ∥p ≤ϵ
where D represents the data distribution. This formulation ensures that the model is trained
on worst-case perturbations, improving robustness against adversarial attacks. A generative ap-
proach to adversarial learning is encapsulated in Generative Adversarial Networks (GANs), where
a generator G and a discriminator D are trained in a two-player minimax game:
min max Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))] . (12.8)
G D
161
162 CHAPTER 12. ADVERSIAL LEARNING
Here, the generator G aims to produce samples G(z) that are indistinguishable from real samples,
while the discriminator D seeks to correctly classify real and generated samples. The optimal
discriminator is given by
pdata (x)
D∗ (x) = . (12.9)
pdata (x) + pG (x)
The equilibrium of the game is attained when pG (x) = pdata (x), leading to the global minimum of
the Jensen-Shannon divergence between real and generated distributions. Variants of adversarial
learning include Wasserstein GANs (WGANs), which replace the standard GAN loss with the
Wasserstein distance:
min max Ex∼pdata [D(x)] − Ez∼pz [D(G(z))], (12.10)
G ∥D∥L ≤1
Here, π is the policy, A represents adversarial perturbations, and R(τ ) is the reward function.
Mathematically rigorous guarantees for adversarial robustness involve formulating certified de-
fenses, which provide provable bounds on a model’s robustness. For example, randomized smooth-
ing transforms a classifier f (x) into a smoothed classifier:
Mathematically, this attack is formulated as an optimization problem where the goal is to maximize
the loss function subject to a bounded perturbation constraint:
Since computing the exact solution to this optimization problem is computationally expensive,
FGSM employs a first-order approximation using Taylor series expansion:
Maximizing this expression with respect to δ under the ℓ∞ -norm constraint ∥δ∥∞ ≤ ϵ leads to the
optimal perturbation:
δ ∗ = ϵ sign(∇x J(θ, x, y)). (12.18)
Substituting this into the perturbed input yields the FGSM formulation:
FGSM exploits the local linearity property of neural networks, which can be seen by considering a
first-order approximation of the neural network output near x:
Given that neural networks often have high-dimensional input spaces, even a small perturbation
aligned with the gradient can induce significant changes in the output, leading to misclassification.
If the network assigns class probabilities using a softmax function
exp(fy (x; θ))
P (y | x; θ) = P , (12.21)
j exp(fj (x; θ))
then the FGSM attack perturbs x in a way that increases the probability of an incorrect class y ′ ,
where
P (y ′ | xadv ; θ) > P (y | xadv ; θ). (12.22)
The effectiveness of FGSM depends on ϵ. A small ϵ may not cause misclassification, whereas a large
ϵ may introduce perceptible distortions. The impact of FGSM is often analyzed using the decision
boundary properties of neural networks. Consider a linear classifier with decision boundary defined
by
wT x + b = 0. (12.23)
Applying FGSM perturbs x along the gradient direction, which in a linear setting is simply
If x is initially correctly classified, then after perturbation, the new decision function evaluation is
If the perturbation shifts this beyond zero, the classification flips. This explains why FGSM can
be highly effective, even for small ϵ, particularly in high-dimensional spaces where each small
perturbation accumulates across dimensions. A key property of FGSM is its transferability across
different models. Given two classifiers f1 (x; θ 1 ) and f2 (x; θ 2 ), adversarial examples crafted for f1
often remain adversarial for f2 , which can be analyzed using the gradient similarity measure:
This suggests that gradients of different models are often aligned, leading to the observed trans-
ferability phenomenon. FGSM can be countered using adversarial training, where the training
objective is modified to incorporate adversarial examples:
min E(x,y)∼D max J(θ, x + δ, y) . (12.27)
θ ∥δ∥∞ ≤ϵ
164 CHAPTER 12. ADVERSIAL LEARNING
This formulation leads to robust classifiers that are less sensitive to adversarial perturbations. How-
ever, adversarial training increases computational costs, as it requires solving an inner maximization
problem during training. Another defense mechanism is gradient masking, where modifications to
the loss function result in non-informative gradients:
While this suppresses adversarial perturbations, it does not fundamentally resolve the adversarial
vulnerability, as adaptive attacks can circumvent such defenses by using alternative optimization
techniques.
where L(x, y; θ) denotes the loss function (e.g., cross-entropy loss), x is the input sample, y is
the corresponding label, and θ represents the parameters of the model. The perturbation set S is
commonly chosen as the ℓp -ball of radius ϵ, given by
To solve the constrained optimization problem, PGD proceeds iteratively by updating δ in the
direction of the gradient of L with respect to the input, followed by a projection step to enforce
the constraint. Given a step size α, the update rule for PGD at iteration t is
!
(t)
∇x L(x + δ , y; θ)
δ (t+1) = ΠS δ (t) + α , (12.31)
∥∇x L(x + δ (t) , y; θ)∥p
where ΠS (·) denotes the projection operator that ensures δ (t+1) remains within S. The projection
depends on the choice of the ℓp -norm constraint. For the commonly used ℓ∞ -ball, the projection is
simply
ΠS (δ) = max(−ϵ, min(δ, ϵ)), (12.32)
which clips each component of δ to lie within [−ϵ, ϵ]. For the ℓ2 -ball, projection is achieved by
normalizing and scaling the perturbation as
δ
ΠS (δ) = ϵ if ∥δ∥2 > ϵ. (12.33)
∥δ∥2
The iterative application of this process ensures that the perturbation maximally increases the
loss while adhering to the predefined perturbation budget. In the limit as the step size α → 0
and the number of iterations T → ∞, PGD approximates the optimal solution of the constrained
12.3. GENERATIVE APPROACH IN ADVERSARIAL LEARNING 165
maximization problem. This is in contrast to the Fast Gradient Sign Method (FGSM), which
performs only a single-step update of the form
PGD is therefore considered a stronger attack, as it allows the adversarial perturbation to explore
the loss landscape more effectively by iteratively refining the adversarial example. From a theo-
retical perspective, PGD can be interpreted as an approximate projected gradient ascent method
for solving the constrained maximization problem. Formally, this corresponds to a Lagrangian
formulation where the Karush-Kuhn-Tucker (KKT) conditions dictate the optimal perturbation:
where g(δ) defines the boundary of the perturbation set. The iterative updates in PGD approximate
this equilibrium condition numerically. Moreover, when applied iteratively with different random
initializations of δ (0) , PGD can better explore non-convex loss surfaces, making it an effective
strategy against adversarially trained defenses.
where pdata represents the true data distribution, and pz is the prior distribution from which latent
variables z are drawn. The generator learns a mapping G : z → x, parameterized by θG , such that
the distribution of generated samples pG (x) approximates pdata . The discriminator, parameterized
by θD , learns a function D : x → [0, 1], which estimates the probability that x is drawn from
pdata . The training of the generator and discriminator proceeds iteratively, where D is updated by
maximizing
JD (θD ) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))] (12.37)
and G is updated by minimizing
A fundamental result in adversarial learning is that, given sufficient capacity, the optimal discrim-
inator is
pdata (x)
D∗ (x) = (12.39)
pdata (x) + pG (x)
Substituting D∗ into the minimax objective leads to the Jensen-Shannon divergence between pdata
and pG , given by
1 1
JSD(pdata ∥ pG ) = DKL (pdata ∥ M ) + DKL (pG ∥ M ) (12.40)
2 2
where
1
M = (pdata + pG ) (12.41)
2
and DKL (P ∥ Q) denotes the Kullback-Leibler divergence
Z
P (x)
DKL (P ∥ Q) = P (x) log dx (12.42)
Q(x)
166 CHAPTER 12. ADVERSIAL LEARNING
In practical settings, this divergence-based formulation leads to issues such as vanishing gradients
when pG and pdata do not overlap significantly. To address this, alternative formulations such as
Wasserstein GANs (WGANs) optimize the Wasserstein distance
where Π(pdata , pG ) denotes the set of joint distributions with marginals pdata and pG . A crucial
extension of adversarial learning is its application to conditional generation, where the generator
learns a conditional distribution pG (x | y). The objective function for a conditional GAN is given
by
min max E(x,y)∼pdata [log D(x, y)] + Ez∼pz ,y∼pdata [log(1 − D(G(z, y), y))] (12.45)
G D
From a probabilistic standpoint, generative adversarial networks can be interpreted within the
framework of energy-based models, where the discriminator implicitly defines an energy function
Minimizing this energy aligns generated samples with the modes of the true distribution, reinforcing
the generative modeling capability of adversarial learning.
To formally understand this, let pdata (x) be the true data distribution and pG (x) be the gener-
ator’s induced distribution. The discriminator is parameterized by Dθ (x), where Dθ (x) outputs
the probability that x is a real sample. The optimal discriminator in the standard GAN formulation
is derived as:
pdata (x)
D∗ (x) = (12.47)
pdata (x) + pG (x)
which emerges from the minimization of the following objective:
In the energy-based interpretation, we introduce an energy function E(x) that models the relative
likelihood of a sample. The energy function is linked to probability distributions via the Gibbs
measure:
e−E(x)
p(x) = (12.49)
Z
12.4. INTERPRETING GENERATIVE ADVERSARIAL NETWORKS WITHIN THE FRAMEWORK OF
pdata (x)
Eθ (x) ≈ − log (12.52)
pG (x)
which is precisely the log-ratio of the real and generated distributions, making GANs inherently an
instance of energy-based models. From a probabilistic standpoint, the generator learns to minimize
the divergence between pG (x) and pdata (x). Specifically, for a fixed discriminator, the generator’s
objective can be rewritten as minimizing the Jensen-Shannon divergence:
1 1
DJS (pdata ∥ pG ) = DKL (pdata ∥ M ) + DKL (pG ∥ M ) (12.53)
2 2
where M = 21 (pdata + pG ). This reveals that GANs operate within an information-theoretic frame-
work where they reduce an energy-based measure of dissimilarity between real and generated sam-
ples. The training dynamics can be interpreted through contrastive divergence, which appears
naturally in energy-based models. The generator updates its parameters to produce samples that
have lower energy, thereby implicitly performing a form of Markov Chain Monte Carlo (MCMC)-
like sampling in the energy landscape defined by Eθ (x). The gradient update of the generator
is:
∇θG Ez∼pz [log(1 − D(G(z)))] = −Ez∼pz [∇θG Eθ (G(z))] (12.54)
which aligns with the contrastive divergence updates used in training energy-based models. This
equivalence suggests that GANs effectively perform energy minimization by dynamically adjusting
G such that its samples reduce the energy discrepancy between real and generated distributions.
Furthermore, considering a continuous-time stochastic formulation, the discriminator can be rein-
terpreted using a Langevin-type evolution:
dx p
= −∇x Eθ (x) + 2β −1 η(t) (12.55)
dt
where η(t) is a white noise process and β controls the temperature of the energy model. This
perspective suggests that adversarial training implicitly simulates diffusion in an energy-based po-
tential field, refining the generator’s distribution pG over time. Thus, from a probabilistic viewpoint,
generative adversarial networks fundamentally operate as an energy-based model where the dis-
criminator estimates an energy function that guides the generator toward producing samples that
match the statistics of real data. This establishes GANs as a probabilistic framework akin to en-
ergy minimization methods such as Boltzmann Machines, where adversarial training serves as an
implicit mechanism for defining an energy landscape in the data space.
13 Casual Inference in Deep Neural Net-
works
Causal inference in deep neural networks involves the rigorous mathematical study of cause-and-
effect relationships within high-dimensional representations learned by deep models. Unlike tra-
ditional statistical correlation, causal inference seeks to understand how perturbations in input
variables propagate through the model to affect outcomes, disentangling spurious correlations from
true causation. Given a set of random variables X, Y , and Z, a fundamental question in causal
inference is whether X causally influences Y or whether their observed association is due to a
confounding factor Z. The structural causal model (SCM) framework formalizes this using a set
of structural equations:
Y = fY (X, UY ) (13.1)
X = fX (Z, UX ) (13.2)
where fY and fX are deterministic functions encoding the causal mechanisms, and UY and UX are
exogenous noise variables, assumed to be mutually independent. The central problem in causal
inference is identifying causal effects through interventions, denoted as do(X = x), which replace
the original structural equation of X with a fixed value x, leading to a new distribution:
X
P (Y |do(X = x)) = P (Y |X = x, Z = z)P (Z) (13.3)
z
This contrasts with traditional conditional probabilities P (Y |X), which do not eliminate con-
founding. In deep neural networks, causal inference is particularly challenging due to the high-
dimensional, entangled nature of representations. Given a deep neural network parameterized by
weights θ, mapping an input X to an output Y through multiple layers:
where H (l) denotes the activations at layer l, W (l) and b(l) are the weight matrices and biases, and
σ is a nonlinear activation function, causal inference seeks to identify whether perturbations in X
affect Y via a causal pathway. One approach to formalizing causality in deep networks is the use
of causal feature selection, which attempts to identify a subset of features S such that:
for all xS in the support of XS , ensuring that selected features S are causally relevant to Y . This
can be achieved using conditional independence tests based on the back-door criterion:
X
P (Y |do(X)) = P (Y |X, Z)P (Z) (13.6)
Z
where Z blocks all back-door paths between X and Y . In deep neural networks, this translates to
modifying network architectures to impose structural constraints that enforce causal dependencies.
A key tool in causal inference for deep learning is the concept of counterfactual reasoning. Given
168
13.1. STRUCTURAL CAUSAL MODEL (SCM) 169
an observed instance X = x leading to output Y = y, a counterfactual query asks what would have
happened had X been different:
where λ controls the strength of the causal penalty. This ensures that deep models learn represen-
tations aligned with causal mechanisms rather than spurious correlations. Another crucial aspect
is the causal transportability of learned models, ensuring that a model trained in one environ-
ment generalizes to another with different distributions P (X). This is characterized by domain
adaptation techniques that minimize the divergence:
where P ′ is the distribution in the target domain, and D is a divergence measure such as KL-
divergence:
X P (x)
DKL (P ||Q) = P (x) log (13.10)
x
Q(x)
In deep networks, this is achieved using domain-invariant representations through adversarial train-
ing or contrastive learning. A critical issue in causal inference is the identifiability of causal effects
from observational data. The fundamental problem of causal inference states that counterfactual
outcomes cannot be directly observed:
for x ̸= x′ , meaning that estimating P (Y |do(X)) requires additional assumptions such as uncon-
foundedness:
YX←x ⊥⊥ X|Z (13.12)
which allows estimation using propensity score matching:
X
P (Y |do(X)) = P (Y |X, Z)P (Z) (13.13)
Z
Thus, deep causal inference seeks to disentangle causal structure within neural networks, ensur-
ing that learned representations encode true causal effects rather than spurious correlations. This
requires integrating structural causal models, counterfactual reasoning, domain adaptation, and in-
variant risk minimization, ensuring that deep models generalize to unseen domains while preserving
causal mechanisms.
equations and directed acyclic graphs (DAGs) to capture cause-and-effect mechanisms. Given
a set of endogenous variables V = {V1 , V2 , . . . , Vn } and exogenous variables U = {U1 , U2 , . . . , Um },
an SCM is defined as a tuple (V, U, F, P (U)), where F is a set of deterministic structural equations,
and P (U) is a probability distribution over exogenous variables. Each endogenous variable Vi is
determined by a function of its parents in the causal graph and the corresponding exogenous
variables:
Vi = fi (PAi , Ui ) (13.14)
where PAi represents the set of parent variables of Vi in the causal graph. The causal relationships
between variables are represented using a directed acyclic graph G = (V, E), where each edge
Vj → Vi signifies that Vj is a direct cause of Vi . The absence of an edge between Vj and Vi
indicates conditional independence given other parents. The structural equations define the causal
mechanism, and their functional form encodes counterfactual relationships. Given an intervention
do(X = x), where the variable X is forcibly set to x, the new SCM modifies the equation for X
while keeping all other equations unchanged:
SCMdo(X=x) = (V, U, FX=x , P (U)) (13.15)
where FX=x represents the modified set of structural equations with X replaced by the constant x.
Causal effects are quantified using interventional distributions. The post-intervention distribution
of an outcome Y given an intervention on X is computed using the causal effect formula:
X
P (Y | do(X = x)) = P (V \ {X} | X = x)P (X = x) (13.16)
V\{X,Y }
which integrates out all non-intervened variables. The backdoor criterion provides a criterion for
identifying P (Y | do(X)) when a set of variables Z blocks all backdoor paths from X to Y , leading
to the adjustment formula:
X
P (Y | do(X)) = P (Y | X, Z)P (Z) (13.17)
Z
Counterfactuals are evaluated by computing potential outcomes. Given an observed world where
X = x and Y = y, the counterfactual query YX=x′ seeks the value Y would have taken had X been
set to x′ . This is formalized using the three-step counterfactual algorithm:
1. Abduction: Infer exogenous variables using the observed data:
U = fX−1 (X) (13.18)
The probability of causation (PC) quantifies the likelihood that X caused Y by comparing the
counterfactual and factual outcomes:
P C = P (YX=1 = 1 | X = 0, Y = 0) (13.21)
which requires structural knowledge of the system. Identification of causal effects depends on
graphical conditions such as the front-door criterion, which allows inference of P (Y | do(X))
even when confounding is unobserved, provided an intermediate variable M satisfies:
X
P (Y | do(X)) = P (Y | m)P (m | do(X)) (13.22)
m
13.2. COUNTERFACTUAL REASONING IN CAUSAL INFERENCE FOR DEEP NEURAL NETWORKS
SCMs generalize probabilistic models by encoding mechanistic relationships rather than mere sta-
tistical associations. The key distinction is that SCMs allow answering counterfactual queries,
distinguishing correlation from causation, and supporting robust decision-making under interven-
tions.
By minimizing this term, the network ensures that the latent distributions for treated and control
groups are similar, facilitating counterfactual inference. A common loss function for training the
neural network includes both factual prediction loss and a regularization term for domain alignment:
N
X
L= ℓ(Yi , Ŷi ) + λLIPM . (13.28)
i=1
Here, D(Φ(x)) is a discriminator network that tries to distinguish between treated and untreated
units in latent space, and its adversarial training ensures treatment group similarity. Thus, coun-
terfactual reasoning in deep neural networks is achieved by leveraging representation learning,
adversarial domain adaptation, and balanced predictive modeling to estimate missing potential
outcomes.
However, direct inference of the causal effect in the target domain is hindered by the discrepancy
between the marginal distributions PS (X) and PT (X), a phenomenon known as covariate shift.
This necessitates a reweighting strategy to correct for the distributional shift. A common approach
involves the use of importance weights:
PT (X)
w(X) = , (13.32)
PS (X)
which can be estimated via density ratio estimation techniques such as kernel mean matching or
adversarial learning. Given these weights, the expectation of any function f (X, A, Y ) in the target
domain can be approximated using observations from the source domain:
A deep neural network-based approach to causal domain adaptation typically parameterizes the
outcome model hθ (X, A) as a neural network with parameters θ, trained to minimize the weighted
empirical loss:
nS
X
L(θ) = w(Xi )ℓ(hθ (Xi , Ai ), Yi ), (13.34)
i=1
where ℓ(·, ·) is a suitable loss function such as mean squared error for continuous outcomes or
cross-entropy loss for binary outcomes. Additionally, domain-adversarial neural networks (DANNs)
introduce a domain discriminator Dϕ (X), parameterized by ϕ, trained to distinguish between source
and target samples while the feature extractor gθ (X) is trained to minimize the discriminator’s
ability to differentiate domains, effectively promoting invariant feature representations:
nS
X nT
X
min max log Dϕ (gθ (Xi )) + log(1 − Dϕ (gθ (Xj ))). (13.35)
θ ϕ
i=1 j=1
13.4. INVARIANT RISK MINIMIZATION (IRM) IN CAUSAL INFERENCE FOR DEEP NEURAL NETW
By integrating the domain-invariant representation gθ (X) with causal inference models, the con-
founding bias induced by distribution shift is mitigated, allowing for unbiased estimation of the
causal effect in the target domain. Furthermore, the potential outcome framework in deep learning-
based causal domain adaptation estimates the counterfactual outcomes Y 0 , Y 1 using a shared fea-
ture encoder fθ (X) and domain-invariant treatment-response functions hθ0 (X) and hθ1 (X):
Ŷ 0 = hθ0 (fθ (X)), Ŷ 1 = hθ1 (fθ (X)). (13.36)
The individualized treatment effect (ITE) is then estimated as:
τ̂ (X) = Ŷ 1 − Ŷ 0 . (13.37)
To ensure domain generalization, recent advances incorporate representation learning techniques
such as contrastive learning and information bottlenecks, where the feature extractor gθ (X) is
regularized to minimize the mutual information between domain-specific variations and the learned
representations: h i
2
min EX∼PS (X)∪PT (X) ∥gθ (X) − gθ (X̃)∥2 , (13.38)
θ
Mathematically, let H denote the hypothesis space, which includes functions h parameterized by
a deep neural network. We define an invariant predictor h that minimizes the worst-case risk over
all observed environments:
min max Re (h), (13.40)
h∈H e∈E
To relax this constraint into a differentiable objective, the IRM penalty is introduced:
X
LIRM = Re (w ◦ Φ) + λ∥∇w Re (w ◦ Φ)|w=1 ∥2 . (13.47)
e∈E
Here, the gradient term ensures that the classifier w remains optimal across all environments by
enforcing that its gradient with respect to w vanishes. The hyperparameter λ controls the strength
of this regularization. Expanding the penalty term,
2
X
∥∇w Re (w ◦ Φ)|w=1 ∥2 = E(X,Y )∼Pe [∇w ℓ(wΦ(X), Y )] , (13.48)
e∈E
ensuring that the gradients of the risks are aligned across environments. A deeper connection to
causality emerges when considering the Structural Equation Model (SEM):
Y = f ∗ (X) + ϵ, X = g(Z, S), (13.49)
where Z are the causal features, S are spurious features, and ϵ is an independent noise term.
An ERM-trained deep network might learn S because it improves empirical risk minimization in
observed environments, while IRM discourages dependence on S by enforcing that the predictor
generalizes across all environments. A theoretical justification for IRM follows from an analysis of
the optimal invariant predictor. Define the optimal classifier for a given representation Φ:
we∗ = arg min Re (w ◦ Φ). (13.50)
w
For a truly invariant representation Φ∗ , the classifier should be identical across all environments:
we∗ = w∗ ∀e ∈ E. (13.51)
This leads to the condition:
∇w Re (w ◦ Φ) = 0 ∀e ∈ E, (13.52)
which is precisely the IRM penalty. Empirically, training with IRM often results in improved out-of-
distribution generalization compared to standard deep learning approaches. Finally, an alternative
interpretation of IRM emerges from considering a variational bound on the mutual information
between the learned representation and the causal variables:
I(Z; Y ) = H(Y ) − H(Y |Z). (13.53)
IRM implicitly maximizes this information while minimizing the conditional entropy H(Y |Z), lead-
ing to representations that capture causal relationships rather than spurious correlations.
13.5. EMPIRICAL RISK MINIMIZATION (ERM) IN CAUSAL INFERENCE FOR DEEP NEURAL NETW
where h : Rd → R is the hypothesis function parameterized by a deep neural network and ℓ(·, ·) is
a loss function, such as the squared loss
However, in causal inference, the presence of treatment assignment T necessitates a modified ob-
jective that accounts for the potential outcome framework. A common approach is to estimate the
potential outcomes Y (0) and Y (1) using separate deep neural networks, h0 (X; θ0 ) and h1 (X; θ1 ),
parameterized by θ0 , θ1 , leading to the empirical risk formulation
N
1 X
R̂(θ0 , θ1 ) = Ti ℓ(h1 (Xi ; θ1 ), Yi ) + (1 − Ti )ℓ(h0 (Xi ; θ0 ), Yi ) . (13.57)
N i=1
This objective function is a weighted empirical risk based on the observed treatment assignment,
ensuring that only the factual outcome is used in training. The challenge arises due to the imbalance
in treatment assignments, leading to potential covariate shift between the treated and control
groups. To mitigate this, inverse propensity weighting (IPW) is often incorporated, where the
empirical risk is modified as
N
1 X Ti 1 − Ti
R̂IPW (θ0 , θ1 ) = ℓ(h1 (Xi ; θ1 ), Yi ) + ℓ(h0 (Xi ; θ0 ), Yi ) , (13.58)
N i=1 e(Xi ) 1 − e(Xi )
where e(X) = P (T = 1 | X) is the propensity score. This adjustment ensures that the empirical
distribution of covariates in both treatment groups better approximates the underlying population
distribution. However, direct inverse weighting can lead to high variance, motivating the use of
balancing representations via domain adaptation methods such as representation learning with deep
neural networks. Specifically, a feature representation function Φ : Rd → Rm is learned such that
the treated and control groups become indistinguishable in the transformed space. The empirical
risk in this case is
N
1 X
R̂Φ (θ0 , θ1 , ϕ) = Ti ℓ(h1 (Φ(Xi ; ϕ); θ1 ), Yi ) + (1 − Ti )ℓ(h0 (Φ(Xi ; ϕ); θ0 ), Yi ) , (13.59)
N i=1
176 CHAPTER 13. CASUAL INFERENCE IN DEEP NEURAL NETWORKS
with an additional discrepancy term penalizing differences between the distributions of Φ(X) under
T = 1 and T = 0, such as the Maximum Mean Discrepancy (MMD),
1 X 1 X
D(Φ) = k(Φ(Xi ), ·) − k(Φ(Xi ), ·) , (13.60)
NT i:T =1 NC i:T =0
i i H
where k(·, ·) is a kernel function and H is the reproducing kernel Hilbert space. The final objective
function then combines empirical risk minimization with regularization for balancing:
This formulation ensures consistency even when either the outcome model or the propensity score
model is misspecified. Ultimately, the optimization of these empirical risk formulations in deep
neural networks involves gradient-based methods, with stochastic gradient descent (SGD) or Adam
optimizing θ0 , θ1 , ϕ, while the propensity score model e(X) is learned via logistic regression or a
separate neural network. The complexity of ERM in causal deep learning thus lies in balancing
factual accuracy, counterfactual generalization, and representation learning to ensure robust ITE
estimation.
14 Network Architecture Search (NAS)
in Deep Neural Networks
Network Architecture Search (NAS) in Deep Neural Networks (DNNs) is an optimization problem
that seeks to automatically design the best neural network architecture for a given task, typically
involving a search space A, a search strategy S, and a performance evaluation metric E. Mathe-
matically, the NAS problem can be framed as finding the optimal architecture α∗ that maximizes
a performance function F, which can be formally written as
where D represents the dataset used for training and evaluation. The search space A defines the
possible neural network architectures, which can be represented as a directed acyclic graph (DAG)
G = (V, E), where nodes V correspond to layers (such as convolutional, fully connected, or recurrent
layers), and edges E represent connections between layers. The function F is typically defined in
terms of accuracy, loss, computational efficiency, and resource constraints. The search process can
be formulated as a reinforcement learning (RL) problem, where an agent explores the architecture
space and updates its policy using rewards derived from model performance. Let πθ (at | st ) be a
policy parameterized by θ, which selects an action at (modifying the architecture) given state st
(current architecture configuration). The expected reward function is then
" T #
X
J(θ) = Eτ ∼πθ γ t rt (14.2)
t=0
where R(τ ) is the cumulative reward of a sampled trajectory. Alternatively, evolutionary algo-
rithms model the search as a population-based optimization problem, evolving architectures through
crossover and mutation operators. The fitness function in this context is given by
where E is an evaluation metric such as classification accuracy or mean squared error. Another
approach is gradient-based NAS, where the architecture is parameterized by continuous variables αi
and optimized via gradient descent. If w denotes network weights and L(w, α) is the loss function,
then the architecture optimization problem is
177
178CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS
The optimization follows a bilevel formulation, where the outer optimization updates α while the
inner optimization solves for w. The gradient of L(w∗ (α), α) with respect to α is computed using
implicit differentiation,
A continuous relaxation of discrete architectures is achieved using softmax over candidate operations
oi ,
X exp(αi )
omixed (x) = P oi (x). (14.8)
i j exp(αj )
The optimal architecture is obtained by discretizing αi after training. The computational com-
plexity of NAS is often mitigated by weight sharing across architectures, leading to one-shot NAS
methods where a supernet N encompasses all possible subnets α, and training occurs over a shared
weight space W ,
min Eα∼A L(W, α). (14.9)
W
After training, the best subnet is selected by evaluating candidate architectures sampled from
A. The performance of a neural architecture is influenced by hyperparameters λ, which affect
layer depth, width, and activation functions. A common formulation includes multi-objective
optimization,
max E [F(α)] − λ · C(α) (14.10)
α
where C(α) represents computational cost (e.g., FLOPs or latency). The search can be constrained
using Lagrange multipliers,
L(α, λ) = F(α) − λC(α), (14.11)
with the optimal trade-off achieved by solving
∂L ∂L
= 0, = 0. (14.12)
∂α ∂λ
By iteratively refining F̂, the search converges to architectures with high performance and low
computational cost. The theoretical underpinnings of NAS rely on neural tangent kernels (NTKs)
to approximate training dynamics, where the NTK matrix Θ evolves as
dΘ
= −ηΘ2 , (14.13)
dt
where η is the learning rate. A stable architecture satisfies the condition
Thus, NAS systematically optimizes architectures through structured exploration of the search
space, reinforcement learning, evolutionary algorithms, differentiable search, and surrogate model-
ing, ensuring optimality under computational constraints.
as a set of hyperparameters and connection structures, encoding them into a genetic representation
that undergoes evolutionary processes such as selection, mutation, and crossover to explore the vast
search space efficiently. Given a population of architectures P = {N1 , N2 , . . . , NN }, where each
Ni represents a candidate neural network, we define the fitness function f : N → R, which eval-
uates the performance of a network based on metrics such as validation accuracy, computational
efficiency, or robustness to perturbations.
Each neural network architecture Ni can be described by a directed acyclic graph (DAG) Gi =
(Vi , Ei ), where Vi represents the set of layers (e.g., convolutional layers, pooling layers, fully con-
nected layers), and Ei represents the connections between them. A typical encoding scheme maps
Gi to a vector representation xi ∈ Rd , where each element represents hyperparameters such as the
number of filters Fl in a convolutional layer l, kernel sizes Kl , and activation functions σl . The
architecture search problem can thus be formulated as an optimization problem:
max f (xi ) (14.15)
xi ∈X
where X is the feasible set of architectures defined by design constraints (e.g., hardware constraints,
FLOP limitations). The evolutionary process begins with the initialization of a population P (0) ,
where architectures are randomly generated or sampled from prior knowledge. At each generation
(t)
t, a selection operator S chooses a subset of architectures Ps ⊂ P (t) based on their fitness scores:
Ps(t) = S(P (t) , f ) (14.16)
Common selection methods include tournament selection and rank-based selection. The selected
architectures undergo crossover and mutation operations. Crossover combines two parent architec-
tures Na and Nb to produce an offspring Nc using a recombination function C:
Nc = C(Na , Nb ) (14.17)
For instance, if architectures are encoded as binary strings representing layer connections, one-point
or uniform crossover can be applied:
(k+1:d)
xc = (x(1:k)
a , xb ) (14.18)
where k is a randomly selected crossover point. Mutation introduces random variations in the
offspring architecture to maintain diversity in the population. Given a mutation rate pm , a mutation
operator M perturbs the architecture encoding:
x′i = M (xi , pm ) (14.19)
Common mutation operations include altering the number of filters Fl , modifying kernel sizes Kl ,
or randomly inserting/deleting layers. The updated population P (t+1) for the next generation is
formed by selecting the top-performing architectures:
P (t+1) = Elitism(Ps(t) ∪ Pc(t) , f ) (14.20)
(t)
where Pc is the set of offspring architectures and elitism ensures that the best architectures persist
across generations. The process repeats until a termination criterion is met, such as a maximum
number of generations T or convergence in fitness scores:
f (N (t) ) − f (N (t−1) ) < ϵ (14.21)
where ϵ is a predefined tolerance threshold. The search efficiency of EAs can be enhanced using
surrogate models fˆ that approximate the fitness function:
n
X
fˆ(x) = wi K(x, xi ) (14.22)
i=1
180CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS
where K(x, xi ) is a kernel function and wi are weights learned from evaluated architectures.
Bayesian optimization and neural predictors are often employed to guide the search towards promis-
ing regions of X . To further improve convergence speed, weight inheritance strategies transfer
trained weights from parent networks to offspring, reducing the need for full training cycles. Given
a parent-offspring pair (Np , Nc ), weight inheritance is expressed as:
Wc = T (Wp , Np , Nc ) (14.23)
where Wp and Wc denote the weight matrices of the parent and child networks, and T is a
transformation function that maps weights based on structural similarities. Evolutionary NAS
has demonstrated competitive performance against reinforcement learning-based and gradient-
based approaches, particularly in discovering novel architectures for convolutional neural networks
(CNNs) and transformers. The combination of evolutionary search with differentiable search spaces,
such as in Differentiable Architecture Search (DARTS), leads to hybrid methods where architectures
evolve within a continuous relaxation of the search space:
X
L(α) = wi σ(αi ) (14.24)
i
where α represents architecture parameters and σ(·) is a softmax function ensuring differentiabil-
ity. By integrating EAs with differentiable optimization, recent methods achieve state-of-the-art
performance while maintaining exploration capabilities.
To optimize J(θ), policy gradient methods are employed, where the gradient of the expected reward
with respect to policy parameters is given by:
" T #
X
∇θ J(θ) = Eπ ∇θ log π(at |st ; θ)Rt . (14.26)
t=0
Using Monte Carlo estimates, the policy parameters are updated via gradient ascent:
T
X
θ ←θ+α ∇θ log π(at |st ; θ)Rt . (14.27)
t=0
A common approach to implementing NAS with RL is the use of a Recurrent Neural Network
(RNN)-based controller that generates candidate architectures sequentially. The hidden state ht of
14.2. REINFORCEMENT LEARNING IN NETWORK ARCHITECTURE SEARCH 181
the RNN encodes information about the architecture decisions made up to step t, and the action
at at each step is sampled from a softmax distribution:
The reward signal is obtained by training and evaluating the generated architecture on a validation
set, yielding an accuracy score A, which serves as the reward:
R = A − b, (14.29)
where b is a baseline used in variance reduction techniques such as REINFORCE. The training of
the controller follows the REINFORCE algorithm with baseline subtraction:
T
X
θ ← θ + α(R − b) ∇θ log π(at |st ; θ). (14.30)
t=0
To further stabilize learning, advantage estimation methods such as Generalized Advantage Es-
timation (GAE) can be employed, where the advantage function At is computed using the value
function V (st ):
At = Rt + γV (st+1 ) − V (st ). (14.31)
An alternative approach is to model the NAS problem as a Q-learning task, where the Q-value
represents the expected future reward given a state-action pair:
"∞ #
X
Q(st , at ) = E γ k Rt+k st , at . (14.32)
k=0
In practical implementations, Deep Q-Networks (DQN) are utilized, where a neural network ap-
proximates the Q-function Q(s, a; θ), and updates are made using the loss function:
2
′ ′ −
L(θ) = E(s,a,r,s′ )∼D r + γ max
′
Q(s , a ; θ ) − Q(s, a; θ) , (14.34)
a
where θ− represents target network parameters and D is an experience replay buffer. To address
the instability of Q-learning, Double Q-learning is often employed, where two networks, Q1 and
Q2 , are used to decouple action selection and evaluation:
where Qmin selects the minimum Q-value from two estimators to mitigate overestimation bias. In
actor-critic methods such as Proximal Policy Optimization (PPO), the actor updates the policy
while the critic evaluates its performance using the objective:
π(at |st ; θ) π(at |st ; θ)
L(θ) = E min At , clip , 1 − ϵ, 1 + ϵ At . (14.36)
π(at |st ; θold ) π(at |st ; θold )
This clipped objective prevents large policy updates, ensuring stability in the RL-based NAS frame-
work. By iteratively refining the architecture using these RL techniques, NAS can effectively dis-
cover high-performing network topologies that outperform manually designed architectures.
182CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS
where p(st+1 |st , at ) is the environment transition function, often deterministic in the context of
architecture search. To optimize J(θ), the policy gradient theorem provides an unbiased estimator
of the gradient using the likelihood ratio trick:
In practice, this expectation is estimated via Monte Carlo sampling over a set of architectures,
leading to the update rule:
X T
X
θ ←θ+α R(τ ) ∇θ log πθ (at |st ). (14.42)
τ ∼pθ (τ ) t=0
A common choice for b(st ) is a learned value function V π (st ), leading to an advantage function
A(st , at ) = R(τ ) − V π (st ), which forms the basis for actor-critic methods. The reward function
R(τ ) is typically designed to reflect the generalization ability of the architecture, often incorporating
the validation accuracy of the trained model:
N
1 X
R(τ ) = ℓ(yi , fA (xi ; w∗ )) (14.44)
N i=1
14.4. NEURAL TANGENT KERNELS (NTKS) IN NETWORK ARCHITECTURE SEARCH183
where A denotes the architecture defined by the trajectory τ , fA represents the network function,
and w∗ are the trained weights obtained by minimizing the training loss
N
X
∗
w = arg min ℓ(yi , fA (xi ; w)). (14.45)
w
i=1
where
T X
X
H(πθ ) = − πθ (a|st ) log πθ (a|st ) (14.47)
t=0 a
encourages diverse architecture sampling. In a real implementation, the policy πθ is usually param-
eterized by a recurrent neural network (RNN) such as an LSTM, where the hidden state encodes
past architecture decisions. The architecture parameters θ are updated using stochastic gradient
descent with policy gradient updates, while the sampled architectures are trained using standard
backpropagation. The convergence of policy gradient methods in architecture search relies on
ensuring sufficient exploration and avoiding premature convergence to suboptimal architectures.
Techniques such as proximal policy optimization (PPO), trust region policy optimization (TRPO),
and natural gradient methods are often used to stabilize updates and improve sample efficiency.
These methods modify the standard gradient update by introducing constraints such as
where DKL is the Kullback-Leibler divergence, ensuring that the updated policy does not deviate
too drastically from the previous policy, thereby maintaining stability in the optimization process.
As the width of the network approaches infinity, the NTK remains constant during training under
gradient descent, allowing the network to be effectively modeled as a Gaussian process. This
property enables rapid evaluation of the convergence behavior and generalization ability of a given
architecture by analyzing the eigenvalues of the NTK matrix. The training dynamics of an infinitely
wide neural network can be described using the evolution of the function ft (x) under gradient
descent with a learning rate η, which follows
n
dft (x) X
= −η Θ(x, xi )(ft (xi ) − yi ). (14.50)
dt i=1
184CHAPTER 14. NETWORK ARCHITECTURE SEARCH (NAS) IN DEEP NEURAL NETWORKS
Since the NTK remains constant in the infinite-width regime, this equation can be solved explicitly
using
ft (x) = f0 (x) + Θ(x, X)Θ(X, X)−1 (e−ηΘ(X,X)t (y − f0 (X))), (14.51)
where X = {x1 , . . . , xn } denotes the training set inputs, and y = {y1 , . . . , yn } are the corresponding
targets. The rate of convergence is determined by the smallest eigenvalue λmin of Θ(X, X), implying
that architectures with larger minimum eigenvalues of their NTKs converge faster and are thus more
trainable. The NTK perspective enables a principled approach to NAS by evaluating candidate
architectures based on their kernel spectra, particularly the condition number κ(Θ) = λλmax min
, which
governs the stability of gradient descent. Generalization in deep learning is often characterized
by the smoothness and complexity of learned functions. The NTK controls the function space
induced by a given architecture, allowing estimation of generalization error through the kernel
ridge regression formula
where σ 2 is the noise variance. Architectures with better generalization properties exhibit smaller
trace norms of the NTK inverse. In the infinite-width limit, NTKs of different architectures can
be computed analytically by considering the recursive propagation of covariance matrices through
network layers. For a fully connected network with ReLU activation, the NTK at depth L is given
by
Θ(L) (x, x′ ) = Θ(L−1) (x, x′ )K̇ (L) (x, x′ ), (14.53)
where K̇ (L) (x, x′ ) is the derivative of the activation function’s covariance kernel. Different archi-
tectures, such as convolutional neural networks (CNNs), modify this recurrence relation by incor-
porating weight-sharing and local connectivity constraints. The NTK for a CNN layer with filter
size k is
′ 1 X (L−1)
Θ(L)
conv (x, x ) = Θ (xi , x′j )K̇ (L) (xi , x′j ). (14.54)
k 2 i,j
This allows NAS to be performed analytically by evaluating how architectural choices influence
the NTK and its spectral properties, bypassing the need for expensive training runs. Recent work
has extended NTK-based NAS to attention-based architectures, such as transformers, where the
self-attention mechanism induces a structured NTK
Q(x)K(x′ )T
′ ′ T
Θattn (x, x ) = E softmax √ V (x)V (x ) , (14.55)
d
where Q, K, V are the query, key, and value matrices. The spectrum of Θattn determines the expres-
sivity and trainability of transformer-based models, enabling NTK-based architectural optimization
in sequence learning tasks. The NTK framework provides an elegant, mathematically rigorous ap-
proach to NAS by linking architectural choices to trainability and generalization through spectral
analysis of the kernel.
15 Learning paradigms
Another major breakthrough in unsupervised learning emerged with the self-organizing maps
(SOMs) developed by Kohonen in 1982 [1002]. These neural-inspired models provide a competitive
learning framework where neurons adjust their weights to form low-dimensional representations
of input data while preserving topological relationships. This biologically motivated approach to
clustering has been instrumental in feature extraction and visualization, especially in applications
involving high-dimensional data. Parallel to SOMs, spectral techniques for manifold learning and
dimensionality reduction were rigorously developed, with Belkin and Niyogi [1003] introducing
Laplacian Eigenmaps in 2003. This method constructs a graph Laplacian to capture local geomet-
ric properties of data manifolds and embed them into a lower-dimensional space while maintaining
neighborhood relationships. This has provided a mathematically principled foundation for spectral
clustering and nonlinear dimensionality reduction.
185
186 CHAPTER 15. LEARNING PARADIGMS
A more adversarial approach to generative modeling was pioneered by Goodfellow et al. in 2020
[113] with the introduction of generative adversarial networks (GANs). GANs consist of a generator
and discriminator competing in a minimax game, where the generator learns to produce realistic
samples while the discriminator improves its ability to distinguish between real and generated
data. This adversarial framework has profoundly impacted image generation, domain adapta-
tion, and semi-supervised learning by producing high-quality synthetic data. The visualization of
high-dimensional data was further enhanced by van der Maaten and Hinton in 2008 [1007] with
the development of t-distributed stochastic neighbor embedding (t-SNE). This method provides a
probabilistic approach to mapping data points into a lower-dimensional space by minimizing the
Kullback-Leibler divergence between high- and low-dimensional distributions, preserving local sim-
ilarities. t-SNE has become a widely used technique in exploratory data analysis due to its ability
to reveal meaningful structure in complex datasets.
Together, these contributions have shaped the field of unsupervised learning by introducing rigor-
ous mathematical formulations for clustering, probabilistic modeling, and representation learning.
The foundational algorithms such as k-means and EM provide essential tools for unsupervised
data analysis, while information-theoretic and spectral methods offer deeper insights into structure
preservation and feature extraction. The integration of neural networks into unsupervised learning,
through self-organizing maps, deep belief networks, and variational autoencoders, has led to a sig-
nificant expansion in the capabilities of machine learning models. The development of adversarial
training with GANs and the application of information bottleneck principles further illustrate the
growing sophistication of unsupervised learning techniques. These advancements continue to push
the boundaries of artificial intelligence, enabling models to learn rich, structured representations of
data without the need for explicit supervision.
Roweis and Saul (2000) [1008] introduced Locally Linear Embedding (LLE), a nonlinear dimension-
ality reduction algorithm designed to uncover low-dimensional structures within high-dimensional
data. The core idea behind LLE is to preserve the local geometric relationships between neigh-
boring data points while embedding them into a lower-dimensional space. Unlike linear techniques
such as Principal Component Analysis (PCA), which assume that the global structure of the data
can be well represented using a linear subspace, LLE focuses on preserving the local structure by
considering each data point and its nearest neighbors. The algorithm operates in three main steps.
First, for each data point, LLE identifies a set of nearest neighbors based on a distance metric, usu-
ally Euclidean distance. Second, it computes local reconstruction weights by expressing each data
point as a linear combination of its neighbors, minimizing the reconstruction error while enforcing
the constraint that the weights sum to one. This step ensures that the local geometric structure is
encoded in an invariant manner. Third, LLE determines a lower-dimensional embedding by finding
a new set of coordinates that best preserves the reconstruction weights obtained in the previous
step. This is achieved by minimizing a quadratic cost function that ensures the lower-dimensional
representation maintains the same local linear relationships as the original high-dimensional data.
The main advantage of LLE over traditional linear techniques is its ability to recover nonlinear
manifolds, making it particularly effective for data sets where the underlying structure is curved or
highly nonlinear. Unlike methods such as Multidimensional Scaling (MDS), which rely on pairwise
distances between all points and can be computationally expensive, LLE leverages only local neigh-
borhoods, making it more scalable to larger data sets. Additionally, since LLE does not impose
any parametric assumptions on the manifold, it is capable of adapting to a wide variety of data
distributions.
However, LLE also has some limitations. Its reliance on nearest-neighbor selection makes it sensitive
15.1. UNSUPERVISED LEARNING 187
to noise and parameter choices, particularly the number of neighbors. Furthermore, the embeddings
produced by LLE are often unnormalized, meaning that distances in the lower-dimensional space
may not have a straightforward interpretation. Despite these challenges, LLE has had a significant
impact on machine learning and data visualization, providing a powerful tool for uncovering the
intrinsic structure of high-dimensional data in applications ranging from image processing to bioin-
formatics. Its introduction marked a shift toward more flexible, geometry-preserving dimensionality
reduction techniques, influencing the development of subsequent manifold learning methods such
as Isomap and t-SNE.
Bell and Sejnowski (1995) [1009] introduced Independent Component Analysis (ICA) as an information-
theoretic approach to blind source separation (BSS), which allows for the decomposition of mixed
signals into statistically independent components without prior knowledge of the mixing process.
Their work was motivated by real-world scenarios where multiple signals, such as different voices
in a crowded room or overlapping audio sources in a recording, become mixed together, making
it difficult to recover the original sources. Unlike traditional statistical methods such as Princi-
pal Component Analysis (PCA), which rely on second-order statistics and assume orthogonality
of components, ICA leverages higher-order statistical properties to separate signals that are non-
Gaussian and statistically independent. The fundamental principle behind their method is the
maximization of information transfer in a neural network, also known as the infomax principle.
This principle is based on the idea that a system should be optimized to maximize the mutual in-
formation between its input and output, ensuring that the output representation is as statistically
independent as possible. To achieve this, they formulated a neural learning rule that iteratively
adjusts the network’s weight parameters using a nonlinear function, which enhances the separation
of independent sources. The key insight was that statistical independence implies that the joint
probability distribution of the recovered signals should factorize into the product of their individual
distributions, which is not the case for mixtures of dependent signals. To enforce this independence,
their method used a contrast function derived from information theory, ensuring that the recovered
sources were as non-Gaussian as possible, since mixtures of independent sources tend to become
more Gaussian due to the Central Limit Theorem. This approach enabled the successful separation
of mixed signals in a variety of applications, including audio processing, biomedical signal analysis,
and image processing. One of the most famous demonstrations of ICA, often referred to as the
“cocktail party problem,” involves separating multiple overlapping speech signals recorded by dif-
ferent microphones. Their algorithm was able to recover individual voices from the mixed recording
with remarkable accuracy, highlighting the effectiveness of ICA in practical scenarios. Additionally,
ICA found significant applications in neuroscience, particularly in electroencephalography (EEG)
and functional magnetic resonance imaging (fMRI), where it helped isolate meaningful brain ac-
tivity patterns from background noise.
Despite its power, ICA has limitations, including its sensitivity to the choice of nonlinear func-
tions and the assumption that the number of independent sources does not exceed the number
of observed mixtures. Furthermore, ICA assumes that the sources are statistically independent,
which may not always hold in real-world data. Nonetheless, Bell and Sejnowski’s work laid the
foundation for subsequent advancements in unsupervised learning, influencing modern approaches
in deep learning, signal processing, and latent variable modeling. Their method provided a ro-
bust mathematical framework for data decomposition that has since been expanded and refined in
numerous fields, demonstrating the lasting impact of their contribution to unsupervised learning.
188 CHAPTER 15. LEARNING PARADIGMS
integrates unsupervised and supervised learning to classify network nodes based on trustworthi-
ness, helping identify potentially malicious nodes. Such frameworks are critical in decentralized
environments where security threats are unpredictable and dynamically evolving. The application
of unsupervised learning in this context reduces reliance on predefined attack signatures, improving
network security in real-time.
Structural health monitoring has also benefited from unsupervised learning techniques. Moustakidis
et al. (2025) [1012] proposed a deep learning autoencoder framework for fast Fourier transform
(FFT)-based clustering, designed to analyze acoustic emission data from composite materials. This
method enables automatic detection of structural damage over time, allowing for proactive main-
tenance and risk mitigation in infrastructure management. By employing autoencoders to extract
meaningful features from raw sensor data, the approach outperforms traditional inspection methods
that require extensive manual interpretation. Feature selection is another area where unsupervised
learning has made a significant impact. Liu et al. (2025) [1013] introduced an unsupervised fea-
ture selection algorithm using L2, p-norm feature reconstruction. This method effectively reduces
redundant features while preserving essential data structures, thereby improving the efficiency and
accuracy of clustering algorithms. High-dimensional datasets, such as those used in bioinformatics
and financial modeling, greatly benefit from this technique, as it enables better data representation
and pattern discovery without the need for labeled guidance.
In the medical and healthcare sector, unsupervised learning has been leveraged for disease risk
assessment and biomarker discovery. Zhou et al. (2025) [1014] applied clustering techniques to
analyze metabolic profiles, uncovering hidden subtypes of hypertriglyceridemia associated with
varying disease risks. This research demonstrates the potential of unsupervised learning in per-
sonalized medicine, where identifying metabolic subgroups can help tailor treatments to individual
patients rather than applying a one-size-fits-all approach. Similarly, Lin et al. (2025) [1015] em-
ployed unsupervised learning for risk control in health insurance fund management. By analyzing
claims data and identifying anomalous patterns, this approach enhances fraud detection and risk
assessment, ultimately leading to better resource allocation and cost reduction. The ability to
detect fraud without labeled examples is a significant advantage in an industry where fraudulent
activities are often sophisticated and constantly evolving.
Beyond healthcare, unsupervised learning has shown promise in improving real-world object detec-
tion systems. Huang et al. (2025) [1016] proposed a novel unsupervised domain adaptation tech-
nique to enhance open-world object detection. Unlike traditional supervised models that require
large amounts of labeled data, this approach enables object detection models to generalize across
different environments with minimal supervision. Such advancements are crucial in autonomous
navigation, surveillance, and robotics, where training data may not always be representative of
real-world conditions. In a different engineering application, Wu and Liu (2025) [1017] introduced
a VQ-VAE-2-based algorithm for detecting cracks in concrete structures. This unsupervised ap-
proach automates structural health monitoring, reducing dependence on costly manual inspections
while improving the accuracy and efficiency of damage assessments.
Natural language processing and medical imaging have also seen substantial contributions from
unsupervised learning. Nagelli and Saleena (2025) [1018] developed an aspect-based sentiment
analysis model using self-attention mechanisms. This model enables the automatic extraction of
sentiment-related features from multilingual datasets, allowing businesses to analyze customer feed-
back without requiring labeled sentiment data. Such models are essential for real-time sentiment
analysis in global markets where user-generated content is vast and diverse. Meanwhile, Ekanayake
(2025) [1019] applied deep learning techniques for MRI reconstruction and super-resolution en-
hancement. This research significantly reduces MRI scan times while preserving high image quality,
190 CHAPTER 15. LEARNING PARADIGMS
offering a transformative solution to the challenges of medical imaging. The use of unsupervised
learning in this context enables more efficient data-driven reconstruction techniques, reducing de-
pendency on expensive and time-consuming manually labeled training datasets. These diverse
applications illustrate the expanding role of unsupervised learning in solving complex real-world
problems, demonstrating its adaptability and potential to drive further innovation across multiple
industries.
One of the most rigorous formulations in unsupervised learning arises in the estimation of proba-
bility density functions, where the likelihood function of the observed data is given by
N
Y
L(θ) = pθ (xi ). (15.1)
i=1
PK
where πk are the mixture weights such that k=1 πk = 1, and p(x | θk ) are component distributions,
often taken as Gaussian:
1 1 T −1
p(x | θk ) = exp − (x − µk ) Σk (x − µk ) . (15.6)
(2π)d/2 |Σk |1/2 2
Dimensionality reduction techniques such as Principal Component Analysis (PCA) seek to find
a lower-dimensional representation that maximizes variance. Given a centered data matrix X ∈
RN ×d , PCA finds the eigenvectors of the covariance matrix
1 T
C= X X. (15.7)
N
192 CHAPTER 15. LEARNING PARADIGMS
Cv = λv. (15.8)
Only the top k eigenvectors corresponding to the largest eigenvalues are retained, reducing the data
to a lower-dimensional subspace. The transformation to the new basis is given by
zi = V T xi , (15.9)
where V is the matrix of top eigenvectors. More advanced methods such as autoencoders learn a
mapping from input data x to a latent representation z using an encoder function
where DKL is the Kullback-Leibler divergence. Generative models such as Generative Adversarial
Networks (GANs) learn a mapping from a simple latent space z ∼ p(z) to the data distribution
via a generator function G : Rk → Rd , while an adversarial discriminator D learns to distinguish
between real and generated samples. The objective function of a GAN is given by
One of the key extensions of the IB method was its application to continuous-valued variables,
15.1. UNSUPERVISED LEARNING 193
particularly those following a Gaussian distribution. Chechik, Globerson, Tishby, and Weiss (2005)
[1033] rigorously examined the IB method in the context of jointly Gaussian random variables,
deriving analytical solutions that revealed the intrinsic connections between the IB formulation
and classical techniques such as canonical correlation analysis. Their results demonstrated that the
optimal representation in the Gaussian case can be expressed in terms of principal component anal-
ysis (PCA)-like transformations, thereby bridging the IB method with traditional dimensionality
reduction approaches. This work has had significant implications for signal processing and statis-
tical learning, as it provides a mathematically rigorous foundation for applying the IB principle
to real-world datasets that exhibit Gaussian-like characteristics. In a related extension, Chechik
and Tishby (2002) [1034] introduced a variant of the IB framework that incorporates side infor-
mation, allowing for the extraction of representations that retain information about one target
variable while being uninformative about another. This formulation has proven particularly use-
ful in privacy-preserving machine learning and fairness-aware representations, where one seeks to
ensure that a learned representation encodes task-relevant information while obfuscating sensitive
attributes.
In the context of deep learning, Tishby and Zaslavsky (2015) [1035] proposed that the training
process of deep neural networks (DNNs) can be interpreted through the lens of the IB principle.
They hypothesized that DNNs undergo two distinct phases during training: an initial ”fitting”
phase where the mutual information between successive layers and the target variable Y increases,
followed by a ”compression” phase where the network progressively reduces the mutual informa-
tion between intermediate layers and the input variable X. This perspective provides a theoretical
justification for why deep networks generalize well despite their over-parameterization, as the com-
pression phase effectively filters out irrelevant variations in the input while preserving task-relevant
features. However, Saxe et. al. (2019) [1036] critically examined this hypothesis and argued that
the presence of the compression phase is highly dependent on architectural choices, particularly
the activation functions used in deep networks. Their empirical findings demonstrated that the
IB theory does not universally apply to all neural network architectures, suggesting that while
information-theoretic compression may play a role in some settings, it is not a fundamental prin-
ciple governing the training dynamics of all deep networks.
Further empirical support for the IB perspective in deep learning was provided by Shwartz-Ziv
and Tishby (2017) [1037], who conducted a detailed analysis of information flow in deep neural
networks. Using information plane visualizations, they showed that the training process of DNNs
aligns with the two-phase description proposed in earlier work. Their findings provided strong
evidence for the compression phenomenon in networks trained with stochastic gradient descent
(SGD), reinforcing the argument that deep networks learn compact representations of the input.
However, a major challenge in applying the IB framework to high-dimensional data is the difficulty
of accurately estimating mutual information, especially in deep networks where the latent repre-
sentations are highly non-Gaussian. To address this issue, Noshad et. al. (2019) [1038] introduced
a novel mutual information estimator based on dependence graphs, enabling more scalable and
robust estimations of information flow in deep learning models. Their approach has opened new
avenues for applying the IB principle to real-world machine learning problems where traditional
mutual information estimators fail due to the curse of dimensionality.
The theoretical implications of information bottleneck in neural networks were further explored
by Goldfeld et. al. (2018) [1039], who developed refined techniques for estimating mutual infor-
mation in deep networks. Their work provided new tools for analyzing the role of compression in
neural representations, offering rigorous mathematical justifications for why certain layers in deep
networks tend to exhibit strong compression effects. More recently, Geiger (2021) [1040] presented
a comprehensive review of information plane analyses in neural network classifiers, evaluating the
194 CHAPTER 15. LEARNING PARADIGMS
strengths and limitations of the IB framework in this setting. His analysis raised critical questions
about the general applicability of IB-based insights in deep learning, highlighting cases where the
information plane approach fails to provide accurate characterizations of training dynamics. Build-
ing on these theoretical developments, Kawaguchi, Deng, Ji, and Huang (2023) [1041] rigorously
analyzed the generalization properties of neural networks under the IB framework, providing a
mathematical link between information compression and generalization error bounds. Their re-
sults establish that controlling the information bottleneck can serve as a regularization mechanism,
leading to improved generalization performance in deep learning models. Collectively, these contri-
butions underscore the profound impact of the IB principle across a wide range of disciplines, from
statistical signal processing to modern artificial intelligence.
Yang et al. (2025) [1045] introduced a cognitive-load-aware activation mechanism for large lan-
guage models (LLMs), significantly improving efficiency by dynamically activating only the neces-
sary model parameters. Their study leveraged IB principles to ensure that LLMs retain only the
most relevant contextual representations while discarding redundant computations, reducing com-
putational overhead without sacrificing accuracy. Similarly, Liu et al. (2025) [1046] incorporated
IB principles in their Vision Mamba network for crack segmentation in infrastructure, designing
a structure-aware model that efficiently filters out redundant spatial information. By applying IB
techniques, they achieved enhanced computational efficiency and superior segmentation accuracy,
which is crucial for real-time applications in structural health monitoring. Stierle and Valtere
(2025) [1047] took a different approach by applying IB theory to medical innovation, examining
how bottlenecks in information access within regulatory and patent frameworks slow down gene
therapy advancements. Their work provided a comprehensive analysis of how information bottle-
necks in medical research and policy impede technological progress, emphasizing the necessity of
optimized regulatory frameworks.
Another significant study by Chen et al. (2025) [1048] applied IB concepts to quantum computing,
particularly in optimizing construction supply chains. Their work demonstrated that quantum
models, when integrated with IB techniques, could efficiently compress relevant data while filtering
out extraneous information, leading to improved decision-making and enhanced scheduling flexibil-
ity. Yuan et al. (2025) [1049] extended IB applications to plant metabolomics, where they proposed
a novel feature selection approach to retain highly informative metabolite interactions while dis-
carding non-essential data. This method improved interpretability in plant metabolic studies,
allowing researchers to better understand metabolite interactions without being overwhelmed by
excessive data complexity. In a related domain, Dey et al. (2025) [1050] utilized IB principles in
196 CHAPTER 15. LEARNING PARADIGMS
spatio-temporal prediction models for NDVI (Normalized Difference Vegetation Index), a critical
measure for rice crop yield forecasting. Their IB-augmented neural network significantly enhanced
prediction performance by filtering out irrelevant environmental variables while maintaining the
most crucial features for accurate yield assessment.
Further expanding IB applications, Li (2025) [1051] employed IB principles in robotic path planning
by developing an optimized method for navigation path extraction in mobile robots. Their approach
utilized IB constraints to eliminate irrelevant environmental noise while preserving the most cru-
cial navigational data, thereby improving the efficiency and reliability of robotic movement. This
research has significant implications for autonomous navigation systems, where maintaining a com-
pact yet informative representation of the surrounding environment is essential. Finally, Krinner et
al. (2025) [1043] extended IB applications to reinforcement learning, emphasizing the importance of
information retention in decision-making models. Their proposed IB-based reinforcement learning
framework demonstrated superior generalization capabilities compared to traditional approaches,
as it effectively retained task-relevant information while discarding unnecessary complexity. This
improvement in learning efficiency has the potential to enhance autonomous systems across various
applications, including robotics, finance, and large-scale industrial automation.
Overall, these studies underscore the broad applicability and impact of the IB method in diverse
research areas. From adversarial robustness and reinforcement learning to plant metabolomics and
robotic navigation, the IB principle continues to provide a powerful framework for optimizing in-
formation processing across various domains. The ability of IB-based models to extract the most
salient features while eliminating redundancies has been instrumental in enhancing computational
efficiency and decision-making accuracy. Future research in IB methodologies is likely to further
refine and extend its applications, unlocking new possibilities in artificial intelligence, scientific
research, and complex system optimization.
where:
• I(X; T ) is the mutual information between the input X and the compressed representation
T , which measures the amount of information retained about X in T .
• I(T ; Y ) is the mutual information between T and the target variable Y , ensuring that the
compressed representation remains useful for predicting Y .
• β is a Lagrange multiplier that controls the trade-off between compression and prediction
accuracy.
and
X p(t, y)
I(T ; Y ) = p(t, y) log . (15.17)
t,y
p(t)p(y)
These quantities describe the dependencies between variables, and minimizing I(X; T ) ensures
maximal compression while maximizing I(T ; Y ) preserves relevant predictive information. To find
the optimal encoding distribution p(t|x), we introduce a variational formulation using Lagrange
multipliers, leading to the following self-consistent equations:
p(t)
p(t|x) = exp (−βDKL (p(y|x)∥p(y|t))) , (15.18)
Z(x, β)
where:
• DKL (p(y|x)∥p(y|t)) is the Kullback-Leibler (KL) divergence between the posterior distribu-
tions p(y|x) and p(y|t), ensuring that T retains relevant information about Y .
X
p(y|t) = p(y|x)p(x|t), (15.20)
x
we can numerically solve for p(t|x) in an iterative fashion until convergence. A crucial aspect of
the IB method is the information plane, where solutions are analyzed in terms of the trade-off
between I(X; T ) and I(T ; Y ). The optimal trade-off curve is derived by maximizing:
which provides a Pareto-optimal frontier representing the most efficient trade-offs between com-
pression and predictive power. For multivariate Gaussian variables X and Y , where (X, Y ) follows
a joint Gaussian distribution:
X µX ΣXX ΣXY
∼N , , (15.22)
Y µY ΣY X ΣY Y
the IB solution can be expressed explicitly in terms of covariance matrices. The optimal bottleneck
variable T satisfies:
ΣT T = ΣXX − ΣXY Σ−1 Y Y ΣY X , (15.23)
where the optimal compression ratio is determined by the eigenvalues of the information-preserving
covariance transformation. In modern applications, the IB principle has been extensively applied
to deep neural networks (DNNs), where hidden layer representations T are trained to maximize
information retention about the target Y . The information-theoretic loss function:
is used in variational autoencoders (VAEs) and other deep learning models to enforce minimal
sufficient representations.
In conclusion, The Information Bottleneck method provides a rigorous and mathematically prin-
cipled approach to optimal data compression with maximal information retention. It is deeply
rooted in Shannon’s information theory and has extensive applications in signal processing, neural
networks, and statistical learning. Its iterative updates, self-consistency equations, and variational
derivations make it a powerful tool for understanding fundamental limits in machine learning and
information processing.
One of the critical challenges in training RBMs is the efficient computation of gradients for weight
updates, given the intractability of exact maximum likelihood estimation. To address this issue,
Carreira-Perpiñán and Hinton (2005)[1053] analyzed the Contrastive Divergence (CD) algorithm, a
200 CHAPTER 15. LEARNING PARADIGMS
popular method for approximating the likelihood gradient. Their study provided a rigorous exam-
ination of the convergence properties of CD and highlighted its strengths and limitations. While
CD allows for fast and efficient training of RBMs, they showed that it does not always lead to
unbiased estimates of the likelihood gradient, which can impact the learned representations. Hin-
ton (2012)[1054] later expanded on these ideas by providing a practical guide to training RBMs,
detailing essential hyperparameter selection strategies, initialization techniques, and empirical best
practices. His work served as an invaluable resource for researchers and practitioners aiming to im-
plement RBMs effectively, covering both theoretical and experimental considerations. In a broader
context, Fischer and Igel (2014)[1055] presented a comprehensive introduction to the training and
theoretical underpinnings of RBMs, consolidating knowledge from previous research into a struc-
tured and accessible form. Their work not only explained the fundamental mechanics of RBMs but
also explored various extensions and applications, making it an essential reference for those seeking
to understand both the theoretical and applied aspects of RBMs.
Another significant direction in RBM research involved understanding their effectiveness in un-
supervised feature learning. Coates, Lee, and Ng (2011)[1058] conducted an extensive analysis of
single-layer neural networks, including RBMs, to assess their ability to learn meaningful represen-
tations from raw data. Their findings highlighted the potential of RBMs in learning hierarchical
feature representations and provided empirical evidence that RBMs could achieve competitive per-
formance compared to other feature-learning approaches. This study influenced subsequent research
in deep learning by emphasizing the importance of structured feature extraction from data. In a
different vein, Hinton and Salakhutdinov (2009)[1059] proposed the Replicated Softmax model, an
extension of RBMs designed to model word counts in documents. Their work bridged the gap
between RBMs and topic models, enabling RBMs to be applied to natural language processing
tasks. This development demonstrated the flexibility of RBMs in handling different types of data
distributions, further expanding their applicability beyond conventional structured data.
In recent years, RBM research has intersected with advancements in quantum computing, as
explored by Adachi and Henderson (2015)[1060], who investigated the application of quantum
annealing to the training of deep neural networks. Their study explored how quantum hardware
could potentially accelerate the learning process in RBMs by leveraging quantum parallelism. This
work opened new avenues for research at the intersection of quantum computing and machine
learning, suggesting that RBMs could benefit from novel optimization techniques unavailable to
classical computing paradigms. Overall, the contributions of these works collectively demonstrate
the breadth of RBM research, from theoretical foundations and training methodologies to applica-
tions in diverse domains such as recommender systems, natural language processing, and quantum
machine learning. The progression of RBMs from their early formulations to their modern applica-
15.1. UNSUPERVISED LEARNING 201
tions underscores their significance as a foundational tool in the development of deep learning and
probabilistic modeling frameworks.
Building upon the foundational aspects of RBMs, Joudaki (2025) [1062] conducted a compre-
hensive review of their applications in human action recognition, particularly in combination with
Deep Belief Networks (DBNs). This study systematically categorizes existing research and iden-
202 CHAPTER 15. LEARNING PARADIGMS
tifies key challenges, such as overfitting and slow convergence, while also discussing how RBMs
facilitate hierarchical feature extraction. A complementary perspective is provided by Prat Pou,
Romero, Martı́, and Mazzanti (2025) [1063], who focus on improving the computational efficiency
of RBMs in evaluating partition functions using annealed importance sampling. Their work is par-
ticularly relevant in statistical physics, where accurate estimation of partition functions is crucial
for modeling spin systems and understanding phase transitions. They demonstrate that their pro-
posed initialization method enhances the robustness of the sampling process, significantly reducing
variance in the estimated probabilities.
Further theoretical advancements in RBMs were made by Decelle, Gómez, and Seoane (2025)
[1064], who investigated the ability of RBMs to infer high-order dependencies in complex systems.
Their work presents a framework for mapping RBMs onto higher-order interactions, particularly
in domains such as protein interaction networks and spin glasses, where conventional machine
learning models struggle to capture intricate relationships. In a related study, Savitha, Kannan,
and Logeswaran (2025) [1065] integrate RBMs within DBNs for cardiovascular disease prediction,
leveraging optimization techniques such as the Harris Hawks Search algorithm. Their research
emphasizes the role of RBMs in medical diagnosis, showing that the extracted features from unsu-
pervised pretraining significantly improve classification accuracy in deep learning pipelines. These
contributions highlight the versatility of RBMs in both theoretical modeling and real-world appli-
cations.
Efforts to enhance the efficiency of RBM training and sampling have been explored by Béreux,
Decelle, and Furtlehner (2025) [1066], who propose a novel training strategy that accelerates con-
vergence without compromising generalization performance. Their method is particularly effective
for large-scale learning problems where traditional contrastive divergence methods are computation-
ally expensive. Thériault et. al. (2024) [1067] further examine the structured learning properties
of RBMs in a teacher-student setting, demonstrating that incorporating structured priors enhances
the network’s ability to generalize beyond seen data. These studies collectively address the compu-
tational bottlenecks associated with RBM training, making them more viable for practical machine
learning applications.
Another notable application of RBMs is in feature learning for non-traditional data types. Man-
imurugan, Karthikeyan, and Narmatha (2024) [1068] introduce a hybrid approach that combines
Bi-LSTM networks with RBMs for underwater object detection, demonstrating how RBMs effec-
tively capture spatial dependencies in sonar and optical imagery. Similarly, Hossain, Showkat Ara,
and Han (2025) [1069] benchmark RBMs against classical and deep learning models for human
activity recognition, finding that RBMs provide a unique advantage in extracting latent repre-
sentations. Extending RBM applications to neuromorphic computing, Qin, Peng, Miao, Chen,
Ouyang, and Yang (2025) [1070] integrate RBMs with magnetic tunnel junctions for enhanced
magnetic anomaly detection. This interdisciplinary work bridges the gap between neuromorphic
architectures and probabilistic learning models, demonstrating a path towards more energy-efficient
and compact AI systems.
Collectively, these studies highlight the ongoing evolution of RBMs, from fundamental theoreti-
cal advances to diverse applications in physics, medicine, security, and beyond. The research spans
improvements in training efficiency, inference capabilities, and integration with other deep learn-
ing techniques, demonstrating the continued relevance of RBMs in modern AI research. While
challenges such as mode collapse and slow training persist, the integration of RBMs with quan-
tum computing, neuromorphic architectures, and hybrid deep learning models presents promising
directions for future work.
15.1. UNSUPERVISED LEARNING 203
where vi represents the state of the i-th visible unit, hj represents the state of the j-th hidden unit,
Wij is the weight connecting visible unit i to hidden unit j, ai and bj are the biases of the visible
and hidden units, respectively. The probability distribution of a visible-hidden configuration (v, h)
is governed by the Boltzmann distribution, given by:
e−E(v,h)
P (v, h) = (15.26)
Z
where Z is the partition function ensuring normalization:
X
Z= e−E(v,h) (15.27)
v,h
Since the hidden units are conditionally independent given the visible units, the condi-
tional probability distribution of a hidden unit given the visible units follows a sigmoid activation
function: !
X 1
P (hj = 1 | v) = σ bj + Wij vi = (15.28)
1 + e−(bj + i Wij vi )
P
i
Similarly, the probability of a visible unit given the hidden units is:
!
X 1
P (vi = 1 | h) = σ ai + Wij hj = (15.29)
1 + e−(ai + j Wij hj )
P
j
The marginal probability of a visible vector v is obtained by summing over all possible hidden
states: X 1 X −E(v,h)
P (v) = P (v, h) = e (15.30)
h
Z h
By factoring out terms dependent only on v, we define the free energy function as:
X X P
log 1 + ebj + i Wij vi
F (v) = − ai v i − (15.31)
i j
The gradient of this likelihood function with respect to the weight matrix Wij is given by:
∂L
= Edata [vi hj ] − Emodel [vi hj ] (15.34)
∂Wij
where Edata [vi hj ] is the expectation over the training data, Emodel [vi hj ] is the expectation under the
model distribution. The weight updates are then performed using:
∆Wij = η (Edata [vi hj ] − Emodel [vi hj ]) (15.35)
RBMs use stochastic gradient descent (SGD) for optimization. The weight updates in SGD
follow:
(t+1) (t) (data) (data) (model) (model)
Wij = Wij + η vi hj − vi hj (15.36)
RBMs serve as the building blocks for Deep Belief Networks (DBNs), where multiple RBMs
are stacked to form deep architectures. Each layer is pre-trained in an unsupervised manner
using contrastive divergence before fine-tuning using backpropagation.
15.1. UNSUPERVISED LEARNING 205
Illustration of a restricted Boltzmann machine featuring three observable units and four concealed
units, excluding bias units
Deep Belief Networks (DBNs) have played a transformative role in deep learning by enabling effi-
cient unsupervised pre-training and hierarchical feature extraction. One of the foundational works
in this area is by Hinton et al. (2006) [828], who introduced a fast learning algorithm for DBNs that
leverages a greedy layer-wise pre-training strategy using Restricted Boltzmann Machines (RBMs).
This work addressed the long-standing vanishing gradient problem in deep networks by initializing
weights in a way that preserves meaningful feature representations before fine-tuning with super-
vised learning. Their contribution established DBNs as a key component of early deep learning
architectures and laid the groundwork for further explorations into deep generative models. Lee et
al. (2009) [1071] extended the standard DBN framework by incorporating convolutional structures,
leading to the development of Convolutional Deep Belief Networks (CDBNs). By introducing local
receptive fields and weight sharing, CDBNs enabled the automatic discovery of spatial hierarchies
in data, making them particularly suitable for image and speech processing applications. This
work demonstrated that DBNs could be adapted to structured data representations, enhancing
their scalability and generalization capability.
206 CHAPTER 15. LEARNING PARADIGMS
In the domain of speech recognition, Mohamed et al. (2012) [1072] pioneered the application
of DBNs for acoustic modeling, demonstrating that these networks could effectively capture com-
plex audio patterns. Their work provided empirical evidence that DBN-based models significantly
outperformed traditional Gaussian Mixture Models (GMMs) when used in conjunction with Hid-
den Markov Models (HMMs) for automatic speech recognition. This breakthrough accelerated the
adoption of deep learning in the field of speech processing and motivated subsequent research into
deep neural network-based acoustic modeling. Similarly, Zhang and Zhao (2017) [1074] explored the
use of DBNs for fault diagnosis in chemical processes, highlighting their ability to model intricate
dependencies within multivariate industrial datasets. Their work demonstrated that DBNs could
learn compact feature representations from noisy sensor data, leading to improved fault detection
accuracy. By applying DBNs to real-world industrial systems, they provided compelling evidence
of their practical utility in process monitoring and control.
Further expanding the use of DBNs in fault diagnostics, Peng et al. (2019) [1073] introduced
a health indicator construction framework based on DBNs for bearing fault diagnosis. Their ap-
proach enabled the automatic extraction of degradation features from vibration signals, allowing
for early detection of mechanical failures. The health indicator developed in their study demon-
strated superior predictive performance compared to traditional statistical methods, underscoring
the power of DBNs in prognostics and health management applications. Zhang et al. (2018) [1076]
extended the use of DBNs to the medical field by integrating them with feature selection and extrac-
tion methods for predicting clinical outcomes in lung cancer patients. Their study illustrated that
DBNs could capture complex interactions between clinical variables, improving the interpretability
and accuracy of predictive models in oncology. By leveraging deep learning for medical prognosis,
this work demonstrated the potential of DBNs to advance personalized healthcare and decision
support systems.
Zhong et. al. (2017) [1078] examined the problem of image classification with limited training
data and demonstrated that DBNs could learn meaningful feature representations even in data-
scarce settings. Their work emphasized the ability of DBNs to leverage unsupervised pre-training,
enabling them to generalize well even when labeled data is insufficient. This contribution reinforced
the importance of DBNs in applications where obtaining large-scale labeled datasets is challeng-
ing. Financial time series analysis is another domain where DBNs have demonstrated utility. Liu
(2018) [1075] proposed a hybrid prediction model that combined DBNs with the Autoregressive
Integrated Moving Average (ARIMA) model for stock trend forecasting. Their approach leveraged
the pattern recognition capabilities of DBNs along with the time-series forecasting strengths of
ARIMA, resulting in improved predictive performance for financial markets. This work showcased
the effectiveness of deep learning in modeling complex temporal dependencies and provided insights
into how hybrid models could enhance financial decision-making. Finally, Hoang and Kang (2018)
[1077] developed a novel fault diagnosis framework by integrating DBNs with Dempster–Shafer ev-
idence theory. Their study highlighted how DBNs could serve as a powerful feature extractor while
Dempster–Shafer theory facilitated the fusion of information from multiple sources, enhancing fault
detection accuracy. This interdisciplinary approach demonstrated the potential of combining deep
learning with probabilistic reasoning for more reliable fault diagnosis in engineering systems.
Collectively, these studies illustrate the broad impact of DBNs across multiple domains, including
speech recognition, industrial fault diagnosis, medical prognosis, uncertainty quantification, and fi-
nancial prediction. The ability of DBNs to learn hierarchical representations from high-dimensional
data has established them as a powerful tool for tackling complex machine learning problems. The
advancements made in adapting DBNs to convolutional architectures, hybrid models, and ensem-
ble methods have further expanded their applicability, ensuring their continued relevance in deep
15.1. UNSUPERVISED LEARNING 207
learning research. As interest in deep generative models and representation learning grows, DBNs
continue to serve as a foundational framework for understanding and developing more sophisticated
deep learning architectures.
human activity recognition and medical diagnosis to cybersecurity and agricultural applications.
Joudaki (2025) [1062] provides an extensive literature review on the theoretical underpinnings of
DBNs and RBMs, emphasizing their role in human action recognition. The study highlights how
DBNs are particularly suited for tasks requiring high-dimensional feature extraction and learn-
ing temporal dependencies, outperforming traditional machine learning models. The hierarchical
representation learned by DBNs allows them to capture complex patterns in human gestures and
postures, making them ideal for applications such as motion tracking and gesture-based interface
design.
Alzughaibi (2025) [1079] presents an innovative application of DBNs in pest detection, where they
are integrated with a modified artificial hummingbird algorithm. The research demonstrates how
deep learning-based pattern recognition models can significantly enhance the accuracy of pest clas-
sification in agricultural settings. The model is trained on large datasets of pest images, leveraging
the hierarchical feature extraction capabilities of DBNs to distinguish between different species
effectively. Similarly, Savitha et al. (2025) [1065] apply DBNs to cardiovascular disease prediction
by integrating them with the Harris Hawks Search optimization algorithm. This approach opti-
mizes feature selection and classification accuracy, showing that DBNs can be successfully adapted
for medical diagnosis by leveraging their deep hierarchical structure. The study underscores the
potential of DBNs in clinical decision support systems, where accurate and timely diagnoses are
crucial.
The intrinsic hierarchical structure of DBNs has also been studied by Tausani et al. (2025) [1080],
who investigate their top-down inference capabilities compared to other deep generative models.
Their research explores how DBNs can simulate human-like cognition and learning, making them
suitable for applications in artificial intelligence and cognitive computing. By analyzing the internal
feature representations of DBNs, the study provides insights into their interpretability and efficiency
in generative tasks. Kumar and Ravi (2025) [1081] further contribute to the field by introducing
XDATE, an explainable deep learning framework that combines DBNs with auto-encoders. The
study addresses the issue of interpretability in deep learning by employing the Garson Algorithm
to improve feature attribution, demonstrating that DBNs can achieve a balance between accuracy
and explainability in classification tasks.
In the field of medical image analysis, Alhajlah (2024) [1082] applies DBNs for automated le-
sion detection in gastrointestinal endoscopic images. The research integrates DBNs with a genetic
algorithm-based segmentation technique, significantly improving diagnostic precision. By lever-
aging the feature extraction capabilities of DBNs, the proposed system outperforms conventional
image processing techniques, reducing false positives and improving lesion classification accuracy.
Hossain et al. (2025) [1069] evaluate the performance of DBNs in human activity recognition,
benchmarking them against various classical and deep learning models. Their study finds that
DBNs, despite being unsupervised in their initial training phase, can achieve competitive accuracy
on small and medium-sized datasets, demonstrating their robustness and adaptability.
The application of DBNs extends to cybersecurity, where Pavithra et al. (2025) [1083] develop
a hybrid RNN-DBN model for detecting IoT attacks. This approach captures temporal depen-
dencies in network traffic, allowing for effective anomaly detection and threat mitigation. The
fusion of DBNs with recurrent networks enables better sequence modeling, making them highly
effective in cybersecurity applications. In a similar vein, Bhadane and Verma (2024) [1084] explore
the role of DBNs in personality trait classification, comparing their performance with CNNs and
RNNs. Their study highlights the advantages of DBNs in handling high-dimensional psychologi-
cal datasets, where deep hierarchical structures enable the capture of complex personality-related
patterns. Lastly, Keivanimehr and Akbari (2025) [1085] examine how DBNs can be applied in
15.1. UNSUPERVISED LEARNING 209
edge computing for cardiovascular disease monitoring. Their research discusses the feasibility of
deploying DBN-based models in TinyML environments, where computational efficiency and real-
time processing are critical.
Collectively, these studies demonstrate the versatility and efficacy of DBNs in various domains,
from healthcare and cybersecurity to agricultural automation and cognitive computing. The hier-
archical feature learning capabilities of DBNs, combined with their ability to leverage unsupervised
pre-training, make them suitable for a wide range of complex tasks. Future research is likely to ex-
plore further hybrid architectures that integrate DBNs with more advanced deep learning models,
enhancing their adaptability and efficiency in real-world applications.
Let x ∈ Rd denote the input data vector, where d is the dimensionality of the input space. A
DBN consists of multiple layers of hidden variables h(l) connected in a directed fashion in
the upper layers and undirected in the lower layers. The network defines a joint probability
distribution over the visible units x and hidden units h(1) , h(2) , . . . , h(L) as follows:
P (x, h(1) , h(2) , . . . , h(L) ) = P (x|h(1) )P (h(1) |h(2) ) · · · P (h(L−1) |h(L) )P (h(L) ) (15.37)
where:
1. P (x|h(1) ) represents the conditional distribution of the visible layer given the first
hidden layer.
Each RBM consists of a layer of visible units v and a layer of hidden units h, with energy
function given by: X X X
E(v, h) = − bi vi − cj hj − vi Wij hj (15.38)
i j i,j
where v ∈ {0, 1}d are the binary visible units, h ∈ {0, 1}m are the binary hidden units, Wij is the
weight matrix connecting visible and hidden units, bi and cj are the bias terms for visible and
hidden units, respectively. The probability distribution over (v, h) is given by the Boltzmann
distribution:
1
P (v, h) = e−E(v,h) (15.39)
Z
15.1. UNSUPERVISED LEARNING 211
Diagrammatic representation of a deep belief network, where the arrows indicate directed
connections within the corresponding graphical model
Since exact inference in RBMs is intractable due to the exponential summation in Z, we use
Contrastive Divergence (CD) to approximate the gradient:
∂ log P (v)
≈ Edata [vi hj ] − Emodel [vi hj ] (15.41)
∂Wij
Layer-wise Pretraining of DBNs
DBNs are trained using a layer-wise greedy learning algorithm, which first trains each RBM
independently and then stacks them hierarchically. Given an input dataset {xn }N
n=1 , the pretraining
procedure follows:
1. Train the first RBM with input x to obtain hidden activations:
!
(1)
X (1) (1)
P (hj = 1|x) = σ Wij xi + cj (15.42)
i
Once pretraining is complete, DBNs can be fine-tuned using backpropagation if labeled data
is available. A cost function such as cross-entropy loss is used:
N X
X K
(n) (n)
L=− yk log ŷk (15.44)
n=1 k=1
Deep Belief Networks (DBNs) serve as powerful generative models capable of capturing hi-
erarchical representations of data through layer-wise pretraining of RBMs.
Several studies have analyzed and improved the robustness of t-SNE, particularly concerning its
application to large datasets and biological data visualization. Kobak and Berens (2019) [1086]
addressed the challenges in single-cell transcriptomics by developing ”opt-SNE,” an automated
parameter selection framework that fine-tunes the perplexity and learning rate to enhance visual-
ization reliability. Their study demonstrated that proper parameter tuning significantly impacts
the quality of embeddings, thereby reducing the risk of misleading interpretations. Similarly, Belk-
ina et al. (2019) [1087] proposed an automated approach for optimizing t-SNE parameters for large
datasets, emphasizing the importance of reproducibility and interpretability in real-world applica-
tions. Their work provided empirical evidence that parameter selection can drastically affect the
clustering structure in t-SNE visualizations, necessitating automated approaches for better stan-
dardization.
Beyond empirical studies, some researchers have sought to establish a more rigorous theoretical
understanding of t-SNE’s behavior in clustering and manifold learning. Linderman and Steiner-
berger (2019) [1088] provided a mathematical proof that t-SNE effectively recovers well-separated
clusters under certain conditions, offering a theoretical foundation for why t-SNE performs well in
clustering applications despite not being explicitly designed for that purpose. Their work bridged
the gap between empirical success and mathematical justification, thereby enhancing the credibility
of t-SNE in scientific applications. Amorim and Mirkin (2012) [1089] extended this discussion by
exploring the role of distance metrics in clustering and feature weighting, providing an alternative
approach that incorporated Minkowski metrics and anomaly detection within the t-SNE-reduced
space. Their work showed that selecting an appropriate distance function could further refine the
15.1. UNSUPERVISED LEARNING 213
Finally, alternative approaches to t-SNE have been explored, with studies comparing its perfor-
mance against other dimensionality reduction techniques. Becht et al. (2019) [1093] introduced
Uniform Manifold Approximation and Projection (UMAP) as an alternative to t-SNE, arguing
that UMAP offers improved scalability and better preservation of global structures. Their study
provided empirical comparisons between UMAP and t-SNE, demonstrating that UMAP performs
comparably in clustering while requiring significantly less computational time. Likewise, Moon
et al. (2019) [1094] proposed PHATE (Potential of Heat-diffusion for Affinity-based Transition
Embedding), which was designed to capture both local and global structures more effectively than
t-SNE. Their study showcased PHATE’s ability to preserve continuous trajectories in biological
data, highlighting its advantages in studying cell differentiation processes. Collectively, these stud-
ies illustrate the ongoing efforts to refine and expand upon t-SNE, both theoretically and practically,
ensuring its continued relevance in high-dimensional data visualization.
survey of dimensionality reduction techniques, discussing the advantages and limitations of t-SNE
compared to traditional methods like PCA and newer alternatives like UMAP. The study outlined
the scenarios where t-SNE performs optimally, particularly in non-linear data distributions, making
it an essential reference for researchers selecting appropriate dimensionality reduction techniques.
In the domain of defect detection in industrial settings, Chern et al. (2025) [1097] utilized t-
SNE for visualizing metal defect classification in deep learning models, improving interpretability
in YOLO-based defect detection frameworks. Their study showed how t-SNE enhances the under-
standing of feature similarities among defect classes, leading to more refined model improvements.
A similar approach was adopted by Li et al. (2025) [1098] in the field of food safety, where t-SNE
was applied to olfactory sensor data to analyze aflatoxin B1 contamination in wheat. By employing
t-SNE for dimensionality reduction, the researchers effectively visualized sensor response variations,
demonstrating the method’s utility in complex chemical data analysis. In another domain, Singh
and Singh (2025) [1099] developed a hybrid approach for medical image retrieval by integrating
deep learning features with t-SNE. Their method showed that t-SNE enhances feature clustering,
thereby improving the accuracy and efficiency of gastric image retrieval systems, which is particu-
larly beneficial in computer-aided diagnosis applications.
Sun et al. (2025) [1100] explored the application of t-SNE in biomechanics, specifically in detecting
muscle fatigue during lower limb isometric contraction tasks. Their study employed t-SNE to re-
duce the dimensionality of electromyography (EMG) data before applying machine learning-based
classification, leading to improved performance in fatigue detection models. Meanwhile, Su et al.
(2025) [1101] incorporated t-SNE into seismic fragility analysis for earth-rock dams, integrating it
within a deep residual shrinkage network to enhance predictive accuracy. Their research illustrated
how t-SNE, when combined with deep learning techniques, can refine the interpretation of het-
erogeneous material behavior under seismic loading conditions. In the biomedical domain, Yousif
and Al-Sarray (2025) [1102] integrated t-SNE with spectral clustering via convex optimization to
enhance breast cancer gene classification. Their work established that this combination yields
superior clustering performance compared to conventional methods, contributing significantly to
genomics research and precision medicine.
Park et al. (2025) [1103] assessed the clinical applicability of t-SNE in flow cytometry, specifically for
hematologic malignancies. Their study compared t-SNE with UMAP in reducing high-dimensional
cytometric data and concluded that t-SNE provides superior local structure preservation, which is
crucial for identifying rare cell populations in clinical diagnostics. Qiao et al. (2025) [1104] used
t-SNE to analyze cancer-associated fibroblasts (CAFs) in pancreatic ductal adenocarcinoma pa-
tients, identifying subclusters that exhibited distinct inflammatory gene expression patterns. This
study underscored the value of t-SNE in uncovering hidden patterns in complex biological datasets,
aiding in the stratification of cancer patients based on gene expression profiles. Furthermore, t-
SNE has been utilized in aerospace engineering, as demonstrated by Su et al. (2025) [1101], who
employed the technique in damage quantification for aircraft structures. Their study integrated
t-SNE into an end-to-end deep learning framework to enhance defect localization, demonstrating
how the method can improve structural health monitoring systems.
Overall, these studies collectively emphasize the versatility of t-SNE across diverse fields, ranging
from biomedical sciences and geophysics to industrial defect detection and aerospace engineering.
While t-SNE’s strength lies in its ability to preserve local structures in high-dimensional data, these
studies also highlight its limitations, such as high computational cost and sensitivity to parameter
tuning. Future research should focus on optimizing t-SNE’s efficiency while maintaining its ro-
bust visualization capabilities. By integrating t-SNE with deep learning architectures and hybrid
clustering techniques, researchers can further expand its applications in fields requiring advanced
216 CHAPTER 15. LEARNING PARADIGMS
data analysis and interpretation. The increasing adoption of t-SNE across disciplines highlights
its ongoing relevance, making it a crucial tool for researchers handling complex, high-dimensional
datasets.
both high-dimensional and low-dimensional spaces and minimizing the discrepancy between these
similarities. The method builds upon Stochastic Neighbor Embedding (SNE) by incorporating a
Student’s t-distribution with a single degree of freedom (i.e., a Cauchy distribution)
as the low-dimensional similarity function, significantly mitigating the crowding problem of SNE.
The mathematical formulation of t-SNE is highly intricate and involves defining probability distri-
butions over pairwise relationships, constructing a cost function based on Kullback-Leibler (KL)
divergence, and employing gradient-based optimization methods such as gradient descent to find
an embedding that best preserves local structures.
where σi is the perplexity-dependent bandwidth parameter for point xi , which controls how much
influence distant points have on similarity measures. The perplexity, denoted as P, is defined
Figure 15.3: T-SNE representation of word embeddings derived from 19th-century lit-
erary texts
as:
P(i) = 2H(Pi ) (15.47)
where H(Pi ) is the Shannon entropy of the probability distribution:
X
H(Pi ) = − pj|i log2 pj|i (15.48)
j̸=i
218 CHAPTER 15. LEARNING PARADIGMS
We then symmetrize these conditional probabilities to obtain a symmetric joint probability distri-
bution:
pj|i + pi|j
pij = (15.49)
2N
This ensures that pij = pji , making optimization easier and avoiding directed relationships between
points.
This choice of the t-distribution helps mitigate the crowding problem, where points in high-dimensional
space that are moderately distant tend to collapse together in low-dimensional space when using
Gaussian similarities.
To ensure that the probability distributions pij and qij match as closely as possible, we mini-
mize the Kullback-Leibler (KL) divergence, which measures how much information is lost when
using qij to approximate pij :
X pij
C= pij log (15.51)
i̸=j
qij
This function is minimized using gradient descent, where the gradient with respect to each low-
dimensional point yi is given by:
∂C X
=4 (pij − qij ) (1 + ∥yi − yj ∥2 )−1 (yi − yj ) (15.52)
∂yi j̸=i
In summary, t-SNE is a powerful, highly nonlinear technique for dimensionality reduction that
excels at visualizing high-dimensional data while preserving local structures. Its formulation relies
on constructing two probability distributions, using a t-distribution kernel, and optimizing the KL
divergence cost function through gradient descent. However, due to its computational com-
plexity, improvements such as Barnes-Hut t-SNE and FIt-SNE have been developed for large
datasets.
Following the introduction of LLE, Polito and Perona (2001) [1106] extended its application to
the problem of clustering and dimensionality reduction, demonstrating that LLE could naturally
group data points based on their intrinsic geometric properties. Their study highlighted the algo-
rithm’s ability to perform soft clustering, where different regions of the embedded space correspond
to distinct clusters in the high-dimensional space. This property is particularly useful in vision tasks
where the underlying data often exhibit complex nonlinear structures. Zhang and Zha (2004) [1107]
220 CHAPTER 15. LEARNING PARADIGMS
Further refinements to LLE were made by Donoho and Grimes (2003) [1108], who introduced
Hessian Eigenmaps as a variation of LLE that utilized Hessian-based quadratic forms to better
capture local curvature information. Their work showed that Hessian Eigenmaps could outperform
standard LLE in cases where the data exhibited significant variations in local density, thereby
reducing distortion in the learned embeddings. Another modification, introduced by Zhang and
Wang (2006) [1109], was the development of Modified Locally Linear Embedding (MLLE), which
incorporated multiple weights in each neighborhood to address issues related to the conditioning
of the local weight matrix. The authors demonstrated that by introducing multiple weight con-
straints, MLLE produced embeddings that were less prone to numerical instability, thus leading to
improved robustness and consistency in the low-dimensional representations. These advancements
collectively enhanced the stability and general applicability of LLE-based methods, reinforcing their
utility in real-world machine learning tasks.
Beyond theoretical refinements, LLE found applications in natural language processing, where
Liang (2005) [1110] explored its role in semi-supervised learning. By leveraging the geometric
structure of unlabeled data, LLE facilitated the discovery of meaningful feature representations,
which proved useful in learning linguistic patterns with limited labeled samples. Coates and Ng
(2012) [1111] provided a broader perspective on feature learning by comparing LLE with other
unsupervised learning techniques such as K-means clustering. Their study examined the strengths
and weaknesses of LLE in the context of automatic feature extraction, highlighting its capacity
to learn meaningful data representations without explicit supervision. These contributions under-
scored the versatility of LLE and its relevance across different domains, including computer vision,
speech processing, and text analysis.
In the broader context of feature extraction and representation learning, Hyvärinen and Oja (2000)
[1112] explored Independent Component Analysis (ICA) as an alternative method for learning
structured representations of high-dimensional data. While ICA seeks to identify statistically in-
dependent components, LLE preserves local geometric relationships, making them complementary
approaches to dimensionality reduction. Lee et al. (2006) [1113] further advanced the study of
feature learning by developing efficient sparse coding algorithms, which provided insights into the
underlying structures of data representations. Their work discussed the differences between sparse
coding techniques and manifold learning approaches such as LLE, emphasizing the advantages of
sparsity constraints in generating interpretable features. Collectively, these contributions illustrate
the ongoing evolution of nonlinear dimensionality reduction techniques, with LLE serving as a
foundational method that continues to inspire research in machine learning and data science.
15.1. UNSUPERVISED LEARNING 221
model that integrates LLE with Transformer and LightGBM to predict the thermal conductivity of
natural rock materials, a crucial parameter for geothermal energy applications. In their study, LLE
was used to perform dimensionality reduction, allowing the Transformer model to better capture
latent structural patterns in the data. The combination of these methods led to improved predic-
tion accuracy, highlighting the role of LLE in optimizing feature selection for complex geophysical
modeling. Jin et al. (2025) [1116] proposed an improved version of LLE called Neighbor-Adapted
LLE (NALLE) for synthetic aperture radar (SAR) image processing. Their model effectively cap-
tured the structural properties of SAR images, facilitating zero-shot learning, where models are
trained without requiring large labeled datasets. By adapting LLE to work efficiently in image
processing tasks, they demonstrated its applicability in remote sensing, particularly for classifying
maritime objects.
In the field of optimization and algorithmic advancements, Li et al. (2024) [1117] proposed a novel
variation of LLE that modifies its original L2 norm-based distance metric using the Ali Baba and
The Forty Thieves Algorithm, an optimization technique inspired by metaheuristics. This mod-
ification improved LLE’s computational efficiency while maintaining accuracy in capturing data
manifold structures. Similarly, Jafari et al. (2025) [1118] provided an extensive review of LLE and
its variants in machine learning, emphasizing their role in feature extraction, non-linear dimension-
ality reduction, and data visualization. Their work serves as a valuable resource for understanding
the theoretical and practical developments of LLE, particularly in the context of biological data
analysis and big data applications. Additionally, Zhou et al. (2025) [1119] demonstrated the prac-
tical application of LLE in nondestructive testing (NDT) of thermal barrier coatings. Their study
showed that LLE could extract useful features from terahertz imaging data, significantly improving
the accuracy of stress detection in high-temperature material coatings.
Beyond scientific and industrial applications, LLE has been successfully employed in engineering
diagnostics and predictive maintenance. Dou et al. (2024) [1120] proposed an LLE-based method
for detecting faults in high-speed train traction systems. By transforming high-dimensional sensor
data into a lower-dimensional representation, LLE allowed for the early detection of system fail-
ures, ensuring the reliability and safety of train operations. Similarly, Bagherzadeh et al. (2021)
[1121] combined LLE with K-means clustering to optimize test case prioritization in software test-
ing. Their method improved the efficiency of defect detection in software systems, reducing testing
costs and accelerating the debugging process. Liu et al. (2025) [1122] proposed an intelligent
recognition algorithm for analyzing substation secondary wiring diagrams using a denoised variant
of LLE (D-LLE). Their approach enhanced the accuracy of connection identification in complex
electrical circuits, improving power system automation and maintenance.
These studies collectively illustrate the versatility and adaptability of LLE in a wide range of
fields, from fundamental algorithmic research to practical industrial applications. The continu-
ous improvements and novel adaptations of LLE, such as the development of Neighbor-Adapted
LLE for SAR images, its integration with Transformer models for geothermal applications, and
its optimization using heuristic algorithms, demonstrate its evolving role in modern data science.
The effectiveness of LLE in dimensionality reduction, feature extraction, and manifold learning
underscores its significance in handling high-dimensional data across diverse disciplines. Whether
in tourism, energy modeling, material testing, fault detection, or machine learning, LLE contin-
ues to be a powerful tool for solving complex, non-linear problems, paving the way for future
advancements in data-driven decision-making.
15.1. UNSUPERVISED LEARNING 223
Given a dataset X = {x1 , x2 , . . . , xN }, where each data point xi belongs to RD (i.e., the data
is embedded in a high-dimensional space of dimension D), the first step in LLE is to identify the
224 CHAPTER 15. LEARNING PARADIGMS
K-nearest neighbors of each point. Mathematically, for each point xi , we find a set of K nearest
neighbors N (i), such that:
N (i) = {xj1 , xj2 , . . . , xjK } (15.54)
where the neighbors are selected based on Euclidean distance:
Once the neighbors of each point are found, the next step is to compute reconstruction weights
that best express each data point as a linear combination of its nearest neighbors:
X
xi ≈ wij xj . (15.56)
j∈N (i)
The goal is to determine the weights wij that minimize the reconstruction error:
2
N
X X
E(W ) = xi − wij xj . (15.57)
i=1 j∈N (i)
The optimal weights wij are obtained by solving the quadratic programming problem:
N
X X X
(i)
min wij Cjk wik , subject to wij = 1. (15.60)
wij
i=1 j,k∈N (i) j∈N (i)
After computing the weights wij , we find low-dimensional coordinates Y = {y1 , y2 , . . . , yN } such
that: X
yi = wij yj . (15.62)
j∈N (i)
A major breakthrough came with the development of computationally efficient ICA algorithms,
notably the FastICA method by Hyvärinen and Oja (1997) [1125]. They formulated an elegant
fixed-point approach that significantly improved the convergence speed of ICA. Instead of using slow
gradient-based updates, their method maximized non-Gaussianity using a negentropy approxima-
tion, which enabled rapid convergence to independent components. This algorithm incorporated a
226 CHAPTER 15. LEARNING PARADIGMS
preprocessing step involving PCA-based whitening, which decorrelated the observed signals before
ICA extraction, further enhancing stability and efficiency. In parallel, Bell and Sejnowski (1995)
[1009] introduced an information-theoretic approach to ICA, deriving the Infomax algorithm based
on maximum likelihood estimation (MLE). Their work established that maximizing the entropy
of the output neurons under a nonlinear activation function naturally leads to ICA, providing a
fundamental link between ICA and unsupervised neural learning. This insight positioned ICA as
a biologically plausible mechanism for sensory processing, particularly in neural coding and vision
research.
Several alternative approaches to ICA were also developed, focusing on different statistical prin-
ciples. Cardoso and Souloumiac (1993) [1126] introduced the JADE algorithm, which employed
joint approximate diagonalization of eigenmatrices to separate independent sources. By exploit-
ing fourth-order cumulants, their method avoided gradient descent altogether, leading to a robust
and efficient separation procedure, especially for overdetermined mixtures. Meanwhile, Amari et
al. (1995) [1127] introduced an ICA algorithm based on natural gradient learning, drawing on
concepts from information geometry. They demonstrated that a Riemannian metric on the pa-
rameter space of ICA solutions enables more efficient optimization, leading to faster convergence
without suffering from the plateaus and slowdowns of traditional gradient descent methods. This
information-geometric perspective provided deeper insights into the structure of the ICA optimiza-
tion landscape and inspired further advances in adaptive ICA algorithms.
A significant refinement of ICA occurred with the extension of Infomax to handle both sub-Gaussian
and super-Gaussian source distributions by Lee et al. (1999) [1128]. Their method introduced a
nonlinear adaptive function that dynamically adjusted based on the statistical properties of the
data, making ICA applicable to a broader range of real-world signals. This extension proved partic-
ularly useful in biomedical applications, where sources often exhibit mixed statistical distributions.
Around the same time, Pham and Garat (1997) [1129] presented a quasi-maximum likelihood es-
timation (QMLE) approach to ICA, rigorously formulating the problem as an optimization task
within the framework of statistical estimation theory. Their work provided a more systematic the-
oretical foundation for ICA and improved its robustness in practical scenarios, particularly when
dealing with noisy or low-sample data environments.
More recent developments in ICA have explored probabilistic and Bayesian approaches. Højen-
Sørensen et al. (2002) [1130] introduced a variational Bayesian framework for ICA, using mean-field
approximations to estimate the posterior distribution of independent components. This approach
allowed ICA to be extended into a probabilistic setting, making it more robust in the presence
of noise and missing data. Their work was particularly influential in applications requiring un-
certainty quantification, such as brain imaging and financial modeling. In addition to these al-
gorithmic advancements, Stone (2004) [1131] provided a comprehensive textbook that rigorously
detailed the mathematical principles underlying ICA. His work systematically explored the rela-
tionship between ICA, mutual information, higher-order statistics, and practical signal processing
applications, serving as an authoritative reference for both theoretical research and practical im-
plementation. Together, these contributions have solidified ICA as a fundamental tool in signal
processing, machine learning, and neuroscience, with ongoing developments continuing to refine its
theoretical underpinnings and expand its applications.
15.1. UNSUPERVISED LEARNING 227
Independent Component Analysis (ICA) has found diverse applications across multiple domains,
ranging from neuroscience and medical imaging to financial modeling and power systems. Be-
hzadfar et al. (2025) [1132] introduced a novel multi-frequency ICA-based approach to process
functional MRI (fMRI) data. Their study focused on extracting meaningful components while ef-
fectively removing non-gray matter signals, thereby refining the accuracy of frequency-based brain
imaging. The method demonstrated improved signal clarity and more precise voxel-wise frequency
difference estimation, making it a valuable tool in neuroimaging research. Similarly, Eierud et al.
(2025) [1133] developed the NeuroMark PET ICA framework, which employs ICA to decompose
whole-brain PET signals into distinct networks, aiding in the construction of multivariate molecu-
lar imaging brain atlases. This technique allows researchers to analyze complex brain connectivity
patterns with greater precision, enhancing the study of neurodegenerative disorders and other neu-
rological conditions.
Expanding ICA’s application in hydrology, Wang et al. (2025) [1134] leveraged the technique
to analyze terrestrial water storage anomaly (TWSA) trends in the Yangtze River Basin. Their re-
search identified statistically independent spatial trends within hydrological data, offering insights
into climate change and its impact on water resource management. By distinguishing significant
patterns from background noise, ICA facilitated a more accurate assessment of water distribution
and hydrological cycles over time. Similarly, Heurtebise et al. (2025) [1135] used ICA to stabilize
estimators in hydrological dataset analysis, improving the reliability of multivariate mutual infor-
mation measurements. These advancements underscore ICA’s utility in environmental sciences,
particularly in large-scale water resource monitoring.
In the field of computational neuroscience, Ouyang and Li (2025) [1136] developed a protocol
that integrates ICA with Principal Component Analysis (PCA) for semi-automated EEG prepro-
cessing. Their approach streamlines the removal of artifacts and enhances signal quality, making it
easier for researchers to conduct large-scale EEG studies with minimal manual intervention. Zhang
and Luck (2025) [1137] further explored ICA’s impact on brain-computer interfaces by assessing
the effect of artifact correction on the performance of support vector machine (SVM)-based EEG
decoding. Their study demonstrated that ICA-based correction significantly improved classifica-
tion accuracy, reinforcing its role as a crucial preprocessing tool for neurophysiological data.
ICA has also shown promise in financial modeling and power system monitoring. Kirsten and
Süssmuth (2025) [1138] applied ICA to financial time-series data, demonstrating its ability to filter
noise and identify independent market-driving factors in cryptocurrency price movements. Their
findings indicated that ICA, combined with ARIMA modeling, improved predictive accuracy in
highly volatile market environments. Meanwhile, Jung et al. (2025) [1139] developed a hybrid
fault detection system that integrates ICA with auto-associative kernel regression (AAKR) for
power plant monitoring. Their model effectively isolated independent fault components, enabling
more accurate anomaly detection and predictive maintenance strategies.
In signal processing and acoustic applications, Wang et al. (2025) [1140] implemented ICA for
noise filtering in passive acoustic localization, particularly for underwater object detection. Their
method significantly enhanced the clarity of spatial positioning signals, reducing the impact of
environmental noise. Luo et al. (2025) [1141] extended ICA’s utility to brain-computer interfaces
(BCIs), where they used the technique to eliminate electrical noise and enhance the transmission of
neural signals. Their research demonstrated ICA’s ability to improve the reliability of noninvasive
BCIs, paving the way for more accurate and responsive brain-controlled devices.
These studies collectively illustrate ICA’s versatility across disciplines, from improving medical
15.1. UNSUPERVISED LEARNING 229
imaging and neurophysiological data analysis to enhancing hydrological assessments, financial fore-
casting, and fault detection. As ICA continues to evolve, its integration with machine learning
techniques and its application in multi-sensor data fusion are likely to expand, offering even greater
potential for scientific and industrial advancements. Future research should focus on optimizing
ICA algorithms to handle increasingly complex datasets while maintaining computational efficiency.
Table 15.14: Summary of Recent Contributions in Independent Component Analysis (ICA) Liter-
ature
ically, we assume that we are given an n-dimensional observation vector x, which is related to an
unknown set of source signals s through an unknown mixing matrix A as follows:
x = As (15.67)
where x ∈ Rn is the observed signal vector, s ∈ Rn is the original source vector, and A ∈ Rn×n is the
mixing matrix that describes how the sources are combined. The objective of ICA is to estimate
a demixing matrix W such that the transformed signals ŝ are as statistically independent as
possible:
ŝ = Wx (15.68)
where ideally W ≈ A−1 , leading to an approximate recovery of the independent source signals.
The fundamental assumption that allows ICA to work is that the source signals s1 , s2 , ..., sn are
statistically independent, meaning their joint probability density function (PDF) factorizes:
This assumption of statistical independence is crucial because it allows the identification of the
source signals based on higher-order statistical properties, such as kurtosis and negentropy.
Furthermore, ICA relies on the non-Gaussianity of the source signals, as the central limit
theorem (CLT) states that the sum of independent random variables tends toward a Gaussian
distribution. Thus, if the sources were Gaussian, they could not be separated from their mixtures
because Gaussian distributions are completely characterized by second-order statistics (mean and
variance), and higher-order statistics would provide no additional information. To estimate the
demixing matrix W, we must find a transformation that maximizes the statistical independence
of the recovered signals ŝ. Several measures of statistical independence exist, including kurtosis,
negentropy, and mutual information, which lead to different optimization formulations of
ICA. One approach is to use kurtosis, which is defined as the fourth central moment of a random
variable:
Kurt(y) = E[y 4 ] − 3(E[y 2 ])2 . (15.70)
A Gaussian distribution has a kurtosis of zero, while non-Gaussian distributions have nonzero
kurtosis. Since ICA relies on the assumption that the sources are non-Gaussian, we can maximize
the absolute value of kurtosis to obtain independent components. Another widely used measure is
negentropy, which is derived from information theory and defined as:
Since entropy measures randomness, negentropy quantifies how far a given distribution is from
Gaussianity. The higher the negentropy, the more non-Gaussian the distribution, making it a
useful objective function for ICA. An alternative approach is to use mutual information, which
measures the statistical dependence between variables:
n
X
I(y1 , y2 , ..., yn ) = H(yi ) − H(y). (15.73)
i=1
By minimizing mutual information, we can maximize the statistical independence of the estimated
sources. In practical ICA implementations, preprocessing is often necessary to improve perfor-
mance. A common preprocessing step is whitening, which ensures that the observed signals are
15.2. SUPERVISED LEARNING 231
uncorrelated and have unit variance. Whitening is achieved by first computing the covariance
matrix of x:
C = E[xxT ]. (15.74)
Performing an eigenvalue decomposition (EVD):
C = EDET , (15.75)
where E is the matrix of eigenvectors and D is the diagonal matrix of eigenvalues. The whitened
signals are then computed as:
x′ = D−1/2 ET x. (15.76)
This transformation ensures that the covariance of x′ is the identity matrix:
ICA has broad applications in signal processing, including EEG and fMRI signal separation,
speech and audio processing, financial time series analysis, and image processing. It is
particularly useful in solving the cocktail party problem, where multiple speakers’ voices are
separated from mixed audio recordings. The mathematical foundation of ICA makes it one of the
most robust techniques for blind signal separation, and its various optimization formulations
offer flexibility depending on the application.
The field of ensemble learning has also seen significant advancements through rigorous mathe-
matical formulations. Breiman (2001) [730] introduced the Random Forest algorithm, an ensemble
method based on decision trees, and demonstrated how bootstrap aggregation (bagging) reduces
variance in predictive models. The paper provided a theoretical justification for the effectiveness of
decorrelating individual decision trees using random feature selection, showing that this mechanism
enhances generalization. Moreover, Breiman formally defined the out-of-bag (OOB) error estima-
tion method, an internal validation technique that provides an unbiased estimate of the model’s
232 CHAPTER 15. LEARNING PARADIGMS
performance without requiring a separate validation set. Around the same time, Friedman, Hastie,
and Tibshirani (2000) [1021] offered a rigorous statistical interpretation of boosting algorithms,
particularly AdaBoost. Their work demonstrated that boosting can be understood as a stagewise
optimization process that minimizes an exponential loss function, providing a function-space per-
spective on gradient boosting. By linking boosting to numerical optimization techniques, they laid
the foundation for gradient boosting machines (GBMs), which remain one of the most effective
supervised learning techniques today. Further theoretical contributions to boosting were made by
Schapire (1990) [1023], who formally proved that weak classifiers, which perform only marginally
better than random guessing, can be transformed into arbitrarily strong classifiers through the
process of boosting. The paper established a precise mathematical framework for analyzing how
iterative reweighting of training samples reduces error bounds, reinforcing the statistical robustness
of boosting methods.
Deep learning, a subset of supervised learning, has also benefited from rigorous theoretical ad-
vancements. LeCun et al. (1998) [1020] introduced convolutional neural networks (CNNs), demon-
strating their effectiveness in document recognition, particularly for handwritten digit classification
in the MNIST dataset. The authors provided a mathematically rigorous derivation of the backprop-
agation algorithm and the gradient-based optimization techniques used in training deep networks.
One of their major contributions was the concept of weight sharing, which significantly reduces the
number of learnable parameters in CNNs by enforcing translational invariance in feature detection.
This architecture laid the groundwork for modern deep learning-based vision systems, which have
since expanded to complex image recognition tasks. More recently, Srivastava et al. (2014) [132]
introduced dropout as a regularization technique for deep neural networks. The authors provided
a probabilistic interpretation of dropout as an approximation to model averaging over an exponen-
tially large ensemble of subnetworks, rigorously deriving its impact on reducing overfitting. The
paper also presented empirical results demonstrating that dropout significantly improves general-
ization performance across a wide range of supervised learning tasks.
One of the earliest contributions to supervised learning, which laid the foundation for neural net-
works, was made by Rosenblatt (1958) [1022] through the introduction of the perceptron algorithm.
This work rigorously proved the perceptron convergence theorem, which guarantees that if a dataset
is linearly separable, the perceptron learning rule will converge to a separating hyperplane in a fi-
nite number of iterations. Although the perceptron was later shown to be limited in expressive
power, particularly in its inability to solve non-linearly separable problems such as the XOR prob-
lem, it inspired the development of more advanced architectures, including multi-layer perceptrons
(MLPs) and deep neural networks. Hastie, Tibshirani, and Friedman (2009) [130] provided one
of the most mathematically rigorous treatments of supervised learning, covering a broad range of
techniques including linear models, kernel methods, decision trees, ensemble learning, and deep
learning. Their work emphasized the mathematical foundations of machine learning, particularly
the bias-variance tradeoff, regularization methods such as ridge regression and Lasso, and kernel-
ized learning methods. The authors also provided in-depth theoretical analyses of the convergence
properties of various supervised learning algorithms, making their text a fundamental resource for
statistical learning theory.
The optimization of deep learning models has been another area of rigorous mathematical re-
search, with Kingma and Ba (2014) [166] introducing the Adam optimization algorithm. Their
work provided a detailed mathematical derivation of Adam’s update rules, which rely on the com-
putation of exponentially decaying moving averages of past gradients and squared gradients. The
authors rigorously demonstrated how Adam effectively combines the benefits of Adagrad, which
adapts learning rates based on the historical sum of squared gradients, and RMSProp, which nor-
malizes updates using an exponentially weighted average of past squared gradients. The method
15.2. SUPERVISED LEARNING 233
has since become the default optimization algorithm for training deep neural networks, particularly
due to its robustness in handling sparse gradients and noisy objective functions. Collectively, these
works have profoundly shaped the landscape of supervised learning by introducing mathematically
rigorous frameworks for classification, regression, deep learning, and optimization. Their theoreti-
cal contributions continue to inform modern advancements, ensuring that machine learning models
remain both statistically sound and computationally efficient.
In the field of remote sensing and agriculture, Pei et al. (2025) [1025] introduced a weakly su-
pervised learning approach for segmenting vegetation from UAV images. Traditional supervised
models often struggle with limited labeled datasets in complex outdoor environments, but their
model incorporates spectral reconstruction techniques to improve classification accuracy in field
conditions. This advancement significantly enhances agricultural monitoring and precision farm-
ing. Likewise, Efendi et al. (2025) [1026] designed an IoT-based health monitoring system where
supervised learning algorithms were used to classify and predict health anomalies in elderly pa-
tients. Their study highlights how cloud computing and machine learning can be combined to
provide real-time health insights, ultimately enabling better remote patient care.
Another critical challenge in supervised learning is the scarcity of labeled data, which was ad-
dressed by Pang et al. (2025) [1027] in their research on protein transition pathway prediction.
They introduced DeepPath, a framework that integrates active learning with supervised deep learn-
ing, allowing models to efficiently identify uncertain data points for labeling. This method enhances
model accuracy while reducing the dependency on extensive labeled datasets, making it particu-
larly useful for complex biological simulations. In a different application, Curry et al. (2025) [1028]
explored supervised classification techniques in geoscience, comparing their effectiveness with un-
supervised clustering methods for analyzing ignimbrite flare-up patterns. Their work provides
insights into how machine learning models can be optimized for geological pattern detection, im-
proving hazard assessment and geological forecasting.
In the realm of computational drug discovery, Li et al. (2025) [1029] developed a deep learning-based
framework called π-PhenoDrug, which employs supervised learning for phenotypic drug screening.
The model utilizes transfer learning strategies and neural networks to classify drug interactions, sig-
nificantly accelerating the identification of promising drug candidates. Similarly, Liu et al. (2025)
[1030] integrated supervised learning with molecular docking and dynamic simulations to identify
ASGR1 and HMGCR dual-target inhibitors. By combining computational chemistry with machine
learning, they were able to streamline the drug discovery process, demonstrating the efficiency of
supervised models in pharmaceutical research. These contributions highlight the transformative
15.2. SUPERVISED LEARNING 235
Beyond healthcare and science, supervised learning is also playing a crucial role in business and
engineering. Dutta and Karmakar (2025) [1031] investigated the application of the Random Forest
algorithm in business analytics, demonstrating its superior accuracy in predictive modeling com-
pared to other machine learning approaches. Their findings illustrate how machine learning can
optimize decision-making processes in organizational contexts. Finally, Ekanayake (2025) [1019]
explored the use of supervised deep learning models for enhancing Magnetic Resonance Imaging
(MRI) reconstruction and super-resolution. Their study addresses the challenges of lengthy MRI
scan times by leveraging artificial intelligence to improve image quality, thereby advancing medical
diagnostics and reducing patient discomfort.
These studies collectively demonstrate the broad impact of supervised learning across various dis-
ciplines, from security and healthcare to remote sensing, drug discovery, and business analytics.
The integration of machine learning with domain-specific challenges has led to significant break-
throughs, highlighting the adaptability and efficiency of supervised learning models. As researchers
continue to refine these techniques, the future of artificial intelligence-driven decision-making ap-
pears increasingly promising. The ability of supervised learning to extract meaningful patterns from
structured data is proving invaluable in solving real-world problems, paving the way for further
advancements in machine learning research and its practical applications.
where F denotes the hypothesis space of functions and ℓ(y, ŷ) is a loss function quantifying the error
between the true and predicted values. In empirical risk minimization (ERM), the expectation is
approximated using the training set, leading to
N
ˆ 1 X
f = arg min ℓ(yi , f (xi )). (15.80)
f ∈F N
i=1
For regression tasks, common choices for ℓ(y, ŷ) include the squared loss,
ℓ(y, ŷ) = (y − ŷ)2 , (15.81)
which leads to the minimization of the mean squared error (MSE),
N
ˆ 1 X
f = arg min (yi − f (xi ))2 . (15.82)
f ∈F N
i=1
For classification, a frequently used loss function is the cross-entropy loss for a model outputting
class probabilities pk (x),
K
X
ℓ(y, ŷ) = − ⊮(y = k) log pk (x), (15.83)
k=1
where K is the number of classes and ⊮(y = k) is the indicator function that is 1 if y = k and 0
otherwise. The classifier aims to minimize the empirical risk,
N X
K
1 X
fˆ = arg min − ⊮(yi = k) log pk (xi ). (15.84)
f ∈F N i=1 k=1
15.2. SUPERVISED LEARNING 237
To optimize fˆ, gradient-based methods are commonly used, particularly in deep learning where f
is parameterized by a neural network with weights θ,
where Wl and bl are weight matrices and bias vectors at layer l, and σ(·) is an activation func-
tion. The optimization process follows stochastic gradient descent (SGD), updating parameters
iteratively as follows,
θ(t+1) = θ(t) − η∇θ L(θ), (15.86)
where η is the learning rate and L(θ) is the loss function. The gradient ∇θ L(θ) is computed via
backpropagation,
N
∂L X ∂ℓ(yi , f (xi ; θ))
= . (15.87)
∂θ i=1
∂θ
Regularization techniques, such as L2 regularization (ridge regression),
N
1 X
L(θ) = (yi − f (xi ; θ))2 + λ∥θ∥22 , (15.88)
N i=1
are used to prevent overfitting by penalizing large weights. In logistic regression, the probability of
a binary label y ∈ {0, 1} is modeled using the sigmoid function,
1
p(y = 1 | x; θ) = . (15.89)
1 + e−xT θ
The loss function in this case is the binary cross-entropy loss,
N
1 X
L(θ) = − [yi log p(yi ) + (1 − yi ) log(1 − p(yi ))] . (15.90)
N i=1
Thus, supervised learning encompasses various frameworks, each employing rigorous mathematical
formulations to optimize predictive performance.
In the context of deep learning, Neal et al. (2018) [854] revisited the bias-variance tradeoff, rig-
orously analyzing why larger neural networks often achieve superior generalization despite their
over-parameterization. They argued that implicit regularization, induced by optimization algo-
rithms such as stochastic gradient descent, plays a crucial role in preventing excessive variance
growth. Rocks and Mehta (2020) [855] further investigated over-parameterized models using sta-
tistical physics methods, deriving explicit analytical expressions for bias and variance in settings
such as linear regression and shallow neural networks. Their work provided a deeper theoretical
understanding of how interpolation affects generalization, revealing a phase transition where the
training error collapses to zero while test error exhibits non-trivial behavior due to the complex
interaction of bias and variance. Meanwhile, Doroudi and Rastegar (2023) [856] extended the
bias-variance framework beyond machine learning, applying it to cognitive science models. Their
work emphasized that cognitive processes also operate under similar tradeoffs, where overly rigid or
overly flexible models fail to accurately capture human learning and reasoning, thereby providing
an interdisciplinary perspective on the problem.
The practical implications of the bias-variance tradeoff extend to domains such as label noise,
knowledge distillation, and ensemble learning. Almeida et al. (2020) [857] focused on mitigating
class boundary label uncertainty, proposing an approach that estimates label noise and adjusts
training sample weights to simultaneously reduce both bias and variance. Their method provided
empirical evidence that accounting for uncertainty near decision boundaries leads to improved gen-
eralization performance. Zhou et al. (2021) [858] explored the role of soft labels in knowledge
distillation, demonstrating that these labels act as a form of implicit regularization that influences
the bias-variance tradeoff in student-teacher learning paradigms. They proposed a novel method
of weighting soft labels to dynamically balance bias and variance at the sample level. Gupta et al.
(2022) [859] examined ensemble methods from a bias-variance perspective, showing that ensem-
bling reduces variance without significantly increasing bias. Their theoretical and empirical results
highlighted why ensemble-based approaches often outperform single models in practice. Finally,
Ranglani (2024) [860] conducted an extensive empirical analysis across different machine learning
algorithms, quantifying the bias-variance decomposition and providing insights into the tradeoff
dynamics in regression and classification tasks. Their study offered practical guidelines for opti-
mizing machine learning models by rigorously understanding how bias and variance manifest across
various architectures.
Collectively, these works illustrate the evolution of the bias-variance tradeoff from its classical
roots to its modern reinterpretation in deep learning and beyond. They provide rigorous the-
oretical insights and empirical validations that have fundamentally shaped the field of machine
learning. The contributions span diverse perspectives, from statistical learning theory to interdis-
ciplinary applications, and challenge traditional notions of model complexity and generalization.
The continued exploration of this tradeoff remains a central theme in machine learning research,
influencing advancements in algorithm design, optimization techniques, and real-world deployment
strategies.
Reference Contribution
Geman, Bienenstock, Introduced the bias-variance decomposition in neural networks,
and Doursat (1992) demonstrating how model complexity influences predictive error.
They formulated a theoretical framework explaining the tradeoff
between underfitting (high bias) and overfitting (high variance).
(Continued on next page)
15.2. SUPERVISED LEARNING 239
In the context of statistical modeling, Polson and Sokolov (2024) rigorously analyze hierarchical
linear models, showcasing how ridge and lasso regression act as regularization techniques to mini-
mize variance while maintaining sufficient flexibility to prevent high bias. This is further explored
by Jogo (2025), who provides a broader statistical perspective, linking the bias-variance tradeoff to
support vector machines, unsupervised learning, and computational efficiency in high-dimensional
spaces. Additionally, Du et al. (2025) introduce a mathematical framework that incorporates
margin theory and optimization-based strategies to manage bias-variance tradeoffs in ensemble
learning, highlighting computational trade-offs associated with different model architectures.
A particularly interesting development in deep learning is the challenge to the traditional U-shaped
bias-variance curve, as presented by Wang and Pope (2025). Their work on the double descent
phenomenon suggests that increasing model complexity beyond a certain threshold can actually
improve generalization, contradicting classical theory. In a related vein, Chen et al. (2024) investi-
gate the role of graph convolutional networks in regression tasks, providing a theoretical framework
to understand how different convolutional layers impact the bias-variance tradeoff. Meanwhile,
Obster et al. (2024) take a more interpretability-focused approach, proposing a scoring system to
balance predictive accuracy and model transparency while managing the bias-variance relationship.
where Y is the true response variable, X is the input feature, f (X) is the true underlying function,
and fˆ(X) is the estimated function learned by the model. The decomposition highlights three key
sources of error:
1. Bias: This term quantifies how far the expected prediction E[fˆ(X)] is from the true function
f (X). Formally, it is defined as
A high-bias model makes systematic errors because it fails to capture the complexity of the
data. This often occurs in underfitting, where the model is too simple to represent the
underlying structure of the data.
2. Variance: This term quantifies the variability of the model’s predictions around its expected
value, given by
h i
Var(fˆ(X)) = E (fˆ(X) − E[fˆ(X)])2 . (15.93)
A high-variance model is highly sensitive to fluctuations in the training data and does not
generalize well to unseen data. This typically occurs in overfitting, where the model captures
noise instead of the true signal.
3. Irreducible Error: The term σ 2 = E[(Y − f (X))2 ] represents noise inherent in the data
that no model can eliminate.
242 CHAPTER 15. LEARNING PARADIGMS
Since the total error is the sum of these components, an increase in one term often leads to a
decrease in another. For instance, using a more flexible model (e.g., a high-degree polynomial)
can reduce bias but at the cost of increased variance. Conversely, using a simple model (e.g.,
linear regression) reduces variance but increases bias. This tradeoff is captured by minimizing the
expected loss function h i
min E (Y − fˆ(X))2 , (15.94)
fˆ
where the goal is to balance the bias and variance terms to achieve optimal generalization. To
analyze this tradeoff further, consider a simple linear model where the target function is quadratic:
Y = X 2 + ϵ, where ϵ ∼ N (0, σ 2 ). (15.95)
If we fit a linear regression model fˆ(X) = aX + b, then the bias is
2
Bias2 = E[aX + b] − X 2 . (15.96)
For a high-degree polynomial model fˆ(X) = di=0 ci X i , the variance term increases with d, leading
P
to:
d
X
ˆ
Var(f (X)) = Var(ci )X 2i . (15.97)
i=0
Minimizing the expected generalization error requires selecting a model complexity d such that
∂
Bias2 + Var + σ 2 = 0.
(15.98)
∂d
To further illustrate, consider a dataset of size n and a model parameterized by θ. The variance of
the estimator θ̂ is given by h i
Var(θ̂) = E (θ̂ − E[θ̂])2 , (15.99)
whereas the bias squared is 2
Bias2 (θ̂) = E[θ̂] − θtrue . (15.100)
The total mean squared error (MSE) of the estimator is then
MSE(θ̂) = Bias2 (θ̂) + Var(θ̂). (15.101)
As the complexity of the model increases, the bias decreases, but variance increases. This behavior
is described by the function
Error(d) = Ad−p + Bdq + C, (15.102)
where A, B, C, p, q are constants depending on the dataset and learning algorithm, and d represents
model complexity. The optimal complexity d∗ satisfies
∂Error(d)
= 0. (15.103)
∂d
For neural networks, increasing the number of layers and neurons often leads to decreased bias
but significantly increased variance due to overparameterization. The expected loss for a neural
network with weights W trained on data (X, Y ) is given by
E [L(W )] = E (Y − fW (X))2 = Bias2 + Var + σ 2 .
(15.104)
Regularization techniques such as ridge regression introduce a penalty term λ∥W ∥2 , leading to the
optimization problem
Xn
min (yi − fW (xi ))2 + λ∥W ∥2 . (15.105)
W
i=1
This reduces variance at the cost of increasing bias, effectively controlling model complexity to
achieve an optimal bias-variance tradeoff.
15.2. SUPERVISED LEARNING 243
Support Vector Machines (SVMs) are rooted in Vladimir Vapnik’s ”The Nature of Statistical Learn-
ing Theory” (1995) [134], which introduced the fundamental principle of structural risk minimiza-
tion (SRM). Unlike empirical risk minimization, which solely minimizes training error, SRM aims
to balance the complexity of the model and its ability to generalize to unseen data, thereby pre-
venting overfitting. This theoretical framework laid the foundation for the development of SVMs
as a powerful tool for classification and regression problems. Schölkopf and Smola’s ”Learning
with Kernels” (2002) [798] provided a mathematically rigorous treatment of kernel methods, which
form the core of SVMs by enabling them to operate in high-dimensional feature spaces without
explicit transformation via the kernel trick. This book formalized the mathematical underpinnings
of reproducing kernel Hilbert spaces (RKHS), Mercer’s theorem, and various kernel functions such
as Gaussian and polynomial kernels. Cristianini and Shawe-Taylor’s ”An Introduction to Sup-
port Vector Machines” (2000) [799] played a crucial role in making these complex mathematical
ideas accessible. It not only explained the theoretical aspects of SVMs but also provided insights
into their practical implementation, making it a key reference for researchers and practitioners alike.
The statistical properties of SVMs were rigorously analyzed in Christmann and Steinwart’s ”Sup-
port Vector Machines” (2008) [800], which focused on their consistency, robustness, and learning
rates. This work addressed fundamental questions regarding the asymptotic behavior of SVM
classifiers, including their convergence properties under different loss functions. Furthermore, the
edited volume by Schölkopf, Burges, and Smola, ”Advances in Kernel Methods” (1999) [801], com-
piled several significant advancements in SVM research, including extensions of SVMs to regression
problems (Support Vector Regression, SVR), one-class SVMs for anomaly detection, and novel
kernel functions suited for different applications. The book highlighted both theoretical develop-
ments and practical applications, illustrating the versatility of SVMs across various domains such
as image recognition, bioinformatics, and financial modeling.
The extension of SVMs beyond binary classification was a crucial milestone. Drucker et al.’s
”Support Vector Regression Machines” (1997) [802] formalized Support Vector Regression (SVR),
which adapted the SVM framework to predict continuous-valued outputs by introducing an ϵ-
insensitive loss function. This work demonstrated how SVMs could be effectively used for time
series prediction and function approximation. Joachims’ ”Transductive Inference for Text Clas-
sification” (1999) [803] introduced Transductive SVMs (TSVMs), which exploit both labeled and
unlabeled data to enhance classification performance, particularly in text classification problems
where labeled data is limited. By incorporating unlabeled data, TSVMs leverage the structure of
the input distribution, making them particularly useful in semi-supervised learning.
Another fundamental contribution came from Schölkopf, Smola, and Müller’s ”Nonlinear Com-
ponent Analysis as a Kernel Eigenvalue Problem” (1998) [804], which extended Principal Compo-
nent Analysis (PCA) to nonlinear settings using kernels (Kernel PCA). This demonstrated how
kernel-based methods, including SVMs, could be applied beyond classification and regression to
dimensionality reduction and feature extraction, which are critical for handling high-dimensional
data. Burges’ ”A Tutorial on Support Vector Machines for Pattern Recognition” (1998) [805] pro-
vided an intuitive yet mathematically detailed explanation of SVMs, covering Lagrange duality,
convex optimization, and margin maximization. This tutorial remains an essential resource for
researchers seeking to understand both the theoretical foundations and practical aspects of imple-
menting SVMs.
introduced one-class SVMs for anomaly detection, where the goal is to find a decision boundary that
encloses the majority of data points while rejecting outliers. This formulation is particularly useful
in fraud detection, network security, and fault diagnosis, where anomalies are rare but significant.
Collectively, these contributions have shaped the development of SVMs as a mathematically rigor-
ous and practically powerful machine learning technique. The blend of theoretical depth, statistical
robustness, and practical versatility has cemented SVMs as one of the most influential algorithms
in the history of machine learning.
Reference Contribution
Vladimir N. Vapnik (1995) Introduced Structural Risk Minimization (SRM), the
foundational principle behind SVMs. Established the
theoretical basis for SVMs within statistical learning
theory.
Bernhard Schölkopf and Provided a rigorous mathematical treatment of kernel
Alexander J. Smola (2002) methods, including reproducing kernel Hilbert spaces
(RKHS) and Mercer’s theorem, formalizing the kernel
trick.
Nello Cristianini and John Made SVM theory accessible by providing a practical
Shawe-Taylor (2000) introduction to the mathematical foundations and ap-
plications of SVMs.
Ingo Steinwart and Andreas Offered a statistical analysis of SVMs, covering consis-
Christmann (2008) tency, robustness, and learning rates, providing insight
into their asymptotic properties.
Bernhard Schölkopf, Compiled major advances in kernel methods, including
Christopher J.C. Burges, extensions of SVMs to regression (SVR) and novel kernel
and Alexander J. Smola functions. Showcased applications in image processing
(1999) and bioinformatics.
Harris Drucker et al. (1997) Developed Support Vector Regression (SVR), adapting
SVMs for continuous-valued predictions. Introduced the
ϵ-insensitive loss function.
Thorsten Joachims (1999) Introduced Transductive SVMs (TSVMs), which lever-
age both labeled and unlabeled data for improved clas-
sification, particularly in text mining applications.
Bernhard Schölkopf, Developed Kernel Principal Component Analysis (Ker-
Alexander Smola, and nel PCA), extending SVM-related methods to nonlinear
Klaus-Robert Müller (1998) dimensionality reduction and feature extraction.
Christopher J.C. Burges Provided an accessible and mathematically detailed tu-
(1998) torial on SVMs, explaining concepts such as margin
maximization, convex optimization, and duality theory.
Bernhard Schölkopf et al. Introduced One-Class SVMs for anomaly detection, pro-
(2001) viding a method to estimate the support of a high-
dimensional distribution. Applied in fraud detection
and network security.
data, assessing stroke rehabilitation effects in brain tumor patients. Their study emphasized SVM’s
capability in clinical decision-making and biomedical imaging. Similarly, Diao et al. (2025) [843]
investigated lung cancer detection by optimizing Bi-LSTM networks with hand-crafted features.
Their findings revealed that SVMs achieved the highest classification accuracy when extracting
Gray-Level Co-Occurrence Matrix (GLCM) features, reinforcing their robustness in medical image
classification. Another pivotal study by Lin et al. (2025) [844] combined deep transfer learning
with SVM-based radiomics for sinonasal malignancy detection in MRI scans, achieving an impres-
sive 92.6 percent accuracy, underscoring SVM’s potential in computer-aided diagnosis and medical
image processing. Çetintaş (2025) [845] further extended SVM applications in healthcare by em-
ploying an optimized SVM model via Grid Search for monkeypox detection, significantly improving
classification performance on imbalanced datasets.
In the realm of natural language processing and text analytics, Wang and Zhao (2025) [846]
compared sentiment lexicon and machine learning methods for citation sentiment identification,
demonstrating SVM’s superior generalization ability in text classification tasks. Muralinath et al.
(2025) [847] explored multichannel EEG classification using spectral graph kernels, where SVMs
played a crucial role in robust epilepsy detection. Additionally, Hu et al. (2025) [848] addressed the
class imbalance problem in sarcasm detection by leveraging an ensemble-based oversampling tech-
nique, proving SVM’s efficacy in handling skewed datasets. These studies confirm SVM’s strength
in high-dimensional, sparse data environments, making it a go-to method in various text processing
applications.
Engineering and applied sciences have also benefited greatly from SVM-based models. Wang et al.
(2025) [849] demonstrated how Support Vector Regression (SVR) effectively predicts the tensile
properties of automotive steels, validating its use in materials science and mechanical engineering.
Similarly, Husain et al. (2025) [850] employed SVR to model shear thickening fluid behavior, show-
casing its ability to forecast nonlinear behaviors in physics and engineering applications. Iqbal and
Siddiqi (2025) [851] integrated SVM in a hybrid deep learning model to enhance seasonal stream-
flow prediction, proving SVM’s utility in hydrological and environmental modeling. These studies
collectively highlight the adaptability of SVM-based regression techniques in predicting complex,
nonlinear systems across scientific disciplines.
From biomedical diagnostics to predictive analytics in engineering, these studies illustrate how Sup-
port Vector Machines remain a fundamental tool in machine learning research. The ability of SVMs
to handle high-dimensional spaces, work with limited data, and maintain strong generalization ca-
pabilities makes them highly relevant across numerous fields. Whether used for medical imaging,
fraud detection, text mining, or physical modeling, SVMs continue to provide state-of-the-art so-
lutions that rival deep learning techniques while maintaining interpretability and computational
efficiency.
246 CHAPTER 15. LEARNING PARADIGMS
|w⊤ xi + b|
(15.108)
∥w∥
The margin, which is the distance between the two parallel supporting hyperplanes that pass
through the closest data points of each class, is given by
2
(15.109)
∥w∥
The optimal separating hyperplane maximizes this margin, leading to the following convex
optimization problem
1
min ∥w∥2 (15.110)
w,b 2
subject to
yi (w⊤ xi + b) ≥ 1, ∀i ∈ {1, 2, . . . , N } (15.111)
To solve this constrained optimization problem, the Lagrangian is introduced:
N
1 X
L(w, b, α) = ∥w∥2 − αi yi (w⊤ xi + b) − 1
(15.112)
2 i=1
N
X
αi yi = 0 (15.114)
i=1
⊤
αi yi (w xi + b) − 1 = 0, ∀i ∈ {1, 2, . . . , N } (15.115)
248 CHAPTER 15. LEARNING PARADIGMS
subject to
N
X
αi yi = 0, αi ≥ 0, ∀i ∈ {1, 2, . . . , N } (15.117)
i=1
The optimal support vectors are the points for which αi > 0. The bias term b is computed using
This illustration shows an SVM trained on two-class data, highlighting the maximum-margin
hyperplane and its margins. The data points that rest on the margin are referred to as support
vectors
For non-linearly separable data, the SVM allows some misclassification using slack variables
ξi ≥ 0, leading to the soft-margin SVM formulation
N
1 X
min ∥w∥2 + C ξi (15.119)
w,b,ξ 2
i=1
15.2. SUPERVISED LEARNING 249
subject to
yi (w⊤ xi + b) ≥ 1 − ξi , ξi ≥ 0, ∀i ∈ {1, 2, . . . , N } (15.120)
where C > 0 is a regularization parameter. The corresponding dual problem remains
N N N
X 1 XX
max αi − αi αj yi yj x⊤
i xj (15.121)
α
i=1
2 i=1 j=1
subject to
N
X
αi yi = 0, 0 ≤ αi ≤ C, ∀i ∈ {1, 2, . . . , N } (15.122)
i=1
Thus, SVM constructs an optimal decision boundary by solving a quadratic optimization problem,
ensuring a maximum-margin separator, and leveraging kernel functions to extend the model to
non-linearly separable cases.
The geometric interpretation of regression was further expanded by Pearson (1901) [809], who
developed principal component analysis (PCA) as a method for dimensionality reduction and data
compression. Pearson’s work highlighted the connection between linear regression and orthogonal
projections, providing insights into how regression techniques relate to eigenvalue decomposition
and singular value decomposition (SVD). Fisher (1922) [810] played a crucial role in rigorously
formulating the statistical inference framework for linear regression, developing methods for es-
timating standard errors, constructing confidence intervals, and conducting hypothesis tests on
regression coefficients. Fisher’s approach introduced key concepts such as the sampling distribu-
tion of regression estimators and the use of likelihood-based inference, which remain central to
modern regression analysis.
Building upon these foundational principles, Koopmans (1937) [811] extended linear regression
techniques to time series analysis, addressing challenges such as serial correlation, heteroskedastic-
ity, and multicollinearity in economic forecasting. This laid the groundwork for the generalized least
squares (GLS) method, which was later formalized by Goldberger (1991) [812]. Goldberger pro-
vided a rigorous treatment of violations of ordinary least squares (OLS) assumptions, particularly
250 CHAPTER 15. LEARNING PARADIGMS
when dealing with correlated error structures and non-constant variance in regression models. Rao
(1973) [813] further expanded the theory of regression by introducing ridge regression, a technique
that stabilizes coefficient estimates in the presence of multicollinearity by introducing a penalty
term to control variance. His work unified linear regression within the broader framework of mul-
tivariate statistical inference, contributing to the development of generalized regression models.
Huber (1992) [814] introduced a significant advancement by developing robust regression tech-
niques, which addressed the sensitivity of classical least squares estimation to outliers and depar-
tures from normality. His work on M-estimation provided a framework for obtaining regression
estimates that remain stable under heavy-tailed distributions and heteroskedastic errors. Finally,
modern developments in regression analysis were driven by Hastie, Tibshirani, and Friedman (2009)
[130], who introduced regularized regression techniques such as ridge regression and LASSO (Least
Absolute Shrinkage and Selection Operator). These methods added penalization terms to the least
squares objective function, effectively preventing overfitting and improving model generalization in
high-dimensional settings.
Collectively, these contributions transformed linear regression from a simple numerical fitting
method into a rigorous statistical and mathematical framework with broad applicability across
various scientific and engineering disciplines. The integration of inferential statistics, robustness
techniques, and regularization strategies has ensured that linear regression remains a cornerstone
of modern statistical analysis and predictive modeling.
Reference Contribution
Legendre (1805) Introduced the least squares method for estimating pa-
rameters by minimizing the sum of squared residuals, initially
applied to astronomical data.
Gauss (1809, 1821) Formally justified the least squares method under normal
error assumptions, proving its best linear unbiased esti-
mation (BLUE) property and laying the foundation for the
Gauss-Markov theorem. Developed statistical inference
for regression.
Pearson (1901) Developed principal component analysis (PCA), estab-
lishing the geometric relationship between regression and or-
thogonal projections, which later connected to singular
value decomposition (SVD).
Fisher (1922) Formalized the statistical inference framework for regression,
deriving the sampling distributions of regression coeffi-
cients, hypothesis testing procedures, and maximum like-
lihood estimation (MLE) methods.
Koopmans (1937) Extended regression to time series analysis, addressing is-
sues such as serial correlation, heteroskedasticity, and
multicollinearity, which are critical in econometric models.
Goldberger (1964) Developed generalized least squares (GLS) to handle cor-
related errors and non-constant variance, providing a rigor-
ous treatment of assumption violations in ordinary least
squares (OLS).
(Continued on next page)
15.2. SUPERVISED LEARNING 251
Healthcare research has also significantly benefited from linear regression models, particularly in
understanding human behavior and institutional efficiency. Zhong et al. (2025) [817] employ mul-
tiple linear regression to analyze factors influencing nurses’ attitudes and practices in postural
management for premature infants. Their study highlights how regression can quantify behavioral
determinants, aiding in the design of targeted medical training programs. Likewise, Liu et al.
(2025) [818] use regression analysis to assess the research capabilities of pediatric clinical nurses,
identifying key skill gaps and the variables influencing scientific output in medical institutions.
These findings emphasize the importance of regression-based methodologies in human resource op-
timization and professional development within healthcare. A crucial addition to the healthcare
research stream is the study by Dietze et al. (2025) [820], which was previously omitted. Their
research examines the impact of the UNODC/WHO SOS (Stop Overdose Safely) training program
on opioid overdose knowledge and attitudes. Using multivariable linear regression, they analyze the
relationship between demographic characteristics and training effectiveness. Their findings reveal
how linear regression can provide deep insights into behavioral interventions, policy evaluations,
and addiction treatment effectiveness.
In the economic and policy domain, regression techniques serve as powerful tools for evaluating
regulatory frameworks and socioeconomic disparities. Ming-jun and Jian-ya (2025) [819] construct
multiple linear regression models to examine the Porter hypothesis in environmental taxation, re-
vealing complex interdependencies between tax policies and corporate behavior. This application
of regression provides empirical support for policy decisions aimed at balancing environmental sus-
tainability with economic growth. Hasan and Ghosal (2025) [821] extend this approach to public
252 CHAPTER 15. LEARNING PARADIGMS
health, using regression to quantify inequities in healthcare access across different regions in West
Bengal, India. Their findings contribute to the formulation of equitable healthcare policies by iden-
tifying key determinants of accessibility, such as affordability and geographical distribution.
Agricultural and biomedical research have also leveraged linear regression for predictive modeling
and diagnostic improvements. Zeng et al. (2025) [822] employ LASSO (Least Absolute Shrinkage
and Selection Operator) regression to enhance maize yield predictions under stress conditions. By
integrating regression with remote sensing data, they improve crop forecasting models, demonstrat-
ing the method’s importance in precision agriculture. In orthopedic research, Baird et al. (2025)
[823] use regression to analyze long-term trends in surgical procedures, particularly in anterior cru-
ciate ligament reconstruction and hip arthroscopy. Their work demonstrates how regression-driven
AI analysis can refine medical decision-making by identifying treatment patterns and patient out-
comes over time.
Finally, the application of regression in veterinary and agronomic sciences further highlights its
interdisciplinary utility. Overton and Eicker (2025) [824] analyze milk production and fertility
trends in Holstein dairy cows using a combination of linear and logistic regression models. Their
study provides critical insights into dairy farm efficiency, showing how regression techniques can
optimize breeding programs and livestock management strategies. Collectively, these studies illus-
trate the remarkable adaptability of linear regression, as it continues to evolve through integration
with machine learning and modern statistical techniques, further solidifying its relevance in both
academic research and practical applications.
In linear regression, an underlying relationship (blue) exists between the dependent variable (y)
and the independent variable (x), with the observed data points (red) resulting from random
variations (green) around this trend
Linear regression is a fundamental statistical method used to model the relationship between a
dependent variable y and one or more independent variables x1 , x2 , . . . , xn . The goal of linear
regression is to determine the best-fitting linear function that describes this relationship by min-
imizing the error between the predicted values and the actual observations. In its most general
254 CHAPTER 15. LEARNING PARADIGMS
where y is the dependent variable, xi are the independent variables, β0 is the intercept term, βi are
the regression coefficients corresponding to each independent variable, and ϵ represents the error
term, which accounts for the variability in y that is not explained by the independent variables. If
there is only one independent variable, the model reduces to simple linear regression:
y = β0 + β1 x + ϵ. (15.125)
This is an example of simple linear regression, characterized by having only one independent
variable.
y = Xβ + ϵ. (15.127)
The fundamental objective of linear regression is to estimate the coefficient vector β in such a way
that the sum of squared errors (SSE) is minimized. The sum of squared errors is given by
m
X m
X
S(β) = ϵ2i = (yi − xTi β)2 . (15.128)
i=1 i=1
15.2. SUPERVISED LEARNING 255
To find the optimal β, we take the derivative of S(β) with respect to β and set it equal to zero:
∂S(β)
= −2XT (y − Xβ) = 0. (15.130)
∂β
XT Xβ = XT y. (15.131)
This example demonstrates cubic polynomial regression, which falls under the category of linear
regression. Although it models data with a curved function, it is considered statistically linear
because the regression function E(y|x) depends linearly on the unknown parameters being
estimated. Therefore, polynomial regression is recognized as a specific form of multiple linear
regression
which allows for Bayesian interpretations where a prior distribution is placed on β and updated
using observed data to obtain a posterior distribution, leading to Bayesian linear regression.
Subsequent research focused on model assessment and diagnostic methods to ensure the reliability
of logistic regression results. Haberman (1990) [782] rigorously developed deviance-based tests,
allowing analysts to assess goodness-of-fit and compare nested models using likelihood ratio tests.
This work introduced the notion of model adequacy beyond traditional R-squared metrics used in
linear regression. Similarly, Hosmer and Lemeshow (1980) [783] introduced the Hosmer-Lemeshow
test, which rigorously quantified calibration errors by partitioning observations into quantile groups
and comparing predicted and observed probabilities. Their methodology provided a practical yet
statistically rigorous tool for validating logistic regression models in real-world applications. The
rigorous mathematical structure of logistic regression was further explored by McCullagh (2019)
[784], who formalized the theory of iteratively reweighted least squares (IRLS) for estimating model
parameters and proved asymptotic properties of logistic regression coefficients. Their contributions
solidified logistic regression as a theoretically sound and computationally efficient method in sta-
tistical modeling.
With the increasing complexity of datasets and computational advancements, logistic regression
was further extended into machine learning and regularization frameworks. Hastie, Tibshirani,
and Friedman (2001) [130] rigorously analyzed logistic regression as a classification tool, compar-
ing it with support vector machines (SVMs) and decision trees. They introduced regularization
techniques, such as ridge regression (L2 penalty) and LASSO (L1 penalty), to combat issues of over-
fitting and multicollinearity. In parallel, Green and Silverman (1994) expanded logistic regression
to nonparametric settings, developing methods for spline smoothing and kernel-based regression,
allowing the model to capture nonlinear relationships without relying on rigid parametric assump-
tions. Another significant computational advancement came from Firth (1993) [785], who proposed
15.2. SUPERVISED LEARNING 257
a bias reduction method using penalized maximum likelihood estimation (PMLE). This technique,
known as Firth’s logistic regression, rigorously mitigates the small-sample bias that arises in rare
event classification by modifying the score equations of MLE, thereby preventing infinite estimates
in perfectly separated data.
Finally, extensions of logistic regression into hierarchical and Bayesian settings have been rigor-
ously explored to account for structured data and uncertainty in parameter estimation. King and
Zeng (2001) [786] addressed a fundamental issue in rare event data, where traditional logistic regres-
sion tends to underestimate event probabilities due to data imbalance. Their corrected probability
estimation approach adjusted the likelihood function to provide more accurate parameter estimates
in imbalanced datasets, such as those found in epidemiology and fraud detection. Gelman and Hill
(2007) [787] extended logistic regression into Bayesian hierarchical models, introducing random
effects logistic regression to handle grouped data structures. They rigorously developed Bayesian
priors for logistic regression coefficients, allowing for shrinkage estimation and improved param-
eter stability in small-sample settings. These contributions have collectively transformed logistic
regression from a simple binary classification method into a statistically rigorous, computationally
efficient, and widely applicable tool used in fields ranging from biomedical sciences to artificial
intelligence.
Reference Contribution
Cox (1958) Formulated logistic regression using the logit function and MLE.
Nelder & Wedderburn (1972) Introduced GLMs, formalizing logistic regression in a unified
statistical framework.
Haberman (1973) Developed deviance tests for logistic regression goodness-of-fit.
Hosmer & Lemeshow (1980) Introduced the Hosmer-Lemeshow test for model calibration.
McCullagh & Nelder (1983) Provided a rigorous theoretical foundation for GLMs, including
logistic regression.
Hastie, Tibshirani & Friedman (2001) Discussed regularization methods and logistic regression in the
context of statistical learning.
Green & Silverman (1994) Extended logistic regression to nonparametric settings.
Firth (1993) Proposed bias reduction techniques for small-sample logistic
regression (Firth’s correction).
King & Zeng (2001) Addressed logistic regression’s biases in rare event data.
Gelman & Hill (2007) Developed Bayesian and hierarchical logistic regression models.
In medical research, logistic regression plays a crucial role in identifying risk factors for diseases
and treatment outcomes. Waller et al. (2025) [790] utilized logistic regression to examine the
association between maternal diarrhea during the periconceptional period and birth defects, iden-
tifying key risk factors for fetal abnormalities. Similarly, Beyeler et al. (2025) [792] investigated
the susceptibility vessel sign (SVS) in stroke patients undergoing thrombectomy, demonstrating
that certain SVS characteristics significantly influence treatment efficacy. Another notable study
by Yedavalli et al. (2025) [793] applied logistic regression to assess the relationship between hy-
poperfusion intensity and stroke recovery, revealing that a hypoperfusion intensity ratio below 0.4
correlates with favorable patient outcomes. These findings are critical in guiding clinical decisions
for ischemic stroke treatment. Additionally, Yang et al. (2025) [795] explored the link between
left ventricular function and cerebral small vessel disease, showing that cardiac health can serve
as a biomarker for neurological risks. This research advances our understanding of cardiovascular-
neurological interactions and their implications for stroke prevention.
In conclusion, the recent literature showcases the far-reaching applications of logistic regression,
from public health and medical research to environmental science and behavioral studies. Whether
identifying predictors of contraceptive use, evaluating ecological selection mechanisms, or assessing
treatment outcomes in stroke patients, logistic regression remains a powerful statistical tool. Its
ability to model complex relationships and quantify risk factors makes it indispensable in research
and decision-making across disciplines. As studies continue to refine logistic regression methodolo-
gies and incorporate advanced modeling techniques, its utility in predictive analytics and policy
formulation will only expand. These advancements underscore the necessity of rigorous statistical
approaches in tackling real-world challenges, ultimately driving more effective and data-informed
solutions.
Study Title Main Contribution
Sani et al. 2025 Logistic regression used to study sociodemo-
graphic predictors of contraception adoption.
Dorsey et al. 2025 Assesses how visual exposure influences shorebird
nesting behavior.
Slawny et al. 2025 Examined how language dominance affects bilin-
gual family communication.
Waller et al. 2025 Identifies birth defect risk factors using logistic re-
gression analysis.
Beyeler et al. 2025 Evaluated impact of vessel characteristics on
stroke interventions.
Yedavalli et al. 2025 Uses logistic regression to assess hypoperfusion in-
tensity ratio and stroke outcomes.
Continued on next page
15.2. SUPERVISED LEARNING 259
This is an example of a logistic regression curve applied to data, depicting how the estimated
probability of passing an exam (a binary dependent variable) varies with the number of hours
spent studying (a single independent variable)
P (y = 1 | x) = σ(w⊤ x + b) (15.137)
where w = (w1 , w2 , . . . , wn ) is the weight vector, b is the bias term, and σ(z) is the sigmoid function
given by:
1
σ(z) = (15.138)
1 + e−z
260 CHAPTER 15. LEARNING PARADIGMS
P (y = 0 | x) = 1 − P (y = 1 | x) = 1 − σ(w⊤ x + b) (15.139)
The logistic regression model is trained by maximizing the likelihood function, which represents the
probability of observing the given set of labeled data points (xi , yi ) for i = 1, . . . , m. Assuming that
the training examples are independently and identically distributed (i.i.d.), the likelihood function
is: m
Y
L(w, b) = P (yi | xi ) (15.140)
i=1
Expanding this using the probabilities defined above, the likelihood function is:
m
Y 1−yi
L(w, b) = σ(w⊤ xi + b)yi 1 − σ(w⊤ xi + b) (15.141)
i=1
Instead of maximizing the likelihood, it is more convenient to maximize the log-likelihood function:
m
X
yi log σ(w⊤ xi + b) + (1 − yi ) log(1 − σ(w⊤ xi + b))
ℓ(w, b) = (15.142)
i=1
To find the optimal values of w and b, we compute the gradient of the log-likelihood function. The
derivative of the sigmoid function is:
To optimize w and b, we use gradient ascent, updating the parameters iteratively as follows:
m
X
(t+1) (t)
yi − σ(w(t)⊤ xi + b(t) ) xij
wj = wj + α (15.146)
i=1
m
X
b(t+1) = b(t) + α yi − σ(w(t)⊤ xi + b(t) )
(15.147)
i=1
For multi-class classification, logistic regression is extended using the softmax function:
⊤
ewk x+bk
P (y = k | x) = PK ⊤ (15.149)
j=1 ewj x+bj
15.2. SUPERVISED LEARNING 261
Beyond statistical and machine learning applications, LDA has played a critical role in computer
vision and nonlinear classification problems. A significant application was introduced by Bel-
humeur et al. in 1997 [765], who developed the Fisherfaces method, leveraging LDA for robust
face recognition that outperformed Principal Component Analysis (PCA)-based methods under
varying lighting conditions and facial expressions. The scope of LDA was further expanded by
Mika et al. in 1999 [766] with the introduction of Kernel Fisher Discriminant Analysis (KFDA),
which mapped input data into a high-dimensional feature space via kernel functions before ap-
plying Fisher’s criterion, making LDA suitable for nonlinear classification problems. More recent
advancements, such as those by Ye and Yu in 2005 [767] and Sugiyama in 2007 [768], addressed
challenges in high-dimensional, low-sample-size problems and multimodal data distributions, re-
spectively. Ye’s Generalized Discriminant Analysis (GDA) provided theoretical solutions to issues
arising when within-class covariance matrices are singular, while Sugiyama’s Local Fisher Dis-
criminant Analysis (LFDA) introduced a localized version of LDA that effectively preserves both
global and local structures in complex datasets. These refinements further strengthened LDA’s the-
oretical robustness and adaptability to real-world, high-dimensional, and structured data scenarios.
262 CHAPTER 15. LEARNING PARADIGMS
Thus, the trajectory of LDA research has evolved from Fisher’s statistical classification framework
to high-dimensional machine learning applications, bridging the gap between classical multivari-
ate statistics and modern computational intelligence. The method has been rigorously analyzed,
generalized, and extended across various domains, ensuring its continued relevance in statistical
learning, pattern recognition, and artificial intelligence.
Reference Contribution
how perceived discrimination affects blood pressure in Black mothers, showcasing its application
in behavioral sciences. On the other hand, Singh et al. (2025) [773] employed LDA as a dimen-
sionality reduction technique in facial expression recognition using CNN-BiLSTM networks. Their
findings highlighted LDA’s ability to enhance deep learning performance by filtering discrimina-
tive features while reducing computational complexity. Akter et al. (2025) [774] further explored
LDA’s capabilities in food quality assessment, comparing it with SVM and PLS-DA for detecting
fruit surface defects via hyperspectral imaging. The study concluded that LDA, although effective,
may be outperformed by more complex models when dealing with high-dimensional spectral data.
Beyond human-centered applications, Feng et. al. (2025) [775] demonstrated LDA’s utility in on-
cology, classifying causes of death in colorectal and lung cancer patients, reinforcing its relevance
in medical prognosis and predictive analytics. Similarly, Chick et al. (2025) [777] utilized LDA
in microbiome analysis, identifying bacterial strains associated with gut inflammation in broiler
chickens, showcasing its efficacy in bioinformatics and microbial classification. Meanwhile, Miao et
al. (2025) [778] introduced an LDA-PCA hybrid model for breast cancer molecular subtyping, illus-
trating its role in cancer diagnostics when combined with spectral imaging data. Finally, Rohan et
al. (2025) [779] compared LDA with ensemble AI techniques for heart disease prediction, revealing
that while LDA remains a strong statistical classifier, modern ensemble models often outperform it
in complex, non-linear data environments. In summary, these studies reinforce LDA’s adaptability
across disciplines, from psychophysiology and medicine to machine learning and geophysics. While
LDA remains a powerful tool for classification and dimensionality reduction, recent advancements
suggest that it performs best when integrated with more sophisticated models like CNNs, ran-
dom forests, and ensemble learning techniques. Its continued application in high-impact research
areas highlights its relevance in both traditional statistical analysis and contemporary AI-driven
methodologies.
Figure 15.11: Linear discriminant analysis on a two dimensional space with two classes
This example illustrates linear discriminant analysis in a two-dimensional space with two classes.
The true data-generating parameters define the Bayes boundary, whereas the realized data points
are used to estimate the boundary
To achieve this, LDA constructs two scatter matrices: the within-class scatter matrix SW
and the between-class scatter matrix SB . The within-class scatter matrix is defined as the sum
of the covariance matrices of each class:
C X
X
SW = (xi − µc )(xi − µc )T (15.152)
c=1 xi ∈Xc
and Nc is the number of samples in class c. The between-class scatter matrix is defined as
C
X
SB = Nc (µc − µ)(µc − µ)T (15.154)
c=1
The objective of LDA is to maximize the ratio of the determinant of SB to the determinant of SW ,
which leads to the following optimization problem:
det(WT SB W)
W∗ = arg max . (15.156)
W det(WT SW W)
This is solved by finding the eigenvectors wi corresponding to the largest eigenvalues λi of the
generalized eigenvalue problem:
SB w = λSW w. (15.157)
Since SW is symmetric and positive definite under typical conditions, it is invertible, leading to the
equivalent formulation:
S−1
W SB w = λw. (15.158)
The eigenvectors corresponding to the largest C − 1 eigenvalues form the columns of W, giving
the optimal projection that maximizes class separability in the lower-dimensional space. The
transformed feature vectors are then given by
yi = WT xi . (15.159)
The projection preserves the information necessary for classification while reducing dimensionality.
Given a new sample x, classification can be performed using a simple distance metric such as the
Mahalanobis distance:
where Σc is the covariance matrix of the projected class distributions. The decision rule is then:
When the class distributions are assumed to be Gaussian with identical covariances, LDA can be
interpreted as finding the optimal decision boundary that minimizes classification error under the
266 CHAPTER 15. LEARNING PARADIGMS
Bayes decision framework. In the two-class case, LDA reduces to Fisher’s Linear Discriminant,
where the optimal direction w is obtained as
w = S−1
W (µ1 − µ2 ). (15.162)
wT x = b, (15.163)
The prior probability of class c is denoted as P (y = c) = NNc , where Nc is the number of samples in
class c and N is the total number of samples. The global mean vector of the dataset is given by
N
1 X
µ= xi . (15.166)
N i=1
For the purpose of classification, the goal of LDA is to maximize the separability between different
classes while minimizing the variance within each class. This is achieved by defining two scatter
matrices: the within-class scatter matrix and the between-class scatter matrix. The within-class
scatter matrix is given by
XC X
SW = (xi − µc )(xi − µc )T . (15.167)
c=1 xi ∈Cc
ST = SW + SB , (15.171)
To find the optimal transformation, LDA seeks to maximize the following objective function:
The solution to this maximization problem involves solving the generalized eigenvalue problem
SB w = λSW w. (15.174)
S−1
W SB w = λw. (15.175)
Since the rank of SB is at most C − 1, there are at most C − 1 nonzero eigenvalues, implying that
the optimal dimensionality of the transformed space is at most C − 1. To extract the projection
matrix W, we sort the eigenvectors of S−1
W SB corresponding to the largest eigenvalues. The optimal
projection matrix is then
W = [w1 , w2 , . . . , wC−1 ]. (15.176)
Once the transformation is applied, the projected data points are given by
Image Credit: By Amélia Oliveira Freitas da Silva - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=104693008
z = WT x. (15.177)
268 CHAPTER 15. LEARNING PARADIGMS
Image Credit: By Amélia Oliveira Freitas da Silva - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=104693007
The classification rule in the transformed space follows from Bayes’ Theorem:
When the class covariances are equal, this results in a linear decision boundary, given by
1
WT (µc − µc′ ) + (µTc WWT µc − µTc′ WWT µc′ ) = 0. (15.181)
2
Alternatively, when the class covariances differ, the boundary is quadratic, requiring solving
T −1
(z − µc )T Σ−1
c (z − µc ) − (z − µc′ ) Σc′ (z − µc′ ) = 0. (15.182)
The relationship between LDA and Fisher’s Discriminant Analysis is seen by considering the case
C = 2, where maximizing the ratio
wT SB w
J(w) = T (15.183)
w SW w
yields an equivalent generalized eigenvalue problem. This follows from the trace expansion
C−1
X
J(W) = λi . (15.184)
i=1
15.2. SUPERVISED LEARNING 269
where α is a small positive constant. The overall asymptotic properties of LDA follow from the con-
centration of measure in high-dimensional settings, where the expected misclassification rate
follows a chi-squared distribution under Gaussianity.
For large sample sizes, the classification error converges to the Bayes error rate, which can be
expressed in terms of the Mahalanobis distance between class means:
v
u C
1 uX X Nc Nc′
Perror ≈ Q t (µc − µc′ )T Σ−1 (µc − µc′ ) , (15.187)
2 c=1 c′ ̸=c Nc + Nc ′
where E represents a small perturbation in SB due to finite-sample effects. Using first-order per-
turbation theory, the perturbed eigenvector satisfies
X wjT Ewi
wi∗ = wi + wj + O(ϵ2 ). (15.191)
j̸=i
λi − λj
where Θ represents the principal angle between the estimated and true discriminant subspaces.
Thus, the empirical discriminant directions converge to the true optimal directions up to a normal-
ization factor. The generalization error scales as
C−1
C −1X 1
E≈ . (15.201)
N i=1 λi
This indicates that when the smallest discriminative eigenvalue λmin is small, LDA exhibits poor
generalization.
Exact Asymptotic Bounds for the Misclassification Probability: The Bayes error rate for
multiclass classification is characterized by the probability that a sample from class c is assigned to
a different class c′ . In LDA, this error is asymptotically governed by the generalized Mahalanobis
distance between class means, given by
2 T −1
Dcc ′ = (µc − µc′ ) Σ (µc − µc′ ).
For asymptotically optimal classifiers, the misclassification probability can be approximated using
the Chernoff bound:
X 1 2
Perror ≤ exp − Dcc′ .
c̸=c ′
8
In the high-dimensional limit where d ≫ N , we incorporate the Marčenko-Pastur correction for
the empirical covariance matrix SW , leading to the refined bound
2
X 1 Dcc ′
Perror ≤ exp − ,
c̸=c′
81+γ
15.2. SUPERVISED LEARNING 271
where γ = d/N is the dimensionality ratio. As γ → 1, the error rate approaches that of random
guessing due to the eigenvalue collapse of SW .
Further Spectral Properties: To fully characterize LDA, we analyze the spectral structure of
the Fisher matrix
F = S−1
W SB .
Since SB has at most rank C − 1, its eigenvalue spectrum consists of C − 1 nonzero eigenvalues,
denoted as λ1 , . . . , λC−1 , and a bulk of zero eigenvalues. The total discriminative variance captured
by LDA is given by the sum
C−1
X
λi .
i=1
Using random matrix perturbation theory, we quantify the stability of the eigenvalues under sam-
pling noise. For small perturbations E in SB , the first-order correction to each eigenvalue satisfies
λ∗i = λi + wiT Ewi + O(∥E∥2 ).
The condition number of the Fisher matrix is
λmax (S−1
W SB )
κ(F) = −1 .
λmin (SW SB )
For high-dimensional data, the smallest eigenvalue λmin obeys the Tracy-Widom law, leading to an
expected conditioning bound √
(1 + γ)2
κ(F) ≈ √ .
(1 − γ)2
Thus, LDA becomes ill-conditioned as d/N → 1, reducing its effectiveness.
Random Matrix Theory Perspective: From a random matrix theory (RMT) viewpoint, we
analyze the spectrum of S−1 W SB . Assuming that the entries of X are Gaussian-distributed, the
eigenvalue distribution of SW follows the Marčenko-Pastur law:
1 p
ρ(λ) = (λ+ − λ)(λ − λ− ), λ ∈ [λ− , λ+ ],
2πλ
where
√
λ± = σ 2 (1 ± γ)2 .
For the generalized eigenvalue problem
SB w = λSW w,
the spectral properties of S−1
W SB follow from the spiked covariance model:
(
(βi +1)(βi +γ) √
βi
, if βi > γ,
λi =
1 + O(N −1/2 ), otherwise.
where βi = d1 wiT SB wi represents the population discriminability. Thus, when βi is small, LDA fails
to extract meaningful directions.
Exact Rate of Eigenvalue Concentration: For a given sample covariance matrix SW , we an-
alyze the eigenvalue concentration of S−1
W SB by considering the extreme eigenvalues in the high-
dimensional regime d, N → ∞ with γ = d/N fixed. Using random matrix theory (RMT), the
empirical eigenvalues λi of S−1
W SB satisfy the Baik-Ben Arous-Péché (BBP) phase transition:
( 2 √
θi + γσ
θ
, if θ i > γ,
λi = √i 2 −2/3
(1 + γ) + O(N ), otherwise.
272 CHAPTER 15. LEARNING PARADIGMS
where F1 (t) is the Tracy-Widom distribution of order 1. This implies that the eigenvalues are
tightly concentrated around their expected values, with deviations of order N −2/3 , ensuring that
the Fisher discriminant ratio remains stable in sufficiently large dimensions.
For high-dimensional LDA, the error is governed by the effective rank of SW , defined as:
Tr(SW )
reff = .
λmax (SW )
If reff ≪ d, then the generalization error degrades. Using local Rademacher complexities, we obtain
the bound: √
d log C
R(h) ≤ R̂(h) + O √ ,
N
showing that LDA’s performance deteriorates if d ≫ N , unless regularization is applied.
15.2. SUPERVISED LEARNING 273
The Naı̈ve Bayes classifier has undergone significant theoretical and empirical development since
its early applications in probabilistic reasoning and information retrieval. Maron (1961) [741] first
introduced Bayesian probability in automatic indexing, demonstrating how probabilistic inference
could effectively classify documents. Around the same time, Minsky (1961) [742] explored proba-
bilistic models for artificial intelligence, discussing how Bayesian approaches, including Naı̈ve Bayes,
could be leveraged in pattern recognition tasks. Mosteller and Wallace (1963) [743] provided one
of the earliest large-scale applications of Bayesian methods to text classification, where they used
Bayesian inference to determine the authorship of the Federalist Papers, setting a precedent for
later advancements in text analytics and authorship verification. These foundational studies firmly
established the utility of Naı̈ve Bayes in probabilistic reasoning, even before modern machine learn-
ing frameworks popularized it.
Subsequent studies rigorously investigated the theoretical underpinnings of the classifier, partic-
ularly concerning its surprising effectiveness despite the strong feature independence assumption.
Domingos and Pazzani (1997) [744] mathematically analyzed the performance of Naı̈ve Bayes under
zero-one loss and proved that the classifier could be optimal even when attributes exhibit strong
dependencies. Their work provided theoretical justification for why Naı̈ve Bayes performs well in
many real-world settings. Hand and Yu (2001) [745] further argued that, despite its simplicity,
Naı̈ve Bayes remains highly competitive in classification tasks. They analyzed its robustness and
derived theoretical explanations for its empirical success, showing that when attributes are con-
ditionally independent given the class, Naı̈ve Bayes achieves optimal classification performance.
Rish (2001) [746] expanded upon this by conducting an extensive empirical study, identifying the
specific conditions under which Naı̈ve Bayes fails and where it excels. This research solidified the
classifier’s status as a baseline method that often performs remarkably well in practical applications.
Further refinements and comparative studies explored the limits of the independence assump-
tion and its effect on classification accuracy. Ng and Jordan (2002) [747] rigorously compared
Naı̈ve Bayes with logistic regression, demonstrating that while Naı̈ve Bayes converges more rapidly
with fewer training samples, logistic regression achieves superior asymptotic performance. This
highlighted the fundamental trade-off between generative and discriminative classifiers. Webb,
Boughton, and Wang (2005) [748] proposed the Averaged One-Dependence Estimators (AODE),
which relaxes the independence assumption by allowing each attribute to depend on one other
attribute. Their study provided a pathway for enhancing Naı̈ve Bayes through partial dependency
modeling, improving classification accuracy while retaining computational efficiency. Similarly,
Boulle (2007) [749] introduced a compression-based Bayesian regularization technique that selects
the most probable subset of features while adhering to the Naı̈ve Bayes assumption, mitigating
overfitting and improving generalization performance. Larsen and Aone (1999) [750] also extended
the classifier’s capabilities by introducing a refined document clustering framework, reducing bias
in class probability estimation.
Collectively, these contributions illustrate the extensive theoretical and empirical development of
the Naı̈ve Bayes classifier, from its foundational probabilistic framework to sophisticated refine-
ments addressing its assumptions. The classifier’s resilience, despite its simplifying assumptions,
has been well-documented, and various studies have sought to either explain or mitigate its limi-
tations. These advancements have cemented Naı̈ve Bayes as a cornerstone of statistical learning,
particularly in text classification, medical diagnosis, spam filtering, and other probabilistic reason-
ing applications. While modern machine learning methods have evolved beyond Naı̈ve Bayes in
many domains, its efficiency, interpretability, and theoretical elegance ensure its continued relevance
274 CHAPTER 15. LEARNING PARADIGMS
Usman et al. (2025) [734] investigated the use of Naı̈ve Bayes in retail sales prediction but found
that the classifier performed poorly due to its inability to handle complex dependencies among prod-
uct features. Similarly, Shannaq (2025) [751] explored its effectiveness in Arabic text classification,
showing that while Naı̈ve Bayes achieves 85-90% accuracy on large datasets, its performance de-
grades significantly for smaller datasets. This limitation is attributed to the classifier’s reliance on
feature independence assumptions, which often do not hold in real-world scenarios. Goldstein
et al. (2025) [752] further investigated the classifier’s application in geotechnical characterization,
concluding that Naı̈ve Bayes is unreliable in high-uncertainty environments. This study high-
lights the model’s inadequacy for engineering applications where probabilistic reasoning must be
robust.
Beyond traditional applications, researchers have evaluated Naı̈ve Bayes in transportation mod-
15.2. SUPERVISED LEARNING 275
eling, facial recognition, and fault detection. Ntamwiza and Bwire (2025) [753] compared it
to ensemble models for predicting biking preferences, concluding that Naı̈ve Bayes underperformed
due to its inability to capture complex, non-linear relationships. In the domain of facial
recognition, El Fadel (2025) [754] found that Naı̈ve Bayes struggles with high-dimensional im-
age data, making it significantly less effective than deep learning-based classifiers. Meanwhile,
RaviKumar et al. (2025) [755] examined its performance in fault diagnosis for electric vehicles
(EVs), demonstrating that while Naı̈ve Bayes showed moderate success, it was inferior to deep
learning methods due to its oversimplified assumptions.
Naı̈ve Bayes has been widely employed in text and sentiment analysis, but with mixed results.
Kavitha et al. (2025) [756] applied it to fake review detection and found that while computation-
ally efficient, the model is susceptible to misclassification in the presence of sarcasm or linguis-
tic subtleties. Nusantara (2025) [757] explored its use in Twitter sentiment analysis for banking
services, concluding that it works well for binary classification but struggles with multi-class
sentiment analysis. Ahmadi et al. (2025) [758] tested Naı̈ve Bayes for SMS spam detection in
cybersecurity and found that while it provides a strong baseline, it is outperformed by deep
learning models that can capture semantic meaning.
The classifier’s application in medical and security domains further highlights its strengths and
weaknesses. Takaki et al. (2025) [759] tested it in an AI-assisted respiratory classification system
for chest X-rays, concluding that while it is computationally efficient, its accuracy is signifi-
cantly lower than deep learning methods such as EfficientNet and GoogleNet. Similarly, Abdullahi
et al. (2025) [739] examined its effectiveness in IoT attack detection, showing that its performance
was near random chance (AUC ≈ 0.50). These findings reinforce that while Naı̈ve Bayes remains
a viable choice for baseline comparisons and simple applications, it is increasingly outclassed
by advanced machine learning methods.
While the Naı̈ve Bayes classifier is a valuable tool due to its computational efficiency, inter-
pretability, and ease of implementation, its reliance on independence assumptions makes
it unsuitable for complex, high-dimensional data. Across various applications, including text classi-
fication, fault diagnosis, sentiment analysis, and medical imaging, Naı̈ve Bayes has been shown to
be effective primarily in structured, low-dimensional datasets. However, as data complexity grows,
more sophisticated models such as deep learning and ensemble classifiers significantly outperform
it. Future research should focus on hybrid approaches that integrate Naı̈ve Bayes with deep
learning models to improve its robustness in complex settings.
Since the denominator P (x) is independent of y, the classifier assigns x to the class maximizing
the numerator, i.e.,
ŷ = arg max P (x | y)P (y). (15.204)
y∈{C1 ,C2 ,...,Ck }
The Naı̈ve Bayes assumption posits that the features xi are conditionally independent given y, such
that n
Y
P (x | y) = P (xi | y). (15.205)
i=1
Thus, the decision rule simplifies to
n
Y
ŷ = arg max P (y) P (xi | y). (15.206)
y∈{C1 ,C2 ,...,Ck }
i=1
The estimation of P (y) and P (xi | y) depends on the type of data. For categorical data, the
probabilities are estimated using the frequency counts:
count(y)
P (y) = , (15.208)
total samples
278 CHAPTER 15. LEARNING PARADIGMS
count(xi , y)
P (xi | y) = . (15.209)
count(y)
For continuous data, a common approach is to model P (xi | y) as a Gaussian distribution with
parameters estimated from the training data:
(xi − µi,y )2
1
P (xi | y) = q exp − 2
, (15.210)
2
2πσi,y 2σi,y
2
where µi,y and σi,y are the mean and variance of feature xi for class y, computed as
Ny
1 X
µi,y = xi,j , (15.211)
Ny j=1
Ny
2 1 X
σi,y = (xi,j − µi,y )2 . (15.212)
Ny j=1
For robustness, Laplace smoothing is often applied in the categorical case, modifying the estimates
as
count(xi , y) + α
P (xi | y) = (15.213)
count(y) + α|Vi |
where α > 0 is the smoothing parameter and |Vi | is the number of possible values of xi . The
model is trained by computing these probabilities from the dataset and classifies new instances by
evaluating the logarithmic sum of probabilities across all classes.
Beyond core decision tree algorithms, significant research has focused on improving their per-
formance through feature selection, ensemble learning, and data stream adaptation. Kohavi and
John’s Wrapper Method for Feature Selection (1997) [727] rigorously analyzed how feature selection
impacts decision tree accuracy, demonstrating that an optimal feature subset can significantly im-
prove predictive performance while reducing computational costs. Breiman’s Bagging (1996) [728]
further refined decision tree stability by introducing bootstrap aggregation, an ensemble method
that constructs multiple trees on bootstrapped samples and aggregates their predictions to mit-
igate variance. Freund and Schapire’s AdaBoost (1997) [729] revolutionized ensemble methods
by proposing an adaptive boosting strategy, where successive weak decision trees are trained on
15.2. SUPERVISED LEARNING 279
reweighted datasets that emphasize misclassified instances. This approach led to strong general-
ization properties and became a cornerstone of ensemble-based decision tree methods.
Further methodological advancements have enabled decision trees to handle large-scale and stream-
ing data. Breiman’s Random Forests (2001) [730] combined the principles of bagging with random-
ized feature selection, ensuring diversity among individual trees and improving robustness against
overfitting. Domingos and Hulten’s Very Fast Decision Tree (VFDT) (2000) [731] algorithm ad-
dressed real-time data stream mining by using Hoeffding bounds to construct trees incrementally
while maintaining computational efficiency. Additionally, Freund and Mason’s Alternating De-
cision Tree (ADTree) (1999) [732] integrated decision trees with boosting techniques, producing
more compact and interpretable models that outperform traditional tree-based classifiers. Quin-
lan’s Oblique Decision Trees (1993) [733] expanded decision tree expressiveness by introducing
linear combination-based decision boundaries, overcoming the axis-aligned partitioning limitations
of classical decision trees.
Collectively, these contributions have rigorously enhanced decision tree learning by addressing
fundamental challenges related to feature selection, overfitting, scalability, and decision boundary
flexibility. The evolution from simple, greedy tree induction algorithms to sophisticated ensemble
and streaming methodologies has solidified decision trees as a powerful tool for machine learning.
The theoretical underpinnings of these developments have not only improved predictive accuracy
but also deepened the mathematical understanding of tree-based models, ensuring their continued
relevance in modern data-driven applications.
In medical and clinical applications, decision tree learning has shown substantial promise. Eili
et al. (2025) [737] developed a machine learning framework integrating decision trees with Markov
models to predict patient treatment pathways for traumatic brain injuries (TBI). This application
underscores the utility of decision trees in dynamic decision-making environments such as health-
care. Furthermore, Yin et al. (2025) [738] conducted a comparative analysis between decision trees
and logistic regression for predicting cancer treatment response, revealing that tree-based models
more effectively capture nonlinear relationships between biomarkers and treatment outcomes. Liu
et al. (2025) [707] extended this approach to the dental field, using decision trees to analyze bond
strength in lithium disilicate-reinforced ceramics, reinforcing the value of tree-based models in pre-
cision material engineering.
Beyond clinical settings, decision trees have been instrumental in environmental and agricultural
research. Barghouthi et al. (2025) [708] fused decision trees with K-nearest neighbors and extreme
gradient boosting to create a multi-channel predictive model for pressure injuries in hospitalized
patients, demonstrating the robustness of tree-based models in handling high-dimensional data. In
agronomy, Jewan (2025) [709] applied decision tree classifiers to predict crop yields in Bambara
groundnut and grapevines, effectively processing remote sensing data to model environmental in-
fluences on agricultural productivity. Similarly, Abdullahi et al. (2025) [739] explored the use of
decision trees in sound analysis for indoor localization, presenting a novel approach for integrating
hierarchical classification with feature extraction, proving its adaptability beyond traditional use
cases.
Lastly, Mokan et al. (2025) [740] illustrated the power of decision tree classifiers in medical imag-
ing by developing a model capable of segmenting retinal vasculature into arteries and veins. This
application demonstrates how decision trees can effectively handle pixel-wise classification tasks,
enhancing diagnostic precision in ophthalmology. Collectively, these studies showcase the wide
applicability and adaptability of decision tree learning, reinforcing its status as a versatile tool
across disciplines. The method’s ability to structure complex decision boundaries, interpret data
15.2. SUPERVISED LEARNING 281
hierarchically, and integrate seamlessly with ensemble learning approaches ensures its continued
relevance in contemporary machine learning applications.
The diagram depicts passenger survival on the Titanic, with ”sibsp” referring to the count of
spouses or siblings aboard. Beneath each leaf, the figures represent the probability of survival and
the percentage of passengers in that category. In essence, a higher chance of survival applied to
those who were either (i) female or (ii) male, no older than 9.5 years, and traveling with fewer
than three siblings.
most informative feature at each step based on a splitting criterion. The goal of a decision tree is to
create a model that predicts the value of a target variable by learning simple decision rules inferred
from data features. Given a dataset D = {(xi , yi )}N d
i=1 , where each xi ∈ R is a feature vector and
yi is the corresponding output (either categorical for classification or continuous for regression), a
decision tree recursively partitions Rd into disjoint regions Rm , where each region corresponds to a
leaf node containing a prediction for y. The function learned by the decision tree can be expressed
as a piecewise constant function:
M
X
f (x) = cm ⊮(x ∈ Rm ), (15.214)
m=1
where M is the total number of leaf nodes, cm is the prediction assigned to region Rm , and ⊮(x ∈
Rm ) is an indicator function that is 1 if x ∈ Rm and 0 otherwise. The splitting criterion for a
decision tree involves selecting the feature j and threshold s that best separate the data at each
step. For classification, the impurity of a node is measured using a criterion such as the Gini
impurity, defined as
K
X
G(R) = pk (1 − pk ), (15.215)
k=1
where pk is the proportion of samples in region R belonging to class k, and K is the total number
of classes. Another common impurity measure is entropy, given by
K
X
H(R) = − pk log pk . (15.216)
k=1
For regression, the variance reduction is typically used, and the impurity at a node is given by the
mean squared error (MSE):
1 X
MSE(R) = (yi − ȳR )2 , (15.217)
|R| x ∈R
i
where ȳR is the mean of the target values in region R. The optimal split is found by maximizing
the information gain, which is computed as
|RL | |RR |
∆I = I(R) − I(RL ) + I(RR ) , (15.218)
|R| |R|
where RL and RR are the left and right child nodes obtained after splitting R, and I(R) is the
impurity measure (Gini, entropy, or MSE). The algorithm selects the feature j ∗ and threshold s∗
that maximize ∆I:
(j ∗ , s∗ ) = arg max ∆I. (15.219)
j,s
The recursive splitting process continues until a stopping criterion is met, such as a maximum
depth D, a minimum number of samples per leaf nmin , or an impurity threshold ϵ. The depth of
the tree, denoted D, determines the complexity of the model, and the number of terminal nodes
M satisfies
M ≤ 2D . (15.220)
Pruning is performed to prevent overfitting, and one common approach is cost complexity pruning,
which minimizes the function
M
X
C(T ) = |Rm |H(Rm ) + αM, (15.221)
m=1
284 CHAPTER 15. LEARNING PARADIGMS
where α is a regularization parameter controlling the trade-off between tree complexity and accu-
racy. The optimal subtree T ∗ is obtained by
where N is the number of samples and d is the number of features. Despite its simplicity, decision
trees can suffer from high variance, making ensemble methods such as random forests and gradient
boosting necessary for improving generalization.
Mathematically, the distance function d(xq , xi ) is commonly defined as the Euclidean distance:
v
u d
uX
d(xq , xi ) = ∥xq − xi ∥2 = t (xqj − xij )2 (15.224)
j=1
Alternatively, the Minkowski distance of order p generalizes the Euclidean and Manhattan distances:
d
! p1
X
d(xq , xi ) = |xqj − xij |p (15.225)
j=1
15.2. SUPERVISED LEARNING 287
The green dot represents the test sample, which needs to be classified as either a blue square or a
red triangle. If the classification is based on k = 3 (solid-line circle), the sample is assigned to the
red triangles because there are more triangles (2) than squares (1) within this region. However,
with k = 5 (dashed-line circle), the sample is classified as a blue square, as the outer circle
contains three squares and only two triangles
where ⊮(·) is the indicator function, and C represents the set of all possible class labels. If weighted
voting is used, the contribution of each neighbor can be weighted by the inverse distance:
1
wi = (15.227)
d(xq , xi ) + ϵ
where ϵ is a small positive number to prevent division by zero. The weighted class probability
estimate is then given by: P
i∈Nk (xq ) wi ⊮(yi = c)
P (yq = c) = P (15.228)
i∈Nk (xq ) wi
and the final classification decision is:
ŷq = arg max P (yq = c) (15.229)
c∈C
288 CHAPTER 15. LEARNING PARADIGMS
For regression, the predicted value ŷq is typically the mean of the k nearest neighbors’ target values:
1 X
ŷq = yi (15.230)
k
i∈Nk (xq )
Computational complexity is a critical aspect of KNN. A naive search for the nearest neighbors
requires computing distances from xq to all N data points, leading to a time complexity of O(N d).
If a spatial data structure such as a k-d tree or a ball tree is used, the query time can be reduced to
O(log N ) in low-dimensional spaces. However, for high-dimensional data, the curse of dimensional-
ity makes these structures less effective, often reverting to the brute-force O(N d) complexity. The
choice of k significantly affects KNN performance. A small k can lead to high variance, whereas a
large k smooths the decision boundary but may introduce bias. The optimal k is often determined
via cross-validation, where the classification error is minimized as:
N
X (k)
k̂ = arg min ⊮(ŷi ̸= yi ) (15.232)
k
i=1
The theoretical foundation of KNN can be analyzed in terms of consistency. Under mild assump-
tions, as N → ∞ and k → ∞ while Nk → 0, the KNN classification error converges to the Bayes
error rate:
lim E[⊮(ŷq ̸= yq )] = R∗ (15.233)
N →∞
For regression, under similar asymptotic conditions, the expected squared error converges to:
In high-dimensional spaces, the effectiveness of KNN diminishes due to the concentration of dis-
tances:
maxi d(xq , xi ) − mini d(xq , xi )
lim =0 (15.235)
d→∞ mini d(xq , xi )
The KNN decision boundary is a piecewise linear approximation of the true decision surface. In
the limit as k → 1, KNN forms a Voronoi tessellation:
For k > 1, decision regions are obtained by aggregating Voronoi cells. Given a sample xq , the
probability of class c is estimated as:
kc
P (yq = c | xq ) = (15.237)
k
Thus, KNN is a flexible, non-parametric method that relies on local structure for decision-making.
different approaches for converting similarities into kernel functions. Their study systematically
explored the mathematical properties of similarity measures and their effect on classification per-
formance, offering insights into optimal ways to utilize nearest neighbor weights. Chechik et al.
(2010) [716] expanded upon this by introducing OASIS, an online learning algorithm designed to
handle large-scale image similarity tasks efficiently. By leveraging a bilinear similarity function and
optimizing with a large-margin criterion, OASIS achieved state-of-the-art performance in ranking
tasks while maintaining computational feasibility. Similarly, Huang et al. (2013) [717] extended
similarity learning into the domain of content-based image retrieval (CBIR) by integrating relative
comparisons, an approach grounded in ranking theory. This refinement aligned image retrieval
more closely with real-world applications, ensuring that retrieved images were ordered based on
their true visual similarities rather than absolute feature distances.
The theoretical underpinnings of similarity-based learning were further explored by Kar and Jain
(2011) [720], who proposed a framework for mapping similarity functions to data-driven embed-
dings. Their work addressed the crucial question of how similarity functions can be interpreted
in terms of data separability, providing a bridge between metric learning and traditional feature-
based classifiers. In a different but related direction, Xiao et al. (2011) [719] tackled the problem of
positive and unlabeled learning (PU learning), where similarity functions were used to weight am-
biguous examples and improve the classification of data points with uncertain labels. This approach
demonstrated how similarity-based techniques could enhance learning in scenarios where labeled
data is scarce or incomplete, making it particularly relevant for applications such as anomaly de-
tection and biomedical classification.
With the advent of deep learning, similarity learning has expanded into more complex and high-
dimensional data spaces. Yang et al. (2024) [718] conducted an extensive survey on deep learning
approaches for similarity computation, covering applications ranging from sequence matching to
graph-based similarity models. Their review provided a critical examination of various neural net-
work architectures designed to capture similarity relationships in data, highlighting key challenges
such as overfitting, interpretability, and computational efficiency. Additionally, contributions from
Wikipedia contributors [722] have documented the broader theoretical framework of semantic sim-
ilarity, including traditional node-based and edge-based methods for quantifying textual similarity.
Co-citation proximity analysis, as explored in recent research, introduced an innovative way to
determine document similarity by leveraging citation networks, demonstrating how similarity mea-
sures can be extended beyond direct content analysis to relational data structures.
From a practical implementation perspective, PingCAP (2024) [721] evaluated a range of tools for
computing semantic similarity in natural language processing, including transformer-based mod-
els like BERT and classical vector representations such as Word2Vec. Their analysis underscored
the trade-offs between model complexity, computational cost, and accuracy in different NLP ap-
plications. Finally, Choi (2022) [723] applied similarity scoring techniques to document retrieval
and clustering, demonstrating how fine-grained textual similarity assessments can enhance infor-
mation retrieval systems. By incorporating deep learning models, they showcased improvements
in contextual understanding, making similarity-based approaches increasingly vital in modern AI
applications. Collectively, these works highlight the evolution of similarity training, demonstrat-
ing its growing importance across disciplines while underscoring the interplay between theoretical
advancements and real-world applications.
290 CHAPTER 15. LEARNING PARADIGMS
Reference Contribution
Chen et al. (2009) Established a mathematical foundation for similarity-
based classification by systematically converting simi-
larity measures into kernels and evaluating their perfor-
mance in various learning scenarios.
Chechik et al. (2010) Developed OASIS, an efficient online algorithm that
learns a bilinear similarity function for large-scale image
ranking, optimizing a margin-based criterion to enhance
ranking performance.
Wang et al. (2013) Proposed a similarity learning framework for content-
based image retrieval (CBIR) that incorporates relative
comparisons, aligning retrieval results with human per-
ception of image similarity.
Kar and Jain (2011) Introduced a similarity embedding framework that maps
similarity functions into data-driven feature spaces,
bridging the gap between similarity learning and tra-
ditional classification approaches.
Liu et al. (2011) Addressed positive and unlabeled (PU) learning by
leveraging similarity-based weighting of uncertain exam-
ples, improving classification performance when labeled
data is scarce.
Zhang et al. (2024) Conducted an extensive survey on deep learning tech-
niques for similarity learning, covering applications in
sequence modeling, graph-based learning, and high-
dimensional data similarity computation.
Wikipedia Contribu- Documented theoretical aspects of semantic similar-
tors ity, including classical methods such as node-based and
edge-based similarity computations, and their applica-
tions in knowledge representation.
PingCAP (2024) Explored NLP tools like Word2Vec and BERT for se-
mantic similarity computation, analyzing trade-offs be-
tween computational complexity and accuracy in various
language processing applications.
ResearchGate Con- Demonstrated the application of similarity-based scor-
tributors (2023) ing techniques in document retrieval and clustering,
showcasing improvements in contextual understanding
using deep learning models.
Co-citation Proximity Investigated citation-based similarity measures, intro-
Analysis ducing co-citation proximity analysis to quantify the
relationship between academic articles based on their
citation network structures.
classify crops in UAV-captured images. It employs RGB and vegetation indices (VARI) for training,
optimizing performance by balancing training and testing datasets with an 80/20 split. The model
improves classification accuracy by using similarity metrics to refine feature extraction. Bakaev et.
al. (2025) [686] explored similarity-based training methods to enhance synthetic text generation
by Large Language Models (LLMs). Using cosine similarity and Mahalanobis distance, the authors
analyze how closely generated text matches human-authored content, showcasing the effectiveness
of similarity-based methods in controlling model outputs. Ahn et. al. (2025) [687] employed
similarity training using the Dice Similarity Coefficient to fine-tune a deep learning model for
medical image segmentation. The proposed method significantly improves precision in identifying
standard imaging planes, demonstrating how similarity metrics enhance training effectiveness. Peng
et. al. (2025) [688] introduced similarity label supervision to refine visual place recognition by
improving re-ranking mechanisms. By integrating descriptor similarity into the training process,
the model enhances recognition performance even in challenging environmental conditions. Zhao et.
al. (2025) [689] employed similarity-based global distance measures to cluster data while preserving
privacy in federated learning. The authors integrate Generative Adversarial Networks (GANs) to
refine the training process, ensuring robust clustering in distributed environments. Wang et. al.
(2025) [690] developed a similarity matrix (W matrix) to predict genomic traits in maize hybrids.
By incorporating similarity measures into training, the model improves yield and moisture content
predictions across different environmental conditions. Xu et. al. (2025) [691] introduced a similarity
loss function in medical image registration. By optimizing image similarity during training, the
authors achieve enhanced alignment accuracy for medical imaging applications. Sun et. al. (2025)
[692] investigated how similarity training affects text generation in LLMs. It evaluates ROUGE-1
similarity scores to measure how synthetic text diverges from human-generated content, guiding
training strategies for better alignment with human language patterns. Liang et. al. (2025) [693]
presented a similarity-based training approach to improve lip-to-speech synthesis. The model learns
to generate natural-sounding speech without explicit speaker embeddings, achieving high speaker
similarity through deep learning methods.
Image Credit:
https://developers.google.com/machine-learning/clustering/dnn-clustering/overview
Image Credit:
https://developers.google.com/machine-learning/clustering/dnn-clustering/overview
A central approach in similarity learning is metric learning, where the goal is to learn a distance
metric M that parameterizes a Mahalanobis-like distance
q
DM (xi , xj ) = (xi − xj )T M (xi − xj ) (15.241)
where M is a positive semi-definite (PSD) matrix (M ⪰ 0), ensuring that DM satisfies the properties
of a metric:
DM (xi , xj ) ≥ 0, DM (xi , xj ) = 0 ⇐⇒ xi = xj (15.242)
DM (xi , xj ) = DM (xj , xi ), DM (xi , xk ) ≤ DM (xi , xj ) + DM (xj , xk ). (15.243)
Metric learning can be formulated as an optimization problem where we minimize a loss function
that enforces similarity constraints. One such loss function is the contrastive loss
X
L= yij DM (xi , xj )2 + (1 − yij ) max(0, α − DM (xi , xj ))2 (15.244)
i,j
where α is a margin parameter. This loss ensures that similar samples are pulled together while
dissimilar samples are pushed apart beyond a margin. Another widely used loss function is the
294 CHAPTER 15. LEARNING PARADIGMS
triplet loss, which operates on triplets (xa , xp , xn ), where xa is an anchor, xp is a positive example,
and xn is a negative example. The triplet loss is defined as
X
L= max(0, D(f (xa ), f (xp )) − D(f (xa ), f (xn )) + α). (15.245)
(a,p,n)
Here, the objective is to ensure that the distance between the anchor and the positive is smaller
than the distance between the anchor and the negative by at least a margin α. If we represent the
embedding function f (x) as a deep neural network parameterized by θ, then similarity learning
becomes a deep learning problem, where the network parameters are optimized via stochastic
gradient descent to minimize one of the loss functions described above. A particularly effective
approach in similarity learning is the use of a Siamese network, where two identical neural networks
fθ share weights and process two input vectors xi and xj . The network learns a representation such
that the Euclidean distance
is small for similar pairs and large for dissimilar pairs. The optimization is driven by minimizing
a loss function such as the contrastive or triplet loss. Another approach is the use of graph-based
similarity learning, where a similarity graph G = (V, E) is constructed over the dataset, and
embeddings are learned by enforcing that connected nodes are closer together in the embedding
space. This can be formulated as
X
L= wij D(f (xi ), f (xj ))2 (15.247)
(i,j)∈E
where wij is a weight representing the strength of similarity between xi and xj . Thus, similarity
learning encompasses a vast array of methods, from classical metric learning with Mahalanobis
distances to deep learning approaches with Siamese networks, triplet loss, and self-supervised con-
trastive learning, all of which fundamentally rely on defining and optimizing a notion of similarity
in a mathematically rigorous manner.
In the context of hierarchical self-learning, Hinton et al. (2006) [828] proposed Deep Be-
lief Networks (DBNs), pioneering layer-wise unsupervised pretraining, which allows deep
networks to learn robust feature hierarchies. This method was pivotal in self-learning hierar-
chical representations and directly influenced architectures like autoencoders, GANs, and
modern deep networks. Similarly, meta-learning, or “learning how to learn,” was formalized
by Finn et al. (2017) [829] through Model-Agnostic Meta-Learning (MAML). Their frame-
work demonstrated that deep networks could be optimized to rapidly adapt to new tasks using
only a few gradient steps, showcasing self-learning’s potential in few-shot learning and robotics.
Additionally, Jaderberg et al. (2017) [830] introduced auxiliary self-learning objectives in RL,
where an agent trains on unsupervised auxiliary tasks, such as predicting future states, to
enhance representation learning and sample efficiency. This method demonstrated that rein-
forcement learning agents could self-discover useful intermediate goals, thereby accelerating
policy convergence and improving generalization across multiple tasks.
Finally, the transformer revolution has brought self-learning capabilities to vision tasks, as
demonstrated by Dosovitskiy et al. (2020) [831] with Vision Transformers (ViTs). They showed
that self-attention mechanisms—originally developed for natural language processing—could re-
place convolutional architectures, allowing deep networks to self-learn image representations
without handcrafted priors. This introduced a paradigm shift where models no longer rely on
spatial inductive biases, instead learning global dependencies in a self-supervised manner.
Collectively, these contributions have rigorously established self-learning as a fundamental princi-
ple in artificial intelligence, spanning multiple disciplines and application areas. From curiosity-
driven exploration and reinforcement-based self-play to self-supervised feature learning and meta-
learning, self-learning methodologies now underpin state-of-the-art AI systems, enabling them
to autonomously acquire knowledge, optimize representations, and generalize across
domains with minimal or no human intervention.
Table 15.35: Summary of Contributions in Self-Learning
Reference Contribution
Schmidhuber (1991) Introduced curiosity-driven learning and intrinsic motivation for
self-learning agents.
Sutton and Barto (1998) Formalized reinforcement learning through MDPs, TD learning,
and actor-critic architectures.
Silver et al. (2017) Demonstrated self-play in AlphaZero, mastering games without hu-
man input using MCTS.
296 CHAPTER 15. LEARNING PARADIGMS
Bengio et al. (2009) Introduced curriculum learning, simulating human cognitive pro-
gression in deep learning.
He et al. (2020) Developed contrastive learning through Momentum Contrast
(MoCo) for self-supervised learning.
Grill et al. (2020) Introduced BYOL, proving self-supervised learning without nega-
tive pairs.
Hinton et al. (2006) Pioneered deep belief networks (DBNs) for hierarchical self-learning
representations.
Finn et al. (2017) Formalized model-agnostic meta-learning (MAML) for few-shot
learning.
Jaderberg et al. (2017) Proposed auxiliary self-learning objectives in RL to enhance policy
learning.
Dosovitskiy et al. (2020) Developed Vision Transformers (ViTs), enabling self-learning rep-
resentations in vision tasks.
The role of self-learning AI extends beyond the medical field into education and cognitive sciences.
Hou (2025) [836] investigates the integration of AI-supported learning environments with students’
psychological factors such as self-esteem and mindfulness. Their findings suggest that AI-driven
adaptive learning platforms foster personalized education, optimizing student performance through
15.3. SELF LEARNING 297
tailored feedback mechanisms. A related study by Li et al. (2025) [839] explored generative AI’s
role in real-time adaptive scaffolding for self-regulated learning. By analyzing student behaviors in
real time, AI dynamically generates learning resources that adapt to individual needs, significantly
enhancing personalized education strategies. Furthermore, Bjerregaard et al. (2025) [833] examine
how self-supervised learning can be applied to structural biology, particularly in protein folding
studies. Their research provides a foundation for AI-driven drug discovery and molecular engineer-
ing by leveraging AI models that can learn protein structures from vast amounts of unlabeled data.
These studies highlight the increasing influence of self-learning AI in shaping education, scientific
research, and personalized cognitive support.
Paper Contribution
Mousavi (2025) Examines the ethical implications of self-aware AGI and
whether it possesses moral standing. Utilizes fuzzy logic
to assess AI consciousness, influencing AI governance
debates.
Bjerregaard et al. (2025) Demonstrates the application of self-supervised learn-
ing to structural biology, improving molecular structure
prediction and advancing computational drug discovery.
Cui et al. (2025) Develops a dual-level self-supervised learning model to
enhance generalization in physics-based AI, particularly
in interatomic potential modeling.
Jia et al. (2025) Introduces a graph-based self-supervised learning model
for molecular property prediction, utilizing retrosyn-
thetic fragmentation to improve AI-driven drug design.
Hou (2025) Investigates the psychological effects of AI-driven adap-
tive learning on self-esteem and academic mindfulness,
highlighting AI’s role in personalized education.
Liu et al. (2025) Proposes a reinforcement learning-based scheduling sys-
tem for elective surgeries, optimizing hospital efficiency
and reducing patient wait times.
Song et al. (2025) Develops a deep self-supervised learning framework for
anomaly detection in time-series data, improving AI ap-
plications in finance and industry.
Li et al. (2025) Explores generative AI’s ability to provide real-time
adaptive scaffolding for personalized self-regulated
learning, enhancing online education strategies.
Continued on next page
298 CHAPTER 15. LEARNING PARADIGMS
where L is a loss function that quantifies the discrepancy between the actual outcome yi and the
predicted outcome fˆ(xi ). In an idealized self-learning system, the function fˆ evolves dynamically
as additional information is acquired, which can be described by an update rule:
where η is a learning rate and ∇J(fˆ) denotes the gradient of the loss function with respect to the
model parameters. A crucial aspect of self-learning is the incorporation of feedback mechanisms,
whereby an entity refines its internal model based on past errors. This can be framed in terms of
stochastic gradient descent (SGD):
where θ represents the parameters of the function approximator fθ . If the system has access to
reinforcement signals rather than explicit input-output pairs, self-learning aligns with reinforcement
learning, where the goal is to maximize an expected cumulative reward:
"∞ #
X
J(π) = E γ t rt | π (15.252)
t=0
where π is a policy mapping states to actions, rt is a reward function, and γ is a discount factor.
The optimal policy satisfies the Bellman equation:
h i
Q∗ (s, a) = E r + γ max
′
Q ∗ ′ ′
(s , a ) | s, a (15.253)
a
which recursively expresses the expected return of a state-action pair. In an unsupervised setting,
self-learning often manifests through clustering or density estimation, where the objective is to
15.3. SELF LEARNING 299
model the underlying distribution P (X). One way to formalize this is through maximum likelihood
estimation:
XN
max log P (xi | θ) (15.254)
θ
i=1
which seeks to find model parameters θ that best explain the observed data. A more flexible
formulation involves variational inference, where we approximate the posterior distribution using
a tractable function qϕ (Z | X) and minimize the Kullback-Leibler (KL) divergence:
which ensures that qϕ is close to the true posterior P (Z | X). Self-learning in neural systems can be
framed using Hebbian learning principles, where synaptic weights wij evolve according to activity
correlations:
∆wij = ηxi xj (15.256)
where τ represents the temporal offset between presynaptic and postsynaptic activations. In deep
learning, self-learning often involves self-supervised contrastive learning, which optimizes an objec-
tive of the form:
X exp(sim(fθ (x), fθ (x+ ))/τ )
min − log P −
(15.258)
x− exp(sim(fθ (x), fθ (x ))/τ )
θ
+
(x,x )
where x+ is a positive sample, x− represents negative samples, and sim(·, ·) measures similarity
in representation space. This encourages self-learned representations to be invariant under trans-
formations. Mathematically, self-learning is deeply connected to information-theoretic principles,
particularly in maximizing the mutual information between learned representations Z and the data
X:
I(Z; X) = H(Z) − H(Z | X) (15.259)
which captures the reduction in uncertainty about Z given knowledge of X. A self-learning system
thus seeks to construct representations that maximize I(Z; X) while discarding task-irrelevant
noise. More generally, if self-learning operates within a Bayesian framework, it continuously updates
a posterior belief over hypotheses H using Bayes’ theorem:
P (D | h)P (h)
P (h | D) = (15.260)
P (D)
where D represents the accumulated data. Finally, in the context of dynamical systems, self-
learning can be modeled as a time-dependent process governed by differential equations, such as
the Riccati equation in adaptive control:
dP
= AT P + P A − P BR−1 B T P + Q (15.261)
dt
A major milestone in RL came with Mnih et al. (2015) [279], who introduced Deep Q Net-
works (DQN), combining deep neural networks with Q-learning to solve high-dimensional control
problems. Their method incorporated experience replay to break correlation in training data and
target networks for stable Q-value updates, allowing for the first superhuman performance on
Atari games using raw pixel inputs. The introduction of deep RL opened the door to applications
in robotics, game AI, and autonomous systems. The subsequent work of Silver et al. (2016) [862]
extended deep RL into combinatorial search problems, with AlphaGo integrating Monte Carlo Tree
Search (MCTS) with deep policy and value networks. This demonstrated the potential of RL in
solving long-horizon planning problems and led to AlphaZero, which generalizes the method to
various board games, achieving superhuman performance in Go, chess, and shogi without human
supervision. Parallel to these advances, Konda and Tsitsiklis (2000) [280] introduced actor-critic
algorithms, which separate the value function estimator (critic) from the policy update mechanism
(actor), leading to more stable policy learning. This framework influenced modern policy optimiza-
tion techniques such as Proximal Policy Optimization (PPO), developed by Schulman et al. (2017)
[864], which uses a clipped surrogate objective to ensure stable updates without excessive deviation
from the current policy.
In the realm of continuous control, Lillicrap et al. (2016) [865] introduced Deep Deterministic
Policy Gradient (DDPG), an off-policy actor-critic method that enables RL in high-dimensional
continuous action spaces. Unlike DQN, which operates in discrete action spaces, DDPG employs a
deterministic policy with target networks and batch normalization for stable learning. Building on
these ideas, Haarnoja et al. (2018) [278] proposed Soft Actor-Critic (SAC), incorporating entropy
maximization to encourage exploration and improve sample efficiency. SAC’s stochastic policy for-
mulation and automatic entropy adjustment mechanism set a new standard for continuous control
tasks in RL. Meanwhile, Levine et al. (2016) [281] demonstrated the feasibility of end-to-end learn-
ing for robotics, directly mapping visual inputs to control actions using deep reinforcement learning.
Their guided policy search method combined RL with supervised learning to improve sample effi-
ciency, paving the way for RL-based autonomous robotic systems. These collective advancements,
spanning from fundamental RL principles to modern deep learning integration, have established
reinforcement learning as a powerful tool for solving complex decision-making and control problems
in diverse applications.
15.4. REINFORCEMENT LEARNING 301
Reference Contribution
Bellman (1957) Introduced Dynamic Programming, laying the mathematical foun-
dation for reinforcement learning. Developed the Bellman equation,
which enables recursive computation of optimal policies in Markov
Decision Processes (MDPs).
Sutton and Barto Formalized reinforcement learning as a computational framework.
(1998, 2018) Introduced key concepts such as temporal difference (TD) learning,
actor-critic methods, and policy evaluation. Their textbook serves
as the primary resource for both theoretical and applied RL.
Watkins and Dayan Developed Q-learning, a model-free off-policy algorithm that en-
(1992) ables agents to learn optimal policies without requiring an explicit
model of the environment. Provided proof of Q-learning’s conver-
gence under certain conditions.
Mnih et al. (2015) Introduced Deep Q Networks (DQN), combining deep learning with
Q-learning to handle high-dimensional state spaces. Innovations
include experience replay and target networks, leading to stable
and sample-efficient training. Achieved superhuman performance
on Atari games.
Silver et al. (2016) Developed AlphaGo, which integrated Monte Carlo Tree Search
(MCTS) with deep reinforcement learning. Demonstrated the abil-
ity to learn complex planning tasks with deep policy and value
networks. Paved the way for AlphaZero, which generalized the ap-
proach to chess and shogi.
Konda and Tsitsiklis Proposed actor-critic methods, separating policy learning (actor)
(2000) from value estimation (critic). Their framework improved policy
stability and inspired modern policy gradient methods such as Prox-
imal Policy Optimization (PPO).
Schulman et al. Developed Proximal Policy Optimization (PPO), a policy gradi-
(2017) ent method using a clipped objective function to stabilize training.
PPO is widely used due to its balance between sample efficiency
and simplicity.
Lillicrap et al. (2016) Introduced Deep Deterministic Policy Gradient (DDPG), extend-
ing reinforcement learning to continuous action spaces. Utilized
deterministic policies, target networks, and batch normalization to
improve stability.
Haarnoja et al. (2018) Developed Soft Actor-Critic (SAC), which incorporates entropy
maximization for improved exploration and stability. SAC’s
stochastic policy formulation and automatic entropy adjustment
enhanced sample efficiency.
Levine et al. (2016) Applied reinforcement learning to robotics using guided policy
search. Combined RL with supervised learning to improve sam-
ple efficiency, demonstrating end-to-end learning from raw sensory
inputs to control outputs.
Beyond planning and security, RL is making strides in multi-agent decision-making systems. Hengzhi
et al. (2025) [869] leverage multi-agent reinforcement learning (MARL) for UAV relay covert com-
munication, optimizing real-time transmission strategies for secure aerial data networks. This work
highlights the potential of RL in decentralized, high-stakes environments where autonomous agents
must coordinate under uncertainty. The concept of multi-task learning is further expanded by Pan
et al. (2025) [870], who propose a Markov Decision Process (MDP)-based task grouping approach
that enables RL models to learn multiple tasks efficiently. This innovation addresses a fundamental
limitation of RL: its need for extensive task-specific training, making it more scalable across diverse
problem domains. Similarly, Liu et al. (2025) [871] explore multi-hop knowledge graph reasoning,
applying RL to enhance automated reasoning processes in knowledge systems, significantly improv-
ing the efficiency of complex decision chains.
RL is also proving instrumental in energy and infrastructure resilience. Chen et al. (2025) [872]
propose an interpretable RL framework for building energy management, ensuring that deep RL
models remain transparent while optimizing energy consumption in smart building systems. Their
work addresses a key limitation of black-box AI models by extracting human-readable decision
rules, bridging the gap between efficiency and interpretability. Anwar and Akber (2025) [873] ex-
tend RL applications to structural resilience, using multi-agent deep RL to enhance the durability
of buildings under extreme environmental conditions. This work is pivotal in modern urban plan-
ning, demonstrating how RL can optimize interconnected physical infrastructures for enhanced
resilience. Zhao et al. (2025) [874] apply RL-enhanced long short-term memory (LSTM) networks
to predictive maintenance, improving industrial fault detection by optimizing data-driven health
assessments of rolling bearings. By integrating RL with deep learning models, they achieve superior
early fault detection, reducing operational risks in industrial automation.
Finally, RL is shaping mental health support and empathetic AI. Soman et al. (2025) [875] in-
troduce reinforcement learning-enhanced retrieval-augmented generation (RAG) models for mental
health AI agents, where RL fine-tunes AI-generated responses to provide more personalized and em-
pathetic support. This integration is crucial in human-AI interactions, ensuring that mental health
support systems align with human emotional needs. Overall, these studies collectively demonstrate
the growing versatility of reinforcement learning across disciplines, tackling challenges in security,
multi-agent coordination, AI transparency, and intelligent automation. As RL techniques continue
to evolve, they are poised to redefine decision-making, infrastructure resilience, and human-centric
AI interactions in unprecedented ways.
15.4. REINFORCEMENT LEARNING 303
The value function of a policy π, denoted as V π (s), represents the expected return starting from
state s while following policy π,
"∞ #
X
V π (s) = Eπ γ k R(st+k , at+k ) | st = s . (15.263)
k=0
Similarly, the action-value function Qπ (s, a) gives the expected return for taking action a in state
s and subsequently following policy π,
"∞ #
X
Qπ (s, a) = Eπ γ k R(st+k , at+k ) | st = s, at = a . (15.264)
k=0
and the optimal state-value function satisfies the Bellman optimality equation,
X
V ∗ (s) = max P (s′ | s, a) [R(s, a) + γV ∗ (s′ )] . (15.268)
a∈A
s′ ∈S
The policy iteration algorithm alternates between policy evaluation, which computes V π (s), and
policy improvement, which updates the policy as
In contrast, value iteration directly updates V (s) using the Bellman optimality equation iteratively,
X
Vk+1 (s) = max P (s′ | s, a) [R(s, a) + γVk (s′ )] . (15.271)
a∈A
s′ ∈S
Temporal Difference (TD) learning updates value estimates based on observed transitions, using
The Q-learning algorithm updates the action-value function using the rule
h i
′
Q(st , at ) ← Q(st , at ) + α R(st , at ) + γ max
′
Q(s t+1 , a ) − Q(s ,
t ta ) . (15.273)
a
where α is the learning rate. In deep reinforcement learning, function approximators such as neural
networks parameterized by θ approximate Q, where updates follow
′
θ ← θ + α∇θ R(st , at ) + γ max
′
Q (s
θ t+1 , a ) − Q (s ,
θ t t a ) . (15.274)
a
The policy gradient method optimizes a stochastic policy πθ (a | s) by following the gradient of the
expected return, given by
"∞ #
X
∇θ J(θ) = E ∇θ log πθ (at | st )Gt . (15.275)
t=0
Building upon these theoretical foundations, Schulman et al. (2015) [878] proposed Trust Region
Policy Optimization (TRPO), which ensures monotonic improvement in policy performance by con-
straining the step size via the Kullback-Leibler (KL) divergence between successive policies. This
theoretically motivated approach prevents drastic policy updates that could lead to performance
degradation. TRPO was further refined by Schulman et al. (2017) [864] through Proximal Policy
Optimization (PPO), which simplifies the optimization process by employing a clipped surrogate
objective function. PPO strikes a balance between sample efficiency and computational feasibility,
making it one of the most widely adopted algorithms in deep reinforcement learning. In parallel,
Agarwal et al. (2021) [879] rigorously analyzed the optimality and sample complexity of policy
gradient methods, particularly investigating how function approximation errors and distribution
shifts affect their theoretical performance guarantees. Their work provided valuable insights into
the conditions under which policy gradient methods converge to optimal or near-optimal policies.
Recent theoretical advancements, such as the work by Liu et al. (2024) [880], have further deep-
ened our understanding of policy optimization by analyzing projected policy gradients and natural
policy gradients in the context of discounted Markov Decision Processes (MDPs). By deriving
novel convergence bounds, they provided an elementary yet rigorous framework that elucidates the
dynamics of policy updates in large-scale reinforcement learning tasks. Meanwhile, Lorberbom et
al. (2020) [881] proposed Direct Policy Gradients (DirPG), an alternative approach specifically
designed for discrete action spaces. Their method optimizes policies by directly maximizing ex-
pected return-to-go trajectories, offering a novel perspective that integrates domain knowledge into
policy optimization. Complementary to these efforts, McCracken et al. (2020) [882] studied policy
gradient methods in exactly solvable Partially Observable Markov Decision Processes (POMDPs),
deriving analytical results that characterize their probabilistic convergence behavior. Their findings
are significant as they provide a rigorous theoretical foundation for understanding policy gradients
in partially observable settings.
Lastly, practical considerations and comparative studies have played a crucial role in refining pol-
icy gradient methodologies. A definitive guide by Lehmann (2024) [883] systematically explores
on-policy policy gradient methods in deep reinforcement learning, presenting a rigorous discussion
of entropy regularization, KL divergence constraints, and their impact on stability. Furthermore,
comparative analyses of policy-gradient algorithms done by Sutton et. al. (2000) [885] provide
both theoretical and empirical evaluations of their efficiency and convergence properties, guiding
practitioners in selecting the most suitable methods for various reinforcement learning applica-
tions. Collectively, these contributions have significantly advanced the theoretical underpinnings
and practical implementations of policy gradient methods, shaping their role as a fundamental tool
in modern reinforcement learning.
Reference Contribution
Sutton et al. (1999) Introduced policy gradient methods with function approximation,
deriving an unbiased gradient estimator for direct policy optimiza-
tion. Established the foundation for parameterized policies inde-
pendent of value functions.
Kakade (2001) Developed Natural Policy Gradient (NPG), which utilizes the
Fisher information matrix to normalize gradient updates, making
learning invariant to policy parameterization. Improved stability
and efficiency in policy optimization.
Continued on next page
15.4. REINFORCEMENT LEARNING 307
In the realm of robotic and autonomous control, policy gradient methods have enabled significant
progress in task-specific learning. Raei et al. (2025) [889] introduce a DDPG-based reinforcement
learning framework for robotic nonprehensile manipulation, particularly focusing on object sliding.
Their study presents an efficient policy optimization technique that enhances robotic dexterity
in handling objects without grasping them. Extending this concept to UAV (Unmanned Aerial
Vehicle) applications, Ting-Ting et al. (2025) [890] propose a Multi-Agent Deep Deterministic
Policy Gradient (MADDPG) approach, which allows multiple UAVs to make collaborative, au-
tonomous decisions while operating under strict communication constraints. Similarly, Zhang et
al. (2025) [891] integrate Deep Deterministic Policy Gradient (DDPG) with neuromorphic comput-
ing to develop a novel Hybrid Deep Deterministic Policy Gradient (Neuro-HDDPG) algorithm. This
approach enhances obstacle avoidance for spherical underwater robots by mimicking the decision-
making process of biological neural networks, significantly improving their autonomy in complex
underwater environments.
Policy gradient methods have also found profound applications in reinforcement learning for natural
language processing (NLP) and constrained decision-making problems. Nguyen et al. (2025) [892]
explore REINFORCE and RELAX policy gradient algorithms to fine-tune text-to-SQL models,
effectively improving structured query generation accuracy by optimizing reward-based learning.
Additionally, Chathuranga Brahmanage et al. (2025) [893] extend policy gradient techniques to
action-constrained reinforcement learning by leveraging constraint violation signals, allowing agents
to optimize their decision-making while adhering to predefined operational constraints. These stud-
ies highlight how policy gradient approaches can be adapted to complex, structured learning tasks
that require precision and rule adherence.
Another major advancement in policy gradient methods is in federated learning and network op-
timization. Huang et al. (2025) [894] introduce the Knowledge Collaboration Actor-Critic Policy
Gradient (KCACPG) algorithm, which facilitates knowledge transfer in cooperative traffic schedul-
ing for intelligent transportation networks, ensuring seamless traffic management with minimal
delays. Finally, Li et al. (2025) [895] propose FedDDPG, a federated learning variant of Deep
Deterministic Policy Gradient, tailored for vehicle trajectory prediction in decentralized learning
environments. This framework enables autonomous vehicles to learn optimal path planning strate-
gies without centralizing sensitive data, ensuring enhanced privacy and efficiency in large-scale
intelligent transportation systems. Together, these contributions showcase the robustness of pol-
icy gradient methods across diverse applications, reinforcing their critical role in advancing deep
reinforcement learning frameworks for real-world challenges.
where τ = (s0 , a0 , s1 , a1 , . . . ) represents a trajectory sampled from the environment following policy
πθ , and pθ (τ ) is the probability distribution over trajectories under policy πθ , given by the product
of the initial state distribution p(s0 ), the policy probability, and the state transition dynamics:
T
Y −1
pθ (τ ) = p(s0 ) πθ (at | st )p(st+1 | st , at ). (15.277)
t=0
The reward function R(τ ) is typically defined as the sum of discounted rewards:
T −1
X
R(τ ) = γ t r(st , at ), (15.278)
t=0
where r(st , at ) is the reward received at time step t and γ ∈ (0, 1] is the discount factor. The
gradient of J(θ) with respect to θ is computed using the likelihood ratio trick:
we obtain " T −1
#
X
∇θ J(θ) = Eτ ∼pθ (τ ) R(τ ) ∇θ log πθ (at | st ) . (15.281)
t=0
where τi are sampled trajectories. Instead of using raw returns R(τ ), it is common to replace them
with an advantage function A(st , at ) to reduce variance:
where α > 0 is the learning rate. A commonly used baseline function b(st ), often taken as V (st ),
is subtracted to further reduce variance without introducing bias:
N T −1
1 XX
∇θ J(θ) ≈ (Q(sit , ait ) − V (sit ))∇θ log πθ (ait | sit ). (15.288)
N i=1 t=0
This formulation forms the basis of policy gradient methods, including REINFORCE, actor-critic
algorithms, and proximal policy optimization (PPO), which introduce additional constraints and
refinements to improve stability and efficiency.
(DDPG), an actor-critic method designed for continuous control problems. DDPG extended Q-
learning to continuous actions by utilizing an off-policy approach with target smoothing, improving
convergence in high-dimensional environments. Meanwhile, Schulman et al. (2015) [878], Schul-
man et al. (2017) [864] introduced Trust Region Policy Optimization (TRPO) and Proximal Policy
Optimization (PPO), which formulated policy optimization with explicit constraints on divergence
from previous policies. PPO’s clipped objective made it computationally efficient and less sensitive
to hyperparameters, establishing it as a standard for training deep RL agents.
Building upon these foundations, Haarnoja et al. (2018) [278] proposed Soft Actor-Critic (SAC),
which incorporated entropy maximization to encourage exploration and robustness in policy learn-
ing. Unlike traditional policy gradient methods, SAC optimized a trade-off between reward ac-
cumulation and policy stochasticity, leading to superior performance in complex robotic tasks.
Simultaneously, Hessel et al. (2018) [896] developed Rainbow DQN, a unification of multiple en-
hancements to DQN, including Double DQN, Prioritized Experience Replay, Dueling Networks,
Multi-step Learning, Noisy Networks, and Distributional RL. This integration provided a compre-
hensive framework for improving sample efficiency and generalization. Reinforcement learning also
made breakthroughs in strategic decision-making, as demonstrated by Silver et al. (2016) [862]
with AlphaGo, which leveraged Monte Carlo Tree Search (MCTS) with deep policy and value net-
works to achieve superhuman performance in the game of Go. AlphaGo’s success was a milestone
in planning-based RL, where deep neural networks provided heuristic guidance for complex search-
based decision-making.
Another significant area of advancement in DRL was its application to robotic control and perception-
driven decision-making. Levine et al. (2016) [897] pioneered end-to-end deep visuomotor policies,
allowing robots to learn manipulation skills directly from raw visual inputs. Their work integrated
guided policy search with deep learning, enabling robots to generalize across unseen scenarios with
improved sample efficiency. Parallel to these practical advancements, theoretical contributions
were made by Bellemare et al. (2017) [898] in Distributional RL, which proposed predicting a
distribution of future rewards rather than a single expected return. This perspective significantly
improved stability and sample efficiency, leading to the development of quantile-based methods like
QR-DQN and Implicit Quantile Networks (IQN). Additionally, the foundational work by Sutton
(2018) [273] provided a rigorous mathematical exposition of value iteration, policy iteration, and
temporal difference learning, forming the theoretical bedrock of modern RL research. Their book
continues to be the definitive reference for both theoretical exploration and practical implementa-
tion of reinforcement learning algorithms.
Through these fundamental breakthroughs, DRL has evolved into a robust field that spans value-
based methods, policy gradient approaches, entropy-regularized reinforcement learning, and dis-
tributional perspectives. The fusion of deep learning with reinforcement learning has enabled un-
precedented progress in areas such as robotics, strategic decision-making, and autonomous systems.
The field continues to expand, with ongoing research exploring meta-reinforcement learning, multi-
agent RL, and offline reinforcement learning to push the boundaries of intelligent decision-making
in complex and uncertain environments.
312 CHAPTER 15. LEARNING PARADIGMS
Reference Contribution
Mnih et al. (2015) Introduced Deep Q-Learning for high-dimensional state spaces us-
ing deep neural networks. Proposed experience replay to break
correlation among samples and target networks for stable Q-value
updates. Enabled deep reinforcement learning for discrete action
spaces.
Lillicrap et al. (2016) Extended Q-learning to continuous action spaces using an actor-
critic framework. Leveraged off-policy learning with target smooth-
ing to improve convergence and stability in high-dimensional envi-
ronments.
Schulman et al. Developed a stable policy gradient method with constraints on pol-
(2015) icy divergence using a trust region approach. Ensured monotonic
policy improvement and was particularly effective in robotic control
tasks.
Schulman et al. Simplified TRPO by introducing a clipped surrogate objective for
(2017) stable and efficient policy updates. Achieved state-of-the-art per-
formance while being computationally efficient and less sensitive to
hyperparameters.
Haarnoja et al. (2018) Introduced entropy-regularized reinforcement learning to encour-
age exploration. Optimized a stochastic policy while balancing re-
ward accumulation and entropy maximization, improving sample
efficiency and robustness.
Hessel et al. (2018) Unified multiple improvements to DQN, including Double DQN,
Prioritized Experience Replay, Dueling Networks, Multi-step
Learning, Noisy Networks, and Distributional RL. Provided a com-
prehensive framework with enhanced sample efficiency and stability.
Silver et al. (2016) Combined Monte Carlo Tree Search (MCTS) with deep policy and
value networks to achieve superhuman performance in the game
of Go. Demonstrated the power of planning-based reinforcement
learning.
Levine et al. (2016) Integrated deep learning with guided policy search for end-to-end
robotic manipulation. Enabled robots to learn complex tasks di-
rectly from raw visual inputs.
Bellemare et al. Proposed predicting the distribution of future rewards instead of
(2017) the expected value. Led to improved stability and sample efficiency,
forming the foundation for quantile-based RL methods such as QR-
DQN and IQN.
Sutton and Barto Provided a rigorous mathematical foundation for reinforcement
(2018) learning. Covered fundamental topics such as value iteration, policy
iteration, and temporal difference learning. Served as a definitive
reference for both theoretical and practical RL research.
UAV-assisted MEC systems. This is complemented by Amodu et al. (2025) [900], who present
a comprehensive review of DRL applications for UAV-based optimization in MEC and IoT ap-
plications. Their work outlines the core challenges of DRL, including sample inefficiency, high
computational costs, and model generalizability, while also suggesting future directions such as
transfer learning and meta-reinforcement learning to improve adaptability in real-world systems.
Another critical area where DRL has proven useful is autonomous energy management, as demon-
strated in the study by Sunder et al. (2025) [901], who introduce the SmartAPM framework. This
system leverages DRL to optimize power consumption in wearable devices, allowing for extended
battery life through adaptive power control strategies based on environmental and user activity
data.
Beyond network optimization and energy management, DRL is also making significant strides in
autonomous control systems and cyber defense. For instance, Sarigül and Bayezit (2025) [902] ap-
ply DRL to fixed-wing aircraft heading control, showing how deep policy optimization can enhance
maneuverability and operational autonomy compared to classical control techniques. Similarly,
Mustafa et al. (2025) [886] focus on vehicular networks, proposing a Proximal Policy Optimiza-
tion (PPO)-based DRL framework for computation offloading. Their findings highlight improved
latency reduction and energy efficiency, making real-time decision-making more feasible for in-
telligent transport systems. Meanwhile, Mukhamadiarov (2025) [903] explores the application of
DRL in stochastic dynamical systems, demonstrating how reinforcement learning can be effectively
integrated with traditional control theory to enhance stability in non-linear environments. In cyber-
security, Ali and Wallace (2025) [904] introduce a DRL-powered autonomous cyber defense system
for Security Operations Centers (SOCs), where self-learning agents continuously refine threat de-
tection and mitigation strategies in adversarial settings. This research underscores the increasing
role of DRL in adaptive cyber resilience, as it allows security systems to proactively identify and
neutralize cyber threats in real-time.
Another promising area of DRL research lies in cross-domain transfer learning and real-world de-
ployment, particularly in fluid dynamics and energy optimization. The study by Yan et al. (2025)
[905] applies a mutual information-based transfer learning approach in active flow control for three-
dimensional bluff body flows, significantly reducing training costs by enabling efficient knowledge
transfer between different aerodynamic configurations. In a different sector, Silvestri et al. (2025)
[906] investigated the deployment of DRL in real-world building control systems, where they in-
corporate imitation learning to address the sample inefficiency problem that plagues traditional
DRL methods. Their approach enhances energy efficiency in buildings while maintaining occupant
comfort, demonstrating a viable path for scaling DRL beyond theoretical applications. Similarly,
Alajaji et al. (2025) [907] provide a scoping review of DRL in medical imaging, particularly in
infrared spectroscopy-based cancer diagnosis. Their findings reveal that DRL can significantly im-
prove the sensitivity and specificity of early cancer detection by automatically identifying abnormal
patterns in spectroscopy data.
Collectively, these studies showcase the remarkable versatility and potential of DRL in solving
real-world challenges. Whether in network optimization, cyber defense, autonomous control, med-
ical imaging, or fluid dynamics, DRL continues to redefine the boundaries of machine learning by
offering adaptive, scalable, and self-improving solutions. However, challenges such as high sample
complexity, lack of explainability, and computational overhead remain significant hurdles that re-
searchers are actively working to overcome. Future research in DRL is likely to focus on hybrid
learning approaches, transfer learning techniques, and improved generalization methods to ensure
that DRL models can perform effectively across diverse, unpredictable environments. The contin-
ued evolution of DRL in these domains not only paves the way for more robust and intelligent
autonomous systems but also holds promise for breakthroughs in fields ranging from aerospace
314 CHAPTER 15. LEARNING PARADIGMS
where S is the state space, A is the action space, P (s′ |s, a) represents the transition probability
function defining the probability of transitioning to state s′ given the current state s and action
a, R(s, a) is the reward function specifying the immediate reward received after taking action a in
state s, and γ ∈ [0, 1] is the discount factor that determines the present value of future rewards.
The objective of DRL is to find an optimal policy π ∗ (a|s) that maximizes the expected cumulative
reward, given by the return function:
∞
X
Gt = γ k R(st+k , at+k ). (15.290)
k=0
The optimal policy is derived by maximizing the state-value function V π (s) or the action-value
function Qπ (s, a), which are respectively defined as:
"∞ #
X
V π (s) = Eπ γ t R(st , at ) s0 = s , (15.291)
t=0
"∞ #
X
Qπ (s, a) = Eπ γ t R(st , at ) s0 = s, a0 = a . (15.292)
t=0
In Deep Q-Networks (DQN), a deep neural network parameterized by θ is used to approximate the
optimal Q-function:
Q(s, a; θ) ≈ Q∗ (s, a). (15.293)
The network parameters are updated using the loss function derived from the Bellman equation:
where y is the target value computed using the Bellman optimality equation:
y = r + γ max
′
Q(s′ , a′ ; θ− ), (15.295)
a
where θ− represents the parameters of a target network that is updated periodically to stabilize
training. Policy-based methods directly optimize the policy πθ (a|s) without relying on a Q-function.
The objective function for policy gradient methods is given by:
"∞ #
X
J(θ) = Eπ γ t R(st , at ) . (15.296)
t=0
The gradient of J(θ) with respect to the policy parameters is computed using the policy gradient
theorem:
∇θ J(θ) = Eπ [∇θ log πθ (a|s)Qπ (s, a)] . (15.297)
The REINFORCE algorithm estimates this gradient using Monte Carlo sampling:
T
X
∇θ J(θ) ≈ ∇θ log πθ (at |st )Gt . (15.298)
t=0
The policy gradient is then computed using an advantage function Aπ (s, a), where:
Deep Deterministic Policy Gradient (DDPG) extends policy gradients to continuous action spaces
by parameterizing the policy as a deterministic function:
a = µθ (s). (15.302)
Proximal Policy Optimization (PPO) improves stability by enforcing a trust region constraint on
policy updates. The surrogate objective function is:
where rt (θ) = πθπθ (a(at |st |st )t ) is the probability ratio between the new and old policies. Soft Actor-Critic
old
(SAC) introduces an entropy regularization term to encourage exploration:
"∞ #
X
J(θ) = E γ t (R(st , at ) + αH(π(·|st ))) , (15.305)
t=0
P
where H(π(·|st )) = − a π(a|s) log π(a|s) represents the entropy of the policy, and α is a temper-
ature parameter.
The Advantage Actor-Critic (A2C) algorithm has been extensively studied and applied in various
domains, contributing to both theoretical advancements and practical implementations. Mnih et
al. (2016) [918] laid the foundational work for A2C through their development of the Asynchronous
Advantage Actor-Critic (A3C) method, which utilized multi-threaded parallelization to stabilize
policy gradient methods. A2C emerged as a synchronous variant that improved sample efficiency
while retaining the stability benefits of A3C. Wang et al. (2022) [919] further refined A2C by
introducing Recursive Least Squares (RLS) methods to improve sample efficiency and convergence
stability. Their research focused on optimizing the learning process using adaptive step sizes, effec-
tively reducing training time and enhancing the robustness of deep reinforcement learning models.
The adaptability of A2C has enabled its application across diverse fields, including financial market
trading and multi-agent reinforcement learning. Rubell Marion Lincy et al. (2023) [920] explored
the integration of A2C with technical analysis indicators to improve decision-making in stock trad-
ing. Their approach demonstrated that A2C could effectively learn optimal buy and sell strategies
while adapting to market fluctuations. Meanwhile, Paczolay et al. (2020) [921] extended A2C to
multi-agent environments, addressing key challenges in coordination and competition. They devel-
oped an improved training mechanism to stabilize interactions between multiple learning agents,
significantly enhancing performance in decentralized decision-making settings. Zhang et al. (2024)
[922] leveraged A2C for industrial applications, specifically in disassembly line balancing, where
they optimized cycle time minimization. By formulating the optimization problem as a reinforce-
ment learning task, they showcased the potential of A2C to streamline complex industrial processes.
15.4. REINFORCEMENT LEARNING 317
Beyond conventional applications, researchers have also explored novel enhancements to A2C, such
as quantum computing and adversarial training. Kölle et al. (2024) [923] introduced Quantum
Advantage Actor-Critic (QA2C) and Hybrid Advantage Actor-Critic (HA2C), benchmarking them
against classical A2C implementations. Their findings indicated that quantum models could im-
prove computational efficiency, making reinforcement learning more scalable for high-dimensional
problems. Benhamou (2019) [924] focused on addressing one of the fundamental challenges in
policy gradient methods—variance reduction. By formulating a rigorous mathematical framework,
they optimized control variate estimators, effectively minimizing gradient variance and improving
learning stability. Similarly, Peng et al. (2018) [925] introduced Adversarial A2C, where a dis-
criminator was incorporated to enhance exploration strategies in dialogue policy learning. Their
approach demonstrated significant improvements in reinforcement learning-based conversational
agents, particularly in human-like task completion.
Finally, A2C has proven to be a powerful tool in robotics and adaptive control systems. Van
Veldhuizen (2022) [926] explored its application in tuning Proportional-Integral-Derivative (PID)
controllers for robotic systems. Their research demonstrated that A2C could autonomously opti-
mize PID parameters, significantly improving the precision of control tasks such as apple harvesting.
The ability of A2C to generalize across different control environments underscores its flexibility
as a reinforcement learning framework. Collectively, these contributions illustrate the extensive
theoretical advancements and practical innovations enabled by A2C, solidifying its position as a
cornerstone in deep reinforcement learning.
Reference Contribution
Mnih et al. Introduced the Asynchronous Advantage Actor-Critic
(2016) (A3C) algorithm, which inspired the synchronous A2C al-
gorithm by stabilizing policy gradient methods through
multi-threaded training.
Wang et al. Developed RLS-based A2C variants that improve sam-
(2022) ple efficiency and learning stability in deep reinforcement
learning environments.
Rubell Mar- Applied A2C for stock trading using technical indicators,
ion Lincy et showing improved decision-making for buy and sell strate-
al. (2023) gies.
Paczolay et Adapted A2C for multi-agent scenarios, addressing coordi-
al. (2020) nation and competition among agents with improved train-
ing mechanisms.
Zhang et al. Applied A2C for industrial optimization problems, demon-
(2024) strating enhanced cycle time minimization in disassembly
line balancing.
Kölle et al. Explored quantum implementations of A2C, benchmarking
(2024) QA2C and HA2C against classical A2C architectures for
efficiency gains.
Benhamou Provided a mathematical framework for reducing variance
(2019) in A2C methods, optimizing control variate estimators for
policy gradient algorithms.
Peng et al. Introduced Adversarial A2C, incorporating a discriminator
(2018) to enhance exploration in dialogue policy learning.
Continued on next page
318 CHAPTER 15. LEARNING PARADIGMS
Reference Contribution
van Veld- Applied A2C for adaptive control in robotics, optimizing
huizen (2022) PID tuning for an apple-harvesting robot.
Advantage Actor-Critic (A2C) reinforcement learning has emerged as a powerful tool across diverse
domains, providing superior adaptability and stability in decision-making compared to traditional
reinforcement learning methods. In the financial sector, A2C has been leveraged for optimiz-
ing investment strategies and risk-sensitive portfolio management. Wang and Liu (2025) [908]
introduced a transformer-enhanced A2C model that refines petroleum futures trading by dynam-
ically adjusting trading decisions based on real-time risk assessments. Their study demonstrates
how combining transformers with A2C can enhance financial modeling accuracy, outperforming
static risk-assessment methods. Similarly, Thongkairat and Yamaka (2025) [909] applied A2C in
stock market trading, proving its effectiveness in managing highly volatile stock portfolios. Unlike
Deep Q-Networks (DQN) and other value-based methods, A2C’s policy gradient approach enables
more stable convergence and better long-term decision-making. By leveraging actor-critic opti-
mization, A2C effectively reduces erratic investment behaviors and optimizes portfolio allocations
under stochastic market conditions.
Beyond finance, A2C has been pivotal in cybersecurity and autonomous decision-making. Dey
and Ghosh (2025) [910] demonstrated A2C’s potential in intrusion detection, developing a re-
inforcement learning-based framework to mitigate QUIC-based Denial of Service (DoS) attacks.
Traditional signature-based cybersecurity systems often struggle against evolving attack patterns,
but A2C adapts dynamically by continuously learning from network behavior, thereby improving
real-time threat detection. In the realm of autonomous vehicles and drones, Zhao et al. (2025) [911]
proposed an A2C-based UAV trajectory optimization method, optimizing real-time flight paths and
task allocation based on environmental feedback. Their study underscores A2C’s ability to han-
dle high-dimensional control problems, ensuring efficient drone operations while minimizing energy
consumption. Similarly, Mounesan et al. (2025) [912] introduced Infer-EDGE, an A2C-driven sys-
tem for optimizing deep learning inference in edge-AI applications. Their research highlights the
growing synergy between reinforcement learning and AI-driven computational efficiency, demon-
strating how A2C can dynamically balance trade-offs between computation cost, latency, and model
accuracy in resource-constrained environments.
In energy systems and industrial automation, A2C has proven instrumental in optimizing resource
allocation and performance forecasting. Hou et al. (2025) [913] developed a multi-agent cooperative
A2C framework (MAC-A2C) for fuel cell degradation prediction, significantly improving the lifespan
and efficiency of fuel cells in automotive applications. This research showcases how reinforcement
learning can improve predictive maintenance, reducing operational costs in high-performance en-
ergy systems. In nuclear reactor safety, Radaideh et al. (2025) [914] applied asynchronous A2C
algorithms to optimize neutron flux distribution, enhancing reactor control and minimizing the risk
of critical failures. Their findings suggest that reinforcement learning could revolutionize nuclear
engineering by introducing adaptive and autonomous reactor safety mechanisms. Meanwhile, Li et
al. (2025) [915] compared A2C with Soft Actor-Critic (SAC) for task offloading in edge computing,
demonstrating that A2C’s on-policy approach offers superior learning stability in low-resource en-
vironments. Their work highlights A2C’s effectiveness in mobile cloud computing, where efficient
resource distribution is essential.
Furthermore, A2C has been successfully employed in large-scale combinatorial optimization prob-
15.4. REINFORCEMENT LEARNING 319
lems. Khan et al. (2025) [916] applied A2C for route optimization in dairy product logistics,
reducing delivery costs while ensuring timely and efficient distribution. Unlike heuristic optimiza-
tion approaches, reinforcement learning-based solutions can adapt to real-time traffic conditions
and supply chain fluctuations, making A2C a promising tool for logistics and transportation in-
dustries. Yuan et al. (2025) [917] extended A2C by incorporating transformers (TR-A2C) to
enhance multi-user semantic communication in vehicle networks. Their study demonstrates how
transformer-enhanced reinforcement learning can significantly improve the efficiency of information
exchange in connected vehicles, paving the way for next-generation intelligent transportation sys-
tems. These applications illustrate A2C’s growing impact across multiple disciplines, cementing
its role as a versatile and robust reinforcement learning framework. By enabling real-time adapt-
ability, policy optimization, and multi-agent coordination, A2C continues to drive advancements
in machine learning, industrial automation, and autonomous decision-making.
The Advantage Actor-Critic (A2C) algorithm is a policy gradient method that improves sample
efficiency by utilizing both value-based and policy-based learning techniques. Let st denote the state
of the environment at time step t, at the action taken, and rt the immediate reward received. The
objective of the reinforcement learning agent is to maximize the expected cumulative discounted
reward, given by "∞ #
X
J(θ) = Eπθ γ t rt , (15.306)
t=0
where πθ (at | st ) represents the policy parameterized by θ, and γ is the discount factor (0 ≤ γ ≤ 1).
The policy is updated using the policy gradient theorem, which states
"∞ #
X
∇θ J(θ) = Eπθ ∇θ log πθ (at | st )Gt , (15.307)
t=0
where Gt is the return from time step t onwards. Instead of using the full return Gt , which has
high variance, A2C employs the advantage function
Aπ (st , at ) = Qπ (st , at ) − V π (st ), (15.308)
where Qπ (st , at ) is the action-value function,
" ∞
#
X
Qπ (st , at ) = Eπ γ k rt+k | st , at , (15.309)
k=0
A2C estimates V π (st ) using a parameterized critic Vϕ (st ), trained by minimizing the squared tem-
poral difference (TD) error
The actor updates the policy parameters θ using the advantage-weighted policy gradient
where α is the learning rate. The critic updates its parameters ϕ using gradient descent
where β is the critic’s learning rate. The training process is synchronized across multiple agents,
ensuring a more stable and deterministic update compared to the asynchronous variant A3C. This
synchronous nature of A2C prevents stale gradients and allows for efficient batch updates.
Deep Deterministic Policy Gradient (DDPG) has emerged as a fundamental algorithm in reinforce-
ment learning (RL) for continuous action spaces, building upon deterministic policy gradients and
deep Q-learning. The pioneering work by Lillicrap et al. (2015) [865] introduced the DDPG frame-
work, integrating an actor-critic architecture with experience replay and target networks to stabilize
training. This algorithm extends the Deterministic Policy Gradient (DPG) theorem from Silver et
al. (2014) [825], which rigorously established that deterministic policy gradients can be computed
with lower variance than their stochastic counterparts, making them particularly effective in high-
dimensional continuous control tasks. While DDPG demonstrated its efficacy in simulated physics
environments, subsequent research focused on addressing its limitations, particularly in off-policy
learning, stability, and exploration. Cicek et al. (2021) [927] tackled the issue of off-policy correc-
tion by introducing Batch Prioritized Experience Replay, a technique that prioritizes data points
based on their likelihood of being generated by the current policy. This enhancement improved
sample efficiency by reducing policy divergence during updates, a critical challenge in off-policy
deep RL methods.
Building upon these foundational ideas, Han et al. (2021) [928] proposed the Regularly Updated
Deterministic (RUD) Policy Gradient Algorithm, which introduced a structured update mechanism
to mitigate overestimation bias and variance in Q-value estimates. This was further refined by Pan
et al. (2020) [929] in the Softmax Deep Double Deterministic Policy Gradient (SD3) method,
which incorporated a Boltzmann softmax operator to smooth the optimization landscape and com-
bat overestimation errors, leading to more reliable policy convergence. Addressing another major
weakness of DDPG—exploration inefficiency—Luck et al. (2019) [930] proposed an approach that
integrates latent trajectory optimization, where model-based trajectory estimation improves explo-
ration in sparse-reward settings. Their work demonstrated that combining deep RL with latent
variable models leads to improved generalization and efficiency in complex, high-dimensional envi-
ronments. The practical applications of DDPG have also been expanded, with Dong et al. (2023)
322 CHAPTER 15. LEARNING PARADIGMS
[931] developing an enhanced version for robotic arm control by integrating adaptive reward shap-
ing and improved experience replay, yielding superior performance in dexterous manipulation tasks.
Beyond robotics, DDPG has been successfully applied in autonomous navigation and medical
decision-making. Jesus et al. (2019) [932] extended the DDPG framework to mobile robot naviga-
tion, demonstrating its effectiveness in dynamic obstacle-avoidance scenarios without the need for
explicit localization or map-based planning. Their results highlight DDPG’s ability to learn con-
trol policies in real-world environments where stochastic disturbances pose significant challenges.
In a different domain, Lin et al. (2023) [933] employed DDPG for personalized medicine dosing
strategies, illustrating how reinforcement learning can optimize complex decision-making processes
in healthcare. Their approach aimed to replicate expert medical decisions, showcasing the potential
for RL-based models to revolutionize treatment protocols by adapting to patient-specific charac-
teristics. Lastly, the work of Sumalatha et al. (2024) [934] provided a rigorous overview of the
evolution, enhancements, and applications of DDPG, consolidating various algorithmic advance-
ments and practical implementations across multiple fields.
Collectively, these contributions form a rigorous and multifaceted body of research that extends
DDPG’s theoretical foundations, addresses its key limitations, and explores novel applications in
diverse domains. The enhancements in off-policy correction, policy update mechanisms, and ex-
ploration strategies underscore the continuous evolution of deep reinforcement learning methods to
achieve greater efficiency, stability, and real-world applicability. As research in deep RL progresses,
the integration of DDPG with meta-learning, hierarchical RL, and model-based approaches is ex-
pected to further enhance its capabilities, opening new frontiers in both theoretical and applied
reinforcement learning.
Table 15.45: Summary of Contributions in Deep Deterministic Policy Gradient (DDPG) Research
Deep Deterministic Policy Gradient (DDPG) has demonstrated significant advancements in di-
verse domains, particularly in optimizing resource allocation, decision-making, and control systems
across vehicular networks, robotics, and UAV coordination. Recent studies have leveraged DDPG
and its variations, such as Twin Delayed DDPG (TD3) and Multi-Agent Deep Deterministic Policy
Gradient (MADDPG), to enhance performance in complex, real-time environments. For instance,
in vehicular edge computing networks, Yang et al. (2025) [935] proposed a hierarchical optimization
framework integrating DDPG to optimize driving mode selection and resource allocation, signif-
icantly reducing latency and improving efficiency. Similarly, Jamshidiha and Pourahmadi (2025)
incorporated DDPG into a traffic-aware graph neural network for cellular networks, demonstrat-
ing improved user association and network adaptability. Furthermore, Tian et al. (2025) [936]
introduced a Multi-State Iteration DDPG (SIDDPG) for the Internet of Vehicles (IoV), optimiz-
ing offloading strategies to minimize delays and energy consumption. These studies collectively
highlight DDPG’s ability to dynamically optimize decision-making processes in vehicular and com-
munication networks, ensuring enhanced network performance and resource utilization.
In addition to vehicular networks, DDPG has been extensively applied in robotics, UAV coor-
dination, and cloud-edge computing. Raei et al. (2025) [889] introduced a DDPG-based rein-
324 CHAPTER 15. LEARNING PARADIGMS
forcement learning framework for non-prehensile robotic manipulation, demonstrating superior ef-
ficiency in dynamic control environments. In another application, Chen et al. (2025) [937] proposed
a UAV-based computation offloading system utilizing DDPG, which optimizes task scheduling in
cloud-edge collaborative environments, reducing latency and enhancing computational efficiency.
Moreover, Ting-Ting et al. (2025) [890] developed a MADDPG-based approach for UAV clus-
ters, ensuring robust decision-making under communication constraints. These studies illustrate
DDPG’s adaptability to real-time, multi-agent environments, allowing for optimized coordination
and resource management in robotics and aerial systems. By integrating reinforcement learning
with real-world constraints, these approaches have successfully enhanced operational efficiency in
autonomous systems.
Furthermore, advancements in DDPG have facilitated its application in emerging network and
cloud computing paradigms. Deng et al. (2025) [938] expanded DDPG into a multi-agent re-
inforcement learning paradigm for non-terrestrial networks, enabling distributed coordination in
satellite communications. Similarly, Anwar and Akber (2025) [873] incorporated MADDPG into
structural engineering, optimizing building resilience by considering utility interactions under ex-
treme conditions. Additionally, Zhang et al. (2025) [939] employed TD3 for multi-vehicle and
multi-edge resource orchestration, optimizing computational tasks and minimizing system delays.
These studies highlight the robustness of DDPG in optimizing decision-making processes across
distributed networks, structural systems, and multi-agent interactions. By addressing dynamic,
high-dimensional control problems, DDPG and its extensions continue to drive innovations in ar-
tificial intelligence and engineering.
In conclusion, the reviewed studies establish DDPG as a versatile and powerful reinforcement learn-
ing technique capable of handling continuous action-space optimization in diverse domains. From
vehicular edge computing and UAV-based coordination to robotics and cloud-edge resource man-
agement, DDPG has demonstrated remarkable efficiency in optimizing real-time decision-making.
With further advancements in multi-agent learning, federated reinforcement learning, and adaptive
policy optimization, DDPG is poised to remain at the forefront of artificial intelligence research,
revolutionizing complex control and resource allocation problems across various industries.
Study Contribution
Yang et al. (2025) Introduced a three-stage hierarchical optimization
framework (3SHO) where DDPG optimizes resource al-
location, improving efficiency in vehicular networks.
Jamshidiha and Pourah- Employed DDPG within a traffic-aware graph neural
madi (2025) network to optimize user association, enhancing network
adaptability in cellular networks.
Tian et al. (2025) Developed Multi-State Iteration DDPG (SIDDPG) to
optimize task partitioning and reduce latency and en-
ergy consumption in Internet of Vehicles (IoV).
Saad et al. (2025) Applied Twin Delayed DDPG (TD3) for edge server se-
lection in 5G-enabled industrial applications, minimiz-
ing computation delays.
Deng et al. (2025) Proposed a Multi-Agent DDPG (MADDPG) for satel-
lite communication networks, improving distributed co-
ordination in non-terrestrial networks.
Raei et al. (2025) Designed a DDPG-based reinforcement learning frame-
work for robotic manipulation through sliding, improv-
ing control efficiency.
Continued on next page
15.4. REINFORCEMENT LEARNING 325
The action-value function, or Q-function, is defined as the expected return starting from a state s
and taking an action a, while following policy π:
"∞ #
X
π k
Q (s, a) = E [Rt | st = s, at = a, π] = E γ r(st+k , at+k ) | st = s, at = a, π . (15.317)
k=0
The Bellman equation for the Q-function under an optimal policy satisfies
h i
∗ ∗ ′ ′
Q (s, a) = E r(s, a) + γ max
′
Q (s , a ) | s, a . (15.318)
a
Unlike traditional Q-learning approaches, which solve for discrete action spaces by iterating over all
possible actions, DDPG employs an actor-critic architecture where the actor πθ is a deterministic
policy parameterized by θ that maps states directly to actions:
a = πθ (s). (15.319)
The critic function Qϕ (s, a) approximates the true Q-function and is parameterized by ϕ. The
optimal policy parameters are obtained by maximizing the Q-value under the policy:
θ∗ = arg max Es∼ρπ [Qϕ (s, πθ (s))] . (15.320)
θ
The deterministic policy gradient theorem states that the gradient of the expected return with
respect to the policy parameters is
h i
∇θ J(θ) = Es∼ρπ ∇θ πθ (s)∇a Qϕ (s, a) a=π (s) . (15.321)
θ
326 CHAPTER 15. LEARNING PARADIGMS
The Q-function is learned via the Bellman equation, leading to the following loss function for the
critic:
L(ϕ) = E(s,a,r,s′ )∼D (Qϕ (s, a) − y)2 ,
(15.322)
where the target y is computed using a target Q-network Qϕ′ and a target policy network πθ′ :
To stabilize training, DDPG employs two techniques: (1) target networks, which are slow-moving
versions of the policy and value networks, updated with soft updates
θ′ ← τ θ + (1 − τ )θ′ , (15.324)
ϕ′ ← τ ϕ + (1 − τ )ϕ′ , (15.325)
where τ ≪ 1; and (2) an experience replay buffer D, where past transitions (s, a, r, s′ ) are stored and
randomly sampled for training, breaking correlations between consecutive updates and improving
sample efficiency.
Proximal Policy Optimization (PPO) has become one of the most widely used policy gradient meth-
ods due to its balance between computational efficiency and stable performance. Schulman et al.
(2017) laid the theoretical foundation for PPO by introducing a first-order optimization technique
that constrains policy updates through a clipped surrogate objective, ensuring stable and reliable
learning without requiring the complex second-order optimization steps of Trust Region Policy
Optimization (TRPO). Following its introduction, researchers identified several critical implemen-
tation details that significantly impact PPO’s performance. Huang and Dossa (2022) compiled a
comprehensive list of 37 key implementation details, highlighting often-overlooked hyperparameter
choices, network architectures, and training strategies necessary for PPO to perform consistently
across different tasks. These insights provided a crucial guideline for practitioners and researchers
to reproduce PPO’s results accurately, preventing performance discrepancies arising from subtle
implementation variations.
Several studies have proposed refinements to PPO’s clipping mechanism to enhance training sta-
bility and sample efficiency. Zhang et al. (2023) addressed the issue of fixed clipping bounds by
introducing a dynamic adjustment mechanism that utilizes task-specific feedback, thereby ensur-
ing that policy updates remain within an optimal range. This adaptation mitigates the issue of
overly aggressive policy updates in early training stages while preventing excessive conservatism in
later stages. Similarly, Zhang et al. (2020) proposed an alternative dynamic clipping strategy to
fine-tune the trade-off between exploration and exploitation, further improving PPO’s adaptabil-
ity. Another notable contribution in this domain is Kobayashi’s (2022) introduction of an adaptive
threshold for PPO using the symmetric relative density ratio, which optimally adjusts the clipping
bound based on the scale of the policy update error. This method improves training stability and
ensures a more theoretically grounded approach to threshold selection. Kobayashi (2020) extended
this idea by formulating PPO using relative Pearson divergence, which provides an intuitive and
principled way to constrain policy updates, thereby achieving smoother and more stable training
dynamics.
Beyond single-agent reinforcement learning, PPO has also been extended to multi-agent systems.
Piao and Zhuo (2021) introduced Coordinated Proximal Policy Optimization (CoPPO), which
adapts step sizes dynamically for multiple interacting agents, mitigating instability issues caused
15.4. REINFORCEMENT LEARNING 327
Finally, a broader perspective on PPO’s reliability was provided by Henderson et al. (2018),
who critically examined deep reinforcement learning methodologies and demonstrated that per-
formance metrics in literature are highly sensitive to hyperparameter tuning and implementation
details. Their findings underscored the necessity of rigorous benchmarking and reproducibility
standards to ensure that performance improvements claimed by various PPO enhancements are not
merely the result of uncontrolled hyperparameter tuning. Collectively, these contributions show-
case the continuous evolution of PPO, from its theoretical foundation to practical improvements
in policy update mechanisms, stability guarantees, and applicability in multi-agent reinforcement
learning. They also highlight the importance of principled algorithm design and robust evaluation
methodologies to advance reinforcement learning research in a scientifically rigorous manner.
Reference Contribution
Schulman et al. Introduced Proximal Policy Optimization (PPO), a first-order pol-
(2017) icy gradient method that balances stability and efficiency. PPO re-
places the trust region constraint in TRPO with a clipped surrogate
objective, simplifying implementation while maintaining strong em-
pirical performance across various reinforcement learning tasks.
Huang and Provided a comprehensive checklist of 37 crucial implementation de-
Dossa (2022) tails necessary for reproducing PPO’s reported performance. This
work highlighted the impact of hyperparameter choices, architec-
tural considerations, and training strategies on PPO’s consistency
and reliability.
Zhang et al. Proposed a dynamic clipping strategy where the clipping threshold
(2023) adapts based on task feedback, ensuring optimal constraint selec-
tion for policy updates. This method improves training stability
and sample efficiency while addressing PPO’s sensitivity to fixed
clipping bounds.
Zhang et al. Developed an exploration-enhancing variant of PPO that incorpo-
(2020) rates uncertainty estimation. Their approach improves sample ef-
ficiency by encouraging exploration in regions of high uncertainty,
particularly in continuous control tasks.
Kobayashi Introduced a threshold adaptation mechanism for PPO using the
(2022) symmetric relative density ratio, leading to a more theoretically
grounded method for selecting the clipping bound. This modifica-
tion enhances stability and improves policy update efficiency.
Kobayashi Formulated PPO using relative Pearson divergence, providing a
(2020) principled policy divergence constraint that ensures smoother and
more stable training updates, improving theoretical soundness.
Piao and Zhuo Extended PPO to multi-agent reinforcement learning by introduc-
(2021) ing Coordinated Proximal Policy Optimization (CoPPO), which
adapts step sizes dynamically for multiple interacting agents. This
approach improves training stability and performance in decentral-
ized multi-agent environments.
328 CHAPTER 15. LEARNING PARADIGMS
Reference Contribution
Zhang and Proposed a dynamic clipping bound method that adjusts the
Wang (2020) threshold throughout training, further refining the balance between
exploration and exploitation in PPO-based learning.
Wang et al. Introduced Truly Proximal Policy Optimization (TPPO), which ex-
(2019) plicitly incorporates a trust region constraint into PPO’s objective
function, leading to more reliable and theoretically grounded policy
updates.
Henderson et al. Critically examined the reproducibility of deep reinforcement learn-
(2018) ing algorithms, including PPO. The study demonstrated the sensi-
tivity of performance metrics to hyperparameter tuning and imple-
mentation details, emphasizing the need for rigorous benchmarking
and standardization in reinforcement learning research.
Proximal Policy Optimization (PPO) has emerged as a robust reinforcement learning algorithm
with wide-ranging applications across multiple domains, from space exploration to autonomous
robotics and networking optimization. One of its notable implementations is in interplanetary tra-
jectory design, where PPO is leveraged to optimize orbital maneuvers, ensuring efficient fuel man-
agement and trajectory planning under stringent mission constraints (Cuéllar et al., 2024) [940].
This showcases PPO’s ability to handle high-dimensional decision-making problems. Similarly, in
autonomous underwater robotics, PPO is employed for three-dimensional trajectory planning by in-
tegrating fluid dynamics-based motion strategies, allowing underwater vehicles to adapt to dynamic
marine environments and external perturbations (Liu et al., 2025) [941]. The adaptability of PPO
in handling complex physical constraints further extends to humanoid robotics, where it enables
fast multimodal gait learning, allowing bipedal robots to adjust walking patterns across varying
terrains through reinforcement-based self-learning mechanisms (Figueroa et al., 2025) [942]. These
applications underscore the strength of PPO in real-world motion planning and adaptive control.
In networking and communication systems, PPO has demonstrated its efficacy in cognitive ra-
dio networks and edge computing. For instance, a Lyapunov-guided PPO-based resource allocation
model optimizes spectrum sharing and power distribution in dynamic radio environments, reducing
latency and maximizing throughput (Xu et al., 2025) [943]. Similarly, vehicular networks benefit
from PPO-driven computation offloading, where the algorithm efficiently schedules tasks between
cloud and edge nodes to enhance latency-sensitive applications (Mustafa et al., 2025) [886]. PPO
has also been employed in anti-jamming wireless communication to dynamically allocate spectrum
resources in adversarial settings, ensuring robust and resilient connectivity (Li et al., 2025) [944].
These applications illustrate how PPO enhances real-time decision-making in complex, distributed
systems that require efficient task execution and resource utilization.
Beyond robotics and communication, PPO has been instrumental in optimizing industrial processes
and energy systems. In cloud computing, PPO has been integrated with graph neural networks
(GNNs) to optimize dynamic workflow scheduling, reducing both computational overhead and
energy consumption in large-scale cloud environments (Chandrasiri and Meedeniya, 2025) [945].
Moreover, a smart building energy management system leverages PPO to dynamically regulate
heating, ventilation, and air conditioning (HVAC) operations, significantly improving carbon effi-
ciency (Wu and Xie, 2025) [946]. In the manufacturing sector, Guan et al. (2025) [947] proposed
a multi-agent PPO-based approach for real-time production scheduling and AGV selection, ensur-
ing just-in-time logistics and maximizing supply chain efficiency. These implementations highlight
PPO’s impact on autonomous decision-making for sustainability and operational efficiency.
15.4. REINFORCEMENT LEARNING 329
Finally, PPO’s utility in decision-making under uncertainty is evident in financial markets and
transportation systems. In stock portfolio management, PPO is integrated with reinforcement-
based market simulations to optimize trade execution, balancing risk and return in volatile envi-
ronments (Zhang et al., 2025) [948]. Additionally, in autonomous vehicle trajectory planning, PPO
is combined with control barrier function techniques to ensure collision-free navigation in high-
speed driving scenarios, refining behavior planning strategies at complex intersections (Zhang et
al., 2025) [949]. These cases demonstrate PPO’s capability in learning optimal policies in dynamic,
stochastic environments, making it an invaluable tool for real-time AI decision-making. Across
all these domains, PPO continues to drive innovation by balancing exploration and exploitation,
allowing intelligent systems to adapt to ever-changing real-world constraints efficiently.
Proximal Policy Optimization (PPO) is a policy gradient method used in reinforcement learning
that seeks to improve the stability and efficiency of policy updates while maintaining sample effi-
ciency. Given a Markov Decision Process (MDP) defined by the tuple (S, A, P, r, γ), where S is
the state space, A is the action space, P (s′ | s, a) represents the transition probabilities, r(s, a) is
the reward function, and γ is the discount factor, PPO aims to find an optimal policy πθ (a | s) pa-
rameterized by θ, maximizing the expected cumulative discounted reward. The objective function
for policy optimization is given by the expected advantage-weighted likelihood ratio:
πθ (a | s) πθ
J(θ) = Es,a∼πθold A old (s, a) (15.326)
πθold (a | s)
where Aπθold (s, a) is the advantage function that quantifies how much better an action is compared
to the expected value of the state under the old policy. To prevent overly large updates that can
lead to instability, PPO employs a clipped surrogate objective:
LCLIP (θ) = Es,a∼πθold [min (rθ (s, a)Aπθold (s, a), clip(rθ (s, a), 1 − ϵ, 1 + ϵ)Aπθold (s, a))] (15.327)
330 CHAPTER 15. LEARNING PARADIGMS
This formulation ensures that the policy update is constrained within a predefined trust region
without requiring explicit second-order optimization methods, unlike Trust Region Policy Opti-
mization (TRPO). Additionally, PPO often includes a value function term to reduce variance while
maintaining unbiased gradient estimates. The value loss function is given by
h 2 i
LVF (θ) = Es∼πθold Vθ (s) − V target (s) (15.330)
where Vθ (s) is the learned state-value function and V target (s) is the estimated target value function
computed from bootstrapped returns, typically using Generalized Advantage Estimation (GAE):
V target (st ) = Rt + γVθold (st+1 ) (15.331)
where the return is
Rt = rt + γrt+1 + γ 2 rt+2 + · · · + γ T −t rT (15.332)
and the advantage function is estimated using GAE:
T −t
X
At = (γλ)l δt+l (15.333)
l=0
The Soft Actor-Critic (SAC) algorithm, originally proposed by Haarnoja et al. (2018), introduced
a novel framework that integrates the maximum entropy principle with off-policy reinforcement
learning, significantly enhancing sample efficiency, stability, and robustness of policy learning. The
key contribution of SAC lies in its entropy-regularized objective, which encourages policies to not
only maximize expected returns but also maintain high entropy, ensuring effective exploration and
better generalization across unseen states. This approach was further refined in Haarnoja et al.
(2019), where an automatic temperature tuning mechanism was introduced to dynamically balance
the trade-off between exploration and exploitation, alleviating the need for exhaustive hyperpa-
rameter tuning. These developments enabled SAC to outperform existing algorithms such as Deep
Deterministic Policy Gradient (DDPG) and Twin Delayed DDPG (TD3) on a variety of continu-
ous control benchmarks, demonstrating superior performance in both sample efficiency and policy
robustness.
Several extensions of SAC have been proposed to address specific challenges in robotics, autonomous
navigation, and uncertainty estimation. Wu et al. (2023) applied SAC to LiDAR-based robot nav-
igation, where the algorithm was adapted to dynamic obstacle avoidance scenarios, showing im-
proved training efficiency and navigation success rates. Similarly, Hossain et al. (2022) introduced
an inhibitory network-based modification to SAC, designed to accelerate the retraining of UAV
controllers, thereby enhancing adaptability in fast-changing environments. Another significant de-
velopment came from Ishfaq et al. (2025), who introduced Langevin Soft Actor-Critic (LSAC),
integrating Thompson sampling with distributional Langevin Monte Carlo updates to improve un-
certainty estimation and exploration efficiency. These contributions reflect ongoing efforts to tailor
SAC for real-world applications where stability and robustness are paramount.
Further theoretical advancements have sought to refine the critic network and improve value esti-
mation in SAC. Verma et al. (2023) proposed the Soft Actor Retrospective Critic (SARC), which
introduced an additional retrospective loss term for the critic network, accelerating convergence
and stabilizing training dynamics. Addressing concerns regarding approximation errors, Tasdighi
et al. (2023) introduced PAC-Bayesian Soft Actor-Critic, which incorporates a PAC-Bayesian ob-
jective into the critic’s learning process, reducing uncertainty and improving sample efficiency. In
a related effort, Duan (2021) proposed Distributional Soft Actor-Critic (DSAC), which learns a
Gaussian distribution over stochastic returns, mitigating value overestimation errors commonly
encountered in standard SAC implementations. These refinements collectively contribute to more
reliable policy learning, especially in high-dimensional continuous control tasks.
Beyond algorithmic improvements, theoretical explanations and comparative studies have solid-
ified SAC’s role as a benchmark reinforcement learning algorithm. The Papers with Code analysis
provides a detailed breakdown of SAC’s fundamental principles, highlighting its superiority over
traditional policy optimization methods due to its ability to capture multi-modal policy distri-
butions. This is complemented by the in-depth analyses of Haarnoja et al., who emphasize the
algorithm’s strengths in handling stochasticity while maintaining computational tractability. Taken
together, these contributions establish SAC as one of the most powerful and widely adopted actor-
critic methods in deep reinforcement learning, continuing to inspire further research in robotics,
autonomous systems, and sample-efficient deep reinforcement learning.
332 CHAPTER 15. LEARNING PARADIGMS
Soft Actor-Critic (SAC) has emerged as a powerful deep reinforcement learning algorithm, demon-
15.4. REINFORCEMENT LEARNING 333
strating remarkable adaptability across various domains such as autonomous systems, financial
markets, industrial automation, and transportation. One significant application is in search and
rescue operations, where SAC outperforms traditional reinforcement learning methods due to its
entropy-regularized policy. Ewers et al. (2025) implement SAC alongside Proximal Policy Opti-
mization (PPO) in a recurrent autoencoder-based deep reinforcement learning system, revealing
SAC’s superiority in handling dynamic and uncertain environments. Similarly, in fluid dynamics,
Yan et al. (2025) propose mutual information-based knowledge transfer learning (MIKT-SAC),
enhancing SAC’s ability to generalize across domains for active flow control in bluff body flows,
demonstrating significant improvements in cross-domain transferability. Another critical indus-
trial application is seen in Industry 5.0, where Asmat et al. (2025) develop a Digital Twin (DT)
framework integrated with SAC, enabling intelligent cyber-physical systems for adaptive industrial
transitions. Moreover, SAC is extensively utilized in telecommunications, as highlighted by Chao
and Jiao (2025), who leverage SAC for network spectrum resource allocation, optimizing spectrum
usage in dynamic wireless environments.
Beyond industrial applications, SAC is also making strides in autonomous navigation and adversar-
ial learning. Ma et al. (2025) introduce SIE-SAC, an advanced SAC-based learning mechanism for
UAV navigation under adversarial conditions, specifically GPS/INS-integrated spoofing scenarios.
The study underscores SAC’s capability to adapt and refine deception strategies, a crucial element
in defensive aerospace applications. Additionally, SAC is being explored in financial markets, where
Walia et al. (2025) integrate SAC with causal generative adversarial networks (GANs) and large
language models (LLMs) for liquidity-aware bond yield prediction. This approach significantly
enhances financial forecasting by leveraging reinforcement learning’s adaptability in complex, non-
linear financial datasets. Lalor and Swishchuk (2025) further extend SAC’s financial applications
by applying it to non-Markovian market-making, proving its efficiency in handling long-term de-
pendencies and stochastic pricing models.
SAC’s utility is not limited to theoretical models but also extends into multi-agent cooperative
learning and resource optimization. Zhang et al. (2025) propose a diffusion-based SAC framework
for multi-UAV networks in the Metaverse, optimizing cooperative task allocation and resource dis-
tribution in edge-enabled virtual environments. Similarly, Zhao et al. (2025) apply SAC in energy
management for hybrid storage systems in urban rail transit, enhancing the efficiency of traction
power supply systems. This approach demonstrates SAC’s potential in sustainable energy appli-
cations, where efficient power distribution is paramount. In the automotive sector, Tresca et al.
(2025) utilize SAC to design adaptive energy management strategies for hybrid electric vehicles, op-
timizing fuel efficiency and battery longevity by dynamically adjusting energy consumption based
on real-time driving conditions.
Overall, SAC continues to revolutionize deep reinforcement learning across various domains by
providing sample-efficient, entropy-regularized learning strategies that balance exploration and ex-
ploitation effectively. From autonomous navigation and cybersecurity to industrial optimization
and sustainable energy systems, SAC’s ability to handle high-dimensional, complex decision-making
problems ensures its widespread applicability. Whether applied in robotics, finance, or telecommu-
nications, SAC demonstrates exceptional versatility, pushing the boundaries of intelligent decision-
making in real-world systems.
334 CHAPTER 15. LEARNING PARADIGMS
The Soft Actor-Critic (SAC) algorithm is a model-free, off-policy deep reinforcement learn-
ing method that optimizes policies in continuous action spaces. It is formulated within the
maximum entropy reinforcement learning framework, which aims to maximize both the
expected return and the entropy of the policy. Given a Markov Decision Process (MDP) de-
fined by the tuple M = (S, A, p, r, γ), where S is the state space, A is the action space, p(s′ | s, a)
is the transition probability, r(s, a) is the reward function, and γ ∈ (0, 1] is the discount factor,
SAC introduces a policy optimization objective that includes an entropy term. This results in a
policy that is stochastic and encourages exploration while still optimizing for high rewards. The
objective function for SAC is defined as
∞
X
J(π) = E(st ,at )∼ρπ [r(st , at ) + αH(π(· | st ))] (15.338)
t=0
15.4. REINFORCEMENT LEARNING 335
where ρπ (st , at ) represents the state-action distribution under policy π, H(π(· | st )) is the entropy
of the policy at state st , and α is a temperature parameter that controls the trade-off between
exploration and exploitation. The soft Q-function in SAC satisfies the Bellman equation
Qπ (s, a) = r(s, a) + γEs′ ∼p(s′ |s,a),a′ ∼π [Qπ (s′ , a′ ) − α log π(a′ | s′ )] (15.339)
which differs from the standard Q-function by incorporating the entropy term −α log π(a′ | s′ ),
making the policy more stochastic and less greedy. The policy is optimized by minimizing the
KL divergence between π(a | s) and an exponential Boltzmann distribution induced by Qπ (s, a):
∗ exp(Q(s, a)/α)
π (a | s) = arg min DKL π(· | s) (15.340)
π Z(s)
where Z(s) is the partition function ensuring normalization. This results in a policy of the form
exp(Q(s, a)/α)
π(a | s) = R (15.341)
A
exp(Q(s, a′ )/α) da′
which highlights the softmax-like behavior of the policy. To stabilize learning, SAC employs
two Q-networks, Qθ1 (s, a) and Qθ2 (s, a), and uses the minimum value in the Bellman update to
reduce overestimation bias:
y = r(s, a) + γEs′ ∼p,a′ ∼π [min (Qθ1 (s′ , a′ ), Qθ2 (s′ , a′ )) − α log π(a′ | s′ )] (15.342)
where D is the replay buffer. The policy πϕ (a | s) is parameterized using a neural network,
and its objective function is derived from the reparameterization trick. If πϕ is a Gaussian policy
where
a = tanh(µϕ (s) + σϕ (s) · ϵ), ϵ ∼ N (0, I) (15.344)
Jπ (ϕ) = Es∼D,ϵ∼N (0,I) [α log πϕ (a | s) − min (Qθ1 (s, a), Qθ2 (s, a))] (15.345)
where the first term encourages higher entropy, and the second term ensures policy improve-
ment. The temperature parameter α is automatically adjusted by minimizing
Jα (α) = Es∼D,a∼πϕ −α log πϕ (a | s) + H̄ (15.346)
where H̄ is a target entropy value. The gradients for α update are given by
where λ is the learning rate. The full SAC algorithm follows an off-policy training procedure
where experience tuples (s, a, r, s′ ) are sampled from the replay buffer, and gradients are computed
using stochastic updates. The key components include the Q-function updates, policy up-
dates, and entropy adjustments, ensuring stable convergence and exploratory behavior
throughout training.
336 CHAPTER 15. LEARNING PARADIGMS
15.5 Neuroevolution
Neuroevolution is a computational framework in artificial intelligence (AI) that utilizes evolu-
tionary algorithms to optimize the architecture and weights of artificial neural networks (ANNs).
Unlike conventional deep learning, which relies on gradient-based optimization techniques such as
stochastic gradient descent (SGD), neuroevolution leverages principles of evolutionary computa-
tion, including mutation, crossover, and selection, to iteratively refine the structure and parameters
of neural networks. Given a population of candidate neural networks N = {N1 , N2 , . . . , Nm }, each
network is evaluated by a fitness function F (N ), which measures its performance on a given task.
The fundamental process can be mathematically described as an iterative search over the space of
possible networks, where the objective is to maximize the fitness function:
N ∗ = arg max F (N ). (15.348)
N ∈N
Each neural network in the population is parameterized by a set of weights W and biases B, where
the function represented by the network is given by
y = σ(W x + B), (15.349)
where x is the input vector, y is the output, and σ(·) is the activation function, typically chosen
as a nonlinear function such as the sigmoid, tanh, or ReLU. The evolutionary algorithm modifies
the parameters W and B over successive generations through genetic operators, which can be
expressed as follows. Given a set of parent solutions {(Wi , Bi )}m
i=1 , offspring solutions are generated
via mutation and crossover. Mutation introduces perturbations in the network weights:
W ′ = W + ϵ, B ′ = B + δ, (15.350)
where ϵ ∼ N (0, σ 2 ) and δ ∼ N (0, σ 2 ) are Gaussian perturbations. The crossover operation com-
bines weight matrices from two parent networks (W1 , B1 ) and (W2 , B2 ), producing an offspring
(W ′ , B ′ ):
W ′ = αW1 + (1 − α)W2 , B ′ = αB1 + (1 − α)B2 , (15.351)
where α ∈ [0, 1] is a crossover coefficient that determines the proportion of contribution from each
parent. The fitness of each offspring network is then computed, and a selection mechanism is
applied to determine which networks survive to the next generation. A common selection strategy
is tournament selection, in which k networks are randomly chosen, and the one with the highest
fitness is selected:
Nnext = arg max F (N ). (15.352)
N ∈Ntournament
In addition to optimizing the weights, neuroevolution can be extended to evolve the architecture of
the neural networks, including the number of layers, neurons per layer, and connectivity patterns.
This results in a combinatorial search problem where each network is represented as a graph G =
(V, E), where V is the set of neurons and E is the set of synaptic connections. The evolution of
architectures can be modeled using genetic encoding schemes such as direct encoding, where each
network parameter is explicitly represented in a genome, or indirect encoding, where a compact
representation (e.g., a developmental rule) is used to generate the architecture. If the architecture
is encoded as a vector θ, then the objective function is
θ∗ = arg max F (G(θ)). (15.353)
θ
A powerful variant of neuroevolution is the use of novelty search, in which selection is based not
on performance but on behavioral diversity. The novelty of a network is defined as the average
distance between its behavior and those of its nearest neighbors in the population:
k
1X
N (N ) = d(N, Ni ), (15.354)
k i=1
15.5. NEUROEVOLUTION 337
where d(N, Ni ) is a distance metric (e.g., Euclidean or cosine distance) between network behaviors.
Modern neuroevolution approaches incorporate gradient-based methods to accelerate convergence.
One such method is Evolution Strategies (ES), which approximates the gradient of the expected
fitness with respect to the network parameters via a perturbation-based estimation:
n
1X
∇θ E[F (θ)] ≈ F (θ + σϵi )ϵi , (15.355)
σ i=1
where ϵi ∼ N (0, I) are random perturbations, and σ is the step size. This update rule enables
evolution to operate in high-dimensional parameter spaces more efficiently. Another advanced
neuroevolution technique is the NeuroEvolution of Augmenting Topologies (NEAT), which evolves
both network weights and topologies by introducing genetic operators such as mutation of connec-
tions and nodes. The key idea is to preserve structural innovations via speciation, where networks
are clustered into species based on a similarity metric S(N1 , N2 ), typically computed as a function
of the number of disjoint and excess connections:
S(N1 , N2 ) = c1 E + c2 D + c3 W, (15.356)
where E and D are the number of excess and disjoint genes, respectively, and W is the average
weight difference between matching genes, with c1 , c2 , c3 as weighting coefficients. Through iterative
application of these principles, neuroevolution generates increasingly effective neural networks for
complex tasks such as reinforcement learning, robotics, and generative modeling. By searching the
space of neural architectures and parameters simultaneously, it provides an alternative to traditional
deep learning methods, enabling the discovery of novel architectures without the need for explicit
human-designed features. The optimization process can be formally represented as
" T #
X
min Eπθ γ t rt , (15.357)
θ
t=0
where πθ is a policy parameterized by a neural network, rt is the reward at time step t, and γ is a
discount factor in reinforcement learning settings.
As deep learning gained prominence, researchers sought ways to automate the design of neural
architectures. Miikkulainen et al. (2024) [957] introduced CoDeepNEAT, an extension of NEAT
338 CHAPTER 15. LEARNING PARADIGMS
that optimizes deep learning architectures by evolving topology, hyperparameters, and compo-
nents, yielding results comparable to human-designed networks for object recognition. Similarly,
LEAF (Learning Evolutionary AI Framework) by Liang et al. (2019) [958] applied evolu-
tionary algorithms to optimize both the architecture and hyperparameters of deep neural networks,
demonstrating state-of-the-art performance in real-world tasks such as medical imaging and natural
language processing. Meanwhile, Vargas and Murata (2016) [959] proposed Spectrum-Diverse
Neuroevolution, which introduces a diversity-preserving mechanism to enhance the robustness of
evolved networks by maintaining variation at the behavioral level. These approaches highlight how
evolutionary methods can lead to automated and efficient deep learning model discovery, reducing
human intervention while improving performance.
evolutionary strategies to automate model selection and improve computational efficiency. Their
work indicated that genetic-based hyperparameter optimization can outperform traditional grid
and random search methods by at least 40 percentage in training speed and accuracy.
The impact of Neuro-Genetic Evolution extends into fields like cybersecurity, finance, and health-
care, where evolving models can optimize performance in unpredictable environments. Kannan
et al. (2024) [970] presented a neuro-genetic deep learning framework for IoT security, effectively
detecting RPL attacks through an adaptive, self-learning anomaly detection system. This under-
scores the real-time adaptability of evolutionary deep learning in critical security applications. In
financial forecasting, Zeng et. al. (2022) [971] developed a hybrid genetic-deep learning model for
predicting market fluctuations, demonstrating higher accuracy and robustness in volatile financial
trends. Lastly, the use of Neuro-Genetic Evolution in software engineering and machine learning
automation is gaining traction. S KV and Swamy (2024) [972] explored ensemble-based neuro-
genetic models to improve software quality by refining feature selection and defect prediction. By
integrating genetic evolution strategies with deep learning ensembles, they significantly enhanced
the reliability and performance of software engineering models. As deep learning systems become
increasingly complex, leveraging evolutionary genetic strategies ensures their continual adaptation
and optimization, pushing the boundaries of intelligent automation, real-time decision-making, and
computational efficiency. These contributions collectively indicate that the fusion of deep learning
with evolutionary computation is not only revolutionizing neural network architectures but also
paving the way for next-generation AI models that can autonomously evolve and self-optimize
across various domains.
Table 15.52: Summary of Recent Contributions in Neuro-Genetic Evolution Research
where x ∈ Rn is the input vector, W is the weight matrix, b is the bias vector, and σ(·) is an
activation function. The objective is to optimize W and b using genetic evolution. A population
of neural networks is represented as:
where N is the population size. The fitness function F : P → R evaluates the performance of each
network on a given task:
M
X
F (Wi , bi ) = − ∥f (xj , Wi ) − yj ∥2 (15.360)
j=1
where M is the number of training samples, (xj , yj ) are training pairs, and the objective is to
maximize F . Selection is performed using a probabilistic function based on fitness:
F (Wi , bi )
Pi = PN (15.361)
k=1 F (Wk , bk )
where Pi is the probability of selecting the i-th network for reproduction. Higher-fitness networks
are more likely to be chosen. Crossover combines weights of two parent networks (WA , bA ) and
(WB , bB ) to create an offspring (WC , bC ):
where α ∼ U (0, 1) is a random crossover parameter. Mutation perturbs the weights and biases
with a small random noise:
where η is the mutation rate and N (0, σ 2 ) is Gaussian noise. The evolution process iterates over
multiple generations G to refine the population:
where t denotes the generation index. Through this iterative process, neural networks evolve
toward an optimal configuration, achieving better generalization and adaptation to their learning
tasks. The convergence of the method is influenced by hyperparameters such as population size
N , mutation rate η, and crossover probability pc , which are optimized to balance exploration and
exploitation in the evolutionary process.
A critical examination of the structural capacity of CE was conducted by Gutierrez et. al. (2004)
[976], who investigated its ability to generate a diverse range of feedforward neural network topolo-
gies. Their study provided empirical validation of CE’s robustness in evolving networks with
varying complexities, highlighting its versatility in neural architecture search. Meanwhile, Zhang
and Muhlenbein (1993) [977] examined the role of Occam’s Razor in CE-based neural evolution,
revealing how CE can balance model complexity and performance, leading to optimized network
topologies that avoid unnecessary overfitting. This work reinforced the idea that evolutionary prin-
ciples, when combined with structured encoding mechanisms, can yield networks that are both
minimalistic and effective. On a related note, Kitano (1990) [978] introduced a graph generation
system for designing neural networks using genetic algorithms, which, while predating formal CE
methodologies, provided foundational insights into evolutionary network design principles. Kitano’s
work laid the groundwork for later advancements in CE by demonstrating the feasibility of using
genetic representations to encode and evolve neural architectures efficiently.
A significant parallel development came with Miller and Turner (2015) [979] and Miller (2020)
[980]. Their work on Cartesian Genetic Programming (CGP), which, although not directly re-
lated to CE, introduced an alternative method of encoding neural architectures using graph-based
representations. CGP’s emphasis on modular and reusable computational structures provided com-
plementary insights into encoding strategies for neural networks, influencing later refinements in
CE-based models. Similarly, Stanley and Miikkulainen (2002) [950] introduced the NEAT (Neu-
roEvolution of Augmenting Topologies) algorithm, which evolved neural networks by progressively
augmenting topologies. While NEAT follows a different evolutionary strategy than CE, it shares
15.5. NEUROEVOLUTION 343
the core philosophy of dynamically evolving neural structures rather than relying on fixed archi-
tectures. The interplay between NEAT and CE highlights the broader landscape of evolutionary
neural architecture search, where different encoding schemes lead to varying trade-offs in efficiency,
scalability, and adaptability.
In more recent advancements, Hernandez Ruiz et al. (2021) [981] extended the CE paradigm to
neural cellular automata (NCA), proposing a manifold representation capable of generating diverse
images through dynamic convolutional mechanisms. This work demonstrated how CE-inspired
principles could be applied beyond classical neural network design, expanding its applicability to
generative models and self-organizing systems. Furthermore, Hajij, Istvan, and Zamzmi (2020) [982]
introduced Cell Complex Neural Networks (CXNs), which generalized message-passing schemes to
higher-dimensional structures, offering a novel perspective on encoding neural computations. Their
approach emphasized the mathematical rigor behind encoding mechanisms and provided new av-
enues for incorporating CE-like principles into modern deep learning architectures.
Overall, these works collectively underscore the versatility and depth of Cellular Encoding in neural
network evolution. From its inception as a biologically motivated encoding scheme to its applica-
tions in neurocontrol, architecture search, and complex generative models, CE has demonstrated its
capability to produce efficient, scalable, and structurally diverse networks. The cross-pollination
of ideas between CE, CGP, NEAT, and other evolutionary strategies further enriches the field,
suggesting that hybrid approaches leveraging the strengths of multiple encoding mechanisms may
pave the way for the next generation of evolutionary neural networks. Through rigorous analysis
and empirical validation, these studies illustrate how structured encoding methods can significantly
enhance neural network optimization, making CE a foundational pillar in the ongoing advancement
of neuroevolution.
Table 15.53: Summary of Contributions in Cellular Encoding Research
Cellular Encoding (CE) in neural networks has emerged as a pivotal concept bridging computa-
tional neuroscience, bioinformatics, and artificial intelligence. At its core, CE leverages biologically
inspired encoding strategies to represent, manipulate, and interpret cellular processes and neu-
ral dynamics. Sun et al. (2025) [983] explored this phenomenon by investigating how learning
transforms hippocampal neural representations into an orthogonalized state machine. Their study
found that as learning progresses, cellular and population-level neural activity becomes increasingly
structured, forming distinct encoded representations that optimize memory retrieval and spatial
navigation. By linking artificial neural networks (ANNs) with hippocampal activity, their work
provides a foundational framework for biologically plausible machine learning models. Similarly,
Hu et al. (2025) [848] extended this idea to genomics by designing an ensemble deep learning
framework for long non-coding RNA (lncRNA) subcellular localization, demonstrating how CE
can aid in deciphering complex gene regulation patterns.
Advancing the scope of CE, Guan et al. (2025) [984] developed a graph neural structure en-
coding approach for semantic segmentation of nuclei in pathological tissues. Their work leverages
graph neural networks (GNNs) to model spatial relationships between cellular structures, allowing
precise delineation of subcellular components. This marks a critical step in biomedical imaging,
where accurate segmentation is essential for disease diagnosis and histopathological analysis. On
the computational front, Ghosh et al. (2025) [985] tackled regulatory network encoding by designing
a deep learning framework for transcription factor binding site prediction. Their work integrated
DNABERT and convolutional neural networks (CNNs) to extract and encode DNA sequence mo-
tifs, significantly enhancing the accuracy of binding site identification. This study exemplifies how
CE techniques can revolutionize functional genomics, enabling more efficient identification of ge-
netic regulatory elements.
Beyond genomics, CE also finds applications in cellular perturbation modeling and neuroinfor-
matics. Sun et al. (2025) [986] introduced a perturbation proteomics-based virtual cell model that
encodes dynamic protein interaction networks. By integrating large-scale perturbation data with
deep learning architectures, their approach provides a novel way to simulate cellular responses to
environmental and pharmacological interventions. Grosjean et al. (2025) [987] contributed to self-
supervised learning in neuroscience by developing a network-aware encoding strategy for detecting
genetic modifiers of neuronal activity. Their work highlights the importance of high-content pheno-
15.5. NEUROEVOLUTION 345
typic screening, where CE can be used to identify how genetic variations influence neural dynamics.
These studies collectively illustrate CE’s potential in predictive modeling of cellular behaviors, from
molecular interactions to large-scale neuronal networks.
Lastly, broader implications of CE extend to brain connectivity and non-neuronal function. Sprecher
(2025) [990] investigated how neural networks encode and coordinate brain-wide activity, providing
a computational model of synaptic connectivity. Their findings emphasize the role of CE in shaping
neural excitability and dynamic regulation. Li et al. (2025) [991] expanded this perspective by
examining non-neuronal contributions to neural encoding. Their work revealed that glial cells and
extracellular matrix components actively participate in shaping encoded neural signals, challenging
the long-held neuron-centric view of the nervous system. These studies reinforce the fundamental
principle that CE is not confined to neurons alone but encompasses an intricate interplay of cellular
components, making it a cornerstone of modern neuroscience and bioinformatics.
Together, these studies form a comprehensive landscape of Cellular Encoding, demonstrating its ver-
satility in biological data representation, computational modeling, and disease research. Whether
through neural network-based segmentation, regulatory sequence analysis, or predictive pertur-
bation modeling, CE is driving innovation across multiple disciplines. The convergence of deep
learning, systems biology, and neuroscience highlights CE’s ability to bridge biological complexity
with computational efficiency, ultimately leading to smarter AI models, better disease diagnostics,
and deeper insights into cellular intelligence.
Mathematically, let G be the genotype, which consists of a set of developmental instructions {gi },
where each gi is a rule that governs a cellular transformation. The phenotype P , which is the neural
network, is generated through a function D that applies these instructions iteratively:
P = D(G) (15.365)
where D is a mapping function that executes a sequence of operations to form the final neural
network. Each cell in this developmental process can be represented by a state S, which consists
of attributes such as its position x, its type t, and its connectivity C:
Si = (xi , ti , Ci ) (15.366)
where xi ∈ Rn represents the spatial coordinates of the cell, ti denotes the type of the neuron
(such as input, hidden, or output), and Ci represents the set of synaptic connections formed during
development. The developmental process begins with a single root cell at an initial state S0 . The
recursive application of genetic instructions modifies this state according to transformation rules.
Each transformation is mathematically represented by a function T , such that at step k:
(k+1) (k)
Si = T (Si , gk ) (15.367)
15.5. NEUROEVOLUTION 347
where gk specifies an operation such as cell division, differentiation, or connection formation. Cell
division is a key operation in CE and can be modeled as a function that creates two daughter cells
SA and SB from a parent cell SP :
SA , SB = Φ(SP , gd ) (15.368)
where gd is a division instruction. The transformation function Φ ensures that spatial attributes x
and connectivity C are updated to reflect the new structure:
xA = xP + δx, xB = xP − δx (15.369)
where gc is a connection rule that specifies how weights wij are assigned based on distance and
developmental constraints:
wij = γ · e−α∥xi −xj ∥ (15.373)
where γ is a scaling factor and α controls the decay of connection strength with distance. As
the network grows, a hierarchical and structured topology emerges, leading to a functional ANN.
The final output network is expressed as a weighted graph N = (V, E, W ), where V is the set of
neurons, E is the set of synaptic connections, and W represents the weight matrix:
The learning process in CE can also involve adaptive modifications, where the weights evolve
according to a rule based on Hebbian learning:
∆wij = η · xi xj (15.375)
where η is the learning rate, ensuring that connectivity patterns are refined through evolutionary
pressure. Additionally, mutation and crossover operators in genetic algorithms can modify G,
leading to different developmental trajectories:
where M represents mutation and C represents crossover. Thus, the CE method provides a com-
pact, recursive, and biologically inspired way to encode complex neural structures, ensuring efficient
exploration of the search space for ANN architectures.
through time, GNARL employs an evolutionary strategy that allows for the dynamic adaptation of
network structures. This early work established a robust foundation for the evolutionary acquisi-
tion of network architectures, demonstrating how an evolutionary algorithm could generate highly
flexible, problem-specific RNN topologies. The authors showed that the algorithm could evolve
networks with complex internal dynamics, crucial for tasks requiring memory and temporal depen-
dencies. Expanding on these ideas, Angeline, Saunders, and Pollack (1994) [993] further refined
the approach by presenting an extensive comparison between GNARL and traditional methods,
emphasizing the limitations of genetic algorithms in network evolution and advocating for the ef-
fectiveness of GNARL’s approach in evolving RNNs more efficiently.
A critical review by Yao (1999) [995] provided a comprehensive survey of artificial neural net-
work evolution, situating GNARL within the broader landscape of evolutionary algorithms for
neural networks. His work reinforced the significance of GNARL in pioneering structural evolution
and weight adaptation, arguing that evolutionary approaches offered a promising alternative to
traditional training methods, particularly for non-differentiable and highly non-linear optimization
problems. Further expanding on GNARL’s contributions, Floreano, Dürr, and Mattiussi (2008)
[996] examined its role in the historical development of neuroevolution, emphasizing how its struc-
tural adaptability enabled more efficient learning processes compared to fixed-topology networks.
Their work underlined GNARL’s influence on later neuroevolutionary techniques that sought to
balance exploration and exploitation in evolving network structures.
The broader implications of GNARL’s methodology can also be seen in Moriarty and Miikku-
lainen’s (1996) [998] exploration of reinforcement learning through symbiotic evolution. Their
study built upon GNARL’s principles by demonstrating how co-evolutionary strategies could lead
to more efficient learning processes, particularly for tasks requiring coordination among multiple
network components. Furthermore, Gomez and Miikkulainen (1997) [999] extended these ideas by
investigating the incremental evolution of complex behaviors, a concept deeply rooted in GNARL’s
evolutionary framework. Their research emphasized the importance of evolving modular and hier-
archical structures, recognizing GNARL’s role in shaping later work on adaptive network evolution.
15.5. NEUROEVOLUTION 349
Collectively, these studies illustrate the enduring impact of GNARL on the field of neuroevolution,
highlighting its foundational contributions to evolving RNNs for complex learning and control tasks.
To represent an artificial neural network undergoing GNARL-based evolution, let N be the number
of neurons in the network. The connectivity of the network at any given time step t is given by an
adjacency matrix C(t), where each entry Cij (t) is defined as
(
1, if a connection exists from neuron j to neuron i at time t,
Cij (t) = (15.377)
0, otherwise.
Each connection has an associated weight matrix W (t), whose elements Wij (t) evolve based on
mutation and crossover mechanisms, following a genetic algorithm paradigm. The update rule for
weights follows
Wij (t + 1) = Wij (t) + ∆Wij (t), (15.378)
where the weight perturbation ∆Wij (t) is given by
∂F
∆Wij (t) = η + ξij , (15.379)
∂Wij
where η is a learning rate, F is a fitness function evaluating network performance, and ξij is a ran-
dom mutation term typically drawn from a zero-mean Gaussian distribution N (0, σ 2 ) to introduce
stochasticity in the evolution process. The recurrent dynamics of the network are governed by the
activation function ϕ, leading to neuron state updates given by
N
!
X
xi (t + 1) = ϕ Cij (t)Wij (t)xj (t) + bi (t) , (15.380)
j=1
where xi (t) is the activation of neuron i at time t, and bi (t) represents the neuron bias, which can
also be evolved over time via
bi (t + 1) = bi (t) + ∆bi (t), (15.381)
where
∂F
∆bi (t) = η + ξi . (15.382)
∂bi
The evolution of connectivity is achieved by a probabilistic mechanism where links are added or
removed based on fitness evaluations. If Padd and Pdel are the probabilities of adding and deleting
a link, respectively, then
1
Padd = , Pdel = 1 − Padd , (15.383)
1+ e−α(F −Fthresh )
where α is a sensitivity parameter and Fthresh is a threshold fitness value that governs network
complexity. The addition of a new connection follows
where U ∼ U(0, 1) is a uniformly distributed random variable and I is the indicator function.
Similarly, link deletion follows
The fitness function F is domain-specific and often includes terms for error minimization, stability,
and computational efficiency. A common formulation in supervised learning scenarios is
M
X X
F =− (yk − ŷk )2 + λ Cij (t), (15.386)
k=1 i,j
where M is the number of training samples, yk and ŷk are the target and predicted outputs, re-
spectively, and λ is a regularization parameter that penalizes excessive network complexity. The
evolutionary process iterates through selection, mutation, and crossover operations. Selection fol-
lows a fitness-proportional rule given by
eβFi
Pi = P βFj , (15.387)
je
where β is a selection pressure parameter controlling the preference for higher-fitness individuals.
Mutation is applied independently to weights and biases via
Crossover occurs between two parent networks A and B to produce offspring O, where
(
WijA , if U < 0.5,
WijO = (15.389)
WijB , otherwise
Recurrent links play a crucial role in GNARL’s ability to discover temporal dependencies in sequen-
tial data. The recurrent connections enable information to persist across time steps, making the
method particularly suited for problems such as time-series prediction and reinforcement learning.
The recurrent update equations are formulated as
!
X (R) (R)
X (I) (I)
hi (t + 1) = ϕ Cij (t)Wij (t)hj (t) + Cij (t)Wij (t)xj (t) , (15.390)
j j
these neurons. Each connection has an associated weight wij such that the activation aj of neuron
j is given by
X
aj = f wij ai + bj (15.392)
i∈P(j)
where P(j) is the set of neurons feeding into neuron j, bj is the bias term, and f (·) is the activation
function, which is often chosen to be a non-linear function such as the sigmoid
1
f (x) = (15.393)
1 + e−x
or a rectified linear unit (ReLU)
f (x) = max(0, x). (15.394)
In NEAT, the evolutionary process begins with a population of simple neural networks (often with
only input and output layers) and progressively adds new connections and new neurons via mutation
operations. The weight evolution follows a traditional genetic algorithm, where mutations modify
the connection strengths:
(t+1) (t)
wij = wij + η · N (0, σ 2 ) (15.395)
where η is the learning rate and N (0, σ 2 ) is a Gaussian noise term. Structural mutations can add
a new connection between two previously unconnected nodes with a probability pc , or a new node
can be inserted into an existing connection with probability pn . The insertion of a new node vk
replaces an edge (i, j) with two edges (i, k) and (k, j), and their weights are initialized as
(0) (0)
wik = wij , wkj = 1. (15.396)
One of the central innovations of NEAT is the use of historical markings to track the lineage of
genes (connections). Each connection gene is assigned an innovation number upon creation, which
remains unchanged throughout evolution. The similarity d between two genomes G1 and G2 is
computed using the genetic distance metric
c1 E c2 D
d= + + c3 W (15.397)
N N
where E is the number of excess genes, D is the number of disjoint genes, W is the average
weight difference of matching genes, and c1 , c2 , c3 are scaling coefficients. The normalization factor
N accounts for genome length to prevent excessive penalties for larger networks. This metric is
crucial for implementing speciation, where similar networks are grouped into species to encourage
innovation without being prematurely eliminated by competition. The fitness of each network G
is evaluated using a task-dependent function F (G), and species-level fitness sharing is applied to
maintain diversity:
Fi
Fi′ = P (15.398)
j∈species s(dij )
where s(dij ) is a step function that determines whether two genomes belong to the same species:
(
1, if dij < δ
s(dij ) = (15.399)
0, otherwise
with δ being a speciation threshold. Networks within a species compete primarily amongst them-
selves, fostering the survival of innovative structures. The crossover operation between two parent
genomes G1 and G2 is performed by inheriting matching genes randomly and retaining excess and
disjoint genes from the more fit parent:
where Gfitter is the genome with the higher fitness. If a connection gene in one parent is disabled
due to mutation, it is inherited in a disabled state unless reactivated by future mutations. Another
fundamental aspect of NEAT is complexification, wherein networks start minimally and gradually
increase in complexity by adding nodes and connections. Unlike fixed-topology methods, NEAT
allows evolution to discover increasingly sophisticated representations, leading to efficient problem-
solving. The network complexity at generation t is measured by
where |V | and |E| denote the number of neurons and connections, respectively. Over time, the
structural complexity increases as new mutations introduce novel architectures. In practice, the
evolutionary process iterates until convergence, defined by an optimal fitness threshold Fopt such
that
max F (G) ≥ Fopt . (15.402)
G∈population
The mutation rates pc and pn are typically annealed to balance exploration and exploitation, gov-
erned by
pc (t) = pc (0)e−λt , pn (t) = pn (0)e−µt (15.403)
where λ, µ are decay rates. In summary, NEAT dynamically evolves both weights and network
structures while preserving historical innovations and maintaining diversity through speciation. The
key principles—incremental complexification, historical markings, and speciation—enable NEAT
to efficiently discover topologies that traditional methods struggle to optimize. The continuous
balance between exploration and exploitation, governed by well-defined evolutionary operators,
results in a robust and self-adaptive neuroevolution framework.
Mathematically, let the neural substrate be defined as a set of nodes embedded in a D-dimensional
Euclidean space, where each node is assigned a spatial coordinate x ∈ RD . The weight ma-
trix W of the neural network is not directly evolved but rather generated dynamically using a
function f : R2D → R that maps the spatial positions of node pairs to connection weights. This
function f is instantiated by a Compositional Pattern Producing Network (CPPN), which
is a fully connected feedforward neural network with activation functions that can include
sigmoidal, Gaussian, sine, and identity functions, among others. The connectivity weight
between two neurons located at coordinates xi and xj is given by:
where the function f (xi , xj ) is parameterized by the weights and activation functions of the
CPPN, which itself undergoes neuroevolution using the NEAT algorithm. The CPPN represents
a compressed encoding of the connectivity pattern, which allows the network to exploit spatial
regularities in a way that would be infeasible with a direct encoding approach. The NEAT
354 CHAPTER 15. LEARNING PARADIGMS
algorithm evolves the CPPN by mutating and recombining genomes, where each genome G
represents a set of neurons and connections forming the CPPN. A genome G in the NEAT
representation consists of a set of node genes and a set of connection genes, where each
connection gene is represented as a tuple:
where i and j are node indices, wij is the connection weight, and enabled is a binary flag indicating
whether the connection is active. The fitness function F (G) that guides the evolution is determined
by the performance of the decoded ANN on the given task. The evolution proceeds using
mutation operators such as:
wij ← wij + N (0, σ 2 ) (15.406)
where N (0, σ 2 ) is a Gaussian perturbation, and structural mutations that add nodes and
connections, preserving the historical marking mechanism of NEAT. Once the CPPN is evolved,
it is queried at every pair of spatial coordinates (xi , xj ) in the neural substrate to generate the
weight matrix:
W = {f (xi , xj ) | ∀(xi , xj ) ∈ S × S} (15.407)
where S is the set of nodes in the substrate. This procedure ensures that the evolved ANN inherits
the geometric properties encoded by the CPPN, leading to topological coherence and the
ability to scale up efficiently. The substrate structure can be fixed or evolved, leading to a variety
of architectures including multi-layer perceptrons, convolutional-like structures, and even
more complex topologies. The implicit encoding provided by the CPPN enables patterned
connectivity with symmetries, repetitions, and other motifs that are biologically plau-
sible and computationally advantageous. HyperNEAT’s search space is thus fundamentally
different from traditional neuroevolution approaches, as it searches over functions that encode
networks rather than over networks themselves. This means that small changes in the CPPN
can lead to large, structured modifications in the ANN’s connectivity, a property known as
indirect encoding. Formally, the space of networks N encoded by a CPPN is given by:
N = {f (xi , xj ) | xi , xj ∈ RD } (15.408)
where f is constrained by the activation functions and weight structure of the CPPN. The expres-
siveness of the encoding is controlled by the activation functions in the CPPN, which determine
the nature of the patterns it can represent. A significant advantage of HyperNEAT is its ability
to exploit domain regularities by leveraging geometric relationships within the substrate.
For example, in a vision-based task, neurons representing nearby pixels should have stronger
connections, which can be captured by a distance-sensitive function:
where α and β are evolved parameters that control the strength and scale of connectivity. This
equation ensures that neurons form meaningful topologies based on spatial structure. Hyper-
NEAT optimizes the encoding function rather than individual weights, allowing it to generalize
connectivity patterns across different domains and tasks. This ability to evolve topological
regularities has been successfully applied in robot control, game playing, function approx-
imation, and large-scale pattern recognition.
A fundamental property of this mapping is substrate symmetry, which is captured by the fact
that weight magnitudes for symmetric inputs in the domain follow similar functional properties:
This symmetry allows for computational efficiency, as only a fraction of the connections need to
be explicitly evaluated. In ES-HyperNEAT, unlike conventional HyperNEAT, the adaptive
growth mechanism is introduced by defining an adaptive threshold function τ (x) over the
substrate, which dynamically determines whether a connection should be expressed. The connection
viability is determined by:
|wij | > τ (xi , xj ), (15.413)
where τ is a learned function that varies over the substrate space and is itself evolved during the
learning process. The introduction of adaptive thresholds enables ES-HyperNEAT to selectively
prune redundant connections while preserving the essential connectivity for task-relevant compu-
tation. The neural network instantiated from the CPPN-generated substrate follows a typical
activation dynamics governed by the weighted sum and activation function:
!
X
ai = σ wij aj + bi , (15.414)
j
where ai is the activation of node i, wij is the synaptic weight from node j to node i, bi is a bias
term, and σ is a nonlinear activation function (e.g., sigmoid, tanh, or ReLU). The ES-HyperNEAT
approach further refines the weight refinement strategy by incorporating a local function-
based weight refinement, denoted as W ′ , which modifies the initial weights W using a learned
adaptive function g(xi , xj ):
′
wij = W (xi , xj ) + g(xi , xj ). (15.415)
This refinement process iteratively enhances connectivity by reinforcing important regions
in the substrate while eliminating weak connections through thresholding and weight decay.
356 CHAPTER 15. LEARNING PARADIGMS
The effect is a progressively optimized connectivity pattern that dynamically adapts to the
learning environment. From a mathematical perspective, the evolution of the CPPN itself follows
genetic mutation and crossover dynamics, where the genetic encoding of the CPPN, denoted
by ΘCPPN , undergoes variation through mutation functions M and crossover functions C:
(t+1) (t) (t′ )
ΘCPPN = M(C(ΘCPPN , ΘCPPN )). (15.416)
(t) (t′ )
Here, t represents the generation index, and ΘCPPN and ΘCPPN are the CPPN parameters of parent
solutions selected based on fitness evaluations. The fitness function F evaluates the network’s
performance on a given task and is defined as:
X
F (ΘCPPN ) = L(ŷk , yk ), (15.417)
k
where L is a loss function comparing predicted outputs ŷk to true labels yk . The evolutionary
process seeks to maximize F , leading to increasingly optimized neural architectures. Overall, the
ES-HyperNEAT approach represents a mathematically rigorous framework for evolving com-
plex neural network topologies through adaptive connectivity growth, thresholded weight
refinement, and neurogenesis-driven substrate expansion. The resulting architectures are
not only topologically optimized but also computationally scalable, making ES-HyperNEAT
a powerful tool for the evolution of high-dimensional neural representations in artificial intelligence.
The Evolutionary Acquisition of Neural Topologies (EANT) and its extension EANT2 are advanced
methods for evolving artificial neural networks (ANNs) through an evolutionary process that si-
multaneously optimizes both the topology and the weights of the network. The primary principle
underlying EANT/EANT2 is the combination of evolutionary algorithms (EAs) and neural net-
work training, allowing for the automatic discovery of optimal network architectures without the
need for manual design. The method begins with an initial minimal structure and progressively
complexifies through mutations and recombinations, ensuring efficient exploration of the search
space while avoiding unnecessary complexity. A neural network in the context of EANT/EANT2
can be defined as a directed graph G = (V, E), where the set of vertices V represents neurons,
and the set of edges E represents synaptic connections. Each edge eij ∈ E carries an associated
weight wij , which modulates the signal transmission from neuron i to neuron j. Mathematically,
the activation aj of neuron j at time step t is given by:
X
aj (t) = f wij ai (t − 1) + bj , (15.418)
i∈Pre(j)
where Pre(j) denotes the set of presynaptic neurons to j, wij are the connection weights, bj is the
bias term, and f (·) is the activation function, typically chosen as the sigmoid function
1
f (x) = (15.419)
1 + e−λx
15.5. NEUROEVOLUTION 357
where I is the identity matrix. The covariance matrix is updated using an evolution path formula-
tion:
C(t+1) = (1 − cc )C(t) + cc pc p⊤
c , (15.424)
where cc is a learning rate, and pc is the evolution path. Throughout the evolutionary process,
EANT/EANT2 ensures that only structurally beneficial changes are retained, leading to an ef-
ficient exploration-exploitation tradeoff. The incremental growth of topology in EANT can be
mathematically formalized by defining the probability of adding a new neuron vk as
1
Padd (vk ) = , (15.425)
1 + e−γ∆F
where ∆F represents the fitness improvement from the structural modification, and γ is a control
parameter. At each iteration, the population of neural networks undergoes selection, mutation, and
recombination, leading to the formation of an improved generation. The optimization objective is
to maximize a performance metric J(w, T ), where T represents the training dataset. The overall
evolutionary update rule can be expressed as
where α is a learning rate and the second term represents stochastic exploration. The convergence
of EANT/EANT2 is governed by the stability of the weight adaptation dynamics. If the eigenvalues
of the Jacobian matrix
∂f (w)
J= (15.427)
∂w
lie within the unit circle, the evolutionary process stabilizes, ensuring convergence. Otherwise,
further structural modifications are required to regularize the topology. Thus, EANT/EANT2
provides a mathematically rigorous framework for evolving neural networks by simultaneously op-
timizing topology and weights, leveraging evolutionary principles, and incorporating efficient weight
adaptation techniques such as CMA. The fundamental strength of the method lies in its ability to
construct minimal yet powerful architectures that efficiently learn complex functions.
358 CHAPTER 15. LEARNING PARADIGMS
Neuro-evolution optimizes the weights and structures of artificial neural networks using evolution-
ary algorithms. Consider a neural network parameterized by a weight vector w = (w1 , w2 , . . . , wn ),
where n is the number of trainable parameters. The loss function of the network, denoted as L(w),
defines the objective landscape in which evolutionary search occurs. The ICONE method introduces
interactive constraints that enforce adherence to specific functional, structural, or performance-
based criteria, which we define mathematically as
Ci (w) ≤ 0, i = 1, 2, . . . , m, (15.428)
where Ci (w) represents the ith constraint function, ensuring the evolved network satisfies predefined
conditions. The evolution process follows a mutation-selection cycle under constrained optimiza-
(t) (t) (t)
tion. Given an initial population P (t) = {w1 , w2 , . . . , wN } at generation t, candidate solutions
undergo mutation:
(t+1) (t)
wj = wj + η j , η j ∼ N (0, σ 2 I), (15.429)
where η j is a perturbation vector sampled from an isotropic Gaussian distribution with variance σ 2 .
(t+1)
The feasibility of each mutated candidate is determined by evaluating the constraints Ci (wj ).
If a candidate violates any constraints, a **projection step** enforces feasibility by solving the
constrained optimization problem:
(t+1) (t+1) 2
wj = arg min ∥w − wj ∥ s.t. Ci (w) ≤ 0, i = 1, 2, . . . , m. (15.430)
w
This ensures that every evolved individual remains within the feasible search space. Selection in
ICONE is governed by fitness evaluation and constraint satisfaction. The fitness function F (w)
incorporates both performance metrics (e.g., classification accuracy, regression error) and constraint
penalties. A penalty-based fitness function is formulated as:
m
X
F (w) = L(w) + λ max(0, Ci (w))2 , (15.431)
i=1
where λ is a penalty coefficient enforcing constraint satisfaction. Higher fitness candidates are
selected for reproduction, forming the next generation P (t+1) . A key characteristic of ICONE
is interactive constraint adaptation, where constraints evolve dynamically based on intermediate
feedback. If the optimization process trends toward infeasible solutions, adaptive constraints are
imposed by modifying the constraint functions:
(t+1) (t)
Ci (w) = Ci (w) + γ · ∇w Ci (w), (15.432)
where γ is an adaptation rate controlling the magnitude of constraint adjustment. This adaptation
mechanism ensures the evolutionary process remains both stable and effective, guiding solutions
toward desirable regions in the search space. The convergence of ICONE relies on satisfying the
Karush-Kuhn-Tucker (KKT) conditions, which characterize optimality in constrained optimization.
At convergence, optimal solutions satisfy the Lagrangian formulation:
m
X
L(w, µ) = F (w) + µi Ci (w), (15.433)
i=1
15.5. NEUROEVOLUTION 359
where µi ≥ 0 are the Lagrange multipliers. The necessary conditions for optimality are:
These conditions ensure the final neural network configuration is both performance-optimized and
constraint-compliant.
Ultimately, the ICONE method provides a mathematically rigorous framework for constrained
neuro-evolution, ensuring neural networks evolve under explicit, dynamically adaptable, and in-
teractive constraints while maintaining optimal performance in their designated tasks. Through a
constraint-driven evolutionary search, ICONE guarantees that evolved models satisfy both func-
tional and theoretical constraints, leading to robust and interpretable artificial intelligence systems.
In traditional neural network training, the architecture is often fixed, and only the weights are
optimized using algorithms like backpropagation. In contrast, DXNN simultaneously evolves the
topology and weights of the network. This is achieved through a combination of evolutionary strate-
gies and local search methods, which iteratively refine the network’s performance. The memetic
algorithm aspect of DXNN incorporates local optimization techniques within the evolutionary pro-
cess, enhancing convergence rates and solution quality. Mathematically, the DXNN method can be
described as follows:
1. Initialization: A population of neural networks is initialized with random topologies and
weights. Each network’s topology can be represented as a graph G = (V, E), where V denotes
neurons and E denotes synaptic connections.
N
1 X
Fi = L(yj , ŷj ) (15.435)
N j=1
where N is the number of samples, yj is the true output, ŷj is the network’s output, and L is
a loss function, such as mean squared error:
3. Selection: Networks are selected for reproduction based on their fitness scores. A common
selection method is tournament selection, where a subset of networks is chosen, and the one
with the highest fitness is selected for reproduction.
W O = αW A + (1 − α)W B (15.437)
W ′ = W + ∆W (15.438)
∆W ∼ N (0, σ 2 ) (15.439)
For topology mutation, connections can be added or removed based on a probability padd or
premove .
6. Local Optimization (Memetic Component): After mutation, local search methods, such
as gradient-based optimization, are applied to fine-tune the weights of the offspring networks.
This involves minimizing the loss function L with respect to the weights W :
W ← W − η∇W L (15.440)
where η is the learning rate, and ∇W L is the gradient of the loss function with respect to the
weights.
7. Replacement: The new generation of networks replaces the old population, and the pro-
cess repeats from the fitness evaluation step until a termination criterion is met, such as a
predefined number of generations or a satisfactory fitness level.
The DXNN method has been applied in various domains, including financial markets. For instance,
in automated currency trading, DXNN has been used to evolve neural networks that process Forex
chart images as inputs, enabling the detection of patterns and trends for trading decisions. This
approach contrasts with traditional methods that rely on fixed technical indicators, offering a more
dynamic and adaptive trading strategy.
In summary, the Deus Ex Neural Network method provides a comprehensive framework for evolving
both the architecture and parameters of neural networks. By integrating evolutionary algorithms
with local optimization techniques, DXNN facilitates the development of adaptable and efficient
models capable of tackling complex tasks across various domains.
Central to SUNA is the Unified Neuron Model, which encapsulates various neuron types and acti-
vation functions within a single, cohesive representation. Each neuron i is characterized by a set
15.5. NEUROEVOLUTION 361
of parameters θi , governing its specific function. The output yi of neuron i can be mathematically
expressed as:
X
yi = fi wij xj + bi (15.441)
j∈pre(i)
Here, fi denotes the activation function, wij represents the synaptic weight between neurons j and
i, xj signifies the input from neuron j, and bi is the bias term associated with neuron i. This formu-
lation allows for the seamless integration of diverse neuronal behaviors within a unified framework.
The evolutionary process in SUNA involves the optimization of both the neural network’s architec-
ture and its synaptic weights. A population of candidate solutions, each encoded as a chromosome
C, undergoes iterative selection, crossover, and mutation operations. The fitness F (C) of each
chromosome is evaluated based on its performance on the target task. The mutation operators are
designed to modify the network’s topology and weights, introducing variations that enhance the
search for optimal solutions. To maintain a rich diversity of solutions, SUNA employs the spectrum
diversity mechanism. This approach constructs a spectrum S(C) for each chromosome, capturing
its unique characteristics. The spectrum is defined as:
where sk (C) represents the k-th feature of the chromosome C. The distance D between two spectra
S(C1 ) and S(C2 ) is computed using a suitable metric, such as the Euclidean distance:
v
u n
uX
D (S(C1 ), S(C2 )) = t (sk (C1 ) − sk (C2 ))2 (15.443)
k=1
This distance metric informs the niching mechanism, ensuring that the evolutionary process explores
a diverse set of solutions by promoting chromosomes with unique spectra. The fitness evaluation in
SUNA is augmented by a novelty score N (C), which quantifies the distinctiveness of a chromosome
relative to the current population. The novelty score is calculated as:
k
1X
N (C) = D (S(C), S(Ci )) (15.444)
k i=1
where Ci denotes the i-th nearest neighbor to C in the spectrum space, and k is a predefined
constant. This scoring system encourages the exploration of novel solutions, thereby enhancing the
algorithm’s ability to escape local optima. The overall selection probability P (C) of a chromosome
is influenced by both its fitness and novelty scores, and can be expressed as:
αF (C) + βN (C)
P (C) = P ′ ′
(15.445)
C ′ (αF (C ) + βN (C ))
Here, α and β are weighting factors that balance the contributions of fitness and novelty, respec-
tively. This probabilistic selection mechanism ensures a harmonious trade-off between exploiting
high-performing solutions and exploring innovative ones.
Through the integration of the Unified Neuron Model and spectrum diversity, SUNA adeptly nav-
igates the complex search space inherent in neuroevolution. This comprehensive approach enables
the discovery of neural network configurations that are both diverse and well-suited to a multitude
of tasks, thereby advancing the field of artificial intelligence.
16 Training Neural Networks
1
L(ŷ (i) , y (i) ) = ∥ŷ (i) − y (i) ∥22 , (16.1)
2
where ∥ · ∥2 represents the Euclidean norm. The total loss J(θ) for the entire dataset is the average
of the individual losses:
N
1 X
J(θ) = L(ŷ (i) , y (i) ), (16.2)
N i=1
362
16.2. BACKPROPAGATION ALGORITHM 363
where N is the number of training samples. For squared error loss, we can write:
N
1 X (i)
J(θ) = ∥ŷ − y (i) ∥22 . (16.3)
2N i=1
The forward pass through the network consists of computing the activations at each layer. For the
l-th layer, the pre-activation z (l) is calculated as:
where a(l−1) is the activation from the previous layer and W(l) is the weight matrix connecting the
(l − 1)-th layer to the l-th layer. The output of the layer, i.e., the activation a(l) , is computed by
applying the activation function σ (l) element-wise to z (l) :
The final output of the network is given by the activation a(L) at the last layer, which is the
predicted output ŷ (i) :
ŷ (i) = a(L) . (16.6)
The backpropagation algorithm computes the gradient of the loss function J(θ) with respect to
each parameter (weights and biases). First, we compute the error at the output layer. Let δ (L)
represent the error at layer L. This is computed by taking the derivative of the loss function with
respect to the activations at the output layer:
∂L (L)′ (L)
δ (L) = ⊙ σ (z ), (16.7)
∂a(L)
′
where ⊙ denotes element-wise multiplication, and σ (L) (z (L) ) is the derivative of the activation
function applied element-wise to z (L) . For squared error loss, the derivative with respect to the
activations is:
∂L
(L)
= ŷ (i) − y (i) (16.8)
∂a
so the error term at the output layer is:
′
δ (L) = (ŷ (i) − y (i) ) ⊙ σ (L) (z (L) ) (16.9)
To propagate the error backward through the network, we compute the errors at the hidden layers.
For each hidden layer l = L − 1, L − 2, . . . , 1, the error δ (l) is calculated by the chain rule:
′
δ (l) = W(l+1)T δ (l+1) ⊙ σ (l) (z (l) )
(16.10)
where W(l+1)T ∈ Rnl+1 ×nl is the transpose of the weight matrix connecting layer l to layer l+1. This
equation uses the fact that the error at layer l depends on the error at the next layer, modulated
by the weights, and the derivative of the activation function at layer l. Once the errors δ (l) are
computed for all layers, we can compute the gradients of the loss function with respect to the
parameters (weights and biases). The gradient of the loss with respect to the weights W(l) is:
N
∂J(θ) 1 X (l) (l−1) T
= δ (a ) (16.11)
∂W(l) N i=1
The gradient of the loss with respect to the biases b(l) is:
N
∂J(θ) 1 X (l)
= δ (16.12)
∂b(l) N i=1
364 CHAPTER 16. TRAINING NEURAL NETWORKS
After computing these gradients, we update the parameters using an optimization algorithm such
as gradient descent. The weight update rule is:
∂J(θ)
W(l) ← W(l) − η , (16.13)
∂W(l)
and the bias update rule is:
∂J(θ)
b(l) ← b(l) − η (16.14)
∂b(l)
where η is the learning rate controlling the step size in the gradient descent update. This process of
forward pass, backpropagation, and parameter update is repeated over multiple epochs, with each
epoch consisting of a forward pass, a backward pass, and a parameter update, until the network
converges to a local minimum of the loss function.
At each step of backpropagation, the chain rule is applied recursively to propagate the error
backward through the network, adjusting each weight and bias to minimize the total loss. The
′
derivative of the activation function σ (l) (z (l) ) is critical, as it dictates how the error is modulated
at each layer. Depending on the choice of activation function (e.g., ReLU, sigmoid, or tanh), the
derivative will take different forms, and this choice has a direct impact on the learning dynamics
and convergence rate of the network. Thus, backpropagation serves as the computational back-
bone of neural network training. By calculating the gradients of the loss function with respect to
the network parameters through efficient error propagation, backpropagation allows the network
to adjust its parameters iteratively, gradually minimizing the error and improving its performance
across tasks. This process is mathematically rigorous, utilizing fundamental principles of calculus
and optimization, ensuring that the neural network learns effectively from its training data.
where (xi , yi ) are the input-output pairs in the training dataset of size N , and ℓ(θ; xi , yi ) is the
sample-specific loss. The minimization problem is solved iteratively, starting from an initial guess
θ (0) and updating according to the rule
where η > 0 is the learning rate, and ∇θ L(θ) is the gradient of the loss with respect to θ. The gra-
dient, computed via backpropagation, follows the chain rule and propagates through the network’s
layers to adjust weights and biases optimally. In a feedforward neural network with L layers, the
computations proceed as follows. The input to layer l is
where W (l) ∈ Rnl ×nl−1 and b(l) ∈ Rnl are the weight matrix and bias vector for the layer, respec-
tively, and a(l−1) is the activation vector from the previous layer. The output is then
where f (l) is the activation function. Backpropagation begins with the computation of the error at
the output layer,
∂ℓ
δ (L) = (L)
⊙ f ′(L) (z (L) ), (16.19)
∂a
where f ′(L) (·) is the derivative of the activation function. For hidden layers, the error propagates
recursively as
δ (l) = (W (l+1) )⊤ δ (l+1) ⊙ f ′(l) (z (l) ). (16.20)
The gradients for weight and bias updates are then computed as
∂L
= δ (l) (a(l−1) )⊤ (16.21)
∂W (l)
and
∂L
(l)
= δ (l) , (16.22)
∂b
respectively. The dynamics of gradient descent are deeply influenced by the curvature of the loss
surface, encapsulated by the Hessian matrix
H(θ) = ∇2θ L(θ). (16.23)
For a small step size η, the change in the loss function can be approximated as
η2
∆L ≈ −η∥∇θ L(θ)∥2 + (∇θ L(θ))⊤ H(θ)∇θ L(θ). (16.24)
2
This reveals that convergence is determined not only by the gradient magnitude but also by the
curvature of the loss surface along the gradient direction. The eigenvalues λ1 , λ2 , . . . , λd of H(θ)
dictate the local geometry, with large condition numbers κ = λλmax min
slowing convergence due to
ill-conditioning. Stochastic gradient descent (SGD) modifies the standard gradient descent by
computing updates based on a single data sample (xi , yi ), leading to
θ (k+1) = θ (k) − η∇θ ℓ(θ; xi , yi ). (16.25)
While SGD introduces variance into the updates, this stochasticity helps escape saddle points
characterized by zero gradient but mixed curvature. To balance computational efficiency and
stability, mini-batch SGD computes gradients over a randomly selected subset B ⊂ {1, . . . , N } of
size |B|, yielding
1 X
∇θ LB (θ) = ∇θ ℓ(θ; xi , yi ). (16.26)
|B| i∈B
Momentum methods enhance convergence by incorporating a memory of past gradients. The
velocity term
v (k+1) = γv (k) + η∇θ L(θ) (16.27)
accumulates gradient information, and the parameter update is
θ (k+1) = θ (k) − v (k+1) . (16.28)
Analyzing momentum in the eigenspace of H(θ), with H = QΛQ⊤ , reveals that the effective step
size in each eigendirection is
η
ηeff,i = , (16.29)
1 − γλi
showing that momentum accelerates convergence in low-curvature directions while damping oscil-
lations in high-curvature directions. Adaptive gradient methods, such as AdaGrad, RMSProp, and
Adam, refine learning rates for individual parameters. In AdaGrad, the adaptive learning rate is
(k+1) η
ηi =q , (16.30)
(k+1)
Gii +ϵ
366 CHAPTER 16. TRAINING NEURAL NETWORKS
where
(k+1) (k)
Gii = Gii + (∇θi L(θ))2 . (16.31)
RMSProp modifies this with an exponentially weighted average
(k+1) (k)
Gii = βGii + (1 − β) (∇θi L(θ))2 . (16.32)
Adam combines RMSProp with momentum, where the first and second moments are
and
v (k+1) = β2 v (k) + (1 − β2 ) (∇θ L(θ))2 . (16.34)
Bias corrections yield
m(k+1) v (k+1)
m̂(k+1) = , v̂ (k+1) = . (16.35)
1 − β1k 1 − β2k
The final parameter update is
m̂(k+1)
θ (k+1) = θ (k) − η p . (16.36)
v̂ (k+1) + ϵ
In conclusion, gradient descent and its variants provide a rich framework for optimizing neural
network parameters. While standard gradient descent offers a basic approach, advanced methods
like momentum and adaptive gradients significantly enhance convergence by tailoring updates to
the landscape of the loss surface and the dynamics of training.
where
N
1 X
f (w) = ℓ(w; xi , yi ) (16.38)
N i=1
represents the empirical risk, constructed from a dataset {(xi , yi )}N
i=1 . Here, ℓ(w; xi , yi ) denotes
the loss function, w ∈ Rd is the parameter vector, N is the dataset size, and f (w) approximates
the true population risk
Ex,y [ℓ(w; x, y)]. (16.39)
Standard gradient descent involves the update rule
is the full gradient. However, for large-scale datasets, the computation of ∇f (w) becomes com-
putationally prohibitive, motivating the adoption of stochastic approximations. The stochastic
approximation relies on the idea of estimating the gradient ∇f (w) using a single data point or a
small batch of data points. Denoting the random index sampled at iteration t as it , the stochastic
gradient can be written as
b (w(t) ) = ∇ℓ(w(t) ; xit , yit ).
∇f (16.42)
Consequently, the update rule becomes
w(t+1) = w(t) − η ∇f
b (w(t) ). (16.43)
b (w(t) ) = 1
X
∇f ∇ℓ(w(t) ; xi , yi ). (16.44)
m i∈B
t
An important property of ∇f
b (w) is its unbiasedness:
E[∇f
b (w)] = ∇f (w). (16.45)
is the variance of the gradients. To analyze the convergence properties of SGD, we assume f (w)
to be L-smooth, meaning
∥∇f (w1 ) − ∇f (w2 )∥ ≤ L∥w1 − w2 ∥, (16.48)
368 CHAPTER 16. TRAINING NEURAL NETWORKS
and f (w) to be bounded below by f ∗ = inf w f (w). Using Taylor expansion, we can write
η η2L 2
E[f (w(t+1) )] ≤ E[f (w(t) )] − E[∥∇f (w(t) )∥2 ] + σ , (16.50)
2 2
showing that the convergence rate depends on the interplay between the learning rate η, the smooth-
ness constant L, and the gradient variance σ 2 . For η small enough, the dominant term in conver-
gence is − η2 E[∥∇f (w(t) )∥2 ], leading to monotonic decrease in f (w(t) ). In the strongly convex case,
where f (w) satisfies
µ
f (w1 ) ≥ f (w2 ) + ∇f (w2 )⊤ (w1 − w2 ) + ∥w1 − w2 ∥2 (16.51)
2
for µ > 0, SGD achieves linear convergence. Specifically,
ησ 2
E[∥w(t) − w∗ ∥2 ] ≤ (1 − ηµ)t ∥w(0) − w∗ ∥2 + . (16.52)
2µ
For non-convex functions, where ∇2 f (w) can have both positive and negative eigenvalues, SGD
may converge to a local minimizer or saddle point. Stochasticity plays a pivotal role in escaping
strict saddle points ws where ∇f (ws ) = 0 but λmin (∇2 f (ws )) < 0.
Despite the theoretical consensus on NAG’s superiority in convex optimization, Hermant et. al.
(2024) [425] present an unexpected empirical and theoretical challenge to this assumption. Their
study systematically compares deterministic NAG with Stochastic Gradient Descent (SGD) under
convex function interpolation, revealing cases where SGD exhibits superior practical performance
despite lacking formal acceleration guarantees. Their findings raise fundamental questions about
the practical advantages of momentum-based methods in data-driven scenarios, particularly when
stochastic noise interacts with interpolation dynamics. Applying NAG beyond classical convex
16.3. GRADIENT DESCENT VARIANTS 369
optimization, Alavala and Gorthi (2024) [426] integrate it into medical imaging reconstruction,
specifically for Cone Beam Computed Tomography (CBCT). They develop a NAG-accelerated least
squares solver (NAG-LS), demonstrating substantial improvements in computational efficiency and
image reconstruction quality. Their results indicate that NAG’s ability to mitigate error propaga-
tion in iterative reconstruction algorithms makes it particularly well-suited for inverse problems in
medical imaging. From a generalization perspective, Li (2024) [427] formulates a unified momen-
tum framework encompassing NAG, Polyak’s Heavy Ball method, and other stochastic momentum
algorithms. By introducing a generalized momentum differential equation, he rigorously dissects
the trade-off between stability, acceleration, and variance control in momentum-based optimization.
His framework provides a cohesive theoretical structure for understanding how momentum-based
techniques interact with gradient noise, particularly in high-dimensional stochastic settings.
Beyond convexity, Gupta and Wojtowytsch (2024) [428] rigorously analyze NAG’s performance
in non-convex optimization landscapes, a setting where standard acceleration techniques are often
assumed ineffective. Their research establishes conditions under which NAG retains acceleration
benefits even in the absence of strong convexity, highlighting how NAG’s momentum interacts with
saddle points, sharp local minima, and benign non-convex structures. Their work provides a cru-
cial extension of NAG beyond convex functions, opening new avenues for its application in deep
learning and high-dimensional optimization. Meanwhile, Razzouki et. al. (2024) [429] compile
a comprehensive survey of gradient-based optimization methods, systematically comparing NAG,
Adam, RMSprop, and other modern optimizers. Their analysis delves into theoretical convergence
guarantees, empirical performance benchmarks, and practical tuning considerations, emphasizing
how NAG’s momentum-driven updates compare against adaptive learning rate strategies. Their
survey serves as an authoritative reference for researchers seeking to navigate the landscape of
momentum-based optimization algorithms. Shifting towards hardware implementations, Wang et
al. (2025) [430] apply NAG to digital background calibration in Analog-to-Digital Converters
(ADCs). Their study demonstrates how NAG accelerates error correction algorithms in high-speed
ADC architectures, particularly in mitigating nonlinear distortions and improving signal-to-noise
ratios (SNRs). Their results provide compelling evidence that momentum-based optimization tran-
scends software applications, finding practical utility in high-performance electronic circuit design.
To further explore empirical performance trade-offs, Naeem et. al. (2024) [431] conduct an ex-
haustive empirical evaluation of NAG, Adam, and Gradient Descent across various convex and
non-convex loss functions. Their results highlight that while NAG accelerates convergence in many
cases, it can induce oscillatory behavior in certain settings, necessitating adaptive momentum tun-
ing to prevent divergence. Their findings offer practical insights into optimizer selection strategies,
particularly in deep learning architectures where gradient curvature varies dynamically. Finally,
Campos et. al. (2024) [432] extend NAG to optimization on Lie groups, a fundamental class of
non-Euclidean geometries. By adapting momentum-based gradient descent methods to Lie alge-
bra structures, they establish new convergence guarantees for optimization problems on curved
manifolds, an area crucial to robotics, physics, and differential geometry applications. Their work
signifies a major extension of NAG’s applicability, proving its efficacy beyond Euclidean space.
This method achieves a linear convergence rate of O(1/t) in the convex case. In Momentum-Based
Gradient Descent, the momentum-based update rule is:
θt+1 = θt + vt . (16.59)
where vt is a velocity-like term accumulating past gradients. µ is the momentum coefficient. Mo-
mentum reduces oscillations and accelerates convergence but suffers from excessive oscillations in
ill-conditioned problems. The Nesterov Accelerated Gradient (NAG) is A Look-Ahead Strategy.
Instead of computing the gradient at θt , NAG applies momentum first:
θt+1 = θt + vt . (16.62)
• Adaptive Step Size: The effective step size is modified dynamically, stabilizing the trajec-
tory.
To find the Variational Formulation of NAG, We derive NAG from an auxiliary optimization
problem that minimizes an upper bound on f (θ). Define a quadratic approximation at the look-
ahead iterate θ̃t :
T 1 2
θt+1 = arg min f (θ̃t ) + ∇f (θ̃t ) (θ − θ̃t ) + ∥θ − θt ∥ . (16.63)
θ 2η
Solving for θt+1 :
θt+1 = θ̃t − η∇f (θ̃t ). (16.64)
This derivation justifies why NAG achieves adaptive step-size behavior. We analyze the conver-
gence properties and Optimality Rate under convexity assumptions of Gradient Descent (GD). For
gradient descent:
∗ 1
f (θt ) − f (θ ) = O . (16.65)
t
16.3. GRADIENT DESCENT VARIANTS 371
This is suboptimal in large-scale settings. Regarding the NAG Convergence Rate, for strongly
convex f (θ):
∗ 1
f (θt ) − f (θ ) = O 2 . (16.66)
t
This improvement is due to the momentum-enhanced look-ahead updates. We need to do the
Lyapunov Analysis for Stability. Define the Lyapunov function:
γ
Vt = f (θt ) − f (θ∗ ) + ∥θt − θ∗ ∥2 . (16.67)
2
Here, γ, δ > 0 are parameters chosen to ensure Vt is non-increasing. We analyze Vt+1 − Vt to show
it is non-positive. Expanding Vt+1 :
γ δ
Vt+1 = f (θt+1 ) − f (θ∗ ) + ∥θt+1 − θ∗ ∥2 + ∥vt+1 ∥2 . (16.68)
2 2
Using strong convexity:
L
f (θt+1 ) ≤ f (θt ) + ∇f (θt )T (θt+1 − θt ) + ∥θt+1 − θt ∥2 . (16.69)
2
Since θt+1 = θt + vt , we substitute:
L
f (θt+1 ) ≤ f (θt ) + ∇f (θt )T vt + ∥vt ∥2 . (16.70)
2
Now, using vt = µvt−1 − η∇f (θ̃t ), we analyze the term ∥θt+1 − θ∗ ∥2 :
Expanding:
∥θt+1 − θ∗ ∥2 = ∥θt − θ∗ ∥2 + 2(θt − θ∗ )T vt + ∥vt ∥2 . (16.72)
Similarly, we expand ∥vt+1 ∥2 :
Expanding:
∥vt+1 ∥2 = µ2 ∥vt ∥2 − 2µηvtT ∇f (θ̃t+1 ) + η 2 ∥∇f (θ̃t+1 )∥2 . (16.74)
We have to choose γ, δ to Ensure Descent. To ensure Vt+1 ≤ Vt , we require:
Vt+1 − Vt ≤ 0. (16.75)
After substituting the above expansions and simplifying, we obtain a sufficient condition:
L 1
γ≥ , δ≥ . (16.76)
η η
Choosing γ, δ appropriately, we conclude:
Vt+1 ≤ Vt (16.77)
which proves the global stability of NAG. In conclusion, since Vt is non-increasing and lower-
bounded (by 0), it converges, which implies that θt → θ∗ and the NAG iterates remain bounded.
Hence, we have rigorously proven the global stability of Nesterov’s Accelerated Gradient (NAG).
For Practical Considerations, we need to have:
• Choice of µ: Optimal momentum is µ = 1 − O(1/t).
• Adaptive Learning Rate: Choosing η = O(1/L) ensures convergence.
372 CHAPTER 16. TRAINING NEURAL NETWORKS
We aim to minimize a stochastic objective function f (θ), where θ ∈ Rd represents the parame-
ters of the model. The optimization problem is:
where ξ is a random variable representing the stochasticity (e.g., mini-batch sampling in deep
learning). The Adam optimizer maintains that the first moment estimate (exponentially decaying
average of gradients) is given by:
where gt = ∇θ f (θt−1 ; ξt ) is the stochastic gradient at time t, and β1 ∈ [0, 1) is the decay rate. The
second moment estimate (exponentially decaying average of squared gradients) is given by:
where β2 ∈ [0, 1) is the decay rate, and gt2 denotes element-wise squaring. The bias-corrected
estimates are:
mt vt
m̂t = t
, v̂t = (16.81)
1 − β1 1 − β2t
The parameter update rule is:
m̂t
θt = θt−1 − η √ (16.82)
v̂t + ϵ
where η is the learning rate, and ϵ > 0 is a small constant for numerical stability. To rigorously
analyze Adam, we impose the following assumptions. The gradient ∇θ f (θ) is Lipschitz continuous
with constant L:
∥∇θ f (θ1 ) − ∇θ f (θ2 )∥ ≤ L∥θ1 − θ2 ∥ (16.83)
The stochastic gradients gt are bounded almost surely:
∥gt ∥∞ ≤ G (16.84)
E[∥gt ∥2 ] ≤ σ 2 (16.85)
The decay rates β1 and β2 satisfy 0 ≤ β1 , β2 < 1, and β1 < β2 . We analyze Adam in the online
optimization framework, where the loss function ft (θ) is revealed sequentially. The goal is to bound
the regret:
T
X XT
R(T ) = ft (θt ) − min ft (θ) (16.87)
θ
t=1 t=1
Regarding the Boundedness of m̂t and v̂t , using the boundedness of gt , we can show:
G G2
∥m̂t ∥∞ ≤ , ∥v̂t ∥∞ ≤ (16.89)
1 − β1 1 − β2
The bias-corrected estimates satisfy:
The update rule scales the gradient by √v̂1t +ϵ , which adapts to the curvature of the loss function.
Under the assumptions, the regret of Adam can be bounded as:
D2 T η(1 + β1 )G2
R(T ) ≤ + (16.91)
2η(1 − β1 ) (1 − β1 )(1 − β2 )(1 − γ)2
374 CHAPTER 16. TRAINING NEURAL NETWORKS
√
where γ = ββ21 . This bound is O( T ), which is optimal for online convex optimization. Regarding
Convergence in Non-Convex Settings, for non-convex optimization, we analyze the convergence of
Adam to a stationary point. Specifically, we show that:
Mathematically, at each iteration t, the Adam optimizer updates the parameter vector θt ∈ Rn ,
where n is the number of parameters of the model, based on the gradient gt , which is the gradient
of the objective function with respect to θt , i.e., gt = ∇θ f (θt ). In its essence, Adam computes
two distinct quantities: the first moment estimate mt and the second moment estimate vt , which
are recursive moving averages of the gradients and the squared gradients, respectively. The first
moment estimate mt is given by
where β1 ∈ [0, 1) is the decay rate for the first moment. This recurrence equation represents a
weighted moving average of the gradients, which is intended to capture the directional momentum
in the optimization process. By incorporating the first moment, Adam accumulates information
about the historical gradients, which helps mitigate oscillations and stabilizes the convergence
direction. The term (1 − β1 ) ensures that the most recent gradient gt receives a more significant
weight in the computation of mt . Similarly, the second moment estimate vt , which represents the
exponentially decaying average of the squared gradients, is updated as
where β2 ∈ [0, 1) is the decay rate for the second moment. This moving average of squared gradi-
ents captures the variance of the gradient at each iteration. The second moment vt thus acts as an
estimate of the curvature of the objective function, which allows the optimizer to adjust the step
size for each parameter accordingly. Specifically, large values of vt correspond to parameters that
experience high gradient variance, signaling a need for smaller updates to prevent overshooting,
while smaller values of vt correspond to parameters with low gradient variance, where larger up-
dates are appropriate. This mechanism is akin to automatically tuning the learning rate for each
parameter based on the local geometry of the loss function. At initialization, both mt and vt are
typically set to zero. This initialization introduces a bias toward zero, particularly at the initial
time steps, causing the estimates of the moments to be somewhat underrepresented in the early
iterations. To correct for this bias, bias correction terms are introduced. The bias-corrected first
moment m̂t is given by
mt
m̂t = , (16.98)
1 − β1t
16.3. GRADIENT DESCENT VARIANTS 375
The learning rate adjustment in Adam is dynamic in nature, as it is controlled by the second
moment estimate v̂t , which means that Adam has a per-parameter learning rate for each param-
eter. For each parameter, the learning rate is inversely proportional to the square root of its
corresponding second moment estimate v̂t , leading to adaptive learning rates. This is what enables
Adam to operate effectively in highly non-convex optimization landscapes, as it reduces the learn-
ing rate in directions where the gradient exhibits high variance, thus stabilizing the updates, and
increases the learning rate where the gradient variance is low, speeding up convergence. In the
case where Adam is applied to convex objective functions, convergence can be analyzed mathemat-
ically. Under standard assumptions, such as bounded gradients and a decreasing learning rate, the
convergence of Adam can be shown by proving that
∞
X ∞
X
ηt2 < ∞ and ηt = ∞, (16.101)
t=1 t=1
where ηt is the learning rate at time step t. The first condition ensures that the learning rate decays
sufficiently rapidly to guarantee convergence, while the second ensures that the learning rate does
not decay too quickly, allowing for continual updates as the algorithm progresses. However, Adam is
not without its limitations. One notable issue arises from the fact that the second moment estimate
vt may decay too quickly, causing overly aggressive updates in regions where the gradient variance
is relatively low. To address this, the AMSGrad variant was introduced. AMSGrad modifies the
second moment update rule by replacing vt with
thereby ensuring that v̂t never decreases, which helps prevent the optimizer from making overly
large updates in situations where the second moment estimate may be miscalculated. By forcing v̂t
to increase or remain constant, AMSGrad reduces the chance of large, destabilizing parameter up-
dates, thereby improving the stability and convergence of the optimizer, particularly in difficult or
ill-conditioned optimization problems. Additionally, further extensions of Adam, such as AdaBelief,
introduce additional modifications to the second moment estimate by introducing a belief-based
mechanism to correct the moment estimates. Specifically, AdaBelief estimates the second moment
v̂t in a way that adjusts based on the belief in the direction of the gradient, offering further stability
in cases where gradients may be sparse or noisy. These innovations underscore the flexibility of
376 CHAPTER 16. TRAINING NEURAL NETWORKS
Adam and its variants in optimizing complex loss functions across a range of machine learning tasks.
Ultimately, the Adam optimizer stands as a highly sophisticated, mathematically rigorous opti-
mization algorithm, effectively combining momentum and adaptive learning rates. By using both
the first and second moments of the gradient, Adam dynamically adjusts the parameter updates,
providing a robust and efficient optimization framework for non-convex, high-dimensional objective
functions. The use of bias correction, coupled with the adaptive nature of the optimizer, allows it
to operate effectively across a wide range of problem settings, making it a go-to method for many
machine learning and deep learning applications. The mathematical rigor behind Adam ensures
that it remains a highly stable and efficient optimization technique, capable of overcoming many
of the challenges posed by large-scale and noisy gradient information in machine learning models.
optimization problem. The fundamental issue with standard gradient descent lies in the constant
learning rate η, which fails to account for the varying magnitudes of the gradients in different di-
rections of the parameter space. This lack of adaptation can cause inefficient optimization, where
large gradients may lead to overshooting and small gradients lead to slow convergence. RMSProp
addresses this problem by dynamically adjusting the learning rate based on the historical gradient
magnitudes, offering a more tailored and efficient approach. Consider the objective function f (θ),
where θ ∈ Rn is the vector of parameters that we aim to optimize. Let ∇f (θ) denote the gradient
of f (θ) with respect to θ, which is a vector of partial derivatives:
T
∂f (θ) ∂f (θ) ∂f (θ)
∇f (θ) = , ,..., . (16.103)
∂θ1 ∂θ2 ∂θn
In traditional gradient descent, the update rule for θ is:
θt+1 = θt − η∇f (θt ), (16.104)
where η is the learning rate, a scalar constant. However, this approach does not account for the fact
that the gradient magnitudes may differ significantly along different directions in the parameter
space, especially in high-dimensional, non-convex functions. The RMSProp optimizer introduces a
solution by adapting the learning rate for each parameter in proportion to the magnitude of the
historical gradients. The key modification in RMSProp is the introduction of a running average
of the squared gradients for each parameter θi , denoted as E[g 2 ]i,t , which captures the cumulative
magnitude of the gradients over time. The update rule for E[g 2 ]i,t is given by the exponential
moving average formula:
E[g 2 ]i,t = βE[g 2 ]i,t−1 + (1 − β)gi,t
2
, (16.105)
where gi,t = ∂f∂θ(θit ) is the gradient of the objective function with respect to the parameter θi at
time step t, and β is the decay factor, typically set close to 1 (e.g., β = 0.9). This recurrence
relation allows the gradient history to influence the current update while exponentially forgetting
older gradient information. The value of β determines the memory of the squared gradients, where
higher values of β give more weight to past gradients. The update for θi in RMSProp is then given
by:
η
θi,t+1 = θi,t − p gi,t , (16.106)
E[g 2 ]i,t + ϵ
where ϵ is a small positive constant (typically ϵ = 10−8 ) introduced to avoid division by zero and
ensure numerical stability. The term √ 12 dynamically adjusts the learning rate for each pa-
E[g ]i,t +ϵ
rameter based on the magnitude of the squared gradient history. This adjustment allows RMSProp
to take larger steps in directions where gradients have historically been small, and smaller steps
in directions where gradients have been large, leading to a more stable and efficient optimization
process. RMSprop (Root Mean Square Propagation) is an adaptive learning rate optimization
algorithm that incorporates the following recursive update for the mean squared gradient:
vt = βvt−1 + (1 − β)gt2 . (16.107)
where vt represents the exponentially weighted moving average of squared gradients at time t,
β ∈ (0, 1) is the decay rate that determines how much past gradients contribute, gt = ∇θ f (θt ) is
the stochastic gradient of the loss function f , gt2 represents the element-wise squared gradient.
The step update for parameters θ is given by:
η
θt+1 = θt − √ gt . (16.108)
vt + ϵ
where η is the learning rate, and ϵ is a small positive constant for numerical stability. The key term
of interest is the mean squared gradient estimate vt , and its mathematical properties will now
be studied in extreme rigor. Note that the recurrence equation is
vt = βvt−1 + (1 − β)gt2 (16.109)
378 CHAPTER 16. TRAINING NEURAL NETWORKS
= β 2 vt−2 + (1 − β)βgt−1
2
+ (1 − β)gt2 . (16.111)
Continuing this expansion:
t−1
X
t
vt = β v0 + (1 − β) β k gt−k
2
. (16.112)
k=0
which represents an exponentially weighted moving average of past squared gradients. To analyze
the expectation, we formally introduce a probability space (Ω, F, P) where Ω is the sample space,
F is the sigma-algebra of measurable events, P is the probability measure governing the stochastic
process gt . The stochastic gradients gt are assumed to be random variables:
gt : Ω → Rd (16.114)
Thus:
t−1
X
E[vt ] = (1 − β)σg2 βk. (16.118)
k=0
we obtain:
E[vt ] = σg2 (1 − β t ). (16.120)
To find the asymptotic Limit, we have to take the limit as t → ∞:
Thus, the mean square estimate converges to the true second moment of the gradient. To
establish almost sure convergence, consider:
t−1
X
vt − σg2 = (1 − β) β k (gt−k
2
− σg2 ). (16.122)
k=0
16.3. GRADIENT DESCENT VARIANTS 379
By the strong law of large numbers, for a sufficiently large number of iterations:
t−1
X
β k (gt−k
2
− σg2 ) → 0 a.s. (16.123)
k=0
which implies:
vt → σg2 a.s. (16.124)
In conclusion, the properties of the Mean Square Estimate are
Expanding recursively:
t−1
X
E[vt ] = β t v0 + (1 − β) β k E[gt−k
2
]. (16.129)
k=0
This confirms that bias-adjusted RMSprop provides an unbiased estimate of the second
moment. We now do the Almost Sure Convergence Analysis. For that we analyze convergence by
considering the difference:
t−1
X
2
vt − σg = (1 − β) β k (gt−k
2
− σg2 ). (16.133)
k=0
380 CHAPTER 16. TRAINING NEURAL NETWORKS
Thus,
vt → σg2 a.s., v̂t → σg2 a.s. (16.135)
confirming that Bias-Adjusted RMSprop provides an asymptotically unbiased estimate
of σg2 . Let’s do the Stability Analysis of Learning Rate. The effective learning rate in RMSprop
is:
η
ηeff = √ . (16.136)
v̂t + ϵ
Therefore we have:
1. Without Bias Correction: If β t is large in early iterations, then:
vt ≈ (1 − β)gt2 . (16.137)
Since (1 − β)gt2 ≪ σg2 , the denominator in ηeff is too small, leading to excessively large
steps, causing instability.
Mathematically, the key advantage of RMSProp over traditional gradient descent lies in its ability
to adapt the learning rate according to the local geometry of the objective function. In regions
where the objective p function is steep (large gradients), RMSProp reduces the effective learning
rate by dividing by E[g 2 ]i,t , mitigating the risk of overshooting. Conversely, in flatter regions
with smaller gradients, RMSProp increases the learning rate, allowing for faster convergence. This
self-adjusting mechanism is crucial in high-dimensional optimization tasks, where the gradients
along different directions can vary greatly in magnitude, as is often the case in deep learning tasks
involving neural networks. The exponential moving average of squared gradients used in RMSProp
is analogous to a form of local normalization, where each parameter is scaled by the inverse of the
running average of its gradient squared. This normalization ensures that the optimizer does not
become overly sensitive to gradients in any particular direction, thus stabilizing the optimization
process. In more formal terms, if the objective function f (θ) exhibits sharp curvatures along cer-
tain directions, RMSProp mitigates the effects of such curvatures by scaling down the step size
along those directions. This scaling behavior can be interpreted as a form of gradient re-weighting,
where the influence of each parameter’s gradient is modulated by its historical behavior, making
the optimizer more robust to ill-conditioned optimization problems. The introduction of ϵ ensures
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 381
that the denominator never becomes zero, even in the case where the squared gradient history for
a parameter θi becomes extremely small. This is crucial for maintaining the numerical stability of
the algorithm, particularly in scenarios where gradients may vanish or grow exceedingly small over
many iterations, as seen in certain deep learning applications, such as training very deep neural
networks. By providing a small non-zero lower bound to the learning rate, ϵ ensures that the up-
dates remain smooth and predictable.
RMSProp’s performance is heavily influenced by the choice of β, which controls the trade-off
between long-term history and recent gradient information. When β is close to 1, the optimizer
relies more heavily on the historical gradients, which is useful for capturing long-term trends in the
optimization landscape. On the other hand, smaller values of β allow the optimizer to be more re-
sponsive to recent gradient changes, which can be beneficial in highly non-stationary environments
or rapidly changing optimization landscapes. In the context of deep learning, RMSProp is particu-
larly effective for optimizing objective functions with complex, high-dimensional parameter spaces,
such as those encountered in training deep neural networks. The non-convexity of such objective
functions often leads to a gradient that can vary significantly in magnitude across different layers of
the network. RMSProp helps to balance the updates across these layers by adjusting the learning
rate based on the historical gradients, ensuring that all layers receive appropriate updates without
being dominated by large gradients from any single layer. This adaptability helps in preventing
gradient explosions or vanishing gradients, which are common issues in deep learning optimiza-
tion. In summary, RMSProp provides a robust and efficient optimization technique by adapting
the learning rate based on the historical squared gradients of each parameter. The exponential
decay of the squared gradient history allows RMSProp to strike a balance between stability and
adaptability, preventing overshooting and promoting faster convergence in non-convex optimization
problems. The introduction of ϵ ensures numerical stability, and the parameter β offers flexibility
in controlling the influence of past gradients. This makes RMSProp particularly well-suited for
high-dimensional optimization tasks, especially in deep learning applications, where the parameter
space is vast, and gradient magnitudes can differ significantly across dimensions. By effectively
normalizing the gradients and dynamically adjusting the learning rates, RMSProp significantly
enhances the efficiency and stability of gradient-based optimization methods.
(2014) [132] introduced Dropout, a widely used regularization technique in deep learning. The au-
thors show how randomly dropping units during training reduces co-adaptation of neurons, thereby
enhancing model generalization. This technique remains a key part of modern neural network train-
ing pipelines. Zou and Hastie (2005) [133] introduced Elastic Net, a combination of L1 (Lasso) and
L2 (Ridge) regularization, which addresses the limitations of Lasso in handling correlated features.
It is particularly useful for high-dimensional data, where feature selection and regularization are
crucial. Vapnik (1995) [134] in his introduced Statistical Learning Theory and the VC-dimension,
which quantifies model complexity. It provides the mathematical framework explaining why overfit-
ting occurs and how regularization constraints reduce generalization error. It forms the theoretical
basis of Support Vector Machines (SVMs) and Structural Risk Minimization. Ng (2004) [135] com-
pares L1 (Lasso) and L2 (Ridge) regularization, demonstrating their impact on feature selection
and model stability. It shows that L1 regularization is more effective for sparse models, whereas L2
preserves information better in highly correlated feature spaces. This work is essential for choosing
the right regularization technique for specific datasets. Li (2025) [136] explored regularization tech-
niques in high-dimensional clinical trial data using ensemble methods, Bayesian optimization, and
deep learning regularization techniques. It highlights the practical application of regularization to
prevent overfitting in medical AI. Yasuda (2025) [137] focused on regularization in hybrid machine
learning models, specifically Gaussian–Discrete RBMs. It extends L1/L2 penalties and dropout
strategies to improve the generalization of deep generative models. It’s valuable for those working
on deep learning architectures and unsupervised learning.
represents the input feature vector for each data point, and yi ∈ R represents the corresponding
target value. The goal is to fit a neural network model f (x; w) parameterized by weights w ∈ RM ,
where M denotes the number of parameters in the model. The model’s objective is to minimize the
empirical risk, given by the mean squared error between the predicted values and the true target
values:
N
1 X
R̂(w) = L(f (xi ; w), yi ) (16.139)
N i=1
where L denotes the loss function, typically the squared error L(ŷi , yi ) = (ŷi − yi )2 . In this frame-
work, the neural network tries to minimize the empirical risk on the training set. However, the
true goal is to minimize the expected risk R(w), which reflects the model’s performance on the
true distribution P (x, y) of the data. This expected risk is given by:
Overfitting occurs when the model minimizes R̂(w) to an excessively small value, but R(w) remains
large, indicating that the model has fit the noise in the training data, rather than capturing the true
data distribution. This discrepancy arises from an overly complex model that learns to memorize
the training data rather than generalizing across different inputs. A fundamental insight into the
overfitting phenomenon comes from the bias-variance decomposition of the generalization error.
The total error in a model’s prediction fˆ(x) of the true target function g(x) can be decomposed
as:
E = E[(g(x) − fˆ(x))2 ] = Bias2 (fˆ(x)) + Var(fˆ(x)) + σ 2 (16.141)
where Bias2 (fˆ(x)) represents the squared difference between the expected model prediction and
the true function, Var(fˆ(x)) is the variance of the model’s predictions across different training sets,
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 383
and σ 2 is the irreducible error due to the intrinsic noise in the data. In the context of overfitting,
the model typically exhibits low bias (as it fits the training data very well) but high variance (as
it is highly sensitive to the fluctuations in the training data). Therefore, regularization techniques
aim to reduce the variance of the model while maintaining its ability to capture the true underlying
relationships in the data, thereby improving generalization. One of the most popular methods to
mitigate overfitting is L2 regularization (also known as weight decay), which adds a penalty term
to the loss function based on the squared magnitude of the weights. The regularized loss function
is given by:
M
X
2
R̂reg (w) = R̂(w) + λ∥w∥2 = R̂(w) + λ wj2 (16.142)
j=1
where λ is a positive constant controlling the strength of the regularization. The gradient of the
regularized loss function with respect to the weights is:
The term 2λw introduces weight shrinkage, which discourages the model from fitting excessively
large weights, thus preventing overfitting by reducing the model’s complexity. This regularization
approach is a direct way to control the model’s capacity by penalizing large weight values, leading
to a simpler model that generalizes better. In contrast, L1 regularization adds a penalty based
on the absolute values of the weights:
M
X
R̂reg (w) = R̂(w) + λ∥w∥1 = R̂(w) + λ |wj | (16.144)
j=1
where sgn(w) denotes the element-wise sign function. L1 regularization has a unique property of
inducing sparsity in the weights, meaning it drives many of the weights to exactly zero, effectively
selecting a subset of the most important features. This feature selection mechanism is particularly
useful in high-dimensional settings, where many input features may be irrelevant. A more advanced
regularization technique is dropout, which randomly deactivates a fraction of neurons during
training. Let hi represent the activation of the i-th neuron in a given layer. During training,
dropout produces a binary mask mi sampled from a Bernoulli distribution with success probability
p, i.e., mi ∼ Bernoulli(p), such that:
1
hdrop
i = mi ⊙ hi (16.146)
p
where ⊙ denotes element-wise multiplication. The factor 1/p ensures that the expected value
of the activations remains unchanged during training. Dropout effectively forces the network to
learn redundant representations, reducing its reliance on specific neurons and promoting better
generalization. By training an ensemble of subnetworks with shared weights, dropout helps to
prevent the network from memorizing the training data, thus reducing overfitting. Early stopping
is another technique to prevent overfitting, which involves halting the training process when the
validation error starts to increase. The model is trained on the training set, but its performance is
evaluated on a separate validation set. If the validation error Rval (t) increases after several epochs,
training is stopped to prevent further overfitting. Mathematically, the stopping criterion is:
where t∗ represents the epoch at which the validation error reaches its minimum. This technique
avoids the risk of continuing to fit the training data beyond the point where the model starts to lose
384 CHAPTER 16. TRAINING NEURAL NETWORKS
its ability to generalize. Data augmentation artificially enlarges the training dataset by applying
transformations to the original data. Let T = {T1 , T2 , . . . , TK } represent a set of transformations
(such as rotations, scaling, and translations). For each training example (xi , yi ), the augmented
dataset D′ consists of K new examples:
These transformations create new, varied examples, which help the model generalize better by
preventing it from fitting too closely to the original, potentially noisy data. Data augmentation is
particularly beneficial in domains like image processing, where transformations like rotations and
flips do not change the underlying label but provide additional examples to learn from. Batch
normalization normalizes the activations of each mini-batch to reduce internal covariate shift and
stabilize the learning process. Given a mini-batch B = {hi }m i=1 with activations hi , the mean and
variance of the activations across the mini-batch are computed as:
m m
1 X 1 X
µB = hi , σB2 = (hi − µB )2 (16.149)
m i=1 m i=1
where ϵ is a small constant for numerical stability. Batch normalization helps to smooth the
optimization landscape, allowing for faster convergence and mitigating the risk of overfitting by
preventing the model from getting stuck in sharp, narrow minima in the loss landscape.
In conclusion, overfitting is a significant challenge in training neural networks, and its prevention
requires a combination of techniques aimed at controlling model complexity, improving generaliza-
tion, and reducing sensitivity to noise in the training data. Regularization methods such as L2
and L1 regularization, dropout, and early stopping, combined with strategies like data augmenta-
tion and batch normalization, are fundamental to improving the performance of neural networks
on unseen data and ensuring that they do not overfit the training set. The mathematical formu-
lations and optimization strategies outlined here provide a detailed and rigorous framework for
understanding and mitigating overfitting in machine learning models.
16.4.3 Dropout
16.4.3.1 Literature Review of Dropout
Srivastava et. al. (2014) [132] introduced dropout as a regularization technique. The authors
demonstrated that randomly dropping units (along with their connections) during training pre-
vents overfitting by reducing co-adaptation among neurons. They provided theoretical insights
and empirical evidence showing that dropout improves generalization in deep neural networks.
Goodfellow et. al. (2016) [112] wrote a comprehensive textbook covers dropout in the context of
regularization and overfitting. It explains dropout as an approximate Bayesian inference method
and discusses its relationship to ensemble learning and noise injection. The book also provides
a broader perspective on regularization techniques in deep learning. Srivastava et. al. (2013)
[547] in a technical report expands on the dropout technique, providing additional insights into
its implementation and effectiveness. It discusses the impact of dropout on different architectures
and datasets, emphasizing its role in reducing overfitting and improving model robustness. Baldi
and Sadowski (2013) [548] provided a theoretical analysis of dropout, explaining why it works as
a regularization technique. The authors show that dropout can be interpreted as an adaptive
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 385
regularization method that penalizes large weights, leading to better generalization. While not
specifically about dropout, this paper by Zou and Hastie (2005) [133] introduced the Elastic Net,
a regularization technique that combines L1 and L2 penalties. It provides foundational insights
into regularization methods, which are conceptually related to dropout in their goal of preventing
overfitting. Gal and Ghahramani (2016) [549] established a theoretical connection between dropout
and Bayesian inference. The authors show that dropout can be interpreted as a variational ap-
proximation to a Bayesian neural network, providing a probabilistic framework for understanding
its regularization effects. Hastie et. al. (2009) [130] provided a thorough grounding in statistical
learning, including regularization techniques. While it predates dropout, it offers essential back-
ground on overfitting, bias-variance tradeoff, and regularization methods like ridge regression and
Lasso, which are foundational to understanding dropout. Gal et. al. (2016) [550] introduced an
improved version of dropout called ”Concrete Dropout” which automatically tunes the dropout
rate during training. This innovation addresses the challenge of manually selecting dropout rates
and enhances the regularization capabilities of dropout. Gal et. al. (2016) [551] provided a rigorous
theoretical analysis of dropout in deep networks. It explores how dropout affects the optimization
landscape and the dynamics of training, offering insights into why dropout is effective in preventing
overfitting. Friedman et. al. (2010) [552] focused on regularization paths for generalized linear
models, emphasizing the importance of regularization in preventing overfitting. While not specific
to dropout, it provides a strong foundation for understanding the broader context of regularization
techniques in machine learning.
where L(fθ (xi ), yi ) is the loss function (e.g., cross-entropy or mean squared error), and N is the
number of data samples. A model trained to minimize this loss function without regularization will
likely overfit to the training data, capturing the noise rather than the underlying distribution of
the data. Dropout addresses this by randomly “dropping out” a fraction of the network’s neurons
during each training iteration, which is mathematically represented by modifying the activations
of neurons.
Let us consider a feedforward neural network with a set of activations ai for the neurons in the i-th
layer, which is computed as ai = f (W xi + bi ), where W represents the weight matrix, xi the input
to the neuron, and bi the bias. During training with dropout, for each neuron, a random Bernoulli
variable ri is introduced, where:
ri ∼ Bernoulli(p) (16.152)
386 CHAPTER 16. TRAINING NEURAL NETWORKS
with probability p representing the retention probability (i.e., the probability that a neuron is kept
active), and 1 − p representing the probability that a neuron is “dropped” (set to zero). The
activation of the i-th neuron is then modified as follows:
a′i = ri · ai = ri · f (W xi + bi ) (16.153)
where ri is a random binary mask for each neuron. During each forward pass, different neurons are
randomly dropped out, and the network is effectively training on a different subnetwork, forcing
the network to learn a more robust set of features that do not depend on any particular neuron.
In this way, dropout acts as a form of ensemble learning, as each forward pass corresponds to a
different realization of the network.
The mathematical expectation of the loss function with respect to the dropout mask r can be
written as:
N
1 X
Er [Ldropout (θ, r)] = L(fθ (xi , r), yi ) (16.154)
N i=1
where fθ (xi , r) is the output of the network with the dropout mask r. Since the dropout mask
is random, the loss is an expectation over all possible configurations of dropout masks. This
randomness induces an implicit ensemble effect, where the model is trained not just on a single set of
parameters θ, but effectively on a distribution of models, each corresponding to a different dropout
configuration. The model is, therefore, regularized because the network is forced to generalize
across these different subnetworks, and overfitting to the training data is prevented. One way to
gain deeper insight into dropout is to consider its connection with Bayesian inference. In the context
of deep learning, dropout can be viewed as an approximation to Bayesian posterior inference. In
Bayesian terms, we seek the posterior distribution of the network’s parameters θ, given the data
D, which can be written as:
p(D|θ)p(θ)
p(θ|D) = (16.155)
p(D)
where p(D|θ) is the likelihood of the data given the parameters, p(θ) is the prior distribution over the
parameters, and p(D) is the marginal likelihood of the data. Dropout approximates this posterior
by averaging over the outputs of many different subnetworks, each corresponding to a different
dropout configuration. This interpretation is formalized by observing that each forward pass with
a different dropout mask corresponds to a different realization of the model, and averaging over
all dropout masks gives an approximation to the Bayesian posterior. Thus, the expected output of
the network, given the data x, under dropout is:
M
1 X
Er [fθ (x)] = fθ (x, ri ) (16.156)
M i=1
where ri is a dropout mask drawn from the Bernoulli distribution and M is the number of Monte
Carlo samples of dropout configurations. This expectation can be interpreted as a form of ensemble
averaging, where each individual forward pass corresponds to a different model sampled from the
posterior.
Dropout is also highly effective because it controls the bias-variance tradeoff. The bias-variance
tradeoff is a fundamental concept in statistical learning, where increasing model complexity reduces
bias but increases variance, and vice versa. A highly complex model tends to have low bias but
high variance, meaning it fits the training data very well but fails to generalize to new data. Regu-
larization techniques, such as dropout, seek to reduce variance without increasing bias excessively.
Dropout achieves this by introducing stochasticity in the learning process. By randomly deacti-
vating neurons during training, the model is forced to learn robust features that do not depend on
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 387
the presence of specific neurons. In mathematical terms, the variance of the model’s output can be
expressed as:
Var(fθ (x)) = Er [(fθ (x))2 ] − (Er [fθ (x)])2 (16.157)
By averaging over multiple dropout configurations, the variance is reduced, leading to better gener-
alization performance. Although dropout introduces some bias by reducing the network’s capacity
(since fewer neurons are available at each step), the variance reduction outweighs the bias increase,
resulting in improved generalization. Another key mathematical aspect of dropout is its relation-
ship with stochastic gradient descent (SGD). In the standard SGD framework, the parameters θ are
updated using the gradient of the loss with respect to the parameters. In the case of dropout, the
gradient is computed based on a stochastic subnetwork at each training iteration, which introduces
an element of randomness into the optimization process. The parameter update rule with dropout
can be written as:
θt+1 = θt − η∇θ Er [Ldropout (θ, r)] (16.158)
where η is the learning rate, and ∇θ is the gradient of the loss with respect to the model parame-
ters. The expectation is taken over all possible dropout configurations, which means that at each
step, the gradient update is based on a different realization of the model. This stochasticity helps
the optimization process by preventing the model from getting stuck in local minima, improving
convergence towards global minima, and enhancing generalization. Finally, it is important to note
that dropout has a close connection with low-rank approximations. During each forward pass with
dropout, certain neurons are effectively removed, which reduces the rank of the weight matrix,
as some rows or columns of the matrix are set to zero. This stochastic reduction in rank forces
the network to learn lower-dimensional representations of the data, effectively performing low-rank
regularization. This aspect of dropout can be formalized by observing that each dropout mask
corresponds to a sparse matrix, and the network is effectively learning a low-rank approximation
of the data distribution. By doing so, dropout prevents the network from learning overly complex
representations that could overfit the data, leading to improved generalization.
introduced an efficient algorithm for computing the regularization path for L1-regularized general-
ized linear models (GLMs). It provides a practical framework for implementing L1 regularization in
various statistical models, including logistic regression and Poisson regression. Meinshausen (2007)
[554] explored the use of L1 regularization for sparse regression and its connection to marginal
testing. The authors rigorously analyze the consistency of L1 regularization in high-dimensional
settings and provide theoretical guarantees for variable selection. Carvalho. et. al. (2009) [555]
extended L1 regularization to Bayesian frameworks, introducing adaptive sparsity-inducing priors.
It provides a rigorous Bayesian interpretation of L1 regularization and demonstrates its application
in genomics, where overfitting is a significant concern.
poor generalization to unseen examples. Overfitting is especially prevalent in models with a large
number of features, where the model becomes overly flexible and may capture spurious correlations
between the features and the target variable. This often results in a model with high variance,
where small fluctuations in the data cause significant changes in the model predictions. To combat
this, regularization techniques are employed, which introduce a penalty term into the objective
function, discouraging overly complex models that fit noise.
Given a set of n observations {(xi , yi )}ni=1 , where each xi ∈ Rp is a feature vector and yi ∈ R
is the corresponding target value, the task is to find a parameter vector θ ∈ Rp that minimizes
the loss function. In standard linear regression, the objective is to minimize the mean squared
error (MSE), defined as:
n
1X 2 1
L(θ) = yi − xTi θ = ∥Xθ − y∥2 (16.159)
n i=1 n
where X ∈ Rn×p is the design matrix, with rows xTi , and y ∈ Rn is the vector of target values.
The solution to this problem, without any regularization, is given by the ordinary least squares
(OLS) solution:
−1 T
θ̂OLS = XT X X y (16.160)
This formulation, however, can lead to overfitting when p is large or when XT X is nearly singular.
In such cases, regularization is used to modify the loss function, adding a penalty term R(θ) to
the objective function that discourages large values for the parameters θi . The regularized loss
function is given by:
Lregularized (θ) = L(θ) + λR(θ) (16.161)
where λ is a regularization parameter that controls the strength of the penalty. The term
R(θ) penalizes the complexity of the model by imposing constraints on the magnitude of the
coefficients. Let us explore two widely used forms of regularization: L1 regularization (Lasso)
and L2 regularization (Ridge). L1 regularization involves adding the ℓ1 -norm of the parameter
vector θ as the penalty term:
p
X
RL1 (θ) = |θi | (16.162)
i=1
This formulation promotes sparsity in the parameter vector θ, causing many coefficients to be-
come exactly zero, effectively performing feature selection. In high-dimensional settings where
many features are irrelevant, L1 regularization helps reduce the model complexity by forcing ir-
relevant features to be excluded from the model. The effect of the L1 penalty can be understood
geometrically by noting that the constraint region defined by the ℓ1 -norm is a diamond-shaped
region in p-dimensional space. When solving this optimization problem, the coefficients often lie
on the boundary of this diamond, leading to a sparse solution with many coefficients being exactly
zero. Mathematically, the soft-thresholding solution that arises from solving the L1 regularized
optimization problem is given by:
This soft-thresholding property drives coefficients to zero when their magnitude is less than λ,
resulting in a sparse solution. L2 regularization, on the other hand, uses the ℓ2 -norm of the
390 CHAPTER 16. TRAINING NEURAL NETWORKS
This penalty term does not force any coefficients to be exactly zero but rather shrinks the coeffi-
cients towards zero, effectively reducing their magnitudes. The L2 regularization helps stabilize the
solution when there is multicollinearity in the features by reducing the impact of highly correlated
features. The optimization problem with L2 regularization leads to a ridge regression solution,
which is given by the following expression:
−1 T
θ̂ridge = XT X + λI X y (16.167)
where I is the identity matrix. The L2 penalty introduces a circular or spherical constraint in
the parameter space, resulting in a solution where all coefficients are reduced in magnitude, but
none are eliminated. The Elastic Net regularization is a hybrid technique that combines both L1
and L2 regularization. The regularized loss function for Elastic Net is given by:
p p
1 X X
LElasticNet (θ) = ∥Xθ − y∥2 + λ1 |θi | + λ2 θi2 (16.168)
n i=1 i=1
In this case, λ1 and λ2 control the strength of the L1 and L2 penalties, respectively. The Elas-
tic Net regularization is particularly useful when dealing with datasets where many features are
correlated, as it combines the sparsity-inducing property of L1 regularization with the stability-
enhancing property of L2 regularization. The Elastic Net has been shown to outperform L1 and
L2 regularization in some cases, particularly when there are groups of correlated features. The
optimization problem can be solved using coordinate descent or proximal gradient methods,
which efficiently handle the mixed penalties. The choice of regularization parameter λ is critical
in controlling the bias-variance tradeoff. A small value of λ leads to a low-penalty model that
is more prone to overfitting, while a large value of λ forces the coefficients to shrink towards zero,
potentially leading to underfitting. Thus, it is important to select an optimal value for λ to strike
a balance between bias and variance. This can be achieved by using cross-validation techniques,
where the model is trained on a subset of the data, and the performance is evaluated on the re-
maining data.
not perform variable selection. Elastic Net balances these by encouraging group selection of corre-
lated variables and improving prediction accuracy, especially when the number of predictors exceeds
the number of observations. Hastie et. al. (2010) [130] provided a comprehensive overview of sta-
tistical learning methods, including detailed discussions on overfitting, regularization techniques,
and the Elastic Net. It explains the theoretical foundations of regularization, the bias-variance
tradeoff, and practical implementations of Elastic Net in high-dimensional data settings. Tibshi-
rani (1996) [553] introduced the Lasso (L1 regularization), which is a key component of Elastic Net.
Lasso performs both variable selection and regularization by shrinking some coefficients to zero.
The paper laid the groundwork for understanding how L1 regularization can prevent overfitting
in high-dimensional datasets. Hoerl and Kennard (1970) [556] introduced Ridge Regression (L2
regularization), which addresses multicollinearity and overfitting by shrinking coefficients toward
zero without setting them to zero. Ridge Regression is the other key component of Elastic Net, and
this paper provides the theoretical basis for its use in regularization. Bühlmann and van de Geer
(2011) [561] provided a rigorous treatment of high-dimensional statistics, including regularization
techniques like Elastic Net. It discusses the theoretical properties of Elastic Net, such as its ability
to handle correlated predictors and its consistency in variable selection. Friedman et. al. (2010)
[552] presented efficient algorithms for computing regularization paths for Lasso, Ridge, and Elastic
Net in generalized linear models. The authors introduce coordinate descent, a computationally effi-
cient method for fitting Elastic Net models, making it practical for large-scale datasets. Gareth et.
al. (2013) [562] provided an accessible introduction to regularization techniques, including Elastic
Net. It explains the intuition behind overfitting, the bias-variance tradeoff, and how Elastic Net
combines L1 and L2 penalties to improve model performance. Efron et. al. (2004) [563] introduced
the Least Angle Regression (LARS) algorithm, which is closely related to Lasso and Elastic Net.
LARS provides a computationally efficient way to compute the regularization path for Lasso and
Elastic Net, making it easier to understand the behavior of these methods. Fan and Li (2001) [564]
discussed the theoretical properties of variable selection methods, including Lasso and Elastic Net.
It introduces the concept of oracle properties, which ensure that the selected model performs as
well as if the true underlying model were known. The paper provides insights into why Elastic
Net is effective in high-dimensional settings. Meinshausen and Bühlmann (2006) [565] explored the
use of Lasso and related methods (including Elastic Net) in high-dimensional settings. It provides
theoretical guarantees for variable selection consistency and discusses the challenges of overfitting
in high-dimensional data. The insights from this paper are directly applicable to understanding
the performance of Elastic Net.
Overfitting is a critical issue in machine learning and statistical modeling, where a model learns the
training data too well, capturing not only the underlying patterns but also the noise and outliers,
leading to poor generalization performance on unseen data. Mathematically, overfitting can be
characterized by a significant discrepancy between the training error Etrain (θ) and the test error
Etest (θ), where θ represents the model parameters. Specifically, Etrain (θ) is minimized during train-
ing, but Etest (θ) remains high, indicating that the model has failed to generalize. This typically
occurs when the model complexity, quantified by the number of parameters or the degrees of free-
dom, is excessively high relative to the amount of training data available. To mitigate overfitting,
regularization techniques are employed, and among these, Elastic Net regularization stands out as
a particularly effective method due to its ability to combine the strengths of both L1 (Lasso) and
L2 (Ridge) regularization. Elastic Net regularization addresses overfitting by introducing a penalty
term to the loss function that constrains the magnitude of the model parameters θ. The general
form of the regularized loss function is given by
where λ is the regularization parameter controlling the strength of the penalty, and Penalty(θ) is
a function that penalizes large or complex parameter values. In Elastic Net, the penalty term is a
convex combination of the L1 and L2 norms of the parameter vector θ, expressed as
Here,
n
X
∥θ∥1 = |θi | (16.171)
i=1
is the L1 norm, which encourages sparsity by driving some parameters to exactly zero, and
n
X
∥θ∥22 = θi2 (16.172)
i=1
is the squared L2 norm, which discourages large parameter values and promotes smoothness. The
mixing parameter α ∈ [0, 1] controls the balance between the L1 and L2 penalties, with α = 1
corresponding to pure Lasso regularization and α = 0 corresponding to pure Ridge regularization.
For a linear regression model, the Elastic Net loss function takes the form
m
1 X 2
yi − θT xi + λ α∥θ∥1 + (1 − α)∥θ∥22
L(θ) = (16.173)
2m i=1
where m is the number of training examples, yi is the target value for the i-th example, xi is the
feature vector for the i-th example, and θ is the vector of model parameters. The first term in the
loss function,
m
1 X 2
yi − θT xi (16.174)
2m i=1
represents the mean squared error (MSE) of the model predictions, while the second term,
λ α∥θ∥1 + (1 − α)∥θ∥22
(16.175)
represents the Elastic Net penalty. The regularization parameter λ controls the overall strength of
the penalty, with larger values of λ resulting in stronger regularization and simpler models. The
optimization problem for Elastic Net regularization is formulated as
( m
)
1 X 2
yi − θT xi + λ α∥θ∥1 + (1 − α)∥θ∥22
min (16.176)
θ 2m i=1
This is a convex optimization problem, and its solution can be obtained using iterative algorithms
such as coordinate descent or proximal gradient methods. The coordinate descent algorithm up-
dates one parameter at a time while holding the others fixed, and the update rule for the j-th
parameter θj is given by P
m (−j)
S i=1 xij (yi − ỹi ), λα
θj ← (16.177)
1 + λ(1 − α)
where S(z, γ) is the soft-thresholding operator defined as
selection by setting some coefficients to zero. This is especially useful in high-dimensional settings
where the number of features n is much larger than the number of training examples m. Second, the
L2 component ((1 − α)∥θ∥22 ) encourages a grouping effect, where correlated features tend to have
similar coefficients. Third, the mixing parameter α provides flexibility in balancing the sparsity-
inducing effect of L1 regularization with the smoothness-promoting effect of L2 regularization. In
practice, the hyperparameters λ and α must be carefully tuned to achieve optimal performance.
This is typically done using cross-validation. The Elastic Net regularization path, which describes
how the coefficients θ change as λ varies, can be computed efficiently using algorithms such as least
angle regression (LARS) with Elastic Net modifications.
In conclusion, Elastic Net regularization is a mathematically rigorous and scientifically sound tech-
nique for controlling overfitting in machine learning models. By combining the sparsity-inducing
properties of L1 regularization with the smoothness-promoting properties of L2 regularization,
Elastic Net provides a flexible and effective framework for handling high-dimensional data, multi-
collinearity, and feature selection.
Goodfellow et. al. (2016) [112] provided a comprehensive overview of deep learning, including
detailed discussions on overfitting and regularization techniques. It explains early stopping as a form
of regularization that prevents overfitting by halting training when validation performance plateaus.
The book rigorously connects early stopping to other regularization methods like weight decay and
dropout, emphasizing its role in controlling model complexity. Montavon et. al. (2012) [566]
compiled practical techniques for training neural networks, including early stopping. It highlights
how early stopping acts as an implicit regularizer by limiting the effective capacity of the model.
The authors provide empirical evidence and theoretical insights into why early stopping works,
comparing it to explicit regularization methods like L2 regularization. Bishop (2006) [116] provided
a rigorous mathematical treatment of overfitting and regularization. It discusses early stopping in
the context of gradient-based optimization, showing how it prevents overfitting by controlling the
effective number of parameters. The book also connects early stopping to Bayesian inference,
framing it as a way to balance model complexity and data fit. Prechelt (1998) [567] provided a
systematic analysis of early stopping criteria, such as generalization loss and progress measures.
He introduces quantitative metrics to determine the optimal stopping point and demonstrates its
effectiveness in preventing overfitting across various datasets and architectures. Zhang et. al.
(2021) [446] challenged traditional views on generalization in deep learning. It shows that deep
neural networks can fit random labels, highlighting the importance of regularization techniques like
early stopping. The authors argue that early stopping is crucial for ensuring models generalize well,
even in the presence of high capacity. Friedman et. al. (2010) [552] introduced coordinate descent
algorithms for regularized linear models, including L1 and L2 regularization. While not exclusively
about early stopping, it provides a theoretical framework for understanding how regularization
techniques, including early stopping, control model complexity and prevent overfitting. Hastie
et. al. (2010) [130] discussed early stopping as a regularization method in the context of gradient
boosting and neural networks. The authors explain how early stopping reduces variance by limiting
the number of iterations, thereby improving generalization performance. While primarily focused
on dropout, Srivastava et. al. (2014) [132] compared dropout to other regularization techniques,
including early stopping. It highlights how early stopping complements dropout by preventing
overfitting during training. The authors provide empirical results showing the combined benefits
of these methods.
394 CHAPTER 16. TRAINING NEURAL NETWORKS
where ℓ(·) is a loss function quantifying the discrepancy between the predicted output f (xi ; θ) and
the true label yi . Overfitting occurs when the model achieves a very low training loss Ltrain (θ) but
a significantly higher generalization loss Ltest (θ), evaluated on an independent test dataset Dtest .
This discrepancy arises because the model has effectively memorized the training data, including
its noise, rather than learning the true underlying patterns.
Early stopping is a regularization technique that mitigates overfitting by dynamically halting the
training process before the model fully converges to a minimum of the training loss. This is achieved
by monitoring the model’s performance on a separate validation dataset Dval = {(xj , yj )}Mj=1 , which
is distinct from both the training and test datasets. The validation loss Lval (θ) is computed as:
M
1 X
Lval (θ) = ℓ(f (xj ; θ), yj ) (16.180)
M j=1
During training, the model parameters θ are updated iteratively using an optimization algorithm
such as gradient descent, which follows the update rule:
where η is the learning rate and ∇θ Ltrain (θt ) is the gradient of the training loss with respect
to the parameters θ at iteration t. Early stopping intervenes in this process by evaluating the
validation loss Lval (θt ) at each iteration t and terminating training when Lval (θt ) ceases to decrease
or begins to increase. This point of termination is determined by a patience parameter P , which
specifies the number of iterations to wait after the last improvement in Lval (θt ) before stopping.
The effectiveness of early stopping as a regularization mechanism can be understood through its
implicit control over the model’s complexity. By limiting the number of training iterations T , early
stopping restricts the model’s capacity to fit the training data perfectly, thereby preventing it from
overfitting. This can be formalized by considering the relationship between the number of iterations
T and the effective complexity of the model. Specifically, early stopping imposes an implicit
constraint on the optimization process, preventing the model from reaching a sharp minimum of
the training loss Ltrain (θ), which is often associated with poor generalization. Instead, early stopping
encourages convergence to a flatter minimum, which is more robust to perturbations in the data.
The regularization effect of early stopping can be further analyzed through its connection to explicit
regularization techniques. It has been shown that early stopping is mathematically equivalent to
imposing an implicit L2 regularization penalty on the model parameters θ. This equivalence arises
because early stopping effectively restricts the norm of the parameter updates ∥θt − θ0 ∥, where θ0
is the initial parameter vector. The strength of this implicit regularization is inversely proportional
to the number of iterations T , as fewer iterations result in smaller updates to θ. Formally, this can
be expressed as:
∥θT − θ0 ∥ ≤ C(T ) (16.182)
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 395
where C(T ) is a function that decreases with T . This constraint on the parameter updates is
analogous to the explicit L2 regularization penalty λ∥θ∥22 , where λ controls the strength of the
regularization. Thus, early stopping can be viewed as a form of adaptive regularization, where the
regularization strength is determined by the number of iterations T . The theoretical foundation
of early stopping is further supported by its connection to the bias-variance tradeoff in statistical
learning. By limiting the number of iterations T , early stopping reduces the variance of the model,
as it prevents the model from fitting the noise in the training data. At the same time, it introduces
a small amount of bias, as the model may not fully capture the underlying data-generating distribu-
tion. This tradeoff is optimized by selecting the stopping point T that minimizes the generalization
error, which can be estimated using cross-validation or a held-out validation set.
In summary, early stopping is a powerful and theoretically grounded technique for controlling
overfitting in machine learning models. By dynamically halting the training process based on the
validation loss, it imposes an implicit regularization constraint on the model parameters, preventing
them from growing too large and overfitting the training data. This regularization effect is math-
ematically equivalent to an implicit L2 penalty, and it is rooted in the principles of optimization
theory and statistical learning. Through its connection to the bias-variance tradeoff, early stopping
provides a principled approach to balancing model complexity and generalization performance,
making it an essential tool in the machine learning practitioner’s toolkit.
provides insights into how regularization techniques can be combined with data augmentation to
control model complexity and prevent overfitting. The paper is particularly useful for understand-
ing the theoretical underpinnings of regularization. Zhang et. al. (2017) [569] introduced Mixup, a
data augmentation technique that creates new training examples by linearly interpolating between
pairs of inputs and their labels. Mixup acts as a form of regularization by encouraging the model
to behave linearly between training examples, thereby reducing overfitting. The paper provides
theoretical and empirical evidence of its effectiveness. Cubuk et al. (2019) [572] proposed Au-
toAugment, a method for automatically learning optimal data augmentation policies from data.
By tailoring augmentation strategies to the specific dataset, AutoAugment acts as a powerful reg-
ularization technique, significantly reducing overfitting and improving model performance. The
paper demonstrates the effectiveness of this approach on multiple benchmarks. Perez (2017) [571]
provided a detailed empirical study of how data augmentation reduces overfitting in deep neural
networks. It compares various augmentation techniques and their impact on model generalization.
The authors also discuss the relationship between augmentation and other regularization methods,
providing insights into how they can be combined for optimal performance.
where L is the loss function measuring the error between the model’s predictions fˆ(x) and the true
labels y, and fˆ is the model that minimizes the empirical risk on Dtrain . The primary cause of
overfitting is the model’s excessive capacity to fit the training data, which is often a consequence
of high model complexity relative to the size and diversity of Dtrain . Data augmentation addresses
overfitting by artificially expanding the training dataset Dtrain through the application of a set of
transformations T to the existing data points. These transformations are designed to preserve
the semantic content of the data while introducing variability that reflects plausible real-world
variations. Formally, let T : X → X be a transformation function that maps an input x ∈ X to a
transformed input T (x). The augmented dataset Daug is then constructed as:
The model is subsequently trained on Daug instead of Dtrain , which effectively increases the size of
the training dataset and introduces additional diversity. This process can be viewed as implicitly
defining a new empirical risk minimization problem:
1 X
fˆ = arg min L(f (xi ), yi ). (16.185)
f ∈H |Daug |
(xi ,yi )∈Daug
By training on Daug , the model is exposed to a broader range of data variations, which encourages
it to learn more robust and generalizable features. This reduces the risk of overfitting by preventing
the model from over-relying on specific patterns or noise present in the original training data. The
effectiveness of data augmentation can be analyzed through the lens of the bias-variance trade-
off. Without data augmentation, the model may exhibit high variance due to its ability to fit the
limited training data too closely. Data augmentation reduces this variance by effectively increasing
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 397
the size of the training dataset, thereby constraining the model’s capacity to fit noise. At the
same time, it introduces a controlled form of bias by encouraging the model to learn features that
are invariant to the applied transformations. This trade-off can be formalized by considering the
expected generalization error Egen of the model, which decomposes into bias and variance terms:
h i
Egen = E(x,y)∼P (fˆ(x) − y)2 = Bias(fˆ)2 + Var(fˆ) + σ 2 , (16.186)
where σ 2 represents the irreducible noise in the data. Data augmentation reduces Var(fˆ) by increas-
ing the effective sample size, while the bias term Bias(fˆ) may increase slightly due to the constraints
imposed by the invariance requirements. The choice of transformations T is critical to the success
of data augmentation. For instance, in image classification tasks, common transformations include
rotations, translations, scaling, flipping, and color jittering. Each transformation T ∈ T can be
represented as a function T : Rd → Rd , where d is the dimensionality of the input space. The set
T should be designed such that the transformed data points T (x) remain semantically consistent
with the original labels y. Mathematically, this can be expressed as:
P (y | T (x)) ≈ P (y | x) ∀T ∈ T . (16.187)
This ensures that the augmented data points are valid representatives of the underlying data distri-
bution P . In addition to reducing overfitting, data augmentation also has the effect of smoothing
the loss landscape of the optimization problem. The loss function L evaluated on the augmented
dataset Daug can be viewed as a regularized version of the original loss function:
1 X
Laug (f ) = L(f (xi ), yi ). (16.188)
|Daug |
(xi ,yi )∈Daug
This augmented loss function typically exhibits a more convex and smoother optimization land-
scape, which facilitates convergence during training. The smoothness of the loss landscape can be
quantified using the Lipschitz constant L of the gradient ∇Laug , which satisfies:
A smaller Lipschitz constant L indicates a smoother loss landscape, which is beneficial for opti-
mization algorithms such as gradient descent.
In conclusion, data augmentation is a powerful and mathematically grounded technique for control-
ling overfitting in machine learning models. By artificially expanding the training dataset through
the application of semantically preserving transformations, data augmentation reduces the model’s
reliance on specific patterns and noise in the original training data. This leads to improved gen-
eralization performance by balancing the bias-variance trade-off and smoothing the optimization
landscape. The rigorous formulation of data augmentation as a form of implicit regularization
provides a solid theoretical foundation for its widespread use in practice.
16.4.8 Cross-Validation
16.4.8.1 Literature Review of Cross-Validation
Hastie et. al. (2010) [130] provided a comprehensive overview of statistical learning methods,
including detailed discussions on overfitting, bias-variance tradeoff, and regularization techniques
(e.g., Ridge Regression, Lasso). It also covers cross-validation as a tool for model selection and
evaluation. The book rigorously explains how regularization mitigates overfitting by introducing
penalty terms to the loss function, and how cross-validation helps in tuning hyperparameters. Tib-
shirani (1996) [553] introduced the Lasso (Least Absolute Shrinkage and Selection Operator), a
398 CHAPTER 16. TRAINING NEURAL NETWORKS
regularization technique that performs both variable selection and shrinkage to prevent overfitting.
The paper demonstrates how Lasso’s L1 penalty encourages sparsity in model coefficients, making
it particularly useful for high-dimensional data. It also discusses cross-validation for selecting the
regularization parameter. Bishop and Nashrabodi (2006) [116] provided a deep dive into probabilis-
tic models and regularization techniques, including Bayesian regularization and weight decay. It
explains how regularization controls model complexity and prevents overfitting by penalizing large
weights. The book also discusses cross-validation as a method for assessing model performance
and selecting hyperparameters. Hoerl and Kennard (1970) [556] introduced Ridge Regression, an
L2 regularization technique that addresses multicollinearity and overfitting in linear models. The
authors demonstrate how adding a penalty term to the least squares objective function shrinks
coefficients, reducing variance at the cost of introducing bias. Cross-validation is highlighted as
a method for choosing the optimal regularization parameter. Domingos (2012) [573] provided
practical insights into machine learning, including the importance of avoiding overfitting and the
role of regularization. He emphasized the tradeoff between model complexity and generalization,
and how techniques like cross-validation help in selecting models that generalize well to unseen
data. Goodfellow et. al. (2016) [112] covered regularization techniques specific to deep learning,
such as dropout, weight decay, and early stopping. It explains how these methods prevent over-
fitting in neural networks and discusses cross-validation as a tool for hyperparameter tuning. The
book also explores the theoretical underpinnings of regularization in the context of deep models.
Srivastava et. al. (2014) [132] introduced dropout, a regularization technique for neural networks
that randomly deactivates neurons during training. The authors demonstrate that dropout reduces
overfitting by preventing co-adaptation of neurons and effectively ensembles multiple sub-networks.
Cross-validation is used to evaluate the performance of dropout-regularized models. Gareth et. al.
(2013) [562] provided an accessible introduction to key concepts in statistical learning, including
overfitting, regularization, and cross-validation. It explains how techniques like Ridge Regression
and Lasso improve model generalization and how cross-validation helps in selecting the best model.
The book includes practical examples and R code for implementation. Stone (1974) [574] formalized
the concept of cross-validation as a method for assessing predictive performance and preventing
overfitting. Stone discusses how cross-validation provides an unbiased estimate of model perfor-
mance by partitioning data into training and validation sets. The paper lays the groundwork for
using cross-validation in conjunction with regularization techniques. Friedman et. al. (2010) [552]
presented efficient algorithms for computing regularization paths for generalized linear models, in-
cluding Lasso and Elastic Net. The authors demonstrate how these techniques balance bias and
variance to prevent overfitting. The paper also discusses the use of cross-validation for selecting
the optimal regularization parameters.
A model is said to overfit if there exists another model g such that R̂(f ) < R̂(g) but R(f ) >
R(g). This discrepancy is analytically understood through the bias-variance decomposition of the
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 399
generalization error:
Overfitting corresponds to the regime where V [f (x)] is significantly large while (E[f (x)] − f ∗ (x))2
remains small, meaning that the model is highly sensitive to variations in the training set. Cross-
validation provides a principled mechanism for estimating R(f ) and preventing overfitting by sim-
ulating model performance on unseen data. The most rigorous formulation of cross-validation is
k-fold cross-validation, where the dataset D is partitioned into k disjoint subsets D1 , D2 , . . . , Dk ,
each containing approximately Nk samples. For each j ∈ {1, 2, . . . , k}, we train the model on the
dataset
Dtrain (j) = D \ Dj (16.193)
and evaluate it on the validation set Dj , computing the validation error:
1 X
R̂j (f ) = L(yi , f (xi )) (16.194)
|Dj |
(xi ,yi )∈Dj
This estimation introduces a tradeoff between bias and variance depending on the choice of k.
A small k, such as k = 2, results in high bias due to insufficient training data per fold, while
large k, such as k = N (leave-one-out cross-validation, LOOCV), results in high variance due to
the extreme sensitivity of the validation error to single observations. The variance of the cross-
validation estimator itself is approximated by:
k
1X
Var(R̂CV ) = Var(R̂j ) (16.196)
k j=1
where f−i is the model trained on D−i . The key advantage of LOOCV is its nearly unbiased nature,
but its computational cost scales as O(N ) times the cost of training the model, making it infeasible
for large datasets. Another important mathematical consequence of cross-validation is its role
in hyperparameter selection. Suppose a model fλ is parameterized by λ (e.g., the regularization
parameter in Ridge regression). Cross-validation allows us to find
This optimization ensures that the selected hyperparameter minimizes generalization error rather
than just empirical risk. In practical applications, hyperparameter tuning via cross-validation is
often performed over a logarithmic grid {λ1 , λ2 , . . . , λm }, and the optimal λ∗ is obtained via
k
∗ 1X j
λ = arg min R̂ (fλj ) (16.200)
λj k
j=1
400 CHAPTER 16. TRAINING NEURAL NETWORKS
This selection mechanism rigorously prevents overfitting by ensuring that the model complexity
is chosen based on its generalization capacity rather than its fit to the training data. A deeper
understanding of the bias-variance tradeoff in cross-validation is achieved through its impact on
model complexity. If fd (x) denotes a model of complexity d, its cross-validation risk behaves as:
d
RCV (fd ) = R(fd ) + O (16.201)
N
This formulation makes explicit that increasing model complexity d leads to lower empirical risk
but higher variance, necessitating cross-validation as a control mechanism to balance these compet-
ing factors. Finally, an advanced theoretical justification for cross-validation arises from stability
theory. The stability of a learning algorithm quantifies how small perturbations in the training set
affect its output. Formally, a learning algorithm is γ-stable if, for two datasets D and D′ differing
by a single point
sup |fD (x) − fD′ (x)| ≤ γ (16.202)
x
Cross-validation is most effective for stable algorithms, where γ-stability ensures that
For highly unstable algorithms (e.g., deep neural networks with small datasets), cross-validation
estimates exhibit significant variance, making regularization even more critical.
16.4.9 Pruning
16.4.9.1 Literature Review of Pruning
LeCun et. al. (1990) [575] introduced the concept of pruning in neural networks. They proposed
the ”optimal brain damage” (OBD) and ”optimal brain surgeon” (OBS) algorithms, which prune
weights based on their contribution to the loss function. These techniques reduce overfitting by
simplifying the model architecture. They proved that pruning based on second-order derivatives
(Hessian matrix) is more effective than random pruning, as it preserves critical weights. Li et.
al. (2016) [576] focused on pruning convolutional neural networks (CNNs) by removing entire
filters rather than individual weights. It demonstrates that filter pruning significantly reduces
computational cost while maintaining accuracy, effectively addressing overfitting in large CNNs.
The Pruning filters based on their L1-norm magnitude is a simple yet effective regularization
technique. Frankle and Carbin (2018) [577] introduced the ”lottery ticket hypothesis,” which states
that dense neural networks contain smaller subnetworks (”winning tickets”) that, when trained in
isolation, achieve comparable performance to the original network. Pruning helps identify these
subnetworks, reducing overfitting by focusing on essential parameters. The authors proposed that
Iterative pruning and retraining can uncover sparse, highly generalizable models. Han et. al.
(2015) [578] proposed a pruning technique that removes redundant connections and retrains the
network to recover accuracy. It introduces a systematic approach to pruning and demonstrates
its effectiveness in reducing overfitting while compressing models. The authors proposed that
Pruning followed by retraining preserves model performance and reduces overfitting by eliminating
unnecessary complexity. Liu et. al. (2018) [579] challenged the conventional wisdom that pruning is
primarily for model compression. It shows that pruning can also serve as a regularization technique,
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 401
improving generalization by removing redundant parameters. The authors proposed that Pruning
can be viewed as a form of architecture search, leading to models that generalize better. Cheng
et. al. (2017) [580] provided a comprehensive overview of model compression techniques, including
pruning, quantization, and knowledge distillation. It highlights how pruning reduces overfitting
by simplifying models and removing redundant parameters. The authors proposed that Pruning
is a key component of a broader strategy to improve model efficiency and generalization. Frankle
et. al. (2020) [581] investigated the limitations of pruning neural networks at initialization (before
training). It highlights the challenges of identifying important weights early and suggests that
iterative pruning during training is more effective for regularization. The authors proposed that
Pruning is most effective when combined with training, as it allows the model to adapt to the
reduced architecture.
If W has too many parameters, the model can memorize training data, leading to an excessive gap
between the empirical and true risk:
r !
dVC
R(W ) = R̂(W ) + O (16.206)
N
where dVC is the Vapnik-Chervonenkis (VC) dimension, a fundamental measure of model com-
plexity. Overfitting occurs when dVC is excessively large relative to N , leading to high variance.
Pruning aims to reduce dVC while preserving network functionality, thereby controlling
complexity and improving generalization. The Mathematical Formulation of Pruning is of a Con-
strained Optimization Problem. Pruning can be rigorously formulated as a constrained empirical
risk minimization problem. The objective is to minimize the empirical risk while enforcing a
constraint on the number of nonzero weights. Mathematically, this is expressed as:
min R̂(W ) subject to ∥W ∥0 ≤ k (16.207)
W
where ∥W ∥0 is the L0 norm, counting the number of nonzero parameters, and k is the sparsity
constraint. Since direct L0 minimization is computationally intractable (NP-hard), practical ap-
proaches approximate this problem using continuous relaxations such as L1 regularization or
thresholding heuristics.
Let’s now discuss some theoretical Justifications of different Types of Pruning. For Weight Pruning
we start with eliminating Redundant Parameters. Weight pruning removes individual weights that
contribute negligibly to the network’s predictions. Given a weight matrix W , the simplest form of
pruning is threshold-based removal:
W ′ = {wj ∈ W | |wj | > τ } (16.208)
402 CHAPTER 16. TRAINING NEURAL NETWORKS
L1 pruning results in a soft-thresholding effect, where small weights decay towards zero, reducing
model complexity in a continuous and differentiable manner. Neuron Pruning is defined as the
removing of entire neurons based on activation strength. Beyond individual weights, entire neurons
can be pruned based on their average activation magnitude. Given a neuron hi (x) in layer l with
weight vector Wi , we define its mean absolute activation over the dataset as:
N
1 X
Ai = |hi (xj )|. (16.211)
N j=1
Neuron pruning leads to a direct reduction in network depth, modifying the function class and
affecting expressivity. The effective VC dimension of a fully connected network of depth L with
layer sizes {n1 , n2 , . . . , nL } satisfies:
L
X
dVC ≈ n2l . (16.213)
l=1
Filters with norms below threshold τ are removed, solving the optimization problem:
" m
#
X
F̂ = arg min R̂(F ) + λ ∥Fi ∥F (16.216)
F
i=1
Pruning filters leads to significant reductions in computational cost, directly improving in-
ference speed while maintaining accuracy. There are some Generalization Bounds for Pruned Net-
works: PAC Learning and VC Dimension Reduction. A pruned neural network exhibits a reduced
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 403
function class complexity, leading to stronger generalization guarantees. The PAC (Proba-
bly Approximately Correct) bound states that for any confidence level δ, the probability of
excessive generalization error is bounded by:
2N ϵ2
′ ′
P R(W ) − R̂(W ) > ϵ ≤ 2 exp − ′ (16.217)
dVC
Since pruning reduces d′VC , it results in a tighter PAC bound, enhancing model robustness. In
conclusion, Pruning is an extremely scientifically and mathematically rigorous approach to
overfitting control, rooted in optimization theory, PAC learning, VC dimension reduction,
and empirical risk minimization. By removing redundant weights, neurons, or filters, prun-
ing improves generalization, tightens complexity bounds, and enhances computational
efficiency.
ensembles. It highlights how optimizing regularization parameters (e.g., learning rate, subsampling
rate) can mitigate overfitting and improve ensemble performance.
Overfitting occurs when the empirical risk is minimized at the cost of a large true risk, i.e.,
which leads to poor generalization. This phenomenon can be rigorously analyzed using the bias-
variance decomposition, which states that the expected squared error of a learned function f
satisfies
E[(f (x) − y)2 ] = (E[f (x)] − f ∗ (x))2 + V[f (x)] + σ 2 . (16.221)
The first term represents the bias, which measures systematic deviation from the true function.
The second term represents the variance, which quantifies the sensitivity of f to fluctuations in
the training data. The third term, σ 2 , represents irreducible noise inherent in the data. Overfitting
occurs when the variance term dominates, which is particularly problematic in ensemble methods
when base learners are highly complex. To understand overfitting in boosting, consider a sequence
of models f1 , f2 , . . . , fT iteratively trained to correct errors of previous models. The boosting
procedure constructs a final model as a weighted sum:
T
X
FT (x) = αt ft (x). (16.222)
t=1
For AdaBoost, the weights αt are chosen to minimize the exponential loss,
N
X
L(FT ) = exp(−yi FT (xi )). (16.223)
i=1
which shows that boosting places exponentially increasing emphasis on misclassified points,
leading to overfitting when noise is present in the data. For bagging, which constructs multiple
base models fm trained on bootstrap samples and aggregates their predictions as
M
1 X
F (x) = fm (x), (16.225)
M m=1
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 405
we analyze variance reduction. If fm are independent with variance σ 2 , then the ensemble variance
satisfies
1 2
V[F (x)] = σ . (16.226)
M
However, in practice, base models are correlated, introducing a term ρ such that
1 2 1
V[F (x)] = σ + 1− ρσ 2 . (16.227)
M M
As M → ∞, variance reduction is limited by ρ, which is exacerbated when deep decision trees are
used, leading to overfitting. To combat overfitting, regularization techniques are employed. One
approach is pruning in decision trees, where complexity is controlled by minimizing
N
X
L(T ) = ℓ(fT (xi ), yi ) + λ|T |, (16.228)
i=1
where |T | is the number of terminal nodes, and λ penalizes complexity. Another approach is
shrinkage in boosting, where the update rule is modified to
where η is a step size satisfying 0 < η < 1. Theoretical analysis shows that small η ensures the
ensemble function sequence remains in a Lipschitz-continuous function space, preventing overfitting.
Finally, in random forests, overfitting is mitigated by decorrelating base models through feature
subsampling. Given a feature set F of dimension d, each base tree is trained on a randomly
selected subset Fm ⊂ F of size k ≪ d, ensuring models remain diverse. Theoretical analysis shows
that feature selection reduces expected correlation ρ between base models, thereby decreasing
ensemble variance:
1 2 1 k 2
V[F (x)] = σ + 1− σ . (16.230)
M M d
Thus, by rigorously analyzing bias-variance tradeoffs, deriving variance-reduction formulas, and
proving shrinkage effectiveness, we ensure ensemble methods generalize effectively.
controls overfitting. It provides rigorous theoretical justifications and empirical studies supporting
the role of noise-based regularization in deep learning. Pei et. al. (2025) [594] explored the ap-
plication of noise injection techniques in convolutional neural networks (CNNs) for electric vehicle
load forecasting. It investigates the impact of different regularization methods, including L1/L2
penalties, dropout, and Gaussian noise injection, on reducing overfitting. The study highlights
how controlled noise perturbations can enhance generalization performance in time-series forecast-
ing tasks. Chen (2024) [595] demonstrated how noise injection, combined with data augmentation
techniques like rotation and shifting, serves as an implicit regularization technique in deep learning
models. The study finds that while noise injection marginally improves AUC scores, its effect varies
depending on the complexity of the dataset, making it a viable yet context-dependent method for
controlling overfitting. An et. al. (2024) [596] introduced a noise-based regularized cross-entropy
(RCE) loss function for robust brain tumor segmentation. It argues that controlled noise injection
during training prevents overfitting by making models less sensitive to small variations in input
data. The study provides empirical evidence that noise-assisted learning improves segmentation
performance by enhancing feature robustness. Song and Liu (2024) [597] presented a novel ad-
versarial training technique integrating label noise as a form of regularization. It investigates the
theoretical underpinnings of noise injection in preventing catastrophic overfitting in adversarial
settings and provides a comparative analysis with traditional dropout and weight decay methods.
Overfitting arises when a model fˆ(x; θ), parameterized by θ ∈ Θ, learns not only the true underlying
function f (x) = E[Y |X = x] but also the noise ϵ = Y − f (X) present in the training data
D = {(xi , yi )}ni=1 . Formally, the generalization error Egen (θ) and training error Etrain (θ) are defined
as: h i
ˆ
Egen (θ) = E(X,Y )∼P L(Y, f (X; θ)) , (16.231)
n
1X
Etrain (θ) = L(yi , fˆ(xi ; θ)) (16.232)
n i=1
where L is a loss function. Overfitting occurs when Egen (θ) ≫ Etrain (θ), indicating that the model
has high variance and poor generalization. This phenomenon is exacerbated when the hypoth-
esis class Θ has excessive capacity, as measured by its Vapnik-Chervonenkis (VC) dimension or
Rademacher complexity. Regularization addresses overfitting by introducing a penalty term R(θ)
to the empirical risk minimization problem:
n
!
1X
θ̂ = arg min L(yi , fˆ(xi ; θ)) + λ · R(θ) (16.233)
θ∈Θ n i=1
where λ > 0 is a hyperparameter controlling the trade-off between fitting the data and minimizing
the penalty. Common choices for R(θ) include the ℓ2 -norm ∥θ∥22 (ridge regression) and the ℓ1 -norm
∥θ∥1 (lasso). These penalties constrain the model’s capacity, favoring solutions with smaller norms
and reducing variance. Noise injection is a stochastic regularization technique that introduces
randomness into the training process to improve generalization. For input noise injection, let
η ∼ Q be a random noise vector sampled from a distribution Q (e.g., Gaussian N (0, σ 2 I)). The
perturbed input is x̃i = xi + ηi , and the modified training objective becomes:
n
1X h i
θ̂ = arg min Eηi ∼Q L(yi , fˆ(x̃i ; θ)) . (16.234)
θ∈Θ n
i=1
This expectation can be approximated using Monte Carlo sampling or analyzed using a second-
order Taylor expansion:
h
ˆ
i
ˆ σ2 2 ˆ
Eη L(yi , f (xi + η; θ)) ≈ L(yi , f (xi ; θ)) + Tr ∇x L(yi , f (xi ; θ)) , (16.235)
2
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 407
where ∇2x L is the Hessian matrix of the loss with respect to the input. The second term acts
as an implicit regularizer, penalizing the curvature of the loss function and encouraging smoother
solutions. For weight noise injection, noise is added directly to the model parameters: θ̃ = θ + η,
where η ∼ Q. The training objective becomes:
n
1X h
ˆ
i
θ̂ = arg min Eη∼Q L(yi , f (xi ; θ̃)) . (16.236)
θ∈Θ n
i=1
This formulation encourages the model to converge to flatter minima in the loss landscape, which
are associated with better generalization. The flatness of a minimum can be quantified using the
eigenvalues of the Hessian matrix ∇2θ L. Output noise injection introduces randomness into the
target labels: ỹi = yi + ϵi , where ϵi ∼ Q. The training objective becomes:
n
1X h
ˆ
i
θ̂ = arg min Eϵi ∼Q L(ỹi , f (xi ; θ)) . (16.237)
θ∈Θ n
i=1
This prevents the model from fitting the training labels too closely, reducing overfitting and im-
proving robustness. Theoretical guarantees for noise injection can be derived using tools from
statistical learning theory. The Rademacher complexity of the hypothesis class Θ is reduced by
noise injection, leading to tighter generalization bounds. The empirical Rademacher complexity is
defined as: " #
n
1X ˆ
R̂n (Θ) = Eσ sup σi f (xi ; θ) , (16.238)
θ∈Θ n i=1
where σi are Rademacher random variables. Noise injection effectively reduces R̂n (Θ), as the model
is forced to learn robust features that are invariant to small perturbations. From a PAC-Bayesian
perspective, noise injection can be interpreted as a form of distributional robustness. It ensures
that the model performs well not only on the training distribution but also on perturbed versions
of it. The PAC-Bayesian bound takes the form:
1 log n
Eθ∼Q [Egen (θ)] ≤ Eθ∼Q [Etrain (θ)] + KL(Q∥P ) + , (16.239)
2n δ2n
where Q is the posterior distribution over parameters induced by noise injection, P is a prior
distribution, and KL(Q∥P ) is the Kullback-Leibler divergence. In the continuous-time limit, noise
injection can be modeled as a stochastic differential equation (SDE):
dθt = −∇θ L(θt )dt + σdWt , (16.240)
where Wt is a Wiener process. This SDE converges to a stationary distribution that favors flat
minima, which generalize better. The stationary distribution p(θ) satisfies the Fokker-Planck equa-
tion:
σ2
∇θ · p(θ)∇θ L(θ) + ∇2θ p(θ) = 0. (16.241)
2
The flatness of the minima can be quantified using the eigenvalues of the Hessian matrix ∇2θ L.
From an information-theoretic perspective, noise injection increases the entropy of the model’s
predictions, reducing overconfidence and improving calibration. The mutual information I(θ; D)
between the parameters and the data is reduced, leading to better generalization. The information
bottleneck principle formalizes this intuition:
min I(θ; D) subject to E(X,Y )∼P L(Y, fˆ(X; θ)) ≤ ϵ,
(16.242)
θ
where " #
N
1 X
RN (H) = ED,σ sup σi f (xi ) , (16.247)
f ∈H N i=1
and σi are Rademacher random variables. Overfitting occurs when the model complexity, as mea-
sured by RN (H), is too large relative to the sample size N , leading to a high generalization gap.
Regularization addresses overfitting by introducing a penalty term Ω(θ) into the empirical risk
minimization framework, yielding the regularized loss function:
where λ controls the strength of regularization. Common choices for Ω(θ) include the L2 -norm
p
X
Ω(θ) = ∥θ∥22 = θj2 (16.249)
j=1
Batch normalization (BN) introduces an additional layer of complexity to this framework by nor-
malizing the activations of a neural network within each mini-batch B = {x1 , x2 , . . . , xm }. For a
given activation x ∈ Rd , BN computes the normalized output x̂ as:
x − µB
x̂ = , (16.253)
σB2 + ϵ
where µB = m1 m
P 2 1
Pm 2
i=1 xi is the mini-batch mean, σB = m i=1 (xi − µB ) is the mini-batch variance,
and ϵ is a small constant for numerical stability. The normalized output is then scaled and shifted
using learnable parameters γ and β, yielding the final output
y = γ x̂ + β. (16.254)
This transformation ensures that the activations have zero mean and unit variance during training,
reducing internal covariate shift and stabilizing the optimization process. The regularization effect
of BN arises from its stochastic nature and its impact on the optimization dynamics. During
training, the use of mini-batch statistics introduces noise into the gradient updates, which can be
modeled as:
g̃(θ) = g(θ) + η, (16.255)
where g(θ) = ∇θ L(θ) is the true gradient, g̃(θ) is the stochastic gradient computed using BN,
and η is a zero-mean random variable with covariance Σ. This noise acts as a form of stochastic
regularization, biasing the optimization trajectory toward flatter minima, which are associated with
410 CHAPTER 16. TRAINING NEURAL NETWORKS
better generalization. The regularization effect can be further analyzed using the continuous-time
limit of stochastic gradient descent (SGD), described by the stochastic differential equation (SDE):
where Wt is a Wiener process. The noise term ηΣdWt induces an implicit regularization effect, as
it biases the trajectory of θt toward regions of the parameter space with smaller curvature. From
a theoretical perspective, the regularization effect of BN can be formalized using the PAC-Bayes
framework. Let Q(θ) be a posterior distribution over the parameters induced by BN, and let P (θ)
be a prior distribution. The PAC-Bayes bound states:
1 1
Eθ∼Q [R(θ)] ≤ Eθ∼Q [R̂(θ)] + KL(Q ∥ P ) + log , (16.257)
2N δ
Empirical studies have demonstrated that BN reduces the need for explicit regularization tech-
niques, such as dropout and weight decay, by introducing an implicit regularization effect that is
both data-dependent and adaptive. However, the exact form of this implicit regularization remains
an open question, and further theoretical analysis is required to fully understand the interaction
between BN and other regularization techniques. In conclusion, batch normalization is a powerful
tool that not only stabilizes and accelerates training but also introduces a sophisticated form of im-
plicit regularization, which can be rigorously analyzed using tools from statistical learning theory,
optimization, and stochastic processes.
weight decay with knowledge distillation significantly improves model generalization. Ba et. al.
(2024) [615] investigated the interplay between data diversity and weight decay regularization in
neural networks. The paper introduces a theoretical framework linking weight decay with dataset
variability and explores its impact on the weight landscape. Li et. al. (2024) [616] integrated L2
regularization (weight decay) with hybrid data augmentation strategies for audio signal processing,
proving its effectiveness in preventing overfitting in deep neural networks. Zang and Yan (2024)
[617] presented a new attenuation-based weight decay regularization method for improving net-
work robustness in high-dimensional data scenarios. It introduces novel kernel-learning techniques
combined with weight decay for enhanced performance.
where P is the true data-generating distribution. The discrepancy between Ltrain (θ) and Ltest (θ)
is a consequence of the model’s excessive capacity to fit noise in the training data, which can be
formalized using the Rademacher complexity RN (H) of the hypothesis space H. Specifically, the
Rademacher complexity is defined as
" N
#
1 X
RN (H) = ED,σ sup σi f (xi ; θ) (16.260)
f ∈H N i=1
where σi are Rademacher random variables. Overfitting occurs when RN (H) is large relative to
the sample size N , leading to a generalization gap
that grows with the complexity of H. Regularization addresses overfitting by introducing a penalty
term Ω(θ) into the empirical risk minimization framework, yielding the regularized objective
where λ > 0 is the regularization parameter. Weight decay, a specific form of regularization,
corresponds to the choice
1
Ω(θ) = ∥θ∥22 (16.263)
2
which imposes an L2 penalty on the model parameters. This penalty can be interpreted as a
constraint on the parameter space, restricting the solution to a ball of radius C = λ2 in the Euclidean
norm, as dictated by the Lagrange multiplier theorem. The regularized objective thus becomes
λ
Lregularized (θ) = Ltrain (θ) + ∥θ∥22 (16.264)
2
which is strongly convex if Ltrain (θ) is convex, ensuring a unique global minimum θ∗ . The optimiza-
tion dynamics of weight decay can be analyzed through the lens of gradient descent. The update
rule for gradient descent with learning rate η and weight decay is given by
Weight decay increases the bias by shrinking θ∗ toward zero but reduces the variance by constraining
the parameter space. This tradeoff can be quantified using the ridge regression estimator in the
linear model setting, where
θ∗ = (X ⊤ X + λI)−1 X ⊤ y (16.270)
The bias and variance of this estimator can be explicitly computed as
and
Var(θ∗ ) = σ 2 Tr (X ⊤ X + λI)−2 X ⊤ X
(16.272)
where σ 2 is the noise variance. The theoretical foundations of weight decay can also be explored
through the lens of reproducing kernel Hilbert spaces (RKHS). In this framework, the regularization
term λ2 ∥θ∥22 corresponds to the squared norm in an RKHS H, and the regularized solution is the
minimizer of
λ
Lregularized (f ) = Ltrain (f ) + ∥f ∥2H (16.273)
2
where ∥f ∥H is the norm in H. This connection reveals that weight decay is equivalent to Tikhonov
regularization in the RKHS setting, providing a unifying theoretical framework for understanding
regularization in both parametric and non-parametric models. In conclusion, weight decay is a
mathematically principled regularization technique that addresses overfitting by constraining the
hypothesis space and reducing the Rademacher complexity of the model. Its optimization dy-
namics, statistical properties, and connections to RKHS theory provide a rigorous foundation for
understanding its role in improving generalization performance. By carefully tuning the regular-
ization parameter λ, we can achieve an optimal balance between bias and variance, ensuring robust
and reliable model performance on unseen data.
use of max-norm constraints with dropout, demonstrating that this combination prevents excessive
weight growth and stabilizes training in deep neural networks. Moradi et al. (2020) [618] provided
a comprehensive survey of regularization techniques, including max-norm constraints. The authors
explore different forms of norm-based constraints (L1, L2, and max-norm), discussing their effects
on weight magnitude, sparsity, and overfitting reduction. They compare these techniques across
multiple neural network architectures. Rodrı́guez et al. (2016) [619] introduced a novel regular-
ization technique that constrains local weight correlations in CNNs, reducing overfitting without
sacrificing learning capacity. They demonstrate that max-norm constraints help prevent weights
from growing too large, thus maintaining stability in deep convolutional networks. Tian and Zhang
(2022) [620] surveyed different regularization strategies, with a special focus on norm constraints.
It extensively discusses the effectiveness of max-norm constraints in preventing overfitting in deep
learning models and compares them with weight decay and L1/L2 regularization. Cong et al. (2017)
[621] developed a hybrid approach combining max-norm and low-rank constraints to handle overfit-
ting in similarity learning tasks. The authors propose an online learning method that reduces model
complexity while maintaining generalization performance. Salman and Liu (2019) [622] conducted
an empirical study on how overfitting manifests in deep neural networks and propose max-norm
constraints as a key strategy to mitigate overfitting. Their results suggest that max-norm regular-
ization improves generalization by limiting weight magnitudes. Wang et. al. (2021) [623] explored
benign overfitting, where models achieve perfect training accuracy but still generalize well. The
authors investigate max-norm constraints as a form of implicit regularization and show that they
help avoid harmful overfitting in high-dimensional settings. Poggio et al. (2017) [624] presented
a theoretical framework explaining why deep networks often avoid overfitting despite having more
parameters than data points. They highlight the role of max-norm constraints in controlling model
complexity and preventing overfitting. Oyedotun et. al. (2017) [625] discussed the consequences
of overfitting in deep networks and compares various norm-based constraints (L1, L2, max-norm).
The authors advocate for max-norm regularization due to its computational efficiency and robust-
ness in high-dimensional spaces. Luo et al. (2016) [626] proposed an improved extreme learning
machine (ELM) model that integrates L1, L2, and max-norm constraints to enhance generalization
performance. The authors show that max-norm regularization effectively prevents overfitting while
maintaining model interpretability.
and it increases when the model complexity is too high relative to the number of training samples.
In statistical learning theory, this gap can be upper-bounded using the Vapnik-Chervonenkis (VC)
dimension VC(H) of the hypothesis class H, yielding the bound
r !
VC(H)
E[L(w)] ≤ Lempirical (w) + O (16.278)
N
This inequality suggests that models with high VC dimension have larger generalization gaps,
leading to overfitting. Another theoretical measure of complexity is the Rademacher complexity,
which quantifies the ability of a function class to fit random noise. If H has high Rademacher
complexity R(H), the generalization bound
indicates poor generalization. Regularization techniques aim to reduce the effective hypothesis
space, thereby improving generalization by controlling model complexity. One effective approach
to mitigating overfitting is the incorporation of a regularization term in the objective function. A
general regularized loss function takes the form
which shrinks large weight values but does not impose an explicit bound on their magnitude.
Similarly, L1 regularization (lasso regression),
d
X
Ω(w) = ∥w∥1 = |wj | (16.282)
j=1
promotes sparsity but does not constrain the overall norm. Max-norm regularization is a stricter
form of regularization that directly enforces an upper bound on the norm of the weight vector.
Specifically, it constrains the weight norm to satisfy
∥w∥2 ≤ c (16.283)
for some constant c. This constraint prevents the optimizer from selecting solutions where the
weight magnitudes grow excessively, thereby controlling model complexity more effectively than
L2 regularization. Instead of adding a penalty term to the loss function, max-norm regularization
enforces the constraint during optimization by projecting the weight vector onto the feasible set
whenever it exceeds the bound. Mathematically, this projection step is given by
c
w← w (16.284)
max(∥w∥2 , c)
From a geometric perspective, max-norm regularization restricts the hypothesis space to a Eu-
clidean ball of radius c centered at the origin. The restricted hypothesis space leads to a lower
VC dimension and reduced Rademacher complexity, improving generalization. The constrained
optimization problem can be reformulated using Lagrange multipliers, leading to the constrained
optimization problem
min L(w) subject to ∥w∥2 ≤ c (16.285)
w
16.4. OVERFITTING AND REGULARIZATION TECHNIQUES 415
two probability distributions: the source distribution Psource (x, y) and the target distribution
Ptarget (x, y), which govern the input-output relationship. The goal of transfer learning is to approx-
imate the optimal target hypothesis function f ∗ (x) by leveraging knowledge from the source
model fs (x), while minimizing the expected risk over the target distribution:
Since Ptarget is unknown, we approximate Rtarget (f ) using the empirical risk computed over a
finite dataset Dtarget = {(xi , yi )}N
i=1 :
N
1 X
R̂target (f ) = L(f (xi ), yi ). (16.290)
N i=1
A model that perfectly minimizes R̂target (f ) may lead to overfitting, wherein the function
f (x) aligns with noise in the training set instead of generalizing well to new data. The degree of
overfitting is measured by the generalization gap:
where λ is a hyperparameter balancing empirical risk minimization and model complexity. From
the perspective of functional analysis, we interpret regularization as imposing constraints on the
function space where f is chosen. In many cases, f is assumed to belong to a Reproducing
Kernel Hilbert Space (RKHS) HK associated with a kernel function K(x, x′ ). The RKHS
norm, X
∥f ∥2HK = αi αj K(xi , xj ), (16.294)
i,j
acts as a smoothness regularizer that prevents excessive function oscillations. Alternatively, in the
Sobolev space W m,p (X ), regularization can take the form:
Z
Ω(f ) = |Dm f (x)|p dx, (16.295)
X
where Dm f represents the mth weak derivative of f . The choice of m and p dictates the smoothness
constraints imposed on f , directly influencing its generalization ability. One of the most widely used
regularization techniques is L2 regularization or Tikhonov regularization, which penalizes the
Euclidean norm of the model parameters:
X
Ω(f ) = ∥θ∥22 = θi2 . (16.296)
i
16.5. HYPERPARAMETER TUNING 417
To understand the effect of L2 regularization, consider the Hessian matrix H = ∇2θ L, which
captures the local curvature of the loss landscape. The largest eigenvalue λmax determines the
sharpness of the loss minimum:
∥H∥2 = sup ∥Hv∥2 . (16.297)
∥v∥2 =1
h∗ = arg min Lval (θ∗ (h); h), where θ∗ (h) = arg min Ltrain (θ; h). (16.298)
h∈H θ
Here, H denotes the hyperparameter space, which is often high-dimensional, non-convex, and
computationally expensive to traverse. The training loss function Ltrain (θ; h) is typically represented
as an empirical risk computed over the training dataset {(xi , yi )}N
i=1 :
N
1 X
Ltrain (θ; h) = ℓ(f (xi ; θ, h), yi ), (16.299)
N i=1
where f (xi ; θ, h) is the neural network output given the input xi , parameters θ, and hyperparameters
h, and ℓ(a, b) is the loss function quantifying the discrepancy between prediction a and ground truth
b. For classification tasks, ℓ often takes the form of cross-entropy loss:
C
X
ℓ(a, b) = − bk log ak , (16.300)
k=1
where C is the number of classes, and ak and bk are the predicted and true probabilities for the
k-th class, respectively. Central to the training process is the optimization of θ via gradient-based
methods such as stochastic gradient descent (SGD). The parameter updates are governed by:
where η > 0 is the learning rate, a critical hyperparameter controlling the step size. The stability
and convergence of SGD depend on η, which must satisfy:
2
0<η< , (16.302)
λmax (H)
where λmax (H) is the largest eigenvalue of the Hessian matrix H = ∇2θ Ltrain (θ; h). This condition
ensures that the gradient descent steps do not overshoot the minimum. To analyze convergence
behavior, the loss function Ltrain (θ; h) near a critical point θ∗ can be approximated via a second-
order Taylor expansion:
1
Ltrain (θ; h) ≈ Ltrain (θ∗ ; h) + (θ − θ∗ )⊤ H(θ − θ∗ ), (16.303)
2
where H is the Hessian matrix of second derivatives. The eigenvalues of H reveal the local cur-
vature of the loss surface, with positive eigenvalues indicating directions of convexity and negative
eigenvalues corresponding to saddle points. Regularization is often introduced to improve gener-
alization by penalizing large parameter values. For L2 regularization, the modified training loss
is:
λ
Lreg 2
train (θ; h) = Ltrain (θ; h) + ∥θ∥2 , (16.304)
2
where λ > 0 is the regularization coefficient. The gradient of the regularized loss becomes:
∇θ Lreg
train (θ; h) = ∇θ Ltrain (θ; h) + λθ. (16.305)
16.5. HYPERPARAMETER TUNING 419
Another key hyperparameter is the weight initialization strategy, which affects the scale of activa-
tions and gradients throughout the network. For a layer with nin inputs, He initialization samples
weights from:
2
wij ∼ N 0, , (16.306)
nin
to ensure that the variance of activations remains stable as data propagate through layers. The
activation function g(z) also plays a crucial role. The Rectified Linear Unit (ReLU), defined as
g(z) = max(0, z), introduces sparsity and mitigates vanishing gradients. However, it suffers from
the ”dying neuron” problem, as its derivative g ′ (z) is zero for z ≤ 0. The search for optimal hyper-
parameters can be approached using grid search, random search, or more advanced methods like
Bayesian optimization. In Bayesian optimization, a surrogate model p(Lval (h)), often a Gaussian
Process (GP), is constructed to approximate the validation loss. The acquisition function a(h), such
as Expected Improvement (EI), guides the exploration of H by balancing exploitation of regions
with low predicted loss and exploration of uncertain regions:
where Lval,min is the best observed validation loss. Hyperparameter tuning is computationally
intensive due to the high dimensionality of H and the nested nature of the optimization problem.
Early stopping, a widely used strategy, halts training when the improvement in validation loss falls
below a threshold:
(t+1) (t)
Lval − Lval
(t)
< ϵ, (16.308)
Lval
where ϵ > 0 is a small constant. Advanced techniques like Hyperband leverage multi-fidelity op-
timization, allocating resources dynamically to promising hyperparameter configurations based on
partial training evaluations.
Extreme Gradient Boosting (XGBoost) models in predicting post-surgical complications. The re-
sults suggest that hyperparameter tuning significantly improves predictive performance, with Grid
Search leading to the best model stability and interpretability. Lázaro et. al. (2025) [403] im-
plemented Grid Search and Bayesian Optimization to optimize K-Nearest Neighbors (KNN) and
Decision Trees for incident classification in aviation safety. The research underscores how different
hyperparameter tuning methods affect the generalization of machine learning models in NLP-based
accident reports. Li et. al. (2025) [404] proposed RAINER, an ensemble learning model that in-
tegrates Grid Search for optimal hyperparameter tuning. The study demonstrates how parameter
optimization enhances the predictive capabilities of rainfall models, making Grid Search an essen-
tial step in climate modeling. Khurshid et. al. (2025) [405] compared Bayesian Optimization with
Grid Search for hyperparameter tuning in diabetes prediction models. The study finds that while
Bayesian methods are computationally faster, Grid Search delivers more precise hyperparameter
selection, especially for models with structured medical data. Kanwar et. al. (2025) [406] applied
Grid Search for tuning Random Forest classifiers in landslide susceptibility mapping. The study
demonstrates that fine-tuned models improve the identification of high-risk zones, reducing false
positives in predictive landslide models. Fadil et. al. (2025) [407] evaluated the role of Grid Search
and Random Search in hyperparameter tuning for XGBoost regression models in corrosion predic-
tion. The authors find that Grid Search-based models achieve higher R² scores, making them ideal
for complex chemical modeling applications.
S = H1 × H2 × · · · × Hp . (16.309)
For example, if we have two hyperparameters h1 and h2 with 3 possible values each, the total
number of combinations to explore is 9. This search space grows exponentially as the number
of hyperparameters increases, posing a significant computational challenge. Grid search involves
iterating over all configurations in S, evaluating the model’s performance for each configuration.
Let us define the performance metric M (⃗h, Dtrain , Dval ), which quantifies the model’s performance
for a given hyperparameter configuration ⃗h, where Dtrain and Dval are the training and validation
datasets, respectively. This metric might represent accuracy, error rate, F1-score, or any other
relevant criterion, depending on the problem at hand. The hyperparameters are then tuned by
maximizing or minimizing M across the search space:
⃗h∗ = arg max M (⃗h, Dtrain , Dval ), (16.311)
⃗h∈S
For each hyperparameter combination, the model is trained on Dtrain and evaluated on Dval . The
process requires the repeated evaluation of the model over all |S| configurations, each yielding a
performance metric. To mitigate overfitting and ensure the reliability of the performance metric,
S Dtrain is partitioned into
cross-validation is frequently used. In k-fold cross-validation, the dataset
(j)
k disjoint subsets D1 , D2 , . . . , Dk . The model is trained on Dtrain = i̸=j Di and validated on Dj .
For each fold j, we compute the performance metric:
(j)
Mj (⃗h) = M (⃗h, Dtrain , Dj ). (16.313)
The overall cross-validation performance for a hyperparameter configuration ⃗h is the average of the
k individual fold performances:
k
⃗ 1X
M (h) = Mj (⃗h). (16.314)
k j=1
Thus, the grid search with cross-validation aims to find the optimal hyperparameters by maximizing
or minimizing the average performance across all folds. The computational complexity of grid search
is a key consideration. If we denote C as the cost of training and evaluating the model for a single
configuration, the total cost for grid search is:
p
!
Y
O mi · k · C , (16.315)
i=1
where k represents the number of folds in cross-validation. This results in an exponential increase in
the total computation time as the number of hyperparameters p and the number of candidate values
mi increase. For large search spaces, grid search can become computationally expensive, making it
infeasible for high-dimensional hyperparameter optimization problems. To illustrate with a specific
example, consider two hyperparameters h1 and h2 with the following sets of candidate values:
S = H1 × H2 = {(0.01, 0.1), (0.01, 1.0), (0.01, 10.0), (0.1, 0.1), . . . , (1.0, 10.0)}. (16.317)
There are 9 configurations to evaluate. For each configuration, assume we perform 3-fold cross-
validation, where the performance metrics for the first fold are:
M1 (0.1, 1.0) = 0.85, M2 (0.1, 1.0) = 0.87, M3 (0.1, 1.0) = 0.86, (16.318)
This process is repeated for all 9 combinations of h1 and h2 . Grid search, while exhaustive and
deterministic, can fail to efficiently explore the hyperparameter space, especially when the num-
ber of hyperparameters is large. The search is confined to a discrete grid and cannot interpolate
between points to capture optimal configurations that may lie between grid values. Furthermore,
because grid search evaluates each configuration independently, it can be computationally expen-
sive for high-dimensional spaces, as the number of configurations grows exponentially with p and mi .
respective candidate values, which can limit its applicability for large-scale problems. As a re-
sult, more advanced techniques such as random search, Bayesian optimization, or evolutionary
algorithms are often used for hyperparameter tuning when the computational budget is limited.
Despite these challenges, grid search remains a powerful tool for demonstrating the principles of
hyperparameter tuning and is well-suited for problems with relatively small search spaces. The
pros of grid search are:
where L(h) is the loss function that quantifies how well the model generalizes to unseen data. The
minimization of this function is often subject to constraints on the range or type of values that
each hi can take, forming a constrained optimization problem:
where H represents the feasible hyperparameter space. Hyperparameter tuning is typically carried
out by selecting a search method that explores this space efficiently, with the goal of finding the
global or local optimum of the loss function.
One such search method is random search, which is a straightforward yet effective approach
to exploring the hyperparameter space. Instead of exhaustively searching over a grid of val-
ues for each hyperparameter (as in grid search), random search samples hyperparameters ht =
(ht,1 , ht,2 , . . . , ht,d ) from a predefined distribution for each hyperparameter hi . For each iteration
t, the hyperparameters are independently sampled from probability distributions Di associated
with each hyperparameter hi , where the probability distribution might be continuous or discrete.
Specifically, for continuous hyperparameters, ht,i is drawn from a uniform or normal distribution
over an interval Hi = [ai , bi ]:
ht,i ∼ U(ai , bi ), ht,i ∈ Hi , (16.322)
where U(ai , bi ) denotes the uniform distribution between ai and bi . For discrete hyperparameters,
ht,i is sampled from a discrete set of values Hi = {hi1 , hi2 , . . . , hiNi } with each value equally probable:
where Di denotes the discrete distribution over the set {hi1 , hi2 , . . . , hiNi }. Thus, each hyperpa-
rameter is selected independently from its corresponding distribution. After selecting a new set of
hyperparameters ht , the model is trained with this configuration, and its performance is evaluated
by computing the loss function L(ht ). The process is repeated for T iterations, generating a se-
quence of hyperparameter configurations h1 , h2 , . . . , hT , and for each configuration, the associated
loss function values L(h1 ), L(h2 ), . . . , L(hT ) are computed. The optimal set of hyperparameters h∗
is then selected as the one that minimizes the loss:
Thus, random search performs an approximate optimization of the hyperparameter space, where
the computational cost per iteration is C (the time to evaluate the model’s performance for a
given set of hyperparameters), and the total computational cost is O(T · C). This makes random
search a computationally feasible approach, especially when T is moderate. The computational
efficiency of random search can be compared to that of grid search, which exhaustively searches the
hyperparameter space by discretizing each hyperparameter hi into a set of values hi1 , hi2 , . . . , hini ,
424 CHAPTER 16. TRAINING NEURAL NETWORKS
where ni is the number of values for the i-th hyperparameter. The total number of grid search
configurations is given by:
d
Y
Ngrid = ni , (16.325)
i=1
and the computational cost of grid search is O(Ngrid ·C), which grows exponentially with the number
of hyperparameters d. In this sense, grid search can become prohibitively expensive when the
dimensionality d of the hyperparameter space is large. Random search, on the other hand, requires
only T evaluations, and since each evaluation is independent of the others, the computational
cost grows linearly with T , making it more efficient when d is large. The probabilistic nature of
random search further enhances its efficiency. Suppose that only a subset of hyperparameters,
say k, significantly influences the model’s performance. Let S be the subspace of H consisting of
hyperparameter configurations that produce low loss values, and let the complementary space H\S
correspond to configurations that are unlikely to achieve low loss. In this case, the task becomes one
of searching within the subspace S, rather than the entire space H. The random search method is
well-suited to such problems, as it can probabilistically focus on the relevant subspace by drawing
hyperparameter values from distributions Di that prioritize areas of the hyperparameter space
with low loss. More formally, the probability of selecting a hyperparameter set ht from the relevant
subspace S is given by:
Yd
P (ht ∈ S) = P (ht,i ∈ Si ), (16.326)
i=1
where Si is the relevant region for the i-th hyperparameter, and P (ht,i ∈ Si ) is the probability that
the i-th hyperparameter lies within the relevant region. As the number of iterations T increases, the
probability that random search selects a hyperparameter set ht ∈ S increases as well, approaching
1 as T → ∞:
P (ht ∈ S) = 1 − (1 − P0 )T , (16.327)
where P0 is the probability of sampling a hyperparameter set from the relevant subspace in one
iteration. Thus, random search tends to explore the subspace of low-loss configurations, improving
the chances of finding an optimal or near-optimal configuration as T increases.
The exploration behavior of random search contrasts with that of grid search, which, despite its
systematic nature, may fail to efficiently explore sparsely populated regions of the hyperparam-
eter space. When the hyperparameter space is high-dimensional, the grid search must evaluate
exponentially many configurations, regardless of the relevance of the hyperparameters. This leads
to inefficiencies when only a small fraction of hyperparameters significantly contribute to the loss
function. Random search, by sampling independently and uniformly across the entire space, is
not subject to this curse of dimensionality and can more effectively locate regions that matter for
model performance. Mathematically, random search has an additional advantage when the hyper-
parameters exhibit smooth or continuous relationships with the loss function. In this case, random
search can probe the space probabilistically, discovering gradients of loss that grid search, due to
its fixed grid structure, may miss. Furthermore, random search is capable of finding the optimum
even when the loss function is non-convex, provided that the space is explored adequately. This
becomes particularly relevant in the presence of highly irregular loss surfaces, as random search
has the potential to escape local minima more effectively than grid search, which is constrained by
its fixed sampling grid.
In conclusion, random search is a highly efficient and scalable approach for hyperparameter op-
timization in machine learning. By sampling hyperparameters from predefined probability dis-
tributions and evaluating the associated loss function, random search provides a computationally
feasible method for high-dimensional hyperparameter spaces, outperforming grid search in many
cases. Its probabilistic nature allows it to focus on relevant regions of the hyperparameter space,
16.5. HYPERPARAMETER TUNING 425
1. More efficient than grid search, especially when some hyperparameters are less important.
Chang et. al. (2025) [415] applied Bayesian Optimization (BO) for hyperparameter tuning in
machine learning models used for predicting landslide displacement. It explores the impact of BO
in optimizing Support Vector Machines (SVM), Long Short-Term Memory (LSTM), and Gated
Recurrent Units (GRU), demonstrating how Bayesian techniques improve model accuracy and con-
vergence rates. Cihan (2025) [416] used Bayesian Optimization to fine-tune XGBoost, LightGBM,
Elastic Net, and Adaptive Boosting models for predicting biomass gasification output. The study
finds that Bayesian Optimization outperforms Grid and Random Search in reducing computational
overhead while improving predictive accuracy. Makomere et. al. (2025) [417] integrated Bayesian
Optimization for hyperparameter tuning in deep learning-based industrial process modeling. The
study provides insights into how BO improves model generalization and reduces prediction er-
rors in chemical process monitoring. Bakır (2025) [418] introduced TuneDroid, an automated
Bayesian Optimization-based framework for hyperparameter tuning of Convolutional Neural Net-
works (CNNs) used in cybersecurity. The results suggest that Bayesian Optimization accelerates
model training while improving malware detection accuracy. Khurshid et. al. (2025) [405] com-
pared Bayesian Optimization and Random Search for tuning hyperparameters in XGBoost-based
diabetes prediction models. It concludes that Bayesian Optimization provides a superior trade-off
between speed and accuracy compared to traditional search methods. Liu et. al. (2025) [419]
explored Bayesian Optimization’s ability to fine-tune deep learning models for predicting acoustic
performance in engineering systems. The authors demonstrate how Bayesian methods improve
prediction accuracy while reducing computational costs. Balcan et. al. (2025) [412] provided a
rigorous analysis of the sample complexity required for Bayesian Optimization in deep learning.
The findings show that Bayesian Optimization requires fewer samples to converge to optimal so-
lutions compared to other hyperparameter tuning techniques. Ma et. al. (2025) [420] integrated
Bayesian Optimization with Support Vector Machines (SVMs) for anomaly detection in high-speed
machining. They find that Bayesian Optimization allows more effective exploration of hyperpa-
rameter spaces, leading to improved model reliability. Bouzaidi et. al. (2025) [421] explored the
impact of Bayesian Optimization on CNN-based models for image classification. It demonstrates
how Bayesian techniques outperform traditional methods like Grid Search in transfer learning sce-
narios. Mustapha et. al. (2025) [422] integrated Bayesian Optimization for tuning a hybrid deep
learning framework combining Convolutional Neural Networks (CNNs) and Vision Transformers
(ViTs) for pneumonia detection. The results confirm that Bayesian Optimization enhances the
efficiency of multi-model architectures in medical imaging.
426 CHAPTER 16. TRAINING NEURAL NETWORKS
where m = [m(x1 ), m(x2 ), . . . , m(xn )]⊤ is the mean vector and K is the covariance matrix whose
entries are defined by a covariance (or kernel) function k(x, x′ ), which encodes assumptions about
the smoothness and periodicity of the objective function. The kernel function plays a crucial role
in determining the properties of the Gaussian Process. A commonly used kernel is the Squared
Exponential (SE) kernel, which is defined as:
∥x − x′ ∥2
′ 2
k(x, x ) = σf exp − (16.330)
2ℓ2
where σf2 is the variance, which scales the function values, and ℓ is the length scale, which controls
the smoothness of the function by dictating how quickly the function values can change with respect
to the inputs. Once the Gaussian Process has been specified, Bayesian Optimization proceeds by
updating the posterior distribution over the objective function after each new evaluation. Given a
set of n observed pairs {(xi , yi )}ni=1 , where yi = f (xi )+ϵi and ϵi ∼ N (0, σ 2 ) represents observational
noise, we update the posterior of the GP to reflect the observed data. The posterior mean µ(x∗ )
and variance σ 2 (x∗ ) at a new point x∗ are given by the following equations:
µ(x∗ ) = k⊤ −1
∗K y (16.331)
σ 2 (x∗ ) = k(x∗ , x∗ ) − k⊤ −1
∗ K k∗ (16.332)
where k∗ is the vector of covariances between the test point x∗ and the observed points x1 , x2 , . . . , xn ,
and K is the covariance matrix of the observed points. The updated mean µ(x∗ ) provides the
model’s best guess for the value of the function at x∗ , and σ 2 (x∗ ) quantifies the uncertainty asso-
ciated with this estimate.
16.5. HYPERPARAMETER TUNING 427
In Bayesian Optimization, the central objective is to select the next hyperparameter setting x∗
to evaluate in such a way that the number of function evaluations is minimized while still making
progress toward the global optimum. This is achieved by optimizing an acquisition function. The
acquisition function α(x) represents a trade-off between exploiting regions of the input space where
the objective function is expected to be low and exploring regions where the model’s uncertainty is
high. Several acquisition functions have been proposed, including Expected Improvement (EI),
Probability of Improvement (PI), and Upper Confidence Bound (UCB). The Expected
Improvement (EI) acquisition function is one of the most widely used and is defined as:
fbest − µ(x) fbest − µ(x)
EI(x) = (fbest − µ(x))Φ + σ(x)ϕ (16.333)
σ(x) σ(x)
where fbest is the best observed value of the objective function, Φ(·) and ϕ(·) are the cumulative
distribution and probability density functions of the standard normal distribution, respectively,
and σ(x) is the standard deviation at x. The first term measures the potential for improvement,
weighted by the probability of achieving that improvement, and the second term reflects the uncer-
tainty at x, encouraging exploration in uncertain regions. The acquisition function is maximized
at each iteration to select the next point x∗ :
where κ is a hyperparameter that controls the trade-off between exploration (κ large) and exploita-
tion (κ small). After selecting x∗ , the function is evaluated at this point, and the observed value
y∗ = f (x∗ ) is used to update the posterior distribution of the Gaussian Process. This process is
repeated iteratively, and each new observation refines the model’s understanding of the objective
function, guiding the search for the optimal x∗ . One of the primary advantages of Bayesian Opti-
mization is its ability to efficiently optimize expensive-to-evaluate functions by focusing the search
on the most promising regions of the input space. However, as the number of observations in-
creases, the computational complexity of maintaining the Gaussian Process model grows cubically
with respect to the number of points, due to the need to invert the covariance matrix K. This
cubic complexity, O(n3 ), can be prohibitive for large datasets. To mitigate this, techniques such as
sparse Gaussian Processes have been developed, which approximate the full covariance matrix
by using a smaller set of inducing points, thus reducing the computational cost while maintaining
the flexibility of the Gaussian Process model.
2. Balances exploration (trying new regions) and exploitation (focusing on promising regions).
where f : Λ → R is an objective function, typically the validation loss of a machine learning model.
This function is often non-convex, non-differentiable, high-dimensional, and stochastic,
which makes conventional gradient-based methods inapplicable. Moreover, the search space Λ may
consist of both continuous and discrete hyperparameters, further complicating the problem. Given
the computational complexity of exhaustive search methods such as grid search and the ineffi-
ciency of purely random search methods, Genetic Algorithms (GAs) provide a heuristic but
powerful optimization framework inspired by principles of natural evolution.
e−βFi
P (λi ) = PN . (16.341)
e −βFj
j=1
Here, β > 0 controls the selection pressure, determining how much preference is given to high-
performing individuals. If β is too high, selection is overly greedy and can lead to premature
convergence; if too low, selection becomes nearly random, reducing the convergence rate. This
selection process ensures that better-performing hyperparameter configurations have a higher prob-
ability of propagating to the next generation while still allowing some stochastic diversity.
After selection, the next step is Crossover, also known as recombination, which involves combin-
ing the genetic information of two parents to produce offspring. Mathematically, given two parent
hyperparameter vectors λA and λB , a child λC is generated via a convex combination:
This is known as blend crossover, which ensures a smooth interpolation between parent solu-
tions. Other crossover techniques include one-point crossover, where a random split point k is
chosen and the first k components come from one parent while the remaining components come
from the other parent. The use of crossover ensures that useful information is inherited from
multiple parents, promoting efficient exploration of the search space. To maintain diversity and
prevent premature convergence, Mutation is applied, introducing small random perturbations to
the offspring. Mathematically, this can be expressed as
λnew
j = λj + δ, δ ∼ N (0, σ 2 ), (16.343)
430 CHAPTER 16. TRAINING NEURAL NETWORKS
where σ controls the mutation step size. In adaptive genetic algorithms, σ decreases over time:
σt = σ0 e−γt (16.344)
for some decay rate γ > 0, implementing annealing-based exploration, which helps refine
solutions as the algorithm progresses. The convergence behavior of Genetic Algorithms can be
analyzed through the expected fitness improvement formula:
where η is a learning rate influenced by the mutation rate µ. This follows a Lyapunov stabil-
ity argument, implying eventual convergence under bounded variance conditions. Additionally,
Genetic Algorithms operate as a Markov Chain, satisfying:
Thus, GAs approximate a randomized hill-climbing process with enforced diversity, ensuring
a good tradeoff between exploration and exploitation. Genetic Algorithms offer significant
advantages over traditional hyperparameter tuning methods. Grid Search, which evaluates all
combinations exhaustively, suffers from exponential complexity O(k n ) for n hyperparameters with
k values each. Random Search, though more efficient, lacks any adaptation to previous eval-
uations. GAs, in contrast, leverage historical information and evolutionary dynamics to
efficiently search the space while maintaining diversity.
16.5.7 Hyperband
16.5.7.1 Literature Review of Hyperband
Li et. al. (2018) [487] introduced the HyperBand algorithm. It provides a theoretical foundation for
HyperBand, demonstrating its efficiency in hyperparameter optimization by dynamically allocating
resources to promising configurations. The authors rigorously analyze its performance compared
to traditional methods like random search and Bayesian optimization, proving its superiority in
terms of speed and scalability. Falkner et. al. (2018) [488] combined Bayesian Optimization
(BO) with HyperBand (HB) to create BOHB, a hybrid method that leverages the strengths of
both approaches. It introduces a robust and scalable framework for hyperparameter tuning, par-
ticularly effective for large-scale machine learning tasks. The authors provide extensive empirical
16.5. HYPERPARAMETER TUNING 431
evaluations, demonstrating BOHB’s efficiency and robustness. Li et. al. (2020) [489] extended
HyperBand to a distributed computing environment, enabling massively parallel hyperparameter
tuning. The authors introduce a system architecture that scales HyperBand to thousands of work-
ers, making it practical for large-scale industrial applications. The paper also provides insights into
the trade-offs between resource allocation and optimization performance. While not exclusively
about HyperBand, the paper by Snoek et. al. (2012) [490] laid the groundwork for understanding
Bayesian optimization, which is often compared to HyperBand. It provides a comprehensive frame-
work for hyperparameter tuning, which is useful for understanding the context in which HyperBand
operates and its advantages over Bayesian methods. Slivkins et. al. (2024) [491] provided a thor-
ough theoretical foundation for multi-armed bandit algorithms, which are the basis for HyperBand.
It explains the principles of resource allocation and exploration-exploitation trade-offs, offering a
deeper understanding of how HyperBand achieves efficient hyperparameter optimization. Hazan
et. al. (2018) [492] explored spectral methods for hyperparameter optimization, providing a theo-
retical perspective that complements HyperBand’s empirical approach. It discusses the limitations
of traditional methods and highlights the advantages of bandit-based approaches like HyperBand.
Domhan et. al. (2015) [493] introduced the concept of learning curve extrapolation, which is a key
component of HyperBand’s success. It demonstrates how early stopping and resource allocation
can be optimized by predicting the performance of hyperparameter configurations, a technique that
HyperBand later formalizes and extends. Agrawal (2021) [494] provided a comprehensive overview
of hyperparameter optimization techniques, including a detailed chapter on HyperBand. It explains
the algorithm’s mechanics, its advantages over other methods, and practical implementation tips.
The book is particularly useful for practitioners looking to apply HyperBand in real-world scenar-
ios. Shekhar et. al. (2021) [495] compared various hyperparameter optimization tools, including
HyperBand, Bayesian optimization, and random search. It provides empirical evidence of Hyper-
Band’s efficiency and scalability, particularly for large datasets and complex models. The paper
also discusses the trade-offs between different methods. Bergstra et. al. (2011) [496] discussed the
challenges of hyperparameter optimization in neural networks and introduces early methods for
addressing them. While it predates HyperBand, it provides valuable context for understanding the
evolution of hyperparameter optimization techniques and the need for more efficient methods like
HyperBand.
• Evaluating L(λ) with a budget b (e.g., number of epochs, dataset size) yields an approximation
L(λ, b), where L(λ, b) → L(λ) as b → R, and R is the maximum budget.
HyperBand relies on the following assumptions for its theoretical guarantees: For any λ, L(λ, b) is
non-increasing in b. That is, increasing the budget improves performance:
The maximum budget R is finite, and L(λ, R) = L(λ). There exists a unique optimal configuration
λ∗ ∈ Λ such that:
L(λ∗ ) ≤ L(λ), ∀λ ∈ Λ. (16.348)
Successive Halving is the Building Block of HyperBand Method. HyperBand generalizes the Suc-
cessive Halving (SH) algorithm. SH operates as follows:
432 CHAPTER 16. TRAINING NEURAL NETWORKS
3. Increase the budget by a factor of η and repeat until one configuration remains.
Here, smax = ⌊logη (R)⌋ is the number of brackets. We have to run the Inner Loop (Successive
Halving) with n configurations and initial budget b. For Inner Loop (Successive Halving), we have
to first randomly sample n configurations λ1 , . . . , λn . For each round i ∈ {0, 1, . . . , s}:
Return the best configuration from the final round. HyperBand’s efficiency stems from its ability
to explore multiple resource allocation strategies. Below, we analyze its properties rigorously. The
total cost of HyperBand is the sum of costs across all brackets:
sX
max
where CSH (s) is the cost of Successive Halving in bracket s. HyperBand balances exploration and
exploitation by varying s:
This ensures that HyperBand does not prematurely discard potentially optimal configurations.
Under the assumptions of monotonicity and finite budget, HyperBand achieves the following:
• Logarithmic Scaling: The total cost CHB scales logarithmically with the number of con-
figurations.
We sketch a proof of HyperBand’s efficiency under the given assumptions. By monotonicity, the
ranking of configurations improves as the budget increases. Thus, the top configurations in early
rounds are likely to include λ∗ . The cost of each bracket s is:
s s
X X n
CSH (s) = ni · bi = i
· b · η i = n · b · (s + 1). (16.352)
i=0 i=0
η
16.5. HYPERPARAMETER TUNING 433
Since smax = ⌊logη (R)⌋, the cost scales logarithmically with R. There are some impressive practical
implications of HyperBand Method. HyperBand’s theoretical guarantees make it highly effective
for:
We optimize:
θ∗ (λ) = arg min Ltrain (θ, λ) (16.356)
θ∈Θ
The optimization problem is now posed in a functional setting. Using Variational Formulation
of Hyperparameter Optimization, Instead of solving a constrained minimization, we express the
optimization problem using the Euler-Lagrange equation. The hyperparameter tuning problem
is:
λ∗ = arg min F(λ) (16.358)
λ
where:
F(λ) = E(x,y)∼Dval [Lval (θ∗ (λ), λ)] (16.359)
Since θ∗ (λ) is the minimizer of Ltrain , it satisfies the Euler-Lagrange equation:
δLtrain ∗
(θ (λ), λ) = 0 (16.360)
δθ
To differentiate F(λ), apply the chain rule in variational calculus:
δLval dθ∗
d ∂Lval
F(λ) = + , (16.361)
dλ ∂λ δθ dλ H
We should now do the Higher-Order Sensitivity Analysis. Beyond first and second derivatives, we
analyze third-order terms using Taylor expansions in Banach spaces:
dθ∗ 1 d2 θ ∗
θ∗ (λ + ∆λ) = θ∗ (λ) + ∆λ + (∆λ)2 + O(∥∆λ∥3 ) (16.364)
dλ 2 dλ2
The second-order sensitivity term is:
−1 3
d2 θ ∗ δ Ltrain dθ∗ δ 2 Ltrain
2
δ Ltrain
=− + (16.365)
dλ2 δθ2 δλδθ2 dλ δλ2 δθ
If λmin > 0, H is positive definite, ensuring local convexity and If λmin = 0, H is singular,
requiring pseudo-inversion. Using Tikhonov regularization, we modify:
PBT with other evolutionary algorithms and highlights its advantages in terms of computational
efficiency and adaptability. The authors also discuss practical considerations for implementing
PBT in large-scale training scenarios. Co-Reyes et. al. [506] explored the use of PBT for meta-
optimization, specifically for evolving reinforcement learning algorithms. It demonstrates how PBT
can be used to discover novel RL algorithms by optimizing both hyperparameters and algorithmic
components. The work shows PBT’s versatility beyond standard hyperparameter tuning. Song
et. al. (2024) [507] applied PBT to Neural Architecture Search (NAS), showing how PBT can
efficiently explore and exploit architectures and hyperparameters simultaneously. It provides in-
sights into how PBT can reduce the computational cost of NAS while maintaining competitive
performance. Wan et. al. (2022) [508] bridged the gap between Bayesian Optimization (BO) and
PBT by proposing a hybrid approach. It uses BO to guide the initial hyperparameter search and
PBT to refine hyperparameters dynamically during training. The paper demonstrates improved
performance over standalone PBT or BO. Garcia-Valdez et. al. (2023) [509] addressed the scal-
ability of PBT in distributed computing environments. It introduces an asynchronous variant of
PBT that reduces idle time and improves resource utilization. The work is particularly relevant
for large-scale machine-learning applications.
• θi ∈ Rd represents the model parameters, with d being the dimensionality of the model
parameter space.
• hi ∈ H ⊂ Rm represents the hyperparameters of the i-th model, with m being the dimen-
sionality of the hyperparameter space H. The set H is a bounded subset of the positive real
numbers, such as learning rates, batch sizes, or regularization factors.
We have to now use the Loss Function as a Metric. The objective function L(θ, h) is a mapping
from the space of model parameters and hyperparameters to a scalar loss value. This loss function
is a non-convex, potentially non-differentiable function in high-dimensional spaces, particularly in
the context of deep neural networks.
where Ltrain (θ, h) is the training loss, and Lval (θ, h) is the validation loss. Here, Lval serves as
the fitness function upon which the hyperparameter optimization process is based. Using the
Exploitation-Exploration Framework, the central mechanism of PBT revolves around two pro-
cesses: exploitation (model selection) and exploration (hyperparameter mutation). We will
delve into these components through the lens of Markov Decision Processes (MDPs), opti-
mization theory, and stochastic calculus. Regarding the Selection Mechanism (Exploitation), the
models in the population are ranked based on their validation fitness Mi (t) at each time step t:
In terms of selection, the worst-performing models are replaced by the best-performing models.
We now formally express the selection step in terms of the updating mechanism. Given a population
of models P(t), at time step t, a new model θi (t +1), hi (t +1) inherits its hyperparameters hi (t) and
16.5. HYPERPARAMETER TUNING 437
model parameters θi (t) from the best-performing models, denoted by i∗ . Thus, the hyperparameter
update rule for the next iteration is:
This corresponds to the exploitation phase, where we take the best-performing hyperparameters
from the current generation to seed the next. Regarding the Mutation Mechanism (Exploration),
the mutation process injects randomness into the hyperparameters to encourage exploration of the
search space. To formally describe this process, we use a stochastic perturbation model. Let hi (t) be
the hyperparameters at time t. Mutation introduces a random perturbation to the hyperparameters
as:
hi (t + 1) = hi (t) · (1 + ϵi (t)) (16.375)
where ϵi (t) ∼ U(−α, α) represents a random perturbation drawn from a uniform distribution with
parameter α. This random perturbation ensures that the hyperparameters can adaptively escape
local minima, promoting a more global search in the hyperparameter space. The mutative process
can be seen as:
hi (t + 1) = hi (t) · 10U (−α,α) (16.376)
This mutation process is a continuous stochastic process with a bounded magnitude, facilitating a
fine balance between exploitation and exploration. We now interpret PBT as a non-stationary,
stochastic optimization problem with dynamic model parameter and hyperparameter updates.
In optimization terms, PBT involves iteratively optimizing a non-convex function L(θ, h) with
respect to the hyperparameters h, and the model parameters θ. The stochastic update for hi (t)
can be modeled as:
hi (t + 1) = hi (t) + ∇h L(θi (t), hi (t)) + σ · N (0, I) (16.377)
where ∇h L(θi (t), hi (t)) is the gradient of the loss function with respect to the hyperparameters hi ,
representing the exploitation mechanism (steepest descent direction), N (0, I) is a noise term with
zero mean and identity covariance matrix, modeling the exploration mechanism, σ is a hyperpa-
rameter that controls the magnitude of the noise, thus influencing the exploration rate. We shall
now do th Convergence Analysis via Lyapunov Stability. To rigorously analyze the convergence
of PBT, we leverage Lyapunov’s stability theory, which provides insight into whether the sys-
tem of updates stabilizes or diverges. Define the Lyapunov function V (t), which represents the
deviation from the optimal solution h∗ in terms of squared Euclidean distance:
N
X
V (t) = ∥hi (t) − h∗ ∥2 (16.378)
i=1
The evolution of V (t) over time gives us information about the behavior of the hyperparameters as
the population evolves. If the system converges to a local optimum, we expect that E[V (t + 1)] <
V (t). Using the update rule for hi (t), we can compute the expected rate of change of the Lyapunov
function:
E[V (t + 1) − V (t)] = −δV (t) (16.379)
where δ > 0 is a constant that guarantees exponential convergence towards the optimal hyperpa-
rameter configuration. This exponential decay implies that the population of models is moving
toward a global optimum at a rate proportional to the current deviation from the optimal solu-
tion. Regarding the Generalized Stochastic Optimization Framework, PBT can be viewed as an
instance of stochastic optimization under non-stationary conditions. The optimization pro-
cess evolves by sequentially adjusting the hyperparameters and parameters according to a noisy
gradient update:
hi (t + 1) = hi (t) + η(t) · (∇h L(θi (t), hi (t)) + ϵi (t)) (16.380)
Here η(t) is a learning rate that decays over time, ensuring that the updates become smaller as the
optimization progresses. The term ϵi (t) introduces noise for exploration, and the gradient term
438 CHAPTER 16. TRAINING NEURAL NETWORKS
∇h L ensures that the system exploits the current state of the model for refinement. Regarding
the Theoretical Convergence Guarantees, Under appropriate conditions, PBT guarantees that the
models will converge to an optimal or near-optimal hyperparameter configuration. By applying
perturbation theory and large deviation principles, we can demonstrate that the population
converges to a near-optimal region of the hyperparameter space with high probability. Furthermore,
as N → ∞, the convergence rate improves, which underscores the efficiency of the population-
based approach in exploring high-dimensional hyperparameter spaces. Regarding Computational
Efficiency and Parallelism in PBT, One of the key advantages of PBT is its parallelizability.
Since each model in the population is trained independently, the process is well-suited to mod-
ern distributed computing environments, such as multi-GPU or multi-TPU setups. The time
complexity of the population-based optimization process can be analyzed as follows:
Since each model is evaluated independently, this process can be easily parallelized, allowing for
significant speedup in hyperparameter optimization, particularly when the number of models in
the population is large.
16.5.10 Optuna
16.5.10.1 Literature Review of Optuna
Akiba et. al. (2019) [510] wrote the foundational paper introducing Optuna. It describes the frame-
work’s design principles, including its define-by-run API, efficient sampling algorithms, and pruning
mechanisms. The paper highlights Optuna’s scalability and flexibility compared to other hyperpa-
rameter optimization tools like Hyperopt and Bayesian Optimization. Kadhim et. al. (2022) [512]
provided a comprehensive overview of hyperparameter optimization techniques, including Bayesian
optimization, evolutionary algorithms, and bandit-based methods. It contextualizes Optuna within
the broader landscape of hyperparameter tuning tools and methodologies. Bergstra et. al. (2011)
[496] introduced the concept of sequential model-based optimization (SMBO) and tree-structured
Parzen estimators (TPE), which are foundational to Optuna’s sampling algorithms. It provides
theoretical insights into efficient hyperparameter search strategies. Snoek et. al. (2012) [490] in-
troduced Bayesian optimization using Gaussian processes (GPs) for hyperparameter tuning. While
Optuna primarily uses TPE, this work is critical for understanding the theoretical underpinnings
of probabilistic modeling in hyperparameter optimization. Akiba et. al. (2025) [511] expanded
on the original Optuna paper, providing deeper insights into its define-by-run paradigm, which
allows users to dynamically construct search spaces. It also discusses advanced features like multi-
objective optimization and distributed computing. Yang and Shami (2020) [514] wrote a book that
includes a practical guide to hyperparameter tuning, with examples using Optuna. It emphasizes
the importance of tuning in deep learning and provides hands-on code snippets for integrating
Optuna with Keras and TensorFlow. Wang (2024) [515] explained Optuna’s support for multi-
objective optimization, which is crucial for tasks like balancing model accuracy and computational
cost. It provides practical examples and benchmarks. Frazier (2018) [516] provided a thorough
introduction to Bayesian optimization, which is closely related to Optuna’s TPE algorithm. It
covers acquisition functions, Gaussian processes, and practical considerations for implementation.
Jeba (2021) [513] wrote a collection of case studies that demonstrated Optuna’s application in real-
world scenarios, including hyperparameter tuning for deep learning, reinforcement learning, and
time-series forecasting. It highlights Optuna’s efficiency and ease of use. Hutter et. al. (2019) [517]
16.5. HYPERPARAMETER TUNING 439
H = H1 × H2 × · · · × Hn (16.381)
where each Hi represents the domain of the i-th hyperparameter. The objective is to identify the
optimal hyperparameter configuration h∗ ∈ H that minimizes (or maximizes) a predefined objective
function
f :H→R (16.382)
which quantifies the performance of a machine learning model, such as validation loss or accuracy.
Mathematically, this is expressed as
The function f (h) is often expensive to evaluate, as it requires training and validating a model, and
is typically non-convex, noisy, and lacks an analytical gradient, rendering traditional optimization
methods ineffective.
Optuna addresses this challenge by employing a Bayesian optimization framework, which itera-
tively constructs a probabilistic surrogate model of the objective function f (h) and uses it to guide
the search for h∗ . Specifically, Optuna utilizes a Tree-structured Parzen Estimator (TPE) as its
surrogate model, which is a non-parametric density estimator that models the distribution of hyper-
parameters conditioned on the observed values of f (h). The TPE approach partitions the observed
trials into two subsets: Ggood , containing hyperparameter configurations associated with the best
observed values of f (h), and Gbad , containing the remaining configurations. It then estimates two
probability density functions,
which represent the likelihood of hyperparameters given good and bad performance, respectively.
The acquisition function a(h), defined as the ratio
p(h | Ggood )
a(h) = (16.385)
p(h | Gbad )
is maximized to select the next hyperparameter configuration hnext , thereby balancing exploration
and exploitation in the search process. The optimization process begins with an initial phase of
random sampling to build a preliminary model of f (h), after which the TPE algorithm refines
its probabilistic model and focuses on regions of H that are more likely to contain h∗ . This
adaptive sampling strategy ensures that the search is both efficient and effective, particularly in
high-dimensional spaces where the curse of dimensionality would otherwise render exhaustive search
methods intractable. Additionally, Optuna incorporates pruning mechanisms to further enhance
computational efficiency. Pruning involves terminating trials that are unlikely to yield improve-
ments in f (h) based on intermediate evaluations, thereby reducing the computational cost associ-
ated with unpromising configurations. This is achieved by comparing the performance of a trial at
a given step to the performance of other trials at the same step and applying a statistical criterion
440 CHAPTER 16. TRAINING NEURAL NETWORKS
to decide whether to continue or halt the trial. The convergence properties of Optuna’s optimiza-
tion process are grounded in the theoretical foundations of Bayesian optimization and TPE. Under
mild assumptions, such as the smoothness of f (h) and the proper calibration of the acquisition
function, the algorithm is guaranteed to converge to the global optimum h∗ as the number of trials
N approaches infinity. However, in practice, the rate of convergence depends on the dimensionality
of H, the noise level of f (h), and the efficiency of the surrogate model in capturing the underlying
structure of the objective function. Optuna’s implementation also supports advanced features such
as conditional hyperparameter spaces, where the domain of one hyperparameter may depend on the
value of another, and parallelization, which enables distributed evaluation of trials across multiple
computational nodes.
In summary, Optuna provides a rigorous and mathematically sound framework for hyperparam-
eter tuning by leveraging Bayesian optimization, TPE, and pruning mechanisms. Its ability to
efficiently navigate complex and high-dimensional hyperparameter spaces, combined with its theo-
retical guarantees of convergence, makes it a powerful tool for optimizing machine learning models.
The framework’s flexibility, scalability, and integration with modern machine learning pipelines
further enhance its utility in both research and practical applications. By formalizing hyperpa-
rameter tuning as a probabilistic optimization problem and employing advanced sampling and
pruning strategies, Optuna achieves a balance between computational efficiency and optimization
performance, ensuring that the identified hyperparameter configuration h∗ is both optimal and
robust.
(2024) [546] proposed FlexHB that extended SHA by introducing GloSH, an improved version of
Successive Halving that dynamically adjusts resource allocation. The study highlights its advan-
tages in reducing wasted computational resources while maintaining high-quality hyperparameter
selection.
where M (λ) is the machine learning model trained using hyperparameters λ, and L(·) represents
a loss function such as cross-entropy loss, mean squared error, or negative log-likelihood. Due to
the large cardinality of Λ and the computational expense of evaluating L(M (λ), Dval ), exhaustive
evaluation of all configurations is infeasible. To mitigate this computational burden, Successive
Halving (SH) is employed as a multi-fidelity optimization technique that dynamically allocates
computational resources to promising candidates while progressively eliminating inferior configu-
rations in a statistically justified manner.
The Successive Halving algorithm proceeds in a sequence of K iterative stages, where each stage
consists of training, evaluation, ranking, and pruning of hyperparameter configurations. Let N
denote the initial number of hyperparameter candidates sampled from Λ, and let B denote the
total computational budget. The algorithm initializes each configuration with a budget of B0 such
that the sum of allocated budgets across all iterations remains bounded by B. Specifically, defining
the reduction factor η > 1, the number of surviving configurations at each iteration is recur-
sively defined as Nk = N/η k , while the budget allocated to each surviving configuration follows the
exponential growth pattern Bk = ηBk−1 . The number of iterations required to reduce the search
space to a single surviving configuration is given by K = logη N . Thus, the total computational
cost incurred by the algorithm satisfies
K K
X X N
CSH = Nk Bk = · η k B0 = O(B logη N ) (16.387)
k=0 k=0
ηk
Compared to brute-force grid search, which incurs an evaluation cost of Cgrid = N B, this result
demonstrates that SH achieves an exponential reduction in computational complexity while
maintaining high fidelity in identifying near-optimal hyperparameter configurations. A key proba-
bilistic aspect of SH is its ability to retain at least one optimal configuration with high probability.
Let λ∗ denote an optimal configuration in Λ, and let fk (λ) represent the performance metric (e.g.,
validation accuracy) evaluated at iteration k. Assuming fk (λ) follows a sub-Gaussian distribution,
the probability that λ∗ survives elimination at each iteration satisfies
Applying Chernoff bounds, the probability of discarding λ∗ at any given iteration is at most 1
ηk
,
leading to a final retention probability of
1
Pfinal = 1 − logη N
(16.389)
η
As N → ∞, the term ηlog1η N asymptotically vanishes, ensuring that SH converges to an optimal
configuration with probability approaching unity. The asymptotic convergence rate of SH is
442 CHAPTER 16. TRAINING NEURAL NETWORKS
given by
log N
O (16.390)
N
which significantly outperforms naive random search while being slightly suboptimal compared to
adaptive bandit-based methods such as Hyperband. Hyperband extends SH by employing multiple
independent SH runs with varying initial budget allocations, thereby balancing exploration (many
configurations trained briefly) and exploitation (few configurations trained extensively). The
expected number of evaluations required by Hyperband satisfies
B log N
E[evaluations] = O (16.391)
log η
which achieves sublinear dependence on N and further enhances computational efficiency. Com-
pared to traditional SH, Hyperband is more robust to hyperparameter configurations with delayed
performance gains, making it particularly effective for deep learning applications. Despite its com-
putational advantages, SH has several practical limitations. The choice of the reduction factor η
influences the algorithm’s efficiency; larger values accelerate pruning but increase the risk of dis-
carding promising configurations prematurely. Additionally, SH assumes that partial evaluations
of configurations provide an unbiased estimate of their final performance, which may not hold for
all machine learning models, particularly those with complex training dynamics. Finally, for small
computational budgets B, SH may allocate insufficient resources to any configuration, leading to
suboptimal tuning outcomes.
framework where the RL agent learns to generate and evaluate hyperparameter configurations.
The paper demonstrates the scalability of RL-based methods for large-scale hyperparameter opti-
mization. Afshar and Zhang (2022) [523] introduced a practical RL framework for hyperparameter
tuning in machine learning pipelines. It uses a tree-structured Parzen estimator (TPE) to guide
the RL agent, enabling efficient exploration of the hyperparameter space. The authors provide
empirical evidence of the method’s superiority over traditional approaches. Wu et. al. (2020) [524]
proposed a model-based RL approach for hyperparameter tuning, where a surrogate model is used
to approximate the performance of different hyperparameter configurations. The method reduces
the number of evaluations required to find optimal hyperparameters, making it highly efficient for
large-scale applications. Iranfar et. al. (2021) [525] focused on using deep RL algorithms, such as
Deep Q-Networks (DQN), to optimize hyperparameters in neural networks. The authors demon-
strate how deep RL can handle high-dimensional hyperparameter spaces and achieve competitive
results on tasks like image classification and natural language processing. While not exclusively
about RL, this survey by He et al. (2021) [526] provides a comprehensive overview of automated
machine learning (AutoML) techniques, including RL-based hyperparameter tuning. It discusses
the strengths and limitations of RL in the context of AutoML and provides a roadmap for future
research in the field.
where θ ∈ Θ is the vector of hyperparameters, with Θ being the feasible hyperparameter space,
M (θ) is the machine learning model parameterized by θ, Dval is the validation dataset, drawn
from a data distribution D, P (M (θ); Dval ) is the performance metric (e.g., validation accuracy,
negative loss) of the model M (θ) on Dval . This formulation emphasizes that the goal is to optimize
the expected performance of the model over the distribution of validation datasets. Let’s cast
Reinforcement Learning as a Markov Decision Process (MDP). The problem is cast as a Markov
Decision Process (MDP), defined by the tuple (S, A, P, R, γ):
• State Space (S): The state st ∈ S encodes the current hyperparameter configuration θt ,
the history of performance metrics, and any other relevant information (e.g., computational
resources used).
• Reward Function (R): The reward rt = R(st , at , st+1 ) quantifies the improvement in model
performance, e.g.,
rt = P (M (θt+1 ); Dval ) − P (M (θt ); Dval ). (16.394)
• Discount Factor (γ): The discount factor γ ∈ [0, 1] balances immediate and future rewards.
The objective is to find a policy π : S → A that maximizes the expected discounted return:
"∞ #
X
J(π) = Eπ γ t rt . (16.395)
t=0
444 CHAPTER 16. TRAINING NEURAL NETWORKS
Let’s do Policy Optimization via Stochastic Gradient Ascent, the policy πϕ is parameterized by ϕ
and optimized using stochastic gradient ascent. The gradient of the expected return J(πϕ ) with
respect to ϕ is given by the policy gradient theorem:
∇ϕ J(πϕ ) = Eπϕ [∇ϕ log πϕ (at | st )Qπ (st , at )] (16.396)
where Qπ (st , at ) is the action-value function, representing the expected return of taking action at
in state st and following policy π thereafter:
"∞ #
X
Qπ (st , at ) = Eπ γ k−t rk | st , at . (16.397)
k=t
∇ϕ log πϕ (at | st ) is the score function, which measures the sensitivity of the policy to changes
in ϕ. To estimate Qπ (st , at ), a parameterized value function Qw (st , at ) is used, where w are the
parameters. The value function is optimized by minimizing the mean squared Bellman error:
L(w) = Eπϕ (Qw (st , at ) − (rt + γQw (st+1 , at+1 )))2 .
(16.398)
This is typically solved using stochastic gradient descent:
w ← w − αw ∇w L(w) (16.399)
where αw is the learning rate. We can do exploration via Entropy Regularization. To encourage
exploration, an entropy regularization term is added to the policy objective:
Jreg (πϕ ) = J(πϕ ) + λH(πϕ ), (16.400)
where H(πϕ ) is the entropy of the policy:
H(πϕ ) = Es∼dπ ,a∼π [− log πϕ (a | s)]. (16.401)
The entropy term ensures that the policy remains stochastic, thereby facilitating better exploration
of the hyperparameter space. Modern RL algorithms for hyperparameter tuning often use advanced
policy optimization techniques, such as Proximal Policy Optimization (PPO)
πϕ (at |st ) πϕ (at |st )
LCLIP (ϕ) = Et min At , , 1 − ϵ, 1 + ϵ At (16.402)
πϕold (at |st ) πϕold (at |st )
where the advantage function is defined as:
At = Qw (st , at ) − Vw (st ). (16.403)
Trust Region Policy Optimization (TRPO) is
πϕ (at |st )
max Et At (16.404)
ϕ πϕold (at |st )
subject to Et [KL (πϕold (·|st )∥πϕ (·|st ))] ≤ δ, (16.405)
where KL is the Kullback-Leibler divergence. There are some Theoretical Convergence Guarantees,
under certain conditions, RL-based hyperparameter tuning algorithms converge to the optimal
policy π ∗ . Key assumptions include. The MDP satisfies the Bellman optimality principle:
∗ ∗
Q (st , at ) = E rt + γ max Q (st+1 , at+1 ) | st , at . (16.406)
at+1
The policy and value function are Lipschitz continuous with respect to their parameters. The
learning rates αϕ and αw satisfy the Robbins-Monro conditions:
∞
X ∞
X
αt = ∞, αt2 < ∞. (16.407)
t=0 t=0
There are some Practical Implementation and Scalability issues. To scale RL-based hyperparameter
tuning to high-dimensional spaces, techniques such as:
16.5. HYPERPARAMETER TUNING 445
• Early Stopping: Use techniques like Hyperband to terminate poorly performing configura-
tions early.
We should rigorously analyze the exploration-exploitation tradeoff using multi-armed bandit theory
and regret minimization. The cumulative regret R(T ) after T steps is defined as:
T
X
R(T ) = (P (M (θ∗ ); Dval ) − P (M (θt ); Dval )) . (16.408)
t=1
Algorithms like Upper Confidence Bound (UCB) and Thompson Sampling provide theoretical guar-
antees on the regret, e.g., √
R(T ) = O( T ). (16.409)
In summary, hyperparameter tuning using reinforcement learning is a mathematically rigorous
process that involves first formulating the problem as a stochastic optimization problem within an
MDP framework and then Optimizing the policy using advanced gradient-based methods and value
function approximation. We then balance exploration and exploitation using entropy regularization
and regret minimization and then ensure theoretical convergence and scalability through careful
algorithm design and analysis.
16.5.13 Meta-Learning
16.5.13.1 Literature Review of Meta-Learning
Gomaa et. al. (2024) [527] introduced SML-AutoML, a novel meta-learning-based automated
machine learning (AutoML) framework. It addresses the challenge of model selection and hy-
perparameter optimization by learning from past experiences. The framework leverages meta-
learning to dynamically select the best model architecture and hyperparameters based on histor-
ical performance. This research is significant in making AutoML more efficient and adaptable
to different datasets. Khan et. al. (2025) [528] explored federated learning where multiple de-
centralized models collaborate. It proposes a consensus-driven hyperparameter tuning approach
using meta-learning to optimize models across nodes. This study is crucial for ensuring model
convergence in non-IID (non-independent and identically distributed) data environments, where
traditional hyperparameter optimization methods often fail. Morrison and Ma (2025) [529] focused
on meta-optimization for improving machine learning optimizers. The study evaluates various op-
timization algorithms, demonstrating that meta-learning can fine-tune optimizer hyperparameters
to improve model efficiency, particularly in nanophotonic inverse design tasks. This approach is
applicable in physics-driven AI models that require precise parameter tuning. Berdyshev et. al.
(2025) [530] presented EEG-Reptile, a meta-learning framework for brain-computer interfaces (BCI)
that tunes hyperparameters dynamically during learning. The study introduces a Reptile-based
meta-learning approach that enables fast adaptation of models to individual brain signal patterns,
making AI-powered BCI systems more personalized and efficient. Pratellesi (2025) [531] applied
meta-learning to biomedical classification problems, specifically in flow cytometry cell analysis.
The paper demonstrates that meta-learning can optimize hyperparameter selection for imbalanced
biomedical datasets, improving classification accuracy while reducing computational costs. Gar-
cia et. al. (2022) [532] introduced a meta-learned Bayesian hyperparameter search technique for
metabolite annotation. It highlights how meta-learning can improve molecular property prediction
by selecting optimal descriptors and hyperparameters for chemical space exploration. Deng et.
446 CHAPTER 16. TRAINING NEURAL NETWORKS
al. (2024) [533] introduced a surrogate modeling approach that leverages meta-learning for effi-
cient hyperparameter search. The proposed method significantly reduces the computational cost
of hyperparameter tuning while maintaining high performance. The study is particularly useful
for computationally expensive AI models like deep neural networks. Jae et. al. (2024) [534] inte-
grated reinforcement learning with meta-learning to optimize hyperparameters for quantum state
learning. It demonstrates how reinforcement learning agents can dynamically adjust hyperparam-
eters, improving black-box optimization methods for quantum computing applications. Upadhyay
et. al. (2025) [535] investigated meta-learning-based sparsity optimization in multi-task networks.
By learning the optimal sparsity structure and hyperparameters, this approach enhances memory
efficiency and computational scalability for large-scale deep learning applications. Paul et. al.
(2025) [536] provided a comprehensive theoretical and practical overview of meta-learning for neu-
ral network design. It discusses how meta-learning can automate hyperparameter tuning, improve
transfer learning strategies, and enhance architecture search.
Here, D denotes the dataset, and L is the loss function used to measure the quality of the model.
The challenge arises because the hyperparameters θ are fixed before training begins, unlike the
model parameters that are learned via optimization techniques such as gradient descent. This
problem becomes computationally intractable when θ is high-dimensional or when traditional grid
and random search methods are employed. Meta-learning, often referred to as ”learning to learn,”
provides a sophisticated framework to address hyperparameter tuning. The key objective in meta-
learning is to develop a meta-model that can efficiently adapt to new tasks with minimal data.
Mathematically, consider a set of tasks T = {T1 , T2 , . . . , TN }, where each task Ti consists of a
dataset Di and a corresponding loss function Li . The meta-learning framework aims to find meta-
parameters ϕ that minimize the expected loss across tasks:
Meanwhile, the outer optimization problem concerns learning ϕ, the meta-parameters, from multi-
ple tasks: X
ϕ∗ = arg min LT (fh(DT ,ϕ) , DT ) (16.413)
ϕ
Ti ∈T
This nested optimization structure, wherein the inner optimization problem is task-specific and the
outer optimization problem is meta-specific, requires careful treatment via gradient-based methods
and implicit differentiation. The meta-learning process can be understood as a bi-level optimization
problem. To analyze this, we first consider the inner optimization, which optimizes the task-
specific hyperparameters θ for each task Ti . This is given by:
For each task, the hyperparameter θ is chosen to minimize the corresponding task-specific loss. The
outer optimization then aims to find the optimal meta-parameters ϕ across tasks. The outer
objective can be written as:
XN
∗
ϕ = arg min Li (fh(Di ,ϕ) , Di ) (16.415)
ϕ
i=1
Since the task-specific loss Li depends on θi∗ , which in turn depends on ϕ, we require the application
of implicit differentiation. By applying the chain rule, we obtain the gradient of the outer
objective with respect to ϕ:
∂θ∗
∇ϕ Li (fθi∗ , Di ) = ∇θi∗ Li · i (16.416)
∂ϕ
∂θ∗
The term ∂ϕi involves the inverse of the Hessian matrix of the loss function with respect to θ,
leading to a computationally expensive second-order update rule:
∂θi∗ −1
≈ − ∇2θi Li ∇θi h(Di , ϕ) (16.417)
∂ϕ
This analysis demonstrates the intricate dependencies between the task-specific hyperparame-
ters and the meta-parameters, requiring sophisticated optimization strategies for practical use.
Gradient-Based Meta-Learning (e.g., Model-Agnostic Meta-Learning or MAML) seeks to find an
optimal initialization θ0 for the hyperparameters that can be adapted to new tasks with a small
number of gradient steps. For a single task Ti , the hyperparameters are adapted as follows:
Here, α is the learning rate for task-specific updates. The goal is to optimize θ0 such that, after a
few gradient steps, the model performs well on any task Ti . The meta-objective is given by:
N
X
min Li (fθi′ , Di ) (16.419)
θ0
i=1
θ ∼ GP(µ, K) (16.422)
′ ∥2
where K(x, x′ ) = exp − ∥x−x 2l2
is the RBF kernel, and l is the length scale parameter. The
posterior distribution over θ given the observed data is:
p(D|θ)p(θ)
p(θ|D) = (16.423)
p(D)
448 CHAPTER 16. TRAINING NEURAL NETWORKS
Using this posterior, we define an acquisition function such as Expected Improvement (EI):
which helps guide the optimization of θ by balancing exploration and exploitation. The com-
putational challenges in this approach are mitigated by using sparse Gaussian Processes or
variational inference methods, which approximate the posterior more efficiently. In conclusion,
Meta-learning offers a mathematically rigorous framework for hyperparameter tuning, leveraging
advanced optimization techniques and probabilistic models to adapt to new tasks efficiently. The
bi-level optimization problem, second-order derivatives, and Bayesian frameworks provide both
theoretical depth and practical utility. These sophisticated methods represent a powerful toolkit
for hyperparameter optimization in complex machine learning systems.
17 Convolution Neural Networks
Literature Review: Goodfellow et. al. (2016) [112] wrote one of the most foundational text-
books on deep learning, covering CNNs in depth. It introduces theoretical principles, including
convolutions, backpropagation, and optimization methods. The book also discusses applications of
CNNs in image processing and beyond. LeCun et. al. (2015) [118] provides a historical overview
of CNNs and deep learning. LeCun, one of the inventors of CNNs, explains why convolutions help
in image recognition and discusses their applications in vision, speech, and reinforcement learning.
Krizhevsky et. al. (2012) [147] and Krizhevsky et. al. (2017) [148] introduced AlexNet, the first
modern deep CNN, which won the 2012 ImageNet Challenge. It demonstrated that deep CNNs can
achieve unprecedented accuracy in image classification tasks, paving the way for deep learning’s
dominance. Simonyan and Zisserman (2015) [149] introduced VGGNet, which demonstrated that
increasing network depth using small 3x3 convolutions can improve performance. It also provided
insights into layer design choices and their effects on accuracy. He et. al. (2016) [150] introduced
ResNet, which solved the vanishing gradient problem in deep networks by using skip connections.
This revolutionized CNN design by allowing models as deep as 1000 layers to be trained efficiently.
Cohen and Welling (2016) [151] extended CNNs using group theory, enabling equivariant feature
learning. This improved CNN robustness to rotations and translations, making them more efficient
in symmetry-based tasks. Zeiler and Fergus (2014) [152] introduced deconvolution techniques to
visualize CNN feature maps, making it easier to interpret and debug CNNs. It showed how different
layers detect patterns, textures, and objects. Liu et.al. (2021) [153] introduced Vision Transform-
ers (ViTs) that outperform CNNs in some vision tasks. This paper discusses the limitations of
CNNs and how transformers can be hybridized with CNN architectures. Lin et.al. (2013) [154]
introduced the 1x1 convolution, which improved feature learning efficiency. This concept became
a key component of modern CNN architectures such as ResNet and MobileNet. Rumelhart et.
al. (1986) [155] formalized backpropagation, the training method used for CNNs. Without this
discovery, CNNs and deep learning would not exist today.
A Convolutional Neural Network (CNN) is a deep learning model primarily used for analyzing
grid-like data, such as images, video, and time-series data with spatial or temporal dependencies.
The fundamental operation of CNNs is the convolution operation, which is employed to extract
local patterns from the input data. The input to a CNN is generally represented as a tensor
I ∈ RH×W ×C , where H is the height, W is the width, and C is the number of channels (for RGB
images, C = 3).
At the core of a CNN is the convolutional layer, where the input image I is convolved with a
set of filters or kernels K ∈ Rfh ×fw ×C , where fh and fw are the height and width of the filter,
respectively. The filter K slides across the input image I, and the result of this convolution is a
set of feature maps that are indicative of certain local patterns in the image. The element-wise
449
450 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
where Ii+p−1,j+q−1,r denotes the value of the r-th channel of the input image at position (i + p −
1, j + q − 1), and Kp,q,r is the corresponding filter value at (p, q, r). This operation is done for
each location (i, j) of the output feature map. The resulting feature map F has spatial dimensions
H ′ × W ′ , where:
′ H + 2p − fh ′ W + 2p − fw
H = + 1, W = +1 (17.2)
s s
where p is the padding, and s is the stride of the filter during its sliding motion. The convolution
operation provides a translation-invariant representation of the input image, as each filter detects
patterns across the entire image. After this convolution, a non-linear activation function, typically
the Rectified Linear Unit (ReLU), is applied to introduce non-linearity into the network and
ensure it can model complex patterns. The ReLU activation function operates element-wise and is
given by:
ReLU(x) = max(0, x) (17.3)
Thus, for each feature map F, the output after ReLU is:
This ensures that negative values in the feature map are discarded, which helps with the sparse
representation of activations, mitigating the vanishing gradient problem in deeper layers. In CNNs,
pooling operations follow the convolution and activation layers. Pooling serves to reduce the spa-
tial dimensions of the feature maps, thus decreasing computational complexity and making the
representation more invariant to translations. Max pooling, which is the most common form,
selects the maximum value within a specified window size ph × pw . Given an input feature map
′ ′
F ∈ RH ×W ×K , max pooling operates as follows:
where P is the pooled feature map. This pooling operation effectively reduces the spatial dimensions
′′ ′′
of each feature map, resulting in an output P ∈ RH ×W ×K , where:
′ ′
′′ H ′′ W
H = , W = (17.6)
ph pw
Max pooling introduces an element of robustness by capturing only the strongest features within
the local regions, discarding irrelevant information, and ensuring that the network is invariant to
small translations. The CNN architecture typically contains multiple convolutional layers followed
by pooling layers. After these operations, the feature maps are flattened into a one-dimensional
vector and passed into one or more fully connected (dense) layers. A fully connected layer
computes a linear transformation of the form:
where a(l−1) is the input to the layer, W(l) is the weight matrix, and b(l) is the bias vector. The
output of this linear transformation is then passed through a non-linear activation function, such as
ReLU or softmax for classification tasks. For classification, the softmax function is often applied
to convert the output into a probability distribution:
exp(zi )
yi = PC (17.8)
j=1 exp(zj )
17.2. APPLICATIONS IN IMAGE PROCESSING 451
where C is the number of output classes, and yi is the probability of the i-th class. The softmax
function ensures that the output probabilities sum to 1, providing a valid classification output.
The CNN is trained using backpropagation, which computes the gradients of the loss function
L with respect to the network’s parameters (i.e., weights and biases). Backpropagation uses the
chain rule to propagate the error gradients through each layer. The gradients with respect to the
convolutional filters K are computed by:
∂L ∂L
= ∗I (17.9)
∂K ∂F
where ∗ denotes the convolution operation. Similarly, the gradients for the fully connected layers
are computed by:
∂L ∂L
(l)
= a(l−1) · (l) (17.10)
∂W ∂z
Once the gradients are computed, the weights are updated using an optimization algorithm like
gradient descent:
∂L
W(l) ← W(l) − η (17.11)
∂W(l)
where η is the learning rate. This optimization ensures that the network’s parameters are adjusted
in the direction of the negative gradient, minimizing the loss function and thereby improving the
performance of the CNN. Regularization techniques are commonly applied to avoid overfitting.
Dropout, for instance, randomly deactivates a subset of neurons during training, preventing the
network from becoming too reliant on any specific feature and promoting better generalization.
The dropout operation at a given layer l with dropout rate p is defined as:
where the activations a(l) are randomly set to zero with probability p, and the remaining activations
1
are scaled by 1−p . Another regularization technique is batch normalization, which normalizes
the inputs of each layer to have zero mean and unit variance, thus improving training speed and
stability. Mathematically, batch normalization is defined as:
x − µB
x̂ = , y = γ x̂ + β (17.13)
σB
where µB and σB are the mean and standard deviation of the batch, and γ and β are learned
scaling and shifting parameters.
The proposed model outperforms traditional CNNs by providing human-interpretable insights into
medical image classification. The research highlights how CNNs can be effectively applied to med-
ical imaging with enhanced transparency. Ramos-Briceño et. al. (2025) [187] demonstrated the
superior classification accuracy of CNNs in malaria parasite detection. The research uses deep
CNNs to classify malaria species in blood samples and achieves state-of-the-art performance. The
paper provides valuable insights into CNN-based image classification for biomedical applications.
Espino-Salinas et. al. (2025) [188] applied CNNs to mental health diagnostics by classifying motion
activity patterns as images. The paper explores the novel application of CNNs beyond traditional
image classification by transforming time-series data into visual representations and utilizing CNNs
to detect psychiatric disorders. Ran et. al. (2025) [189] introduced a CNN-based hyperspectral
imaging method for early diagnosis of pancreatic neuroendocrine tumors. The paper highlights
CNNs’ ability to process multispectral data for complex medical imaging tasks, further expanding
their utility in pathology and cancer detection. Araujo et. al. (2025) [190] demonstrated how CNNs
can be employed in industrial monitoring and predictive maintenance. The research introduces an
innovative CNN-based approach for detecting faults in ZnO surge arresters using thermal imaging,
proving CNNs’ robustness in non-destructive testing applications. Sari et. al. (2025) [191] applied
CNNs to cultural heritage preservation, specifically Batik pattern classification. The study show-
cases CNNs’ adaptability in fine-grained image classification and highlights the importance of deep
learning in automated textile pattern recognition. Wang et. al. (2025) [192] proposed CF-WIAD,
a novel semi-supervised learning method that leverages CNNs for skin lesion classification. The
research demonstrates how CNNs can be used to effectively classify dermatological images, partic-
ularly in low-data environments, which is a key challenge in medical AI. Cai et. al. (2025) [193]
introduced DFNet, a CNN-based residual network that improves feature extraction by incorporat-
ing differential features. The study highlights CNNs’ role in advanced feature engineering, which
is crucial for applications such as facial recognition and object classification. Vishwakarma and
Deshmukh (2025) [194] presented CNNM-FDI, a CNN-based fire detection model that enhances
real-time safety monitoring. The study explores CNNs’ application in environmental monitoring,
emphasizing fast-response classification models for early disaster prevention. Ranjan et. al. (2025)
[195] merged CNNs, Autoencoders, GANs, and Zero-Shot Learning to improve hyperspectral image
classification. The research underscores how CNNs can be augmented with generative models to
enhance classification in limited-label datasets, a crucial area in remote sensing applications.
The process of image classification in Convolutional Neural Networks (CNNs) involves a sophisti-
cated interplay of linear algebra, calculus, probability theory, and optimization. The primary goal
is to map a high-dimensional input image to a specific class label. Let I ∈ RH×W ×C represent the
input image, where H, W , and C are the height, width, and number of channels (usually 3 for
RGB images) of the image, respectively. Each pixel of the image can be represented as I(i, j, c),
which denotes the intensity of the c-th channel at pixel position (i, j). The objective of the CNN
is to transform this raw input image into a label, typically one of M classes, using a hierarchical
feature extraction process that includes convolutions, nonlinearities, pooling, and fully connected
layers.
The convolution operation is central to CNNs and forms the basis for the feature extraction process.
Let K ∈ Rk×k×C be a filter (or kernel) with spatial dimensions k × k and C channels, where k is
typically a small odd integer, such as 3 or 5. The filter K is convolved with the input image I
to produce a feature map S ∈ R(H−k+1)×(W −k+1)×F , where F is the number of filters used in the
convolution. For a given spatial position (i, j) in the feature map, the convolution operation is
defined as:
k−1 X
X k−1 C−1
X
Si,j,f = I(i + m, j + n, c) · Km,n,c,f (17.14)
m=0 n=0 c=0
17.2. APPLICATIONS IN IMAGE PROCESSING 453
where Si,j,f represents the value at position (i, j) in the feature map corresponding to the f -th filter.
This operation computes a weighted sum of pixel values in the receptive field of size k × k × C
around pixel (i, j), where the weights are given by the filter values. The result is a new feature map
that captures local patterns such as edges or textures in the image. This local feature extraction
is performed for each position (i, j) across the entire image, producing a set of feature maps for
each filter. To introduce non-linearity into the network and allow it to model complex functions,
the feature map S is passed through a non-linear activation function, typically the Rectified Linear
Unit (ReLU), which is defined element-wise as:
This activation function outputs 0 for negative values and passes positive values unchanged, en-
suring that the network can learn complex, non-linear relationships. The output of the activation
function for the feature map is denoted as S+ , where each element of S+ is computed as:
S+
i,j,f = max(0, Si,j,f ) (17.16)
This element-wise operation enhances the network’s ability to capture and represent complex pat-
terns, thereby aiding in the learning process. After the convolution and activation, the feature map
is downsampled using a pooling operation. The most common form of pooling is max pooling,
which selects the maximum value in a local region of the feature map. Given a pooling window of
size p × p and stride s, the max pooling operation for the feature map S+ is given by:
Pi,j,f = max S+
i+u,j+v,f (17.17)
(u,v)∈p×p
where P represents the pooled feature map. This operation reduces the spatial dimensions of the
feature map by a factor of p, while preserving the most important features in each region. Pooling
serves several purposes, including dimensionality reduction, translation invariance, and noise re-
duction. It also helps prevent overfitting by limiting the number of parameters and computations
in the network.
Once the feature maps are obtained through convolution, activation, and pooling, they are flat-
tened into a one-dimensional vector F ∈ RN , where N is the total number of elements in the
pooled feature map. The flattened vector F is then fed into one or more fully connected layers.
These layers perform linear transformations of the input, which are essentially weighted sums of
the inputs, followed by the addition of a bias term. The output of a fully connected layer can be
expressed as:
O=W·F+b (17.18)
where W ∈ RM ×N is the weight matrix, b ∈ RM is the bias vector, and O ∈ RM is the raw output
or logit for each of the M classes. The fully connected layer computes a set of logits for the classes
based on the learned features from the convolutional and pooling layers. To convert the logits into
class probabilities, a softmax function is applied. The softmax function is a generalization of the
logistic function to multiple classes and transforms the logits into a probability distribution. The
probability of class k is given by:
eOk
P (y = k | O) = PM (17.19)
Ok
k=1 e
where Ok is the logit corresponding to class k, and the denominator ensures that the sum of
probabilities across all classes equals 1. The class label with the highest probability is selected as
the final prediction:
y = arg max P (y = k | O) (17.20)
k
454 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
The prediction is made based on the computed class probabilities, and the network aims to minimize
the discrepancy between the predicted probabilities and the true labels during training. To optimize
the network’s parameters, we minimize a loss function that measures the difference between
the predicted probabilities and the actual labels. The cross-entropy loss is commonly used in
classification tasks and is defined as:
M
X
L=− yk log P (y = k | O) (17.21)
k=1
where yk is the true label in one-hot encoding, and P (y = k | O) is the predicted probability for
class k. The goal of training is to minimize this loss function, which corresponds to maximizing
the likelihood of the correct class under the predicted probability distribution.
The optimization of the network parameters is performed using gradient descent and its variants,
such as stochastic gradient descent (SGD), which iteratively updates the parameters based on the
gradients of the loss function. The gradients are computed using backpropagation, a method
that applies the chain rule of calculus to compute the partial derivatives of the loss with respect
to each parameter. For a fully connected layer, the gradient of the loss with respect to the weights
W is given by:
∂L ∂O
∇W L = · = δ · FT (17.22)
∂O ∂W
∂L
where δ = ∂O is the error term (also known as the delta) for the logits, and FT is the transpose of
the flattened feature vector. The parameters are updated using the following rule:
W ← W − η∇W L (17.23)
where η is the learning rate, controlling the step size of the updates. This process is repeated
for each batch of training data until the network converges to a set of parameters that minimize
the loss function. Through this complex and iterative process, CNNs are able to learn to classify
images by automatically extracting hierarchical features from raw input data. The combination of
convolution, activation, pooling, and fully connected layers enables the network to learn increasingly
abstract and high-level representations of the input image, ultimately achieving high accuracy in
image classification tasks.
reduce false-positive detections. The paper also explores the impact of feature pyramid networks
(FPNs) in hierarchical feature extraction, demonstrating their effectiveness in improving the detec-
tion of fine-grained details. The authors conduct an extensive empirical evaluation, comparing their
improved Faster R-CNN model against existing object detection architectures, proving its superior
performance in terms of precision and recall, particularly for applications involving customized icon
generation and user interface design. Ramana et. al. (2025) [198] introduced a Deep Convolutional
Graph Neural Network (DCGNN) that integrates Spectral Pyramid Pooling (SPP) and fused key-
point generation to significantly improve 3D object detection performance. The study employs
ResNet-50 as the backbone CNN architecture and enhances its feature extraction capability by
introducing multi-scale spectral feature aggregation. Through the integration of graph neural net-
works (GNNs), the model can effectively capture spatial relationships between object components,
leading to highly accurate 3D bounding box predictions. The proposed methodology is rigorously
evaluated on multiple benchmark datasets, demonstrating its superior ability to handle occlusion,
scale variation, and viewpoint changes. Additionally, the paper presents a novel fusion strategy
that combines keypoint-based object representation with spectral domain feature embeddings, al-
lowing the model to achieve unparalleled robustness in automated 3D object detection tasks. Shin
et. al. (2025) [199] explores the application of deep learning-based object detection in the field
of microfluidics and droplet-based bioengineering. The authors utilize YOLOv10n, an advanced
CNN-based object detection framework, to develop an automated system for tracking and catego-
rizing double emulsion droplets in high-throughput experimental setups. By fine-tuning the YOLO
architecture, the study achieves remarkable improvements in detection sensitivity and classification
accuracy, enabling real-time identification of droplet morphology, phase separation dynamics, and
stability characteristics. The researchers further introduce an adaptive feature refinement strategy,
wherein the CNN model continuously learns from real-time experimental variations, allowing for
automated calibration and correction of droplet misclassification. The paper also demonstrates
the practical implications of this AI-driven approach in drug delivery systems, encapsulation tech-
nologies, and synthetic biology applications. Taca et. al. (2025) [200] provided a comprehensive
comparative analysis of multiple CNN-based object detection architectures applied to aphid clas-
sification in large-scale agricultural datasets. The researchers evaluate the performance of YOLO,
SSD, Faster R-CNN, and EfficientDet, analyzing their trade-offs in terms of accuracy, inference
speed, and computational efficiency. Through an extensive experimental setup involving 48,000
annotated images, the authors demonstrate that certain CNN models excel in specific detection
scenarios, such as YOLO for real-time aphid localization and Faster R-CNN for high-precision clas-
sification. Furthermore, the paper introduces an innovative hybrid ensemble strategy, combining
the strengths of multiple CNN architectures to achieve optimal detection performance. The au-
thors validate their findings on real-world agricultural environments, emphasizing the importance
of deep learning-driven pest detection in sustainable farming practices. Ulaş et. al. (2025) [201]
explored the application of CNN-based object detection in the domain of astronomical time-series
analysis, specifically targeting oscillation-like patterns in eclipsing binary light curves. The study
systematically evaluates multiple state-of-the-art object detection models, including YOLO, Faster
R-CNN, and SSD, to determine their effectiveness in identifying transient light fluctuations that
indicate oscillatory behavior in celestial bodies. One of the key contributions of this paper is the
introduction of a customized pre-processing pipeline that optimizes raw observational data by re-
moving noise and enhancing feature visibility using wavelet-based signal decomposition techniques.
The researchers further implement a hybrid detection mechanism, integrating CNN-based spatial
feature extraction with recurrent neural networks (RNNs) to capture both spatial and temporal
dependencies within light curve datasets. Extensive validation on large-scale astronomical datasets
demonstrates that this approach significantly outperforms traditional statistical methods in detect-
ing oscillatory behavior, paving the way for AI-driven automation in astrophysical event classifica-
tion. Valensi et. al. (2025) [202] presents an advanced semi-supervised deep learning framework for
pleural line detection and segmentation in lung ultrasound (LUS) imaging, leveraging the power of
456 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
foundation models and CNN-based object detection architectures. The study highlights the short-
comings of conventional fully supervised learning in medical imaging, where annotated datasets
are limited and labor-intensive to create. To overcome this challenge, the researchers incorporate a
semi-supervised learning strategy, utilizing self-training techniques combined with pseudo-labeling
to improve model generalization. The framework employs YOLOv8-based object detection, specif-
ically optimized for medical feature localization, which significantly enhances detection accuracy
in cases of low-contrast and high-noise ultrasound images. Furthermore, the study integrates a
multi-scale feature extraction strategy, combining convolutional layers with attention mechanisms
to ensure precise identification of pleural lines across different imaging conditions. Experimen-
tal results demonstrate that this hybrid approach achieves a substantial increase in segmentation
accuracy, particularly in detecting subtle abnormalities linked to pneumothorax and pleural effu-
sion, making it a highly valuable tool in clinical diagnostic applications. Arulalan et. al. (2025)
[203] proposed an optimized object detection pipeline that integrates a novel convolutional neural
network (CNN) architecture, BS2ResNet, with bidirectional LSTM (LTK-Bi-LSTM) for improved
spatiotemporal object recognition. Unlike conventional CNN-based object detectors, which focus
solely on static spatial features, this study introduces a hybrid deep learning framework that cap-
tures both spatial and temporal dependencies. The proposed BS2ResNet model enhances feature
extraction by utilizing bottleneck squeeze-and-excitation blocks, which selectively emphasize impor-
tant spatial information while suppressing redundant feature maps. Additionally, the integration of
LTK-Bi-LSTM layers allows the model to effectively capture temporal correlations, making it highly
robust for detecting moving objects in dynamic environments. This approach is validated on mul-
tiple benchmark datasets, including autonomous driving and video surveillance datasets, where it
demonstrates superior performance in handling occlusions, rapid motion, and low-light conditions.
The findings indicate that combining deep convolutional networks with sequence-based modeling
significantly improves object detection accuracy in complex real-world scenarios, offering critical
advancements for applications in intelligent transportation, security, and real-time monitoring. Zhu
et. al. (2025) [204] investigated a novel adversarial attack strategy targeting CNN-based object
detection models, with a specific focus on binary image segmentation tasks such as salient object
detection and camouflage object detection. The paper introduces a high-transferability adversarial
attack framework, which generates adversarial perturbations capable of fooling a wide range of
deep learning models, including YOLO, Mask R-CNN, and U-Net-based segmentation networks.
The researchers employ adversarial example augmentation, where synthetic adversarial patterns
are iteratively refined through gradient-based optimization techniques, ensuring that the adversar-
ial attacks remain effective across different architectures and datasets. A particularly important
contribution is the introduction of a dual-stage attack pipeline, wherein the model first learns to
generate localized, high-impact adversarial noise and then optimizes for cross-model generalization.
Extensive experiments demonstrate that this approach significantly degrades detection performance
across multiple state-of-the-art models, revealing critical vulnerabilities in current CNN-based ob-
ject detectors. This research provides valuable insights into deep learning security and underscores
the urgent need for robust adversarial defense mechanisms in high-stakes applications such as au-
tonomous systems, medical imaging, and biometric security. Guo et. al. (2025) [205] introduced
a deep learning-based agricultural monitoring system, utilizing CNNs for agronomic entity detec-
tion and attribute extraction. The research highlights the limitations of traditional rule-based and
manual annotation systems in agricultural monitoring, which are prone to errors and inefficiencies.
By leveraging CNN-based object detection models, the proposed system enables real-time crop
analysis, accurately identifying key agronomic attributes such as plant height, leaf structure, and
disease symptoms. A significant innovation in this study is the incorporation of inter-layer feature
fusion, wherein multi-scale convolutional features are integrated across different network depths to
improve detection robustness under varying lighting and environmental conditions. Additionally,
the authors employ a hybrid feature selection mechanism, combining spatial attention networks
with spectral domain feature extraction, which enhances the model’s ability to distinguish between
17.2. APPLICATIONS IN IMAGE PROCESSING 457
healthy and diseased crops with high precision. The research is validated through rigorous field
trials, demonstrating that CNN-based agronomic monitoring can significantly enhance crop yield
predictions, reduce human labor in precision agriculture, and optimize resource allocation in farm-
ing operations.
In the mathematical framework, let the input image be represented by a matrix I ∈ RH×W ×C ,
where H, W , and C are the height, width, and number of channels (typically 3 for RGB images).
Convolution operations in a CNN serve as the fundamental building blocks to extract spatial hier-
archies of features. The convolution operation involves the application of a kernel K ∈ Rm×n×C to
the input image, where m and n are the spatial dimensions of the kernel, and C is the number of
input channels. The convolution operation is performed by sliding the kernel over the image and
computing the element-wise multiplication between the kernel and the image patch, yielding the
following equation for the feature map O(x, y):
m−1
XX n−1 C−1
X
O(x, y) = I(x + i, y + j, c) · K(i, j, c) (17.24)
i=0 j=0 c=0
Here, O(x, y) represents the feature map at the location (x, y), which is generated by applying the
kernel K. The sum is taken over the spatial extent of the kernel as it slides over the image. This
convolutional operation helps the network capture local patterns in the input image, such as edges,
corners, and textures, which are crucial for identifying objects. Once the convolution is performed,
a non-linear activation function such as the Rectified Linear Unit (ReLU) is applied to introduce
non-linearity into the system. The ReLU activation function is given by:
This activation function helps the network model complex non-linear relationships between fea-
tures and is computationally efficient. The application of ReLU ensures that the network can learn
complex decision boundaries that are necessary for tasks like object detection.
In CNN-based object detection, the goal is to predict the class of an object and localize its position
via a bounding box. The bounding box is parametrized by four coordinates: (x, y) for the center
of the box, and w, h for the width and height. The task can be viewed as a twofold problem: (1)
classify the object and (2) predict the bounding box that best encodes the object’s spatial posi-
tion. Mathematically, this requires the network to output both class probabilities and bounding
box coordinates for each object within the image. The classification task is typically performed
using a softmax function, which converts the network’s raw output logits zi for each class i into
probabilities P (yi |r). The softmax function is defined as:
exp(zi )
P (yi |r) = Pk (17.26)
j=1 exp(zj )
where k is the number of possible classes, zi is the raw score for class i, and P (yi |r) is the probability
that the region r belongs to class yi . This function ensures that the predicted scores are valid
458 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
probabilities that sum to one, which allows the network to make a probabilistic decision regarding
the class of the object in each region. Simultaneously, the network must also predict the four
parameters of the bounding box for each object. The network’s predicted bounding box parameters
are typically denoted as B̂ = (x̂, ŷ, ŵ, ĥ), while the ground truth bounding box is denoted by
B = (x, y, w, h). The error between the predicted and true bounding boxes is quantified using a
loss function, with the smooth L1 loss being a commonly used metric for bounding box regression.
The smooth L1 loss for each parameter of the bounding box is defined as:
4
X
Lbbox = SmoothL1(Bi − B̂i ) (17.27)
i=1
This loss function is used to reduce the impact of large errors, thereby making the training process
more robust. The goal is to minimize this loss during the training phase to improve the network’s
ability to predict both the class and the bounding box of objects.
For training, a combined loss function is used that combines both the classification loss and the
bounding box regression loss. The total loss function can be written as:
where Lcls is the classification loss, typically computed using the cross-entropy between the pre-
dicted probabilities and the ground truth labels. The cross-entropy loss for classification is given
by:
Xk
Lcls = − yi log(ŷi ) (17.30)
i=1
where yi is the true label, and ŷi is the predicted probability for class i. The total objective
function for training is therefore a weighted sum of the classification and bounding box regression
losses, and the network is optimized to minimize this combined loss function. Object detection
architectures like Region-based CNNs (R-CNNs) take a two-stage approach where the task is broken
into generating region proposals and classifying these regions. Region Proposal Networks (RPNs)
are employed to generate candidate regions r1 , r2 , . . . , rn , which are then passed through the network
to obtain their feature representations. The bounding box refinement and classification for each
proposal are then performed by a fully connected layer. The loss function for R-CNNs combines
both classification and bounding box regression losses for each proposal, and the objective is to
minimize:
LR-CNN = Lcls + Lbbox (17.31)
Another popular architecture, YOLO (You Only Look Once), frames object detection as a single
regression task. The image is divided into a grid of S × S cells, where each cell predicts the class
probabilities and bounding box parameters. The output vector for each cell consists of:
where (x, y) are the coordinates of the bounding box center, w and h are the dimensions of the box,
c is the confidence score, and P1 , P2 , . . . , Pk are the class probabilities. The total loss for YOLO
combines the classification loss, bounding box regression loss, and confidence loss, which can be
written as:
LYOLO = Lcls + Lbbox + Lconf (17.33)
17.3. REAL-WORLD APPLICATIONS 459
where Lcls is the classification loss, Lbbox is the bounding box regression loss, and Lconf is the
confidence loss, which penalizes predictions with low confidence. This approach allows YOLO to
make object detection predictions in a single pass through the network, enabling faster inference.
The Single Shot Multibox Detector (SSD) improves on YOLO by generating bounding boxes at
multiple feature scales, which allows for detecting objects of varying sizes. The loss function for
SSD is similar to that of YOLO, comprising the classification loss and bounding box localization
loss, given by:
LSSD = Lcls + Lloc (17.34)
where Lcls is the classification loss, and Lloc is the smooth L1 loss for bounding box regression.
This multi-scale approach enhances the network’s ability to detect objects at different levels of
resolution, improving its robustness to objects of different sizes.
Convolutional Neural Networks (CNNs) have become an indispensable tool in the field of med-
ical imaging, driven by their ability to automatically learn spatial hierarchies of features directly
from image data without the need for handcrafted feature extraction. The convolutional layers in
CNNs are designed to exploit the spatial structure of the input data, making them particularly
well-suited for tasks in medical imaging, where spatial relationships in images often encode critical
diagnostic information. The fundamental building block of CNNs, the convolution operation, is
mathematically expressed as
k
X k
X
S(i, j) = I(i + m, j + n) · K(m, n), (17.35)
m=−k n=−k
where S(i, j) represents the value of the output feature map at position (i, j), I(i, j) is the input
image, K(m, n) is the convolutional kernel (a learnable weight matrix), and k denotes the kernel
radius (for example, k = 1 for a 3 × 3 kernel). This equation fundamentally captures how local
patterns, such as edges, textures, and more complex features, are extracted by sliding the kernel
across the image. The convolution operation is performed for each channel of a multi-channel input
(e.g., RGB images or multi-modal medical images), and the results are summed across channels,
leading to multi-dimensional feature maps. For a 3D input tensor, the convolution extends to
include depth:
D X
X k Xk
S(i, j, d′ ) = I(i + m, j + n, d) · K(m, n, d), (17.36)
d=1 m=−k n=−k
where D is the depth of the input tensor, and d′ is the depth index of the output feature map. CNNs
incorporate nonlinear activation functions after convolutional layers to introduce nonlinearity into
the model, allowing it to learn complex mappings. A commonly used activation function is the
Rectified Linear Unit (ReLU), mathematically defined as
This function ensures sparsity in the activations, which is advantageous for computational efficiency
and generalization. More advanced activation functions, such as parametric ReLU (PReLU), extend
this concept by allowing learnable parameters for the negative slope:
(
x if x > 0,
f (x) = (17.38)
ax if x ≤ 0,
where a is a learnable parameter. Pooling layers are employed in CNNs to downsample the spatial
dimensions of feature maps, thereby reducing computational complexity and the risk of overfitting.
Max pooling is defined mathematically as
where R is the pooling region (e.g., 2 × 2). Average pooling computes the mean value instead:
1 X
P (i, j) = S(i + m, j + n). (17.40)
|R|
(m,n)∈R
In medical imaging, CNNs are widely used for image classification tasks such as detecting abnor-
malities (e.g., tumors, fractures, or lesions). Consider a classification problem where the input is a
mammogram image, and the output is a binary label y ∈ {0, 1}, indicating benign or malignant.
The CNN model outputs a probability score ŷ, computed as
1
ŷ = σ(z) = , (17.41)
1 + e−z
17.3. REAL-WORLD APPLICATIONS 461
where z is the output of the final layer before the sigmoid activation. The binary cross-entropy loss
function is then used to train the model:
N
1 X
L=− [yi log(ŷi ) + (1 − yi ) log(1 − ŷi )] . (17.42)
N i=1
For image segmentation tasks, where the goal is to assign a label to each pixel, architectures such
as U-Net are commonly used. U-Net employs an encoder-decoder structure, where the encoder ex-
tracts features through a series of convolutional and pooling layers, and the decoder reconstructs the
image through upsampling and concatenation operations. The objective function for segmentation
is often the Dice coefficient loss, defined as
P
2 i pi gi
LDice = 1 − P P , (17.43)
i pi + i gi
where pi and gi are the predicted and ground truth values for pixel i, respectively. In the context of
image reconstruction, such as in magnetic resonance imaging (MRI), CNNs are used to reconstruct
high-quality images from undersampled k-space data. The reconstruction problem is formulated as
minimizing the difference between the reconstructed image Ipred and the ground truth Itrue , often
using the ℓ2 -norm:
Lreconstruction = ∥Ipred − Itrue ∥22 . (17.44)
Generative adversarial networks (GANs) have also been applied to medical imaging, particularly
for enhancing image resolution or synthesizing realistic images from noisy inputs. A GAN consists
of a generator G and a discriminator D, where G learns to generate images G(z) from latent noise
z, and D distinguishes between real and fake images. The loss functions for G and D are given by
LD = −E[log D(x)] − E[log(1 − D(G(z)))], (17.45)
LG = −E[log D(G(z))]. (17.46)
Multi-modal imaging, where data from different modalities (e.g., MRI and PET) are combined,
further highlights the utility of CNNs. For instance, feature maps from MRI and PET images are
concatenated at intermediate layers to exploit complementary information, improving diagnostic
accuracy. Attention mechanisms are often incorporated to focus on the most relevant regions of
the image. For example, a spatial attention map As can be computed as
As = σ(W2 · ReLU(W1 · F + b1 ) + b2 ), (17.47)
where F is the input feature map, W1 and W2 are learnable weight matrices, and b1 and b2 are
biases. Despite their success, CNNs in medical imaging face challenges, including data scarcity
and interpretability. Transfer learning addresses data scarcity by fine-tuning pre-trained models
on small medical datasets. Techniques such as Grad-CAM provide interpretability by visualiz-
ing regions that influence the network’s predictions. Mathematically, Grad-CAM computes the
importance of a feature map Ak for a class c as
1 X ∂y c
αkc = , (17.48)
Z i,j ∂Aki,j
where y c is the score for class c and Z is a normalization constant. The class activation map is
then obtained as !
X
c c k
LGrad-CAM = ReLU αk A . (17.49)
k
In summary, CNNs have transformed medical imaging by enabling automated and highly accu-
rate analysis of complex medical images. Their applications span disease detection, segmentation,
reconstruction, and multi-modal imaging, with continued advancements addressing challenges in
data efficiency and interpretability. Their mathematical foundations and computational frameworks
provide a robust basis for future innovations in this critical field.
462 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
Literature Review: Ojala and Zhou (2024) [324] proposed a CNN-based approach for detecting
and estimating object distances from thermal images in autonomous driving. They developed a
deep convolutional model for distance estimation using a single thermal camera and introduced the-
oretical formulations for thermal imaging data preprocessing within CNN pipelines. Popordanoska
and Blaschko (2025) [325] investigated the mathematical underpinnings of CNN calibration in high-
risk domains, including autonomous vehicles. They analyzed the confidence calibration problem in
CNNs used for self-driving perception and developed a Bayesian-inspired regularization approach to
improve CNN decision reliability in autonomous driving. Alfieri et. al. (2024) [326] explored deep
reinforcement learning (DRL) methods with CNNs for optimizing route planning in autonomous
vehicles. They bridged CNN-based vision models with Deep Q-Learning, enabling adaptive path
optimization in real-world driving conditions and established a novel theoretical connection be-
tween Q-learning and CNN-based object detection for autonomous navigation. Zanardelli (2025)
[327] examined decision-making frameworks using CNNs in autonomous vehicle systems. He devel-
oped a statistical model integrating CNNs with reinforcement learning to improve self-driving car
decision-making and provided a rigorous probabilistic analysis of how CNNs handle uncertainty
in real-world driving environments. Norouzi et. al. (2025) [328] analyzed the role of transfer
learning in CNN models for autonomous vehicle perception. They introduced pre-trained CNNs
for vehicle object detection using multi-sensor data fusion and provided a rigorous theoretical jus-
tification for integrating Kalman filtering and Dempster-Shafer theory with CNNs. Wang et. al.
(2024) [329] investigated the mathematics of uncertainty quantification in CNN-based perception
models for self-driving cars. They used Bayesian CNNs to model uncertainty in semantic segmen-
tation for autonomous driving and proposed a Dempster-Shafer theory-based fusion mechanism
for combining multiple CNN outputs. Xia et. al. [330] integrated CNN-based perception models
with reinforcement learning (RL) to improve autonomous vehicle trajectory tracking. They uses
CNNs for lane detection and integrated them into a RL-based path planner. They also estab-
lished a theoretical framework linking CNN-based scene recognition to control theory. Liu et. al.
(2024) [331] introduced a CNN-based multi-view feature extraction framework for spatial-temporal
analysis in self-driving cars. They developed a hybrid CNN-graph attention model to extract
temporal driving patterns. They also made theoretical advancements in multi-view learning and
feature fusion for CNNs in autonomous vehicle decision-making. Chakraborty and Deka (2025)
[332] applied CNN-based multimodal sensor fusion to autonomous vehicles and UAVs for real-time
navigation. They did theoretical analysis of CNN feature fusion mechanisms for real-time per-
ception and developed mask region-based CNNs (Mask-RCNNs) for enhanced object recognition
in autonomous navigation. Mirindi et. al. (2025) [333] investigated the role of CNNs and AI in
smart autonomous transportation. They did theoretical discussion on the Unified Theory of AI
Adoption in autonomous driving and introduced hybrid Recurrent Neural Networks (RNNs) and
CNN architectures for vehicle trajectory prediction.
Z ∞
s(t) = x(τ )w(t − τ ) dτ, (17.50)
−∞
17.3. REAL-WORLD APPLICATIONS 463
where x(τ ) represents the input, w(t − τ ) is the filter or kernel, and s(t) is the output feature. In
the discrete domain, especially for image processing, this operation can be written as:
k
X k
X
S(i, j) = X(i + m, j + n) · W (m, n), (17.51)
m=−k n=−k
where X(i, j) denotes the pixel intensity at coordinate (i, j) of the input image, and W (m, n)
represents the convolutional kernel values. This operation enables the detection of local patterns
such as edges, corners, or textures, which are then aggregated across layers to recognize complex
features like shapes and objects. In the context of autonomous vehicles, CNNs process sensor data
from cameras, LiDAR, and radar to identify critical features such as other vehicles, pedestrians,
road signs, and lane boundaries. For object detection, CNN-based architectures such as YOLO
(You Only Look Once) and Faster R-CNN employ a backbone network like ResNet, which uses
successive convolutional layers to extract hierarchical features from the input image. The object
detection task involves two primary outputs: bounding box coordinates and object class probabil-
ities. Mathematically, bounding box regression is modeled as a multi-task learning problem. The
loss function for bounding box regression is often formulated as:
N
X X
Lreg = SmoothL1(tij − t̂ij ), (17.52)
i=1 j∈{x,y,w,h}
where tij and t̂ij are the ground-truth and predicted bounding box parameters (e.g., center coordi-
nates x, y and dimensions w, h). Simultaneously, the classification loss, typically cross-entropy, is
computed as:
N X
X C
Lcls = − yi,c log(pi,c ), (17.53)
i=1 c=1
where yi,c is a binary indicator for whether the object at index i belongs to class c, and pi,c is the
predicted probability. The total loss function is a weighted combination:
Ltotal = αLreg + βLcls . (17.54)
Semantic segmentation, another critical task, requires pixel-level classification to assign a label
(e.g., road, vehicle, pedestrian) to each pixel in an image. Fully Convolutional Networks (FCNs)
or U-Net architectures are commonly used for this purpose. These architectures utilize an encoder-
decoder structure where the encoder extracts spatial features, and the decoder reconstructs the
spatial resolution to generate pixel-wise predictions. The loss function for semantic segmentation
is a sum over all pixels and classes, given as:
N X
X C
L=− yi,c log(pi,c ), (17.55)
i=1 c=1
where yi,c is the ground-truth binary label for pixel i and class c, and pi,c is the predicted prob-
ability. Advanced architectures also employ skip connections to preserve high-resolution spatial
information, enabling sharper segmentation boundaries.
Depth estimation is essential for autonomous vehicles to understand the 3D structure of their
surroundings. CNNs are used to predict depth maps from monocular images or stereo pairs. The
depth estimation process is modeled as a regression problem, where the loss function is designed to
minimize the difference between the predicted depth dˆi and the ground-truth depth di . A commonly
used loss function for this task is the scale-invariant loss:
2
n n
!
1 X 2 1 X
Lscale-inv = log di − log dˆi − 2 log di − log dˆi . (17.56)
n i=1 n i=1
464 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
This loss ensures that the relative depth differences are minimized, which is critical for accurate
3D reconstruction. Lane detection, another critical application, uses CNNs to detect road lanes
and boundaries. The task often involves predicting the lane markings as polynomial curves. CNNs
process the input image to extract lane features, and post-processing involves fitting a curve, such
as:
y = ax2 + bx + c, (17.57)
where a, b, c are the coefficients predicted by the network. The fitting process minimizes an error
function, typically the sum of squared differences between the detected lane points and the curve:
N
X
E= (yi − (ax2i + bxi + c))2 . (17.58)
i=1
In autonomous vehicles, these CNN tasks are integrated into an end-to-end pipeline. The input
data from cameras, LiDAR, and radar is first processed using CNNs to extract features relevant to
the vehicle’s perception. The outputs, including object detections, semantic maps, depth maps, and
lane boundaries, are then passed to the planning module, which computes the vehicle’s trajectory.
For instance, detected objects provide information about obstacles, while lane boundaries guide
path planning algorithms. The planning process involves solving optimization problems where
the objective function incorporates constraints from the CNN outputs. For example, a trajectory
optimization problem may minimize a cost function:
Z T
w1 ẋ2 + w2 ẏ 2 + w3 c(t) dt,
J= (17.59)
0
where ẋ and ẏ are the lateral and longitudinal velocities, and c(t) is a collision penalty based on
object detections.
In conclusion, CNNs provide the computational framework for perception tasks in autonomous
vehicles, enabling real-time interpretation of complex sensory data. By leveraging mathematical
principles of convolution, loss optimization, and hierarchical feature extraction, CNNs transform
raw sensor data into actionable insights, paving the way for safe and efficient autonomous naviga-
tion.
and Established the theoretical backbone of ResNet’s identity mapping for deep optimization. Si-
monyan and Zisserman (2014) [149] presented the VGG architecture, which demonstrates how
depth improvement enhances feature extraction. They developed the theoretical formulation of
increasing CNN depth and its impact on feature hierarchies and provided an analytical framework
for receptive field expansion in deep CNNs. Krizhevsky et. al. (2012) [338] introduced AlexNet,
the first CNN model to achieve state-of-the-art performance in ImageNet classification. They intro-
duced ReLU activation as a breakthrough in CNN training and established dropout regularization
theory, preventing overfitting in deep networks. Sultana et. al. (2019) [339] compared the feature
extraction strategies of AlexNet, VGG, and ResNet for object recognition. They gave theoretical
explanation of hierarchical feature learning in CNN architectures and examined VGG’s use of small
convolutional filters and how it impacts feature map depth. Sattler et. al. (2019) [340] investi-
gated the fundamental limitations of CNN architectures such as AlexNet, VGG, and ResNet. They
established formal constraints on convolutional filters in CNNs and developed a theoretical model
for CNN generalization error in classification tasks.
17.4.1 AlexNet
The AlexNet Convolutional Neural Network (CNN) is a deep learning model that operates
on raw pixel values to perform image classification. Given an input image, represented as a 3D
tensor I0 ∈ RH×W ×C , where H is the height, W is the width, and C represents the number of input
channels (typically C = 3 for RGB images), the network performs a series of operations, such as
convolutions, activation functions, pooling, and fully connected layers, to transform this input into
a final output vector y ∈ RK , where K is the number of output classes. The objective of AlexNet
is to minimize a loss function that measures the discrepancy between the predicted output and the
true label, typically using the cross-entropy loss function.
At the heart of AlexNet’s architecture are the convolutional layers, which are designed to
learn local patterns in the image by convolving a set of filters over the input image. Specifi-
cally, the first convolutional layer performs a convolution of the input image I0 with a set of filters
(k)
W1 ∈ RF1 ×F1 ×C , where F1 is the size of the filter and C is the number of channels in the input.
(k)
The convolution operation for a given filter W1 and input image I0 at position (i, j) is defined as:
F1 X
X F1 X
C
(k) (k) (k)
Y1 (i, j) = W1 (u, v, c) · I0 (i + u − 1, j + v − 1, c) + b1 (17.60)
u=1 v=1 c=1
(k)
where b1 is the bias term for the k-th filter, and the output of this convolution is a feature map
(k)
Y1 (i, j) that captures the response of the filter at each spatial location (i, j). The result of this
(k) ′ ′
convolution operation is a set of feature maps Y1 ∈ RH ×W , where the dimensions of the output
are H ′ = H −F1 +1 and W ′ = W −F1 +1 if no padding is applied. Subsequent to the convolutional
(k)
operation, the output feature maps Y1 are passed through a ReLU (Rectified Linear Unit)
activation function, which introduces non-linearity into the network. The ReLU function is defined
as:
ReLU(z) = max(0, z) (17.61)
(k)
This function transforms negative values in the feature map Y1 into zero, while leaving positive
values unchanged, thus allowing the network to model complex, non-linear patterns in the data.
(k) (k)
The output of the ReLU activation function is denoted by A1 (i, j) = ReLU(Y1 (i, j)). Following
the activation function, a max-pooling operation is performed to downsample the feature maps
and reduce their spatial dimensions. Given a pooling window of size P × P , the max-pooling
operation computes the maximum value in each window, which is mathematically expressed as:
(k)
Y1pool (i, j) = max A1 (i′ , j ′ ) : (i′ , j ′ ) ∈ pooling window (17.62)
466 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
(k)
where A1 is the feature map after ReLU, and the resulting pooled output Y1pool (i, j) has reduced
′ ′
spatial dimensions, typically H ′′ = HP and W ′′ = WP . This operation helps retain the most
important features while discarding irrelevant spatial details, which makes the network more robust
to small translations in the input image. The convolutional and pooling operations are repeated
across multiple layers, with each layer learning progressively more complex patterns from the input
data. In the second convolutional layer, for example, we convolve the feature maps from the first
(k) (k)
layer A1 with a new set of filters W2 ∈ RF2 ×F2 ×K1 , where K1 is the number of feature maps
produced by the first convolutional layer. The convolution for the second layer is expressed as:
F2 X
X F2 X
K1
(k) (k) (c) (k)
Y2 (i, j) = W2 (u, v, c) · A1 (i + u − 1, j + v − 1) + b2 (17.63)
u=1 v=1 c=1
This process is iterated for each subsequent convolutional layer, where each new set of filters learns
higher-level features, such as edges, textures, and object parts. The activation maps produced
by each convolutional layer are passed through the ReLU activation function, and max-pooling is
applied after each convolutional layer to reduce the spatial dimensions.
After the last convolutional layer, the feature maps are flattened into a 1D vector af ∈ RN , where
N is the total number of activations across all channels and spatial dimensions. This flattened
vector is then passed to fully connected (FC) layers for classification. Each fully connected
layer performs a linear transformation, followed by a non-linear activation. The output of the i-th
neuron in the fully connected layer is given by:
N
X
zi = Wij · af (j) + bi (17.64)
j=1
where Wij is the weight connecting neuron j in the previous layer to neuron i in the current layer,
and bi is the bias term. The output of the fully connected layer is a vector of class scores z ∈ RK ,
which represents the unnormalized log-probabilities of the input image belonging to each class. To
convert these scores into a valid probability distribution, the softmax function is applied:
ezi
σ(zi ) = PK (17.65)
j=1 ezj
The softmax function ensures that the output values are in the range [0, 1] and sum to 1, thus
representing a probability distribution over the K classes. The final output of the network is a
probability vector ŷ ∈ RK , where each element ŷi corresponds to the predicted probability that
the input image belongs to class i. To train the AlexNet model, the network minimizes the cross-
entropy loss function between the predicted probabilities ŷ and the true labels y, which is given
by:
XK
L=− yi log(ŷi ) (17.66)
i=1
where yi is the true label (1 if the image belongs to class i, 0 otherwise), and ŷi is the predicted
probability for class i. The goal of training is to adjust the weights W and biases b in the network
to minimize this loss. The parameters of the network are updated using gradient descent. To
compute the gradients, the backpropagation algorithm is used. The gradient of the loss with
respect to the weights W in a fully connected layer is given by:
∂L ∂L ∂z
= · (17.67)
∂W ∂z ∂W
where ∂L
∂z
∂z
is the gradient of the loss with respect to the output of the layer, and ∂W is the gradient
of the output with respect to the weights. These gradients are then used to update the weights
17.4. POPULAR CNN ARCHITECTURES 467
Regularization techniques such as dropout are often applied to prevent overfitting during training.
Dropout involves randomly setting a fraction of the activations to zero during each training step,
which helps prevent the network from relying too heavily on any one feature and encourages the
model to learn more robust features. Once trained, the AlexNet model can be used to classify
new images by passing them through the network and selecting the class with the highest proba-
bility. The combination of convolutional layers, ReLU activations, pooling, fully connected layers,
and softmax activation makes AlexNet a powerful and efficient architecture for image classification
tasks.
17.4.2 ResNet
At the heart of the ResNet architecture lies the notion of residual learning, where instead of learning
the direct transformation y = f (x; W), the network learns the residual function F(x, W), i.e., the
difference between the input and output. The network output y can therefore be expressed as:
y = F(x; W) + x (17.69)
This formulation represents the core difference from traditional neural networks where the model
learns a mapping directly from the input x to the output y. The introduction of the identity short-
cut connection x introduces a powerful mechanism by which the network can learn the residual,
and if the optimal residual transformation is the identity function, the network can essentially learn
y = x, improving optimization. This reduces the challenge of training deeper networks, where deep
layers often lead to vanishing gradients, because the gradient can propagate directly through these
shortcuts, bypassing intermediate layers.
Let’s formalize this residual learning. Let the input to the residual block be xl and the output
yl . In a conventional neural network, the transformation from input to output at the l-th layer
would be:
yl = F(xl ; Wl ) (17.70)
where F represents the function learned by the layer, parameterized by Wl . In contrast, for ResNet,
the output is the sum of the learned residual function F(xl ; Wl ) and the input xl itself, yielding:
yl = F(xl ; Wl ) + xl (17.71)
This addition of the identity shortcut connection enables the network to bypass layers if needed,
facilitating the learning process and addressing the vanishing gradient issue. To formalize the
optimization problem, we define the residual learning objective as the minimization of the loss
function L with respect to the parameters Wl :
N
X
L= Li (yi , ti ) (17.72)
i=1
where N is the number of training samples, ti are the target outputs, and Li is the loss for the i-th
sample. The training process involves adjusting the parameters Wl via gradient descent, which
in turn requires the gradients of the loss function with respect to the network parameters. The
gradient of L with respect to Wl can be expressed as:
N
∂L X ∂Li ∂yi
= · (17.73)
∂Wl i=1
∂yi ∂Wl
468 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
Since the residual block adds the input directly to the output, the derivative of the output with
respect to the weights Wl is given by:
∂yl ∂F(xl ; Wl )
= (17.74)
∂Wl ∂Wl
Now, let’s explore how this addition of the residual connection directly influences the backpropa-
gation process. In a traditional feedforward network, the backpropagated gradients for each layer
depend solely on the output of the preceding layer. However, in a residual network, the gradient
flow is enhanced because the identity mapping xl is directly passed to the subsequent layer. This
ensures that the gradients will not be lost as the network deepens, a phenomenon that becomes
critical in very deep networks. The gradient with respect to the loss L at layer l is:
∂L ∂L ∂yl
= · (17.75)
∂xl ∂yl ∂xl
Since yl = F(xl ; Wl ) + xl , the derivative of yl with respect to xl is:
∂yl ∂F(xl ; Wl )
=I+ (17.76)
∂xl ∂xl
∂L
where I is the identity matrix. This ensures that the gradient ∂x l
can propagate more easily
through the network, as it is now augmented by the identity matrix term. Thus, this term helps
preserve the gradient’s magnitude during backpropagation, solving the vanishing gradient problem
that typically arises in deep networks. Furthermore, to ensure that the dimensions of the input
and output of a residual block match, especially when the number of channels changes, ResNet
introduces projection shortcuts. These are used when the dimensionality of xl and yl do not align,
typically through a 1 × 1 convolution. The projection shortcut modifies the residual block’s output
to be:
yl = F(xl ; Wl ) + Wx · xl (17.77)
where Wx is a convolutional filter, and F(xl ; Wl ) is the residual transformation. The introduction
of the 1×1 convolution ensures that the input xl is mapped to the appropriate dimensionality, while
still benefiting from the residual learning framework. The ResNet architecture can be extended by
stacking multiple residual blocks. For a network with L layers, the output after passing through
the entire network can be written recursively as:
y(L) = x + F(y(L−1) ; WL ) (17.78)
where y(L−1) is the output after L − 1 layers. The recursive nature of this formula ensures that
the network’s output is built layer by layer, with each layer contributing a transformation rela-
tive to the input passed to it. Mathematically, the gradient of the loss function with respect to
the parameters in deep residual networks can be expressed recursively, where each layer’s gradient
involves contributions from the identity shortcut connection. This facilitates the training of very
deep networks by maintaining a stable and consistent flow of gradients during the backpropagation
process.
Thus, the Residual Neural Network (ResNet) significantly improves the trainability of deep neural
networks by introducing residual learning, allowing the network to focus on learning the difference
between the input and output rather than the entire transformation. This approach, combined
with identity shortcut connections and projection shortcuts for dimensionality matching, ensures
that gradients flow effectively through the network, even in very deep architectures. The resulting
ResNet architecture has been proven to enable the training of networks with hundreds of layers,
yielding impressive performance on a wide range of tasks, from image classification to semantic
segmentation, while mitigating issues such as vanishing gradients. Through its recursive structure
and rigorous mathematical formulation, ResNet has become a foundational architecture in modern
deep learning.
17.4. POPULAR CNN ARCHITECTURES 469
17.4.3 VGG
The Visual Geometry Group (VGG) Convolutional Neural Network (CNN), introduced
by Simonyan and Zisserman in 2014, presents a detailed exploration of the effect of depth on the
performance of deep neural networks, specifically within the context of computer vision tasks such
as image classification. The VGG architecture is grounded in the hypothesis that deeper networks,
when constructed with small, consistent convolutional kernels, are more capable of capturing hi-
erarchical patterns in data, particularly in the domain of visual recognition. In contrast to other
CNN architectures, VGG prioritizes the usage of small 3 × 3 convolution filters (with a stride of 1)
stacked in increasing depth, rather than relying on larger filters (e.g., 5 × 5 or 7 × 7), thus offering
computational benefits without sacrificing representational power. This design choice inherently
encourages sparse local receptive fields, which ensures a richer learning capacity when extended
across deeper layers.
Let I ∈ RH×W ×C represent an input image of height H, width W , and C channels, where the
channels correspond to different color representations (e.g., RGB for C = 3). For the convolution
operation applied at a particular layer k, the output feature map O(k) can be computed by con-
volving the input I with a set of kernels K (k) corresponding to the k-th layer. The convolution for
each spatial location i, j can be described as:
kh X
X Cin
kw X
(k) (k)
Oi,j = Ku,v,c′ ,c Ii+u,j+v,c′ + bc(k) (17.79)
u=1 v=1 c′ =1
(k) (k)
where Oi,j is the output value at location (i, j) of the feature map for the k-th filter, Ku,v,c′ ,c is
(k)
the u, v-th spatial element of the c′ -to-c filter in layer k, and bc represents the bias term for the
output channel c. The convolutional layer’s kernel K (k) is typically initialized with small values
and learned during training, while the bias b(k) is added to shift the activation of the neuron. A key
aspect of the VGG architecture is that these convolution layers are consistently followed by non-
linear ReLU (Rectified Linear Unit) activation functions, which introduce local non-linearity
to the model. The ReLU function is mathematically defined as:
This transformation is applied element-wise, ensuring that negative values are mapped to zero,
which, as an effect, activates only positive feature responses. The non-linearity introduced by
ReLU aids the network in learning complex patterns and overcoming issues such as vanishing
gradients that often arise in deeper networks. In VGG, the network is constructed by stacking
these convolutional layers with ReLU activations. Each convolution layer is followed by max-
pooling operations, typically with 2 × 2 filters and a stride of 2. Max-pooling reduces the spatial
dimensions of the feature maps and extracts the most significant features from each region of the
image. The max-pooling operation is mathematically expressed as:
where P is the pooling window, and Oi,j is the pooled value at position (i, j). The pooling oper-
ation performs downsampling, ensuring translation invariance while retaining the most prominent
features. The effect of this pooling operation is to reduce computational complexity, lower the
number of parameters, and make the network invariant to small translations and distortions in
the input image. The architecture of VGG typically culminates in a series of fully connected
(FC) layers after several convolutional and pooling layers have extracted relevant features from
the input image. Let the output of the final convolutional layer, after flattening, be denoted as
X ∈ Rd , where d represents the dimensionality of the feature vector obtained by flattening the last
470 CHAPTER 17. CONVOLUTION NEURAL NETWORKS
convolutional feature map. The fully connected layers then transform this vector into the output,
as expressed by:
O = WX + b (17.82)
′ ′
where W ∈ Rd ×d is the weight matrix of the fully connected layer, b ∈ Rd is the bias vector, and
′
O ∈ Rd is the output vector. The output vector O represents the unnormalized scores for each
of the d′ possible classes in a classification task. This is typically followed by the application of a
softmax function to convert these raw scores into a probability distribution:
eoi
σ(oi ) = Pd′ (17.83)
oj
j=1 e
where oi is the score for class i, and the softmax function ensures that the outputs are positive
and sum to one, facilitating their interpretation as class probabilities. This softmax function is
a crucial step in multi-class classification tasks as it normalizes the output into a probabilistic
format. During the training phase, the model minimizes the cross-entropy loss between the
predicted probabilities and the actual class labels, often represented as one-hot encoded vectors.
The cross-entropy loss is given by:
Xd′
L=− yi log(pi ) (17.84)
i=1
where yi is the true label for class i in one-hot encoded form, and pi is the predicted probability
for class i. This loss function is the appropriate objective for classification tasks, as it measures
the difference between the true and predicted probability distributions. The optimization of the
parameters in the VGG network is carried out using stochastic gradient descent (SGD) or its
variants. The weight update rule in gradient descent is:
W ← W − η∇W L (17.85)
where η is the learning rate, and ∇W L is the gradient of the loss with respect to the weights.
The gradient is computed through backpropagation, applying the chain rule of derivatives to
propagate errors backward through the network, updating the weights at each layer based on the
contribution of each parameter to the final output error.
A key advantage of the VGG architecture lies in its use of smaller, deeper layers compared to
previous networks like AlexNet, which used larger convolution filters. By using multiple small
kernels (such as 3 × 3), the VGG network can create richer representations without exponentially
increasing the number of parameters. The depth of the network, achieved by stacking these small
convolution filters, enables the model to extract increasingly abstract and hierarchical features from
the raw pixel data. Despite its success, VGG’s computational demands are relatively high due to
the large number of parameters, especially in the fully connected layers. The fully connected lay-
ers, which connect every neuron in one layer to every neuron in the next, account for a significant
portion of the model’s total parameters. To mitigate this limitation, later architectures, such as
ResNet, introduced skip connections, which allow gradients to flow more efficiently through the
network, thus enabling even deeper architectures without incurring the same computational costs.
Nevertheless, the VGG network set an important precedent in the design of deep convolutional net-
works, demonstrating the power of deep architectures and the effectiveness of small convolutional
filters. The model’s simplicity and straightforward design have influenced subsequent architectures,
reinforcing the notion that deeper models, when carefully constructed, can achieve exceptional per-
formance on complex tasks like image classification, despite the challenges posed by computational
cost and model complexity.
18 Recurrent Neural Networks (RNNs)
471
472 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
a function of the current input xt ∈ Rn and the previous hidden state ht−1 . This evolution is
governed by the recurrence relation:
where Wxh ∈ Rm×n is the input-to-hidden weight matrix, Whh ∈ Rm×m is the hidden-to-hidden
weight matrix, bh ∈ Rm is the bias vector, and fh is a non-linear activation function, typically
ex − e−x
tanh(x) = (18.2)
ex + e−x
or the rectified linear unit ReLU(x) = max(0, x). The recursive nature of this update equation
ensures that ht inherently encodes information about the sequence {x1 , x2 , . . . , xt }, allowing the
network to maintain a dynamic representation of past inputs. The output yt ∈ Ro at time t is
computed as:
yt = fy (Why ht + by ) , (18.3)
where Why ∈ Ro×m is the hidden-to-output weight matrix, by ∈ Ro is the output bias, and fy is
an activation function such as the softmax function:
e zi
fy (z)i = Po (18.4)
j=1 e zj
for classification tasks. Expanding the recurrence relation iteratively, the hidden state at time t
can be expressed as:
is computed through backpropagation through time (BPTT). The gradient of L with respect to
Whh , for instance, is given by:
T X t t
∂L X ∂ℓt Y ∂hj ∂hk
= , (18.8)
∂Whh t=1 k=1
∂ht j=k+1 ∂hj−1 ∂Whh
∂hj
where tj=k+1 ∂hj−1
Q
represents the chain of derivatives from time step k to t. Unlike feedforward
neural networks, where each input is processed independently, RNNs maintain a hidden state ht
that acts as a dynamic memory, evolving recursively as the input sequence progresses. Formally,
given an input sequence {x1 , x2 , . . . , xT }, where xt ∈ Rn represents the input vector at time t, the
hidden state ht ∈ Rm is updated via the recurrence relation:
where Wxh ∈ Rm×n , Whh ∈ Rm×m , and bh ∈ Rm are learnable parameters, and fh is a nonlinear
activation function such as tanh or ReLU. The recursive structure inherently allows the hidden
18.1. KEY CONCEPTS 473
state ht to encode the entire history of the sequence up to time t. The output yt ∈ Ro at each time
step is computed as:
yt = fy (Why ht + by ), (18.10)
where Why ∈ Ro×m and by ∈ Ro are additional learnable parameters, and fy is an optional output
activation function, such as the softmax function for classification. To elucidate the recursive
dynamics, we can expand ht explicitly in terms of the initial hidden state h0 and all previous
inputs {x1 , . . . , xt }:
This nested structure highlights the temporal dependencies and the potential challenges in training,
such as the vanishing and exploding gradient problems. During training, the loss function L, which
aggregates the discrepancies between the predicted outputs yt and the ground truth yttrue , is typically
defined as:
X T
L= ℓ(yt , yttrue ), (18.12)
t=1
where ℓ is a task-specific loss function, such as the mean squared error (MSE)
1
ℓ(y, y true ) = ∥y − y true ∥2 (18.13)
2
for regression or the cross-entropy loss for classification. To optimize L, gradient-based methods
are employed, requiring the computation of derivatives of L with respect to all parameters, such as
Wxh , Whh , and bh . Using backpropagation through time (BPTT), the gradient of L with respect
to Whh is expressed as:
T X t t
∂L X ∂ℓt Y ∂hj ∂hk
= . (18.14)
∂Whh t=1 k=1
∂ht j=k+1 ∂hj−1 ∂Whh
Here,
t
Y ∂hj
(18.15)
j=k+1
∂hj−1
∂hj
is the product of Jacobian matrices that encode the influence of hk on ht . The Jacobian ∂hj−1
itself
is given by:
∂hj
= Whh ⊙ fh′ (aj ), (18.16)
∂hj−1
where
aj = Wxh xj + Whh hj−1 + bh , (18.17)
and fh′ (aj ) denotes the elementwise derivative of the activation function. The repeated multipli-
cation of these Jacobians can lead to exponential growth or decay of the gradients, depending on
the spectral radius ρ(Whh ). Specifically, if ρ(Whh ) > 1, gradients explode, whereas if ρ(Whh ) < 1,
gradients vanish, severely hampering the training process for long sequences. To address these
issues, modifications such as Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs) introduce gating mechanisms that explicitly regulate the flow of information. In
LSTMs, the cell state ct , governed by additive dynamics, prevents vanishing gradients. The cell
state is updated as:
ct = ft ⊙ ct−1 + it ⊙ tanh(Wc xt + Uc ht−1 + bc ), (18.18)
where ft is the forget gate, it is the input gate, and Uc , Wc , and bc are learnable parameters.
474 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
Sequence modeling in Recurrent Neural Networks (RNNs) represents a powerful framework for
capturing temporal dependencies in sequential data, enabling the learning of both short-term and
long-term patterns. The primary characteristic of RNNs lies in their recurrent architecture, where
the hidden state ht at time step t is updated as a function of both the current input xt and the
hidden state at the previous time step ht−1 . Mathematically, this recurrent relationship can be
expressed as:
ht = f (Wh ht−1 + Wx xt + bh ) (18.19)
Here, Wh and Wx are weight matrices corresponding to the previous hidden state ht−1 and the
current input xt , respectively, while bh is a bias term. The function f (·) is a non-linear activation
18.2. SEQUENCE MODELING AND LONG SHORT-TERM MEMORY (LSTM) AND GRUS475
function, typically chosen as the hyperbolic tangent tanh or rectified linear unit (ReLU). The output
yt at each time step is derived from the hidden state ht through a linear transformation, followed
by a non-linear activation, yielding:
yt = g(Wy ht + by ) (18.20)
where Wy is the weight matrix connecting the hidden state to the output space, and by is the
associated bias term. The function g(·) is generally a softmax activation for classification tasks or
a linear activation for regression problems. The key feature of this structure is the interdependence
of the hidden state across time steps, allowing the model to capture the history of past inputs and
produce predictions that incorporate temporal context. Training an RNN involves minimizing a
loss function L, which quantifies the discrepancy between the predicted outputs yt and the true
outputs yttrue across all time steps. A common loss function used in classification tasks is the cross-
entropy loss, while regression tasks often utilize mean squared error. To optimize the parameters
of the network, the model employs Backpropagation Through Time (BPTT), a variant of the
standard backpropagation algorithm adapted for sequential data. The primary challenge in BPTT
arises from the recurrent nature of the network, where the hidden state at each time step depends
on the previous hidden state. The gradient of the loss function with respect to the hidden state at
time step t is computed recursively, reflecting the temporal structure of the model. The chain rule
is applied to compute the gradient of the loss with respect to the hidden state:
T
∂L ∂L ∂yt X ∂L ∂ht′
= · + · (18.21)
∂ht ∂yt ∂ht t′ =t+1 ∂ht′ ∂ht
∂L ∂yt
Here, ∂y t
is the gradient of the loss with respect to the output, and ∂ht
represents the Jacobian of
the output with respect to the hidden state. The second term in this expression corresponds to the
accumulated gradients propagated from future time steps, incorporating the temporal dependencies
across the entire sequence. This recursive gradient calculation allows for updating the weights and
biases of the RNN, adjusting them to minimize the total error across the sequence. The gradients
of the loss function with respect to the parameters of the network, such as Wh , Wx , and Wy , are
computed using the chain rule. For example, the gradient of the loss with respect to Wx is:
T
∂L X ∂L
= · x⊤ (18.22)
∂Wx t=1
∂ht t
This captures the contribution of each input to the overall error at all time steps, ensuring that the
model learns the correct relationships between inputs and hidden states. Similarly, the gradients
with respect to Wh and bh account for the recurrence in the hidden state, enabling the model
to adjust its internal parameters in response to the sequential nature of the data. Despite their
theoretical elegance, RNNs face significant practical challenges during training, primarily due to
the vanishing gradients problem. This issue arises when the gradients propagate through
many time steps, causing them to decay exponentially, especially when using activation functions
like tanh. As a result, the influence of distant time steps diminishes, making it difficult for the
network to learn long-term dependencies. The mathematical manifestation of this problem is seen
in the norm of the Jacobian matrices associated with the hidden state updates. If the spectral
radius of the weight matrices Wh is close to or greater than 1, the gradients can either vanish or
explode, leading to unstable training dynamics. To mitigate this issue, several solutions have been
proposed, including the use of Long Short-Term Memory (LSTM) networks and Gated Recurrent
Units (GRUs), which introduce gating mechanisms to better control the flow of information through
the network. LSTMs, for example, incorporate a memory cell Ct , which allows the network to store
information over long periods of time. The update rules for the LSTM are governed by three gates:
476 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
the forget gate ft , the input gate it , and the output gate ot , which control how much of the previous
memory and new information to retain. The equations governing the LSTM are:
In summary, sequence modeling in RNNs involves a series of recurrent updates to the hidden
state, driven by both the current input and the previous hidden state, and is trained via backprop-
agation through time. The introduction of specialized gating mechanisms in LSTMs and GRUs
alleviates issues such as vanishing gradients, enabling the networks to learn and maintain long-term
dependencies. Through these advanced architectures, RNNs can effectively model complex tem-
poral relationships, making them powerful tools for tasks such as time-series prediction, natural
language processing, and sequence generation.
healthcare applications. It highlights how RNNs can analyze patient records and medical reports
using NLP techniques. The study shows that RNN-based NLP models enhance medical diagnos-
tics and decision-making by extracting meaningful insights from unstructured text data. Petrov
et. al. (2025) [381] discussed the role of RNNs in emotion classification from textual data, an
essential NLP task. The paper evaluates various RNN-based architectures, including BiLSTMs, to
enhance the accuracy of emotion recognition in social media texts and chatbot responses. Liang
(2025) [382] focused on the application of RNNs in educational settings, specifically for automated
grading and feedback generation. The study presents an RNN-based NLP system capable of ana-
lyzing student responses, providing real-time assessments, and generating contextual feedback. Jin
(2025) [383] explored how RNNs optimize text generation tasks related to pharmaceutical edu-
cation. It demonstrates how NLP-powered RNN models generate high-quality textual summaries
from medical literature, ensuring accurate knowledge dissemination in the pharmaceutical indus-
try. McNicholas et. al. (2025) [384] investigated how RNNs facilitate clinical decision-making in
critical care by extracting insights from unstructured medical text. The research highlights how
RNN-based NLP models enhance patient care by predicting outcomes based on clinical notes and
physician reports. Abbas and Khammas (2024) [385] introduced an RNN-based NLP technique
for detecting malware in IoT networks. The study illustrates how RNN classifiers process logs
and textual patterns to identify malicious software, making RNNs crucial in cybersecurity appli-
cations. Kalonia and Upadhyay (2025) [386] applied RNNs to software fault prediction using NLP
techniques. It shows how recurrent networks analyze bug reports and software documentation to
predict potential failures in software applications, aiding developers in proactive debugging. Han
et. al. (2025) [387] discussed RNN applications in conversational AI, focusing on chatbots and
virtual assistants. The study presents an RNN-driven NLP model for improving dialogue manage-
ment and user interactions, significantly enhancing the responsiveness of AI-powered chat systems.
Recurrent Neural Networks (RNNs) are deep learning architectures that are explicitly designed
to handle sequential data, a key feature that makes them indispensable for applications in Natural
Language Processing (NLP). The mathematical foundation of RNNs lies in their ability to process
sequences of inputs, x1 , x2 , . . . , xT , where T denotes the length of the sequence. At each time step
t, the network updates its hidden state, ht , using both the current input xt and the previous hidden
state ht−1 . This recursive relationship is represented mathematically by the following equation:
ht = σ(Wh ht−1 + Wx xt + b) (18.33)
Here, σ is a nonlinear activation function such as the hyperbolic tangent (tanh) or the rectified
linear unit (ReLU), Wh is the weight matrix associated with the previous hidden state ht−1 , Wx
is the weight matrix associated with the current input xt , and b is a bias term. The nonlinearity
introduced by σ allows the network to learn complex relationships between the input and the
output. The output yt at each time step is computed as:
yt = Wy ht + c (18.34)
where Wy is the weight matrix corresponding to the output and c is the bias term for the output.
The output yt is then used to compute the predicted probability distribution over possible outputs
at each time step, typically through a softmax function for classification tasks:
P (yt |ht ) = softmax(Wy ht + c) (18.35)
In NLP tasks such as language modeling, the objective is to predict the next word in a sequence
given all previous words. The RNN is trained to estimate the conditional probability distribution
P (wt |w1 , w2 , . . . , wt−1 ) of the next word wt based on the previous words. The full likelihood of the
sequence w1 , w2 , . . . , wT can be written as:
T
Y
P (w1 , w2 , . . . , wT ) = P (wt |w1 , w2 , . . . , wt−1 ) (18.36)
t=1
478 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
For an RNN, this conditional probability is modeled by recursively updating the hidden state and
generating a probability distribution for each word. At each time step, the probability of the next
word is computed as:
P (wt |ht−1 ) = softmax(Wy ht + c) (18.37)
The network is trained by minimizing the negative log-likelihood of the true word sequence:
T
X
L=− log P (wt |ht−1 ) (18.38)
t=1
This loss function guides the optimization of the weight matrices Wh , Wx , and Wy to maximize the
likelihood of the correct word sequences. As the network learns from large datasets, it develops the
ability to predict words based on the context provided by previous words in the sequence. A key
extension of RNNs in NLP is machine translation, where one sequence of words in one language
is mapped to another sequence in a target language. This is typically modeled using sequence-to-
sequence (Seq2Seq) architectures, which consist of two RNNs: the encoder and the decoder. The
encoder RNN processes the input sequence x1 , x2 , . . . , xT , updating its hidden state at each time
step:
henc
t = σ(Whenc henc enc
t−1 + Wx xt + b
enc
) (18.39)
The final hidden state henc
T of the encoder is passed to the decoder as its initial hidden state. The
decoder RNN generates the target sequence y1 , y2 , . . . , yT by updating its hidden state at each time
step, using both the previous hidden state hdec
t−1 and the previous output yt−1 :
hdec
t = σ(Whdec hdec dec
t−1 + Wx yt−1 + b
dec
) (18.40)
The decoder produces a probability distribution over the target vocabulary at each time step:
The training of the Seq2Seq model is based on minimizing the cross-entropy loss function:
T
X
L=− log P (yt |hdec
t ) (18.42)
t=1
This ensures that the network learns to map input sequences to output sequences. By training
on a large corpus of paired sentences, the Seq2Seq model learns to translate sentences from one
language to another, with the encoder capturing the context of the input sentence and the decoder
generating the translated sentence.
RNNs are also effective in sentiment analysis, a task where the goal is to classify the sentiment
of a sentence (positive, negative, or neutral). Given a sequence of words x1 , x2 , . . . , xT , the RNN
processes each word sequentially, updating its hidden state:
After processing the entire sentence, the final hidden state hT is used to classify the sentiment. The
output is obtained by applying a softmax function to the final hidden state:
y = softmax(Wy hT + c) (18.44)
where Wy is the weight matrix associated with the output layer. The network is trained to minimize
the cross-entropy loss:
XT
L=− log P (y|hT ) (18.45)
t=1
18.4. DEEP LEARNING AND THE COLLATZ CONJECTURE 479
This allows the RNN to classify the overall sentiment of the sentence by learning the relationships
between words and sentiment labels. Sentiment analysis is useful for applications such as customer
feedback analysis, social media monitoring, and opinion mining. In Named Entity Recognition
(NER), RNNs are used to identify and classify named entities, such as people, locations, and
organizations, in a text. The RNN processes each word xt in the sequence, updating its hidden
state at each time step:
ht = σ(Wh ht−1 + Wx xt + b) (18.46)
The output at each time step is a probability distribution over possible entity labels:
By learning to classify each word with the appropriate entity label, the RNN can perform infor-
mation extraction tasks, such as identifying people, organizations, and locations in text. This is
crucial for applications such as document categorization, knowledge graph construction, and ques-
tion answering. In speech recognition, RNNs are used to transcribe spoken language into text.
The input to the RNN consists of a sequence of acoustic features, such as Mel-frequency cepstral
coefficients (MFCCs), which are extracted from the audio signal. At each time step t, the RNN
updates its hidden state:
ht = σ(Wh ht−1 + Wx xt + b) (18.49)
The output at each time step is a probability distribution over phonemes or words:
By learning the mapping between acoustic features and corresponding words or phonemes, the
RNN can transcribe speech into text, which is fundamental for applications such as voice assis-
tants, transcription services, and speech-to-text systems.
In summary, RNNs are powerful tools for processing sequential data in NLP tasks such as ma-
chine translation, sentiment analysis, named entity recognition, and speech recognition. Their
ability to process input sequences in a time-dependent manner allows them to capture long-range
dependencies, making them well-suited for complex tasks in NLP and beyond. However, chal-
lenges such as the vanishing and exploding gradient problems necessitate the use of more advanced
architectures, like LSTMs and GRUs, to enhance their performance in real-world applications.
for any positive integer n. The conjecture asserts that for all n ∈ N, repeated iteration of T (n)
eventually results in 1, after which the sequence enters the cycle 1 → 4 → 2 → 1. Despite its simple
formulation, proving this statement for all n has remained elusive. Several approaches attempt to
model the behavior of the Collatz sequence using transformations and modular arithmetic. Notably,
Rahn et. al. (2021) [884] proposed an algorithm that linearizes Collatz convergence by introducing
arithmetic rules that govern the entire dynamical system. This approach provides a structured way
to analyze sequence behavior, which can be beneficial in training deep learning models for discrete
dynamical problems. Cox et. al. (2021) [698] extended the analysis of the classic Collatz conjecture
to generalized sequences defined by the recurrence relation 3n + c, where c is an odd integer not
divisible by 3. The authors build upon techniques introduced by Halbeisen and Hungerbühler to
analyze rational Collatz cycles, applying these methods to the more general 3n + c sequences. In
these generalized sequences, cycles are categorized based on their length and the number of odd
elements they contain. The study introduces the concept of parity vectors—sequences of 0s and
1s representing even and odd elements, respectively—to determine the smallest and largest odd
elements within these cycles. The authors observe that these smallest and largest odd elements
often belong to the same cycle, suggesting that the parity vector generated by the floor function
can be rotated to match that generated by the ceiling function. This rotation involves two linear
congruences, and the natural numbers produced by one of these congruences appear to be uniformly
distributed after sorting, exhibiting properties similar to the zeros of the Riemann zeta function.
The paper provides a detailed mathematical framework for analyzing these cycles, offering insights
into the behavior of generalized 3n + c sequences and their relation to classical problems in number
theory.
Since the transformation is entirely deterministic, a neural network would not necessarily need
to approximate probabilities but rather learn the structural properties of the sequence. Deep learn-
ing techniques such as recurrent neural networks (RNNs), transformers, and graph neural networks
(GNNs) provide a potential approach to uncovering hidden patterns that could offer insights into
this problem. From a machine learning perspective, the Collatz sequence can be framed as a se-
quence prediction problem, where the goal is to predict the next number in the sequence given
the previous values. Given an initial number n0 , the sequence
(n0 , n1 , n2 , . . . , nk ) (18.53)
can be viewed as a time-series dataset. A Recurrent Neural Network (RNN), such as a Long
Short-Term Memory (LSTM) network or a Transformer model, can be trained on known
Collatz sequences to generate sequences for new unseen numbers. The function
T̂θ (n) = fθ (n) (18.54)
where fθ is a neural network parameterized by θ, can be trained to approximate T (n). The loss
function for training would be:
N
X
L(θ) = |fθ (n) − T (n)|2 . (18.55)
n=1
Another perspective on the Collatz problem is to consider it as a graph learning task. The
transformation T (n) induces a directed graph G(V, E), where each node V represents an integer
n and each directed edge represents the transformation n → T (n). A Graph Neural Network
(GNN) can be trained on this graph structure to embed numerical relationships into a low-
dimensional vector space, revealing hidden clustering patterns. The message-passing function
in a GNN is given by:
X
h(t+1)
n = σ W h(t)
m
, (18.56)
m∈N (n)
18.5. MERTENS FUNCTION AND THE COLLATZ CONJECTURE 481
(t)
where hn is the embedding of node n at iteration t, W is a trainable weight matrix, and N (n)
represents the set of neighboring nodes. Even though the Collatz function is deterministic, we can
model it as a stochastic process by introducing a probabilistic relaxation. Consider a Markov
process Xt defined on N where the transition probabilities are:
where δ(·) is the Dirac delta function and p(m) is a small noise term. A reinforcement learning
(RL) agent can be defined with a policy π(n) that selects transitions based on a learned strategy.
The goal is to minimize the cost function:
"∞ #
X
J(π) = Eπ γ t C(nt ) , (18.58)
t=0
where C(nt ) is a cost function that represents the number of steps needed to reach 1. RL-based
approaches can be used to estimate statistical properties of the stopping time function S(n), po-
tentially revealing asymptotic behaviors in large n. Deep learning can serve as a heuristic tool to
analyze large-scale computational data, offering numerical insights that might guide future the-
oretical approaches. If certain statistical properties—such as mean stopping times or clustering
of trajectory lengths—can be detected via deep learning, they could provide new conjectures or
constraints that mathematicians can then attempt to prove rigorously.
The intersection of deep learning and the Collatz conjecture is an exciting and emerging field.
While deep learning cannot replace traditional mathematical proofs, it provides a powerful experi-
mental tool to analyze the statistical properties of the Collatz sequence, detect hidden structures in
its graph representation, and study the problem through reinforcement learning and Markov mod-
els. By leveraging graph neural networks, recurrent models, and probabilistic relaxations, we can
explore new perspectives on this classic problem. However, the deterministic nature of the Collatz
function suggests that deep learning alone will not yield a proof, but rather assist in formulating
new hypotheses and guiding analytical techniques.
The asymptotic growth of M (n) has a deep connection to the Riemann Hypothesis (RH). The
classical Mertens conjecture proposed
√
|M (n)| ≤ n ∀n ≥ 1, (18.61)
which, if true, would have implied RH. However, it was disproven by Odlyzko and te Riele (1985).
The best known upper bound under RH is
1
M (n) = O(n 2 +ϵ ) for any ϵ > 0. (18.62)
482 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
A fundamental connection between M (n) and the zeros of the Riemann zeta function arises
via ∞
X 1
ζ(s) = , Re(s) > 1. (18.64)
n=1
ns
By analytic continuation, ζ(s) satisfies the functional equation
πs
ζ(s) = 2s π s−1 sin Γ(1 − s)ζ(1 − s). (18.65)
2
The explicit formula connecting M (n) to the nontrivial zeros ρ of ζ(s) is
X nρ
M (n) = + O(n). (18.66)
ρ
ρ
Since the distribution of ρ determines the oscillatory behavior of M (n), deep learning approaches
can be employed to analyze these fluctuations. Consider a sequence prediction model for M (n)
based on a recurrent neural network (RNN) where the state equation of the hidden layer is
where ht is the hidden state at time step t, xt is the input (previous values of M (n)), and W, U, b
are learnable parameters. The loss function to optimize is
N
X 2
L= M (nt ) − M̂ (nt ) . (18.68)
t=1
Since M (n) exhibits chaotic fluctuations, it is beneficial to employ Fourier analysis in training.
The Fourier transform of M (n) is
Z ∞
M (f ) =
c M (n)e−2πif n dn. (18.69)
−∞
A wavelet transform decomposition can be applied using basis functions ψa,b (t) given by
1 t−b
ψa,b (t) = p ψ , (18.70)
|a| a
which allows us to extract localized frequency behavior in M (n). Since M (n) is driven by the prime
factorization structure of n, a graph neural network (GNN) can be constructed where nodes
correspond to prime factors and edges encode multiplicative relationships. The adjacency matrix
A(n) is defined such that
M (n) = f (A(n)), (18.71)
where f is a deep function approximator trained to estimate M (n). The connection to the Rie-
mann Hypothesis suggests that deep reinforcement learning (DRL) can be used to estimate
bounds on M (n). Define a reinforcement learning setup where
State = {M (n1 ), M (n2 ), . . . , M (nk )}, Action = {B(n) adjustments}, Reward = −(|M (n)|−B(n))2 .
(18.72)
18.5. MERTENS FUNCTION AND THE COLLATZ CONJECTURE 483
Using a wavelet-based neural network, the spectral properties of M (n) can be analyzed via the
scaling function ϕ(t) satisfying X
ϕ(t) = hk ϕ(2t − k). (18.73)
k
A deep convolutional network can extract self-similar patterns in M (n) by training on feature maps
defined as !
X (l)
Fl (x) = σ wi Fl−1 (x − i) + bl , (18.74)
i
where Fl (x) represents the activation at layer l. The deep learning integration with classical number
theory enables empirical conjecture generation, providing insights into bounds on M (n) and prime
number distributions.
Here are some experiments that could be conducted using deep learning to analyze the Mertens
function M (n). These experiments leverage different machine learning architectures to identify po-
tential patterns, estimate new bounds, or even gain insights into its connection with the Riemann
hypothesis.
The summatory Möbius function, also known as the Mertens function, is given by:
n
X
M (n) = µ(k). (18.76)
k=1
This setup transforms the problem into a sequence prediction task where the goal is to map past
values to the next value in the sequence. The probability distribution of µ(n) is assumed to follow:
1
P (µ(n) = −1) = P (µ(n) = 0) = P (µ(n) = 1) = , (18.78)
3
though deviations occur due to arithmetic fluctuations. Neural architectures such as recurrent
neural networks (RNNs), long short-term memory (LSTM) networks, and transformer-based models
are employed to capture long-range dependencies and potential hidden structures in µ(n). The
optimization function is chosen as:
N
X
L= (µ̂(ni ) − µ(ni ))2 , (18.79)
i=1
484 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
where µ̂(ni ) is the predicted value, minimizing the squared error loss. A Fourier spectral analysis
is performed by considering the transformed Möbius sequence:
N
X
F (ω) = µ(n)e−2πiωn , (18.80)
n=1
which provides insight into hidden periodicities. If the trained model achieves prediction accu-
racy significantly exceeding random guessing, this would suggest undiscovered regularities in µ(n).
Conversely, failure would reinforce the conjecture that µ(n) behaves as a pseudo-random function.
This experiment probes the learnability of deep arithmetic properties through machine learning
techniques and may inspire further investigations into the relationship between deep learning and
number theory.
where µ(k) is the Möbius function, a natural question arises regarding the asymptotic behavior
and potential bounding of M (n) with respect to known conjectures and heuristic approximations.
A key observation is that the classical Mertens conjecture, which suggested that:
√
|M (n)| ≤ n,
The goal of this experiment is to determine whether deep learning-based regression models can
capture such behavior or even suggest refined empirical bounds beyond known results. To achieve
this, we construct a dataset where the input is n and the output is M (n). Given the irregular
oscillatory nature of M (n), standard regression models may struggle to converge to meaningful
patterns. We therefore consider transformations such as:
M (n) M (n)
Y (n) = log |M (n)|, Y (n) = , Y (n) = ,
n1/2 n1/2+ϵ
and evaluate whether deep learning models can effectively approximate these relationships. The
models employed include:
1. Fully connected deep neural networks (DNNs)
where fθ is the deep learning model parameterized by θ. If θ∗ represents the optimal parameters,
we analyze whether:
sup |M (n)| ≤ fθ∗ (n)
n
and compare this bound with existing theoretical upper bounds. Furthermore, spectral analysis
methods such as Fourier and wavelet transforms are applied to M (n) to decompose its oscillatory
structure into dominant frequency components, revealing potential periodic trends. If the neural
network can identify new bounding relationships through empirical training, this would suggest av-
enues for improving theoretical results. The comparison of deep learning-predicted bounds against
classical results such as:
serves as a benchmark to assess the efficacy of the deep learning approach. Ultimately, the study
investigates whether a neural network can establish improved empirical bounds and whether these
findings align with rigorous analytic number theory expectations. If successful, this approach might
hint at refined hypotheses regarding the asymptotics of M (n), potentially shedding light on deeper
connections with the Riemann Hypothesis and related conjectures.
The first step in this Approach is Data Representation. We have to construct a data set con-
sisting of known non-trivial zeros ρk = 21 + iγk of ζ(s), where γk denotes the imaginary part. We
then have to generate the corresponding sequences of the values M (n) and µ(n) to explore the sta-
tistical relationships between these number-theoretic functions and the spectral properties of ζ(s).
We then have to explore alternative data representations, such as Fourier-transformed Möbius se-
quences or wavelet coefficients of M (n), for improved model interpretability.
The second step in this Approach is Neural Network Architectures. We have to Implement Fourier
Transform Neural Networks (FTNNs) to leverage spectral properties and identify periodicities
in µ(n) and M (n). We then have to utilize graph neural networks (GNNs) to model number-
theoretic dependencies, representing prime factorization structures as graph embeddings. We then
have to experiment with deep recurrent architectures (LSTMs, Transformers) to analyze long-
range dependencies in sequences of Möbius function values. The third step in this Approach is
Training and Evaluation. We have to train the model to map between the zero distribution of ζ(s)
and features of M (n), such as its upper bounds. We then have to validate whether the learned
embeddings capture existing theoretical results, such as the classical bound:
M (n) = O(n1/2+ϵ ) for any ϵ > 0, assuming the Riemann Hypothesis. (18.81)
Assess model generalization using synthetic test cases and real-world prime counting functions.
Some potential insights are:
486 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
• Numerical Evidence for Spectral Connections: If the model can learn meaningful
patterns in the distribution of zeros, this might provide further empirical support for the
spectral interpretation of prime number distributions.
• Predictive Utility: A model capable of estimating new zero locations could refine our under-
standing of the error terms in prime number theorems and potentially guide new conjectures
in analytic number theory.
• Deep Learning as a Theoretical Tool: While deep learning does not offer rigorous math-
ematical proofs, its ability to approximate highly nonlinear functions could lead to novel
heuristic insights, paving the way for new analytical techniques to study the Mertens func-
tion and related zeta function properties.
and understanding its asymptotic behavior is crucial for problems in analytic number theory. The
primary objective is to encode the prime factorization of an integer n as a graph and leverage GNNs
to predict values of µ(n) and analyze statistical fluctuations in M (n). Since the properties of µ(n)
are determined entirely by the prime factorization of n, we hypothesize that a well-trained GNN
could uncover latent structural patterns in these arithmetic functions.
Each number n is represented as a graph Gn = (V, E), where nodes correspond to prime factors,
and edges represent multiplicative relationships. Node features such as prime identity, exponent,
and relative position within the factorization tree are encoded using adjacency matrices. The
eigenvalues of these matrices, given by:
may provide insights into the underlying arithmetic structure. We employ Message Passing Neu-
ral Networks (MPNNs), Graph Convolutional Networks (GCNs), and Graph Attention Networks
(GATs) to process these number-theoretic structures. The update rule for each node feature rep-
(t)
resentation hv in an MPNN follows:
X
h(t+1)
v = σ W h(t)
v + W ′ h(t)
u
, (18.85)
u∈N (v)
18.5. MERTENS FUNCTION AND THE COLLATZ CONJECTURE 487
where N (v) denotes the neighborhood of v, and σ is a non-linearity. The model is trained to classify
integers according to their Möbius function values and predict bounds on M (n). A key quantity of
interest is the supremum bound:
which is compared against the model’s empirical predictions. By analyzing learned representations
of µ(n), we may uncover new heuristics for understanding the randomness of arithmetic functions.
The spectral properties of factorization graphs could reveal deep connections to the Riemann Hy-
pothesis, given the existing conjecture:
X
µ(n) = O(x1/2+ϵ ), (18.87)
n≤x
and how machine learning approaches refine these estimates. Furthermore, extensions of this frame-
work could be applied to other multiplicative functions such as Euler’s totient function ϕ(n) and
divisor functions d(n), contributing to the study of number-theoretic graph structures.
to analyze whether there exist long-range dependencies within the compressed representation. In
order to quantify the reconstruction error and randomness evaluation, we compute the variance of
reconstruction errors to test if µ(n) is learnable within a low-dimensional manifold:
N
1 X
σ2 = (µ(ni ) − D(E(µ(ni ))))2 . (18.90)
N i=1
This variance is compared with the expected behavior under random models of µ(n) to evaluate
whether compression introduces meaningful structure or preserves apparent randomness.
If the autoencoder extracts meaningful lower-dimensional features, it may indicate the presence
488 CHAPTER 18. RECURRENT NEURAL NETWORKS (RNNS)
The Transformer model is an advanced neural network architecture fundamentally defined by the
self-attention mechanism, which enables global context-aware computations on sequential data.
The model processes an input sequence represented by
X ∈ Rn×dmodel , (19.1)
489
490 CHAPTER 19. ADVANCED ARCHITECTURES
where n denotes the sequence length and dmodel the embedding dimensionality. Each token in
this sequence is projected into three learned spaces—queries Q, keys K, and values V—using the
trainable matrices WQ , WK , and WV , such that
where WQ , WK , WV ∈ Rdmodel ×dk , with dk being the dimensionality of queries and keys. The
pairwise similarity between tokens is determined by the dot product QK⊤ , scaled by the factor
√1 to ensure numerical stability, yielding the raw attention scores:
dk
QK⊤
S= √ , (19.3)
dk
where S ∈ Rn×n . These scores are normalized using the softmax function, producing the attention
weights A, where
exp(Sij )
Aij = Pn , (19.4)
k=1 exp(Sik )
ensuring nj=1 Aij = 1. The output of the attention mechanism is computed as a weighted sum of
P
the values:
Z = AV, (19.5)
where Z ∈ Rn×dv , with dv being the dimensionality of the value vectors. This process can be
expressed compactly as
QK⊤
Attention(Q, K, V) = softmax √ V. (19.6)
dk
Multi-head attention extends this mechanism by splitting Q, K, V into h distinct heads, each
operating in its subspace. For the i-th head:
where Qi = XWiQ , Ki = XWiK , Vi = XWiV . The outputs of all heads are concatenated and
linearly transformed:
where WO ∈ Rhdv ×dmodel . This architecture enables the model to capture multiple types of rela-
tionships simultaneously. Positional encodings are added to the input embeddings X to preserve
sequence order. These encodings P ∈ Rn×dmodel are defined as:
pos pos
P(pos,2i) = sin , P(pos,2i+1) = cos , (19.9)
100002i/dmodel 100002i/dmodel
ensuring unique representations for each position pos and dimension index i. The feedforward
network (FFN) applies two dense layers with an intermediate ReLU activation:
where W1 ∈ Rdmodel ×dff , W2 ∈ Rdff ×dmodel , and dff > dmodel . Residual connections and layer normal-
ization are applied throughout to stabilize training, with the output given by
where P (yt | y<t , x) is modeled using the softmax over the logits zt Wout + bout , with parame-
ters Wout , bout . The Transformer achieves a complexity of O(n2 dk ) per attention layer due to the
computation of QK⊤ , yet its parallelization capabilities render it more efficient than recurrent net-
works. This mathematical formalism, coupled with innovations like sparse attention and dynamic
programming, has solidified the Transformer as the cornerstone of modern sequence modeling tasks.
While this quadratic complexity poses challenges for very long sequences, it allows for greater par-
allelization compared to RNNs, which require O(n) sequential steps. Furthermore, the memory
complexity of O(n2 ) for storing attention weights can be mitigated using sparse approximations or
hierarchical attention structures. The Transformer architecture’s flexibility and effectiveness stem
from its ability to handle diverse tasks by appropriately modifying its components. For example,
in Vision Transformers (ViTs), the input sequence is formed by flattening image patches, and the
positional encodings capture spatial relationships. In contrast, in sequence-to-sequence tasks like
translation, the cross-attention mechanism enables the decoder to focus on relevant parts of the
encoder’s output.
In conclusion, the Transformer represents a paradigm shift in neural network design, replacing
recurrence with attention and enabling unprecedented scalability and performance. The rigorous
mathematical foundation of attention mechanisms, combined with the architectural innovations
of multi-head attention, positional encoding, and feedforward layers, underpins its success across
domains.
segmentation. They developed a novel attention-enhanced GAN for robust segmentation and pro-
vided a mathematical formulation for adversarial segmentation loss functions.
min max V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))], (19.13)
G D
where E denotes the expectation operator. This objective seeks to maximize the discriminator’s
ability to distinguish between real and generated samples while simultaneously minimizing the
generator’s ability to produce samples distinguishable from real data. For a fixed generator G, the
optimal discriminator D∗ is obtained by maximizing V (D, G), yielding
pdata (x)
D∗ (x) = . (19.14)
pdata (x) + pg (x)
This expression is equivalent to minimizing the Jensen-Shannon (JS) divergence between pdata and
pg , defined as
1 1
JS(pdata ∥pg ) = KL(pdata ∥M ) + KL(pg ∥M ), (19.16)
2 2
where M = 21 (pdata + pg ) and KL(p∥q) = p(x) log p(x)
R
q(x)
dx is the Kullback-Leibler divergence. At
the Nash equilibrium, pg = pdata , the JS divergence vanishes, and D∗ (x) = 21 for all x. The gradient
updates during training are derived using stochastic gradient descent. For the discriminator, the
gradients are given by
∇θD V (D, G) = Ex∼pdata [∇θD log D(x)] + Ez∼pz [∇θD log(1 − D(G(z)))] . (19.17)
Training Generative Adversarial Networks (GANs) involves iterative updates to the parameters
θD of the discriminator and θG of the generator. The discriminator’s parameters are updated
via gradient ascent to maximize the value function V (D, G), while the generator’s parameters are
updated via gradient descent to minimize the same value function. Denoting the gradients of D
and G with respect to their parameters as ∇θD and ∇θG , the updates are given by:
and
θG ← θG − ηG ∇θG [Ez∼pz [log (1 − D(G(z)))]] . (19.19)
In practice, to address issues of vanishing gradients, an alternative loss function for the generator
is often used, defined as:
−Ez∼pz [log D(G(z))] . (19.20)
19.3. AUTOENCODERS AND VARIATIONAL AUTOENCODERS 493
This modification ensures stronger gradient signals when the discriminator is performing well,
effectively improving the generator’s training dynamics. For the generator, the gradients in the
original formulation are expressed as
but due to vanishing gradients when D(G(z)) is near 0, the non-saturating generator loss is pre-
ferred:
LG = −Ez∼pz [log D(G(z))]. (19.22)
The convergence of GANs is inherently linked to the properties of D∗ (x) and the alignment of
pg with pdata . However, mode collapse and training instability are frequently observed due to the
non-convex nature of the objective functions. Wasserstein GANs (WGANs) address these issues
by replacing the JS divergence with the Wasserstein-1 distance, defined as
where Π(pdata , pg ) is the set of all couplings of pdata and pg . Using Kantorovich-Rubinstein duality,
the Wasserstein distance is reformulated as
where f is a 1-Lipschitz function. To enforce the Lipschitz constraint, gradient penalties are ap-
plied, ensuring that ∥∇f (x)∥ ≤ 1.
The mathematical framework of GANs integrates elements from game theory, optimization, and
information geometry. Their training involves solving a high-dimensional non-convex game, where
theoretical guarantees for convergence are challenging due to saddle points and complex interac-
tions between G and D. Nevertheless, GANs represent a mathematically elegant paradigm for
generative modeling, with ongoing research extending their theoretical and practical capabilities.
(CVAEs) for material design. They introduced new loss functions for conditional generative mod-
eling and theoretically proved how VAEs can optimize material selection. Kim et. al. [311] uses
Transformer-based Autoencoders (AEs) for video anomaly detection. They established theoreti-
cal improvements of AEs for time-series anomaly detection and used spatio-temporal Autoencoder
embeddings to capture anomalies in videos. Albert et. al. (2024) [312] compared Kernel Learning
Embeddings (KLE) and Variational Autoencoders for dimensionality reduction. They introduced
VAE-based models for atmospheric modeling and established a mathematical comparison between
VAEs and kernel-based models. Sharma et. al. (2024) [313] explored practical applications of
Autoencoders in network intrusion detection. They established Autoencoders as robust feature
extractors for anomaly detection and provided a formal study of Autoencoder latent space repre-
sentations.
This formulation drives the encoder-decoder architecture towards learning a latent representation
that preserves key features of the input data, allowing it to be efficiently reconstructed. The solution
to this problem is typically pursued via stochastic gradient descent (SGD), where gradients
of the loss with respect to the model parameters are computed and backpropagated through the
network. In contrast to the deterministic autoencoder, the Variational Autoencoder (VAE)
introduces a probabilistic model to better capture the distribution of the latent variables. A VAE
models the data generation process using a latent variable z ∈ Rl , and aims to maximize the
likelihood of observing the data x by integrating over all possible latent variables. Specifically, we
have the joint distribution:
p(x, z) = p(x|z)p(z), (19.27)
where p(x|z) is the likelihood of the data given the latent variables, and p(z) is the prior distribution
of the latent variables, typically chosen to be a standard Gaussian N (z; 0, I). The prior assumption
that p(z) = N (0, I) simplifies the modeling, as it imposes no particular structure on the latent
space, which allows for flexible modeling of the data distribution. The encoder in a VAE outputs
a distribution qθe (z|x) over the latent variables, typically modeled as a multivariate Gaussian with
mean µθe (x) and variance σθe (x), i.e. qθe (z|x) = N (z; µθe (x), σθ2e (x)I). The decoder generates the
likelihood of the data x given the latent variable z, expressed as pθd (x|z), which typically takes the
form of a Gaussian distribution for continuous data. A central challenge in VAE training is the
marginal likelihood p(x), which represents the probability of the observed data. This marginal
likelihood is intractable due to the integral over the latent variables:
Z
p(x) = pθd (x|z)p(z) dz. (19.28)
To address this, VAE training employs variational inference, which approximates the true pos-
terior p(z|x) with a variational distribution qθe (z|x). The goal is to optimize the Evidence Lower
19.4. GRAPH NEURAL NETWORKS (GNNS) 495
Bound (ELBO), which is a lower bound on the log-likelihood log p(x). The ELBO is derived
using Jensen’s inequality:
log p(x) ≥ Eqθe (z|x) [log pθd (x|z)] − KL (qθe (z|x) || p(z)) , (19.29)
where the first term is the expected log-likelihood of the data given the latent variables, and
the second term is the Kullback-Leibler (KL) divergence between the approximate posterior
qθe (z|x) and the prior p(z). The KL divergence acts as a regularizer, penalizing deviations from
the prior distribution. The ELBO can then be written as:
LVAE (x) = Eqθe (z|x) [log pθd (x|z)] − KL (qθe (z|x) || p(z)) . (19.30)
This formulation balances two competing objectives: maximizing the likelihood of reconstructing
x from z, and minimizing the divergence between the posterior qθe (z|x) and the prior p(z). In
order to perform optimization, we need to compute the gradient of the ELBO with respect to the
parameters θe and θd . However, since sampling from the distribution qθe (z|x) is non-differentiable,
the reparameterization trick is applied. This trick allows us to reparameterize the latent variable
z as:
z = µθe (x) + σθe (x) · ϵ, (19.31)
where ϵ ∼ N (0, I) is a standard Gaussian noise vector. This enables the backpropagation of
gradients through the latent space and allows the optimization process to proceed via stochastic
gradient descent. In practice, the Monte Carlo method is used to estimate the expectation
in the ELBO. This involves drawing K samples zk from the variational posterior qθe (z|x) and
approximating the expectation as:
K K
1 X 1 X
L̂VAE (x) = log pθd (x|zk ) − log qθe (zk |x). (19.32)
K k=1 K k=1
This approximation allows for efficient optimization, even when the latent space is high-dimensional
and the exact expectation is computationally prohibitive. Thus, the training process of a VAE
involves the following steps: first, the encoder produces a distribution qθe (z|x) for each input x;
then, latent variables z are sampled from this distribution; finally, the decoder reconstructs the
data x̂ from the latent variable z. The network is trained to maximize the ELBO, which effectively
balances the reconstruction loss and the KL divergence term.
In this rigorous exploration, we have presented the mathematical foundations of both autoencoders
and variational autoencoders. The core distinction between the two lies in the introduction of a
probabilistic framework in the VAE, which leverages variational inference to optimize a tractable
lower bound on the marginal likelihood. Through this process, the VAE learns to generate data by
sampling from the latent space and reconstructing the input, while maintaining a well-structured
latent distribution through regularization by the KL divergence term. The optimization framework
for VAEs is grounded in variational inference and the reparameterization trick, enabling gradient-
based optimization techniques to efficiently train deep generative models.
Graph Neural Networks (GNNs) are a profound and mathematically intricate class of deep learn-
ing models specifically designed to handle and process data that is naturally structured as graphs.
Unlike traditional neural networks that operate on Euclidean data structures such as vectors, se-
quences, or grids, GNNs generalize deep learning to non-Euclidean spaces by directly leveraging the
underlying graph topology. The mathematical foundation of GNNs is deeply rooted in algebraic
graph theory, spectral graph theory, and the principles of geometric deep learning, all of which
contribute to the rigorous understanding of how neural networks can be extended to structured
relational data. At the core of any graph-based machine learning model lies the mathematical
representation of a graph. Formally, a graph G is defined as an ordered pair G = (V, E), where V
represents the set of nodes (or vertices), and E ⊆ V × V represents the set of edges that define the
relationships between nodes. The total number of nodes in the graph is denoted by |V | = N , while
the number of edges is given by |E|. The connectivity of the graph is encoded in the adjacency
matrix A ∈ RN ×N , where Aij is nonzero if and only if there exists an edge between nodes i and j.
The adjacency matrix can be either binary (indicating the mere presence or absence of an edge)
or weighted, in which case Aij encodes the strength or affinity of the connection. In addition to
graph connectivity, each node i is often associated with a feature vector xi ∈ Rd , and collecting
these feature vectors across all nodes forms the node feature matrix X ∈ RN ×d , where d is the
dimensionality of the feature space.
19.4. GRAPH NEURAL NETWORKS (GNNS) 497
One of the fundamental challenges in extending neural networks to graph domains is the lack
of a consistent node ordering, which makes standard operations such as convolutions, pooling,
and fully connected layers non-trivial. Unlike images where a fixed spatial structure allows for
well-defined convolutional kernels, graphs exhibit arbitrary structure and permutation invariance,
meaning that the labels of nodes can be permuted without altering the intrinsic properties of
the graph. This necessitates the development of graph-specific neural network architectures that
respect the graph topology while maintaining permutation invariance. To facilitate learning on
graphs, GNNs employ a neighborhood aggregation or message-passing scheme, wherein each node
iteratively gathers information from its neighbors to update its representation. This process can
be formulated mathematically using recursive feature propagation rules. Let H (l) ∈ RN ×dl denote
(l)
the node feature matrix at layer l, where each row Hi represents the embedding of node i at that
layer. The most fundamental form of feature propagation follows the update equation:
1 1
H (l+1) = σ D̃− 2 ÃD̃− 2 H (l) W (l) , (19.33)
where W (l) ∈ Rdl ×dl+1 is a learnable weight matrix that transforms the feature representation, σ(·)
is a nonlinear activation function such as ReLU, and à = A + I is the adjacency matrix augmented
with self-loops to ensure that each node includes its own features
P in the aggregation process. The
diagonal matrix D̃ is the degree matrix of Ã, defined as D̃ii = j Ãij , which normalizes the feature
propagation to avoid scale distortion. The initial node features are represented as H (0) = X, where
X is the matrix of initial node features. The weight matrix at layer l is denoted as W (l) ∈ Rdl ×dl+1 ,
where dl and dl+1 are the dimensions of the input and output feature spaces at layer l, respectively.
The weight matrix is trainable. The adjacency matrix A is augmented with self-loops, denoted as
à = A+I, where I is the identity matrix. The degree matrix D̃ is the diagonal matrix corresponding
to the adjacency matrix with self-loops, defined as:
X
D̃ii = Ãij , (19.34)
j
where Ãij is the entry in the augmented adjacency matrix. The function σ(·) is a nonlinear
activation function (such as ReLU) applied element-wise to the node features. This operation
ensures that each node aggregates information from its local neighborhood, facilitating feature
propagation across the graph. More generally, GNNs can be defined using a message passing
scheme, which consists of two key steps. Each node i receives messages from its neighbors j ∈ N (i).
The aggregated message at node i at layer l is computed as:
(l)
X (l) (l)
mi = fm (Hj , Hi , Aij ), (19.35)
j∈N (i)
where fm is a learnable function that determines how information is aggregated. The node em-
(l)
bedding is updated using the function fu , which takes the current node embedding Hi and the
(l)
aggregated message mi . The updated embedding for node i at layer l + 1 is given by:
(l+1) (l) (l)
Hi = fu (Hi , mi ), (19.36)
where fu is another learnable function. A popular choice for the functions fm and fu is:
(l+1)
X (l)
Hi = σ W (l) Ãij D̃ii−1 Hj , (19.37)
j∈N (i)
where W (l) is the trainable weight matrix, Ãij are the entries of the augmented adjacency matrix,
and D̃ii−1 is the inverse of the degree matrix. This formulation is permutation invariant, ensuring
498 CHAPTER 19. ADVANCED ARCHITECTURES
that the node embeddings do not depend on the order in which neighbors are processed.
L=D−A (19.38)
which possesses an orthonormal eigenbasis. The eigenvalues of the Laplacian encode fundamental
properties of the graph, such as connectivity and diffusion characteristics. Spectral methods define
graph convolutions in the Fourier domain using the eigen-decomposition
Lsym = U ΛU ⊤ (19.40)
where U is the matrix of eigenvectors and Λ is the diagonal matrix of eigenvalues. The graph
Fourier transform of a signal x is then given by
x̂ = U ⊤ x (19.41)
gθ ∗ x = U gθ (Λ)U ⊤ x (19.42)
exp(LeakyReLU(a⊤ [W hi ∥ W hj ]))
αij = P ⊤
(19.43)
k∈N (i) exp(LeakyReLU(a [W hi ∥ W hk ]))
where ∥ denotes concatenation, a is a learnable parameter vector, and attention scores αij de-
termine the importance of neighbors in updating node features. Another variant, GraphSAGE,
employs different aggregation functions (mean, LSTM-based, or pooling-based) to sample and ag-
gregate information from local neighborhoods, ensuring scalability to large graphs. The theoretical
expressivity of GNNs is an active area of research, particularly in the context of the Weisfeiler-
Lehman graph isomorphism test. The Graph Isomorphism Network (GIN) is designed to match
the expressiveness of the 1-dimensional Weisfeiler-Lehman test, using an aggregation function of
the form:
(l+1) (l)
X (l)
Hi = MLP (1 + ϵ)Hi + Hj , (19.44)
j∈N (i)
where MLP(·) is a multi-layer perceptron, and ϵ is a learnable parameter that controls the contri-
bution of self-information. This formulation has been shown to be more powerful in distinguishing
graph structures compared to traditional GCNs. Applications of GNNs span multiple domains,
ranging from molecular property prediction in chemistry and biology, where molecules are repre-
sented as graphs with atoms as nodes and chemical bonds as edges, to recommendation systems
that model users and items as bipartite graphs. Other applications include knowledge graph rea-
soning, social network analysis, and combinatorial optimization problems.
19.5. PHYSICS INFORMED NEURAL NETWORKS (PINNS) 499
In summary, Graph Neural Networks represent a mathematically rich extension of deep learn-
ing to structured relational data. Their foundation in spectral graph theory, algebraic topology,
and geometric deep learning provides a rigorous framework for understanding their function and
capabilities. Despite their success, open research challenges remain in improving their expressiv-
ity, generalization, and computational efficiency, making them an active and evolving field within
modern machine learning.
Physics Informed Neural Networks (PINNs) are a class of neural networks explicitly designed to
incorporate partial differential equations (PDEs) and boundary/initial conditions into their training
500 CHAPTER 19. ADVANCED ARCHITECTURES
process. The goal is to find approximate solutions to the PDEs governing physical systems using
neural networks, while directly embedding the governing physical laws (described by PDEs) into
the training mechanism.A typical PDE problem is represented as:
where:
• L is a differential operator, for instance, the Laplace operator ∇2 , or the Navier-Stokes oper-
ator for fluid dynamics.
• f (x) is a known source term, which could represent external forces or other sources in the
system.
• Ω is the domain in which the equation is valid, such as a bounded region in Rn (e.g., Ω ⊆ R3 ).
The solution u(x) is approximated by a neural network û(x, θ), where θ denotes the parameters
(weights and biases) of the neural network. A neural network approximates a function û(x) as a
composition of nonlinear mappings, typically as:
where:
• Wi and bi are the weight matrices and bias vectors of the i-th layer.
Thus, the neural network learns a representation û(x, θ) that approximates the physical solution to
the PDE. The accuracy of this approximation depends on the choice of the network architecture,
activation function, and the training process. To enforce that the neural network approximates a
solution to the PDE, we introduce a physics-informed loss function. This loss function typically
consists of two parts:
1. Data-driven loss term: This term enforces the agreement between the model predictions and
any available data points (boundary or initial conditions).
2. Physics-driven loss term: This term enforces the satisfaction of the governing PDE at collo-
cation points within the domain Ω.
The data-driven component aims to minimize the discrepancy between the predicted solution and
the observed values at certain data points. For a set of training data {xi , ui }, the data-driven loss
is given by:
XN
Ldata = |û(xi , θ) − ui |2 (19.47)
i=1
where û(xi , θ) is the predicted value of u(x) at xi , and ui is the observed value.
The physics-driven term ensures that the predicted solution satisfies the PDE. Let r(xi ) repre-
sent the PDE residual evaluated at collocation points xi ∈ Ω. The residual r(xi ) is defined as the
difference between the left-hand side and the right-hand side of the PDE:
Here, Lû(xi , θ) is the differential operator acting on the neural network approximation û(xi , θ). By
applying automatic differentiation (AD), we can compute the required derivatives of û(xi , θ) with
respect to x. For instance, in the case of a second-order differential equation, AD will compute:
∂ 2 û(x)
(19.49)
∂x2
The physics-driven loss is then defined as:
M
X
Lphysics = r(xi )2 (19.50)
i=1
where r(xi ) represents the residuals at the collocation points xi distributed throughout the do-
main Ω. The number of these points M can vary depending on the problem’s complexity and
dimensionality. The total loss function is a weighted sum of the data-driven and physics-driven
terms:
Ltotal = Ldata + λLphysics (19.51)
where λ is a hyperparameter controlling the balance between the two loss terms. Minimizing this
loss function during training ensures that the neural network learns to approximate the solution
u(x) that satisfies both the data and the governing physical laws. A key feature of PINNs is the
use of automatic differentiation (AD), which allows us to compute the derivatives of the neural
network approximation û(x, θ) with respect to its inputs (i.e., the spatial coordinates in the PDE).
The chain rule of differentiation is applied recursively through the layers of the neural network.
and higher derivatives required by the PDE. For example, for the Laplace operator in a 1D setting:
k
∂ 2 û(x) X ∂ 2 σ(·)
= W j (19.53)
∂x2 j=1
∂x2
This automatic differentiation procedure ensures that the PDE residual r(xi ) = Lû(xi , θ) − f (xi )
is computed efficiently and accurately. The formulation of PINNs extends naturally to higher-
dimensional PDEs. In the case of a system of partial differential equations, the operator L may
involve higher-order derivatives in multiple dimensions. For instance, in fluid dynamics, the gov-
erning equations might involve the Navier-Stokes equations, which require computing derivatives
in 3D space:
∂u
+ (u · ∇)u = −∇p + ν∇2 u + f (19.54)
∂t
Here, u(x, t) is the velocity field, p is the pressure field, and ν is the kinematic viscosity. The neural
network architecture in PINNs can be extended to multi-output networks that solve for vector fields,
ensuring that all components of the velocity and pressure fields satisfy the corresponding PDEs.
For inverse problems, where we aim to infer unknown parameters of the system (e.g., material
properties, boundary conditions), PINNs provide a natural framework. The inverse problem is
framed as the optimization of the loss function with respect to both the neural network parameters
θ and the unknown physical parameters α:
Multi-fidelity PINNs involve using data at multiple levels of fidelity (e.g., coarse vs. fine simula-
tions, experimental data vs. high-fidelity models) to improve training efficiency and accuracy.
502 CHAPTER 19. ADVANCED ARCHITECTURES
Physics-Informed Neural Networks (PINNs) provide an elegant and powerful approach to solv-
ing PDEs by embedding physical laws directly into the training process. The use of automatic
differentiation allows for efficient computation of residuals, and the combined loss function enforces
both data-driven and physics-driven constraints. With applications spanning across many domains
in engineering, physics, and biology, PINNs represent a significant advancement in the integration
of machine learning with scientific computing.
where L is a differential operator (either linear or nonlinear), u(x) is the unknown solution, and
f (x) is a given source or forcing term. The domain Ω ⊂ Rd represents the spatial region where the
solution u(x) is sought. In the case of nonlinear PDEs, L may involve both u(x) and its derivatives
in a nonlinear fashion. Additionally, boundary conditions are specified as:
where g(x) is a prescribed function at the boundary ∂Ω of the domain. The weak formulation of
the PDE is obtained by multiplying both sides of the differential equation by a test function v(x)
and integrating over the domain Ω:
Z Z
v(x)L(u(x)) dx = v(x)f (x) dx (19.58)
Ω Ω
This weak formulation is valid in spaces of functions that satisfy appropriate regularity conditions,
such as Sobolev spaces. The weak formulation transforms the problem into an integral form,
making it easier to handle numerically. The Deep Galerkin Method (DGM) is a deep learning-
based method for approximating the solution of PDEs. The fundamental idea is to construct a
neural network-based approximation û(x; θ) for the unknown function u(x), such that the residual
of the PDE (the error in satisfying the equation) is minimized in a Galerkin sense. This means that
the neural network is trained to minimize the weak form of the PDE’s residuals over the domain.
In the case of DGM using Physics-Informed Neural Networks (PINNs), the solution is embedded
in the architecture of a neural network, and the physics of the problem is enforced through the
loss function. The PINN aims to minimize the residual of the weak formulation of the PDE,
incorporating both the differential equation and boundary conditions. The neural network used to
approximate the solution û(x; θ) is typically a feedforward neural network with an input x ∈ Rd
(where d is the dimension of the domain) and output û(x; θ), which represents the predicted solution
at x. The parameters θ represent the weights and biases of the network, and the architecture is
chosen to be deep enough to capture the complexity of the solution. The neural network can be
expressed as:
û(x; θ) = NN(x; θ) (19.59)
Here, NN(x; θ) denotes the neural network function that maps the input x to an output û(x; θ).
The network layers can include nonlinear activation functions such as ReLU or tanh to capture
complex behavior. The PINN minimizes a loss function that combines the residual of the PDE and
the boundary condition enforcement. Let the loss function be:
where LPDE (θ) represents the loss due to the PDE residual, and LBC (θ) enforces the boundary
conditions. The PDE residual LPDE (θ) is defined as the error in satisfying the PDE at a set of
collocation points {xi }N
i=1 in the domain Ω, where Ncoll is the number of collocation points. The
coll
residual at a point xi is given by the difference between the differential operator applied to the
predicted solution û(xi ; θ) and the forcing term f (xi ):
Ncoll
1 X
LPDE (θ) = (L (û(xi ; θ)) − f (xi ))2 (19.61)
Ncoll i=1
Here, L (û(xi ; θ)) is the result of applying the differential operator to the output of the neural
network at the collocation point xi . For nonlinear PDEs, the operator L might involve both u(x)
and its derivatives, and the derivatives of û(x; θ) are computed using automatic differentiation. The
boundary condition loss LBC (θ) ensures that the neural network’s predictions at boundary points
{xbi }N
i=1 satisfy the boundary condition u(x) = g(x). This loss is computed as:
BC
NBC
1 X
LBC (θ) = (û(xbi ; θ) − g(xbi ))2 (19.62)
NBC i=1
where xbi are points on the boundary ∂Ω, and g(xbi ) is the prescribed boundary value. For the
Training the Neural Network, The objective is to minimize the total loss function:
θ∗ = arg min (LP DE (θ) + LBC (θ)) (19.63)
θ
The neural network is designed to minimize the residual R(x) in the weak sense, over the points
where the loss is computed.
For nonlinear PDEs, such as the Navier-Stokes equations or nonlinear Schrödinger equations, the
neural network’s ability to approximate complex functions is key. The operator L(û(x)) may in-
volve terms like û(x)∇û(x) (nonlinear convection terms), and the neural network can model these
nonlinearities by introducing appropriate activation functions in the layers (e.g., ReLU, sigmoid,
or tanh). For a nonlinear PDE such as the incompressible Navier-Stokes equations:
∂ û
+ û · ∇û = −∇p + ν∇2 û + f (19.66)
∂t
where û is the velocity field, p is the pressure, ν is the kinematic viscosity, and f is the external
forcing, the network learns the solution û(x; θ) and p̂(x; θ), such that:
L(û(x; θ), p̂(x; θ)) = f (x) (19.67)
504 CHAPTER 19. ADVANCED ARCHITECTURES
This requires the network to compute the derivatives of û and p̂ and use them in the residual
computation. Collocation points xi are typically sampled using Monte Carlo methods or Latin
hypercube sampling. This allows for efficient exploration of the domain Ω, especially in high-
dimensional spaces. Boundary points xbi are selected to enforce boundary conditions accurately.
The training process uses an iterative optimization procedure (e.g., Adam optimizer) to update
the neural network parameters θ. The gradients of the loss function are computed using automatic
differentiation in deep learning frameworks, ensuring accurate and efficient computation of the
derivatives of û(x). Convergence is determined by monitoring the reduction in the total loss L(θ),
which should approach zero as the solution is refined. Residuals are monitored for both the PDE
and boundary conditions, ensuring that the solution satisfies the PDE and boundary conditions to
a high degree of accuracy.
In the Deep Galerkin Method (DGM) using Physics-Informed Neural Networks (PINNs), we con-
struct a neural network to approximate the solution of a PDE in the weak form. The loss function
enforces both the PDE residual and boundary conditions, and the network is trained to minimize
this loss using gradient-based optimization. The method is highly flexible and can handle both
linear and nonlinear PDEs, leveraging the power of neural networks to solve complex differential
equations in a scientifically and mathematically rigorous manner. This rigorous framework can be
applied to a wide variety of differential equations, from simple linear cases to complex nonlinear
systems, and serves as a powerful tool for solving high-dimensional and difficult-to-solve PDEs.
20 Deep Kolmogorov Methods
Literature Review: Han and Jentzen (2017) [480] introduced the Deep BSDE (Backward Stochas-
tic Differential Equation) solver, a foundational framework for solving high-dimensional PDEs using
deep learning. It demonstrates how neural networks can approximate solutions to parabolic PDEs
by reformulating them as stochastic control problems. The authors rigorously prove the conver-
gence of the method and provide numerical experiments for high-dimensional problems, such as the
Hamilton-Jacobi-Bellman equation. Beck et. al. (2021) [481] extended the Deep BSDE method to
solve Kolmogorov PDEs, which describe the evolution of probability densities for stochastic pro-
cesses. The authors provide a theoretical analysis of the approximation capabilities of deep neural
networks for these equations and demonstrate the method’s effectiveness in high-dimensional set-
tings. While not exclusively focused on Kolmogorov methods, the paper by Raissi et. al. (2019)
[457] introduces Physics-Informed Neural Networks (PINNs), which have become a cornerstone
in deep learning for PDEs. PINNs incorporate physical laws (e.g., PDEs) directly into the loss
function, enabling the solution of forward and inverse problems. The framework is applicable to
high-dimensional PDEs and has inspired many subsequent works. Han and Jentzen (2018) [482]
provided a comprehensive theoretical and empirical analysis of the Deep BSDE method. It high-
lights the method’s ability to overcome the curse of dimensionality and demonstrates its application
to high-dimensional nonlinear PDEs, including those arising in finance and physics. Sirignano and
Spiliopoulos (2018) [460] proposed the Deep Galerkin Method (DGM), which uses deep neural net-
works to approximate solutions to PDEs without requiring a mesh. The method is particularly
effective for high-dimensional problems and is shown to outperform traditional numerical methods
in certain settings. Yu (2018) [484] introduced the Deep Ritz Method, which uses deep learning
to solve variational problems associated with elliptic PDEs. The method is closely related to Kol-
mogorov methods and provides a powerful alternative for high-dimensional problems. Zhang et.
al. (2020) [463] extended PINNs to solve time-dependent stochastic PDEs, including Kolmogorov-
type equations. The authors propose a modal decomposition approach to improve the efficiency
and accuracy of the method in high dimensions. Jentzen et. al. (2021) [483] provided a rigor-
ous mathematical foundation for deep learning-based methods for nonlinear parabolic PDEs. It
includes error estimates and convergence proofs, making it a key reference for understanding the
theoretical underpinnings of Deep Kolmogorov Methods. Khoo et. al. (2021) [485] explored the
use of neural networks to solve parametric PDEs, which are closely related to Kolmogorov equa-
tions. The authors provide a unified framework for handling high-dimensional parameter spaces and
demonstrate the method’s effectiveness in various applications. While not strictly a deep learning
method, the paper by Hutzenthaler et. al. (2020) [486] introduced the Multilevel Picard method,
which has inspired many deep learning approaches for high-dimensional PDEs. It provides a theo-
retical framework for approximating solutions to semilinear parabolic PDEs, including Kolmogorov
equations.
The Deep Kolmogorov Method (DKM) is a deep learning-based approach to solving high-
dimensional partial differential equations (PDEs), particularly those arising from stochastic pro-
cesses governed by Itô diffusions. The rigorous foundation of DKM is built upon stochastic
analysis, functional analysis, variational principles, and neural network approximation
theory. To fully understand the method, one must rigorously derive the Kolmogorov backward
505
506 CHAPTER 20. DEEP KOLMOGOROV METHODS
equation, justify its probabilistic representation via Feynman-Kac theory, and establish the error
bounds for deep learning approximations within appropriate function spaces. Let us explore these
aspects in their maximal mathematical depth.
∥µ(x) − µ(y)∥ ≤ C∥x − y∥, ∥σ(x) − σ(y)∥ ≤ C∥x − y∥, ∀x, y ∈ Rd . (20.2)
These conditions guarantee the existence of a unique strong solution Xt to the SDE, satisfying
E[sup0≤t≤T ∥Xt ∥2 ] < ∞. The Kolmogorov backward equation describes the evolution of a
function u(t, x), which is defined as the expected value of a terminal function g(XT ) and an integral
source term f (t, Xt ): Z T
u(t, x) = E g(XT ) + f (s, Xs )ds | Xt = x . (20.3)
t
This equation is well-posed in function spaces such as Sobolev spaces H k (Rd ), Hölder spaces
C k,α (Rd ), or Bochner spaces Lp (Ω; H k (Rd )) under standard parabolic regularity assumptions.
Taking expectations and noting that the stochastic integral has zero mean, we conclude that Mt
is a martingale, which establishes the Feynman-Kac representation:
Z T
u(t, x) = E g(XT ) + f (s, Xs )ds | Xt = x . (20.9)
t
To prove the above equation, we assume that Xt is a diffusion process satisfying the stochastic
differential equation (SDE):
Here Wt is a standard Brownian motion on a filtered probability space (Ω, F, (Ft )t≥0 , P). The
drift µ(x, t) and diffusion σ(x, t) are assumed to be Lipschitz continuous in x and measurable
in t, ensuring existence and uniqueness of a strong solution to the SDE. The filtration (Ft ) is the
natural filtration of Wt , satisfying the usual conditions (right-continuity and completeness).
We consider the backward parabolic partial differential equation (PDE):
∂u ∂u 1 2 ∂ 2u
+ µ(x, t) + σ (x, t) 2 = V (x, t)u + f (x, t), (20.11)
∂t ∂x 2 ∂x
with final condition:
u(x, T ) = g(x). (20.12)
The Feynman-Kac representation states that:
Z T R
− ts V (Xr ,r)dr − tT V (Xr ,r)dr
R
u(x, t) = E e f (Xs , s)ds + e g(XT ) Xt = x . (20.13)
t
This provides a probabilistic representation of the solution to the PDE. Let’s now revisit some
prerequisites from Stochastic Calculus and Functional Analysis. For that, we first discuss the
existence of the Stochastic Process Xt . The existence of Xt follows from the standard existence
and uniqueness theorem for SDEs when µ(x, t) and σ(x, t) satisfy the Lipschitz continuity
condition:
|µ(x, t) − µ(y, t)| + |σ(x, t) − σ(y, t)| ≤ L|x − y|. (20.14)
Under these conditions, there exists a unique strong solution Xt that is adapted to Ft . Let’s use
the Itô’s Lemma for Stochastic Processes, for a sufficiently smooth function ϕ(Xt , t), Itô’s lemma
states:
∂ϕ 1 2 ∂ 2 ϕ
∂ϕ ∂ϕ
dϕ(Xt , t) = +µ + σ 2
dt + σ dWt . (20.15)
∂t ∂x 2 ∂x ∂x
This will be crucial in proving the Feynman-Kac formula. Now let us prove the Feynman-Kac
Formula. The first step is to define the Stochastic Process Ys . Define:
Rs
Ys = e− t V (Xr ,r)dr
u(Xs , s). (20.16)
Thus:
∂u 1 2 ∂ 2 u
−
Rs
V (Xr ,r)dr ∂u − ts V (Xr ,r)dr ∂u
R
dYs = e t +µ + σ − V u ds + e σ dWs . (20.21)
∂t ∂x 2 ∂x2 ∂x
The second step shall be taking the expectation and using the Martingale Property. Define the
process: Z s R
r ∂u
Ms = e− t V (Xq ,q)dq σ dWr . (20.22)
t ∂x
Since Ms is a stochastic integral, it is a martingale with expectation zero:
where η is the learning rate. By the universal approximation theorem, a sufficiently deep
network with ReLU activation satisfies:
where L is the network depth and W is the network width. Let u : [0, T ] × Rd → R be the exact
solution to the Kolmogorov backward equation:
∂u
+ Lu = f, (t, x) ∈ [0, T ] × Rd , (20.29)
∂t
where L is the differential operator:
d d
X ∂u 1X ∂ 2u
Lu = bi (x) + aij (x) , (20.30)
i=1
∂xi 2 i,j=1 ∂xi ∂xj
20.3. DEEP KOLMOGOROV METHOD: NEURAL NETWORK APPROXIMATION 509
with bi (x) and aij (x) satisfying smoothness and uniform ellipticity conditions:
d
X
∃λ, Λ > 0 such that λ|ξ|2 ≤ aij (x)ξi ξj ≤ Λ|ξ|2 ∀ξ ∈ Rd . (20.31)
i,j=1
∥u − uθ ∥H s . (20.33)
From Sobolev Space Approximation by Deep Neural Networks. We assume u ∈ H s (Rd ) with
s > d/2, so by the Sobolev embedding theorem, we obtain:
This ensures u is Hölder continuous, which is crucial for pointwise approximation. From the
Universal Approximation in Sobolev Norms, Barron space theorem and error estimates in
Sobolev norms, there exists a neural network uθ such that:
where W is the network width,L is the depth, C depends on the smoothness of u. This error bound
refines the classical universal approximation theorem by considering derivatives up to order s.
To find the Neural Network Approximation Error, let us do the Neural Network Approximation of
the Kolmogorov Equation. We now examine the residual error:
∂uθ
Rθ (t, x) = + Luθ − f. (20.36)
∂t
From Sobolev estimates, we obtain:
This follows from the regularity of solutions to parabolic PDEs, specifically that:
W ∼ Ld/s . (20.41)
We have established that the approximation error for deep neural networks in solving the Kol-
mogorov backward equation satisfies the rigorous bound:
which follows from Sobolev theory, parabolic PDE regularity, and universal approxima-
tion in higher-order norms.
for a standard Brownian motion Ws . We approximate this expectation using Monte Carlo sampling.
(i)
Given N independent samples XT ∼ p(x, T ), the empirical Monte Carlo estimator is:
N Z T
1 X (i) (i)
uN (t, x) = g(XT ) + f (s, Xs )ds . (20.48)
N i=1 t
where Ω is the sample space of Brownian paths, F is the filtration generated by Ws , P is the Wiener
measure. The random variable EN is thus defined over this probability space. By the Law of Large
Numbers (LLN), we have
P lim EN = 0 = 1. (20.51)
N →∞
However, for finite N , we quantify the error using advanced probability bounds. Regarding the
Asymptotic Analysis of Monte Carlo Error, the expectation of the squared error is:
Z T
2 1
E[EN ] = Var g(XT ) + f (s, Xs )ds . (20.52)
N t
Applying the Central Limit Theorem (CLT), we obtain the asymptotic distribution:
√ d
− N (0, σ 2 ),
N EN → (20.53)
20.3. DEEP KOLMOGOROV METHOD: NEURAL NETWORK APPROXIMATION 511
where: Z T
2
σ = Var g(XT ) + f (s, Xs )ds . (20.54)
t
Thus, the Monte Carlo error satisfies:
1
EN = Op √ . (20.55)
N
We need to find Refined Error Bounds via Concentration Inequalities. To rigorously bound the
error, we employ Hoeffding’s inequality:
2N ϵ2
P (|EN | ≥ ϵ) ≤ 2 exp − 2 . (20.56)
σ
For a higher-order bound, we use the Berry-Esseen theorem:
√ !
N EN C
sup P ≤ x − Φ(x) ≤ √ , (20.57)
x σ N
where C depends on the third moment:
3
" #
Z T
E g(XT ) + f (s, Xs )ds − u(t, x) . (20.58)
t
From a Functional Analysis Perspective, we need to find Operator Norm Bounds. Define the Monte
Carlo estimator as a linear operator :
MN : L2 (Ω) → R, (20.59)
such that:
N
1 X (i)
MN ϕ = ϕ(XT ). (20.60)
N i=1
The error is then the operator norm deviation:
1
∥MN − E∥L2 = O √ . (20.61)
N
By the spectral decomposition of the covariance operator, the error satisfies:
1/2
λmax
∥EN ∥L2 ≤ √ , (20.62)
N
where λmax is the largest eigenvalue of the covariance matrix. For a more precise error characteri-
zation, we use the Edgeworth Series for Higher-Order Expansion:
√ !
N EN ρ3 2 1
P ≤ x = Φ(x) + √ (1 − x )ϕ(x) + O , (20.63)
σ 6 N N
RT
where ρ3 is the skewness of g(XT )+ t f (s, Xs )ds, ϕ(x) is the standard normal density. We have now
mathematically rigorously proved that the Monte Carlo sampling error in the Deep Kolmogorov
method satisfies:
1
EN = Op √ , (20.64)
N
but with precise higher-order refinements via Berry-Esseen theorem (finite sample error), Ho-
effding’s inequality (concentration bound), Functional norm bounds (operator analysis),Edgeworth√
expansion (higher-order moment corrections). Thus, the optimal error decay rate remains 1/ N ,
but the prefactors depend on problem-specific variance and moment conditions.
The Deep Kolmogorov Method (DKM) provides a framework for solving high-dimensional
PDEs using deep learning, with rigorous theoretical justification from stochastic calculus, func-
tional analysis, and neural network theory.
21 Reinforcement Learning
Literature Review: Sutton and Barto (2018) [273] [274] (2021) wrote a definitive textbook on
reinforcement learning. It covers the fundamental concepts, including Markov decision processes
(MDPs), temporal difference learning, policy gradient methods, and function approximation. The
second edition expands on deep reinforcement learning, covering advanced algorithms like DDPG,
A3C, and PPO. Bertsekas and Tsitsiklis (1996) [275] laid the theoretical foundation for reinforce-
ment learning by introducing neuro-dynamic programming, an extension of dynamic programming
methods for decision-making under uncertainty. It rigorously covers approximate dynamic pro-
gramming, policy iteration, and value function approximation. Kakade (2003) [276] in his thesis
formalized the sample complexity of RL, providing theoretical guarantees for how much data is
required for an agent to learn optimal policies. It introduces the PAC-RL (Probably Approxi-
mately Correct RL) framework, which has significantly influenced how RL algorithms are evaluated.
Szepesvári (2010) [277] presented a rigorous yet concise overview of reinforcement learning algo-
rithms, including value iteration, Q-learning, SARSA, function approximation, and policy gradient
methods. It provides deep theoretical insights into convergence proofs and performance bounds.
Haarnoja et. al. (2018) [278] introduced Soft Actor-Critic (SAC), an off-policy deep reinforce-
ment learning algorithm that maximizes expected reward and entropy simultaneously. It provides
a strong theoretical framework for handling exploration-exploitation trade-offs in high-dimensional
continuous action spaces. Mnih et al. (2015) [279] introduced Deep Q-Networks (DQN), demon-
strating how deep learning can be combined with Q-learning to achieve human-level performance
in Atari games. The authors address key challenges in reinforcement learning, including function
approximation and stability improvements. Konda and Tsitsiklis (2003) [280]. provided a rigorous
theoretical analysis of Actor-Critic methods, which combine policy-based and value-based learning.
It formally establishes convergence proofs for actor-critic algorithms and introduces the natural
gradient method for policy improvement. Levine (2018) [281] introduced a probabilistic inference
framework for reinforcement learning, linking RL to Bayesian inference. It provides a theoreti-
cal foundation for maximum entropy reinforcement learning, explaining why entropy-regularized
objectives lead to better exploration and stability. Mannor et. al. (2022) [282] gave one of the
most rigorous mathematical treatments of reinforcement learning theory. It covers several topics:
PAC guarantees for RL algorithms, Complexity bounds for exploration, Connections between RL
and control theory, Convergence rates of popular RL methods. Borkar (2008) [283] rigorously an-
alyzed stochastic approximation methods, which form the theoretical backbone of RL algorithms
like TD-learning, Q-learning, and policy gradient methods. Borkar provides a dynamical systems
perspective to convergence analysis, offering deep mathematical insights.
513
514 CHAPTER 21. REINFORCEMENT LEARNING
agent interacts with the environment by taking actions based on the current state of the environ-
ment. The goal of the agent is to maximize the expected cumulative reward over time. A policy
π is a mapping from states to probability distributions over actions. Formally, the policy π can be
written as:
π : S → P(A), (21.1)
where S is the state space, A is the action space, and P(A) is the set of probability distributions
over the actions. The policy can be either deterministic:
(
1 if at = π(st ),
π(at |st ) = (21.2)
0 otherwise,
where π(st ) is the action chosen in state st , or stochastic, in which case the policy assigns a
probability distribution over actions for each state st . The goal of reinforcement learning is to find
an optimal policy π ∗ (st ), which maximizes the expected return (cumulative reward) from any initial
state. The optimal policy is defined as:
"∞ #
X
π ∗ (st ) = arg max E γ k rt+k | st , (21.3)
π
k=0
where γ is the discount factor that determines the weight of future rewards, and E[·] represents
the expectation under the policy π. The optimal policy can be derived from the optimal action-
value function Q∗ (st , at ), which we define in the next section. The state st ∈ S describes the
current situation of the agent at time t, encapsulating all relevant information that influences the
agent’s decision-making process. The state space S may be either discrete or continuous. The state
transitions are governed by a probability distribution P (st+1 |st , at ), which represents the probability
of moving from state st to state st+1 given action at . These transitions satisfy the Markov property,
meaning the future state depends only on the current state and action, not the history of previous
states or actions:
P (st+1 |st , at ) = P (st+1 |st , at ) ∀st , st+1 ∈ S, at ∈ A. (21.4)
Additionally, the transition probabilities satisfy the normalization condition:
X
P (st+1 |st , at ) = 1 ∀st , at . (21.5)
st+1 ∈S
The state distribution ρt (st ) represents the probability of the agent being in state st at time t. The
state distribution evolves over time according to the transition probabilities:
X
ρt+k (st+k ) = P (st+k |st , at )ρt (st ), (21.6)
st ∈S
where ρt (st ) is the initial distribution at time t, and ρt+k (st+k ) is the distribution at time t + k. An
action at taken at time t by the agent in state st leads to a transition to state st+1 and results in a
reward rt . The agent aims to select actions that maximize its long-term reward. The action-value
function Q(st , at ) quantifies the expected cumulative reward from taking action at in state st and
following the optimal policy thereafter. It is defined as:
"∞ #
X
k
Q(st , at ) = E γ rt+k | st , at . (21.7)
k=0
The optimal action-value function Q∗ (st , at ) satisfies the Bellman Optimality Equation:
X
Q∗ (st , at ) = R(st , at ) + γ P (st+1 |st , at ) max Q∗ (st+1 , at+1 ). (21.8)
at+1
st+1 ∈S
21.1. KEY CONCEPTS 515
This recursive equation provides the foundation for dynamic programming methods such as value
iteration and policy iteration. The optimal policy π ∗ (st ) is derived by choosing the action that
maximizes the action-value function:
The optimal value function V ∗ (st ), representing the expected return from state st under the optimal
policy, is given by:
V ∗ (st ) = max Q∗ (st , at ). (21.10)
at ∈A
The reward rt at time t is a scalar value that represents the immediate benefit (or cost) the agent
receives after taking action at in state st . It is a function R(st , at ) mapping state-action pairs to
real numbers:
rt = R(st , at ). (21.12)
The agent’s objective is to maximize the cumulative reward, which is given by the total return from
time t: ∞
X
Gt = γ k rt+k . (21.13)
k=0
The agent seeks to find a policy π that maximizes the expected return. The Bellman equation for
the expected return is:
X
V π (st ) = R(st , π(st )) + γ P (st+1 |st , π(st ))V π (st+1 ). (21.14)
st+1 ∈S
This recursive relation helps in solving for the optimal value function. An RL problem is typically
modeled as a Markov Decision Process (MDP), which is defined as the tuple:
where:
The agent’s goal is to solve the MDP by finding the optimal policy π ∗ (st ) that maximizes the
cumulative expected reward. Reinforcement Learning provides a powerful framework for decision-
making in uncertain environments, where the agent seeks to maximize cumulative rewards over
time. The core concepts—agents, states, actions, rewards—are formalized mathematically within
the structure of a Markov Decision Process, enabling the application of optimization techniques
such as dynamic programming, Q-learning, and policy gradient methods to solve complex decision-
making problems.
516 CHAPTER 21. REINFORCEMENT LEARNING
Deep Q-Learning (DQL) is an advanced reinforcement learning (RL) technique where the goal
is to approximate the optimal action-value function Q∗ (s, a) through the use of deep neural net-
works. In traditional Q-learning, the action-value function Q(s, a) maps a state-action pair to the
expected return or cumulative discounted reward from that state-action pair, under the assumption
of following an optimal policy. Formally, the Q-function is defined as:
"∞ #
X
Q(s, a) = E γ t rt | s0 = s, a0 = a (21.16)
t=0
where γ ∈ [0, 1] is the discount factor, which determines the weight of future rewards relative to
immediate rewards, and rt is the reward received at time step t. The optimal Q-function Q∗ (s, a)
satisfies the Bellman optimality equation:
h i
Q∗ (s, a) = E rt + γ max
′
Q ∗
(s t+1 , a′
) | s 0 = s, a0 = a (21.17)
a
where st+1 is the next state after taking action a in state s, and the maximization term represents
the optimal future expected reward. This equation represents the recursive structure of the optimal
21.2. DEEP Q-LEARNING 517
action-value function, where each Q-value is updated based on the reward obtained in the current
step and the maximum future reward expected from the next state. The goal is to learn the optimal
Q-function through iterative updates, typically using the Temporal Difference (TD) method. In
Deep Q-Learning, the Q-function is approximated by a deep neural network, as directly storing
Q-values for every state-action pair is computationally infeasible for large state and action spaces.
Let the approximated Q-function be Qθ (s, a), where θ denotes the parameters (weights and biases)
of the neural network that approximates the action-value function. The deep Q-network (DQN)
aims to learn Qθ (s, a) such that it closely approximates Q∗ (s, a) over time. The update of the
Q-function follows the TD error principle, where the goal is to minimize the difference between the
current Q-values and the target Q-values derived from the Bellman equation. The loss function for
training the DQN is given by:
L(θ) = E(st ,at ,rt ,st+1 )∼D (yt − Qθ (st , at ))2
(21.18)
where D denotes the experience replay buffer containing previous transitions (st , at , rt , st+1 ). The
target yt for the Q-values is defined as:
yt = rt + γ max
′
Qθ− (st+1 , a′ ) (21.19)
a
Here, θ− represents the parameters of the target network, which is a slowly updated copy of the
online network parameters θ. The target network Qθ− is used to generate stable targets for the
Q-value updates, and its parameters are updated periodically by copying the parameters from the
online network θ after every T steps. The idea behind this is to stabilize the training by preventing
rapid changes in the Q-values due to feedback loops from the Q-network’s predictions. The update
rule for the network parameters θ follows the gradient descent method and is expressed as:
∇θ L(θ) = E(st ,at ,rt ,st+1 )∼D [(yt − Qθ (st , at )) ∇θ Qθ (st , at )] (21.20)
where ∇θ Qθ (st , at ) is the gradient of the Q-function with respect to the parameters θ, which is
computed using backpropagation through the neural network. This gradient is used to update the
parameters of the Q-network to minimize the loss function. In reinforcement learning, the agent
must balance exploration (trying new actions) and exploitation (selecting actions that maximize
the reward). This is often handled by using an epsilon-greedy policy, where the agent selects a
random action with probability ϵ and the action with the highest Q-value with probability 1 − ϵ.
The epsilon value is decayed over time to ensure that, as the agent learns, it shifts from exploration
to more exploitation. The epsilon-greedy action selection rule is given by:
(
random action, with probability ϵ
at = (21.21)
arg maxa Qθ (st , a), with probability 1 − ϵ
This policy encourages the agent to explore different actions at the beginning of training and
gradually exploit the learned Q-values as training progresses. The decay of ϵ typically follows an
annealing schedule to balance exploration and exploitation effectively. A critical component in
stabilizing training in Deep Q-Learning is the use of experience replay. In standard Q-learning,
the updates are based on consecutive transitions, which can lead to high correlations between
consecutive data points. This correlation can slow down learning or even lead to instability. Ex-
perience replay addresses this issue by storing a buffer of past experiences and sampling random
mini-batches from this buffer during training. This breaks the correlation between consecutive
samples and results in more stable and efficient updates. Mathematically, the loss function for
training the network involves random sampling of transitions (st , at , rt , st+1 ) from the experience
replay buffer D, and the update to the Q-values is computed using the Bellman error based on the
sampled experiences:
2
′
L(θ) = E(st ,at ,rt ,st+1 )∼D rt + γ max
′
Qθ− (st+1 , a ) − Qθ (st , at ) (21.22)
a
518 CHAPTER 21. REINFORCEMENT LEARNING
This method ensures that the Q-values are updated in a way that is less sensitive to the order in
which experiences are observed, promoting more stable learning dynamics.
Despite its success, the DQL algorithm can still suffer from certain issues such as overestimation
bias and instability due to the maximization step in the Bellman equation. Overestimation bias
occurs because the maximization operation maxa′ Qθ− (st+1 , a′ ) tends to overestimate the true value,
as the Q-values are updated based on the same Q-network. To address this, Double Q-learning was
introduced, which uses two separate Q-networks for action selection and value estimation, reducing
overestimation bias. In Double Q-learning, the target Q-value is computed using the following
equation:
′
yt = rt + γQθ− st+1 , arg max
′
Q θ (s t+1 , a ) (21.23)
a
This approach helps to mitigate the overestimation problem by decoupling the action selection
from the Q-value estimation process. The value of arg max is taken from the online network Qθ ,
while the Q-value for the next state is estimated using the target network Qθ− . Another extension
to improve the DQL framework is Dueling Q-Learning, which decomposes the Q-function into two
separate components: the state value function Vθ (s) and the advantage function Aθ (s, a). The
Q-function is then expressed as:
This decomposition allows the agent to learn the value of a state Vθ (s) independently of the specific
actions, thus reducing the number of parameters needed for learning. This is particularly beneficial
in environments where many actions have similar expected rewards, as it enables the agent to focus
on identifying the value of states rather than overfitting to individual actions.
In conclusion, Deep Q-Learning is an advanced reinforcement learning method that utilizes deep
neural networks to approximate the optimal Q-function, enabling agents to handle large state and
action spaces. The mathematical formulation of DQL involves minimizing the loss function based
on the temporal difference error, utilizing experience replay to stabilize learning, and using target
networks to prevent instability. Extensions such as Double Q-learning and Dueling Q-learning
further improve the performance and stability of the algorithm. Despite its remarkable successes,
Deep Q-Learning still faces challenges such as overestimation bias and instability, which have been
addressed with innovative modifications to the original algorithm.
(e.g., lighting, texture variations). They used deep learning and image-based metrics to measure
differences between simulated and real-world training environments. Kobanda et. al. (2024) [374]
introduced a hierarchical approach to offline reinforcement learning (ORL) for robotic control and
gaming AI. The study proposes policy subspaces that allow RL models to transfer knowledge across
different tasks and demonstrated its effectiveness in goal-conditioned RL for adaptive video game
AI. Shefin et. al. (2024) [368] focused on safety-critical RL applications in games and robotic ma-
nipulation. They introduced a framework for explainable reinforcement learning (XRL), making
AI decisions more interpretable and applied to robotic grasping tasks, ensuring safe and reliable
interactions. Xu et. al. (2025) [375] developed UPEGSim, a Gazebo-based simulation framework
for underwater robotic games. They used reinforcement learning to optimize evasion strategies
in underwater drone combat and highlighted RL applications in military and search-and-rescue
robotics. Patadiya et. al. (2024) [376] used Deep RL to create autonomous players in racing
games (Forza Horizon 5). They combined AlexNet with DRL for vision-based self-driving agents
in gaming. The model learns optimal driving strategies through self-play. Janjua et. al. (2024)
[377] explored RL scalability challenges in robotics and open-world games. They studied RL’s
adaptability in dynamic, open-ended environments (e.g., procedural game worlds) and discussed
generalization techniques for RL agents, improving their performance in unpredictable scenarios.
Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to make
decisions by interacting with an environment. The goal of the agent is to maximize a cumulative
reward signal over time by taking actions that affect its environment. The RL framework is for-
mally represented by a Markov Decision Process (MDP), which is defined by a 5-tuple (S, A, P, r, γ),
where:
• S is the state space, which represents all possible states the agent can be in.
• A is the action space, which represents all possible actions the agent can take.
• P (s′ |s, a) is the state transition probability, which defines the probability of transitioning
from state s to state s′ under action a.
• r(s, a) is the reward function, which defines the immediate reward received after taking action
a in state s.
• γ ∈ [0, 1) is the discount factor, which determines the importance of future rewards.
The objective in RL is for the agent to learn a policy π : S → A that maximizes its expected return
(the cumulative discounted reward), which is mathematically expressed as:
"∞ #
X
J(π) = Eπ γ t r(st , at ) , (21.25)
t=0
where st denotes the state at time t, and at = π(st ) is the action taken according to the policy π.
The expectation is taken over the agent’s interaction with the environment, under the policy π.
The agent seeks to maximize this expected return by choosing actions that yield the most reward
over time. The optimal value function V ∗ (s) is defined as the maximum expected return that can
be obtained starting from state s, and is governed by the Bellman optimality equation:
V ∗ (s) = max E [r(s, a) + γV ∗ (s′ )] , (21.26)
a
where s′ is the next state, and the expectation is taken with respect to the transition dynamics
P (s′ |s, a). The action-value function Q∗ (s, a) represents the maximum expected return from taking
action a in state s, and then following the optimal policy. It satisfies the Bellman optimality
equation for Q∗ (s, a): h i
∗ ∗ ′ ′
Q (s, a) = E r(s, a) + γ max ′
Q (s , a ) , (21.27)
a
520 CHAPTER 21. REINFORCEMENT LEARNING
where a′ is the next action to be taken, and the expectation is again over the state transition
probabilities. These Bellman equations form the basis of many RL algorithms, which iteratively
approximate the value functions to learn an optimal policy. To solve these equations, one of
the most widely used methods is Q-learning, an off-policy, model-free RL algorithm. Q-learning
iteratively updates the action-value function Q(s, a) according to the following rule:
h i
′
Q(st , at ) ← Q(st , at ) + α r(st , at ) + γ max
′
Q(s t+1 , a ) − Q(s t , at ) , (21.28)
a
where α is the learning rate that controls the step size of updates, and γ is the discount factor.
The key idea behind Q-learning is that the agent learns the optimal action-value function Q∗ (s, a)
without needing a model of the environment. The agent improves its action-value estimates over
time by interacting with the environment and receiving feedback (rewards). The iterative nature of
this update ensures convergence to the optimal Q∗ , under the condition that all state-action pairs
are visited infinitely often and α is decayed appropriately. Policy Gradient methods, in contrast,
directly optimize the policy πθ , which is parameterized by a vector θ. These methods are useful
in high-dimensional or continuous action spaces where action-value methods may struggle. The
objective in policy gradient methods is to maximize the expected return, J(πθ ), which is given by:
"∞ #
X
J(πθ ) = Est ,at ∼πθ γ t r(st , at ) . (21.29)
t=0
The policy is updated using the gradient ascent method, and the gradient of the expected return
with respect to θ is computed as:
∇θ J(πθ ) = Est ,at ∼πθ [∇θ log πθ (at |st )Q(st , at )] , (21.30)
where Q(st , at ) is the action-value function, and ∇θ log πθ (at |st ) is the score function, representing
the sensitivity of the policy’s likelihood to the policy parameters. By following this gradient, the
policy parameters θ are updated to improve the agent’s performance. This method, known as
REINFORCE, is particularly effective when the action space is large or continuous, and the policy
needs to be parameterized with complex models, such as deep neural networks. In both Q-learning
and policy gradient methods, exploration and exploitation are essential concepts. Exploration refers
to trying new actions that have not been sufficiently tested, whereas exploitation involves choosing
actions that are known to yield high rewards. The epsilon-greedy strategy is a common way to
balance exploration and exploitation, where with probability ϵ, the agent chooses a random action,
and with probability 1 − ϵ, it chooses the action with the highest expected reward. As the agent
learns, ϵ is typically decayed over time to reduce exploration and focus more on exploiting the
learned policy. In more complex environments, Boltzmann exploration or entropy regularization
techniques are used to maintain a controlled amount of randomness in the policy to encourage
exploration. In multi-agent games, RL takes on additional complexity. When multiple agents
interact, the environment is no longer static, as each agent’s actions affect the others. In this
context, RL can be used to find optimal strategies through game theory. A fundamental concept
here is the Nash equilibrium, where no agent can improve its payoff by changing its strategy,
assuming all other agents’ strategies remain fixed. In mathematical terms, for two agents i and j,
a Nash equilibrium (πi∗ , πj∗ ) satisfies:
where ri (πi , πj ) is the payoff function of agent i when playing policy πi against agent j’s policy
πj . Finding Nash equilibria in multi-agent RL is a complex and computationally challenging task,
requiring the agents to learn in a non-stationary environment where the other agents’ strategies
are also changing over time. In the context of robotics, RL is used to solve high-dimensional
control tasks, such as motion planning and trajectory optimization. The robot’s state space is often
21.3. APPLICATIONS IN GAMES AND ROBOTICS 521
represented by vectors of its position, velocity, and other physical parameters, while the action space
consists of control inputs, such as joint torques or linear velocities. In this setting, RL algorithms
learn to map states to actions that optimize the robot’s performance in a task-specific way, such as
minimizing energy consumption or completing a task in the least time. The dynamics of the robot
are often modeled by differential equations:
where x(t) is the state vector at time t, and u(t) is the control input. Through RL, the robot
learns to optimize the control policy u(t) to maximize a reward function, typically involving a
combination of task success and efficiency. Deep RL, specifically, allows for the representation of
highly complex control policies using neural networks, enabling robots to tackle tasks that require
high-dimensional sensory input and decision-making, such as object manipulation or autonomous
navigation.
In games, RL has revolutionized the field by enabling agents to learn complex strategies in en-
vironments where hand-crafted features or simple tabular representations are insufficient. A key
challenge in Deep Reinforcement Learning (DRL) is stabilizing the training process, as neural
networks are prone to issues such as overfitting, exploding gradients, and vanishing gradients. Tech-
niques such as experience replay and target networks are used to mitigate these challenges, ensuring
stable and efficient learning. Thus, Reinforcement Learning, with its theoretical underpinnings in
MDPs, Bellman equations, and policy optimization methods, provides a mathematically rich and
deeply rigorous approach to solving sequential decision-making problems. Its application to fields
such as games and robotics not only illustrates its versatility but also pushes the boundaries of
machine learning into real-world, high-complexity scenarios.
22 Federated Learning
Federated Learning (FL) is a decentralized machine learning paradigm that enables model train-
ing across multiple devices or edge nodes without transferring raw data to a central server. This
approach enhances data privacy and reduces communication costs, making it suitable for vari-
ous applications, including healthcare, finance, and edge computing. This literature review in
the following section provides an overview of key developments, challenges, and advancements in
Federated Learning.
522
22.2. RECENT LITERATURE REVIEW OF FEDERATED LEARNING 523
– Clustered FL: Sattler et al. (2020) [1149] introduced methods to group clients with
similar data distributions for better model convergence.
– Meta-Learning in FL: Fallah et al. (2020) applied meta-learning techniques in FL to
improve model generalization across diverse clients.
• Scalability: Efficient handling of a large number of clients remains an open research problem.
• Fairness and Bias: Addressing biases in FL models due to non-representative data distri-
butions is a critical issue.
2. Federated Learning for IoT and Smart Systems: Ferretti et al. (2025) [1157] propose
a blockchain-based federated learning system for resilient and decentralized coordination, im-
proving reliability and traceability in edge AI environments. Their approach ensures secure
and tamper-proof federated model updates. Chen et al. (2025) [1158] introduce Federated
Hyperdimensional Computing (FHC) for quality monitoring in smart manufacturing, which
leverages hierarchical learning strategies to improve anomaly detection and predictive main-
tenance in industrial settings. Mei et al. (2025) [1159] explore semi-asynchronous FL control
strategies in satellite networks, focusing on optimizing communication efficiency and reducing
training latency in federated AI for space applications.
3. Advances in Federated Learning for Edge AI and Security: Rawas and Samala (2025)
[1160] introduce Edge-Assisted Federated Learning (EAFL) for real-time disease prediction,
integrating FL with Edge AI to enhance processing efficiency while preserving patient data
privacy. Becker et al. (2025) [1161] examine combined reconstruction and poisoning attacks
on FL systems, assessing vulnerabilities and proposing mitigation strategies such as federated
adversarial learning and model verification techniques.
Federated learning continues to evolve, addressing critical challenges such as communication over-
head, security vulnerabilities, and model personalization. Future research is expected to focus
on:
This review highlights the latest advancements in FL, offering insights into its promising applica-
tions and the challenges that remain.
where w ∈ Rd represents the global model parameters, F (w) is the global objective function, P is
a probability distribution over clients, reflecting the heterogeneity of data distributions, Fk (w) is
the local objective function for client k, defined as:
where Dk is the local data distribution for client k, and ℓ(w; x, y) is the loss function. Let M denote
the global model, parameterized by w ∈ Rd , where d is the dimensionality of the model parameters.
The goal is to minimize a global objective function F (w), which is typically the average of local
objective functions Fk (w) computed over the data Dk held by each client k:
K
1 X
F (w) = Fk (w), (22.3)
K k=1
1 X
Fk (w) = ℓ(w; xi , yi ). (22.4)
|Dk |
(xi ,yi )∈Dk
Here, ℓ(w; xi , yi ) is the loss function (e.g., cross-entropy or mean squared error) evaluated on the
data point (xi , yi ), and |Dk | is the number of data points on client k.
22.4.1 Clients
In Federated Learning, clients are the distributed entities (e.g., mobile devices, IoT devices, or
institutional servers) that hold local datasets Dk , where k ∈ {1, 2, . . . , K} indexes the clients. Each
client k participates in the collaborative training of a global model M , parameterized by w ∈ Rd ,
without sharing its raw data Dk . Each client k maintains a local dataset Dk and computes updates
to the global model parameters w based on its local data. Each client k is associated with a local
objective function Fk (w), which represents the expected loss over its local data distribution Dk :
where:
• ℓ(w; x, y) is the loss function (e.g., cross-entropy, mean squared error) evaluated on a data
point (x, y).
• Dk is the local dataset of client k, which may differ significantly from other clients’ datasets
(non-IID data).
Clients are responsible for computing updates to the global model parameters w based on their
local data. This process typically involves local stochastic gradient descent (SGD) or its variants.
The local update rule at communication round t is:
where ηt is the learning rate at round t and ∇Fk (wk (t); ξk (t)) is the stochastic gradient computed
on a mini-batch ξk (t) ⊆ Dk . Clients can also train personalized models wk that adapt to their local
data distribution Dk . This can be achieved by adding a regularization term to the local objective
function:
λ
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] + ∥w − w(t)∥2 , (22.7)
2
where λ controls the degree of personalization. Clients can ensure privacy by adding noise to their
local updates:
2
wk (t + 1) = wk (t) − ηt ∇Fk (wk (t); ξk (t)) + N (0, σDP ), (22.8)
where σDP is calibrated to achieve (ϵ, δ)-differential privacy. To reduce communication costs, clients
can transmit only a subset of the model parameters:
wk (t + 1) = Sparsify wk (t) − ηt ∇Fk (wk (t); ξk (t)) . (22.9)
• Resource Constraints: Clients often have limited computational resources (e.g., CPU,
memory) and communication bandwidth. This necessitates efficient algorithms for local train-
ing and model compression.
• Privacy and Security: Clients must ensure that their local data Dk is not exposed during
training. Techniques such as differential privacy and secure multi-party computation (SMPC)
are employed to protect client data.
22.4.2 Server
A central server orchestrates the training process by aggregating the local updates from the clients
and updating the global model. The most common aggregation method is Federated Averaging
(FedAvg), which computes a weighted average of the local model parameters:
K
X |Dk |
w(t + 1) = wk (t + 1), (22.11)
k=1
|D|
where
K
X
|D| = |Dk | (22.12)
k=1
is the total number of data points across all clients. The server can use adaptive aggregation
methods to account for heterogeneous client updates. For example, weighted aggregation assigns
higher weights to clients with larger datasets or more reliable updates:
X
w(t + 1) = pk wk (t + 1), (22.13)
k∈S(t)
where pk is the weight assigned to client k. To ensure privacy, the server can use secure multi-party
computation (SMPC) to aggregate client updates without revealing individual contributions:
The server can add noise to the global model updates to ensure differential privacy:
2
w(t + 1) = w(t) + N (0, σDP ). (22.15)
• Non-IID Data: The local data distributions Dk may differ significantly across clients, lead-
ing to statistical heterogeneity. This can cause client drift and slow convergence. The server
must account for this by using robust aggregation methods.
• Privacy and Security: The server must ensure that the global model updates do not leak
sensitive information about the clients’ local data. Techniques such as secure aggregation and
differential privacy are employed to protect client privacy.
where pk is the weight assigned to client k, typically proportional to the size of its local dataset
|Dk |.
• Client Selection: At the start of each communication round t, the server selects a subset of
clients S(t) to participate. The selection may be random or based on criteria such as client
availability, computational resources, or data distribution. The probability of selecting client
k is denoted by pk , where
XK
pk = 1. (22.18)
k=1
• Model Distribution: The server sends the current global model parameters w(t) to the
selected clients S(t).
528 CHAPTER 22. FEDERATED LEARNING
• Local Training: Each selected client k ∈ S(t) performs τ steps of local stochastic gradient
descent (SGD) on its dataset Dk to compute updated parameters wk (t + 1). The local update
rule at local step s is:
where ηt is the learning rate at round t, ∇Fk (wk (t, s); ξk (t, s)) is the stochastic gradient com-
puted on a mini-batch ξk (t, s) ⊆ Dk . After τ steps, the client sends the updated parameters
wk (t + 1) = wk (t, τ ) (22.20)
• Model Aggregation: The server aggregates the local updates wk (t + 1) from the selected
clients using Federated Averaging (FedAvg):
X |Dk |
w(t + 1) = wk (t + 1), (22.21)
|DS(t) |
k∈S(t)
P
where |Dk | is the number of data points on client k and |DS(t) | = k∈S(t) |Dk | is the total
number of data points across the selected clients.
• Non-IID Data: The local data distributions Dk may differ significantly across clients, lead-
ing to statistical heterogeneity. This can cause client drift and slow convergence.
• Communication Bottlenecks: The communication between the server and clients can be
a bottleneck, especially in large-scale FL systems with millions of clients. Techniques such as
model compression and sparse updates are used to reduce communication costs.
where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.23)
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 529
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset
of client k. Heterogeneous client participation arises when the participation probabilities pk vary
significantly across clients, and the local data distributions Dk are non-IID. The global model
update at round t is given by:
X |Dk | (t+1)
w(t+1) = w , (22.24)
(t)
|DS (t) | k
k∈S
where |D| = K
P
k=1 |Dk | is the total number of data points across all clients. If pk is not uniform,
the expected update is biased toward clients with higher pk . This bias can be quantified as:
K
(t+1) 1 X |Dk | (t+1)
Bias = E[w ]− w . (22.26)
K k=1 |D| k
Heterogeneous client participation slows down convergence due to the increased variance in model
updates and the reduced effective learning rate. The variance of the global model updates increases
because clients contribute unevenly to the training process. Specifically, the variance of the global
update is given by: h i
2
Var[w(t+1) ] = E w(t+1) − E[w(t+1) ] . (22.27)
When clients participate heterogeneously, the variance term Var[w(t+1) ] becomes larger, which
slows down convergence. The effective learning rate is reduced because the global model update is
dominated by a subset of clients. This can be seen by analyzing the expected update direction:
K
X
(t+1) (t)
E[w ]=w − ηt pk ∇Fk (w(t) ). (22.28)
k=1
If pk is not uniform, the update direction is biased toward the gradients of frequently participating
clients, reducing the effective step size. Heterogeneous client participation also introduces bias
in the model updates due to non-IID data distributions. When clients have non-IID data, the
local objective functions Fk (w) differ significantly. If clients participate heterogeneously, the global
model update is biased toward the local objectives of more frequently participating clients. This
bias can be quantified as:
XK
Bias = ∇F (w) − pk ∇Fk (w) . (22.29)
k=1
The bias introduced by heterogeneous participation degrades model performance because the global
model overfits to the data of frequently participating clients and underfits to the data of rarely par-
ticipating clients. To rigorously analyze the impact of heterogeneous participation on convergence,
we consider the expected suboptimality gap ∆(t) = E[F (w(t) ) − F (w∗ )], where w∗ is the optimal
solution. The expected suboptimality gap evolves according to the following recurrence relation:
Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + Bias2 ,
(22.30)
2
where µ is the strong convexity parameter, L is the smoothness parameter, σ 2 is the variance of
the stochastic gradients, and Bias2 is the squared bias due to heterogeneous participation. For a
530 CHAPTER 22. FEDERATED LEARNING
Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + Bias2 .
(22.31)
4µ
This shows that the number of communication rounds T increases due to the bias term Bias2 . The
bias in gradient estimation arises because the global gradient ∇F (w) is estimated as:
K
X
∇F (w) ≈ pk ∇Fk (w). (22.33)
k=1
If pk is not uniform, the gradient estimate is biased toward the gradients of frequently participating
clients. This bias can be quantified as:
K
X
Bias = ∇F (w) − pk ∇Fk (w) . (22.34)
k=1
The biased gradient estimate slows down convergence because the global model update direction is
no longer aligned with the true gradient ∇F (w), and the effective step size is reduced due to the
bias. In conclusion, heterogeneous client participation slows down the training process and intro-
duces bias in the model updates due to the uneven contribution of clients to the global model update
and the non-IID data distributions across clients. These effects are rigorously quantified using tools
from optimization theory and statistical learning. By addressing heterogeneous participation, we
can improve the convergence and performance of Federated Learning.
where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.36)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
client k. Communication bottlenecks affect both the downlink (server to clients) and uplink (clients
to server) transmissions. The time required for downlink communication is given by:
d·b
Tdown = , (22.37)
Bdown
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 531
where d is the dimensionality of the model parameters, b is the number of bits used to represent each
parameter, and Bdown is the downlink bandwidth. The time required for uplink communication is
given by:
d·b
Tup = , (22.38)
Bup
where Bup is the uplink bandwidth. The total time required for each communication round t is
given by:
Tround = Tdown + Tup + Tcomp , (22.39)
where Tcomp is the time required for local computation on the clients. Communication bottlenecks
increase Tdown and Tup , thereby increasing Tround and slowing down the overall training process.
The effective learning rate ηeff is reduced because the global model is updated less frequently. The
effective learning rate is given by:
η
ηeff = , (22.40)
Tround
where η is the nominal learning rate. A smaller ηeff slows down convergence. Communication
bottlenecks also increase the variance of the global model updates because clients may not be able
to transmit their updates in a timely manner. This variance can be quantified as:
h 2
i
Var[w(t+1) ] = E w(t+1) − E[w(t+1) ] . (22.41)
This shows that the number of communication rounds T increases due to the reduced effective
learning rate ηeff and the increased variance Var[w(t+1) ]. To mitigate the impact of communication
bottlenecks, several techniques can be employed. Model compression techniques, such as quanti-
zation and sparsification, reduce the number of bits b required to represent each parameter. This
reduces the communication time Tdown and Tup . Clients can perform multiple local updates before
communicating with the server. This reduces the frequency of communication and increases the
effective learning rate ηeff . Asynchronous updates allow clients to communicate with the server
at different times, reducing the impact of high latency. In conclusion, communication bottlenecks
slow down the training process and increase the overall time required to achieve convergence in
Federated Learning. These effects are rigorously quantified using tools from optimization theory,
information theory, and statistical learning. By addressing communication bottlenecks, we can
improve the efficiency and performance of Federated Learning.
532 CHAPTER 22. FEDERATED LEARNING
where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.46)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
client k. Statistical heterogeneity arises when the local data distributions Dk differ significantly from
the global data distribution D, which can be quantified using measures such as the total variation
distance or the Kullback-Leibler (KL) divergence. For example, the total variation distance between
Dk and D is given by:
1X
TV(Dk , D) = |Dk (x, y) − D(x, y)|. (22.47)
2
(x,y)
When TV(Dk , D) is large, the local data distribution Dk is significantly different from the global
data distribution D. Statistical heterogeneity affects the local objective functions Fk (w), which are
defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.48)
When Dk differs significantly from D, the local objective functions Fk (w) differ significantly from
the global objective function F (w), which is defined as:
K
1 X
F (w) = Fk (w). (22.49)
K k=1
where δ measures the degree of statistical heterogeneity. Client drift occurs when the local models
trained on different clients diverge from the global model due to statistical heterogeneity. This can
be rigorously analyzed using the local update rule:
(t+1) (t) (t) (t)
wk = wk − ηt ∇Fk (wk ; ξk ), (22.51)
(t) (t)
where ηt is the learning rate at round t, and ∇Fk (wk ; ξk ) is the stochastic gradient computed
(t) (t+1)
on a mini-batch ξk ⊆ Dk . When Dk differs significantly from D, the local updates wk diverge
from the global update direction ∇F (w). This divergence can be quantified as:
(t+1) (t)
∥wk − w(t) ∥ ≤ ηt ∥∇Fk (wk ) − ∇F (w(t) )∥. (22.52)
(t+1)
When δ is large, the divergence ∥wk − w(t) ∥ becomes large, leading to client drift. Statistical
heterogeneity slows down convergence because the global model update direction is no longer aligned
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 533
with the true gradient ∇F (w). This can be rigorously analyzed using the expected suboptimality
gap ∆(t) = E[F (w(t) ) − F (w∗ )], where w∗ is the optimal solution. The expected suboptimality gap
evolves according to the following recurrence relation:
Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 ,
(22.53)
2
where µ is the strong convexity parameter, L is the smoothness parameter, σ 2 is the variance of
the stochastic gradients, and δ is the gradient divergence due to statistical heterogeneity. For a
constant learning rate ηt = η, the recurrence relation can be solved as:
Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .
(22.54)
4µ
2∆(0) L(σ 2 + δ 2 )
1
T ≥ log + . (22.55)
2µη ϵ 4µ2 ϵ
This shows that the number of communication rounds T increases due to the gradient divergence
δ. To mitigate the impact of statistical heterogeneity, several techniques can be employed. Regu-
larization can be added to the local objective function to reduce the divergence between local and
global models. Personalized Federated Learning can be used to train personalized models for each
client that adapt to their local data distributions. Data augmentation can be used to augment
the local data to make the local data distributions more similar to the global data distribution.
In conclusion, statistical heterogeneity, where the local data distributions Dk differ significantly
across clients, leads to client drift and slow convergence in Federated Learning. These effects are
rigorously quantified using tools from optimization theory, probability theory, and statistical learn-
ing. By addressing statistical heterogeneity, we can improve the convergence and performance of
Federated Learning.
• Adaptive Client Selection: The server can use adaptive client selection strategies to
prioritize clients with higher data quality or computational resources. For example, clients
with larger datasets |Dk | or lower gradient divergence δ may be selected more frequently.
• Local Step Adaptation: The number of local steps τ can be adapted dynamically based
on the client’s computational resources and data distribution. For example, clients with more
data may perform more local steps to reduce communication frequency.
• Secure Aggregation: To ensure privacy, the server can use secure multi-party computation
(SMPC) to aggregate client updates without revealing individual contributions:
Communication rounds are the backbone of Federated Learning, enabling the collaborative training
of a global model across distributed clients. Each round involves client selection, model distribu-
tion, local training, and model aggregation, governed by rigorous mathematical principles. By
addressing challenges such as heterogeneous client participation, non-IID data, and communication
bottlenecks, communication rounds ensure the efficient and privacy-preserving convergence of the
global model.
534 CHAPTER 22. FEDERATED LEARNING
where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)] . (22.57)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
(t)
client k. Adaptive client selection dynamically adjusts the probability pk that client k is selected
in round t based on criteria such as client availability, computational resources, data quality, and
(t)
network conditions. The selection of clients is governed by a probability distribution pk , where
(t)
pk is the probability that client k is selected in round t. Adaptive client selection affects the
communication rounds in several ways. By selecting clients with better network connectivity,
adaptive client selection reduces the time required for communication. The time required for
downlink and uplink communication is given by:
(t) d·b
Tdown = , (22.58)
Bdown,k
(t) d·b
Tup = , (22.59)
Bup,k
where Bdown,k and Bup,k are the downlink and uplink bandwidths of client k, respectively. By
(t) (t)
selecting clients with higher Bdown,k and Bup,k , adaptive client selection reduces Tdown and Tup ,
(t) (t)
thereby reducing the total communication time Tround . The total communication time Tround is
given by:
(t) (t) (t) (t)
Tround = Tdown + Tup + Tcomp , (22.60)
(t)
where Tcomp is the time required for local computation on the clients. By selecting clients with
(t)
better network connectivity, adaptive client selection reduces Tround and improves communication
efficiency. Adaptive client selection also improves convergence by selecting clients with higher-
quality data. The global model update at round t is given by:
X |Dk | (t+1)
w(t+1) = w , (22.61)
|DS (t) | k
k∈S (t)
P
where |Dk | is the number of data points on client k, and |DS (t) | = k∈S (t) |Dk | is the total number
of data points across the selected clients. By selecting clients with higher-quality data, adaptive
client selection reduces the bias in the global model updates, improving convergence. Adaptive
client selection also mitigates the impact of statistical heterogeneity by selecting clients with more
representative data. The gradient divergence due to statistical heterogeneity is given by:
By selecting clients with smaller δ, adaptive client selection reduces the gradient divergence, im-
proving convergence. To rigorously analyze the impact of adaptive client selection on convergence,
we consider the expected suboptimality gap ∆(t) = E[F (w(t) ) − F (w∗ )], where w∗ is the optimal
solution. The expected suboptimality gap evolves according to the following recurrence relation:
Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 ,
(22.63)
2
where µ is the strong convexity parameter, L is the smoothness parameter, σ 2 is the variance of
the stochastic gradients, and δ is the gradient divergence due to statistical heterogeneity. For a
constant learning rate ηt = η, the recurrence relation can be solved as:
Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .
(22.64)
4µ
To achieve ∆(T ) ≤ ϵ, we require:
2∆(0) L(σ 2 + δ 2 )
1
T ≥ log + . (22.65)
2µη ϵ 4µ2 ϵ
By selecting clients with smaller δ, adaptive client selection reduces the number of communica-
tion rounds T required to achieve ∆(T ) ≤ ϵ. In conclusion, adaptive client selection addresses the
challenges in communication rounds by reducing communication bottlenecks, improving conver-
gence, and mitigating statistical heterogeneity. These effects are rigorously quantified using tools
from optimization theory, probability theory, and statistical learning. By employing adaptive client
selection, we can improve the efficiency and performance of Federated Learning.
To analyze these effects, we begin by defining the global objective function F (w), which is the
average of the local objective functions Fk (w) computed over the data Dk held by each client k.
The global objective function is given by:
K
1 X
F (w) = Fk (w), (22.66)
K k=1
where w ∈ Rd represents the global model parameters, and Fk (w) is the local objective function
for client k, defined as:
Fk (w) = E(x,y)∼Dk [ℓ(w; x, y)]. (22.67)
Here, ℓ(w; x, y) is the loss function evaluated on a data point (x, y), and Dk is the local dataset of
client k. Local step adaptation dynamically adjusts the number of local updates τk (t) performed
by client k in communication round t based on criteria such as client computational resources, data
distribution, and network conditions. The local update rule for client k in round t is given by:
where s = 0, 1, . . . , τk (t) − 1 indexes the local updates, ηt is the learning rate at round t, and
∇Fk (wk (t, s); ξk (t, s)) is the stochastic gradient computed on a mini-batch ξk (t, s) ⊆ Dk . Local
536 CHAPTER 22. FEDERATED LEARNING
step adaptation affects the communication rounds in several ways. By increasing the number of
local updates τk (t), local step adaptation reduces the frequency of communication between clients
and the server. This reduces the total communication time Tround (t), which is given by:
where Tdown (t) and Tup (t) are the downlink and uplink communication times, and Tcomp (t) is the
time required for local computation on the clients. The global model update at round t is given
by:
X |Dk |
w(t + 1) = wk (t + 1), (22.70)
|DS (t)|
k∈S(t)
P
where |Dk | is the number of data points on client k, and |DS (t)| = k∈S(t) |Dk | is the total number
of data points across the selected clients. The gradient divergence due to statistical heterogeneity
is given by:
∥∇Fk (w) − ∇F (w)∥ ≤ δ. (22.71)
To rigorously analyze the impact of local step adaptation on convergence, we consider the expected
suboptimality gap ∆(t) = E[F (w(t)) − F (w∗ )], where w∗ is the optimal solution. The expected
suboptimality gap evolves according to the following recurrence relation:
Lηt2 2
∆(t + 1) ≤ (1 − 2µηt )∆(t) + (σ + δ 2 ), (22.72)
2
where µ is the strong convexity parameter, L is the smoothness parameter, and σ 2 is the variance
of the stochastic gradients.
without revealing any individual wi to the server or any other client. The primary challenge
in federated learning is the communication overhead, which in a naive implementation requires
each client to transmit O(d) bits per round, leading to an overall communication cost of O(N d).
Secure Aggregation addresses this challenge by employing cryptographic techniques such as additive
masking, secret sharing, and homomorphic encryption. Each client Ci generates a pairwise random
mask rij for every other client Cj , where the masks satisfy
to ensure perfect cancellation upon aggregation. The client then transmits a masked update
X
wi′ = wi + rij (22.75)
j̸=i
to the central server. When all clients submit their masked updates, the server computes the
aggregate sum !
XN N
X X
wi′ = wi + rij . (22.76)
i=1 i=1 j̸=i
22.5. DETAILED STEPS IN A COMMUNICATION ROUND 537
This ensures that the server recovers the desired global model update while preserving client privacy.
To further enhance security, an additively homomorphic encryption scheme can be introduced. Each
client encrypts its update using a public key pk, yielding
ensuring that individual wi values remain unknown. This method, however, introduces computa-
tional overhead due to encryption and decryption costs. To address client dropouts, polynomial-
based secret sharing is employed. Each client constructs a polynomial
t
X
P (x) = S + ak x k (22.82)
k=1
where ak are random coefficients. Each client Ci receives a share P (i). The server reconstructs
P (0) using Lagrange interpolation:
t+1 t+1
X Y −im
P (0) = P (ij ) . (22.83)
j=1 m=1
ij − i m
m̸=j
This guarantees that even with t client dropouts, the aggregate update remains recoverable. The
communication complexity is reduced to
d
O , (22.84)
N
compared to the naive cost of O(d), demonstrating efficiency. The adversarial advantage in recov-
ering wi is bounded by
P [Server learns wi | S] = O(2−128 ), (22.85)
ensuring security. The Secure Aggregation protocol thus satisfies
N
X N
X
wi′ = wi (22.86)
i=1 i=1
where pi are the weights associated with each local function fi , and these weights are typically
chosen such that they sum to one, i.e.,
Xm
pi = 1. (22.88)
i=1
Each local function fi (x) is assumed to be Li -smooth, which means that for all x, y ∈ Rd , the
gradient of fi satisfies the Lipschitz continuity condition,
This is equivalent to the quadratic upper bound condition, which states that for all x, y ∈ Rd ,
Li
fi (y) ≤ fi (x) + ⟨∇fi (x), y − x⟩ + ∥y − x∥2 . (22.90)
2
Since the global function F (x) is a weighted sum of the local functions, its gradient is given by
m
X
∇F (x) = pi ∇fi (x). (22.91)
i=1
which leads to
m
X
∥∇F (x) − ∇F (y)∥ ≤ pi Li ∥x − y∥. (22.95)
i=1
22.6. THEORETICAL FOUNDATIONS 539
This proves that the function F (x) is smooth with smoothness constant L given by
m
X
L= pi L i . (22.96)
i=1
Thus, we have established that F (x) satisfies the Lipschitz continuity of the gradient,
This confirms that if each local function has the same smoothness parameter L0 , then the global
function also has the same smoothness parameter, reinforcing the correctness of our result. Fur-
thermore, let us consider an extreme case where one of the local functions dominates, meaning
pj ≈ 1 for some j, while the remaining pi are close to zero. In this scenario,
m
X
L= pi Li ≈ Lj , (22.100)
i=1
which means that the global function inherits the smoothness property primarily from the dominant
local function. This further confirms the validity of our proof. Additionally, let us consider a case
where pi are uniformly distributed, meaning pi = m1 for all i. In this case,
m m
X 1 1 X
L= Li = Li , (22.101)
i=1
m m i=1
which shows that the smoothness constant of the global function is simply the arithmetic mean of
the individual smoothness constants. This is an intuitive result, as the aggregation of gradients in
Federated Learning naturally smooths out local variations. Finally, to further validate our result,
let us consider a simple case where we have two local functions with smoothness constants L1 and
L2 , and weights p1 and p2 such that p1 + p2 = 1. In this case,
L = p1 L1 + p2 L2 . (22.102)
This confirms that our result holds numerically as well. Therefore, the smoothness property of the
global function in Federated Learning is rigorously established, and our derivation is complete.
This condition implies that the function f (x) has a unique minimizer and that the function grows
at least quadratically away from the minimizer. The function is not only convex but also exhibits
a curvature characterized by µ, which ensures faster convergence rates in optimization algorithms.
Given this fundamental property, we now analyze the strong convexity of the global objective
function in Federated Learning. Federated Learning involves multiple clients, indexed by k, each
optimizing its own local loss function Fk (x). The global objective function is given by a weighted
combination of the individual loss functions:
K
X
F (x) = pk Fk (x), (22.105)
k=1
2. The coefficients pk are non-negative weights assigned to each client, which sum to one, i.e.,
K
X
pk = 1. (22.106)
k=1
A common choice for these weights in Federated Learning is pk = nnk , where nk is the number of
data points held by client k, and n = K
P
k=1 nk is the total number of data points across all clients.
The goal is to establish that F (x) is µ-strongly convex, provided that each Fk (x) is µ-strongly
convex. Since each local function Fk (x) is given to be µ-strongly convex, by definition, we have
µ
Fk (y) ≥ Fk (x) + ∇Fk (x)T (y − x) + ∥y − x∥2 , ∀x, y ∈ Rd . (22.107)
2
Multiplying both sides of this inequality by pk , we obtain
µ
pk Fk (y) ≥ pk Fk (x) + pk ∇Fk (x)T (y − x) + pk ∥y − x∥2 . (22.108)
2
Summing over all clients k from 1 to K, we get
K K
X X h µ i
pk Fk (y) ≥ pk Fk (x) + ∇Fk (x)T (y − x) + ∥y − x∥2 . (22.109)
k=1 k=1
2
Each client i computes a stochastic update based on its local mini-batch of data, and the central
server aggregates the stochastic gradients to obtain an overall update direction. The aggregated
stochastic gradient is given by
N
1 X
g(w) = gi (w). (22.117)
N i=1
Since the expectation operator is linear, we take the expectation of both sides to obtain
" N
# N N
1 X 1 X 1 X
E[g(w) | w] = E gi (w) | w = E[gi (w) | w] = ∇fi (w) = ∇f (w). (22.118)
N i=1 N i=1 N i=1
This confirms that g(w) is an unbiased estimator of the true gradient ∇f (w). Next, we analyze
the variance of g(w), which is defined as
Rewriting in terms of the deviation from the individual client gradients, we express the variance as
2
N
1 X
V[g(w) | w] = E (gi (w) − ∇fi (w)) . (22.121)
N i=1
Using the property of variance for independent random variables, which states that if X1 , X2 , . . . , XN
are independent, then " N #
X N
X
V Xi = V[Xi ], (22.122)
i=1 i=1
we obtain " #
N
1 X
V[g(w) | w] = E ∥gi (w) − ∇fi (w)∥2 . (22.123)
N 2 i=1
Since expectation is linear, we can factor out the sum:
N
1 X
E ∥gi (w) − ∇fi (w)∥2 .
V[g(w) | w] = 2 (22.124)
N i=1
If we assume that each client’s stochastic gradient has a bounded variance, i.e., there exists a finite
constant σi2 such that
E[∥gi (w) − ∇fi (w)∥2 ] ≤ σi2 , (22.125)
then we obtain
N
1 X 2
V[g(w) | w] ≤ σ . (22.126)
N 2 i=1 i
2
Defining σmax = maxi σi2 , we further obtain
2
σmax
V[g(w) | w] ≤ . (22.127)
N
2
Since σmax is finite, and N is the number of participating clients, the variance of the aggregated
2
stochastic gradient is bounded above by σmax
N
, which ensures that the stochastic gradient variance
remains bounded in Federated Learning. As N → ∞, the variance of the stochastic gradient
approaches zero, meaning that the aggregated stochastic gradient becomes an increasingly accurate
estimator of the true gradient. This proves that the stochastic gradient variance is always bounded
and decreases as more clients participate in the Federated Learning process.
22.6.3 Heterogeneity
The degree of non-IIDness is quantified by the gradient divergence:
The degree of non-IIDness in Federated Learning is fundamentally captured by the deviation of the
local gradient ∇Fk (w) from the global gradient ∇F (w). To formally establish the bound mentioned
22.6. THEORETICAL FOUNDATIONS 543
in the above equation, we begin by considering the definitions of these gradient terms. The global
objective function in Federated Learning is defined as the weighted sum of local objective functions:
K
X
F (w) = pk Fk (w), (22.129)
k=1
where ξk is a data sample drawn from the local data distribution Dk of client k. Taking the gradient
on both sides, we obtain:
∇Fk (w) = Eξk ∼Dk [∇f (w; ξk )]. (22.132)
Similarly, the global gradient can be expressed as:
K
X
∇F (w) = pk ∇Fk (w). (22.133)
k=1
By adding and subtracting the expectation Ek [∇Fk (w)], we can rewrite this as:
K
!
X
= (∇Fk (w) − Ek [∇Fk (w)]) + Ek [∇Fk (w)] − pj ∇Fj (w) . (22.136)
j=1
The first term represents the deviation of the local gradient from its expectation within the local
dataset, while the second term quantifies the discrepancy between the expectation over the local
data distributions and the global objective function. Using Jensen’s inequality and properties of
expectations, we obtain:
We define
δk = ∥∇Fk (w) − ∇F (w)∥, (22.138)
and establish the bound:
Ek [δk ] ≤ δ. (22.139)
Thus, we obtain the result:
∥∇Fk (w) − ∇F (w)∥ ≤ δ. (22.140)
544 CHAPTER 22. FEDERATED LEARNING
σ2 δ2
(T ) ∗ 1
E[F (w ) − F (w )] ≤ O + + 2 , (22.141)
µT µT µT
where w∗ is the optimal solution. To prove the convergence of the global model, we make the
following assumptions. Each local objective function Fk (w) is L-smooth, meaning:
The FedAvg algorithm proceeds in communication rounds t = 1, 2, . . . , T . In each round the server
selects a subset of clients S (t) , then Each selected client k ∈ S (t) performs τ steps of local SGD to
(t+1)
compute updated parameters wk . After that the server aggregates the local updates to compute
the new global model parameters w(t+1) .
We analyze the expected suboptimality gap E[F (w(T ) ) − F (w∗ )], where w∗ is the optimal solution.
Each client k performs τ steps of local SGD:
(t,s+1) (t,s) (t,s) (t,s)
wk = wk − ηt ∇Fk (wk ; ξk ), (22.146)
where ηt is the learning rate at round t. After τ local steps, the server aggregates the updates:
X |Dk | (t+1)
w(t+1) = w . (22.147)
|DS (t) | k
k∈S (t)
We now derive the expected suboptimality gap after T communication rounds. Using the smooth-
ness and strong convexity assumptions, we can write:
Lηt2 (t) 2
F (w(t+1) ) ≤ F (w(t) ) − ηt ⟨∇F (w(t) ), g(t) ⟩ + ∥g ∥ , (22.148)
2
|Dk | (t+1)
where g(t) =
P
k∈S (t) |D (t) | ∇Fk (wk ). Taking the expectation over the stochastic gradients, we
S
get:
Lηt2 2
E[F (w(t+1) )] ≤ F (w(t) ) − ηt ∥∇F (w(t) )∥2 + σ + δ2 .
(22.149)
2
22.7. CONVERGENCE ANALYSIS 545
Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 .
(22.152)
2
For a constant learning rate ηt = η, the recursion can be solved as:
Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .
(22.153)
4µ
1
Choosing η = 2µT
, we get:
σ2 δ2
(T ) 1
∆ ≤O + + 2 . (22.154)
µT µT µT
The expected suboptimality gap after T communication rounds is:
σ2 δ2
(T ) ∗ 1
E[F (w ) − F (w )] ≤ O + + 2 . (22.155)
µT µT µT
This result rigorously proves the convergence rate of the global model in Federated Learning under
the given assumptions.
1 σ2 δ2
L
T =O log + + . (22.156)
µ ϵ µϵ µ2 ϵ
We now prove the above formula for the number of communication rounds T required to achieve
an ϵ-approximate solution. Using the smoothness and strong convexity assumptions, we can write:
Lηt2 (t) 2
F (w(t+1) ) ≤ F (w(t) ) − ηt ⟨∇F (w(t) ), g(t) ⟩ + ∥g ∥ , (22.157)
2
|Dk | (t+1)
where g(t) =
P
k∈S (t) |D (t) | ∇Fk (wk ). Taking the expectation over the stochastic gradients, we
S
get:
Lηt2 2
E[F (w(t+1) )] ≤ F (w(t) ) − ηt ∥∇F (w(t) )∥2 + σ + δ2 .
(22.158)
2
Using the strong convexity of F (w), we have:
Lηt2 2
∆(t+1) ≤ (1 − 2µηt )∆(t) + σ + δ2 .
(22.161)
2
For a constant learning rate ηt = η, the recursion can be solved as:
Lη 2
∆(T ) ≤ (1 − 2µη)T ∆(0) + σ + δ2 .
(22.162)
4µ
ϵ Lη 2 ϵ
(1 − 2µη)T ∆(0) ≤ and σ + δ2 ≤ . (22.163)
2 4µ 2
1
Let η = 2µT
. Substituting this into the second inequality, we get:
L 2 2
ϵ
σ + δ ≤ . (22.164)
8µ2 T 2
Let’s analyze the convergence of Federated Adam (FedAdam) and Federated Yogi (FedYogi) by
deeply investigating the moment estimation dynamics, per-client learning rates, and their effects on
convergence bounds in a stochastic optimization framework. We will present the complete deriva-
tion of moment estimates, the role of adaptive learning rates, and the theoretical implications for
federated learning settings.
We start with the federated optimization problem where the goal is to minimize the global ob-
jective function:
XK
F (w) = pk Fk (w), (22.168)
k=1
22.8. ADVANCED TECHNIQUES 547
where pk = nnk is the weight of client k, and Fk (w) is the local loss function at client k, with nk
being the number of data points on client k and n = K
P
k=1 nk being the total number of data points
across all clients. The global gradient estimate at round t is obtained from selected clients S (t) :
X (t)
g (t) = pk gk , (22.169)
k∈S (t)
(t)
where gk is the local gradient estimate for client k at round t. The key challenge in federated
learning is that each client has a different distribution of data, leading to gradient heterogeneity.
FedAdam and FedYogi both maintain per-client moment estimates. The first and second moment
estimates for client k are given by:
(t+1) (t) (t)
mk = β1 mk + (1 − β1 )gk (22.170)
(t+1) (t) (t)
vk = β2 vk + (1 − β2 )(gk )2 . (22.171)
These moment estimates are then aggregated across participating clients. The bias-corrected mo-
ment estimates for FedAdam are:
(t+1) (t+1)
(t+1) mk (t+1) vk
m̂k = , v̂k = . (22.172)
1 − β1t+1 1 − β2t+1
X (t+1)
v̂ (t+1) = pk v̂k . (22.174)
k∈S (t)
m̂(t+1)
w(t+1) = w(t) − η √ . (22.175)
v̂ (t+1) + ϵ
FedYogi modifies the second-moment estimate using a sign-based update:
(t+1) (t) (t) (t) (t)
vk = vk − (1 − β2 )(gk )2 sgn(vk − (gk )2 ). (22.176)
(t+1)
This prevents vk from growing too large, stabilizing the learning process. We define the per-client
effective learning rate as:
(t) η
ηk = q . (22.177)
(t)
v̂k + ϵ
The key theoretical insight is that FedAdam and FedYogi assign different learning rates to each
client based on the local gradient variance:
(t) (t)
• If a client has high gradient variance, v̂k is large, leading to a small ηk .
(t) (t)
• If a client has low gradient variance, v̂k is small, leading to a large ηk .
This adaptivity ensures stability in non-i.i.d. settings, preventing clients with high variance from
dominating the update. We assume that F (w) is L-smooth and µ-strongly convex, satisfying:
L (t+1)
F (w(t+1) ) ≤ F (w(t) ) + ⟨∇F (w(t) ), w(t+1) − w(t) ⟩ + ∥w − w(t) ∥2 . (22.178)
2
548 CHAPTER 22. FEDERATED LEARNING
m̂(t+1)
w(t+1) = w(t) − η √ , (22.179)
v̂ (t+1) + ϵ
we obtain:
∥∇F (w(t) )∥2
(t+1) (t)
E[F (w )] ≤ E[F (w )] − ηE √ + O(σ 2 ). (22.180)
v̂ (t+1) +ϵ
For optimally chosen η, β1 , and β2 , FedAdam and FedYogi achieve a convergence rate of
" T
#
1X (t) 2 1
E ∥∇F (w )∥ ≤ O √ + O(σ 2 ), (22.181)
T t=1 T
which matches the convergence rate of centralized adaptive methods. This demonstrates that
FedAdam and FedYogi effectively mitigate the heterogeneity of client data while maintaining robust
per-client learning rates that accelerate convergence.By dynamically adjusting the effective learning
rate based on the client-specific variance of gradients, both methods allow clients with stable
gradients to take larger steps, while preventing unstable clients from destabilizing global updates.
This is especially crucial in federated learning, where data distributions differ across clients and
traditional methods like FedAvg may suffer from divergence due to client drift. Further, defining
the client drift as the deviation from the global gradient,
(t) (t)
dk = gk − g (t) , (22.182)
where σDP is calibrated to achieve (ϵ, δ)-differential privacy. To ensure privacy, noise is introduced
through the Gaussian mechanism, which requires properly calibrating σDP such that the updates
satisfy (ϵ, δ)-differential privacy. The definition of differential privacy states that a mechanism M
satisfies (ϵ, δ)-DP if for any two adjacent datasets D and D′ differing by at most one data point,
and for all measurable sets S,
The sensitivity ∆ of the gradient update measures how much a single individual’s data can change
the computed gradient:
(t) (t) (t) (t)
∆ = sup ∥∇Fk (wk ; ξk ) − ∇Fk′ (wk ; ξk )∥. (22.187)
D,D′
To bound the sensitivity, we apply gradient clipping, which ensures that individual contributions
to the update do not exceed a predefined bound C:
!
(t)
(t) (t)
(t) (t) C
gk = Clip ∇Fk (wk ; ξk ), C = ∇Fk (wk ; ξk ) · min 1, (t) (t)
. (22.188)
∥∇Fk (wk ; ξk )∥
2 T q2C 2
σDP = 2 ln(1.25/δ). (22.193)
ϵ2
The differential privacy guarantee is further strengthened due to subsampling amplification, where
only a subset of data points is used at each iteration. Given that the privacy loss accumulates
with composition, we apply the advanced composition theorem, which states that for T sequential
applications of a mechanism with (ϵ, δ)-DP guarantees, the total privacy guarantee is
p
ϵT = 2T ln(1/δ) · ϵ + T ϵ(eϵ − 1). (22.194)
If we use moment accountant analysis, we can achieve a tighter bound on the privacy cost. Instead
of direct composition, the Gaussian mechanism with privacy amplification by subsampling provides
a refined bound for σDP , which ensures that the total privacy guarantee satisfies
qC p
ϵ≈ 2T ln(1/δ). (22.195)
σDP
The complete privacy-preserving update is then performed as follows. First, the gradient is com-
puted:
(t) (t) (t)
gk = ∇Fk (wk ; ξk ). (22.196)
Next, gradient clipping is applied:
(t) (t)
gk = Clip(gk , C). (22.197)
550 CHAPTER 22. FEDERATED LEARNING
Finally, the model parameters are updated using the noisy gradient:
(t+1) (t) (t)
wk = wk − ηt gk . (22.200)
Thus, by properly calibrating σDP using the privacy accountant, we guarantee that the model
update remains differentially private while still allowing learning to take place.
mi ∼ Bernoulli(p) (22.204)
where p is the probability of retaining each coordinate independently. The expected fraction of
retained parameters is p · d, leading to an expected communication cost of
Csparse = p · d · b (22.205)
where b is the number of bits required to represent each transmitted parameter. The error intro-
duced by sparsification is given by
(t+1) (t) (t) (t) (t) (t) (t) (t)
ek = ek + (wk − ηt ∇Fk (wk ; ξk )) − S(wk − ηt ∇Fk (wk ; ξk )) . (22.206)
22.8. ADVANCED TECHNIQUES 551
This error accumulation mechanism, often referred to as error feedback or residual accumulation,
ensures that information lost due to sparsification is compensated for in subsequent updates, re-
ducing the bias introduced by sparsification. In the case where Fk (w) is a convex and L-smooth
function, the gradient update without sparsification satisfies the standard SGD convergence bound:
With sparsification, the transmitted update is ĝ = S(g) instead of the full gradient g = ∇Fk (w),
leading to an additional variance term:
The presence of this variance term modifies the convergence behavior as follows:
The optimal choice of the sparsification strategy depends on balancing communication efficiency
and convergence speed. For example, top-k sparsification minimizes the introduced variance by
selecting the most important updates, but it requires additional computation to determine the top-k
elements. On the other hand, random sparsification is computationally inexpensive but introduces
additional noise into the optimization process, which may slow down convergence. Theoretical
analysis of the convergence properties under sparsification involves bounding the sparsification-
induced error in terms of the gradient norm. Suppose w∗ is the optimal parameter vector and g
is the full gradient update. The expected squared distance to the optimum after one sparsified
update is given by:
To ensure convergence, the learning rate ηt must be chosen such that the expected decrease in
function value per iteration is sufficient to counteract the sparsification-induced variance. Analyzing
the expected reduction in loss function per step, we obtain
ηt2 σ 2
∗ 2ηt L
E[Fk (w (t+1)
)] − Fk (w ) ≤ 1 − (E[Fk (w(t) )] − Fk (w∗ )) + . (22.214)
1 + ηt L 2(1 + ηt L)
This equation highlights the trade-off between convergence speed and sparsification-induced er-
ror. If sparsification is too aggressive (i.e., p is too small or k is too low), the variance term σ 2
dominates, and convergence slows down. Conversely, if communication cost is not a concern and
no sparsification is applied, the variance term disappears, but communication cost per iteration
remains high. Thus, an optimal sparsification strategy must balance these competing factors by
ensuring that the retained updates provide sufficient information for gradient-based optimization
while minimizing unnecessary communication overhead.
552 CHAPTER 22. FEDERATED LEARNING
where |D| is the total number of data points across all clients. The generalization error of the
global model in Federated Learning (FL) can be analyzed using Rademacher complexity. In FL,
the model is trained in a distributed manner where each client optimizes its local objective, and
the global model aggregates these updates. The objective function in FL is defined as follows:
where w ∈ Rd represents the model parameters, ℓ(w; z) denotes the loss function, and D is the
underlying data distribution. In empirical risk minimization (ERM), we approximate the expected
risk using the empirical risk:
|D|
1 X
F̂ (w) = ℓ(w; zi ). (22.217)
|D| i=1
However, due to the federated setting, the empirical risk is computed separately at different clients,
leading to the following global empirical risk:
K
X |Dk |
F̂FL (w) = F̂k (w), (22.218)
k=1
|D|
where the local empirical risk at the kth client is given by:
1 X
F̂k (w) = ℓ(w; z). (22.219)
|Dk | z∈D
k
To achieve this, we employ Rademacher complexity, which measures the expressivity of the hy-
pothesis class. The empirical Rademacher complexity of a hypothesis class H with respect to a
dataset S = {z1 , . . . , zn } is defined as:
" n
#
1 X
R̂S (H) = Eσ sup σi h(zi ) , (22.221)
h∈H n i=1
where σi are independent Rademacher random variables such that P(σi = ±1) = 1/2. For a class of
Lipschitz-bounded loss functions, standard statistical learning theory results yield the generalization
bound:
1
E[F (w) − F̂ (w)] ≤ 2Rn (H) + O √ , (22.222)
n
where n = |D| is the total number of data points. In the federated learning setting, where each
client contributes a fraction |D k|
|D|
of the data, the Rademacher complexity for FL can be estimated
as: s !
d
R|D| (H) = O . (22.223)
|D|
22.10. OPEN PROBLEMS AND FUTURE DIRECTIONS 553
Federated Learning optimizes the loss function via stochastic gradient descent (SGD) over T rounds.
The update rule in SGD follows:
where η is the step size. The convergence of SGD depends on smoothness and strong convexity
assumptions. Under the assumption that the loss function ℓ(w; z) is L-smooth, meaning:
• Theoretical Limits: Deriving tight lower bounds on communication complexity and con-
vergence rates.
22.11 Conclusion
Federated Learning is a mathematically rigorous framework that combines distributed optimiza-
tion, statistical learning, and privacy-preserving techniques. Its theoretical foundations and prac-
tical applications make it a cornerstone of modern machine learning in decentralized settings. By
addressing the challenges of non-IID data, communication efficiency, and privacy, FL enables col-
laborative learning across diverse and distributed datasets.
23 Kernel Regression
Literature Review: Fan et. al. (2025) [673] explored kernel regression techniques in causal
inference, particularly in the presence of interference among observations. The authors propose
an innovative nonparametric estimator that integrates kernel regression with trimming methods,
improving robustness in observational studies. Atanasov et. al. (2025) [674] generalized kernel
regression by linking it to high-dimensional linear models and stochastic gradient dynamics. The
authors present new asymptotics that extend classical results in nonparametric regression and ran-
dom feature models. Mishra et. al. (2025) [676] applied Gaussian kernel-based regression to image
classification and feature extraction. The authors demonstrate how kernel selection significantly
impacts model performance in plant leaf detection tasks. Elsayed and Nazier (2025) [677] com-
bined kernel smoothing regression with decomposition analysis to study labor market trends. It
highlights the application of kernel-based regression techniques in socio-economic modeling. Kong
et. al. (2025) [678] applied Bayesian Kernel Machine Regression (BKMR) to analyze complex
relationships between heavy metal exposure and health indicators. It extends kernel regression to
toxicology and epidemiological studies. Bracale et. al. (2025) [679] explored antitonic regression
methods, establishing new concentration inequalities for regression problems. It highlights ker-
nel methods’ superiority over traditional parametric approaches in pricing models. Köhne et. al.
(2025) [680] provided a theoretical foundation for kernel regression within Hilbert spaces, focusing
on error bounds for kernel approximations in dynamical systems. Sadeghi and Beyeler (2025) [681]
applied Gaussian Process Regression (GPR) with a Matérn kernel to estimate perceptual thresholds
in retinal implants, showcasing kernel-based regression in biomedical engineering. Naresh et. al.
(2025) [682] in a book chapter discussed logistic regression and kernel methods in network security.
It illustrates how kernelized models can enhance cybersecurity measures in firewalls. Zhao et. al.
(2025) [683] proposed Deep Additive Kernel (DAK) models, which unify kernel methods with deep
learning. This approach enhances Bayesian neural networks’ interpretability and robustness.
where K(x, x′ ) is a positive definite kernel function, ensuring that the Gram matrix
K = [K(xi , xj )]ni,j=1 (23.3)
is symmetric positive semi-definite (PSD). The spectral properties of K are crucial for under-
standing kernel regression’s behavior, particularly in the context of regularization, overfitting, and
generalization error analysis. To rigorously analyze kernel regression, we consider the Reproducing
Kernel Hilbert Space (RKHS) HK induced by K(x, x′ ), where functions satisfy:
∞
X
f (x) = αi φi (x), (23.4)
i=1
554
23.1. NADARAYA–WATSON KERNEL REGRESSION 555
where φi (x) are the eigenfunctions of the integral operator associated with K(x, x′ ):
Z
Kf (x) = K(x, x′ )f (x′ )dµ(x′ ). (23.5)
Ω
Kφi = λi φi , i = 1, 2, . . . (23.6)
where
λ1 ≥ λ2 ≥ · · · ≥ 0 (23.7)
are the eigenvalues of K. These eigenvalues and eigenfunctions determine the approximation ca-
pacity of kernel regression and its regularization properties.
1 x
Kh (x) = d K , (23.9)
h h
556 CHAPTER 23. KERNEL REGRESSION
where h is the bandwidth parameter that determines the smoothing level. A crucial property of
kernel functions is their normalization condition,
Z
K(x) dx = 1. (23.10)
Rd
Let us now do the Bias-Variance Decomposition and Overfitting in Kernel Regression. The perfor-
mance of kernel regression is governed by the bias-variance tradeoff :
where
Bias(fˆ(x)) = E[fˆ(x)] − f (x), (23.13)
and
Var(fˆ(x)) = E[(fˆ(x) − E[fˆ(x)])2 ]. (23.14)
Expanding f (x) via Taylor series, we obtain
1
f (xi ) ≈ f (x) + (xi − x)T ∇f (x) + (xi − x)T Hf (x)(xi − x). (23.15)
2
The expectation of the kernel estimate gives
d R 2
2 X uj K(u)du ∂ 2 f
ˆ h
E[f (x)] = f (x) + R + O(h4 ), (23.16)
2 j=1 K(u)du ∂x2j
σ2
Z
ˆ
Var(f (x)) ≈ K 2 (u)du. (23.17)
nhd fX (x)
Thus, variance scales as O((nhd )−1 ), leading to the optimal bandwidth selection
1
h∗ ∝ n− d+4 . (23.18)
Kernel Ridge Regression (KRR) is one of the best Regularization Techniques to Prevent Overfitting.
To control overfitting, we introduce Tikhonov regularization in kernel space. Define the Gram
matrix K with entries
Kij = Kh (xi − xj ). (23.20)
We solve the regularized least squares problem:
α = (K + λI)−1 y. (23.21)
For small λ, inverse eigenvalues σi−1 amplify noise, whereas for large λ, the regularization term
suppresses high-frequency components. In the spectral decomposition of K, we write
X
K= σi vi viT . (23.23)
i
where σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0 are the eigenvalues of the kernel matrix K and vi are the orthonormal
eigenvectors, i.e.,
viT vj = δij (23.24)
where δij is the Kronecker delta. The rank of K is equal to the number of nonzero eigenvalues
σi . The eigenvalues of K encode the spectrum of feature space correlations. If the kernel function
K(x, x′ ) is smooth, the eigenvalues decay exponentially: regularization, the solution is
σi ≈ O(i−τ ) (23.25)
for some decay exponent τ > 0. The spectral decay controls the effective degrees of freedom of
kernel regression. Applying regularization, the solution is
n
X σi
fˆ(x) = viT y · vi (x). (23.26)
i=1
σi + λ
The regularization smoothly filters the high-frequency components of f (x), preventing overfitting.
For Controlling Model Complexity in Spectral Filtering, we have to note that large σi corresponds
to low-frequency components retained in the solution while small σi are high-frequency components,
attenuated by regularization. The cutoff occurs around defining the effective model complexity. For
very small λ,
σi
≈ 1, (23.27)
σi + λ
causing high variance. For large λ,
σi σi
≈ , (23.28)
σi + λ λ
which heavily suppresses small eigenvalues, leading to underfitting. The optimal λ is selected via
cross-validation, minimizing
n
1X
CV (h) = (yi − fˆ−i (xi ))2 . (23.29)
n i=1
where α now depends on K and L. In conclusion, Kernel regression is powerful but prone to
overfitting when h is too small, leading to high variance. Regularization techniques such as kernel
ridge regression, Tikhonov regularization, and smoothing splines mitigate overfitting by
558 CHAPTER 23. KERNEL REGRESSION
There is a Bias-Variance Tradeoff in Spectral Terms. The expected bias measures the deviation of
fˆ(x) from f (x):
X n
2
Bias = (1 − g(σi ))2 c2i . (23.33)
i=1
where
σi
g(σi ) = . (23.34)
σi + λ
Large λ shrinks eigenmodes, increasing bias. The variance measures sensitivity to noise:
n
X
2
Var = σ g(σi )2 . (23.35)
i=1
For small λ, the model overfits, leading to high variance. The expected generalization bound error
in Spectral Terms is given by:
X X
E[(fˆ(x) − f (x))2 ] = (1 − g(σi ))2 c2i + σ 2 g(σi )2 . (23.36)
i i
In conclusion, Spectral Properties play an important role in Kernel Regression. The spectral
properties of kernel regression determine its ability to generalize and avoid overfitting:
By leveraging spectral decomposition, we gain a deep understanding of how kernel regression in-
terpolates data while controlling complexity. The optimal choice of λ and h ensures an optimal
tradeoff between bias and variance, leading to a robust kernel regression model.
generalized maximum likelihood estimation. It rigorously analyzes the estimator’s weighting prop-
erties and neighborhood selection criteria. Jennen-Steinmetz and Gasser (1988) [653] provided a
comparative framework between the Priestley-Chao estimator and other kernel regression estima-
tors. They explore its mathematical properties, convergence rates, and advantages over alternative
methods such as Nadaraya-Watson. Mack and Müller (1988) [659] evaluated the Priestley-Chao
estimator’s error behavior in nonparametric regression. The paper highlights how convolution-
type adjustments can improve estimation accuracy under random design conditions. Jones et. al.
(2024) [660] categorized various kernel regression estimators, including the Priestley-Chao estima-
tor. It critically evaluates its statistical efficiency and variance properties in comparison to other
kernel methods. Ghosh (2015) [661] introduced a variance estimation technique specifically for
the Priestley-Chao kernel estimator. The paper presents a method to avoid nuisance parameter
estimation, improving computational efficiency. Liu and Luor (2023) [662] integrated fractal in-
terpolants with the Priestley-Chao estimator to handle complex regression problems. It explores
modifications to kernel functions that enhance estimation in high-dimensional datasets. Gasser
and Muller (1979) [645] wrote a foundational work that revisits the Priestley-Chao estimator in the
context of kernel regression. The authors propose two alternative definitions for kernel estimation,
aiming to refine the estimator’s application in empirical data analysis.
in some suitable sense, such as pointwise convergence, mean squared error (MSE) consistency, or
uniform convergence over compact subsets. In kernel density estimation (KDE), a common
approach is to define
n
1 X x − X i
fˆ(x) = K (23.39)
nh i=1 h
where K(·) is a kernel function satisfying
Z ∞
K(u) du = 1 (23.40)
−∞
and h is a bandwidth parameter controlling the level of smoothing. However, KDE relies on a
fixed bandwidth h, which can lead to oversmoothing in regions of high density and un-
dersmoothing in regions of low density. The Priestley–Chao estimator improves upon
this by adapting the bandwidth locally, based on the spacings between consecutive order
statistics.
There is an important role of Order Statistics and Spacings. Let us define the order statistics of
the sample as
X(1) ≤ X(2) ≤ · · · ≤ X(n) (23.41)
The fundamental insight behind the Priestley–Chao estimator is that the spacings between
order statistics contain direct information about the local density. Define the spacing
between two consecutive order statistics as
Using results from order statistics theory, we obtain the key approximation
1
E[Di ] ≈ (23.43)
nf (X(i) )
560 CHAPTER 23. KERNEL REGRESSION
which follows from the fact that the probability of observing a sample in a small interval around X(i)
is approximately given by the density f (X(i) ) times the width of the interval. Thus, rearranging,
we obtain the fundamental estimator
1
fˆ(X(i) ) ≈ (23.44)
nDi
This provides a direct data-driven way to estimate the density without choosing a fixed band-
width h, as in classical KDE methods. Let’s now state the formal Definition of the Priestley–Chao
Estimator. The Priestley–Chao kernel estimator is defined as
n−1
−
1 X 1 x X (i)
fˆ(x) = K (23.45)
n i=1 Di Di
Unlike fixed-bandwidth KDE, here the bandwidth Di varies with location, allowing better
adaptation to the underlying density structure. To understand the performance of the estimator,
we analyze its bias and variance. Using a first-order Taylor expansion of Di around its
expectation, we write
1
Di = + ϵi (23.47)
nf (X(i) )
where ϵi represents the stochastic deviation from the expected value. Substituting this into the
estimator,
n−1
−
1 X x X (i)
fˆ(X(i) ) = 2
f (X(i) ) + nf (X(i) ) ϵi K (23.48)
n i=1 Di
Taking expectations, we obtain the leading-order bias term
1
E[fˆ(x)] = f (x) + h2 f ′′ (x) + O(n−2/5 ) (23.49)
2
where h = Di represents the local bandwidth. The variance of the estimator follows from the
variance of the spacings, which satisfies
This shows that the Priestley–Chao estimator achieves the optimal nonparametric rate of
convergence. The kernel function K(·) plays a crucial role in smoothing the estimate. Common
choices include:
1. Uniform kernel:
1
K(u) = 1(|u| ≤ 1) (23.53)
2
2. Epanechnikov kernel (optimal in MSE sense):
3
K(u) = (1 − u2 )1(|u| ≤ 1) (23.54)
4
23.3. GASSER–MÜLLER KERNEL ESTIMATOR 561
3. Gaussian kernel:
1 2
K(u) = √ e−u /2 (23.55)
2π
Let’s now describe the Mathematical Framework of the Gasser-Müller kernel estimator. Let
X1 , X2 , . . . , Xn represent a set of n independent and identically distributed (i.i.d.) random variables
drawn from an unknown distribution with a probability density function f (x). The goal of kernel
density estimation is to estimate this unknown density f (x) based on the observed data. For the
Gasser-Müller kernel estimator fˆh (x), the core idea is to place the kernel function at the midpoint
between two consecutive data points, Xi and Xi+1 , as follows:
n−1
ˆ 1X Xi + Xi+1
fh (x) = Kh x − (23.57)
n i=1 2
562 CHAPTER 23. KERNEL REGRESSION
where ξi = Xi +X2
i+1
is the midpoint
between consecutive data points, often referred to as the
1 x
”midpoint shift”, Kh (x) = h K h is the scaled kernel function with bandwidth h, K(x) is
the kernel function, typically chosen to be a symmetric probability density, such as the Gaussian
kernel:
1 x2
K(x) = √ e− 2 (23.58)
2π
The key difference between the Gasser-Müller estimator and the traditional kernel estimator is the
use of midpoints ξi instead of the individual data points. The kernel function Kh (x) is applied
to the midpoint shift, effectively smoothing the data and addressing boundary bias by utilizing
information from adjacent points.
The bias of the estimator can be derived by expanding fˆh (x) in a Taylor series around the true
density f (x). To compute the expected value of fˆh (x), we first express the expected kernel evalua-
tion:
n−1
ˆ 1X
E[fh (x)] = E[Kh (x − ξi )] (23.59)
n i=1
Since ξi is the midpoint of adjacent points Xi and Xi+1 , we perform a Taylor expansion around the
true density f (x), resulting in:
Z ∞
ˆ h2 ′′
E[fh (x)] = f (x) + f (x) u2 K(u) du + O(h4 ) (23.60)
2 −∞
R∞
where −∞ u2 K(u) du is the second moment of the kernel function, denoted σK 2
. The term
2
h
2
f ′′ (x)σK
2
represents the bias of the estimator, which is quadratic in h. Thus, the bias decreases
as h becomes smaller, and for sufficiently smooth densities, this bias is small. The main advantage
of the Gasser-Müller method is that it leads to a smaller bias compared to standard kernel density
estimators, especially at the boundaries. The variance of fˆh (x) represents the fluctuation of the
estimator across different samples. The variance is given by:
Z ∞
ˆ 1 2
Var[fh (x)] = K (u) du f (x) (23.61)
n −∞
R∞
where −∞ K 2 (u) du is the second moment of the kernel function K(x). The variance decreases
as the sample size n increases, but it also depends on the bandwidth h. For a fixed sample size,
the variance is inversely proportional to both h and n, i.e.,
1
Var[fˆh (x)] ∝ (23.62)
nh
Thus, larger sample sizes and smaller bandwidths lead to smaller variance, but the optimal band-
width must balance the trade-off between bias and variance. The mean squared error (MSE)
combines both the bias and the variance to evaluate the overall performance of the estimator. The
MSE is given by:
MSE[fˆh (x)] = Bias2 + Var (23.63)
Substituting the expressions for bias and variance, we obtain:
2 2 Z ∞
h 1
MSE[fˆh (x)] = ′′ 2
f (x)σK + f (x) K 2 (u) du (23.64)
2 nh −∞
To minimize the MSE, we select an optimal bandwidth hopt . By differentiating the MSE with
respect to h and setting the derivative to zero, we obtain the optimal bandwidth that balances the
bias and variance:
hopt ∝ n−1/5 . (23.65)
23.4. PARZEN-ROSENBLATT METHOD 563
Thus, the optimal bandwidth decreases as the sample size increases, and this scaling behavior is a
fundamental characteristic of kernel density estimation.
The Gasser-Müller estimator performs exceptionally well when compared to other kernel density
estimators, such as the Parzen-Rosenblatt estimator. The Parzen-Rosenblatt method places
kernels directly at the data points Xi , whereas the Gasser-Müller method places kernels at the
midpoints ξi = Xi +X2
i+1
. This simple modification significantly reduces boundary bias and results
in smoother and more accurate estimates, especially at the boundaries of the sample. Boundary
bias occurs in standard KDE methods because kernels at the boundaries have fewer data points to
influence them, which leads to a less accurate estimate of the density. Moreover, the Gasser-Müller
estimator excels in derivative estimation. When estimating the first or second derivatives of the
density function, the Gasser-Müller method provides more accurate estimates with lower variance
compared to traditional methods. The use of midpoints ensures that the kernel function is better
centered relative to the data, reducing boundary effects that are particularly problematic when
estimating derivatives. Regarding the Asymptotic Properties, The Gasser-Müller kernel estimator
exhibits asymptotic efficiency. As the sample size n approaches infinity, the estimator achieves
the optimal convergence rate of O(n−1/5 ) for the optimal bandwidth hopt . This convergence rate is
the same as that for other kernel density estimators, indicating that the Gasser-Müller estimator
is asymptotically efficient. In the limit, the Gasser-Müller estimator is asymptotically unbiased
and asymptotically efficient, meaning that as the sample size increases, the estimator approaches
the true density f (x) without bias and with minimal variance. The estimator becomes more accu-
rate as the sample size grows, and the optimal choice of bandwidth ensures that the bias-variance
trade-off is well balanced.
In summary, the Gasser-Müller kernel estimator offers several distinct advantages over other
nonparametric density estimators. Its primary strength lies in its ability to reduce boundary bias
by placing kernels at midpoints between adjacent data points. This leads to smoother and more
accurate density estimates, especially near the sample boundaries. The optimal choice of band-
width, which scales as n−1/5 , balances the bias and variance of the estimator, minimizing the mean
squared error. The Gasser-Müller estimator is particularly useful in applications involving density
estimation and derivative estimation, where boundary effects and accuracy are crucial. It is a highly
effective tool for nonparametric statistical analysis and provides accurate, unbiased estimates even
in challenging settings.
and spatial statistics. Slaoui (2018) [641] introduced a bias reduction technique for KDE, provid-
ing theoretical results and practical improvements over the standard Parzen-Rosenblatt estimator.
The modifications significantly enhance density estimation in small-sample scenarios. Michalski
(2016) [642] used KDE in hydrology to estimate groundwater level distributions. It demonstrates
how KDE outperforms parametric methods in environmental science applications. Gramacki and
Gramacki (2018) [643] covered KDE fundamentals, implementation details, and computational op-
timizations. It is an excellent resource for both theoretical insights and practical applications.
Desobry et. al. (2007) [644] extended KDE to unordered sets, exploring its use in kernel-based sig-
nal processing. It bridges the gap between statistical estimation and machine learning applications.
where K(·) is the kernel function, and h > 0 is the bandwidth parameter. The kernel
function K(x) serves as a local weighting function that smooths the empirical distribution, while
the bandwidth parameter h determines the scale over which the data points contribute to the
density estimate. The fundamental goal of KDE is to ensure that fˆh (x) provides an asymptotically
consistent, unbiased, and efficient estimator of f (x), all of which require rigorous mathematical
conditions to be satisfied. There are some important Properties of the Kernel Function. To ensure
the validity of fˆh (x) as a probability density function estimator, the kernel function K(x) must
satisfy the following conditions:
1. Normalization Condition: Z ∞
K(x) dx = 1 (23.67)
−∞
This ensures that the kernel behaves like a proper probability density function and does not
introduce artificial bias into the estimation.
2. Symmetry Condition:
K(x) = K(−x), ∀x ∈ R (23.68)
Symmetry guarantees that the kernel function does not introduce directional bias in the
estimation of f (x).
3. Non-negativity:
K(x) ≥ 0, ∀x ∈ R (23.69)
While not strictly necessary, this property ensures that fˆh (x) remains a valid probability
density estimate in a practical sense.
This ensures that the kernel function does not assign an excessive amount of probability mass
far from the origin, preserving local smoothness properties.
23.4. PARZEN-ROSENBLATT METHOD 565
This ensures that the kernel function does not introduce artificial shifts in the density estimate.
Let’s discuss the choice of Kernel Function and Examples. Several kernel functions satisfy the
above mathematical constraints and are commonly used in KDE:
• Gaussian Kernel:
1 2
K(x) = √ e−x /2 (23.72)
2π
This kernel has the advantage of being infinitely differentiable and providing smooth density
estimates.
• Epanechnikov Kernel:
3
K(x) = (1 − x2 )⊮|x|≤1 (23.73)
4
This kernel is optimal in the mean integrated squared error (MISE) sense, meaning
that it minimizes the variance of fˆh (x) while preserving local smoothness properties.
• Uniform Kernel:
1
K(x) = ⊮|x|≤1 (23.74)
2
This kernel is simple but suffers from discontinuities, making it less desirable for smooth
density estimation.
Regarding the Asymptotic Properties of the KDE, The bias of the KDE can be rigorously derived
using a second-order Taylor expansion of f (x) around a given evaluation point. Specifically, if f (x)
is twice continuously differentiable, we obtain
h2 ′′
E[fˆh (x)] − f (x) = f (x)µ2 (K) + O(h4 ) (23.75)
2
R
where µ2 (K) = x2 K(x) dx is the second moment of the kernel. The leading term in this expansion
shows that the bias is proportional to h2 , implying that a smaller h reduces bias, though at the
expense of increasing variance. The variance of the KDE is given by
ˆ 1 1
Var[fh (x)] = f (x)R(K) + O (23.76)
nh n
R
where R(K) = K 2 (x) dx measures the roughness of the kernel function. The key observation
here is that variance scales as O(1/(nh)), implying that a larger h reduces variance but increases
bias. To minimize the mean integrated squared error (MISE), one must choose an optimal
bandwidth hopt that balances bias and variance. The optimal bandwidth is given by
1/5
4σ̂ 5
hopt = (23.77)
3n
where σ̂ is the sample standard deviation. This scaling rule, known as Silverman’s rule of
thumb, follows from an asymptotic minimization of
Z ∞
E ˆ 2
(fh (x) − f (x)) dx (23.78)
−∞
566 CHAPTER 23. KERNEL REGRESSION
In conclusion, the Parzen-Rosenblatt method provides a highly flexible, consistent, and asymp-
totically optimal approach to density estimation. The choice of kernel function and bandwidth
selection is critical, as they directly impact the bias-variance tradeoff. Future refinements, such
as adaptive bandwidth selection and higher-order kernel corrections, further enhance its
performance.
24 Natural Language Processing (NLP)
Literature Review: Jurafsky and Martin 2023 [226] wrote book that is a cornerstone of NLP the-
ory, covering fundamental concepts like syntax, semantics, and discourse analysis, alongside deep
learning approaches to NLP. The book integrates linguistic theory with probabilistic and neural
methodologies, making it an essential resource for students and researchers alike. It thoroughly
explains sequence labeling, parsing, transformers, and BERT models. Manning and Schütze 1999
[227] wrote a foundational text in NLP, particularly for probabilistic models. It covers hidden
Markov models (HMMs), n-gram language models, and expectation-maximization (EM), concepts
that still underpin modern transformer-based NLP models. It also introduces latent semantic
analysis (LSA), a precursor to modern word embeddings. Liu and Zhang (2018) [228] presented
a detailed exploration of deep learning-based NLP, including word embeddings, recurrent neural
networks (RNNs), LSTMs, GRUs, and transformers. It introduces the mathematical foundations
of neural networks, making it a bridge between classical NLP and deep learning. Allen (1994)
[229] wrote a seminal book in NLP, focusing on symbolic and rule-based approaches. It provides
detailed coverage of semantic parsing, discourse modeling, and knowledge representation. While it
predates deep learning, it forms a strong theoretical foundation for logical and linguistic approaches
to NLP. wrote Koehn (2009) [232] wrote a definitive work on statistical NLP, particularly machine
translation techniques like phrase-based translation, alignment models, and decoder algorithms. It
remains relevant even as neural translation models (e.g., Transformer-based systems) dominate.
We now mention some of the recent works in Natural Language Processing (NLP). Hempelmann
[231] explored how linguistic theories of humor can be incorporated into Large Language Models
(LLMs). It discusses the integration of formal humor theories into neural models and whether
LLMs can be used to test linguistic hypotheses. Eisenstein (2020) [233] wrote a modern NLP text-
book that bridges theory and practice. It covers both probabilistic and deep learning approaches,
including dependency parsing, sequence-to-sequence models, and attention mechanisms. Unlike
many texts, it also discusses ethics and bias in NLP models. Otter et. al. (2018) [234] provides
a comprehensive review of neural architectures in NLP, covering CNNs, RNNs, attention mecha-
nisms, and reinforcement learning for NLP. It discusses both theoretical implications and empirical
advancements, making it an essential reference for deep learning in language tasks. The Oxford
Handbook of Computational Linguistics (2022) [235] provides a comprehensive collection of essays
covering the entire field of NLP and computational linguistics, including morphology, syntax, se-
mantics, discourse processing, and deep learning applications. It presents theoretical debates and
practical applications across different NLP domains. Li et. al. (2025) [230] introduced an advanced
multi-head attention mechanism that combines explorative factor analysis with NLP models. It
enhances our understanding of how transformers encode syntactic and semantic relationships.
567
568 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)
Çekik (2025) [237] introduced a rough set-based approach for text classification, highlighting how
term weighting strategies impact classification accuracy. It explores feature reduction and entropy-
based selection methods to enhance text classifiers. Zhu et. al. (2025) [238] presented a novel
entropy-based prefix tuning method for hierarchical text classification. It demonstrates how en-
tropy regularization can enhance transformer-based classifiers like BERT and GPT for multi-label
and hierarchical categorization. Matrane et. al. (2024) [239] investigated dialectal text classifi-
cation challenges in Arabic NLP. It proposes preprocessing optimizations for low-resource dialects
and demonstrates how transfer learning improves classification accuracy. Moqbel and Jain (2025)
[240] applies text classification to detect deception in online product reviews. It integrates cognitive
appraisal theory and NLP-based text mining to distinguish fake vs. genuine reviews. Kumar et. al.
(2025) [241] focused on medical text classification, demonstrating how NLP techniques can be ap-
plied to diagnose diseases using electronic health records (EHRs) and patient symptoms extracted
from text data. Yin (2024) [242] provided a deep dive into aspect-based sentiment analysis (ABSA),
discussing challenges in fine-grained text classification. It introduces new BERT-based techniques
to improve aspect-level sentiment classification accuracy. Raghavan (2024) [243] examines personal-
ity classification using text data. It evaluates the performance of NLP-based personality prediction
models and compares lexicon-based, deep learning, and transformer-based approaches. Semeraro
et. al. (2025) [244] introduced EmoAtlas, a tool that merges psychological lexicons, artificial in-
telligence, and network science to perform emotion classification in textual data. It compares its
accuracy with BERT and ChatGPT. Cai and Liu (2024) [245] provides a practical approach to text
classification in discourse analysis. It explores Python-based techniques for analyzing therapy talk
and sentiment classification in conversational texts.
Text classification is a fundamental problem in machine learning and natural language process-
ing (NLP), where the goal is to assign predefined categories to a given text based on its content.
This process involves several steps, including text preprocessing, feature extraction, model train-
ing, and evaluation. In this answer, we will explore these steps with a focus on the underlying
mathematical principles and models used in text classification. The first step in text classification
is preprocessing the raw text data. This typically involves the following operations:
• Tokenization: Breaking the text into words or tokens.
• Stopword Removal: Removing common words (such as ”and”, ”the”, etc.) that do not
carry significant meaning.
• Stemming and Lemmatization: Reducing words to their base or root form, e.g., ”running”
becomes ”run”.
• Lowercasing: Converting all words to lowercase to ensure consistency.
• Punctuation Removal: Removing punctuation marks.
These operations result in a cleaned and standardized text, ready for feature extraction. Once the
text is preprocessed, the next step is to convert the text into numerical representations that can
be fed into machine learning models. The most common methods for feature extraction include:
1. Bag-of-Words (BoW) model
2. Term Frequency-Inverse Document Frequency (TF-IDF)
In the first method (Bag-of-Words (BoW) model), each document is represented as a vector
where each dimension corresponds to a unique word in the corpus. The value of each dimension is
the frequency of the word in the document. If we have a corpus of N documents and a vocabulary
of M words, the document i can be represented as a vector xi ∈ RM , where:
xi = [f (w1 , di ), f (w2 , di ), . . . , f (wM , di )] (24.1)
24.1. TEXT CLASSIFICATION 569
where f (wj , di ) is the frequency of the word wj in the document di . The BoW model captures only
the frequency of terms within the document and disregards their order. While simple and com-
putationally efficient, this model does not capture the syntactic or semantic relationships between
words in the document.
A more sophisticated and improved representation can be obtained through Term Frequency-
Inverse Document Frequency (TF-IDF), which scales the raw frequency of words by their
relative importance in the corpus. TF-IDF is a more advanced technique that aims to weight
words based on their importance. It considers both the frequency of a word in a document and the
rarity of the word across all documents. The term frequency (TF) of a word w in document d is
defined as:
count(w, d)
TF(w, d) = (24.2)
total number of words in d
The inverse document frequency (IDF) is given by:
N
IDF(w) = log (24.3)
DF(w)
where N is the total number of documents and DF(w) is the number of documents containing the
word w. The TF-IDF score is the product of these two:
There are several machine learning models that can be used for text classification, ranging from
simpler models to more complex ones. A common approach to text classification is to use a linear
model such as logistic regression or linear support vector machines (SVM). Given a feature vector
xi for document i, the prediction of the class label yi can be made as:
where σ is the sigmoid function for binary classification, and w and b are the weight vector and bias
term, respectively. The model parameters w and b are learned by minimizing a loss function, such
as the binary cross-entropy loss. More complex models, such as Neural Networks (NN), involve
deeper mathematical formulations. In a typical feedforward neural network, the goal is to learn
a set of parameters that map an input vector xi to an output label yi . The network consists of
multiple layers of interconnected neurons, each of which applies a non-linear transformation to the
input. Given an input vector xi , the output of the network is computed as:
(l) (l−1)
hi = σ(W(l) hi + b(l) ) (24.6)
(l)
where hi is the activation of layer l, σ is the activation function (e.g., ReLU, sigmoid, or tanh),
W(l) is the weight matrix, and b(l) is the bias term for layer l. The input to the network is passed
through several hidden layers before producing the final classification output. The output layer
typically applies a softmax function to obtain a probability distribution over the possible classes:
exp(WcT hi + bc )
P (yc |xi ) = P T
(24.7)
c′ exp(Wc′ hi + bc′ )
where Wc and bc are the weights and bias for class c, and hi is the output of the last hidden layer.
The network is trained by minimizing a cross-entropy loss function:
C
X
L(W, b) = − yi,c log P (yc |xi ) (24.8)
c=1
570 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)
where yi,c is the one-hot encoded label for class c, and the goal is to minimize the difference be-
tween the predicted probability distribution and the true class distribution. Throughout the entire
process, optimization plays a crucial role in fine-tuning model parameters to minimize classification
errors. Common optimization techniques include stochastic gradient descent (SGD) and its variants,
such as Adam and RMSProp, which update model parameters iteratively based on the gradient of
the loss function with respect to the parameters. Given the loss function L(θ) parameterized by θ,
the gradient of the loss with respect to a parameter θi is computed as:
∂L(θ)
(24.9)
∂θi
∂L(θ)
θi ← θi − η (24.10)
∂θi
where η is the learning rate. For each iteration, this update rule adjusts the model parameters in
the direction of the negative gradient, ultimately converging to a set of parameters that minimizes
the classification error.
In summary, text classification is an advanced and multifaceted problem that requires a deep
understanding of various mathematical principles, including linear algebra, probability theory, opti-
mization, and functional analysis. The entire process, from text preprocessing to feature extraction,
model training, and evaluation, involves the application of rigorous mathematical techniques that
enable the effective classification of text into meaningful categories. Each of these steps, whether
simple or complex, plays an integral role in transforming raw text data into actionable insights
using mathematically sophisticated models and algorithms.
eTranslation (EU Commission MT model). It shows how general-purpose LLMs can rival dedi-
cated NMT systems but struggle with domain-specific translations. Yang (2025) [254] introduced
error-detection models for NMT output, using transformer-based classifiers to detect syntactic and
semantic errors in machine-generated translations.
Machine Translation (MT) in Natural Language Processing (NLP) is a highly intricate compu-
tational task that requires converting text from one language (source language) to another (target
language) by using statistical, rule-based, and deep learning models, often underpinned by proba-
bilistic and neural network-based frameworks. The goal is to determine the most probable target
sequence T = {t1 , t2 , . . . , tN } from the given source sequence S = {s1 , s2 , . . . , sT }, by modeling the
conditional probability P (T | S). The optimal translation is typically defined by:
This involves estimating the probability of T given S, with the assumption that the translation
can be described probabilistically. In the most fundamental form of statistical machine translation
(SMT), this probability is often modeled through a series of translation models that decompose the
translation process into manageable components. The conditional probability P (T | S) in SMT
can be factorized using Bayes’ theorem:
P (S, T ) P (T | S)P (S)
P (T | S) = = (24.12)
P (S) P (S)
Given this decomposition, the core of early SMT models, such as IBM models, sought to model the
joint probability P (S, T ) over source and target language pairs. Specifically, in word-based models
like IBM Model 1, the task reduces to estimating the probability of translating each word in the
source language S to its corresponding word in the target language T . The joint probability can
be written as:
T Y
Y N
P (S, T ) = t(si | tj ) (24.13)
i=1 j=1
where t(si | tj ) is the probability of translating word si in the source sentence to word tj in the
target sentence. The estimation of these probabilities, t(si | tj ), is typically achieved by analyzing
parallel corpora through various techniques such as Expectation-Maximization (EM), which allows
the unsupervised learning of these translation probabilities from large amounts of bilingual text
data. The EM algorithm iterates between computing the expected alignments of words in the source
and target languages and refining the model parameters accordingly. The word-based translation
models, however, do not take into account the structure of the language, which often leads to
suboptimal translations, especially in languages with significantly different syntactic structures.
The challenges stem from the word order differences and idiomatic expressions that cannot be
captured through a simple word-to-word mapping. To overcome these limitations, IBM Model 2
introduced the concept of word alignments, where an additional hidden variable A is introduced,
representing a possible alignment between words in the source and target sentences. This can be
expressed as:
YT YN
P (S, T, A) = t(si | tj )a(si | tj ) (24.14)
i=1 j=1
where a(si | tj ) denotes the alignment probability between word si in the source language and word
tj in the target language. By optimizing these alignment probabilities, SMT systems improve the
quality of translations by better modeling the relationship between the source and target sentences.
Estimating a(si | tj ), however, requires computationally expensive algorithms, which can be han-
dled by methods like EM for iterative refinement.
572 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)
A more sophisticated approach was introduced with sequence-to-sequence (Seq2Seq) models, which
significantly improved the translation process by leveraging deep learning techniques. The core of
Seq2Seq is the encoder-decoder framework, where an encoder processes the entire source sentence
and encodes it into a context vector, and a decoder generates the target sequence. In this approach,
the translation probability is formulated as:
N
Y
P (T | S) = P (t1 | S) P (ti | t<i , S) (24.15)
i=2
where t<i denotes the previously generated target words, capturing the sequential nature of trans-
lation. The key advantage of the Seq2Seq model is its ability to model entire sentences at once,
providing a richer, more flexible representation of both the source and target sequences compared to
word-based models. The encoder, typically implemented using Recurrent Neural Networks (RNNs)
or more advanced variants such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit
(GRU) networks, encodes the source sequence S into hidden states. The hidden state at time step
t is computed recursively, based on the input xt (the source word representation at time step t)
and the previous hidden state ht−1 :
ht = f (ht−1 , xt ) (24.16)
where f represents the update function, which is often parameterized as a non-linear function, such
as a sigmoid or tanh. This recursion generates a sequence of hidden states {h1 , h2 , . . . , hT }, each
encoding the relevant information of the source sentence. In this model, the decoder generates the
target sequence one token at a time by conditioning on the previous tokens t<i and the context
vector c, which is typically the last hidden state from the encoder. The conditional probability of
generating the next target word is given by:
where W is a learned weight matrix, and ht is the hidden state of the decoder at time step t.
The softmax function converts the output of the network into a probability distribution over the
vocabulary, and the word with the highest probability is chosen as the next target word.
A significant improvement to Seq2Seq was introduced through the attention mechanism. This
allows the decoder to dynamically focus on different parts of the source sentence during transla-
tion, instead of relying on a single fixed-length context vector. The attention mechanism computes
a set of attention weights αt,i for each source word, which are used to compute a weighted sum of
the encoder’s hidden states to form a dynamic context vector ct . The attention weight αt,i for time
step t in the decoder and source word i is calculated as:
exp(et,i )
αt,i = PT (24.18)
k=1 exp(et,k )
where et,i = score(ht , hi ) is a learned scoring function, which can be modeled as:
This attention mechanism allows the model to adaptively focus on relevant parts of the source
sentence while generating each word in the target sentence, thus overcoming the limitations of fixed-
length context vectors in long sentences. Training a machine translation model typically involves
optimizing a loss function that quantifies the difference between the predicted target sequence and
the true target sequence. The most common loss function is the negative log-likelihood:
N
X
L(θ) = − log P (ti | t<i , S; θ) (24.20)
i=1
24.3. CHATBOTS AND CONVERSATIONAL AI 573
where θ represents the parameters of the model. The parameters of the neural network are up-
dated using gradient-based optimization techniques, such as stochastic gradient descent (SGD) or
Adam, with the gradient of the loss function with respect to each parameter being computed via
backpropagation. In backpropagation, the gradient is computed by recursively applying the chain
rule through the layers of the network. For a parameter θ, the gradient is given by:
∂L(θ) ∂L(θ) ∂y
= (24.21)
∂θ ∂y ∂θ
where pn (T, R) is the precision of n-grams between the target translation T and reference R, and
wn is the weight assigned to each n-gram length. Despite advancements, machine translation still
faces challenges, such as handling rare or out-of-vocabulary words, idiomatic expressions, and the
alignment of complex syntactic structures across languages. Approaches such as transfer learning,
unsupervised learning, and domain adaptation are being explored to address these issues and
improve the robustness and accuracy of MT systems.
Chatbots and Conversational AI have evolved as some of the most sophisticated applications of
Natural Language Processing (NLP), a subfield of artificial intelligence that strives to enable ma-
chines to understand, generate, and interact in human language. At the core of conversational
AI is the ability to generate meaningful, contextually appropriate responses in a coherent and flu-
ent manner. This challenge is deeply rooted in both the complexities of natural language itself
and the mathematical models that attempt to approximate human understanding. This intricate
task involves processing language at different levels: syntactic (structure), semantic (meaning),
and pragmatic (context). These systems employ probabilistic and algebraic techniques to handle
language complexities and employ statistical models, deep neural networks, and optimization algo-
rithms to generate, understand, and respond to language.
where P (wi |w1 , w2 , . . . , wi−1 ) models the conditional probability of the word wi given all the pre-
ceding words. This is a central concept in language generation tasks. In traditional n-gram models,
this conditional probability is estimated by considering only a fixed number of previous words. The
bigram model, for instance, assumes that the probability of a word depends only on the previous
word, leading to:
P (wi |w1 , w2 , . . . , wi−1 ) ≈ P (wi |wi−1 ) (24.24)
However, more advanced conversational AI systems, such as those based on recurrent neural net-
works (RNNs), attempt to model dependencies over much longer sequences. RNNs, in particular,
process the input sequence w1 , w2 , . . . , wn recursively by maintaining a hidden state ht that captures
the context up to time t. The hidden state is computed by:
where σ is a non-linear activation function (e.g., tanh or sigmoid ), Wh , Wx are weight matrices,
and b is a bias term. While RNNs provide a mechanism to capture sequential dependencies, they
suffer from the vanishing gradient problem, particularly for long sequences. To address this issue,
Long Short-Term Memory (LSTM) units and Gated Recurrent Units (GRUs) were introduced, with
special gating mechanisms that help mitigate the loss of information over long time horizons. These
networks introduce memory cells and gates, which regulate the flow of information in the network.
For instance, the LSTM memory cell is governed by the following equations:
QK T
Attention(Q, K, V ) = softmax √ V (24.28)
dk
where dk is the dimension of the key vectors. This operation allows the model to attend to all parts
of the input sequence simultaneously, enabling better handling of long-range dependencies and
improving computational efficiency by processing sequences in parallel. Unlike RNNs, transformers
do not process tokens in a fixed order but instead utilize positional encoding to inject sequence
order information. The positional encoding for position i and dimension 2k is given by:
i i
P E(i, 2k) = sin , P E(i, 2k + 1) = cos (24.29)
100002k/d 100002k/d
where d is the embedding dimension and k is the index for the dimension of the positional encoding.
This approach allows transformers to handle longer sequences more efficiently than RNNs and
LSTMs, and is the basis for models like BERT, GPT, and other state-of-the-art conversational
models. Semantic understanding in conversational AI involves translating sentences into formal
representations that can be manipulated by the system. A well-known approach for capturing
meaning is compositional semantics, which treats the meaning of a sentence as a function of the
meanings of its parts. For this, lambda calculus is often employed to represent the meaning of
sentences as functions that operate on their arguments. For example, the sentence ”John saw the
car” can be represented as a lambda expression:
where see(x, y) is a predicate representing the action of seeing, and λx quantifies over the sub-
ject of the action. This allows for the compositional building of complex meanings from simpler
components. Dialogue management is another critical aspect of conversational AI systems. This
is the process of maintaining coherence and context over the course of a conversation. It involves
understanding the user’s input in light of prior dialogue history and generating a response that is
contextually relevant. To model the dialogue state, Markov Decision Processes (MDPs) are com-
monly employed. In this context, the dialogue state is represented as a set of possible states, with
actions being transitions between these states. The goal is to select actions (responses) that maxi-
mize cumulative rewards, which, in this case, corresponds to maintaining a coherent and engaging
conversation. The value function V (s) at state s can be computed using the Bellman equation:
" #
X
V (s) = max R(s, a) + γ P (s′ |s, a)V (s′ ) (24.31)
a
s′
where R(s, a) is the immediate reward for taking action a from state s, γ is the discount factor, and
P (s′ |s, a) represents the transition probability to the next state s′ given action a. By solving this
equation, the system can determine the optimal policy for responding to user inputs in a way that
maximizes long-term conversational quality. Once the dialogue state is updated, the next step in
conversational AI is to generate a response. This is typically achieved using sequence-to-sequence
models, in which the input sequence (e.g., the user’s query) is processed by an encoder to produce a
fixed-size context vector, and a decoder generates the output sequence (e.g., the chatbot’s response).
The basic structure of these models can be expressed as:
yt = Decoder(yt−1 , ht ) (24.32)
where yt represents the token generated at time t, and ht is the hidden state passed from the
encoder. Attention mechanisms are incorporated into this framework to allow the decoder to
focus on different parts of the input sequence at each step, improving the quality of the generated
576 CHAPTER 24. NATURAL LANGUAGE PROCESSING (NLP)
where ŷi is the predicted probability for the correct token yi , and N is the length of the sequence.
The parameters θ are updated iteratively through gradient descent, adjusting the weights to mini-
mize the error.
In summary, chatbots and conversational AI systems are grounded in a rich mathematical frame-
work involving statistics, linear algebra, optimization, and neural networks. Each step, from lan-
guage modeling to dialogue management, relies on carefully constructed mathematical foundations
that drive the ability of machines to interact intelligently and meaningfully with humans. Through
advancements in deep learning and optimization techniques, conversational AI continues to push
the boundaries of what machines can understand and generate in natural language, leading to more
sophisticated, human-like interactions.
25 Deep Learning Frameworks
25.1 TensorFlow
Literature Review: Takhsha et. al. (2025) [284] introduced a TensorFlow-based framework for
medical deep learning applications. The authors propose a novel deep learning diagnostic system
that integrates Choquet integral theory with TensorFlow-based models, improving the explain-
ability of deep learning decisions in medical imaging. Singh and Raman (2025) [285] extended
TensorFlow to Graph Neural Networks (GNNs), discussing how TensorFlow’s computational graph
structure aligns with graph theory. It provides a rigorous mathematical foundation for applying
deep learning to non-Euclidean data structures. Yao et. al. (2024) [286] critically analyzed Tensor-
Flow’s vulnerabilities to adversarial attacks and introduces a robust deep learning ensemble frame-
work. The authors explore autoencoder-based anomaly detection using TensorFlow to enhance
cybersecurity defenses. Chen et. al. (2024) [287] provided an extensive comparison of TensorFlow
pretrained models for various big data applications. It discusses techniques like transfer learning,
fine-tuning, and self-supervised learning, emphasizing how TensorFlow automates hyperparameter
tuning. Dumić (2024) [288] wrote as a rigorous educational resource, guiding learners through
neural network construction using TensorFlow. It bridges the gap between deep learning theory
and TensorFlow’s practical implementation, emphasizing gradient descent, backpropagation, and
weight initialization. Bajaj et. al. (2024) [289] implemented CNNs for handwritten digit recogni-
tion using TensorFlow and provides a rigorous mathematical breakdown of convolution operations,
activation functions, and optimization techniques. It highlights TensorFlow’s computational ef-
ficiency in large-scale character recognition tasks. Abbass and Fyath (2024) [290] introduced a
TensorFlow-based framework for optical fiber communication modeling. It explores how deep
learning can optimize fiber optic transmission efficiency by using TensorFlow for predictive analyt-
ics and channel equalization. Prabha et. al. (2024) [291] rigorously analyzed TensorFlow’s role in
precision agriculture, focusing on time-series analysis, computer vision, and reinforcement learning
for crop monitoring. It delves into TensorFlow’s API optimizations for handling sensor data and
remote sensing images. Abdelmadjid and Abdeldjallil (2024) [292] examined TensorFlow Lite for
edge computing, rigorously testing optimized CNN architectures on low-power devices. It provides
a theoretical comparison of computational efficiency, energy consumption, and model accuracy
in resource-constrained environments. Mlambo (2024) [293] bridged Bayesian inference and deep
learning, providing a rigorous derivation of Bayesian Neural Networks (BNNs) implemented in Ten-
sorFlow. It explores how TensorFlow integrates probabilistic models with deep learning frameworks.
TensorFlow operates primarily on tensors, which are multi-dimensional arrays generalizing scalars,
vectors, and matrices. For instance, a scalar is a rank-0 tensor, a vector is a rank-1 tensor, a matrix
is a rank-2 tensor, and tensors of higher ranks represent multi-dimensional arrays. These tensors
can be written mathematically as:
T ∈ Rd1 ×d2 ×···×dn (25.1)
where d1 , d2 , . . . , dn represent the dimensions of the tensor. TensorFlow leverages efficient tensor
operations that allow the manipulation of large-scale data in a computationally optimized manner.
These operations are the foundation of all the transformations and calculations within TensorFlow
577
578 CHAPTER 25. DEEP LEARNING FRAMEWORKS
models. For example, the dot product of two vectors ⃗a and ⃗b is a scalar:
n
X
⃗a · ⃗b = ai b i (25.2)
i=1
Similarly, for matrices, operations like matrix multiplication A · B are highly optimized, taking
advantage of batch processing and parallelism on devices such as GPUs and TPUs. TensorFlow’s
underlying libraries, such as Eigen, employ these parallel strategies to optimize memory usage and
reduce computation time. The heart of TensorFlow’s efficiency lies in its computation graph, which
represents the relationships between different operations. The computation graph is a directed
acyclic graph (DAG) where nodes represent computational operations, and the edges represent the
flow of data (tensors). Each operation in the graph is a function, f , that maps a set of inputs to
an output tensor:
y = f (x1 , x2 , . . . , xn ) (25.3)
The graph is built by users or automatically by TensorFlow, where the nodes represent operations
such as addition, multiplication, or more complex transformations. Once the computation graph is
defined, TensorFlow optimizes the graph by reordering computations, applying algebraic transfor-
mations, or parallelizing independent subgraphs. The graph is executed either in a dynamic manner
(eager execution) or after optimization (static graph execution), depending on the user’s preference.
Automatic differentiation is another key feature of TensorFlow, and it relies on the chain rule of
differentiation to compute gradients. The gradient of a scalar-valued function f (x1 , x2 , . . . , xn ) with
respect to an input tensor xi is computed as:
n
∂f X ∂f ∂yj
= (25.4)
∂xi j=1
∂yj ∂xi
where yj represents intermediate variables computed during the forward pass of the network. In
the context of a neural network, this chain rule is used to propagate errors backward from the
output to the input layers during the backpropagation process, where the objective is to update
the network’s weights to minimize the loss function L. Consider a neural network with a simple
architecture, consisting of an input layer, one hidden layer, and an output layer. Let X represent
the input tensor, W1 and b1 the weights and biases of the hidden layer, and W2 and b2 the weights
and biases of the output layer. The forward pass can be written as:
h = σ(W1 X + b1 ) (25.5)
ŷ = W2 h + b2 (25.6)
where σ is the activation function, such as the ReLU function σ(x) = max(0, x), and ŷ is the
predicted output. The objective in training a model is to minimize a loss function L(ŷ, y), where
y represents the true labels. The loss function can take different forms, such as the mean squared
error for regression tasks:
N
1 X
L(ŷ, y) = (yi − ŷi )2 (25.7)
N i=1
or the cross-entropy loss for classification tasks:
C
X
L(ŷ, y) = − yi log(ŷi ) (25.8)
i=1
where C is the number of classes, and ŷi is the predicted probability of class i under the softmax
function. The optimization of this loss function requires the computation of the gradients of L with
respect to the model parameters W1 , b1 , W2 , b2 . This is achieved through backpropagation, which
25.1. TENSORFLOW 579
applies the chain rule iteratively through the layers of the network. To perform optimization,
TensorFlow employs algorithms like Gradient Descent (GD). The basic gradient descent update
rule for parameters θ is:
θt+1 = θt − η∇θ L(θ) (25.9)
where η is the learning rate, and ∇θ L(θ) represents the gradient of the loss function with respect
to the model parameters θ. Variants of gradient descent, such as Stochastic Gradient Descent
(SGD), update the parameters using a subset (mini-batch) of the training data rather than the
entire dataset:
m
1 X
θt+1 = θt − η∇θ L(θ, xi , yi ) (25.10)
m i=1
where m is the batch size, and (xi , yi ) are the data points in the mini-batch. More sophisticated
optimizers like Adam (Adaptive Moment Estimation) use both momentum (first moment) and
scaling (second moment) to adapt the learning rate for each parameter. The update rule for Adam
is:
mt = β1 mt−1 + (1 − β1 )∇θ L(θ) (25.11)
vt = β2 vt−1 + (1 − β2 )(∇θ L(θ))2 (25.12)
mt vt
m̂t = t
, v̂t = (25.13)
1 − β1 1 − β2t
m̂t
θt+1 = θt − η √ (25.14)
v̂t + ϵ
where β1 and β2 are the exponential decay rates, and ϵ is a small constant to prevent division by
zero. The inclusion of both the first and second moments allows Adam to adaptively adjust the
learning rate, speeding up convergence. In addition to standard optimization methods, TensorFlow
supports distributed computing, enabling model training across multiple devices, such as GPUs and
TPUs. In a distributed setting, the model’s parameters are split across different workers, each
handling a portion of the data. The gradients computed by each worker are averaged, and the
global parameters are updated:
N
1 X
θt+1 = θt − η ∇θ Li (θ) (25.15)
N i=1
where Li (θ) is the loss computed on the i-th device, and N is the total number of devices. Tensor-
Flow’s efficient parallelism ensures that large-scale data processing tasks can be carried out with
high computational throughput, thus speeding up model training on large datasets.
TensorFlow also facilitates model deployment on different platforms. TensorFlow Lite enables
model inference on mobile devices by converting trained models into optimized, smaller formats.
This process involves quantization, which reduces the precision of the weights and activations,
thereby reducing memory consumption and computation time. The conversion process aims to
balance model accuracy and performance, ensuring that deep learning models can run efficiently
on resource-constrained devices like smartphones and IoT devices. For web applications, TensorFlow
offers TensorFlow.js, which allows users to run machine learning models directly in the browser,
leveraging the computational power of the client-side GPU or CPU. This is particularly useful for
real-time interactions where low-latency predictions are required without sending data to a server.
Moreover, TensorFlow provides an ecosystem that extends beyond basic machine learning tasks.
For instance, TensorFlow Extended (TFX) supports the deployment of machine learning models
in production environments, automating the steps from model training to deployment. Tensor-
Flow Probability supports probabilistic modeling and uncertainty estimation, which are critical in
domains such as reinforcement learning and Bayesian inference.
580 CHAPTER 25. DEEP LEARNING FRAMEWORKS
25.2 PyTorch
Literature Review: Galaxy Yanshi Team of Beihang University [294] examined the use of Py-
Torch as a deep learning framework for real-time astronaut facial recognition in space stations.
It explores the Bayesian coding theory within PyTorch models and its significance in optimizing
neural network architectures. It provides a theoretical exploration of probability distributions in
PyTorch models, demonstrating how deep learning can be used in constrained computational envi-
ronments. Tabel (2024) [295] extended PyTorch to Spiking Neural Networks (SNNs), a biologically
inspired neural network type. It details a new theoretical approach for learning spike timings us-
ing PyTorch’s computational graph. The paper bridges neuromorphic computing and PyTorch’s
automatic differentiation, expanding the theory behind temporal deep learning. Naderi et. al.
(2024) [296] introduced a hybrid physics-based deep learning framework that integrates discrete
element modeling (DEM) with PyTorch-based networks. It demonstrates how physical simula-
tion problems can be formulated as deep learning models in PyTorch, providing new insights into
neural solvers for scientific computing. Polaka (2024) [297] evaluated reinforcement learning (RL)
theories within PyTorch, exploring the mathematical rigor of RL frameworks in safe AI applica-
tions. The author provided a strong theoretical foundation for understanding deep reinforcement
learning (DeepRL) in PyTorch, emphasizing how state-of-the-art RL theories are embedded in
the framework. Erdogan et. al. (2024) [298] explored the theoretical framework for reducing
stochastic communication overheads in large-scale recommendation systems built using PyTorch.
It introduced an optimized gradient synchronization method that can enhance PyTorch-based deep
learning models for distributed computing. Liao et. al. (2024) [299] extended the Iterative Partial
Diffusion Model (IPDM) framework, implemented in PyTorch, for medical image processing and
advanced the theory of deep generative models in PyTorch, specifically in diffusion-based learning
techniques. Sekhavat et. al. (2024) [300] examined the theoretical intersection between deep learn-
ing in PyTorch and artificial intelligence creativity, referencing Nietzschean philosophical concepts.
The author also explored how PyTorch enables neural creativity and provides a rigorous theoretical
model for computational aesthetics. Cai et. al. (2025) [301] developed a new theoretical framework
for explainability in neural networks using Shapley values, implemented in PyTorch and enhanced
the mathematical rigor of explainable AI (XAI) using PyTorch’s autograd system to analyze feature
importance. Na (2024) [302] proposed a novel ensemble learning theory using PyTorch, specifically
in weakly supervised learning (WSL). The paper extends Bayesian learning models in PyTorch for
handling sparse labeled data, addressing critical gaps in WSL. Khajah (2024) [303] combined item
response theory (IRT) and Bayesian knowledge tracing (BKT) using PyTorch to model generaliz-
able skill discovery. This study presents a rigorous statistical theory for adaptive learning systems
using PyTorch’s probabilistic programming capabilities.
The dynamic computation graph in PyTorch forms the core of its ability to perform efficient
and flexible machine learning tasks, especially deep learning models. To understand the underly-
ing mathematical and computational principles, we must explore how the graph operates, what it
represents, and how it changes during the execution of a machine learning program. Unlike the
static computation graphs employed in frameworks like TensorFlow (pre-Eager execution mode),
PyTorch constructs the computation graph dynamically, as the operations are performed in the
forward pass. This allows PyTorch to adapt to various input sizes, model structures, and control
flows that can change during execution. This adaptability is essential in enabling PyTorch to han-
dle models like recurrent neural networks (RNNs), which operate on sequences of varying lengths,
or models that incorporate conditionals in their computation steps.
The computation graph itself can be mathematically represented as a directed acyclic graph
(DAG), where the nodes represent operations and intermediate results, while the edges represent
the flow of data between these nodes. Each operation (e.g., addition, multiplication, or non-linear
25.2. PYTORCH 581
activation) is applied to tensors, and the outputs of these operations are used as inputs for subse-
quent operations. The central feature of PyTorch’s dynamic computation graph is its construction
at runtime. For instance, when a tensor A is created, it might be involved in a series of operations
that eventually lead to the calculation of a loss function L. As each operation is executed, PyTorch
constructs an edge from the node representing the input tensor A to the node representing the
output tensor B. Mathematically, the transformation between these tensors can be described by:
B = f (A; θ) (25.16)
where f represents the transformation function (which could be a linear or nonlinear operation),
and θ represents the parameters involved in this transformation (e.g., weights or biases in the
case of neural networks). The construction of the dynamic graph allows PyTorch to deal with
variable-length sequences, which are common in tasks such as time-series prediction, nat-
ural language processing (NLP), and speech recognition. The length of the sequence can
change depending on the input data, and thus, the number of iterations or layers required in the
computation will also vary. In a recurrent neural network (RNN), for example, the hidden
state ht at each time step t is a function of the previous hidden state ht−1 and the input at the
current time step xt . This can be described mathematically as:
where f is typically a non-linear activation function (e.g., a hyperbolic tangent or a sigmoid), and
Wh , Wx , b represent the weight matrices and bias vector, respectively. This equation encapsulates
the recursive nature of RNNs, where each output depends on the previous output and the current
input. In a static computation graph, the number of operations for each sequence would need to
be predefined, leading to inefficiency when sequences of different lengths are processed. However,
in PyTorch, the computation graph is created dynamically for each sequence, which allows for the
efficient handling of varying-length sequences and avoids redundant computation.
The key to PyTorch’s efficiency lies in automatic differentiation, which is managed by its au-
tograd system. When a tensor A has the property requires grad=True, PyTorch starts tracking
all operations performed on it. Suppose that the tensor A is involved in a sequence of operations
to compute a scalar loss L. For example, if the loss is a function of Y, the output tensor, which is
computed through multiple layers, the objective is to find the gradient of L with respect to A. This
requires the computation of the Jacobian matrix, which represents the gradient of each component
of Y with respect to each component of A. Using the chain rule of differentiation, the gradient of
the loss with respect to A is given by:
∂L X ∂L ∂Yi
= · (25.18)
∂A i
∂Yi ∂A
∂L
This is an application of the multivariable chain rule, where ∂Y i
represents the gradient of the
∂Yi
loss with respect to the output tensor at the i-th component, and ∂A is the Jacobian matrix for the
transformation from A to Y. This computation is achieved by backpropagating the gradients
through the computation graph that PyTorch builds dynamically. Every operation node in the
graph has an associated gradient, which is propagated backward through the graph as we move
from the loss back to the input parameters. For example, if Y = A · B, the gradient of the loss
with respect to A would be:
∂L ∂L
= · BT (25.19)
∂A ∂Y
Similarly, the gradient with respect to B would be:
∂L ∂L
= · AT (25.20)
∂B ∂Y
582 CHAPTER 25. DEEP LEARNING FRAMEWORKS
This shows how the gradients are passed backward through the computation graph, utilizing the
stored operations at each node to calculate the required derivatives. The advantage of this dy-
namic construction of the graph is that it does not require the entire graph to be constructed
beforehand, as in the static graph approach. Instead, the graph is dynamically updated as op-
erations are executed, making it both more memory-efficient and computationally efficient.
An important feature of PyTorch’s dynamic graph is its ability to handle conditionals within the
computation. Consider a case where we have different branches in the computation based on a
conditional statement. In a static graph, such conditionals would require the entire graph to be
predetermined, including all possible branches. In contrast, PyTorch constructs the relevant part
of the graph depending on the input data, effectively enabling a branching computation. For
instance, suppose that we have a decision-making process in a neural network model, where the
output depends on whether an input tensor exceeds a threshold xi > t:
(
A · xi + b if xi > t
yi = (25.21)
C · xi + d otherwise
In a static graph, we would have to design two separate branches and potentially deal with the
computational cost of unused branches. In PyTorch’s dynamic graph, only the relevant branch is
executed, and the graph is updated accordingly to reflect the necessary operations. The mem-
ory efficiency in PyTorch’s dynamic graph construction is particularly evident when handling
large models and training on large datasets. When building models like deep neural networks
(DNNs), the operations performed on each tensor during both the forward and backward passes
are recorded in the computation graph. This allows for efficient reuse of intermediate results, and
only the necessary memory is allocated for each tensor during the graph’s construction. This stands
in contrast to static computation graphs, where the full graph needs to be defined and memory
allocated up front, potentially leading to unnecessary memory consumption.
To summarize, the dynamic computation graph in PyTorch is a powerful tool that allows
for flexible model building and efficient computation. By constructing the graph incrementally
during the execution of the forward pass, PyTorch is able to dynamically adjust to the input size,
control flow, and variable-length sequences, leading to more efficient use of memory and computa-
tional resources. The autograd system enables automatic differentiation, applying the chain
rule of calculus to compute gradients with respect to all model parameters. This flexibility is a key
reason why PyTorch has gained popularity for deep learning research and production, as it com-
bines high performance with flexibility and transparency, allowing researchers and engineers
to experiment with dynamic architectures and complex control flows without sacrificing efficiency.
25.3 JAX
Literature Review: Li et. al. (2024) [314] introduced JAX-based differentiable density func-
tional theory (DFT), enabling end-to-end differentiability in materials science simulations. This
paper extends machine learning theory into quantum chemistry by leveraging JAX’s automatic dif-
ferentiation and parallelization capabilities for efficient optimization of density functional models.
Bieberich and Li (2024) [315] explored quantum machine learning (QML) using JAX and Diffrax
to solve neural differential equations efficiently. They developed a new theoretical model for quan-
tum neural ODEs and discussed how JAX facilitates efficient GPU-based quantum simulations.
Dagréou et. al. (2024) [316] analyzed the efficiency of Hessian-vector product (HVP) computation
in JAX and PyTorch for deep learning. They established a mathematical foundation for computing
second-order derivatives in deep learning and optimization, showcasing JAX’s superior automatic
differentiation. Lohoff and Neftci (2025) [317] developed a deep reinforcement learning (DRL)
model that optimizes JAX’s autograd engine for scientific computing. They demonstrated how
25.3. JAX 583
JAX’s automatic differentiation is central to its ability to compute gradients, Jacobians, Hes-
sians, and other derivatives efficiently. For many applications, the function of interest involves
computing gradients with respect to model parameters in optimization and machine learning tasks.
Automatic differentiation allows for the efficient computation of these gradients using the reverse-
mode differentiation technique. Let us consider a function f : Rn → Rm , and suppose we wish to
compute the gradient of the scalar-valued output with respect to each input variable. The gradient
of f , denoted as ∇x f , is a vector of partial derivatives:
∂f1 ∂f1 ∂f1 ∂fm
∇x f (x) = , ,..., ,..., , (25.22)
∂x1 ∂x2 ∂xn ∂xn
This recursive application of the chain rule ensures that each intermediate gradient computation is
propagated backward through the function’s layers, reducing the number of required passes com-
pared to forward-mode differentiation. This technique becomes particularly beneficial for functions
where the number of outputs m is much smaller than the number of inputs n, as it minimizes the
computational complexity. In the context of JAX, automatic differentiation is utilized through func-
tions like jax.grad, which can be applied to scalar-valued functions to return their gradients with
respect to vector-valued inputs. To compute higher-order derivatives, such as the Hessian matrix,
JAX allows for the computation of second- and higher-order derivatives using similar principles.
The Hessian matrix H of a scalar function f (x) is given by the matrix of second derivatives:
2
∂ f
H= , (25.24)
∂xi ∂xj
which is computed by applying the chain rule once again. The second-order derivatives can be
computed efficiently by differentiating the gradient once more, and this process can be extended
to higher-order derivatives by continuing the recursive application of the chain rule. A central
concept in JAX’s approach to high-performance computing is JIT (just-in-time) compilation,
which provides substantial performance gains by compiling Python functions into optimized ma-
chine code tailored to the underlying hardware architecture. JIT compilation in JAX is built on the
foundation of the XLA (Accelerated Linear Algebra) compiler. XLA optimizes the execution
of tensor operations by fusing multiple operations into a single kernel, thereby reducing the overhead
associated with launching individual computation kernels. This technique is particularly effective
for matrix multiplications, convolutions, and other tensor operations commonly found in machine
learning tasks. For example, consider a simple sequence of operations f = Op1 (Op2 (. . . (Opn (x)))),
where Opi represents different mathematical operations applied to the input tensor x. Without
optimization, each operation would typically be executed separately, introducing significant over-
head. JAX’s JIT compiler, however, recognizes this sequence and applies a fusion transformation,
resulting in a single composite operation:
where Fused Op represents a highly optimized version of the original sequence of operations. This
optimization minimizes the number of kernel launches and reduces memory access overhead, which
in turn accelerates the computation. The JIT compiler analyzes the computational graph of the
function and identifies opportunities to combine operations into a more efficient form, ultimately
speeding up the computation on hardware accelerators such as GPUs or TPUs.
The vectorization capability provided by JAX through the jax.vmap operator is another essen-
tial optimization for high-performance computing. This feature automatically vectorizes functions
across batches of data, allowing the same operation to be applied in parallel across multiple data
points. Mathematically, for a function f : Rn → Rm and a batch of inputs X ∈ RB×n , the vectorized
function can be expressed as:
Y = vmap(f )(X), (25.26)
where B is the batch size and Y is the matrix in RB×m , containing the results of applying f to
each row of X. The mathematical operation applied by JAX is the same as applying f to each
individual row Xi , but with the benefit that the entire batch is processed in parallel, exploiting
the available hardware resources efficiently. The ability to parallelize computations across
multiple devices is one of JAX’s strongest features, and it is enabled through the jax.pmap
operator. This operator allows for the parallel execution of functions across different devices, such
as multiple GPUs or TPUs. Suppose we have a function f : Rn → Rm and a batch of inputs
X = (X1 , X2 , . . . , Xp ), distributed across p devices. The parallelized execution of the function can
be written as:
Y = pmap(f )(X), (25.27)
25.3. JAX 585
where each device independently computes its portion of the computation f (Xi ), and the results
are gathered into the final output Y. This capability is essential for large-scale distributed training
of machine learning models, where the model’s parameters and data must be distributed across
multiple devices to ensure efficient training. The parallelization effectively reduces computation
time, as each device operates on a distinct subset of the data and model parameters. GPU/TPU
acceleration is another crucial aspect of JAX’s performance, and it is facilitated by libraries like
cuBLAS for GPUs, which are specifically designed to optimize matrix operations. The primary
operation used in many numerical computing tasks is matrix multiplication, and JAX optimizes
this by leveraging hardware-accelerated implementations of these operations. Consider the matrix
multiplication of two matrices A and B, where A ∈ Rn×m and B ∈ Rm×p , resulting in a matrix
C ∈ Rn×p :
C = A × B. (25.28)
Using cuBLAS or a similar library, JAX can execute this operation on a GPU, utilizing the massive
parallel processing power of the hardware to perform the multiplication efficiently. This operation
can be further optimized by considering the specific memory hierarchies of GPUs, where large
matrix multiplications are broken down into smaller tiles that fit into the GPU’s high-speed mem-
ory. This technique minimizes memory bandwidth constraints, accelerating the computation. In
addition to these core operations, JAX allows for the definition of custom gradients using the
jax.custom jvp decorator, which enables users to specify the Jacobian-vector products (JVPs)
manually for more efficient gradient computation. This feature is especially useful in machine
learning applications, where certain operations might have custom gradients that cannot be com-
puted automatically. For instance, in a non-trivial activation function such as the softmax, the
custom gradient function might be provided explicitly for efficiency:
∂softmax(x)
= diag(softmax(x)) − softmax(x) · softmax(x)T . (25.29)
∂x
Thus, JAX allows for both flexibility and performance, enabling scientific computing applications
that require both efficiency and the ability to define complex, custom derivatives.
• Matrix Multiplication: If A ∈ Fm×p and B ∈ Fp×n , then the product C = AB is given by:
p
X
Cij = Aik Bkj (26.4)
k=1
This is only defined when the number of columns of A equals the number of rows of B.
• Transpose: The transpose of A, denoted AT , satisfies:
(AT )ij = Aji (26.5)
where A1j is the (n − 1) × (n − 1) submatrix obtained by removing the first row and j-th
column.
• Inverse: A square matrix A is invertible if there exists A−1 such that:
AA−1 = A−1 A = I (26.7)
where I is the identity matrix.
586
26.1. LINEAR ALGEBRA ESSENTIALS 587
The dimension of V , denoted dim(V ), is the number of basis vectors. Linear Transformations:
A function T : V → W is linear if:
T (αv + βw) = αT (v) + βT (w) (26.10)
The matrix representation of T is the matrix A such that:
T (x) = Ax (26.11)
Discrete Probability Distributions: For a discrete random variable X, which takes values
from a countable set, the probability mass function (PMF) is defined as:
Where n is the number of trials, p is the probability of success on each trial, and k is the number
of successes.
Let A and B be two events. Then, Bayes’ theorem gives the conditional probability of A given B:
P (B|A)P (A)
P (A|B) = (26.23)
P (B)
where P (A|B) is the posterior probability of A given B, P (B|A) is the likelihood, the probability
of observing B given A, P (A) is the prior probability of A, P (B) is the marginal likelihood of B,
computed as:
X
P (B) = P (B|Ai )P (Ai ) (26.24)
i
This allows one to update beliefs about a hypothesis A based on observed evidence B. Let us
consider a diagnostic test for a disease. Let A be the event that a person has the disease and B
be the event that the test is positive. We are interested in the probability that a person has the
disease given that the test is positive, i.e., P (A|B). By Bayes’ theorem, we have:
P (B|A)P (A)
P (A|B) = (26.26)
P (B)
where P (B|A) is the probability of a positive test result given that the person has the disease
(sensitivity), P (A) is the prior probability of having the disease, P (B) is the total probability of a
positive test result.
Each of these measures provides different insights into the characteristics of a dataset or a prob-
ability distribution. There are several Measures of Central Tendency. Given a probability space
(Ω, F, P ) and a random variable X : Ω → R, the expectation (or mean) is defined as:
Z
E[X] = X(ω)dP (ω) (26.27)
Ω
If Q1 and Q3 denote the first and third quartiles of a dataset (where Q1 is the 25th percentile and
Q3 is the 75th percentile), then the interquartile range is:
IQR = Q3 − Q1 (26.36)
Expanding:
Cov(X, Y ) = E[XY ] − E[X]E[Y ] (26.40)
The Pearson Correlation Coefficient defined as:
Cov(X, Y )
ρ(X, Y ) = (26.41)
σX σY
where σX and σY are the standard deviations of X and Y , respectively. The Information-Theoretic
Measure is Entropy which is defined as the entropy of a discrete probability distribution p(x) is
given by: X
H(X) = − p(x) log p(x) (26.42)
x
For continuous distributions with density f (x), the differential entropy is:
Z ∞
h(X) = − f (x) log f (x) dx (26.43)
−∞
which measures how much knowing X reduces uncertainty about Y . Statistical Measures satisfy
Linearity and Invariance i.e.
• Expectation is linear:
E[aX + bY ] = aE[X] + bE[Y ] (26.45)
For the Convergence and Asymptotic Behavior, The law of large numbers ensures that empirical
means converge to the expected value, while the central limit theorem states that sums of i.i.d.
random variables converge in distribution to a normal distribution.
The mean or expected value of a random variable X, denoted by E[X], represents the aver-
age value of X. For a discrete random variable:
X
E[X] = xi p(xi ) (26.47)
xi ∈S
The variance of a random variable X, denoted by Var(X), measures the spread or dispersion of
the distribution. For a discrete random variable:
The standard deviation is the square root of the variance and provides a measure of the spread
of the distribution in the same units as the random variable:
p
SD(X) = Var(X) (26.51)
The skewness of a random variable X quantifies the asymmetry of the probability distribution.
It is defined as:
E[(X − E[X])3 ]
Skew(X) = (26.52)
(Var(X))3/2
A positive skew indicates that the distribution has a long tail on the right, while a negative skew
indicates a long tail on the left. The kurtosis of a random variable X measures the ”tailedness”
of the distribution, i.e., how much of the probability mass is concentrated in the tails. It is defined
as:
E[(X − E[X])4 ]
Kurt(X) = (26.53)
(Var(X))2
A distribution with high kurtosis has heavy tails, and one with low kurtosis has light tails compared
to a normal distribution.
To analyze the convergence of gradient descent, we assume f is convex and differentiable with
a Lipschitz continuous gradient. That is, there exists a constant L > 0 such that:
This property ensures the gradient of f does not change too rapidly, which allows us to bound the
convergence rate. The following is an upper bound on the decrease in the function value at each
iteration:
f (xk+1 ) − f (x∗ ) ≤ (1 − ηL)(f (xk ) − f (x∗ )), (26.56)
where x∗ is the global minimum. Thus, we have the following convergence rate:
For this to converge, we require ηL < 1. Hence, the step size η must be chosen carefully to ensure
convergence.
26.3. OPTIMIZATION TECHNIQUES 593
Let the objective function be the sum of individual functions fi (x) corresponding to each data
point:
m
1 X
f (x) = fi (x), (26.58)
m i=1
where m is the number of data points. In Stochastic Gradient Descent, the update rule becomes:
where ik is a randomly chosen index at the k-th iteration, and ∇fik (x) is the gradient of the function
fik (x) corresponding to that randomly selected data point. The stochastic gradient is given by:
Given that the gradient is stochastic, the convergence analysis of SGD is more complex. Assuming
that each fi is convex and differentiable, and using the strong convexity assumption (i.e., there
exists a constant m > 0 such that f satisfies the inequality):
Second-order methods typically have faster convergence rates compared to gradient descent, par-
ticularly when the function f has well-conditioned curvature. However, computing the Hessian is
computationally expensive, which limits the scalability of these methods. Newton’s method is a
widely used second-order optimization technique that uses both the gradient and the Hessian. The
update rule is given by:
xk+1 = xk − ηH−1 (xk )∇f (xk ). (26.64)
Newton’s method converges quadratically near the optimal point under the assumption that the
objective function is twice continuously differentiable and the Hessian is positive definite. More
594 CHAPTER 26. APPENDIX
formally, if xk is sufficiently close to the optimal point x∗ , then the error ∥xk − x∗ ∥ decreases
quadratically:
∥xk+1 − x∗ ∥ ≤ C∥xk − x∗ ∥2 , (26.65)
where C is a constant depending on the condition number of the Hessian.
Since directly computing the Hessian is expensive, quasi-Newton methods aim to approximate
the inverse Hessian at each iteration. One of the most popular quasi-Newton methods is the Broy-
den–Fletcher–Goldfarb–Shanno (BFGS) method, which maintains an approximation to the
inverse Hessian, updating it at each iteration. The Summary of what we discussed above are as
follows:
• Gradient Descent (GD): An optimization algorithm that updates the parameter vector in
the direction opposite to the gradient of the objective function. Convergence is guaranteed
under convexity assumptions with an appropriately chosen step size.
• Stochastic Gradient Descent (SGD): A variant of GD that uses a random subset of
the data to estimate the gradient at each iteration. While faster and less computationally
intensive, its convergence is slower and more noisy, requiring variance reduction techniques
for efficient training.
• Second-Order Methods: These methods use the Hessian (second derivatives of the ob-
jective function) to accelerate convergence, often exhibiting quadratic convergence near the
optimum. However, the computational cost of calculating the Hessian restricts their practical
use. Quasi-Newton methods, such as BFGS, approximate the Hessian to improve efficiency.
Each of these methods has its advantages and trade-offs, with gradient-based methods being widely
used due to their simplicity and efficiency, and second-order methods providing faster convergence
but at higher computational costs.
To compute the derivative of ∥A∥F with respect to A, we first apply the chain rule:
∂∥A∥F 2aij aij
= = (26.68)
∂aij 2∥A∥F ∥A∥F
A
Thus, the gradient of the Frobenius norm is the matrix ∥A∥F
. The Matrix Derivatives of Common
Functions are as follows:
26.4. MATRIX CALCULUS 595
• Matrix trace: For a matrix A, the derivative of the trace Tr(A) with respect to A is the
identity matrix:
∂Tr(A)
=I (26.69)
∂A
• Matrix product: Let A and B be matrices, and consider the product f (A) = AB. The
derivative of this product with respect to A is:
∂(AB)
= BT (26.70)
∂A
• Matrix inverse: The derivative of the inverse of A with respect to A is:
∂(A−1 )
−1 ∂A
= −A A−1 (26.71)
∂A ∂A
Let T be a tensor, represented by the array of components Ti1 ,i2 ,...,ik , where the indices i1 , i2 , . . . , ik
are the dimensions of the tensor. Let f (T) be a scalar-valued function that depends on the tensor
T. The derivative of this function with respect to the tensor components Ti1 ,...,ik is given by:
∂f (T)
= Jacobian of f (T) with respect to Ti1 ,...,ik (26.72)
∂Ti1 ,...,ik
For example, consider a function of a second-order tensor, f (T), where T is a matrix. The dif-
ferentiation rule follows similar principles as matrix differentiation. The Jacobian is computed for
each tensor component in the same fashion, based on the partial derivatives with respect to the
individual tensor components.
Consider a second-order tensor T, and let’s compute the derivative of the Frobenius norm of
T: s X
∥T∥F = Ti21 ,...,ik (26.73)
i1 ,i2 ,...,ik
This tensor product rule applies for higher-order tensors, where differentiation follows tensor con-
traction rules. The process of differentiating matrices and tensors extends the rules of differenti-
ation to multi-dimensional data structures, with careful application of chain rules, product rules,
and understanding the Jacobian of the functions. For matrices, the derivative is a matrix of partial
derivatives, while for tensors, the derivative is typically expressed as a tensor with respect to multi-
index components. In higher-order tensor differentiation, we apply these principles recursively,
accounting for multi-index notation, and respecting the tensor contraction rules that define how
the components interact.
We start with the Differentiation of Scalar-Valued Functions with Matrix Arguments. Let f :
Rm×n → R be a scalar function of a matrix X. The differential df of f is defined by:
f (X + H) − f (X)
df = lim (26.77)
∥H∥→0 ∥H∥
where H is an infinitesimal perturbation. The total derivative of f is given by:
T !
∂f
df = tr dX . (26.78)
∂X
Definition of the Matrix Gradient: The gradient DX f (or Jacobian) is the unique matrix satis-
fying:
df = tr DTX dX .
(26.79)
This ensures that differentiation is dual to the Frobenius inner product ⟨A, B⟩ = tr(AT B), giving
a Hilbert space structure. Let’s start with the example of Quadratic Form Differentiation. Let
f (X) = tr(XT AX). Expanding in a small perturbation H:
dF = DX F : dX. (26.85)
Regarding the Differentiation of the Matrix Inverse, for F(X) = X−1 we use the identity:
where ⊗ denotes the Kronecker product. For Differentiation of Tensor-Valued Functions. We need
to have a differentiable tensor function F : Rm×n×p → Ra×b×c , the Fréchet derivative shall be a
higher-order tensor DX F satisfying:
dF = DX F : dX . (26.89)
∂
(C : X ) = C. (26.91)
∂X
Differentiation can be also done in Non-Euclidean Spaces. For a manifold M, differentiation is
defined via tangent spaces TX M, with the covariant derivative ∇X satisfying the Levi-Civita
connection:
ProjTX+ϵH M (Y(X + ϵH)) − Y(X)
∇X Y = lim . (26.92)
ϵ→0 ϵ
We can do differentiation using Variational Principles also. If f (X) is an energy functional, the
differentiation that follows from Gateaux derivatives is:
f (X + ϵH) − f (X)
δf = lim . (26.93)
ϵ→0 ϵ
For functionals, differentiation uses Euler-Lagrange equations:
Z
d
L(X, ∇X) dV = 0. (26.94)
dt Ω
Formally, Information Theory is deeply intertwined with probability theory, measure the-
ory, functional analysis, and ergodic theory, and it finds applications in diverse fields such
as statistical mechanics, coding theory, artificial intelligence, and even quantum information.
The Shannon entropy H(X) is defined rigorously as the expected value of the logarithm of the
inverse probability: X
H(X) = E[− log p(X)] = − p(x) log p(x). (26.96)
x∈X
where the logarithm is taken in base 2 (bits) or natural base e (nats). Shannon’s entropy satisfies the
following fundamental properties, which are derived axiomatically using Khinchin’s postulates:
The Fundamental Theorem of Information Measures: Given a probability space (Ω, F, P),
the Shannon entropy satisfies the variational principle:
where the infimum is taken over all probability measures Q on X and DKL (P ∥Q) is the Kullback-
Leibler divergence:
X p(x)
DKL (P ∥Q) = p(x) log . (26.101)
x
q(x)
Thus, entropy can be interpreted as the minimum information divergence from uniformity.
Let (Ω, F, P ) be a probability space, where Ω is the sample space, F is the σ-algebra of events, P
is the probability measure. A discrete random variable X is a measurable function X : Ω → X ,
where X is a countable set. The probability mass function (PMF) of X is given by:
pX (x) = P (X = x) (26.102)
where 0 log 0 ≡ 0 by convention, and the logarithm is typically base 2 (bits) or base e (nats). For
two random variables X and Y the joint entropy is:
X
H(X, Y ) = − pX,Y (x, y) log pX,Y (x, y) (26.104)
x∈X ,y∈Y
Regarding the Non-Negativity of Entropy, H(X) ≥ 0, with equality if and only if X is deter-
ministic. To prove this note that pX (x) ∈ [0, 1], we have − log pX (x) ≥ 0 for all x ∈ X . Thus:
X
H(X) = − pX (x) log pX (x) ≥ 0 (26.107)
x∈X
Equality holds if and only if pX (x) = 1 for some x and pX (x′ ) = 0 for all x′ ̸= x, meaning X
is deterministic. To get an upper bound on Entropy, for a discrete random variable X with |X |
possible outcomes:
H(X) ≤ log |X | (26.108)
with equality if and only if X is uniformly distributed. To prove this, using Gibbs’ inequality, for
any probability distributions pX (x) and qX (x):
X X
− pX (x) log pX (x) ≤ − pX (x) log qX (x) (26.109)
x∈X x∈X
1
Let qX (x) = |X |
(uniform distribution). Then:
X 1
H(X) ≤ − pX (x) log = log |X | (26.110)
x∈X
|X |
Equality holds if and only if pX (x) = qX (x) = |X1 | for all x, meaning X is uniformly distributed.
By chain Rule for Joint Entropy, for two random variables X and Y , the joint entropy satisfies:
By definition: X
H(X, Y ) = − pX,Y (x, y) log pX,Y (x, y). (26.112)
x∈X ,y∈Y
Using the chain rule of probability, pX,Y (x, y) = pX (x)pY |X (y|x), we rewrite:
X
H(X, Y ) = − pX,Y (x, y) log[pX (x)pY |X (y|x)] (26.113)
x,y
The first term simplifies to H(X), and the second term simplifies to H(Y |X), giving:
By definition:
X pX,Y (x, y)
I(X; Y ) = pX,Y (x, y) log . (26.117)
x∈X ,y∈Y
pX (x)pY (y)
600 CHAPTER 26. APPENDIX
We now discuss Mutual Information and Fundamental Theorems of Dependence. The mutual
information between two random variables X and Y quantifies the reduction in uncertainty of X
given knowledge of Y :
I(X; Y ) = H(X) − H(X|Y ). (26.119)
Equivalently, it is given by the relative entropy between the joint distribution p(x, y) and
the product of the marginals:
This follows directly from Jensen’s inequality and the convexity of relative entropy.
L ≥ H(X). (26.122)
Moreover, the Asymptotic Equipartition Property (AEP) states that for large n, the proba-
bility of a sequence X1 , X2 , ..., Xn satisfies:
2. Converse: No source code can achieve an average code length per symbol smaller than H(X)
without increasing the error probability to 1.
To prove the Shannon Source Coding Theorem, we assume that X is a discrete random variable
with probability mass function (PMF) PX (x). The entropy of X is defined as:
X
H(X) = − PX (x) log PX (x). (26.124)
x∈X
We will use the Asymptotic Equipartition Property (AEP), which states that for large n, the
(n)
sequence X n belongs to a typical set Tϵ with high probability. The first step is to AEP and the
Size of the Typical Set. The strong law of large numbers implies that for any ϵ > 0, the set:
(n) n n 1 n
Tϵ = x ∈ X : − log PX (x ) − H(X) < ϵ (26.126)
n
26.5. INFORMATION THEORY 601
Since these sequences occur with high probability, we can restrict our coding efforts to them. The
third step is to encode the Typical Sequences. If we assign a unique binary code to each sequence
(n) (n)
in Tϵ , we need at most log |Tϵ | bits per sequence, which gives an encoding length:
Thus, the expected code length per symbol is at most H(X) + ϵ. The third step is to analyze
the Converse (Optimality of Entropy Rate). Consider any uniquely decodable prefix-free code with
average code length L. By Kraft’s inequality:
n
X
2−L(x ) ≤ 1. (26.129)
xn
Thus, no lossless source code can achieve a rate below H(X) bits per symbol. We have
rigorously proven both the achievability and the converse of the Shannon Source Coding Theo-
rem, showing that the fundamental limit of lossless compression is given by the entropy of the source.
where PX is the probability mass function (PMF) of Xi . The entropy of X, denoted H(X), is
defined as: X
H(X) = − PX (x) log PX (x) (26.132)
x∈X
where the logarithm is taken base 2, and H(X) quantifies the expected information content of X.
For a given ϵ > 0 and sequence length n, the typical set Aϵ (n) is defined as:
n 1
Aϵ (n) = xn ∈ X : − log PXn (xn ) − H(X) < ϵ . (26.133)
n
This set consists of sequences xn whose empirical entropy − n1 log PXn (xn ) is close to the true entropy
H(X). The AEP states that, as n → ∞, the probability of the typical set approaches 1:
This is a direct consequence of the weak law of large numbers (WLLN) applied to the random
variable − log PX (Xi ), which has finite mean H(X) and finite variance (by the finiteness of X).
The cardinality of the typical set satisfies:
This follows from the definition of the typical set and the fact that PXn (xn ) ≈ 2−nH(X) for xn ∈
Aϵ (n). By Equipartition Property, we can say that for all xn ∈ Aϵ (n), the probability of xn satisfies:
2−n(H(X)+ϵ) ≤ PXn (xn ) ≤ 2−n(H(X)−ϵ) . (26.136)
This implies that all sequences in the typical set are approximately equiprobable. The AEP is a
consequence of the weak law of large numbers (WLLN) and the Chernoff bound. Here, we provide
a rigorous proof. Define the random variable:
Yi = − log PX (Xi ). (26.137)
Since {Xi } are i.i.d., {Yi } are also i.i.d., with mean E[Yi ] = H(X) and variance σ 2 = Var(Yi ). By
the Weak Law of Large Numbers, we can write:
n
1X
Yi →p H(X) as n → ∞. (26.138)
n i=1
The typical set is defined analogously, and the probability concentration result holds under the
ergodic theorem. For continuous random variables, the differential entropy h(X) replaces H(X),
and the typical set is defined in terms of probability density functions:
n 1
Aϵ (n) = xn ∈ R : − log fXn (xn ) − h(X) < ϵ , (26.142)
n
where fXn is the joint probability density function. For Markov chains and other non-i.i.d. pro-
cesses, the AEP holds under appropriate mixing conditions, with the entropy rate adjusted to
account for dependencies. The AEP underpins Shannon’s source coding theorem, which states
that the optimal compression rate for a source is given by its entropy rate.
Shannon’s Noisy Channel Coding Theorem asserts that for any transmission rate R:
26.5. INFORMATION THEORY 603
where I(X; Y ) is the mutual information between X and Y , and the maximization is over all input
distributions PX .
Fix a rate R < C and a small ϵ > 0. By Random Coding Argument, Consider a block code
of length n with M = 2nR codewords. Each codeword xm = (xm1 , xm2 , . . . , xmn ) is generated inde-
pendently and identically according to the input distribution PX that achieves capacity. Encoding
means to transmit message m, send codeword xm and Decoding means that upon receiving y, the
decoder uses joint typicality decoding. Decode y to m̂ if (xm̂ , y) are jointly typical and no other
codeword is jointly typical with y. If no such m̂ exists or multiple exist, declare an error. Regarding
(n)
the Joint Typicality, the set of jointly typical sequences Aϵ is defined as:
(n) n n 1
Aϵ = (x, y) ∈ X × Y : − log PX n ,Y n (x, y) − H(X, Y ) < ϵ (26.146)
n
where H(X, Y ) is the joint entropy of X and Y . By the Joint Asymptotic Equipartition Property
(AEP), for sufficiently large n:
PX n ,Y n (A(n)
ϵ ) ≥ 1 − ϵ. (26.147)
Doing the Error Probability Analysis, the error probability Pe is decomposed into two events:
(n)
• E1 : (xm , y) ∈
/ Aϵ .
(n)
• E2 : Some other codeword xm′ (with m′ ̸= m) satisfies (xm′ , y) ∈ Aϵ .
Bounding P (E1 ), By the Joint AEP:
/ A(n)
P (E1 ) = P (xm , y) ∈ ϵ ≤ ϵ. (26.148)
(n)
Bounding P (E2 ), for a fixed incorrect codeword xm′ , the probability that (xm′ , y) ∈ Aϵ is approx-
imately 2−nI(X;Y ) . Since there are M − 1 ≈ 2nR incorrect codewords, the union bound gives:
Since R < C = I(X; Y ), P (E2 ) → 0 exponentially as n → ∞. Combining the bounds to get the
total Error Probability:
Pe ≤ P (E1 ) + P (E2 ) ≤ ϵ + 2−n(I(X;Y )−R) . (26.150)
For sufficiently large n, Pe ≤ 2ϵ. The converse part shows that reliable communication is impossible
for R > C. The key steps are:
604 CHAPTER 26. APPENDIX
• Use Fano’s inequality to relate the error probability Pe to the conditional entropy H(M |M̂ ).
• Apply the data processing inequality to bound the mutual information I(M ; M̂ ).
Let X be a random variable representing the source data, with probability distribution pX (x)
defined over a finite alphabet X . The compressed representation of X is denoted by X̂, which
takes values in a finite alphabet X̂ . The distortion between X and X̂ is quantified by a distortion
measure d : X × X̂ → R≥0 , which is assumed to be non-negative and bounded. The Rate-Distortion
Function R(D) is defined as:
n o
R(D) = inf I(X; X̂) : E[d(X, X̂)] ≤ D (26.152)
pX̂|X
where pX̂|X is the conditional distribution of X̂ given X, I(X; X̂) is the mutual information be-
tween X and X̂, E[d(X, X̂)] is the expected distortion. The infimum is taken over all conditional
distributions pX̂|X that satisfy the distortion constraint E[d(X, X̂)] ≤ D. The mutual information
I(X; X̂) is defined as:
XX pX̂|X (x̂|x)
I(X; X̂) = pX (x)pX̂|X (x̂|x) log (26.153)
x∈X x̂∈X̂
pX̂ (x̂)
P
where pX̂ (x̂) = x∈X pX (x)pX̂|X (x̂|x) is the marginal distribution of X̂. The expected distortion
is given by: XX
E[d(X, X̂)] = pX (x)pX̂|X (x̂|x)d(x, x̂) (26.154)
x∈X x̂∈X̂
• The distortion constraint E[d(X, X̂)] ≤ D is a linear (and thus convex) constraint.
26.5. INFORMATION THEORY 605
We now give the Proof of the Rate-Distortion Function. To prove the convexity of R(D), con-
sider two distortion levels D1 and D2 , and let p1 and p2 be the corresponding optimal conditional
distributions achieving R(D1 ) and R(D2 ), respectively. For any λ ∈ [0, 1], define:
Thus:
R(Dλ ) ≤ λR(D1 ) + (1 − λ)R(D2 ) (26.158)
proving the convexity of R(D). Regarding the Monotonicity of R(D), The Rate-Distortion Function
R(D) is non-increasing in D. Formally, if D1 ≤ D2 , then:
This follows because the set of conditional distributions pX̂|X satisfying E[d(X, X̂)] ≤ D2 includes
all distributions satisfying E[d(X, X̂)] ≤ D1 .
The achievability of R(D) is proven using the random coding argument. For a given D, gener-
ate a codebook of 2nR codewords, each drawn independently according to the marginal distribution
pX̂ (x̂). For each source sequence xn , find the codeword x̂n that minimizes the distortion d(xn , x̂n ).
Using the law of large numbers and the typicality of sequences, it can be shown that the expected
distortion approaches D as the block length n → ∞, provided R ≥ R(D). The converse is proven
using the data processing inequality and the properties of mutual information. Suppose there exists
a code with rate R < R(D) and distortion E[d(X, X̂)] ≤ D. Then:
which is a contradiction. Thus, R(D) is the fundamental limit. The optimization problem can be
reformulated using the Lagrangian:
L(pX̂|X , λ) = I(X; X̂) + λ E[d(X, X̂)] − D (26.161)
where λ ≥ 0 is the Lagrange multiplier. The optimal solution satisfies the Karush-Kuhn-Tucker
(KKT) conditions:
1. Stationarity:
∇pX̂|X L = 0. (26.162)
2. Primal Feasibility:
E[d(X, X̂)] ≤ D. (26.163)
3. Dual Feasibility:
λ ≥ 0. (26.164)
4. Complementary Slackness:
λ E[d(X, X̂)] − D = 0. (26.165)
606 CHAPTER 26. APPENDIX
The Blahut-Arimoto algorithm is an iterative method for numerically computing R(D). It al-
ternates between updating the conditional distribution pX̂|X and the Lagrange multiplier λ to
converge to the optimal solution. For a Gaussian source X ∼ N (0, σ 2 ) and squared-error distortion
d(x, x̂) = (x − x̂)2 , the Rate-Distortion Function is:
( 2
1 σ 2
log 2 D , 0 ≤ D ≤ σ ,
R(D) = 2 (26.166)
0, D > σ2.
This result is derived using the properties of Gaussian distributions and mutual information, and
it illustrates the trade-off between rate and distortion. The Rate-Distortion Function R(D) is a
cornerstone of information theory, rigorously characterizing the fundamental limits of lossy data
compression. This deep theoretical framework underpins modern data compression techniques and
has broad applications in communication, signal processing, and machine learning.
Error-Correcting Codes: Reed-Solomon, Turbo, and LDPC codes achieve rates near capac-
ity. The channel capacity C is the supremum of all achievable rates R for which there exists a
coding scheme with a vanishing probability of error Pe → 0 as the block length n → ∞. For a
discrete memoryless channel (DMC) with transition probabilities P (y|x), the capacity is given by:
where I(X; Y ) is the mutual information between the input X and output Y , and the supremum
is taken over all input distributions PX . For the additive white Gaussian noise (AWGN) channel
with power constraint P and noise variance σ 2 , the capacity is:
1 P
C = log2 1 + 2 [bits per channel use]. (26.168)
2 σ
The converse of Shannon’s theorem establishes that no coding scheme can achieve R > C with
Pe → 0. Let’s now discuss the Fundamental Limits and Large Deviation Theory of Error-Correcting
Codes. An error-correcting code C of block length n and rate R = k/n maps k information bits to
n coded bits. The error exponent E(R) characterizes the exponential decay of Pe with n for rates
R < C:
Pe ∼ e−nE(R) . (26.169)
The Gallager exponent provides a lower bound on E(R):
For the AWGN channel, E0 (ρ) can be expressed in terms of the signal-to-noise ratio (SNR). Let’s
discuss the Algebraic Geometry and Finite Fields of Reed-Solomon Codes. Reed-Solomon codes are
evaluation codes defined over finite fields Fq , where q = 2m . They are constructed by evaluating
26.5. INFORMATION THEORY 607
Regarding Turbo Codes: Iterative Decoding and Statistical Mechanics. Turbo codes are con-
structed using two recursive systematic convolutional (RSC) encoders separated by an interleaver.
The iterative decoding process can be analyzed using tools from statistical mechanics.
2. EXIT Charts: The extrinsic information transfer (EXIT) chart is a tool to analyze the
convergence of iterative decoding. The area theorem relates the area under the EXIT curve
to the gap to capacity.
The performance of Turbo codes is characterized by the waterfall region and the error floor, which
can be analyzed using large deviation theory and random matrix theory. LDPC codes are defined by
a sparse parity-check matrix H ∈ Fm×n
2 , where each row represents a parity-check constraint. The
Tanner graph of the code is a bipartite graph with variable nodes (corresponding to codeword bits)
and check nodes (corresponding to parity constraints). Regarding the Message-Passing Decoding,
The sum-product algorithm (SPA) or min-sum algorithm (MSA) is used for iterative decoding. The
messages passed between nodes are log-likelihood ratios (LLRs). Regarding the Density Evolution,
This is a theoretical tool to analyze the asymptotic performance of LDPC codes. It tracks the
probability density function (PDF) of the LLRs as a function of the iteration number. The threshold
of the code is the maximum noise level for which Pe → 0 as n → ∞. The degree distributions of
the variable and check nodes, denoted by λ(x) and ρ(x), respectively, are optimized to maximize
the threshold. The optimization problem can be formulated as:
Z 1 Z 1
max Threshold(λ, ρ) subject to λ(x)dx = ρ(x)dx = 1. (26.173)
λ,ρ 0 0
The near-capacity performance of Turbo and LDPC codes is a consequence of their ability to ex-
ploit the channel’s soft information and their iterative decoding algorithms. The turbo principle
states that the exchange of extrinsic information between decoders improves the reliability of the
estimates.
Machine Learning: KL-divergence and mutual information are used in variational inference.
We begin by placing the problem in a measure-theoretic framework. Let (Ω, F, P ) be a probability
space, where Ω is the sample space, F is a σ-algebra, and P is a probability measure. The observed
variables x and latent variables z are random variables defined on this space, with
x : Ω → X, z : Ω → Z, (26.174)
where X and Z are measurable spaces. The joint distribution p(x, z) is a probability measure on
X × Z, and the posterior p(z | x) is a conditional probability measure. Variational inference seeks
to approximate p(z | x) using a variational measure q(z; ϕ), where ϕ parameterizes the variational
608 CHAPTER 26. APPENDIX
family Q. The Kullback-Leibler (KL) divergence between two probability measures Q and P on
(Z, G) is defined as: Z
dQ
DKL (Q ∥ P ) = log dQ, (26.175)
Z dP
where dQdP
is the Radon-Nikodym derivative of Q with respect to P . The KL divergence is finite
only if Q is absolutely continuous with respect to P (denoted Q ≪ P ), and it satisfies:
DKL (Q ∥ P ) ≥ 0, (26.176)
The Evidence Lower Bound (ELBO) is derived using measure-theoretic expectations. Starting from
the log-marginal likelihood: Z
log p(x) = log p(x, z) dz (26.181)
Z
where
H[q(z; ϕ)] = −Eq(z;ϕ) [log q(z; ϕ)] (26.184)
is the entropy of q(z; ϕ). The mutual information between x and z is defined as:
where p(x)⊗p(z) is the product measure of the marginals. In VI, the variational mutual information
is:
Iq (x; z) = Ep(x) [DKL (q(z | x) ∥ q(z))] (26.186)
where Z
q(z) = q(z | x)p(x) dx (26.187)
X
is the aggregated posterior. Using measure-theoretic expectations, the ELBO can be decomposed
as:
ELBO(ϕ) = Ep(x) Eq(z|x) [log p(x | z)] − Iq (x; z) − DKL (q(z) ∥ p(z)). (26.188)
26.5. INFORMATION THEORY 609
Quantum Information: von Neumann entropy generalizes Shannon entropy for quantum states.
In quantum mechanics, the state of a quantum system is described by a density operator ρ, which
is a positive semi-definite, Hermitian operator acting on a Hilbert space H, with unit trace:
ρ ≥ 0, ρ = ρ† , Tr(ρ) = 1. (26.189)
For a pure state |ψ⟩ ∈ H, the density operator is given by:
ρ = |ψ⟩ ⟨ψ| . (26.190)
For a mixed state, which is a statistical ensemble of pure states {|ψi ⟩} with probabilities {pi }, the
density operator is: X
ρ= pi |ψi ⟩ ⟨ψi | . (26.191)
i
The spectral theorem guarantees that any density operator ρ can be diagonalized in terms of its
eigenvalues {λi } and eigenstates {|ϕi ⟩}:
X
ρ= λi |ϕi ⟩ ⟨ϕi | , (26.192)
i
P
where λi ≥ 0, i λi = 1, and {|ϕi ⟩} forms an orthonormal basis for H. We first give the definition
and functional calculus of Von Neumann Entropy. The von Neumann entropy S(ρ) of a quantum
state ρ is defined as:
S(ρ) = −Tr(ρ log ρ). (26.193)
Since ρ is a positive semi-definite operator, the logarithm of ρ is defined via its spectral decompo-
sition. If X
ρ= λi |ϕi ⟩ ⟨ϕi | , (26.194)
i
then: X
log ρ = log λi |ϕi ⟩ ⟨ϕi | . (26.195)
i
Here, log λi is well-defined for λi > 0. By convention,
0 log 0 = 0, (26.196)
which is consistent with the limit lim+ x log x = 0. The trace operation is linear and invariant
x→0
under cyclic permutations. Using the spectral decomposition of ρ, we have:
!
X X
S(ρ) = −Tr λi |ϕi ⟩ ⟨ϕi | · log λj |ϕj ⟩ ⟨ϕj | . (26.197)
i j
This is the quantum analog of the Shannon entropy, where the eigenvalues {λi } of ρ play the role
of classical probabilities. There are many Mathematical Properties of Von Neumann Entropy. The
first of them is Non-negativity:
S(ρ) ≥ 0, (26.199)
with equality if and only if ρ is a pure state (i.e., ρ = |ψ⟩ ⟨ψ| for some |ψ⟩). For a d-dimensional
Hilbert space H, the von Neumann entropy is maximized by the maximally mixed state ρ = dI ,
where I is the identity operator on H. The maximum entropy is:
I
S = log d. (26.200)
d
610 CHAPTER 26. APPENDIX
The von Neumann entropy is concave in ρ. For any set of density operators {ρi } and probabilities
{pi }, we have: !
X X
S p i ρi ≥ pi S(ρi ). (26.201)
i i
This reflects the fact that mixing quantum states increases uncertainty. For a composite system
described by a product state ρAB = ρA ⊗ ρB , the entropy is additive:
S(ρAB ) = S(ρA ) + S(ρB ). (26.202)
Physics: Maximum entropy methods are foundational in statistical mechanics. The maximum
entropy principle is a variational principle that selects the probability distribution {pi } over mi-
crostates i of a system by maximizing the Shannon entropy functional S[p], subject to a set of
constraints that encode known macroscopic information about the system. Regarding the Shannon
Entropy Functional, for a discrete probability distribution {pi }, the Shannon entropy is defined as:
X
S[p] = −kB pi ln pi (26.203)
i∈M
where M is the set of all microstates of the system, kB is the Boltzmann constant, which ensures
dimensional consistency with thermodynamic
P entropy, pi is the probability of the system being in
microstate i, satisfying pi ≥ 0 and i pi = 1. For a continuous probability distribution p(x) over
a state space X, the entropy is defined as:
Z
S[p] = −kB p(x) ln p(x) dx (26.204)
X
R
where p(x) is a probability density function (PDF) satisfying p(x) ≥ 0 and X p(x) dx = 1. In this
problem, Constraints and Macroscopic Observables, The system is subject to a set of m macroscopic
constraints, which are expressed as expectation values of observables {Ak }m k=1 . These constraints
take the form: X
⟨Ak ⟩ = pi Ak (i) = ak , k = 1, 2, . . . , m (26.205)
i∈M
where λ0 is the Lagrange multiplier for the normalization constraint, λk are the Lagrange multipliers
for the macroscopic constraints. Regarding the Functional Derivative and Stationarity Condition,
To find the extremum of L, we take the functional derivative of L with respect to pi and set it to
zero: m
δL X
= −kB (ln pi + 1) − λ0 − λk Ak (i) = 0 (26.207)
δpi k=1
Solving for pi :
m
1 + λ0 X λk
ln pi = − − Ak (i) (26.208)
kB k=1
kB
m
!
1 X λk
pi = exp − Ak (i) (26.210)
Z k=1
kB
Regarding the Identification of Lagrange Multipliers, The Lagrange multipliers {λk } are determined
by enforcing the constraints. For example: If A1 (i) = Ei (energy of microstate i), then λ1 = β =
1
kB T
, where T is the temperature and If
A2 (i) = Ni (26.211)
• The maximum entropy distribution is the unique global maximizer of S[p] subject to the
constraints.
The maximum entropy principle is deeply connected to thermodynamics through the following
relationships. The partition function Z is given by:
X
Z= exp(−βEi + βµNi ). (26.214)
i
F = −kB T ln Z. (26.215)
• Gibbs’ Inequality: The Shannon entropy is maximized by the uniform distribution when
no constraints are imposed.
• Convex Duality: The Lagrange multipliers {λk } are dual variables that encode the sensi-
tivity of the entropy to changes in the constraints.
There are many applications of the maximum entropy principle in statistical mechanics. The
maximum entropy principle is used to derive:
In summary, the maximum entropy methods in statistical mechanics are a rigorous and foundational
framework for inferring probability distributions based on limited information. They are deeply
rooted in information theory, convex optimization, and statistical physics, and they provide a
profound connection between microscopic dynamics and macroscopic thermodynamics.
The authors acknowledge the contributions of researchers whose foundational work has shaped our
understanding of Deep Learning.
613
Bibliography
[1] Rao, N., Farid, M., and Raiz, M. (2024). Symmetric Properties of λ-Szász Operators Coupled
with Generalized Beta Functions and Approximation Theory. Symmetry, 16(12), 1703.
[2] Mukhopadhyay, S.N., Ray, S. (2025). Function Spaces. In: Measure and Integration. University
Texts in the Mathematical Sciences. Springer, Singapore.
[3] Szoldra, T. (2024). Ergodicity breaking in quantum systems: from exact time evolution to
machine learning (Doctoral dissertation).
[4] SONG, W. X., CHEN, H., CUI, C., LIU, Y. F., TONG, D., GUO, F., ... and XIAO, C.
W. (2025). Theoretical, methodological, and implementation considerations for establishing a
sustainable urban renewal model. JOURNAL OF NATURAL RESOURCES, 40(1), 20-38.
[5] El Mennaoui, O., Kharou, Y., and Laasri, H. (2025). Evolution families in the framework of
maximal regularity. Evolution Equations and Control Theory, 0-0.
[6] Pedroza, G. (2024). On the Conditions for Domain Stability for Machine Learning: a Mathe-
matical Approach. arXiv preprint arXiv:2412.00464.
[7] Cerreia-Vioglio, S., and Ok, E. A. (2024). Abstract integration of set-valued functions. Journal
of Mathematical Analysis and Applications, 129169.
[8] Averin, A. (2024). Formulation and Proof of the Gravitational Entropy Bound. arXiv preprint
arXiv:2412.02470.
[9] Potter, T. (2025). Subspaces of L2 (Rn ) Invariant Under Crystallographic Shifts. arXiv e-prints,
arXiv-2501.
[11] Wang, R., Cai, L., Wu, Q., and Niyato, D. (2025). Service Function Chain Deployment with
Intrinsic Dynamic Defense Capability. IEEE Transactions on Mobile Computing.
[12] Duim, J. L., and Mesquita, D. P. (2025). Artificial Intelligence Value Alignment via Inverse
Reinforcement Learning. Proceeding Series of the Brazilian Society of Computational and
Applied Mathematics, 11(1), 1-2.
[13] Khayat, M., Barka, E., Serhani, M. A., Sallabi, F., Shuaib, K., and Khater, H. M. (2025).
Empowering Security Operation Center with Artificial Intelligence and Machine Learning–A
Systematic Literature Review. IEEE Access.
[14] Agrawal, R. (2025). 46 Detection of melanoma using DenseNet-based adaptive weighted loss
function. Emerging Trends in Computer Science and Its Application, 283.
[15] Hailemichael, H., and Ayalew, B. Adaptive and Safe Fast Charging of Lithium-Ion Batteries
Via Hybrid Model Learning and Control Barrier Functions. Available at SSRN 5110597.
614
BIBLIOGRAPHY 615
[16] Nguyen, E., Xiao, J., Fan, Z., and Ruan, D. Contrast-free Full Intracranial Vessel Geometry
Estimation from MRI with Metric Learning based Inference. In Medical Imaging with Deep
Learning.
[17] Luo, Z., Bi, Y., Yang, X., Li, Y., Wang, S., and Ye, Q. A Novel Machine Vision-Based Collision
Risk Warning Method for Unsignalized Intersections on Arterial Roads. Frontiers in Physics,
13, 1527956.
[18] Bousquet, N., Thomassé, S. (2015). VC-dimension and Erdős–Pósa property. Discrete Math-
ematics, 338(12), 2302-2317.
[19] Asian, O., Yildiz, O. T., Alpaydin, E. (2009, September). Calculating the VC-dimension of
decision trees. In 2009 24th International Symposium on Computer and Information Sciences
(pp. 193-198). IEEE.
[20] Zhang, C., Bian, W., Tao, D., Lin, W. (2012). Discretized-Vapnik-Chervonenkis dimension
for analyzing complexity of real function classes. IEEE transactions on neural networks and
learning systems, 23(9), 1461-1472.
[21] Riondato, M., Akdere, M., Çetintemel, U., Zdonik, S. B., Upfal, E. (2011). The VC-dimension
of SQL queries and selectivity estimation through sampling. In Machine Learning and Knowl-
edge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece,
September 5-9, 2011, Proceedings, Part II 22 (pp. 661-676). Springer Berlin Heidelberg.
[22] Bane, M., Riggle, J., Sonderegger, M. (2010). The VC dimension of constraint-based grammars.
Lingua, 120(5), 1194-1208.
[23] Anderson, A. (2023). Fuzzy VC Combinatorics and Distality in Continuous Logic. arXiv
preprint arXiv:2310.04393.
[24] Fox, J., Pach, J., Suk, A. (2021). Bounded VC-dimension implies the Schur-Erdős conjecture.
Combinatorica, 41(6), 803-813.
[25] Johnson, H. R. (2021). Binary strings of finite VC dimension. arXiv preprint arXiv:2101.06490.
[26] Janzing, D. (2018). Merging joint distributions via causal model classes with low VC dimension.
arXiv preprint arXiv:1804.03206.
[27] Hüllermeier, E., Fallah Tehrani, A. (2012, July). On the vc-dimension of the choquet integral.
In International Conference on Information Processing and Management of Uncertainty in
Knowledge-Based Systems (pp. 42-50). Berlin, Heidelberg: Springer Berlin Heidelberg.
[28] Mohri, M. (2018). Foundations of machine learning.
[29] Cucker, F., Zhou, D. X. (2007). Learning theory: an approximation theory viewpoint (Vol.
24). Cambridge University Press.
[30] Shalev-Shwartz, S., Ben-David, S. (2014). Understanding machine learning: From theory to
algorithms. Cambridge university press.
[31] Truong, L. V. (2022). On rademacher complexity-based generalization bounds for deep learn-
ing. arXiv preprint arXiv:2208.04284.
[32] Gnecco, G., and Sanguineti, M. (2008). Approximation error bounds via Rademacher com-
plexity. Applied Mathematical Sciences, 2, 153-176.
[33] Astashkin, S. V. (2010). Rademacher functions in symmetric spaces. Journal of Mathematical
Sciences, 169(6), 725-886.
616 BIBLIOGRAPHY
[34] Ying and Campbell (2010). Rademacher chaos complexities for learning the kernel problem.
Neural computation, 22(11), 2858-2886.
[35] Zhu, J., Gibson, B., and Rogers, T. T. (2009). Human rademacher complexity. Advances in
neural information processing systems, 22.
[36] Astashkin, S. V., Astashkin, S. V., and Mazlum. (2020). The Rademacher system in function
spaces. Basel: Birkhäuser.
[37] Sachs, S., van Erven, T., Hodgkinson, L., Khanna, R., and Şimşekli, U. (2023, July). Gener-
alization Guarantees via Algorithm-dependent Rademacher Complexity. In The Thirty Sixth
Annual Conference on Learning Theory (pp. 4863-4880). PMLR.
[38] Ma and Wang (2020). Rademacher complexity and the generalization error of residual net-
works. Communications in Mathematical Sciences, 18(6), 1755-1774.
[39] Bartlett, P. L., and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov), 463-482.
[40] Bartlett, P. L., and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov), 463-482.
[41] McDonald, D. J., and Shalizi, C. R. (2011). Rademacher complexity of stationary sequences.
arXiv preprint arXiv:1106.0730.
[43] Giang, T. H., Tri, N. M., and Tuan, D. A. (2024). On some Sobolev and Pólya-Sezgö type
inequalities with weights and applications. arXiv preprint arXiv:2412.15490.
[44] Ruiz, P. A., and Fragkiadaki, V. (2024). Fractional Sobolev embeddings and algebra property:
A dyadic view. arXiv preprint arXiv:2412.12051.
[45] Bilalov, B., Mamedov, E., Sezer, Y., and Nasibova, N. (2025). Compactness in Banach function
spaces: Poincaré and Friedrichs inequalities. Rendiconti del Circolo Matematico di Palermo
Series 2, 74(1), 68.
[46] Cheng, M., and Shao, K. (2025). Ground states of the inhomogeneous nonlinear fractional
Schrödinger-Poisson equations. Complex Variables and Elliptic Equations, 1-17.
[47] Wei, J., and Zhang, L. (2025). Ground State Solutions of Nehari-Pohozaev Type for
Schrödinger-Poisson Equation with Zero-Mass and Weighted Hardy Sobolev Subcritical Ex-
ponent. The Journal of Geometric Analysis, 35(2), 48.
[48] Zhang, X., and Qi, W. (2025). Multiplicity result on a class of nonhomogeneous quasilinear
elliptic system with small perturbations in RN . arXiv preprint arXiv:2501.01602.
[49] Xiao, J., and Yue, C. (2025). A Trace Principle for Fractional Laplacian with an Application
to Image Processing. La Matematica, 1-26.
[50] Pesce, A., and Portaro, S. (2025). Fractional Sobolev spaces related to an ultraparabolic op-
erator. arXiv preprint arXiv:2501.05898.
[52] Chen, H., Chen, H. G., and Li, J. N. (2024). Sharp embedding results and geometric inequalities
for Hö rmander vector fields. arXiv preprint arXiv:2404.19393.
[54] Brezis, H., and Brézis, H. (2011). Functional analysis, Sobolev spaces and partial differential
equations (Vol. 2, No. 3, p. 5). New York: Springer.
[55] Evans, L. C. (2022). Partial differential equations (Vol. 19). American Mathematical Society.
[56] Maz’â, V. G. (2011). Sobolev Spaces: With Applications to Elliptic Partial Differential Equa-
tions. Springer.
[57] Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are
universal approximators. Neural networks, 2(5), 359-366.
[59] Barron, A. R. (1993). Universal approximation bounds for superpositions of a sigmoidal func-
tion. IEEE Transactions on Information theory, 39(3), 930-945.
[60] Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta numerica,
8, 143-195.
[61] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural
networks: A view from the width. Advances in neural information processing systems, 30.
[62] Hanin, B., and Sellke, M. (2017). Approximating continuous functions by relu nets of minimal
width. arXiv preprint arXiv:1710.11278.
[63] Garcıa-Cervera, C. J., Kessler, M., Pedregal, P., and Periago, F. Universal approximation of
set-valued maps and DeepONet approximation of the controllability map.
[64] Majee, S., Abhishek, A., Strauss, T., and Khan, T. (2024). MCMC-Net: Accelerating
Markov Chain Monte Carlo with Neural Networks for Inverse Problems. arXiv preprint
arXiv:2412.16883.
[65] Toscano, J. D., Wang, L. L., and Karniadakis, G. E. (2024). KKANs: Kurkova-Kolmogorov-
Arnold Networks and Their Learning Dynamics. arXiv preprint arXiv:2412.16738.
[67] Rudin, W. (1964). Principles of mathematical analysis (Vol. 3). New York: McGraw-hill.
[68] Stein, E. M., and Shakarchi, R. (2009). Real analysis: measure theory, integration, and Hilbert
spaces. Princeton University Press.
[71] Folland, G. B. (1999). Real analysis: modern techniques and their applications (Vol. 40). John
Wiley and Sons.
[72] Sugiura, S. (2024). On the Universality of Reservoir Computing for Uniform Approximation.
618 BIBLIOGRAPHY
[73] LIU, Y., LIU, S., HUANG, Z., and ZHOU, P. NORMED MODULES AND THE CATEGORI-
FICATION OF INTEGRATIONS, SERIES EXPANSIONS, AND DIFFERENTIATIONS.
[75] Chang, S. Y., and Wei, Y. (2024). Generalized Choi–Davis–Jensen’s Operator Inequalities and
Their Applications. Symmetry, 16(9), 1176.
[76] Caballer, M., Dantas, S., and Rodrı́guez-Vidanes, D. L. (2024). Searching for linear structures
in the failure of the Stone-Weierstrass theorem. arXiv preprint arXiv:2405.06453.
[77] Chen, D. (2024). The Machado–Bishop theorem in the uniform topology. Journal of Approxi-
mation Theory, 304, 106085.
[78] Rafiei, H., and Akbarzadeh-T, M. R. (2024). Hedge-embedded Linguistic Fuzzy Neural Net-
works for Systems Identification and Control. IEEE Transactions on Artificial Intelligence.
[81] Lorentz, G. G. (1966). Approximation of functions, athena series. Selected Topics in Mathe-
matics.
[82] Guilhoto, L. F., and Perdikaris, P. (2024). Deep learning alternatives of the Kolmogorov su-
perposition theorem. arXiv preprint arXiv:2410.01990.
[83] Alhafiz, M. R., Zakaria, K., Dung, D. V., Palar, P. S., Dwianto, Y. B., and Zuhal, L. R. (2025).
Kolmogorov-Arnold Networks for Data-Driven Turbulence Modeling. In AIAA SCITECH 2025
Forum (p. 2047).
[84] Lorencin, I., Mrzljak, V., Poljak, I., and Etinger, D. (2024, September). Prediction of CODLAG
Propulsion System Parameters Using Kolmogorov-Arnold Network. In 2024 IEEE 22nd Jubilee
International Symposium on Intelligent Systems and Informatics (SISY) (pp. 173-178). IEEE.
[85] Trevisan, D., Cassara, P., Agazzi, A., and Scardera, S. NTK Analysis of Knowledge Distilla-
tion.
[86] Bonfanti, A., Bruno, G., and Cipriani, C. (2024). The Challenges of the Nonlinear Regime for
Physics-Informed Neural Networks. arXiv preprint arXiv:2402.03864.
[87] Jacot, A., Gabriel, F., and Hongler, C. (2018). Neural tangent kernel: Convergence and gen-
eralization in neural networks. Advances in neural information processing systems, 31.
[88] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington,
J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent.
Advances in neural information processing systems, 32.
[89] Yang, G., and Hu, E. J. (2020). Feature learning in infinite-width neural networks. arXiv
preprint arXiv:2011.14522.
[90] Xiang, L., Dudziak, L., Abdelfattah, M. S., Chau, T., Lane, N. D., and Wen, H. (2021). Zero-
Cost Operation Scoring in Differentiable Architecture Search. arXiv preprint arXiv:2106.06799.
BIBLIOGRAPHY 619
[91] Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington,
J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent.
Advances in neural information processing systems, 32.
[92] McAllester, D. A. (1999, July). PAC-Bayesian model averaging. In Proceedings of the twelfth
annual conference on Computational learning theory (pp. 164-170).
[93] Catoni, O. (2007). PAC-Bayesian supervised classification: the thermodynamics of statistical
learning. arXiv preprint arXiv:0712.0248.
[94] Germain, P., Lacasse, A., Laviolette, F., and Marchand, M. (2009, June). PAC-Bayesian
learning of linear classifiers. In Proceedings of the 26th Annual International Conference on
Machine Learning (pp. 353-360).
[95] Seeger, M. (2002). PAC-Bayesian generalisation error bounds for Gaussian process classifica-
tion. Journal of machine learning research, 3(Oct), 233-269.
[96] Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approxima-
tions of Gibbs posteriors. Journal of Machine Learning Research, 17(236), 1-41.
[97] Dziugaite, G. K., and Roy, D. M. (2017). Computing nonvacuous generalization bounds for
deep (stochastic) neural networks with many more parameters than training data. arXiv
preprint arXiv:1703.11008.
[98] Rivasplata, O., Kuzborskij, I., Szepesvári, C., and Shawe-Taylor, J. (2020). PAC-Bayes analysis
beyond the usual bounds. Advances in Neural Information Processing Systems, 33, 16833-
16845.
[99] Lever, G., Laviolette, F., and Shawe-Taylor, J. (2013). Tighter PAC-Bayes bounds through
distribution-dependent priors. Theoretical Computer Science, 473, 4-28.
[100] Rivasplata, O., Parrado-Hernández, E., Shawe-Taylor, J. S., Sun, S., and Szepesvári, C.
(2018). PAC-Bayes bounds for stable algorithms with instance-dependent priors. Advances in
Neural Information Processing Systems, 31.
[101] Lindemann, L., Zhao, Y., Yu, X., Pappas, G. J., and Deshmukh, J. V. (2024). Formal
verification and control with conformal prediction. arXiv preprint arXiv:2409.00536.
[102] Jin, G., Wu, S., Liu, J., Huang, T., and Mu, R. (2025). Enhancing Robust Fairness via
Confusional Spectral Regularization. arXiv preprint arXiv:2501.13273.
[103] Ye, F., Xiao, J., Ma, W., Jin, S., and Yang, Y. (2025). Detecting small clusters in the
stochastic block model. Statistical Papers, 66(2), 37.
[104] Bhattacharjee, A., and Bharadwaj, P. (2025). Coherent Spectral Feature Extraction Using
Symmetric Autoencoders. IEEE Journal of Selected Topics in Applied Earth Observations and
Remote Sensing.
[105] Wu, Q., Hu, B., Liu, C. et al. (2025). Velocity Analysis Using High-resolution Hyperbolic
Radon Transform with Lq1 − Lq2 Regularization. Pure Appl. Geophys.
[106] Ortega, I., Hannigan, J. W., Baier, B. C., McKain, K., and Smale, D. (2025). Advancing CH
4 and N 2 O retrieval strategies for NDACC/IRWG high-resolution direct-sun FTIR Observa-
tions. EGUsphere, 2025, 1-32.
[107] Kazmi, S. H. A., Hassan, R., Qamar, F., Nisar, K., and Al-Betar, M. A. (2025). Federated
Conditional Variational Auto Encoders for Cyber Threat Intelligence: Tackling Non-IID Data
in SDN Environments. IEEE Access.
620 BIBLIOGRAPHY
[108] Zhao, Y., Bi, Z., Zhu, P., Yuan, A., and Li, X. (2025). Deep Spectral Clustering with Projected
Adaptive Feature Selection. IEEE Transactions on Geoscience and Remote Sensing.
[109] Saranya, S., and Menaka, R. (2025). A Quantum-Based Machine Learning Approach for
Autism Detection using Common Spatial Patterns of EEG Signals. IEEE Access.
[110] Dhalbisoi, S., Mohapatra, A., and Rout, A. (2024, March). Design of Cell-Free Massive
MIMO for Beyond 5G Systems with MMSE and RZF Processing. In International Conference
on Machine Learning, IoT and Big Data (pp. 263-273). Singapore: Springer Nature Singapore.
[111] Wei, C., Li, Z., Hu, T., Zhao, M., Sun, Z., Jia, K., ... and Jiang, S. (2025). Model-based
convolution neural network for 3D Near-infrared spectral tomography. IEEE Transactions on
Medical Imaging.
[113] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and
Bengio, Y. (2020). Generative adversarial networks. Communications of the ACM, 63(11),
139-144.
[114] Haykin, S. (2009). Neural networks and learning machines, 3/E. Pearson Education India.
[116] Bishop, C. M., and Nasrabadi, N. M. (2006). Pattern recognition and machine learning (Vol.
4, No. 4, p. 738). New York: springer.
[117] Poggio, T., and Smale, S. (2003). The mathematics of learning: Dealing with data. Notices
of the AMS, 50(5), 537-544.
[118] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553), 436-444.
[119] Tishby, N., and Zaslavsky, N. (2015, April). Deep learning and the information bottleneck
principle. In 2015 ieee information theory workshop (itw) (pp. 1-5). IEEE.
[120] Sorrenson, P. (2025). Free-Form Flows: Generative Models for Scientific Applications (Doc-
toral dissertation).
[121] Liu, W., and Shi, X. (2025). An Enhanced Neural Network Forecasting System for the July
Precipitation over the Middle-Lower Reaches of the Yangtze River.
[122] Das, P., Mondal, D., Islam, M. A., Al Mohotadi, M. A., and Roy, P. C. (2025). Analyti-
cal Finite-Integral-Transform and Gradient-Enhanced Machine Learning Approach for Ther-
moelastic Analysis of FGM Spherical Structures with Arbitrary Properties. Theoretical and
Applied Mechanics Letters, 100576.
[123] Zhang, R. (2025). Physics-informed Parallel Neural Networks for the Identification of Con-
tinuous Structural Systems.
[124] Ali, S., and Hussain, A. (2025). A neuro-intelligent heuristic approach for performance pre-
diction of triangular fuzzy flow system. Proceedings of the Institution of Mechanical Engineers,
Part N: Journal of Nanomaterials, Nanoengineering and Nanosystems, 23977914241310569.
[125] Li, S. (2025). Scalable, generalizable, and offline methods for imperfect-information extensive-
form games.
[126] Hu, T., Jin, B., and Wang, F. (2025). An Iterative Deep Ritz Method for Monotone Elliptic
Problems. Journal of Computational Physics, 113791.
BIBLIOGRAPHY 621
[127] Chen, P., Zhang, A., Zhang, S., Dong, T., Zeng, X., Chen, S., ... and Zhou, Q. (2025).
Maritime near-miss prediction framework and model interpretation analysis method based on
Transformer neural network model with multi-task classification variables. Reliability Engi-
neering and System Safety, 110845.
[128] Sun, G., Liu, Z., Gan, L., Su, H., Li, T., Zhao, W., and Sun, B. (2025). SpikeNAS-Bench:
Benchmarking NAS Algorithms for Spiking Neural Network Architecture. IEEE Transactions
on Artificial Intelligence.
[129] Zhang, Z., Wang, X., Shen, J., Zhang, M., Yang, S., Zhao, W., ... and Wang, J. (2025).
Unfixed Bias Iterator: A New Iterative Format. IEEE Access.
[130] Rosa, G. J. (2010). The Elements of Statistical Learning: Data Mining, Inference, and Pre-
diction by HASTIE, T., TIBSHIRANI, R., and FRIEDMAN, J.
[132] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014).
Dropout: a simple way to prevent neural networks from overfitting. The journal of machine
learning research, 15(1), 1929-1958.
[133] Zou, H., and Hastie, T. (2005). Regularization and variable selection via the elastic net.
Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(2), 301-320.
[134] Vapnik, V. (2013). The nature of statistical learning theory. Springer science and business
media.
[135] Ng, A. Y. (2004, July). Feature selection, L 1 vs. L 2 regularization, and rotational invariance.
In Proceedings of the twenty-first international conference on Machine learning (p. 78).
[136] Li, T. (2025). Optimization of Clinical Trial Strategies for Anti-HER2 Drugs Based on
Bayesian Optimization and Deep Learning.
[137] Yasuda, M., and Sekimoto, K. (2024). Gaussian-discrete restricted Boltzmann machine with
sparse-regularized hidden layer. Behaviormetrika, 1-19.
[138] Xiaodong Luo, William C. Cruz, Xin-Lei Zhang, Heng Xiao, (2023), Hyper-parameter op-
timization for improving the performance of localization in an iterative ensemble smoother,
Geoenergy Science and Engineering, Volume 231, Part B, 212404
[139] Alrayes, F.S., Maray, M., Alshuhail, A. et al. (2025) Privacy-preserving approach for IoT
networks using statistical learning with optimization algorithm on high-dimensional big data
environment. Sci Rep 15, 3338. https://doi.org/10.1038/s41598-025-87454-1
[140] Cho, H., Kim, Y., Lee, E., Choi, D., Lee, Y., and Rhee, W. (2020). Basic enhancement
strategies when using Bayesian optimization for hyperparameter tuning of deep neural net-
works. IEEE access, 8, 52588-52608.
[142] Abdel-salam, M., Elhoseny, M. and El-hasnony, I.M. Intelligent and Secure Evolved Frame-
work for Vaccine Supply Chain Management Using Machine Learning and Blockchain. SN
COMPUT. SCI. 6, 121 (2025). https://doi.org/10.1007/s42979-024-03609-3
622 BIBLIOGRAPHY
[143] Vali, M. H. (2025). Vector quantization in deep neural networks for speech and image pro-
cessing.
[145] Razavi-Termeh, S. V., Sadeghi-Niaraki, A., Ali, F., and Choi, S. M. (2025). Improving flood-
prone areas mapping using geospatial artificial intelligence (GeoAI): A non-parametric algo-
rithm enhanced by math-based metaheuristic algorithms. Journal of Environmental Manage-
ment, 375, 124238.
[146] Kiran, M., and Ozyildirim, M. (2022). Hyperparameter tuning for deep reinforcement learning
applications. arXiv preprint arXiv:2201.11182.
[147] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.
[148] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). ImageNet classification with deep
convolutional neural networks. Communications of the ACM, 60(6), 84-90.
[149] Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556.
[150] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-
778).
[151] Cohen, T., and Welling, M. (2016, June). Group equivariant convolutional networks. In In-
ternational conference on machine learning (pp. 2990-2999). PMLR.
[152] Zeiler, M. D., and Fergus, R. (2014). Visualizing and understanding convolutional networks.
In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September
6-12, 2014, Proceedings, Part I 13 (pp. 818-833). Springer International Publishing.
[153] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., ... and Guo, B. (2021). Swin transformer:
Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF in-
ternational conference on computer vision (pp. 10012-10022).
[155] Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by
back-propagating errors. nature, 323(6088), 533-536.
[156] Bensaid, B., Poëtte, G., and Turpault, R. (2024). Convergence of the Iterates for Mo-
mentum and RMSProp for Local Smooth Functions: Adaptation is the Key. arXiv preprint
arXiv:2407.15471.
[157] Liu, Q., and Ma, W. (2024). The Epochal Sawtooth Effect: Unveiling Training Loss Oscilla-
tions in Adam and Other Optimizers. arXiv preprint arXiv:2410.10056.
[158] Li, H. (2024). Smoothness and Adaptivity in Nonlinear Optimization for Machine Learning
Applications (Doctoral dissertation, Massachusetts Institute of Technology).
[159] Heredia, C. (2024). Modeling AdaGrad, RMSProp, and Adam with Integro-Differential Equa-
tions. arXiv preprint arXiv:2411.09734.
BIBLIOGRAPHY 623
[160] Ye, Q. (2024). Preconditioning for Accelerated Gradient Descent Optimization and Regular-
ization. arXiv preprint arXiv:2410.00232.
[161] Compagnoni, E. M., Liu, T., Islamov, R., Proske, F. N., Orvieto, A., and Lucchi, A. (2024).
Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise. arXiv
preprint arXiv:2411.15958.
[162] Yao, B., Zhang, Q., Feng, R., and Wang, X. (2024). System response curve based first-order
optimization algorithms for cyber-physical-social intelligence. Concurrency and Computation:
Practice and Experience, 36(21), e8197.
[163] Wen, X., and Lei, Y. (2024, June). A Fast ADMM Framework for Training Deep Neural
Networks Without Gradients. In 2024 International Joint Conference on Neural Networks
(IJCNN) (pp. 1-8). IEEE.
[164] Hannibal, S., Jentzen, A., and Thang, D. M. (2024). Non-convergence to global minimizers
in data driven supervised deep learning: Adam and stochastic gradient descent optimization
provably fail to converge to global minimizers in the training of deep neural networks with
ReLU activation. arXiv preprint arXiv:2410.10533.
[165] Yang, Z. (2025). Adaptive Biased Stochastic Optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
[166] Kingma, D. P., and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980.
[167] Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of adam and beyond. arXiv
preprint arXiv:1904.09237.
[168] Jin, L., Nong, H., Chen, L., and Su, Z. (2024). A Method for Enhancing Generalization of
Adam by Multiple Integrations. arXiv preprint arXiv:2412.12473.
[169] Adly, A. M. (2024). EXAdam: The Power of Adaptive Cross-Moments. arXiv preprint
arXiv:2412.20302.
[170] Liu, Y., Cao, Y., and Lin, J. Convergence Analysis of the ADAM Algorithm for Linear Inverse
Problems.
[171] Yang, Z. (2025). Adaptive Biased Stochastic Optimization. IEEE Transactions on Pattern
Analysis and Machine Intelligence.
[172] Park, K., and Lee, S. (2024). SMMF: Square-Matricized Momentum Factorization for
Memory-Efficient Optimization. arXiv preprint arXiv:2412.08894.
[173] Mahjoubi, M. A., Lamrani, D., Saleh, S., Moutaouakil, W., Ouhmida, A., Hamida, S., ... and
Raihani, A. (2025). Optimizing ResNet50 Performance Using Stochastic Gradient Descent on
MRI Images for Alzheimer’s Disease Classification. Intelligence-Based Medicine, 100219.
[174] Seini, A. B., and Adam, I. O. (2024). HUMAN-AI COLLABORATION FOR ADAPTIVE
WORKING AND LEARNING OUTCOMES: AN ACTIVITY THEORY PERSPECTIVE.
[176] Lauand, C. K., and Meyn, S. (2025). Markovian Foundations for Quasi-Stochastic Approxi-
mation. SIAM Journal on Control and Optimization, 63(1), 402-430.
624 BIBLIOGRAPHY
[177] Maranjyan, A., Tyurin, A., and Richtárik, P. (2025). Ringmaster ASGD: The First Asyn-
chronous SGD with Optimal Time Complexity. arXiv preprint arXiv:2501.16168.
[178] Gao, Z., and Gündüz, D. (2025). Graph Neural Networks over the Air for Decentralized Tasks
in Wireless Networks. IEEE Transactions on Signal Processing.
[179] Yoon, T., Choudhury, S., and Loizou, N. (2025). Multiplayer Federated Learning: Reaching
Equilibrium with Less Communication. arXiv preprint arXiv:2501.08263.
[180] Verma, K., and Maiti, A. (2025). Sine and cosine based learning rate for gradient descent
method. Applied Intelligence, 55(5), 352.
[181] Borowski, M., and Miasojedow, B. (2025). Convergence of projected stochastic approximation
algorithm. arXiv e-prints, arXiv-2501.
[182] Dong, K., Chen, S., Dan, Y., Zhang, L., Li, X., Liang, W., ... and Sun, Y. (2025). A new
perspective on brain stimulation interventions: Optimal stochastic tracking control of brain
network dynamics. arXiv preprint arXiv:2501.08567.
[183] Jiang, Y., Kang, H., Liu, J., and Xu, D. (2025). On the Convergence of Decentralized Stochas-
tic Gradient Descent with Biased Gradients. IEEE Transactions on Signal Processing.
[184] Sonobe, N., Momozaki, T., and Nakagawa, T. (2025). Sampling from Density power
divergence-based Generalized posterior distribution via Stochastic optimization. arXiv preprint
arXiv:2501.07790.
[185] Zhang, X., and Jia, G. (2025). Convergence of Policy Gradient for Stochastic Linear Quadratic
Optimal Control Problems in Infinite Horizon. Journal of Mathematical Analysis and Appli-
cations, 129264.
[186] Thiriveedhi, A., Ghanta, S., Biswas, S., and Pradhan, A. K. (2025). ALL-Net: integrating
CNN and explainable-AI for enhanced diagnosis and interpretation of acute lymphoblastic
leukemia. PeerJ Computer Science, 11, e2600.
[187] Ramos-Briceño, D. A., Flammia-D’Aleo, A., Fernández-López, G., Carrión-Nessi, F. S., and
Forero-Peña, D. A. (2025). Deep learning-based malaria parasite detection: convolutional
neural networks model for accurate species identification of Plasmodium falciparum and Plas-
modium vivax. Scientific Reports, 15(1), 3746.
[188] Espino-Salinas, C. H., Luna-Garcı́a, H., Cepeda-Argüelles, A., Trejo-Vázquez, K., Flores-
Chaires, L. A., Mercado Reyna, J., ... and Villalba-Condori, K. O. (2025). Convolutional
Neural Network for Depression and Schizophrenia Detection. Diagnostics, 15(3), 319.
[189] Ran, T., Huang, W., Qin, X., Xie, X., Deng, Y., Pan, Y., ... and Zou, D. (2025). Liquid-
based cytological diagnosis of pancreatic neuroendocrine tumors using hyperspectral imaging
and deep learning. EngMedicine, 2(1), 100059.
[190] Araujo, B. V. S., Rodrigues, G. A., de Oliveira, J. H. P., Xavier, G. V. R., Lebre, U., Cordeiro,
C., ... and Ferreira, T. V. (2025). Monitoring ZnO surge arresters using convolutional neural
networks and image processing techniques combined with signal alignment. Measurement,
116889.
[191] Sari, I. P., Elvitaria, L., and Rudiansyah, R. (2025). Data-driven approach for batik pattern
classification using convolutional neural network (CNN). Jurnal Mandiri IT, 13(3), 323-331.
BIBLIOGRAPHY 625
[192] Wang, D., An, K., Mo, Y., Zhang, H., Guo, W., and Wang, B. Cf-Wiad: Consistency Fusion
with Weighted Instance and Adaptive Distribution for Enhanced Semi-Supervised Skin Lesion
Classification. Available at SSRN 5109182.
[193] Cai, P., Zhang, Y., He, H., Lei, Z., and Gao, S. (2025). DFNet: A Differential Feature-
Incorporated Residual Network for Image Recognition. Journal of Bionic Engineering, 1-14.
[194] Vishwakarma, A. K., and Deshmukh, M. (2025). CNNM-FDI: Novel Convolutional Neural
Network Model for Fire Detection in Images. IETE Journal of Research, 1-14.
[195] Ranjan, P., Kaushal, A., Girdhar, A., and Kumar, R. (2025). Revolutionizing hyperspec-
tral image classification for limited labeled data: unifying autoencoder-enhanced GANs with
convolutional neural networks and zero-shot learning. Earth Science Informatics, 18(2), 1-26.
[196] Naseer, A., and Jalal, A. Multimodal Deep Learning Framework for Enhanced Semantic
Scene Classification Using RGB-D Images.
[197] Wang, Z., and Wang, J. (2025). Personalized Icon Design Model Based on Improved Faster-
RCNN. Systems and Soft Computing, 200193.
[198] Ramana, R., Vasudevan, V., and Murugan, B. S. (2025). Spectral Pyramid Pooling and
Fused Keypoint Generation in ResNet-50 for Robust 3D Object Detection. IETE Journal of
Research, 1-13.
[199] Shin, S., Land, O., Seider, W., Lee, J., and Lee, D. (2025). Artificial Intelligence-Empowered
Automated Double Emulsion Droplet Library Generation.
[200] Taca, B. S., Lau, D., and Rieder, R. (2025). A comparative study between deep learning
approaches for aphid classification. IEEE Latin America Transactions, 23(3), 198-204.
[201] Ulaş, B., Szklenár, T., and Szabó, R. (2025). Detection of Oscillation-like Patterns in Eclipsing
Binary Light Curves using Neural Network-based Object Detection Algorithms. arXiv preprint
arXiv:2501.17538.
[202] Valensi, D., Lupu, L., Adam, D., and Topilsky, Y. Semi-Supervised Learning, Foundation
Models and Image Processing for Pleural Line Detection and Segmentation in Lung Ultra-
sound. Foundation Models and Image Processing for Pleural Line Detection and Segmentation
in Lung Ultrasound.
[203] V, A., V, P. and Kumar, D. An effective object detection via BS2ResNet and LTK-Bi-LSTM.
Multimed Tools Appl (2025). https://doi.org/10.1007/s11042-024-20433-2
[204] Zhu, X., Chen, W., and Jiang, Q. (2025). High-transferability black-box attack of binary
image segmentation via adversarial example augmentation. Displays, 102957.
[205] Guo, X., Zhu, Y., Li, S., Wu, S., and Liu, S. (2025). Research and Implementation of Agro-
nomic Entity and Attribute Extraction Based on Target Localization. Agronomy, 15(2), 354.
[206] Yousif, M., Jassam, N. M., Salim, A., Bardan, H. A., Mutlak, A. F., Sallibi, A. D., and
Ataalla, A. F. Melanoma Skin Cancer Detection Using Deep Learning Methods and Binary
GWO Algorithm.
[207] Rahman, S. I. U., Abbas, N., Ali, S., Salman, M., Alkhayat, A., Khan, J., ... and Gu,
Y. H. (2025). Deep Learning and Artificial Intelligence-Driven Advanced Methods for Acute
Lymphoblastic Leukemia Identification and Classification: A Systematic Review. Comput
Model Eng Sci, 142(2).
626 BIBLIOGRAPHY
[208] Pratap Joshi, K., Gowda, V. B., Bidare Divakarachari, P., Siddappa Parameshwarappa, P.,
and Patra, R. K. (2025). VSA-GCNN: Attention Guided Graph Neural Networks for Brain
Tumor Segmentation and Classification. Big Data and Cognitive Computing, 9(2), 29.
[209] Ng, B., Eyre, K., and Chetrit, M. (2025). Prediction of ischemic cardiomyopathy using a
deep neural network with non-contrast cine cardiac magnetic resonance images. Journal of
Cardiovascular Magnetic Resonance, 27.
[210] Nguyen, H. T., Lam, T. B., Truong, T. T. N., Duong, T. D., and Dinh, V. Q. Mv-Trams:
An Efficient Tumor Region-Adapted Mammography Synthesis Under Multi-View Diagnosis.
Available at SSRN 5109180.
[211] Chen, W., Xu, T., and Zhou, W. (2025). Task-based Regularization in Penalized Least-
Squares for Binary Signal Detection Tasks in Medical Image Denoising. arXiv preprint
arXiv:2501.18418.
[212] Pradhan, P. D., Talmale, G., and Wazalwar, S. Deep dive into precision (DDiP): Unleashing
advanced deep learning approaches in diabetic retinopathy research for enhanced detection
and classification of retinal abnormalities. In Recent Advances in Sciences, Engineering, Infor-
mation Technology and Management (pp. 518-530). CRC Press.
[213] Örenç, S., Acar, E., Özerdem, M. S., Şahin, S., and Kaya, A. (2025). Automatic Identifica-
tion of Adenoid Hypertrophy via Ensemble Deep Learning Models Employing X-ray Adenoid
Images. Journal of Imaging Informatics in Medicine, 1-15.
[214] Jiang, M., Wang, S., Chan, K. H., Sun, Y., Xu, Y., Zhang, Z., ... and Tan, T. (2025).
Multimodal Cross Global Learnable Attention Network for MR images denoising with arbitrary
modal missing. Computerized Medical Imaging and Graphics, 102497.
[215] Al-Haidri, W., Levchuk, A., Zotov, N., Belousova, K., Ryzhkov, A., Fokin, V., ... and Brui,
E. (2025). Quantitative analysis of myocardial fibrosis using a deep learning-based framework
applied to the 17-Segment model. Biomedical Signal Processing and Control, 105, 107555.
[216] Osorio, S. L. J., Ruiz, M. A. R., Mendez-Vazquez, A., and Rodriguez-Tello, E. (2024). Fourier
Series Guided Design of Quantum Convolutional Neural Networks for Enhanced Time Series
Forecasting. arXiv preprint arXiv:2404.15377.
[217] Umeano, C., and Kyriienko, O. (2024). Ground state-based quantum feature maps. arXiv
preprint arXiv:2404.07174.
[218] Liu, N., He, X., Laurent, T., Di Giovanni, F., Bronstein, M. M., and Bresson, X. (2024).
Advancing Graph Convolutional Networks via General Spectral Wavelets. arXiv preprint
arXiv:2405.13806.
[219] Vlasic, A. (2024). Quantum Circuits, Feature Maps, and Expanded Pseudo-Entropy: A Cat-
egorical Theoretic Analysis of Encoding Real-World Data into a Quantum Computer. arXiv
preprint arXiv:2410.22084.
[220] Kim, M., Hioka, Y., and Witbrock, M. (2024). Neural Fourier Modelling: A Highly Compact
Approach to Time-Series Analysis. arXiv preprint arXiv:2410.04703.
[221] Xie, Y., Daigavane, A., Kotak, M., and Smidt, T. (2024). The price of freedom: Exploring
tradeoffs between expressivity and computational efficiency in equivariant tensor products.
In ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative
Modeling.
BIBLIOGRAPHY 627
[222] Liu, G., Wei, Z., Zhang, H., Wang, R., Yuan, A., Liu, C., ... and Cao, G. (2024, April).
Extending Implicit Neural Representations for Text-to-Image Generation. In ICASSP 2024-
2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
(pp. 3650-3654). IEEE.
[223] Zhang, M. (2024). Lock-in spectrum: a tool for representing long-term evolution of bearing
fault in the time–frequency domain using vibration signal. Sensor Review, 44(5), 598-610.
[224] Hamed, M., and Lachiri, Z. (2024, July). Expressivity Transfer In Transformer-Based Text-
To-Speech Synthesis. In 2024 IEEE 7th International Conference on Advanced Technologies,
Signal and Image Processing (ATSIP) (Vol. 1, pp. 443-448). IEEE.
[225] Lehmann, F., Gatti, F., Bertin, M., Grenié, D., and Clouteau, D. (2024). Uncertainty prop-
agation from crustal geologies to rock-site ground motion with a Fourier Neural Operator.
European Journal of Environmental and Civil Engineering, 28(13), 3088-3105.
[227] Manning, C., and Schutze, H. (1999). Foundations of statistical natural language processing.
MIT press.
[228] Liu, Y., and Zhang, M. (2018). Neural network methods for natural language processing.
[229] Allen, J. (1988). Natural language understanding. Benjamin-Cummings Publishing Co., Inc..
[230] Li, Z., Zhao, Y., Zhang, X., Han, H., and Huang, C. (2025). Word embedding factor based
multi-head attention. Artificial Intelligence Review, 58(4), 1-21.
[231] Hempelmann, C. F., Rayz, J., Dong, T., and Miller, T. (2025, January). Proceedings of
the 1st Workshop on Computational Humor (CHum). In Proceedings of the 1st Workshop on
Computational Humor (CHum).
[233] Eisenstein, J. (2019). Introduction to natural language processing. The MIT Press.
[234] Otter, D. W., Medina, J. R., and Kalita, J. K. (2020). A survey of the usages of deep learning
for natural language processing. IEEE transactions on neural networks and learning systems,
32(2), 604-624.
[235] Mitkov, R. (Ed.). (2022). The Oxford handbook of computational linguistics. Oxford univer-
sity press.
[236] Liu, X., Tao, Z., Jiang, T., Chang, H., Ma, Y., and Huang, X. (2024). ToDA: Target-oriented
Diffusion Attacker against Recommendation System. arXiv preprint arXiv:2401.12578.
[237] Çekik, R. (2025). Effective Text Classification Through Supervised Rough Set-Based Term
Weighting. Symmetry, 17(1), 90.
[238] Zhu, H., Xia, J., Liu, R., and Deng, B. (2025). SPIRIT: Structural Entropy Guided Prefix
Tuning for Hierarchical Text Classification. Entropy, 27(2), 128.
[239] Matrane, Y., Benabbou, F., and Ellaky, Z. (2024). Enhancing Moroccan Dialect Sentiment
Analysis through Optimized Preprocessing and transfer learning Techniques. IEEE Access.
[240] Moqbel, M., and Jain, A. (2025). Mining the truth: A text mining approach to understand-
ing perceived deceptive counterfeits and online ratings. Journal of Retailing and Consumer
Services, 84, 104149.
628 BIBLIOGRAPHY
[241] Kumar, V., Iqbal, M. I., and Rathore, R. (2025). Natural Language Processing (NLP) in
Disease Detection—A Discussion of How NLP Techniques Can Be Used to Analyze and Clas-
sify Medical Text Data for Disease Diagnosis. AI in Disease Detection: Advancements and
Applications, 53-75.
[242] Yin, S. (2024). The Current State and Challenges of Aspect-Based Sentiment Analysis. Ap-
plied and Computational Engineering, 114, 25-31.
[243] Raghavan, M. (2024). Are you who AI says you are? Exploring the role of Natural Language
Processing algorithms for “predicting” personality traits from text (Doctoral dissertation, Uni-
versity of South Florida).
[244] Semeraro, A., Vilella, S., Improta, R., De Duro, E. S., Mohammad, S. M., Ruffo, G., and
Stella, M. (2025). EmoAtlas: An emotional network analyzer of texts that merges psychological
lexicons, artificial intelligence, and network science. Behavior Research Methods, 57(2), 77.
[245] Cai, F., and Liu, X. Data Analytics for Discourse Analysis with Python: The Case of Therapy
Talk, by Dennis Tay. New York: Routledge, 2024. ISBN: 9781032419015 (HB: USD 41.24),
xiii+ 182 pages. Natural Language Processing, 1-4.
[246] Wu, Yonghui. ”Google’s neural machine translation system: Bridging the gap between human
and machine translation.” arXiv preprint arXiv:1609.08144 (2016).
[247] Hettiarachchi, H., Ranasinghe, T., Rayson, P., Mitkov, R., Gaber, M., Premasiri, D., ... and
Uyangodage, L. (2024). Overview of the First Workshop on Language Models for Low-Resource
Languages (LoResLM 2025). arXiv preprint arXiv:2412.16365.
[248] Das, B. R., and Sahoo, R. (2024). Word Alignment in Statistical Machine Translation: Issues
and Challenges. Nov Joun of Appl Sci Res, 1 (6), 01-03.
[250] UÇKAN, T., and KURT, E. Word Embeddings in NLP. PIONEER AND INNOVATIVE
STUDIES IN COMPUTER SCIENCES AND ENGINEERING, 58.
[251] Pastor, G. C., Monti, J., Mitkov, R., and Hidalgo-Ternero, C. M. (2024). Recent Advances
in Multiword Units in Machine Translation and Translation Technology. Recent Advances in
Multiword Units in Machine Translation and Translation Technology.
[254] Yang, M. (2025). Adaptive Recognition of English Translation Errors Based on Improved
Machine Learning Methods. International Journal of High Speed Electronics and Systems,
2540236.
[255] Linnemann, G. A., and Reimann, L. E. (2024). Artificial Intelligence as a New Field of
Activity for Applied Social Psychology–A Reasoning for Broadening the Scope.
BIBLIOGRAPHY 629
[256] Merkel, S., and Schorr, S. OPP: APPLICATION FIELDS and INNOVATIVE TECHNOLO-
GIES.
[257] Kushwaha, N. S., and Singh, P. (2022). Artificial Intelligence based Chatbot: A Case Study.
Journal of Management and Service Science (JMSS), 2(1), 1-13.
[258] Macedo, P., Madeira, R. N., Santos, P. A., Mota, P., Alves, B., and Pereira, C. M. (2024). A
Conversational Agent for Empowering People with Parkinson’s Disease in Exercising Through
Motivation and Support. Applied Sciences, 15(1), 223.
[259] Gupta, R., Nair, K., Mishra, M., Ibrahim, B., and Bhardwaj, S. (2024). Adoption and impacts
of generative artificial intelligence: Theoretical underpinnings and research agenda. Interna-
tional Journal of Information Management Data Insights, 4(1), 100232.
[260] Foroughi, B., Iranmanesh, M., Yadegaridehkordi, E., Wen, J., Ghobakhloo, M., Senali, M.
G., and Annamalai, N. (2025). Factors Affecting the Use of ChatGPT for Obtaining Shopping
Information. International Journal of Consumer Studies, 49(1), e70008.
[262] Pavlović, N., and Savić, M. (2024). The Impact of the ChatGPT Platform on Consumer
Experience in Digital Marketing and User Satisfaction. Theoretical and Practical Research in
Economic Fields, 15(3), 636-646.
[263] Mannava, V., Mitrevski, A., and Plöger, P. G. (2024, August). Exploring the Suitability of
Conversational AI for Child-Robot Interaction. In 2024 33rd IEEE International Conference
on Robot and Human Interactive Communication (ROMAN) (pp. 1821-1827). IEEE.
[264] Sherstinova, T., Mikhaylovskiy, N., Kolpashchikova, E., and Kruglikova, V. (2024, April).
Bridging Gaps in Russian Language Processing: AI and Everyday Conversations. In 2024
35th Conference of Open Innovations Association (FRUCT) (pp. 665-674). IEEE.
[265] Lipton, Z. C. (2015). A Critical Review of Recurrent Neural Networks for Sequence Learning.
arXiv Preprint, CoRR, abs/1506.00019.
[266] Pascanu, R. (2013). On the difficulty of training recurrent neural networks. arXiv preprint
arXiv:1211.5063.
[267] Jaeger, H. (2001). The “echo state” approach to analysing and training recurrent neural
networks-with an erratum note. Bonn, Germany: German National Research Center for Infor-
mation Technology GMD Technical Report, 148(34), 13.
[269] Kawakami, K. (2008). Supervised sequence labelling with recurrent neural networks (Doctoral
dissertation, Ph. D. thesis).
[270] Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with gra-
dient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166.
[271] Bhattamishra, S., Patel, A., and Goyal, N. (2020). On the computational power of trans-
formers and its implications in sequence modeling. arXiv preprint arXiv:2006.09286.
[291] Prabha, D., Subramanian, R. S., Dinesh, M. G., and Girija, P. (2024). Sustainable Farming
Through AI-Enabled Precision Agriculture. In Artificial Intelligence for Precision Agriculture
(pp. 159-182). Auerbach Publications.
[294] Team, G. Y. Bifang: A New Free-Flying Cubic Robot for Space Station.
[296] Naderi, S., Chen, B., Yang, T., Xiang, J., Heaney, C. E., Latham, J. P., ... and Pain, C.
C. (2024). A discrete element solution method embedded within a Neural Network. Powder
Technology, 448, 120258.
[297] Polaka, S. K. R. (2024). Verifica delle reti neurali per l’apprendimento rinforzato sicuro.
[298] Erdogan, L. E., Kanakagiri, V. A. R., Keutzer, K., and Dong, Z. (2024). Stochastic Commu-
nication Avoidance for Recommendation Systems. arXiv preprint arXiv:2411.01611.
[299] Liao, F., Tang, Y., Du, Q., Wang, J., Li, M., and Zheng, J. (2024). Domain Progressive
Low-dose CT Imaging using Iterative Partial Diffusion Model. IEEE Transactions on Medical
Imaging.
[300] Sekhavat, Y. (2024). Looking for creative basis of artificial intelligence art in the midst of
order and chaos based on Nietzsche’s theories. Theoretical Principles of Visual Arts.
[301] Cai, H., Yang, Y., Tang, Y., Sun, Z., and Zhang, W. (2025). Shapley value-based class
activation mapping for improved explainability in neural networks. The Visual Computer,
1-19.
[302] Na, W. (2024). Rach-Space: Novel Ensemble Learning Method With Applications in Weakly
Supervised Learning (Master’s thesis, Tufts University).
[303] Khajah, M. M. (2024). Supercharging BKT with Multidimensional Generalizable IRT and
Skill Discovery. Journal of Educational Data Mining, 16(1), 233-278.
[304] Zhang, Y., Duan, Z., Huang, Y., and Zhu, F. (2024). Theoretical Bound-Guided Hierarchical
VAE for Neural Image Codecs. arXiv preprint arXiv:2403.18535.
[305] Wang, L., and Huang, W. (2025). On the convergence analysis of over-parameterized varia-
tional autoencoders: a neural tangent kernel perspective. Machine Learning, 114(1), 15.
[306] Li, C. N., Liang, H. P., Zhao, B. Q., Wei, S. H., and Zhang, X. (2024). Machine learning
assisted crystal structure prediction made simple. Journal of Materials Informatics, 4(3), N-A.
[307] Huang, Y. (2024). Research Advanced in Image Generation Based on Diffusion Probability
Model. Highlights in Science, Engineering and Technology, 85, 452-456.
[308] Chenebuah, E. T. (2024). Artificial Intelligence Simulation and Design of Energy Materials
with Targeted Properties (Doctoral dissertation, Université d’Ottawa— University of Ottawa).
632 BIBLIOGRAPHY
[309] Furth, N., Imel, A., and Zawodzinski, T. A. (2024, November). Graph Encoders for Redox
Potentials and Solubility Predictions. In Electrochemical Society Meeting Abstracts prime2024
(No. 3, pp. 344-344). The Electrochemical Society, Inc..
[310] Gong, J., Deng, Z., Xie, H., Qiu, Z., Zhao, Z., and Tang, B. Z. (2025). Deciphering Design
of Aggregation-Induced Emission Materials by Data Interpretation. Advanced Science, 12(3),
2411345.
[311] Kim, H., Lee, C. H., and Hong, C. (2024, July). VATMAN: Video Anomaly Transformer for
Monitoring Accidents and Nefariousness. In 2024 IEEE International Conference on Advanced
Video and Signal Based Surveillance (AVSS) (pp. 1-7). IEEE.
[312] Albert, S. W., Doostan, A., and Schaub, H. (2024). Dimensionality Reduction for Onboard
Modeling of Uncertain Atmospheres. Journal of Spacecraft and Rockets, 1-13.
[313] Sharma, D. K., Hota, H. S., and Rababaah, A. R. (2024). Machine Learning for Real World
Applications (Doctoral dissertation, Department of Computer Science and Engineering, Indian
Institute of Technology Patna).
[314] Li, T., Shi, Z., Dale, S. G., Vignale, G., and Lin, M. Jrystal: A JAX-based Differentiable
Density Functional Theory Framework for Materials.
[315] Bieberich, S., Li, P., Ngai, J., Patel, K., Vogt, R., Ranade, P., ... and Stafford, S. (2024).
Conducting Quantum Machine Learning Through The Lens of Solving Neural Differential
Equations On A Theoretical Fault Tolerant Quantum Computer: Calibration and Bench-
marking.
[316] Dagréou, M., Ablin, P., Vaiter, S., and Moreau, T. (2024). How to compute Hessian-vector
products?. In The Third Blogpost Track at ICLR 2024.
[317] Lohoff, J., and Neftci, E. (2024). Optimizing Automatic Differentiation with Deep Reinforce-
ment Learning. arXiv preprint arXiv:2406.05027.
[318] Legrand, N., Weber, L., Waade, P. T., Daugaard, A. H. M., Khodadadi, M., Mikuš, N.,
and Mathys, C. (2024). pyhgf: A neural network library for predictive coding. arXiv preprint
arXiv:2410.09206.
[319] Alzás, P. B., and Radev, R. (2024). Differentiable nuclear deexcitation simulation for low
energy neutrino physics. arXiv preprint arXiv:2404.00180.
[320] Edenhofer, G., Frank, P., Roth, J., Leike, R. H., Guerdi, M., Scheel-Platz, L. I., ... and
Enßlin, T. A. (2024). Re-envisioning numerical information field theory (NIFTy. re): A library
for Gaussian processes and variational inference. arXiv preprint arXiv:2402.16683.
[321] Chan, S., Kulkarni, P., Paul, H. Y., and Parekh, V. S. (2024, September). Expanding the
Horizon: Enabling Hybrid Quantum Transfer Learning for Long-Tailed Chest X-Ray Clas-
sification. In 2024 IEEE International Conference on Quantum Computing and Engineering
(QCE) (Vol. 1, pp. 572-582). IEEE.
[322] Ye, H., Hu, Z., Yin, R., Boyko, T. D., Liu, Y., Li, Y., ... and Li, Y. (2025). Electron transfer
at birnessite/organic compound interfaces: mechanism, regulation, and two-stage kinetic dis-
crepancy in structural rearrangement and decomposition. Geochimica et Cosmochimica Acta,
388, 253-267.
[323] Khan, M., Ludl, A. A., Bankier, S., Björkegren, J. L., and Michoel, T. (2024). Prediction
of causal genes at GWAS loci with pleiotropic gene regulatory effects using sets of correlated
instrumental variables. PLoS genetics, 20(11), e1011473.
BIBLIOGRAPHY 633
[324] Ojala, K., and Zhou, C. (2024). Determination of outdoor object distances from monocular
thermal images.
[325] Popordanoska, T., and Blaschko, M. (2024). Advancing Calibration in Deep Learning: The-
ory, Methods, and Applications.
[326] Alfieri, A., Cortes, J. M. P., Pastore, E., Castiglione, C., and Rey, G. M. Z. A Deep Q-Network
Approach to Job Shop Scheduling with Transport Resources.
[327] Zanardelli, R. (2025). Statistical learning methods for decision-making, with applications in
Industry 4.0.
[328] Norouzi, M., Hosseini, S. H., Khoshnevisan, M., and Moshiri, B. (2025). Applications of pre-
trained CNN models and data fusion techniques in Unity3D for connected vehicles. Applied
Intelligence, 55(6), 390.
[329] Wang, R., Yang, T., Liang, C., Wang, M., and Ci, Y. (2025). Reliable Autonomous Driving
Environment Perception: Uncertainty Quantification of Semantic Segmentation. Journal of
Transportation Engineering, Part A: Systems, 151(3), 04024117.
[330] Xia, Q., Chen, P., Xu, G., Sun, H., Li, L., and Yu, G. (2024). Adaptive Path-Tracking Con-
troller Embedded With Reinforcement Learning and Preview Model for Autonomous Driving.
IEEE Transactions on Vehicular Technology.
[331] Liu, Q., Tang, Y., Li, X., Yang, F., Wang, K., and Li, Z. (2024). MV-STGHAT: Multi-View
Spatial-Temporal Graph Hybrid Attention Network for Decision-Making of Connected and
Autonomous Vehicles. IEEE Transactions on Vehicular Technology.
[332] Chakraborty, D., and Deka, B. (2025). Deep Learning-based Selective Feature Fusion for
Litchi Fruit Detection using Multimodal UAV Sensor Measurements. IEEE Transactions on
Artificial Intelligence.
[333] Mirindi, D., Khang, A., and Mirindi, F. (2025). Artificial Intelligence (AI) and Automation
for Driving Green Transportation Systems: A Comprehensive Review. Driving Green Trans-
portation System Through Artificial Intelligence and Automation: Approaches, Technologies
and Applications, 1-19.
[334] Choudhury, B., Rajakumar, K., Badhale, A. A., Roy, A., Sahoo, R., and Margret, I. N. (2024,
June). Comparative Analysis of Advanced Models for Satellite-Based Aircraft Identification.
In 2024 International Conference on Smart Systems for Electrical, Electronics, Communication
and Computer Engineering (ICSSEECC) (pp. 483-488). IEEE.
[335] Almubarok, W., Rosiani, U. D., and Asmara, R. A. (2024, November). MobileNetV2 Pruning
for Improved Efficiency in Catfish Classification on Resource-Limited Devices. In 2024 IEEE
10th Information Technology International Seminar (ITIS) (pp. 271-277). IEEE.
[336] Ding, Q. (2024, February). Classification Techniques of Tongue Manifestation Based on Deep
Learning. In 2024 IEEE 3rd International Conference on Electrical Engineering, Big Data and
Algorithms (EEBDA) (pp. 802-810). IEEE.
[337] He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition.
In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-
778).
[338] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. Advances in neural information processing systems, 25.
634 BIBLIOGRAPHY
[339] Sultana, F., Sufian, A., and Dutta, P. (2018, November). Advancements in image classification
using convolutional neural network. In 2018 Fourth International Conference on Research in
Computational Intelligence and Communication Networks (ICRCICN) (pp. 122-129). IEEE.
[340] Sattler, T., Zhou, Q., Pollefeys, M., and Leal-Taixe, L. (2019). Understanding the limitations
of cnn-based absolute camera pose regression. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition (pp. 3302-3312).
[341] Vaswani, A. (2017). Attention is all you need. Advances in Neural Information Processing
Systems.
[342] Nannepagu, M., Babu, D. B., and Madhuri, C. B. Leveraging Hybrid AI Models: DQN,
Prophet, BERT, ART-NN, and Transformer-Based Approaches for Advanced Stock Market
Forecasting.
[343] De Rose, L., Andresini, G., Appice, A., and Malerba, D. (2024). VINCENT: Cyber-threat
detection through vision transformers and knowledge distillation. Computers and Security,
103926.
[344] Buehler, M. J. (2025). Graph-Aware Isomorphic Attention for Adaptive Dynamics in Trans-
formers. arXiv preprint arXiv:2501.02393.
[345] Tabibpour, S. A., and Madanizadeh, S. A. (2024). Solving High-Dimensional Dynamic Pro-
gramming Using Set Transformer. Available at SSRN 5040295.
[346] Li, S., and Dong, P. (2024, October). Mixed Attention Transformer Enhanced Channel Esti-
mation for Extremely Large-Scale MIMO Systems. In 2024 16th International Conference on
Wireless Communications and Signal Processing (WCSP) (pp. 394-399). IEEE.
[347] Asefa, S. H., and Assabie, Y. (2024). Transformer-Based Amharic-to-English Machine Trans-
lation with Character Embedding and Combined Regularization Techniques. IEEE Access.
[348] Liao, M., and Chen, M. (2024, November). A new deepfake detection method by vision
transformers. In International Conference on Algorithms, High Performance Computing, and
Artificial Intelligence (AHPCAI 2024) (Vol. 13403, pp. 953-957). SPIE.
[349] Jiang, L., Cui, J., Xu, Y., Deng, X., Wu, X., Zhou, J., and Wang, Y. (2024, August). SC-
Former: Spatial and Channel-wise Transformer with Contrastive Learning for High-Quality
PET Image Reconstruction. In 2024 IEEE International Conference on Cybernetics and In-
telligent Systems (CIS) and IEEE International Conference on Robotics, Automation and
Mechatronics (RAM) (pp. 26-31). IEEE.
[350] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., ... and
Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing
systems, 27.
[351] CHAPPIDI, J., and SUNDARAM, D. M. (2024). DUAL Q-LEARNING WITH GRAPH
NEURAL NETWORKS: A NOVEL APPROACH TO ANIMAL DETECTION IN CHAL-
LENGING ECOSYSTEMS. Journal of Theoretical and Applied Information Technology,
102(23).
[352] Joni, R. (2024). Delving into Deep Learning: Illuminating Techniques and Visual Clarity for
Image Analysis (No. 12808). EasyChair.
[353] Kalaiarasi, G., Sudharani, B., Jonnalagadda, S. C., Battula, H. V., and Sanagala, B. (2024,
July). A Comprehensive Survey of Image Steganography. In 2024 2nd International Conference
on Sustainable Computing and Smart Systems (ICSCSS) (pp. 1225-1230). IEEE.
BIBLIOGRAPHY 635
[354] Arjmandi-Tash, A. M., Mansourian, A., Rahsepar, F. R., and Abdi, Y. (2024). Predicting
Photodetector Responsivity through Machine Learning. Advanced Theory and Simulations,
2301219.
[355] Gao, Y. (2024). Neural networks meet applied mathematics: GANs, PINNs, and transformers.
HKU Theses Online (HKUTO).
[356] Hisama, K., Ishikawa, A., Aspera, S. M., and Koyama, M. (2024). Theoretical Catalyst
Screening of Multielement Alloy Catalysts for Ammonia Synthesis Using Machine Learning
Potential and Generative Artificial Intelligence. The Journal of Physical Chemistry C, 128(44),
18750-18758.
[357] Wang, M., and Zhang, Y. (2024). Image Segmentation in Complex Backgrounds using an Im-
proved Generative Adversarial Network. International Journal of Advanced Computer Science
and Applications, 15(5).
[358] Alonso, N. I., and Arias, F. (2025). The Mathematics of Q-Learning and the Hamilton-
Jacobi-Bellman Equation. Fernando, The Mathematics of Q-Learning and the Hamilton-
Jacobi-Bellman Equation (January 05, 2025).
[359] Lu, C., Shi, L., Chen, Z., Wu, C., and Wierman, A. (2024). Overcoming the Curse of Di-
mensionality in Reinforcement Learning Through Approximate Factorization. arXiv preprint
arXiv:2411.07591.
[360] Humayoo, M. (2024). Time-Scale Separation in Q-Learning: Extending TD (△) for Action-
Value Function Decomposition. arXiv preprint arXiv:2411.14019.
[361] Jia, L., Qi, N., Su, Z., Chu, F., Fang, S., Wong, K. K., and Chae, C. B. (2024). Game theory
and reinforcement learning for anti-jamming defense in wireless communications: Current
research, challenges, and solutions. IEEE Communications Surveys and Tutorials.
[362] Chai, J., Chen, E., and Fan, J. (2025). Deep Transfer Q-Learning for Offline Non-Stationary
Reinforcement Learning. arXiv preprint arXiv:2501.04870.
[363] Yao, J., and Gong, X. (2024, October). Communication-Efficient and Resilient Distributed
Deep Reinforcement Learning for Multi-Agent Systems. In 2024 IEEE International Conference
on Unmanned Systems (ICUS) (pp. 1521-1526). IEEE.
[364] Liu, Y., Yang, T., Tian, L., and Pei, J. (2025). SGD-TripleQNet: An Integrated Deep Rein-
forcement Learning Model for Vehicle Lane-Change Decision. Mathematics, 13(2), 235.
[365] Masood, F., Ahmad, J., Al Mazroa, A., Alasbali, N., Alazeb, A., and Alshehri, M. S. (2025).
Multi IRS-Aided Low-Carbon Power Management for Green Communication in 6G Smart
Agriculture Using Deep Game Theory. Computational Intelligence, 41(1), e70022.
[367] El Mimouni, I., and Avrachenkov, K. (2025, January). Deep Q-Learning with Whittle Index
for Contextual Restless Bandits: Application to Email Recommender Systems. In Northern
Lights Deep Learning Conference 2025.
[368] Shefin, R. S., Rahman, M. A., Le, T., and Alqahtani, S. (2024). xSRL: Safety-Aware
Explainable Reinforcement Learning–Safety as a Product of Explainability. arXiv preprint
arXiv:2412.19311.
[369] Khlifi, A., Othmani, M., and Kherallah, M. (2025). A Novel Approach to Autonomous Driving
Using DDQN-Based Deep Reinforcement Learning.
636 BIBLIOGRAPHY
[370] Kuczkowski, D. (2024). Energy efficient multi-objective reinforcement learning algorithm for
traffic simulation.
[371] Krauss, R., Zielasko, J., and Drechsler, R. Large-Scale Evolutionary Optimization of Artificial
Neural Networks Using Adaptive Mutations.
[372] Ahamed, M. S., Pey, J. J. J., Samarakoon, S. B. P., Muthugala, M. V. J., and Elara, M. R.
(2025). Reinforcement Learning for Reconfigurable Robotic Soccer. IEEE Access.
[373] Elmquist, A., Serban, R., and Negrut, D. (2024). A methodology to quantify simulation-vs-
reality differences in images for autonomous robots. IEEE Sensors Journal.
[374] Kobanda, A., Portelas, R., Maillard, O. A., and Denoyer, L. (2024). Hierarchical Subspaces
of Policies for Continual Offline Reinforcement Learning. arXiv preprint arXiv:2412.14865.
[375] Xu, J., Xie, G., Zhang, Z., Hou, X., Zhang, S., Ren, Y., and Niyato, D. (2025). UPEGSim:
An RL-Enabled Simulator for Unmanned Underwater Vehicles Dedicated in the Underwater
Pursuit-Evasion Game. IEEE Internet of Things Journal, 12(3), 2334-2346.
[376] Patadiya, K., Jain, R., Moteriya, J., Palaniappan, D., Kumar, P., and Premavathi, T. (2024,
December). Application of Deep Learning to Generate Auto Player Mode in Car Based Game.
In 2024 IEEE 16th International Conference on Computational Intelligence and Communica-
tion Networks (CICN) (pp. 233-237). IEEE.
[377] Janjua, J. I., Kousar, S., Khan, A., Ihsan, A., Abbas, T., and Saeed, A. Q. (2024, Decem-
ber). Enhancing Scalability in Reinforcement Learning for Open Spaces. In 2024 International
Conference on Decision Aid Sciences and Applications (DASA) (pp. 1-8). IEEE.
[378] Yang, L., Li, Y., Wang, J., and Sherratt, R. S. (2020). Sentiment analysis for E-commerce
product reviews in Chinese based on sentiment lexicon and deep learning. IEEE access, 8,
23522-23530.
[379] Manikandan, C., Kumar, P. S., Nikitha, N., Sanjana, P. G., and Dileep, Y. Filtering Emails
Using Natural Language Processing.
[380] ISIAKA, S. O., BABATUNDE, R. S., and ISIAKA, R. M. Exploring Artificial Intelligence
(AI) Technologies in Predictive Medicine: A Systematic Review.
[381] Petrov, A., Zhao, D., Smith, J., Volkov, S., Wang, J., and Ivanov, D. Deep Learning Ap-
proaches for Emotional State Classification in Textual Data.
[382] Liang, M. (2025). Leveraging natural language processing for automated assessment and feed-
back production in virtual education settings. Journal of Computational Methods in Sciences
and Engineering, 14727978251314556.
[383] Jin, L. (2025). Research on Optimization Strategies of Artificial Intelligence Algorithms for
the Integration and Dissemination of Pharmaceutical Science Popularization Knowledge. Sci-
entific Journal of Technology, 7(1), 45-55.
[384] McNicholas, B. A., Madden, M. G., and Laffey, J. G. (2025). Natural language processing in
critical care: opportunities, challenges, and future directions. Intensive Care Medicine, 1-5.
[385] Abd Al Abbas, M., and Khammas, B. M. (2024). Efficient IoT Malware Detection Technique
Using Recurrent Neural Network. Iraqi Journal of Information and Communication Technol-
ogy, 7(3), 29-42.
BIBLIOGRAPHY 637
[386] Kalonia, S., and Upadhyay, A. (2025). Deep learning-based approach to predict software
faults. In Artificial Intelligence and Machine Learning Applications for Sustainable Develop-
ment (pp. 326-348). CRC Press.
[387] Han, S. C., Weld, H., Li, Y., Lee, J., and Poon, J. Natural Language Understanding in
Conversational AI with Deep Learning.
[388] Potter, K., and Egon, A. RECURRENT NEURAL NETWORKS (RNNS) FOR TIME SE-
RIES FORECASTING.
[389] Yatkin, M. A., Kõrgesaar, M., and Işlak, Ü. (2025). A Topological Approach to Enhancing
Consistency in Machine Learning via Recurrent Neural Networks. Applied Sciences, 15(2),
933.
[390] Saifullah, S. (2024). Comparative Analysis of LSTM and GRU Models for Chicken Egg Fer-
tility Classification using Deep Learning.
[392] Tu, Z., Jeffries, S. D., Morse, J., and Hemmerling, T. M. (2024). Comparison of time-series
models for predicting physiological metrics under sedation. Journal of Clinical Monitoring and
Computing, 1-11.
[393] Zuo, Y., Jiang, J., and Yada, K. (2025). Application of hybrid gate recurrent unit for in-store
trajectory prediction based on indoor location system. Scientific Reports, 15(1), 1055.
[394] Lima, R., Scardua, L. A., and De Almeida, G. M. (2024). Predicting Temperatures Inside a
Steel Slab Reheating Furnace Using Neural Networks. Authorea Preprints.
[395] Khan, S., Muhammad, Y., Jadoon, I., Awan, S. E., and Raja, M. A. Z. (2025). Leveraging
LSTM-SMI and ARIMA architecture for robust wind power plant forecasting. Applied Soft
Computing, 112765.
[396] Guo, Z., and Feng, L. (2024). Multi-step prediction of greenhouse temperature and humidity
based on temporal position attention LSTM. Stochastic Environmental Research and Risk
Assessment, 1-28.
[397] Abdelhamid, N. M., Khechekhouche, A., Mostefa, K., Brahim, L., and Talal, G. (2024).
Deep-RNN based model for short-time forecasting photovoltaic power generation using IoT.
Studies in Engineering and Exact Sciences, 5(2), e11461-e11461.
[398] Rohman, F. N., and Farikhin, B. S. Hyperparameter Tuning of Random Forest Algorithm
for Diabetes Classification.
[399] Rahman, M. Utilizing Machine Learning Techniques for Early Brain Tumor Detection.
[400] Nandi, A., Singh, H., Majumdar, A., Shaw, A., and Maiti, A. Optimizing Baby Sound
Recognition using Deep Learning through Class Balancing and Model Tuning.
[401] Sianga, B. E., Mbago, M. C., and Msengwa, A. S. (2025). PREDICTING THE PREVA-
LENCE OF CARDIOVASCULAR DISEASES USING MACHINE LEARNING ALGO-
RITHMS. Intelligence-Based Medicine, 100199.
638 BIBLIOGRAPHY
[402] Li, L., Hu, Y., Yang, Z., Luo, Z., Wang, J., Wang, W., ... and Zhang, Z. (2025). Exploring the
assessment of post-cardiac valve surgery pulmonary complication risks through the integration
of wearable continuous physiological and clinical data. BMC Medical Informatics and Decision
Making, 25(1), 1-11.
[403] Lázaro, F. L., Madeira, T., Melicio, R., Valério, D., and Santos, L. F. (2025). Identifying Hu-
man Factors in Aviation Accidents with Natural Language Processing and Machine Learning
Models. Aerospace, 12(2), 106.
[404] Li, Z., Zhong, J., Wang, H., Xu, J., Li, Y., You, J., ... and Dev, S. (2025). RAINER: A
Robust Ensemble Learning Grid Search-Tuned Framework for Rainfall Patterns Prediction.
arXiv preprint arXiv:2501.16900.
[405] Khurshid, M. R., Manzoor, S., Sadiq, T., Hussain, L., Khan, M. S., and Dutta, A. K. (2025).
Unveiling diabetes onset: Optimized XGBoost with Bayesian optimization for enhanced pre-
diction. PloS one, 20(1), e0310218.
[406] Kanwar, M., Pokharel, B., and Lim, S. (2025). A new random forest method for landslide
susceptibility mapping using hyperparameter optimization and grid search techniques. Inter-
national Journal of Environmental Science and Technology, 1-16.
[407] Fadil, M., Akrom, M., and Herowati, W. (2025). Utilization of Machine Learning for Pre-
dicting Corrosion Inhibition by Quinoxaline Compounds. Journal of Applied Informatics and
Computing, 9(1), 173-177.
[408] Emmanuel, J., Isewon, I., and Oyelade, J. (2025). An Optimized Deep-Forest Algorithm
Using a Modified Differential Evolution Optimization Algorithm: A Case of Host-Pathogen
Protein-Protein Interaction Prediction. Computational and Structural Biotechnology Journal.
[409] Gaurav, A., Gupta, B. B., Attar, R. W., Alhomoud, A., Arya, V., and Chui, K. T. (2025).
Driver identification in advanced transportation systems using osprey and salp swarm opti-
mized random forest model. Scientific Reports, 15(1), 2453.
[410] Ning, C., Ouyang, H., Xiao, J., Wu, D., Sun, Z., Liu, B., ... and Huang, G. (2025). Develop-
ment and validation of an explainable machine learning model for mortality prediction among
patients with infected pancreatic necrosis. eClinicalMedicine, 80.
[411] Muñoz, V., Ballester, C., Copaci, D., Moreno, L., and Blanco, D. (2025). Accelerating hy-
perparameter optimization with a secretary. Neurocomputing, 129455.
[412] Balcan, M. F., Nguyen, A. T., and Sharma, D. (2025). Sample complexity of data-driven
tuning of model hyperparameters in neural networks with structured parameter-dependent
dual function. arXiv preprint arXiv:2501.13734.
[413] Azimi, H., Kalhor, E. G., Nabavi, S. R., Behbahani, M., and Vardini, M. T. (2025). Data-
based modeling for prediction of supercapacitor capacity: Integrated machine learning and
metaheuristic algorithms. Journal of the Taiwan Institute of Chemical Engineers, 170, 105996.
[414] Shibina, V., and Thasleema, T. M. (2025). Voice feature-based diagnosis of Parkinson’s dis-
ease using nature inspired squirrel search algorithm with ensemble learning classifiers. Iran
Journal of Computer Science, 1-25.
[415] Chang, F., Dong, S., Yin, H., Ye, X., Wu, Z., Zhang, W., and Zhu, H. (2025). 3D displacement
time series prediction of a north-facing reservoir landslide powered by InSAR and machine
learning. Journal of Rock Mechanics and Geotechnical Engineering.
BIBLIOGRAPHY 639
[416] Cihan, P. (2025). Bayesian Hyperparameter Optimization of Machine Learning Models for
Predicting Biomass Gasification Gases. Applied Sciences, 15(3), 1018.
[417] Makomere, R., Rutto, H., Alugongo, A., Koech, L., Suter, E., and Kohitlhetse, I. (2025).
Enhanced dry SO2 capture estimation using Python-driven computational frameworks with
hyperparameter tuning and data augmentation. Unconventional Resources, 100145.
[418] Bakır, H. (2025). A new method for tuning the CNN pre-trained models as a feature extractor
for malware detection. Pattern Analysis and Applications, 28(1), 26.
[419] Liu, Y., Yin, H., and Li, Q. (2025). Sound absorption performance prediction of multi-
dimensional Helmholtz resonators based on deep learning and hyperparameter optimization.
Physica Scripta.
[420] Ma, Z., Zhao, M., Dai, X., and Chen, Y. (2025). Anomaly detection for high-speed machining
using hybrid regularized support vector data description. Robotics and Computer-Integrated
Manufacturing, 94, 102962.
[421] El-Bouzaidi, Y. E. I., Hibbi, F. Z., and Abdoun, O. (2025). Optimizing Convolutional Neural
Network Impact of Hyperparameter Tuning and Transfer Learning. In Innovations in Opti-
mization and Machine Learning (pp. 301-326). IGI Global Scientific Publishing.
[422] Mustapha, B., Zhou, Y., Shan, C., and Xiao, Z. (2025). Enhanced Pneumonia Detection in
Chest X-Rays Using Hybrid Convolutional and Vision Transformer Networks. Current Medical
Imaging, e15734056326685.
[423] Adly, S., and Attouch, H. (2024). Complexity Analysis Based on Tuning the Viscosity Param-
eter of the Su-Boyd-Candès Inertial Gradient Dynamics. Set-Valued and Variational Analysis,
32(2), 17.
[424] Wang, Z., and Peypouquet, J. G. Nesterov’s Accelerated Gradient Method for Strongly Con-
vex Functions: From Inertial Dynamics to Iterative Algorithms.
[425] Hermant, J., Renaud, M., Aujol, J. F., and Rondepierre, C. D. A. (2024). Nesterov momentum
for convex functions with interpolation: is it faster than Stochastic gradient descent?. Book of
abstracts PGMO DAYS 2024, 68.
[426] Alavala, S., and Gorthi, S. (2024). 3D CBCT Challenge 2024: Improved Cone Beam
CT Reconstruction using SwinIR-Based Sinogram and Image Enhancement. arXiv preprint
arXiv:2406.08048.
[427] Li, C. J. (2024). Unified Momentum Dynamics in Stochastic Gradient Optimization. Available
at SSRN 4981009.
[428] Gupta, K., and Wojtowytsch, S. (2024). Nesterov acceleration in benignly non-convex land-
scapes. arXiv preprint arXiv:2410.08395.
[429] Razzouki, O. F., Charroud, A., El Allali, Z., Chetouani, A., and Aslimani, N. (2024, Decem-
ber). A Survey of Advanced Gradient Methods in Machine Learning. In 2024 7th International
Conference on Advanced Communication Technologies and Networking (CommNet) (pp. 1-7).
IEEE.
[430] Wang, J., Du, B., Su, Z., Hu, K., Yu, J., Cao, C., ... and Guo, H. (2025). A fast LMS-based
digital background calibration technique for 16-bit SAR ADC with modified shuffling scheme.
Microelectronics Journal, 156, 106547.
640 BIBLIOGRAPHY
[431] Naeem, K., Bukhari, A., Daud, A., Alsahfi, T., Alshemaimri, B., and Alhajlah, M. (2024).
Machine Learning and Deep Learning Optimization Algorithms for Unconstrained Convex
Optimization Problem. IEEE Access.
[432] Campos, C. M., de Diego, D. M., and Torrente, J. (2024). Momentum-based gradient descent
methods for Lie groups. arXiv preprint arXiv:2404.09363.
[433] Jing Li, Hewan Chen, Mohd Shahizan Othman, Naomie Salim, Lizawati Mi Yusuf, Shamini
Raja Kumaran, NFIoT-GATE-DTL IDS: Genetic algorithm-tuned ensemble of deep trans-
fer learning for NetFlow-based intrusion detection system for internet of things, Engi-
neering Applications of Artificial Intelligence, Volume 143, 2025, 110046, ISSN 0952-1976,
https://doi.org/10.1016/j.engappai.2025.110046.
[434] GÜL, M.F., Bakır, H. GA-ML: enhancing the prediction of water electrical conductivity
through genetic algorithm-based end-to-end hyperparameter tuning. Earth Sci Inform 18, 191
(2025). https://doi.org/10.1007/s12145-024-01610-1
[435] Sen, A., Sen, U., Paul, M., Padhy, A. P., Sai, S., Mallik, A., and Mallick, C. (2025).
QGAPHEnsemble: Combining Hybrid QLSTM Network Ensemble via Adaptive Weighting
for Short Term Weather Forecasting. arXiv preprint arXiv:2501.10866.
[436] Roy, A., Sen, A., Gupta, S., Haldar, S., Deb, S., Vankala, T. N., and Das, A. (2025). Deep-
EyeNet: Adaptive Genetic Bayesian Algorithm Based Hybrid ConvNeXtTiny Framework For
Multi-Feature Glaucoma Eye Diagnosis. arXiv preprint arXiv:2501.11168.
[437] Jiang, T., Lu, W., Lu, L., Xu, L., Xi, W., Liu, J., and Zhu, Y. (2025). Inlet Passage Hydraulic
Performance Optimization of Coastal Drainage Pump System Based on Machine Learning
Algorithms. Journal of Marine Science and Engineering, 13(2), 274.
[438] Borah, J., and Chandrasekaran, M. (2025). Application of Machine Learning-Based Approach
to Predict and Optimize Mechanical Properties of Additively Manufactured Polyether Ether
Ketone Biopolymer Using Fused Deposition Modeling. Journal of Materials Engineering and
Performance, 1-17.
[439] Tan, Q., He, D., Sun, Z., Yao, Z., zhou, J. X., and Chen, T. (2025). A deep reinforcement
learning based metro train operation control optimization considering energy conservation and
passenger comfort. Engineering Research Express.
[440] Garcı́a-Galindo, A., López-De-Castro, M., and Armañanzas, R. (2025). Fair prediction sets
through multi-objective hyperparameter optimization. Machine Learning, 114(1), 27.
[441] Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear
regions of deep neural networks. Advances in neural information processing systems, 27.
[442] Schmidt-Hieber, J. (2020). Nonparametric regression using deep neural networks with ReLU
activation function.
[443] Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural
networks, 94, 103-114.
[444] Telgarsky, M. (2016, June). Benefits of depth in neural networks. In Conference on learning
theory (pp. 1517-1539). PMLR.
[445] Lu, Z., Pu, H., Wang, F., Hu, Z., and Wang, L. (2017). The expressive power of neural
networks: A view from the width. Advances in neural information processing systems, 30.
BIBLIOGRAPHY 641
[446] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021). Understanding deep
learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107-
115.
[447] Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. (2008). The graph
neural network model. IEEE transactions on neural networks, 20(1), 61-80.
[448] Kipf, T. N., and Welling, M. (2016). Semi-supervised classification with graph convolutional
networks. arXiv preprint arXiv:1609.02907.
[449] Hamilton, W., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large
graphs. Advances in neural information processing systems, 30.
[450] Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., and Bengio, Y. (2017). Graph
attention networks. arXiv preprint arXiv:1710.10903.
[451] Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2018). How powerful are graph neural net-
works?. arXiv preprint arXiv:1810.00826.
[452] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017, July). Neural
message passing for quantum chemistry. In International conference on machine learning (pp.
1263-1272). PMLR.
[453] Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski,
M., ... and Pascanu, R. (2018). Relational inductive biases, deep learning, and graph networks.
arXiv preprint arXiv:1806.01261.
[454] Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. (2013). Spectral networks and locally
connected networks on graphs. arXiv preprint arXiv:1312.6203.
[455] Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J. (2018, July).
Graph convolutional neural networks for web-scale recommender systems. In Proceedings of
the 24th ACM SIGKDD international conference on knowledge discovery and data mining
(pp. 974-983).
[456] Zhou, J., Cui, G., Hu, S., Zhang, Z., Yang, C., Liu, Z., ... and Sun, M. (2020). Graph neural
networks: A review of methods and applications. AI open, 1, 57-81.
[457] Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2019). Physics-informed neural networks:
A deep learning framework for solving forward and inverse problems involving nonlinear partial
differential equations. Journal of Computational physics, 378, 686-707.
[458] Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., and Yang, L. (2021).
Physics-informed machine learning. Nature Reviews Physics, 3(6), 422-440.
[459] Lu, L., Meng, X., Mao, Z., and Karniadakis, G. E. (2021). DeepXDE: A deep learning library
for solving differential equations. SIAM review, 63(1), 208-228.
[460] Sirignano, J., and Spiliopoulos, K. (2018). DGM: A deep learning algorithm for solving partial
differential equations. Journal of computational physics, 375, 1339-1364.
[461] Wang, S., Teng, Y., and Perdikaris, P. (2021). Understanding and mitigating gradient flow
pathologies in physics-informed neural networks. SIAM Journal on Scientific Computing, 43(5),
A3055-A3081.
[462] Mishra, S., and Molinaro, R. (2023). Estimates on the generalization error of physics-informed
neural networks for approximating PDEs. IMA Journal of Numerical Analysis, 43(1), 1-43.
642 BIBLIOGRAPHY
[463] Zhang, D., Guo, L., and Karniadakis, G. E. (2020). Learning in modal space: Solving time-
dependent stochastic PDEs using physics-informed neural networks. SIAM Journal on Scien-
tific Computing, 42(2), A639-A665.
[464] Jin, X., Cai, S., Li, H., and Karniadakis, G. E. (2021). NSFnets (Navier-Stokes flow nets):
Physics-informed neural networks for the incompressible Navier-Stokes equations. Journal of
Computational Physics, 426, 109951.
[465] Chen, Y., Lu, L., Karniadakis, G. E., and Dal Negro, L. (2020). Physics-informed neural
networks for inverse problems in nano-optics and metamaterials. Optics express, 28(8), 11618-
11633.
[466] Psichogios, D. C., and Ungar, L. H. (1992). A hybrid neural network-first principles approach
to process modeling. AIChE Journal, 38(10), 1499-1511.
[467] Chizat, L., and Bach, F. (2018). On the global convergence of gradient descent for over-
parameterized models using optimal transport. Advances in neural information processing
systems, 31.
[468] Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. (2019, May). Gradient descent finds global
minima of deep neural networks. In International conference on machine learning (pp. 1675-
1685). PMLR.
[469] Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. (2019, May). Fine-grained analysis of opti-
mization and generalization for overparameterized two-layer neural networks. In International
Conference on Machine Learning (pp. 322-332). PMLR.
[470] Allen-Zhu, Z., Li, Y., and Song, Z. (2019, May). A convergence theory for deep learning via
over-parameterization. In International conference on machine learning (pp. 242-252). PMLR.
[471] Cao, Y., and Gu, Q. (2019). Generalization bounds of stochastic gradient descent for wide
and deep neural networks. Advances in neural information processing systems, 32.
[472] Yang, G. (2019). Scaling limits of wide neural networks with weight sharing: Gaussian pro-
cess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint
arXiv:1902.04760.
[473] Huang, J., and Yau, H. T. (2020, November). Dynamics of deep neural networks and neural
tangent hierarchy. In International conference on machine learning (pp. 4542-4551). PMLR.
[474] Belkin, M., Ma, S., and Mandal, S. (2018, July). To understand deep learning we need to
understand kernel learning. In International Conference on Machine Learning (pp. 541-549).
PMLR.
[475] Sra, S., Nowozin, S., and Wright, S. J. (Eds.). (2011). Optimization for machine learning.
Mit Press.
[476] Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015, February).
The loss surfaces of multilayer networks. In Artificial intelligence and statistics (pp. 192-204).
PMLR.
[477] Arora, S., Cohen, N., and Hazan, E. (2018, July). On the optimization of deep networks:
Implicit acceleration by overparameterization. In International conference on machine learning
(pp. 244-253). PMLR.
BIBLIOGRAPHY 643
[478] Baratin, A., George, T., Laurent, C., Hjelm, R. D., Lajoie, G., Vincent, P., and Lacoste-
Julien, S. (2020). Implicit regularization in deep learning: A view from function space. arXiv
preprint arXiv:2008.00938.
[479] Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T. (2018,
July). The mechanics of n-player differentiable games. In International Conference on Machine
Learning (pp. 354-363). PMLR.
[480] Han, J., and Jentzen, A. (2017). Deep learning-based numerical methods for high-dimensional
parabolic partial differential equations and backward stochastic differential equations. Com-
munications in mathematics and statistics, 5(4), 349-380.
[481] Beck, C., Becker, S., Grohs, P., Jaafari, N., and Jentzen, A. (2021). Solving the Kolmogorov
PDE by means of deep learning. Journal of Scientific Computing, 88, 1-28.
[482] Han, J., Jentzen, A., and E, W. (2018). Solving high-dimensional partial differential equations
using deep learning. Proceedings of the National Academy of Sciences, 115(34), 8505-8510.
[483] Jentzen, A., Salimova, D., and Welti, T. (2018). A proof that deep artificial neural networks
overcome the curse of dimensionality in the numerical approximation of Kolmogorov partial
differential equations with constant diffusion and nonlinear drift coefficients. arXiv preprint
arXiv:1809.07321.
[484] Yu, B. (2018). The deep Ritz method: a deep learning-based numerical algorithm for solving
variational problems. Communications in Mathematics and Statistics, 6(1), 1-12.
[485] Khoo, Y., Lu, J., and Ying, L. (2021). Solving parametric PDE problems with artificial neural
networks. European Journal of Applied Mathematics, 32(3), 421-435.
[486] Hutzenthaler, M., and Kruse, T. (2020). Multilevel Picard approximations of high-
dimensional semilinear parabolic differential equations with gradient-dependent nonlinearities.
SIAM Journal on Numerical Analysis, 58(2), 929-961.
[487] Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. (2018). Hyperband:
A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning
Research, 18(185), 1-52.
[488] Falkner, S., Klein, A., and Hutter, F. (2018, July). BOHB: Robust and efficient hyperparam-
eter optimization at scale. In International conference on machine learning (pp. 1437-1446).
PMLR.
[489] Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Ben-Tzur, J., Hardt, M., ... and Tal-
walkar, A. (2020). A system for massively parallel hyperparameter tuning. Proceedings of
Machine Learning and Systems, 2, 230-246.
[490] Snoek, J., Larochelle, H., and Adams, R. P. (2012). Practical bayesian optimization of ma-
chine learning algorithms. Advances in neural information processing systems, 25.
[491] Slivkins, A., Zhou, X., Sankararaman, K. A., and Foster, D. J. (2024). Contextual Bandits
with Packing and Covering Constraints: A Modular Lagrangian Approach via Regression.
Journal of Machine Learning Research, 25(394), 1-37.
[492] Hazan, E., Klivans, A., and Yuan, Y. (2017). Hyperparameter optimization: A spectral
approach. arXiv preprint arXiv:1706.00764.
644 BIBLIOGRAPHY
[493] Domhan, T., Springenberg, J. T., and Hutter, F. (2015, June). Speeding up automatic hy-
perparameter optimization of deep neural networks by extrapolation of learning curves. In
Twenty-fourth international joint conference on artificial intelligence.
[494] Agrawal, T. (2021). Hyperparameter optimization in machine learning: make your machine
learning and deep learning models more efficient (pp. 109-129). New York, NY, USA:: Apress.
[495] Shekhar, S., Bansode, A., and Salim, A. (2021, December). A comparative study of hyper-
parameter optimization tools. In 2021 IEEE Asia-Pacific Conference on Computer Science and
Data Engineering (CSDE) (pp. 1-6). IEEE.
[496] Bergstra, J., Bardenet, R., Bengio, Y., and Kégl, B. (2011). Algorithms for hyper-parameter
optimization. Advances in neural information processing systems, 24.
[497] Zoph, B. (2016). Neural architecture search with reinforcement learning. arXiv preprint
arXiv:1611.01578.
[498] Maclaurin, D., Duvenaud, D., and Adams, R. (2015, June). Gradient-based hyperparameter
optimization through reversible learning. In International conference on machine learning (pp.
2113-2122). PMLR.
[500] Franceschi, L., Frasconi, P., Salzo, S., Grazzi, R., and Pontil, M. (2018, July). Bilevel pro-
gramming for hyperparameter optimization and meta-learning. In International conference on
machine learning (pp. 1568-1577). PMLR.
[501] Franceschi, L., Donini, M., Frasconi, P., and Pontil, M. (2017, July). Forward and reverse
gradient-based hyperparameter optimization. In International Conference on Machine Learning
(pp. 1165-1173). PMLR.
[502] Liu, H., Simonyan, K., and Yang, Y. (2018). Darts: Differentiable architecture search. arXiv
preprint arXiv:1806.09055.
[503] Lorraine, J., Vicol, P., and Duvenaud, D. (2020, June). Optimizing millions of hyperpa-
rameters by implicit differentiation. In International conference on artificial intelligence and
statistics (pp. 1540-1552). PMLR.
[504] Liang, J., Gonzalez, S., Shahrzad, H., and Miikkulainen, R. (2021, June). Regularized evolu-
tionary population-based training. In Proceedings of the Genetic and Evolutionary Computa-
tion Conference (pp. 323-331).
[505] Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., ...
and Kavukcuoglu, K. (2017). Population based training of neural networks. arXiv preprint
arXiv:1711.09846.
[506] Co-Reyes, J. D., Miao, Y., Peng, D., Real, E., Levine, S., Le, Q. V., ... and Faust, A. (2021).
Evolving reinforcement learning algorithms. arXiv preprint arXiv:2101.03958.
[507] Song, C., Ma, Y., Xu, Y., and Chen, H. (2024). Multi-population evolutionary neural archi-
tecture search with stacked generalization. Neurocomputing, 587, 127664.
[508] Wan, X., Lu, C., Parker-Holder, J., Ball, P. J., Nguyen, V., Ru, B., and Osborne, M. (2022,
September). Bayesian generational population-based training. In International conference on
automated machine learning (pp. 14-1). PMLR.
BIBLIOGRAPHY 645
[509] Garcı́a-Valdez, M., Mancilla, A., Castillo, O., and Merelo-Guervós, J. J. (2023). Distributed
and asynchronous population-based optimization applied to the optimal design of fuzzy con-
trollers. Symmetry, 15(2), 467.
[510] Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019, July). Optuna: A next-
generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD
international conference on knowledge discovery and data mining (pp. 2623-2631).
[511] Akiba, T., Shing, M., Tang, Y., Sun, Q., and Ha, D. (2025). Evolutionary optimization of
model merging recipes. Nature Machine Intelligence, 1-10.
[512] Kadhim, Z. S., Abdullah, H. S., and Ghathwan, K. I. (2022). Artificial Neural Network
Hyperparameters Optimization: A Survey. International Journal of Online and Biomedical
Engineering, 18(15).
[513] Jeba, J. A. (2021). Case study of Hyperparameter optimization framework Optuna on a Multi-
column Convolutional Neural Network (Doctoral dissertation, University of Saskatchewan).
[514] Yang, L., and Shami, A. (2020). On hyperparameter optimization of machine learning algo-
rithms: Theory and practice. Neurocomputing, 415, 295-316.
[515] Wang, T. (2024). Multi-objective hyperparameter optimisation for edge machine learning.
[517] Hutter, F., Kotthoff, L., and Vanschoren, J. (2019). Automated machine learning: methods,
systems, challenges (p. 219). Springer Nature.
[518] Jamieson, K., and Talwalkar, A. (2016, May). Non-stochastic best arm identification and
hyperparameter optimization. In Artificial intelligence and statistics (pp. 240-248). PMLR.
[519] Schmucker, R., Donini, M., Zafar, M. B., Salinas, D., and Archambeau, C. (2021). Multi-
objective asynchronous successive halving. arXiv preprint arXiv:2106.12639.
[520] Dong, X., Shen, J., Wang, W., Shao, L., Ling, H., and Porikli, F. (2019). Dynamical hy-
perparameter optimization via deep reinforcement learning in tracking. IEEE transactions on
pattern analysis and machine intelligence, 43(5), 1515-1529.
[521] Rijsdijk, J., Wu, L., Perin, G., and Picek, S. (2021). Reinforcement learning for hyperparam-
eter tuning in deep learning-based side-channel analysis. IACR Transactions on Cryptographic
Hardware and Embedded Systems, 2021(3), 677-707.
[522] Jaafra, Y., Laurent, J. L., Deruyver, A., and Naceur, M. S. (2019). Reinforcement learning
for neural architecture search: A review. Image and Vision Computing, 89, 57-66.
[523] Afshar, R. R., Zhang, Y., Vanschoren, J., and Kaymak, U. (2022). Automated reinforcement
learning: An overview. arXiv preprint arXiv:2201.05000.
[524] Wu, J., Chen, S., and Liu, X. (2020). Efficient hyperparameter optimization through model-
based reinforcement learning. Neurocomputing, 409, 381-393.
[525] Iranfar, A., Zapater, M., and Atienza, D. (2021). Multiagent reinforcement learning for hy-
perparameter optimization of convolutional neural networks. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 41(4), 1034-1047.
[526] He, X., Zhao, K., and Chu, X. (2021). AutoML: A survey of the state-of-the-art. Knowledge-
based systems, 212, 106622.
646 BIBLIOGRAPHY
[527] Gomaa, I., Zidane, A., Mokhtar, H. M., and El-Tazi, N. (2022). SML-AutoML: A Smart
Meta-Learning Automated Machine Learning Framework.
[528] Khan, A. N., Khan, Q. W., Rizwan, A., Ahmad, R., and Kim, D. H. (2025). Consensus-Driven
Hyperparameter Optimization for Accelerated Model Convergence in Decentralized Federated
Learning. Internet of Things, 30, 101476.
[529] Morrison, N., and Ma, E. Y. (2025). Efficiency of machine learning optimizers and meta-
optimization for nanophotonic inverse design tasks. APL Machine Learning, 3(1).
[530] Berdyshev, D. A., Grachev, A. M., Shishkin, S. L., and Kozyrskiy, B. L. (2024). EEG-
Reptile: An Automatized Reptile-Based Meta-Learning Library for BCIs. arXiv preprint
arXiv:2412.19725.
[531] Pratellesi, C. (2025). Meta Learning for Flow Cytometry Cell Classification (Doctoral disser-
tation, Technische Universität Wien).
[532] Garcı́a, C. A., Gil-de-la-Fuente, A., Barbas, C., and Otero, A. (2022). Probabilistic metabolite
annotation using retention time prediction and meta-learned projections. Journal of Chemin-
formatics, 14(1), 33.
[533] Deng, L., Raissi, M., and Xiao, M. (2024). Meta-Learning-Based Surrogate Models for Effi-
cient Hyperparameter Optimization. Authorea Preprints.
[534] Jae, J., Hong, J., Choo, J., and Kwon, Y. D. (2024). Reinforcement learning to learn quantum
states for Heisenberg scaling accuracy. arXiv preprint arXiv:2412.02334.
[535] Upadhyay, R., Phlypo, R., Saini, R., and Liwicki, M. (2025). Meta-Sparsity: Learning
Optimal Sparse Structures in Multi-task Networks through Meta-learning. arXiv preprint
arXiv:2501.12115.
[536] Paul, S., Ghosh, S., Das, D., and Sarkar, S. K. (2025). Advanced Methodologies for Opti-
mal Neural Network Design and Performance Enhancement. In Nature-Inspired Optimization
Algorithms for Cyber-Physical Systems (pp. 403-422). IGI Global Scientific Publishing.
[537] Egele, R., Mohr, F., Viering, T., and Balaprakash, P. (2024). The unreasonable effectiveness
of early discarding after one epoch in neural network hyperparameter optimization. Neuro-
computing, 127964.
[538] Wojciuk, M., Swiderska-Chadaj, Z., Siwek, K., and Gertych, A. (2024). Improving classifi-
cation accuracy of fine-tuned CNN models: Impact of hyperparameter optimization. Heliyon,
10(5).
[539] Geissler, D., Zhou, B., Suh, S., and Lukowicz, P. (2024). Spend More to Save More (SM2):
An Energy-Aware Implementation of Successive Halving for Sustainable Hyperparameter Op-
timization. arXiv preprint arXiv:2412.08526.
[540] Hosseini Sarcheshmeh, A., Etemadfard, H., Najmoddin, A., and Ghalehnovi, M. (2024).
Hyperparameters’ role in machine learning algorithm for modeling of compressive strength of
recycled aggregate concrete. Innovative Infrastructure Solutions, 9(6), 212.
[541] Sankar, S. U., Dhinakaran, D., Selvaraj, R., Verma, S. K., Natarajasivam, R., and Kishore, P.
P. (2024). Optimizing diabetic retinopathy disease prediction using PNAS, ASHA, and transfer
learning. In Advances in Networks, Intelligence and Computing (pp. 62-71). CRC Press.
BIBLIOGRAPHY 647
[542] Zhang, X., and Duh, K. (2024, September). Best Practices of Successive Halving on Neural
Machine Translation and Large Language Models. In Proceedings of the 16th Conference of
the Association for Machine Translation in the Americas (Volume 1: Research Track) (pp.
130-139).
[543] Aach, M., Sarma, R., Neukirchen, H., Riedel, M., and Lintermann, A. (2024). Resource-
Adaptive Successive Doubling for Hyperparameter Optimization with Large Datasets on High-
Performance Computing Systems. arXiv preprint arXiv:2412.02729.
[544] Jang, D., Yoon, H., Jung, K., and Chung, Y. D. (2024). QHB+: Accelerated Configuration
Optimization for Automated Performance Tuning of Spark SQL Applications. IEEE Access.
[545] Chen, Y., Wen, Z., Chen, J., and Huang, J. (2024, May). Enhancing the Performance of
Bandit-based Hyperparameter Optimization. In 2024 IEEE 40th International Conference on
Data Engineering (ICDE) (pp. 967-980). IEEE.
[546] Zhang, Y., Wu, H., and Yang, Y. (2024). FlexHB: a More Efficient and Flexible Framework
for Hyperparameter Optimization. arXiv preprint arXiv:2402.13641.
[547] Srivastava, N. (2013). Improving neural networks with dropout. University of Toronto,
182(566), 7.
[548] Baldi, P., and Sadowski, P. J. (2013). Understanding dropout. Advances in neural information
processing systems, 26.
[549] Gal, Y., and Ghahramani, Z. (2016, June). Dropout as a bayesian approximation: Repre-
senting model uncertainty in deep learning. In international conference on machine learning
(pp. 1050-1059). PMLR.
[550] Gal, Y., Hron, J., and Kendall, A. (2017). Concrete dropout. Advances in neural information
processing systems, 30.
[551] Gal, Y., and Ghahramani, Z. (2016). A theoretically grounded application of dropout in
recurrent neural networks. Advances in neural information processing systems, 29.
[552] Friedman, J. H., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized
linear models via coordinate descent. Journal of statistical software, 33, 1-22.
[553] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society Series B: Statistical Methodology, 58(1), 267-288.
[554] Meinshausen, N. (2007). Relaxed lasso. Computational Statistics and Data Analysis, 52(1),
374-393.
[555] Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009, April). Handling sparsity via the
horseshoe. In Artificial intelligence and statistics (pp. 73-80). PMLR.
[556] Hoerl, A. E., and Kennard, R. W. (1970). Ridge regression: Biased estimation for nonorthog-
onal problems. Technometrics, 12(1), 55-67.
[557] Cesa-Bianchi, N., Conconi, A., and Gentile, C. (2004). On the generalization ability of on-line
learning algorithms. IEEE Transactions on Information Theory, 50(9), 2050-2057.
[558] Devroye, L., Györfi, L., and Lugosi, G. (2013). A probabilistic theory of pattern recognition
(Vol. 31). Springer Science and Business Media.
[559] Abu-Mostafa, Y. S., Magdon-Ismail, M., and Lin, H. T. (2012). Learning from data (Vol. 4,
p. 4). New York: AMLBook.
648 BIBLIOGRAPHY
[560] Shalev-Shwartz, S., and Ben-David, S. (2014). Understanding machine learning: From theory
to algorithms. Cambridge university press.
[561] Bühlmann, P., and Van De Geer, S. (2011). Statistics for high-dimensional data: methods,
theory and applications. Springer Science and Business Media.
[562] Gareth, J., Daniela, W., Trevor, H., and Robert, T. (2013). An introduction to statistical
learning: with applications in R. Spinger.
[563] Efron, B., Hastie, T., Johnstone, I., and Tibshirani, R. (2004). Least angle regression.
[564] Fan, J., and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its
oracle properties. Journal of the American statistical Association, 96(456), 1348-1360.
[565] Meinshausen, N., and Bühlmann, P. (2006). High-dimensional graphs and variable selection
with the lasso.
[566] Montavon, G., Orr, G., and Müller, K. R. (Eds.). (2012). Neural networks: tricks of the trade
(Vol. 7700). springer.
[567] Prechelt, L. (2002). Early stopping-but when?. In Neural Networks: Tricks of the trade (pp.
55-69). Berlin, Heidelberg: Springer Berlin Heidelberg.
[568] Brownlee, J. (2019). Develop deep learning models on theano and TensorFlow using keras. J
Chem Inf Model, 53(9), 1689-1699.
[569] Zhang, H. (2017). mixup: Beyond empirical risk minimization. arXiv preprint
arXiv:1710.09412.
[570] Shorten, C., and Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep
learning. Journal of big data, 6(1), 1-48.
[571] Perez, L. (2017). The effectiveness of data augmentation in image classification using deep
learning. arXiv preprint arXiv:1712.04621.
[572] Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. (2018). Autoaugment:
Learning augmentation policies from data. arXiv preprint arXiv:1805.09501.
[573] Domingos, P. (2012). A few useful things to know about machine learning. Communications
of the ACM, 55(10), 78-87.
[574] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal
of the royal statistical society: Series B (Methodological), 36(2), 111-133.
[575] LeCun, Y., Denker, J., and Solla, S. (1989). Optimal brain damage. Advances in neural
information processing systems, 2.
[576] Li, H., Kadav, A., Durdanovic, I., Samet, H., and Graf, H. P. (2016). Pruning filters for
efficient convnets. arXiv preprint arXiv:1608.08710.
[577] Frankle, J., and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable
neural networks. arXiv preprint arXiv:1803.03635.
[578] Han, S., Pool, J., Tran, J., and Dally, W. (2015). Learning both weights and connections for
efficient neural network. Advances in neural information processing systems, 28.
[579] Liu, Z., Sun, M., Zhou, T., Huang, G., and Darrell, T. (2018). Rethinking the value of network
pruning. arXiv preprint arXiv:1810.05270.
BIBLIOGRAPHY 649
[580] Cheng, Y., Wang, D., Zhou, P., and Zhang, T. (2017). A survey of model compression and
acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
[581] Frankle, J., Dziugaite, G. K., Roy, D. M., and Carbin, M. (2020). Pruning neural networks
at initialization: Why are we missing the mark?. arXiv preprint arXiv:2009.08576.
[584] Freund, Y., and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.
[585] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. Annals
of statistics, 1189-1232.
[586] Zhou, Z. H. (2025). Ensemble methods: foundations and algorithms. CRC press.
[588] Chen, T., and Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In
Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data
mining (pp. 785-794).
[589] Bühlmann, P., and Yu, B. (2003). Boosting with the L 2 loss: regression and classification.
Journal of the American Statistical Association, 98(462), 324-339.
[590] Hinton, G. E., and Van Camp, D. (1993, August). Keeping the neural networks simple by
minimizing the description length of the weights. In Proceedings of the sixth annual conference
on Computational learning theory (pp. 5-13).
[591] Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural
computation, 7(1), 108-116.
[592] Grandvalet, Y., and Bengio, Y. (2004). Semi-supervised learning by entropy minimization.
Advances in neural information processing systems, 17.
[593] Wager, S., Wang, S., and Liang, P. S. (2013). Dropout training as adaptive regularization.
Advances in neural information processing systems, 26.
[594] Pei, Z., Zhang, Z., Chen, J., Liu, W., Chen, B., Huang, Y., ... and Lu, Y. (2025). KAN–CNN:
A Novel Framework for Electric Vehicle Load Forecasting with Enhanced Engineering Appli-
cability and Simplified Neural Network Tuning. Electronics, 14(3), 414.
[595] Chen, H. (2024). Augmenting image data using noise, rotation and shifting.
[596] An, D., Liu, P., Feng, Y., Ding, P., Zhou, W., and Yu, B. (2024). Dynamic weighted knowledge
distillation for brain tumor segmentation. Pattern Recognition, 155, 110731.
[597] SONG, Y. F., and LIU, Y. (2024). Fast adversarial training method based on data augmen-
tation and label noise. Journal of Computer Applications, 0.
[598] Hosseini, S. A., Servaes, S., Rahmouni, N., Therriault, J., Tissot, C., Macedo, A. C., ... and
Rosa-Neto, P. (2024). Leveraging T1 MRI Images for Amyloid Status Prediction in Diverse
Cognitive Conditions Using Advanced Deep Learning Models. Alzheimer’s and Dementia, 20,
e094153.
650 BIBLIOGRAPHY
[599] Cakmakci, U. B. Deep Learning Approaches for Pediatric Bone Age Prediction from Hand
Radiographs.
[600] Surana, A. V., Pawar, S. E., Raha, S., Mali, N., and Mukherjee, T. (2024). ENSEMBLE FINE
TUNED MULTI LAYER PERCEPTRON FOR PREDICTIVE ANALYSIS OF WEATHER
PATTERNS AND RAINFALL FORECASTING FROM SATELLITE DATA. ICTACT Jour-
nal on Soft Computing, 15(2).
[602] Zaitoon, R., Mohanty, S. N., Godavarthi, D., and Ramesh, J. V. N. (2024). SPBTGNS:
Design of an Efficient Model for Survival Prediction in Brain Tumour Patients using Generative
Adversarial Network with Neural Architectural Search Operations. IEEE Access.
[603] Bansal, A., Sharma, D. R., and Kathuria, D. M. Bayesian-Optimized Ensemble Approach for
Fall Detection: Integrating Pose Estimation with Temporal Convolutional and Graph Neural
Networks. Available at SSRN 4974349.
[604] Kusumaningtyas, E. M., Ramadijanti, N., and Rijal, I. H. K. (2024, August). Convolutional
Neural Network Implementation with MobileNetV2 Architecture for Indonesian Herbal Plants
Classification in Mobile App. In 2024 International Electronics Symposium (IES) (pp. 521-
527). IEEE.
[605] Yadav, A. C., Alam, Z., and Mufeed, M. (2024, August). U-Net-Driven Advancements in
Breast Cancer Detection and Segmentation. In 2024 International Conference on Electrical
Electronics and Computing Technologies (ICEECT) (Vol. 1, pp. 1-6). IEEE.
[606] Alshamrani, A. F. A., and Alshomran, F. (2024). Optimizing Breast Cancer Mammogram
Classification through a Dual Approach: A Deep Learning Framework Combining ResNet50,
SMOTE, and Fully Connected Layers for Balanced and Imbalanced Data. IEEE Access.
[607] Zamindar, N. (2024). Using Artificial Intelligence for Thermographic Image Analysis: Appli-
cations to the Arc Welding Process (Doctoral dissertation, Politecnico di Torino).
[608] Xu, M., Yin, H., and Zhong, S. (2024, July). Enhancing Generalization and Convergence in
Neural Networks through a Dual-Phase Regularization Approach with Excitatory-Inhibitory
Transition. In 2024 International Conference on Electrical, Computer and Energy Technologies
(ICECET (pp. 1-4). IEEE.
[609] Elshamy, R., Abu-Elnasr, O., Elhoseny, M., and Elmougy, S. (2024). Enhancing colorectal
cancer histology diagnosis using modified deep neural networks optimizer. Scientific Reports,
14(1), 19534.
[610] Vinay, K., Kodipalli, A., Swetha, P., and Kumaraswamy, S. (2024, May). Analysis of pre-
diction of pneumonia from chest X-ray images using CNN and transfer learning. In 2024 5th
International Conference for Emerging Technology (INCET) (pp. 1-6). IEEE.
[611] Gai, S., and Huang, X. (2024). Regularization method for reduced biquaternion neural net-
work. Applied Soft Computing, 166, 112206.
[612] Xu, Y. (2025). Deep regularization techniques for improving robustness in noisy record linkage
task. Advances in Engineering Innovation, 15, 9-13.
[613] Liao, Z., Li, S., Zhou, P., and Zhang, C. (2025). Decay regularized stochastic configura-
tion networks with multi-level data processing for UAV battery RUL prediction. Information
Sciences, 701, 121840.
BIBLIOGRAPHY 651
[614] Dong, Z., Yang, C., Li, Y., Huang, L., An, Z., and Xu, Y. (2024, May). Class-wise Image Mix-
ture Guided Self-Knowledge Distillation for Image Classification. In 2024 27th International
Conference on Computer Supported Cooperative Work in Design (CSCWD) (pp. 310-315).
IEEE.
[615] Ba, Y., Mancenido, M. V., and Pan, R. (2024). How Does Data Diversity Shape the Weight
Landscape of Neural Networks?. arXiv preprint arXiv:2410.14602.
[616] Li, Z., Zhang, Y., and Li, W. (2024, September). Fusion of L2 Regularisation and Hybrid
Sampling Methods for Multi-Scale SincNet Audio Recognition. In 2024 IEEE 7th Information
Technology, Networking, Electronic and Automation Control Conference (ITNEC) (Vol. 7, pp.
1556-1560). IEEE.
[617] Zang, X., and Yan, A. (2024, May). A Stochastic Configuration Network with Attenuation
Regularization and Multi-kernel Learning and Its Application. In 2024 36th Chinese Control
and Decision Conference (CCDC) (pp. 2385-2390). IEEE.
[618] Moradi, R., Berangi, R., and Minaei, B. (2020). A survey of regularization strategies for deep
models. Artificial Intelligence Review, 53(6), 3947-3986.
[619] Rodrı́guez, P., Gonzalez, J., Cucurull, G., Gonfaus, J. M., and Roca, X. (2016). Regularizing
cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967.
[620] Tian, Y., and Zhang, Y. (2022). A comprehensive survey on regularization strategies in
machine learning. Information Fusion, 80, 146-166.
[621] Cong, Y., Liu, J., Fan, B., Zeng, P., Yu, H., and Luo, J. (2017). Online similarity learning
for big data with overfitting. IEEE Transactions on Big Data, 4(1), 78-89.
[622] Salman, S., and Liu, X. (2019). Overfitting mechanism and avoidance in deep neural networks.
arXiv preprint arXiv:1901.06566.
[623] Wang, K., Muthukumar, V., and Thrampoulidis, C. (2021). Benign overfitting in multiclass
classification: All roads lead to interpolation. Advances in Neural Information Processing
Systems, 34, 24164-24179.
[624] Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., ... and Mhaskar,
H. (2017). Theory of deep learning III: explaining the non-overfitting puzzle. arXiv preprint
arXiv:1801.00173.
[625] Oyedotun, O. K., Olaniyi, E. O., and Khashman, A. (2017). A simple and practical review of
over-fitting in neural network learning. International Journal of Applied Pattern Recognition,
4(4), 307-328.
[626] Luo, X., Chang, X., and Ban, X. (2016). Regression and classification using extreme learning
machine based on L1-norm and L2-norm. Neurocomputing, 174, 179-186.
[627] Zhou, Y., Yang, Y., Wang, D., Zhai, Y., Li, H., and Xu, Y. (2024). Innovative Ghost Channel
Spatial Attention Network with Adaptive Activation for Efficient Rice Disease Identification.
Agronomy, 14(12), 2869.
[628] Omole, O. J., Rosa, R. L., Saadi, M., and Rodriguez, D. Z. (2024). AgriNAS: Neural Architec-
ture Search with Adaptive Convolution and Spatial–Time Augmentation Method for Soybean
Diseases. AI, 5(4), 2945-2966.
652 BIBLIOGRAPHY
[629] Tripathi, L., Dubey, P., Kalidoss, D., Prasad, S., Sharma, G., and Dubey, P. (2024, Decem-
ber). Deep Learning Approaches for Brain Tumour Detection Using VGG-16 Architecture. In
2024 IEEE 16th International Conference on Computational Intelligence and Communication
Networks (CICN) (pp. 256-261). IEEE.
[630] Singla, S., and Gupta, R. (2024, December). Pneumonia Detection from Chest X-Ray Images
Using Transfer Learning with EfficientNetB1. In 2024 International Conference on IoT Based
Control Networks and Intelligent Systems (ICICNIS) (pp. 894-899). IEEE.
[631] Al-Adhaileh, M. H., Alsharbi, B. M., Aldhyani, T., Ahmad, S., Almaiah, M., Ahmed, Z.
A., ... and Singh, S. DLAAD-Deep Learning Algorithms Assisted Diagnosis of Chest Disease
Using Radiographic Medical Images. Frontiers in Medicine, 11, 1511389.
[632] Harvey, E., Petrov, M., and Hughes, M. C. (2025). Learning Hyperparameters via a Data-
Emphasized Variational Objective. arXiv preprint arXiv:2502.01861.
[633] Mahmood, T., Saba, T., Al-Otaibi, S., Ayesha, N., and Almasoud, A. S. (2025). AI-Driven
Microscopy: Cutting-Edge Approach for Breast Tissue Prognosis Using Microscopic Images.
Microscopy Research and Technique.
[634] Shen, Q. (2025). Predicting the value of football players: machine learning techniques and
sensitivity analysis based on FIFA and real-world statistical datasets. Applied Intelligence,
55(4), 265.
[635] Guo, X., Wang, M., Xiang, Y., Yang, Y., Ye, C., Wang, H., and Ma, T. (2025). Uncer-
tainty Driven Adaptive Self-Knowledge Distillation for Medical Image Segmentation. IEEE
Transactions on Emerging Topics in Computational Intelligence.
[636] Zambom, A. Z., and Dias, R. (2013). A review of kernel density estimation with applications
to econometrics. International Econometric Review, 5(1), 20-42.
[637] Reyes, M., Francisco-Fernández, M., and Cao, R. (2016). Nonparametric kernel density esti-
mation for general grouped data. Journal of Nonparametric Statistics, 28(2), 235-249.
[638] Tenreiro, C. (2024). A Parzen–Rosenblatt type density estimator for circular data: exact and
asymptotic optimal bandwidths. Communications in Statistics-Theory and Methods, 53(20),
7436-7452.
[639] Devroye, L., and Penrod, C. S. (1984). The consistency of automatic kernel density estimates.
The Annals of Statistics, 1231-1249.
[641] Slaoui, Y. (2018). Bias reduction in kernel density estimation. Journal of Nonparametric
Statistics, 30(2), 505-522.
[642] Michalski, A. (2016). The use of kernel estimators to determine the distribution of ground-
water level. Meteorology Hydrology and Water Management. Research and Operational Ap-
plications, 4(1), 41-46.
[643] Gramacki, A., and Gramacki, A. (2018). Kernel density estimation. Nonparametric Kernel
Density Estimation and Its Computational Aspects, 25-62.
[644] Desobry, F., Davy, M., and Fitzgerald, W. J. (2007, April). Density kernels on unordered
sets for kernel-based signal processing. In 2007 IEEE International Conference on Acoustics,
Speech and Signal Processing-ICASSP’07 (Vol. 2, pp. II-417). IEEE.
BIBLIOGRAPHY 653
[645] Gasser, T., and Müller, H. G. (1979). Kernel estimation of regression functions. In Smoothing
Techniques for Curve Estimation: Proceedings of a Workshop held in Heidelberg, April 2–4,
1979 (pp. 23-68). Springer Berlin Heidelberg.
[646] Gasser, T., and Müller, H. G. (1984). Estimating regression functions and their derivatives
by the kernel method. Scandinavian journal of statistics, 171-185.
[647] Härdle, W., and Gasser, T. (1985). On robust kernel estimation of derivatives of regression
functions. Scandinavian journal of statistics, 233-240.
[648] Müller, H. G. (1987). Weighted local regression and kernel methods for nonparametric curve
fitting. Journal of the American Statistical Association, 82(397), 231-238.
[649] Chu, C. K. (1993). A new version of the Gasser-Mueller estimator. Journal of Nonparametric
Statistics, 3(2), 187-193.
[650] Peristera, P., and Kostaki, A. (2005). An evaluation of the performance of kernel estimators
for graduating mortality data. Journal of Population Research, 22, 185-197.
[651] Müller, H. G. (1991). Smooth optimum kernel estimators near endpoints. Biometrika, 78(3),
521-530.
[652] Gasser, T., Gervini, D., Molinari, L., Hauspie, R. C., and Cameron, N. (2004). Kernel es-
timation, shape-invariant modelling and structural analysis. Cambridge Studies in Biological
and Evolutionary Anthropology, 179-204.
[653] Jennen-Steinmetz, C., and Gasser, T. (1988). A unifying approach to nonparametric regres-
sion estimation. Journal of the American Statistical Association, 83(404), 1084-1089.
[654] Müller, H. G. (1997). Density adjusted kernel smoothers for random design nonparametric
regression. Statistics and probability letters, 36(2), 161-172.
[656] Steland, A. THE AVERAGE RUN LENGTH OF KERNEL CONTROL CHARTS FOR
DEPENDENT TIME SERIES.
[657] Makkulau, A. T. A., Baharuddin, M., and Agusrawati, A. T. P. M. (2023, December). Multi-
variable Semiparametric Regression Used Priestley-Chao Estimators. In Proceedings of the 5th
International Conference on Statistics, Mathematics, Teaching, and Research 2023 (ICSMTR
2023) (Vol. 109, p. 118). Springer Nature.
[659] Mack, Y. P., and Müller, H. G. (1988). Convolution type estimators for nonparametric re-
gression. Statistics and probability letters, 7(3), 229-239.
[660] Jones, M. C., Davies, S. J., and Park, B. U. (1994). Versions of kernel-type regression esti-
mators. Journal of the American Statistical Association, 89(427), 825-832.
[661] Ghosh, S. (2015). Surface estimation under local stationarity. Journal of Nonparametric
Statistics, 27(2), 229-240.
[662] Liu, C. W., and Luor, D. C. (2023). Applications of fractal interpolants in kernel regression
estimations. Chaos, Solitons and Fractals, 175, 113913.
654 BIBLIOGRAPHY
[663] Agua, B. M., and Bouzebda, S. (2024). Single index regression for locally stationary functional
time series. AIMS Math, 9, 36202-36258.
[664] Bouzebda, S., Nezzal, A., and Elhattab, I. (2024). Limit theorems for nonparametric condi-
tional U-statistics smoothed by asymmetric kernels. AIMS Mathematics, 9(9), 26195-26282.
[665] Zhao, H., Qian, Y., and Qu, Y. (2025). Mechanical performance degradation modelling and
prognosis method of high-voltage circuit breakers considering censored data. IET Science,
Measurement and Technology, 19(1), e12235.
[666] Patil, M. D., Kannaiyan, S., and Sarate, G. G. (2024). Signal denoising based on bias-variance
of intersection of confidence interval. Signal, Image and Video Processing, 18(11), 8089-8103.
[667] Kakani, K., and Radhika, T. S. L. (2024). Nonparametric and nonlinear approaches for
medical data analysis. International Journal of Data Science and Analytics, 1-19.
[668] Kato, M. (2024). Debiased Regression for Root-N-Consistent Conditional Mean Estimation.
arXiv preprint arXiv:2411.11748.
[669] Sadek, A. M., and Mohammed, L. A. (2024). Evaluation of the Performance of Kernel Non-
parametric Regression and Ordinary Least Squares Regression. JOIV: International Journal
on Informatics Visualization, 8(3), 1352-1360.
[670] Gong, A., Choi, K., and Dwivedi, R. (2024). Supervised Kernel Thinning. arXiv preprint
arXiv:2410.13749.
[671] Zavatone-Veth, J. A., and Pehlevan, C. (2025). Nadaraya–Watson kernel smoothing as a
random energy model. Journal of Statistical Mechanics: Theory and Experiment, 2025(1),
013404.
[672] Ferrigno, S. (2024, December). Nonparametric estimation of reference curves. In CMStatistics
2024.
[673] Fan, X., Leng, C., and Wu, W. (2025). Causal Inference under Interference: Regression
Adjustment and Optimality. arXiv preprint arXiv:2502.06008.
[674] Atanasov, A., Bordelon, B., Zavatone-Veth, J. A., Paquette, C., and Pehlevan, C. (2025).
Two-Point Deterministic Equivalence for Stochastic Gradient Dynamics in Linear Models.
arXiv preprint arXiv:2502.05074.
[675] Ghosh, S. (2020). The Basel Problem. arXiv preprint arXiv:2010.03953.
[676] Mishra, U., Gupta, D., Sarkar, A., and Hazarika, B. B. (2025). A hybrid approach for plant
leaf detection using ResNet50-intuitionistic fuzzy RVFL (ResNet50-IFRVFLC) classifier. Com-
puters and Electrical Engineering, 123, 110135.
[677] Elsayed, M. M., and Nazier, H. (2025). Technology and evolution of occupational employment
in Egypt (1998–2018): a task-based framework. Review of Economics and Political Science.
[678] Kong, X., Li, C., and Pan, Y. (2025). Association Between Heavy Metals Mixtures and Life’s
Essential 8 Score in General US Adults. Cardiovascular Toxicology, 1-12.
[679] Bracale, D., Banerjee, M., Sun, Y., Stoll, K., and Turki, S. (2025). Dynamic Pricing in the
Linear Valuation Model using Shape Constraints. arXiv preprint arXiv:2502.05776.
[680] Köhne, F., Philipp, F. M., Schaller, M., Schiela, A., and Worthmann, K. (2024). L∞-error
bounds for approximations of the Koopman operator by kernel extended dynamic mode de-
composition. arXiv preprint arXiv:2403.18809.
BIBLIOGRAPHY 655
[681] Sadeghi, R., and Beyeler, M. (2025). Efficient Spatial Estimation of Perceptual Thresholds
for Retinal Implants via Gaussian Process Regression. arXiv preprint arXiv:2502.06672.
[682] Naresh, E., Patil, A., and Bhuvan, S. (2025, February). Enhancing network security with
eBPF-based firewall and machine learning. In Data Science and Exploration in Artificial Intel-
ligence: Proceedings of the First International Conference On Data Science and Exploration
in Artificial Intelligence (CODE-AI 2024) Bangalore, India, 3rd-4th July, 2024 (Volume 1) (p.
169). CRC Press.
[683] Zhao, W., Chen, H., Liu, T., Tuo, R., and Tian, C. From Deep Additive Kernel Learn-
ing to Last-Layer Bayesian Neural Networks via Induced Prior Approximation. In The 28th
International Conference on Artificial Intelligence and Statistics.
[684] Nanyonga, A., Wasswa, H., Joiner, K., Turhan, U., and Wild, G. (2025). A Multi-Head
Attention-Based Transformer Model for Predicting Causes in Aviation Incident.
[685] Fan, C. L., and Chung, Y. J. (2025). Integrating Image Processing Technology and Deep
Learning to Identify Crops in UAV Orthoimages.
[686] Bakaev, M., Gorovaia, S., and Mitrofanova, O. (2025). Who Will Author the Synthetic Texts?
Evoking Multiple Personas from Large Language Models to Represent Users’ Associative The-
sauri. Big Data and Cognitive Computing, 9(2), 46.
[687] Ahn, K. S., Choi, J. H., Kwon, H., Lee, S., Cho, Y., and Jang, W. Y. (2025). Deep learning-
based automated guide for defining a standard imaging plane for developmental dysplasia
of the hip screening using ultrasonography: a retrospective imaging analysis. BMC Medical
Informatics and Decision Making, 25(1), 1-8.
[688] Peng, J., Lu, F., Li, B., Huang, Y., Qu, S., and Chen, G. (2025). Range and Bird’s Eye View
Fused Cross-Modal Visual Place Recognition. arXiv preprint arXiv:2502.11742.
[689] Zhao, J., Wang, W., Wang, J., Zhang, S., Fan, Z., and Matwin, S. (2025). Privacy-preserved
federated clustering with Non-IID data via GANs. The Journal of Supercomputing, 81(4),
1-37.
[690] Wang, J., Liu, L., He, K., Gebrewahid, T. W., Gao, S., Tian, Q., ... and Li, H. (2025).
Accurate genomic prediction for grain yield and grain moisture content of maize hybrids using
multi-environment data. Journal of Integrative Plant Biology.
[691] Xu, H., Xue, T., Fan, J., Liu, D., Chen, Y., Zhang, F., ... and Cai, W. (2025). Medical Image
Registration Meets Vision Foundation Model: Prototype Learning and Contour Awareness.
arXiv preprint arXiv:2502.11440.
[692] Sun, M., Yin, Y., Xu, Z., Kolter, J. Z., and Liu, Z. (2025). Idiosyncrasies in Large Language
Models. arXiv preprint arXiv:2502.12150.
[693] Liang, Y., Liu, F., Li, A., Li, X., and Zheng, C. (2025). NaturalL2S: End-to-End High-
quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing. arXiv
preprint arXiv:2502.12002.
[694] Fix, E., and Hodges, J. L. (1951). Discriminatory analysis, nonparametric discrimination.
[695] Cover, T., and Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on
information theory, 13(1), 21-27.
[696] Devroye, L., Györfi, L., and Lugosi, G. (2013). A probabilistic theory of pattern recognition
(Vol. 31). Springer Science and Business Media.
656 BIBLIOGRAPHY
[697] Toussaint, G. (2005). Geometric proximity graphs for improving nearest neighbor methods in
instance-based learning and data mining. International Journal of Computational Geometry
and Applications, 15(02), 101-150.
[698] Cox, D., Ghosh, S., and Sultanow, E. (2021). Collatz Cycles and 3n+c Cycles. arXiv preprint
arXiv:2101.04067.
[699] Arya, S., Mount, D. M., Netanyahu, N. S., Silverman, R., and Wu, A. Y. (1998). An optimal
algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM
(JACM), 45(6), 891-923.
[700] Terrell, G. R., and Scott, D. W. (1992). Variable kernel density estimation. The Annals of
Statistics, 1236-1265.
[702] Bremner, D., Demaine, E., Erickson, J., Iacono, J., Langerman, S., Morin, P., and Toussaint,
G. (2005). Output-sensitive algorithms for computing nearest-neighbour decision boundaries.
Discrete and Computational Geometry, 33, 593-604.
[703] Ramaswamy, S., Rastogi, R., and Shim, K. (2000, May). Efficient algorithms for mining out-
liers from large data sets. In Proceedings of the 2000 ACM SIGMOD international conference
on Management of data (pp. 427-438).
[704] Cover, T. M. (1999). Elements of information theory. John Wiley and Sons.
[706] Chen, J. S., Hung, R. W., and Yang, C. Y. (2025). An Efficient Target-to-Area Classification
Strategy with a PIP-Based KNN Algorithm for Epidemic Management. Mathematics, 13(4),
661.
[707] Liu, J., Tu, S., Wang, M., Chen, D., Chen, C., and Xie, H. (2025). The influence of different
factors on the bond strength of lithium disilicate-reinforced glass–ceramics to Resin: a machine
learning analysis. BMC Oral Health, 25(1), 1-12.
[708] Barghouthi, E. A. D., Owda, A. Y., Owda, M., and Asia, M. (2025). A Fused Multi-Channel
Prediction Model of Pressure Injury for Adult Hospitalized Patients—The “EADB” Model.
AI, 6(2), 39.
[709] Jewan, S. Y. Y. Remote sensing technology and machine learning algorithms for crop yield
prediction in Bambara groundnut and grapevines (Doctoral dissertation, University of Not-
tingham).
[710] Moldovanu, S., Munteanu, D., and Sı̂rbu, C. (2025). Impact on Classification Process Gen-
erated by Corrupted Features. Big Data and Cognitive Computing, 9(2), 45.
[711] HosseinpourFardi, N., and Alizadeh, B. (2025). AILIS: effective hardware accelerator for
incremental learning with intelligent selection in classification. The Journal of Supercomputing,
81(4), 1-30.
[712] Afrin, T., Yodo, N., and Huang, Y. (2025). AI-Driven Framework for Predicting Oil Pipeline
Failure Causes Based on Leak Properties and Financial Impact. Journal of Pipeline Systems
Engineering and Practice, 16(2), 04025009.
BIBLIOGRAPHY 657
[713] Hussain, M. A., Chen, Z., Zhou, Y., Ullah, H., and Ying, M. (2025). Spatial analysis of flood
susceptibility in Coastal area of Pakistan using machine learning models and SAR imagery.
Environmental Earth Sciences, 84(5), 1-23.
[714] Reddy, S. R., and Murthy, G. V. (2025). Cardiovascular Disease Prediction Using Particle
Swarm Optimization and Neural Network Based an Integrated Framework. SN Computer
Science, 6(2), 186.
[715] Chen, Y., Garcia, E. K., Gupta, M. R., Rahimi, A., and Cazzanti, L. (2009). Similarity-based
classification: Concepts and algorithms. Journal of Machine Learning Research, 10(3).
[716] Chechik, G., Sharma, V., Shalit, U., and Bengio, S. (2010). Large scale online learning of
image similarity through ranking. Journal of Machine Learning Research, 11(3).
[717] Huang, W., Zhang, P., and Wan, M. (2013). A novel similarity learning method via relative
comparison for content-based medical image retrieval. Journal of digital imaging, 26, 850-865.
[718] Yang, P., Wang, H., Yang, J., Qian, Z., Zhang, Y., and Lin, X. (2024). Deep learning ap-
proaches for similarity computation: A survey. IEEE Transactions on Knowledge and Data
Engineering.
[719] Xiao, Y., Liu, B., Yin, J., Cao, L., Zhang, C., and Hao, Z. (2011, July). Similarity-based
approach for positive and unlabeled learning. In IJCAI Proceedings-International Joint Con-
ference on Artificial Intelligence (Vol. 22, No. 1, p. 1577).
[720] Kar, P., and Jain, P. (2011). Similarity-based learning via data driven embeddings. Advances
in neural information processing systems, 24.
[721] https://www.pingcap.com/article/top-10-tools-for-calculating-semantic-similarity/
[722] Co-citation proximity analysis. (n.d.). In Wikipedia. Retrieved February 22, 2025, from
https://en.wikipedia.org/wiki/Co-citation_Proximity_Analysis
[723] Choi, S. (2022). Internet News User Analysis Using Deep Learning and Similarity Compari-
son. Electronics, 11(4), 569.
[726] Breiman, L., Friedman, J., Olshen, R. A., and Stone, C. J. (2017). Classification and regres-
sion trees. Routledge.
[727] Kohavi, R., and John, G. H. (1997). Wrappers for feature subset selection. Artificial intelli-
gence, 97(1-2), 273-324.
[729] Freund, Y., and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning
and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.
[731] Domingos, P., and Hulten, G. (2000, August). Mining high-speed data streams. In Proceed-
ings of the sixth ACM SIGKDD international conference on Knowledge discovery and data
mining (pp. 71-80).
658 BIBLIOGRAPHY
[732] Freund, Y., and Mason, L. (1999, June). The alternating decision tree learning algorithm. In
icml (Vol. 99, pp. 124-133).
[734] Usman, S. A., Bhattacharjee, M., Alsukhailah, A. A., Shahzad, A. D., Razick, M. S. A., and
Amin, N. (2025). Identifying the Best-Selling Product using Machine Learning Algorithms.
[735] Abbas, J., Yousef, M., Hamoud, K., and Joubran, K. (2025). Low Back Pain Among Health
Sciences Undergraduates: Results Obtained from a Machine-Learning Analysis.
[736] Deng, C., Liu, X., Zhang, J., Mo, Y., Li, P., Liang, X., and Li, N. (2025). Prediction of retail
commodity hot-spots: a machine learning approach. Data Science and Management.
[737] Eili, M. Y., Rezaeenour, J., and Roozbahani, M. H. (2025). Predicting clinical pathways of
traumatic brain injuries (TBIs) through process mining. npj Digital Medicine, 8(1), 1-12.
[738] Yin, Y., Xu, B., Chang, J., Li, Z., Bi, X., Wei, Z., ... and Cai, J. (2025). Gamma-Glutamyl
Transferase Plus Carcinoembryonic Antigen Ratio Index: A Promising Biomarker Associated
with Treatment Response to Neoadjuvant Chemotherapy for Patients with Colorectal Cancer
Liver Metastases. Current Oncology, 32(2), 117.
[739] Abdullahi, N., Akbal, E., Dogan, S., Tuncer, T., and Erman, U. Accurate Indoor Home
Location Classification through Sound Analysis: The 1D-ILQP Approach. Firat University
Journal of Experimental and Computational Engineering, 4(1), 12-29.
[740] Mokan, M., Gabrani, G., and Relan, D. (2025). Pixel-wise classification of the whole retinal
vasculature into arteries and veins using supervised learning. Biomedical Signal Processing
and Control, 106, 107691.
[741] Maron, M. E. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM
(JACM), 8(3), 404-417.
[742] Minsky, M. (1961). Steps toward artificial intelligence. Proceedings of the IRE, 49(1), 8-30.
[743] Mosteller, F., and Wallace, D. L. (1963). Inference in an authorship problem: A comparative
study of discrimination methods applied to the authorship of the disputed Federalist Papers.
Journal of the American Statistical Association, 58(302), 275-309.
[744] Domingos, P., and Pazzani, M. (1997). On the optimality of the simple Bayesian classifier
under zero-one loss. Machine learning, 29, 103-130.
[745] Hand, D. J., and Yu, K. (2001). Idiot’s Bayes—not so stupid after all?. International statistical
review, 69(3), 385-398.
[746] Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001
workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).
[747] Ng, A., and Jordan, M. (2001). On discriminative vs. generative classifiers: A comparison of
logistic regression and naive bayes. Advances in neural information processing systems, 14.
[748] Webb, G. I., Boughton, J. R., and Wang, Z. (2005). Not so naive Bayes: aggregating one-
dependence estimators. Machine learning, 58, 5-24.
[749] Boullé, M. (2007). Compression-based averaging of selective naive Bayes classifiers. The Jour-
nal of Machine Learning Research, 8, 1659-1685.
BIBLIOGRAPHY 659
[750] Larsen, B., and Aone, C. (1999, August). Fast and effective text mining using linear-time
document clustering. In Proceedings of the fifth ACM SIGKDD international conference on
Knowledge discovery and data mining (pp. 16-22).
[751] SHANNAQ, B. (2025). DOES DATASET SPLITTING IMPACT ARABIC TEXT CLASSI-
FICATION MORE THAN PREPROCESSING? AN EMPIRICAL ANALYSIS IN BIG DATA
ANALYTICS. Journal of Theoretical and Applied Information Technology, 103(3).
[752] Goldstein, D., Aldrich, C., Shao, Q., and O’Connor, L. (2025). A Machine Learning Classifi-
cation Approach to Geotechnical Characterisation Using Measure-While-Drilling Data.
[753] Ntamwiza, J. M. V., and Bwire, H. (2025). Predicting biking preferences in Kigali city: A
comparative study of traditional statistical models and ensemble machine learning models.
Transport Economics and Management.
[754] EL Fadel, N. (2025). Facial Recognition Algorithms: A Systematic Literature Review. Journal
of Imaging, 11(2), 58.
[755] RaviKumar, S., Pandian, C. A., Hameed, S. S., Muralidharan, V., and Ali, M. S. W. (2025).
Application of machine learning for fault diagnosis and operational efficiency in EV motor test
benches using vibration analysis. Engineering Research Express, 7(1), 015355.
[756] Kavitha, D., Srujankumar, G., Akhil, C., and Sumanth, P. Uncovering the Truth: A Machine
Learning Approach to Detect Fake Product Reviews and Analyze Sentiment. Explainable IoT
Applications: A Demystification, 309.
[757] Nusantara, R. M. (2025). Analisis Sentimen Masyarakat terhadap Pelayanan Bank Central
Asia: Text Mining Cuitan Satpam BCA pada Twitter. Co-Value Jurnal Ekonomi Koperasi
dan kewirausahaan, 15(9).
[758] Ahmadi, M., Khajavi, M., Varmaghani, A., Ala, A., Danesh, K., and Javaheri, D. (2025).
Leveraging Large Language Models for Cybersecurity: Enhancing SMS Spam Detection with
Robust and Context-Aware Text Classification. arXiv preprint arXiv:2502.11014.
[759] Takaki, T., Matsuoka, R., Fujita, Y., and Murakami, S. (2025). Development and clinical
evaluation of an AI-assisted respiratory state classification system for chest X-rays: A BMI-
Specific approach. Computers in Biology and Medicine, 188, 109854.
[760] Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of
eugenics, 7(2), 179-188.
[761] Anderson, T. W., Anderson, T. W., Anderson, T. W., Anderson, T. W., and Mathématicien,
E. U. (1958). An introduction to multivariate statistical analysis (Vol. 2, pp. 3-5). New York:
Wiley.
[762] Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classi-
fication. Journal of the Royal Statistical Society. Series B (Methodological), 10(2), 159-203.
[763] Duda, R. O., and Hart, P. E. (2006). Pattern classification. John Wiley and Sons.
[764] McLachlan, G. J. (2005). Discriminant analysis and statistical pattern recognition. John
Wiley and Sons.
[765] Belhumeur, P. N., Hespanha, J. P., and Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces:
Recognition using class specific linear projection. IEEE Transactions on pattern analysis and
machine intelligence, 19(7), 711-720.
660 BIBLIOGRAPHY
[766] Mika, S., Ratsch, G., Weston, J., Scholkopf, B., and Mullers, K. R. (1999, August). Fisher
discriminant analysis with kernels. In Neural networks for signal processing IX: Proceedings
of the 1999 IEEE signal processing society workshop (cat. no. 98th8468) (pp. 41-48). Ieee.
[767] Ye, J., and Yu, B. (2005). Characterization of a family of algorithms for generalized discrim-
inant analysis on undersampled problems. Journal of Machine Learning Research, 6(4).
[768] Sugiyama, M. (2007). Dimensionality reduction of multimodal labeled data by local fisher
discriminant analysis. Journal of machine learning research, 8(5).
[769] Hartmann, M., Wolff, W., Martarelli, C. S., Hartmann, M., and Suisse, U. Unpleasant mind,
deactivated body–A distinct somatic signature of boredom through bodily sensation mapping.
[770] Garrido-Tamayo, M. A., Rincón Santamarı́a, A., Hoyos, F. E., González Vega, T., and Laroze,
D. (2025). Autofluorescence of Red Blood Cells Infected with P. falciparum as a Preliminary
Analysis of Spectral Sweeps to Predict Infection. Biosensors, 15(2), 123.
[771] Li, B., and Jiang, S. (2025). Reservoir Fluid PVT High-Pressure Physical Property Analysis
Based on Graph Convolutional Network Model. Applied Sciences, 15(4), 2209.
[772] Nyembwe, A., Zhao, Y., Caceres, B. A., Hall, K., Prescott, L., Potts-Thompson, S., ...
and Taylor, J. Y. (2025). Moderating effect of coping strategies on the association between
perceived discrimination and blood pressure outcomes among young Black mothers in the
InterGEN study. AIMS Public Health, 12(1), 217-232.
[773] Singh, S. K., Kumar, M., Khan, I. M., Jayanthiladevi, A., and Agarwal, C. (2025). An
Attention-based Model for Recognition of Facial Expressions using CNN-BiLSTM. Polytechnic
Journal, 15(1), 4.
[774] Akter, T., Faqeerzada, M. A., Kim, Y., Pahlawan, M. F. R., Aline, U., Kim, H., ... and Cho,
B. K. (2025). Hyperspectral imaging with multivariate analysis for detection of exterior flaws
for quality evaluation of apples and pears. Postharvest Biology and Technology, 223, 113453.
[775] Feng, C. H., Deng, F., Disis, M. L., Gao, N., and Zhang, L. (2025). Towards machine learning
fairness in classifying multicategory causes of deaths in colorectal or lung cancer patients.
bioRxiv, 2025-02.
[777] Chick, H. M., Williams, L. K., Sparks, N., Khattak, F., Vermeij, P., Frantzen, I., ... and
Wilkinson, T. S. (2025). Campylobacter jejuni ST353 and ST464 cause localized gut inflam-
mation, crypt damage, and extraintestinal spread during large-and small-scale infection in
broiler chickens. Applied and Environmental Microbiology, e01614-24.
[778] Miao, X., Xu, L., Sun, L., Xie, Y., Zhang, J., Xu, X., ... and Lin, J. (2025). Highly Sensitive
Detection and Molecular Subtyping of Breast Cancer Cells Using Machine Learning-assisted
SERS Technology. Nano Biomedicine and Engineering.
[779] Rohan, D., Reddy, G. P., Kumar, Y. P., Prakash, K. P., and Reddy, C. P. (2025). An exten-
sive experimental analysis for heart disease prediction using artificial intelligence techniques.
Scientific Reports, 15(1), 6132.
[780] Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical
Society Series B: Statistical Methodology, 20(2), 215-232.
BIBLIOGRAPHY 661
[781] Nelder, J. A., and Wedderburn, R. W. (1972). Generalized linear models. Journal of the Royal
Statistical Society Series A: Statistics in Society, 135(3), 370-384.
[782] Haberman, S., and Renshaw, A. E. (1990). Generalised linear models and excess mortality
from peptic ulcers. Insurance: Mathematics and Economics, 9(1), 21-32.
[783] Hosmer, D. W., and Lemesbow, S. (1980). Goodness of fit tests for the multiple logistic
regression model. Communications in statistics-Theory and Methods, 9(10), 1043-1069.
[785] Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27-38.
[786] King, G., and Zeng, L. (2001). Logistic regression in rare events data. Political analysis, 9(2),
137-163.
[787] Gelman, A., and Hill, J. (2007). Data analysis using regression and multilevel/hierarchical
models. Cambridge university press.
[788] Sani, J., Oluyomi, A. O., Wali, I. G., Ahmed, M. M., and Halane, S. (2025). Regional dispar-
ities on contraceptive intention and its sociodemographic determinants among reproductive
women in Nigeria. Contraception and Reproductive Medicine, 10(1), 1-10.
[789] Dorsey, S. S., Catlin, D. H., Ritter, S. J., Wails, C. N., Robinson, S. G., Oliver, K. W., ...
and Fraser, J. D. (2025). The importance of viewshed in nest site selection of a ground-nesting
shorebird. PLOS ONE, 20(2), e0319021.
[790] Slawny, C., Libersky, E., and Kaushanskaya, M. (2025). The Roles of Language Ability
and Language Dominance in Bilingual Parent–Child Language Alignment. Journal of Speech,
Language, and Hearing Research, 1-13.
[791] Waller, D. K., Dass, N. L. M., Oluwafemi, O. O., Agopian, A. J., Tark, J. Y., Hoyt, A. T.,
... and Study, N. B. D. P. (2025). Maternal Diarrhea During the Periconceptional Period and
the Risk of Birth Defects, National Birth Defects Prevention Study, 2006-2011. Birth defects
research, 117(2), e2438.
[792] Beyeler, M., Rohner, R., Ijäs, P., Eker, O. F., Cognard, C., Bourcier, R., ... and Kaesmacher,
J. (2025). Susceptibility Vessel Sign and Intravenous Alteplase in Stroke Patients Treated with
Thrombectomy. Clinical Neuroradiology, 1-11.
[793] Yedavalli, V., Salim, H. A., Balar, A., Lakhani, D. A., Mei, J., Lu, H., ... and Heit, J. J.
(2025). Hypoperfusion Intensity Ratio Less Than 0.4 is Associated with Favorable Outcomes
in Unsuccessfully Reperfused Acute Ischemic Stroke with Large-Vessel Occlusion. American
Journal of Neuroradiology.
[794] Aarakit, S. M., Ssennono, F. V., Nalweyiso, G., Murungi, H., and Adaramola, M. S. Do
Social Networks and Neighbourhood Effects Matter in Solar Adoption? Insights from Uganda
National Household Survey. Insights from Uganda National Household Survey.
[795] Yang, Y., Cai, X., Zhou, M., Chen, Y., Pi, J., Zhao, M., ... and Wang, Y. (2025). Associa-
tion of Left Ventricular Function With Cerebral Small Vessel Disease in a Community-Based
Population. CNS neuroscience and therapeutics, 31(2), e70226.
[796] Cortese, S. (2025). Advancing our knowledge on the maternal and neonatal outcomes in
women with ADHD. Evidence-Based Nursing.
662 BIBLIOGRAPHY
[797] Gaspar, P., Mittal, P., Cohen, H., and Isenberg, D. A. (2025). Risk factors for bleeding in
patients with thrombotic antiphospholipid syndrome during antithrombotic therapy. Lupus,
09612033251322927.
[798] Schölkopf, B., and Smola, A. J. (2002). Learning with kernels: support vector machines,
regularization, optimization, and beyond. MIT press.
[799] Cristianini, N., and Shawe-Taylor, J. (2000). An introduction to support vector machines and
other kernel-based learning methods. Cambridge university press.
[801] Schölkopf, B., Burges, C. J., and Smola, A. J. (Eds.). (1999). Advances in kernel methods:
support vector learning. MIT press.
[802] Drucker, H., Burges, C. J., Kaufman, L., Smola, A., and Vapnik, V. (1996). Support vector
regression machines. Advances in neural information processing systems, 9.
[803] Joachims, T. (1999, June). Transductive inference for text classification using support vector
machines. In Icml (Vol. 99, pp. 200-209).
[804] Schölkopf, B., Smola, A., and Müller, K. R. (1998). Nonlinear component analysis as a kernel
eigenvalue problem. Neural computation, 10(5), 1299-1319.
[805] Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data
mining and knowledge discovery, 2(2), 121-167.
[806] Schölkopf, B., Platt, J. C., Shawe-Taylor, J., Smola, A. J., and Williamson, R. C. (2001).
Estimating the support of a high-dimensional distribution. Neural computation, 13(7), 1443-
1471.
[807] Gauss, C. F. (1809). Theoria motus corporum coelestium in sectionibus conicis solem ambi-
entium auctore Carolo Friderico Gauss. sumtibus Frid. Perthes et IH Besser.
[808] Legendre, A. M. (1806). Nouvelles méthodes pour la détermination des orbites des comètes:
avec un supplément contenant divers perfectionnemens de ces méthodes et leur application
aux deux comètes de 1805. Courcier.
[809] Pearson, K. (1901). LIII. On lines and planes of closest fit to systems of points in space. The
London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11), 559-572.
[810] Fisher, R. A. (1922). The goodness of fit of regression formulae, and the distribution of
regression coefficients. Journal of the Royal Statistical Society, 597-612.
[813] Rao, C. R., Rao, C. R., Statistiker, M., Rao, C. R., and Rao, C. R. (1973). Linear statistical
inference and its applications (Vol. 2, pp. 263-270). New York: Wiley.
[815] Ramadhan, D. L., and Ali, T. H. (2025). A Multivariate Wavelet Shrinkage in Quantile
Regression Models.
BIBLIOGRAPHY 663
[816] Zhou, F., Chu, J., Lu, F., Ouyang, W., Liu, Q., and Wu, Z. (2025). Real-time monitoring of
methyl orange degradation in non-thermal plasma by integrating Raman spectroscopy with a
hybrid machine learning model. Environmental Technology and Innovation, 104100.
[817] Zhong, X., Cai, S., Wang, H., Wu, L., and Sun, Y. (2025). The knowledge, attitude and
practice of nurses on the posture management of premature infants: status quo and coping
strategies. BMC Health Services Research, 25(1), 288.
[818] Liu, J., Wang, S., Tang, Y., Pan, F., and Xia, J. (2025). Current status and influencing
factors of pediatric clinical nurses’ scientific research ability: a survey. BMC nursing, 24(1),
1-8.
[819] Ming-jun, C., and Jian-ya, Z. (2025). Research on the comprehensive effect of the Porter hy-
pothesis of environmental protection tax regulation in China. Environmental Sciences Europe,
37(1), 28.
[820] Dietze, P., Colledge-Frisby, S., Gerra, G., Poznyak, V., Campello, G., Kashino, W., ... and
Krupchanka, D. (2025). Impact of UNODC/WHO SOS (stop-overdose-safely) training on opi-
oid overdose knowledge and attitudes among people at high or low risk of opioid overdose in
Kazakhstan, Kyrgyzstan, Tajikistan and Ukraine. Harm Reduction Journal, 22, 20.
[821] Hasan, M. S., and Ghosal, S. (2025). Unravelling Inequities in Access to Public Healthcare
Services in West Bengal, India: Multiple Dimensions, Geographic Pattern, and Association
with Health Outcomes. Global Social Welfare, 1-18.
[822] Zeng, S., Hou, X., Luo, X., and Wei, Q. Enhancing Maize Yield Prediction Under Stress
Conditions Using Solar-Induced Chlorophyll Fluorescence and Deep Learning. Available at
SSRN 5146460.
[823] Baird, H. B., Allen, W., Gallegos, M., Ashy, C., Slone, H. S., and Pullen, W. M. (2025).
Artificial Intelligence-Driven Analysis Identifies Anterior Cruciate Ligament Reconstruction,
Hip Arthroscopy and Femoroacetabular Impingement Syndrome, and Shoulder Instability as
the Most Commonly Published Topics in Arthroscopy. Arthroscopy, Sports Medicine, and
Rehabilitation, 101108.
[824] Overton, M. W., and Eicker, S. (2025). Associations between days open and dry period length
versus milk production, replacement, and fertility in the subsequent lactation in Holstein dairy
cows. Journal of Dairy Science.
[825] Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., ... and Hassabis, D.
(2017). Mastering chess and shogi by self-play with a general reinforcement learning algorithm.
arXiv preprint arXiv:1712.01815.
[826] He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). Momentum contrast for unsuper-
vised visual representation learning. In Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition (pp. 9729-9738).
[827] Grill, J. B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., ... and Valko,
M. (2020). Bootstrap your own latent-a new approach to self-supervised learning. Advances in
neural information processing systems, 33, 21271-21284.
[828] Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast learning algorithm for deep belief
nets. Neural computation, 18(7), 1527-1554.
[829] Finn, C., Abbeel, P., and Levine, S. (2017, July). Model-agnostic meta-learning for fast
adaptation of deep networks. In International conference on machine learning (pp. 1126-1135).
PMLR.
664 BIBLIOGRAPHY
[830] Jaderberg, M., Mnih, V., Czarnecki, W. M., Schaul, T., Leibo, J. Z., Silver, D., and
Kavukcuoglu, K. (2016). Reinforcement learning with unsupervised auxiliary tasks. arXiv
preprint arXiv:1611.05397.
[831] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... and
Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at
scale. arXiv preprint arXiv:2010.11929.
[832] Mousavi, S. M. H. Is Deleting the Dataset of a Self-Aware AGI Ethical? Does It Possess a
Soul by Self-Awareness?.
[833] Bjerregaard, A., Groth, P. M., Hauberg, S., Krogh, A., and Boomsma, W. (2025). Foundation
models of protein sequences: A brief overview. Current Opinion in Structural Biology, 91,
103004.
[834] Cui, T., Tang, C., Zhou, D., Li, Y., Gong, X., Ouyang, W., ... and Zhang, S. (2025). Online
test-time adaptation for better generalization of interatomic potentials to out-of-distribution
data. Nature Communications, 16(1), 1891.
[835] Jia, Q., Zhang, Y., Wang, Y., Ruan, T., Yao, M., and Wang, L. (2025). Fragment-level
Feature Fusion Method Using Retrosynthetic Fragmentation Algorithm for molecular property
prediction. Journal of Molecular Graphics and Modelling, 108985.
[836] Hou, L. Unboxing the intersections between self-esteem and academic mindfulness with test
emotions, psychological wellness and academic achievement in artificial intelligence-supported
learning environments: Evidence from English as a foreign language learners. British Educa-
tional Research Journal.
[837] Liu, Y., Huang, Y., Dai, Z., and Gao, Y. (2025). Self-optimized learning algorithm for multi-
specialty multi-stage elective surgery scheduling. Engineering Applications of Artificial Intel-
ligence, 147, 110346.
[838] Song, Q., Li, C., Fu, J., Zeng, Q., and Xie, N. (2025). Self-supervised heterogeneous graph
neural network based on deep and broad neighborhood encoding. Applied Intelligence, 55(6),
467.
[839] Li, T., Nath, D., Cheng, Y., Fan, Y., Li, X., Raković, M., ... and Gašević, D. (2025, March).
Turning Real-Time Analytics into Adaptive Scaffolds for Self-Regulated Learning Using Gen-
erative Artificial Intelligence. In Proceedings of the 15th International Learning Analytics and
Knowledge Conference (pp. 667-679).
[840] Chaudary, E., Khan, S. A., and Mumtaz, W. (2025). EEG-CNN-Souping: Interpretable
emotion recognition from EEG signals using EEG-CNN-souping model and explainable AI.
Computers and Electrical Engineering, 123, 110189.
[841] Tautan, A. M., Andrei, A. G., Smeralda, C. L., Vatti, G., Rossi, S., and Ionescu, B. (2025).
Unsupervised learning from EEG data for epilepsy: A systematic literature review. Artificial
Intelligence in Medicine, 103095.
[842] Guo, X., and Sun, L. (2025). Evaluation of stroke sequelae and rehabilitation effect on brain
tumor by neuroimaging technique: A comparative study. PLOS ONE, 20(2), e0317193.
[843] Diao, S., Wan, Y., Huang, D., Huang, S., Sadiq, T., Khan, M. S., ... and Mazhar, T. (2025).
Optimizing Bi-LSTM networks for improved lung cancer detection accuracy. PLOS ONE,
20(2), e0316136.
BIBLIOGRAPHY 665
[844] Lin, N., Shi, Y., Ye, M., Zhang, Y., and Jia, X. (2025). Deep transfer learning radiomics for
distinguishing sinonasal malignancies: a preliminary MRI study. Future Oncology, 1-8.
[845] Çetintaş, D. (2025). Efficient monkeypox detection using hybrid lightweight CNN architec-
tures and optimized SVM with grid search on imbalanced data. Signal, Image and Video
Processing, 19(4), 1-12.
[846] Wang, X., and Zhao, D. (2025). A comparative experimental study of citation sentiment
identification based on the Athar-Corpus. Data Science and Informetrics.
[847] Muralinath, R. N., Pathak, V., and Mahanti, P. K. (2025). Metastable Substructure Em-
bedding and Robust Classification of Multichannel EEG Data Using Spectral Graph Kernels.
Future Internet, 17(3), 102.
[848] Hu, Y. H., Liu, T. H., Tsai, C. F., and Lin, Y. J. (2025). Handling Class Imbalanced Data in
Sarcasm Detection with Ensemble Oversampling Techniques. Applied Artificial Intelligence,
39(1), 2468534.
[849] Wang, H., Lv, F., Zhan, Z., Zhao, H., Li, J., and Yang, K. (2025). Predicting the Ten-
sile Properties of Automotive Steels at Intermediate Strain Rates via Interpretable Ensemble
Machine Learning. World Electric Vehicle Journal, 16(3), 123.
[850] Husain, M., Aftab, R. A., Zaidi, S., and Rizvi, S. J. A. (2025). Shear thickening fluid: A
multifaceted rheological modeling integrating phenomenology and machine learning approach.
Journal of Molecular Liquids, 127223.
[851] Iqbal, A., and Siddiqi, T. A. (2025). Enhancing seasonal streamflow prediction using mul-
tistage hybrid stochastic data-driven deep learning methodology with deep feature selection.
Environmental and Ecological Statistics, 1-51.
[852] Geman, S., Bienenstock, E., and Doursat, R. (1992). Neural networks and the bias/variance
dilemma. Neural computation, 4(1), 1-58.
[853] Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). Reconciling modern machine-learning
practice and the classical bias–variance trade-off. Proceedings of the National Academy of
Sciences, 116(32), 15849-15854.
[854] Neal, B., Mittal, S., Baratin, A., Tantia, V., Scicluna, M., Lacoste-Julien, S., and Mitliagkas,
I. (2018). A modern take on the bias-variance tradeoff in neural networks. arXiv preprint
arXiv:1810.08591.
[855] Rocks, J. W., and Mehta, P. (2022). Memorizing without overfitting: Bias, variance, and
interpolation in overparameterized models. Physical review research, 4(1), 013201.
[856] Doroudi, S., and Rastegar, S. A. (2023). The bias–variance tradeoff in cognitive science.
Cognitive Science, 47(1), e13241.
[857] Almeida, M., Zhuang, Y., Ding, W., Crouter, S. E., and Chen, P. (2021). Mitigating class-
boundary label uncertainty to reduce both model bias and variance. ACM Transactions on
Knowledge Discovery from Data (TKDD), 15(2), 1-18.
[858] Zhou, H., Song, L., Chen, J., Zhou, Y., Wang, G., Yuan, J., and Zhang, Q. (2021). Rethinking
soft labels for knowledge distillation: A bias-variance tradeoff perspective. arXiv preprint
arXiv:2102.00650.
[859] Gupta, N., Smith, J., Adlam, B., and Mariet, Z. (2022). Ensembling over classifiers: a bias-
variance perspective. arXiv preprint arXiv:2206.10566.
666 BIBLIOGRAPHY
[860] Ranglani, Hardev. (2024). Empirical Analysis of the Bias-Variance Tradeoff Across Machine
Learning Models. Machine Learning and Applications: An International Journal. 11. 01-12.
10.5121/mlaij.2024.11401.
[861] Bellman, R. (1954). The theory of dynamic programming. Bulletin of the American Mathe-
matical Society, 60(6), 503-515.
[862] Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., ... and
Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search.
nature, 529(7587), 484-489.
[863] Watkins, C. J., and Dayan, P. (1992). Q-learning. Machine learning, 8, 279-292.
[864] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347.
[865] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., ... and Wierstra, D.
(2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
[866] Shah, H. Towards Safe AI: Ensuring Security in Machine Learning and Reinforcement Learn-
ing Models.
[867] Ajanovi, Z., Gros, T., Den Hengst, F., Holler, D., Kokel, H., and Taitler, A. (2025, February).
Bridging the Gap Between AI Planning and Reinforcement Learning. In AAAI Conference on
Artificial Intelligence.
[868] Oliveira, D. R., Moreira, G. J., and Duarte, A. R. (2025). Arbitrarily shaped spatial cluster
detection via reinforcement learning algorithms. Environmental and Ecological Statistics, 1-23.
[869] Hengzhi, B. A. I., Haichao, W. A. N. G., Rongrong, H. E., Jiatao, D. U., Guoxin, L. I.,
Yuhua, X. U., and Yutao, J. I. A. O. (2025). Multi-hop UAV relay covert communication: A
multi-agent reinforcement learning approach. Chinese Journal of Aeronautics, 103440.
[870] Pan, R., Yuan, Q., Luo, G., Chen, B., Liu, Y., and Li, J. Tg-Mg: Task Grouping Based on
Mdp Graph for Multi-Task Reinforcement Learning. Available at SSRN 5149163.
[871] Liu, H., Li, D., Zeng, B., and Xu, Y. (2025). Learning discriminative features for multi-hop
knowledge graph reasoning. Applied Intelligence, 55(6), 1-14.
[872] Chen, H., Guo, W., Bao, W., Cui, M., Wang, X., and Zhao, Q. (2025). A novel interpretable
decision rule extracting method for deep reinforcement learning-based energy management in
building complexes. Energy and Buildings, 115514.
[873] Anwar, G. A., and Akber, M. Z. (2025). Multi-agent deep reinforcement learning for resilience
optimization of building structures considering utility interactions for functionality. Computers
and Structures, 310, 107703.
[874] Zhao, W., Lv, Y., Lee, K. M., and Li, W. (2025). An intelligent data-driven adaptive health
state assessment approach for rolling bearings under single and multiple working conditions.
Computers and Industrial Engineering, 110988.
[875] Soman, G., Judy, M. V., and Abou, A. M. (2025). Human guided empathetic AI agent for
mental health support leveraging reinforcement learning-enhanced retrieval-augmented gener-
ation. Cognitive Systems Research, 101337.
[876] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Policy gradient meth-
ods for reinforcement learning with function approximation. Advances in neural information
processing systems, 12.
BIBLIOGRAPHY 667
[877] Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing
systems, 14.
[878] Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015, June). Trust region
policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.
[879] Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2021). On the theory of pol-
icy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine
Learning Research, 22(98), 1-76.
[880] Liu, J., Li, W., and Wei, K. (2024). Elementary analysis of policy gradient methods. arXiv
preprint arXiv:2404.03372.
[881] Lorberbom, G., Maddison, C. J., Heess, N., Hazan, T., and Tarlow, D. (2020). Direct pol-
icy gradients: Direct optimization of policies in discrete action spaces. Advances in Neural
Information Processing Systems, 33, 18076-18086.
[882] McCracken, G., Daniels, C., Zhao, R., Brandenberger, A., Panangaden, P., and Precup, D.
(2020). A Study of Policy Gradient on a Class of Exactly Solvable Models. arXiv preprint
arXiv:2011.01859.
[883] Lehmann, M. (2024). The definitive guide to policy gradients in deep reinforcement learning:
Theory, algorithms and implementations. arXiv preprint arXiv:2401.13662.
[884] Rahn, A., Sultanow, E., Henkel, M., Ghosh, S., and Aberkane, I. J. (2021). An algorithm for
linearizing the Collatz convergence. Mathematics, 9(16), 1898.
[885] Sutton, R. S., Singh, S., and McAllester, D. (2000). Comparing policy-gradient algorithms.
IEEE Transactions on Systems, Man, and Cybernetics, 30(4), 467-477.
[886] Mustafa, E., Shuja, J., Rehman, F., Namoun, A., Bilal, M., and Iqbal, A. (2025). Compu-
tation offloading in vehicular communications using PPO-based deep reinforcement learning.
The Journal of Supercomputing, 81(4), 1-24.
[887] Yang, C., Chen, J., Huang, X., Lian, J., Tang, Y., Chen, X., and Xie, S. (2025). Joint
Driving Mode Selection and Resource Management in Vehicular Edge Computing Networks.
IEEE Internet of Things Journal.
[888] Jamshidiha, S., Pourahmadi, V., and Mohammadi, A. (2025). A Traffic-Aware Graph Neural
Network for User Association in Cellular Networks. IEEE Transactions on Mobile Computing.
[889] Raei, H., De Momi, E., and Ajoudani, A. (2025). A Reinforcement Learning Approach to
Non-prehensile Manipulation through Sliding. arXiv preprint arXiv:2502.17221.
[890] Ting-Ting, Z., Yan, C., Ren-zhi, D., Tao, C., Yan, L., Kai-Ge, Z., ... and Yu-Shi, L. (2025).
Autonomous decision-making of UAV cluster with communication constraints based on rein-
forcement learning. Journal of Cloud Computing, 14(1), 12.
[891] Zhang, B., Xing, H., Zhang, Z., and Feng, W. (2025). Autonomous obstacle avoidance decision
method for spherical underwater robot based on brain-inspired spiking neural network. Expert
Systems with Applications, 127021.
[892] Nguyen, X. B., Phan, X. H., and Piccardi, M. (2025). Fine-tuning text-to-SQL models with
reinforcement-learning training objectives. Natural Language Processing Journal, 100135.
[893] Brahmanage, J. C., Ling, J., and Kumar, A. (2025). Leveraging Constraint Violation Signals
For Action-Constrained Reinforcement Learning. arXiv preprint arXiv:2502.10431.
668 BIBLIOGRAPHY
[894] Huang, Z., Dai, W., Zou, Y., Li, D., Cai, J., Gadekallu, T. R., and Wang, W. (2025).
Cooperative Traffic Scheduling in Transportation Network: A Knowledge Transfer Method.
IEEE Transactions on Intelligent Transportation Systems.
[895] Li, J., Li, R., Ma, G., Wang, H., Yang, W., and Gu, Z. Fedddpg: A Reinforcement Learn-
ing Method For Federated Learning-Based Vehicle Trajectory Prediction. Available at SSRN
5148441.
[896] Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Ostrovski, G., Dabney, W., ... and
Silver, D. (2018, April). Rainbow: Combining improvements in deep reinforcement learning.
In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).
[897] Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuo-
motor policies. Journal of Machine Learning Research, 17(39), 1-40.
[898] Bellemare, M. G., Dabney, W., and Munos, R. (2017, July). A distributional perspective on
reinforcement learning. In International conference on machine learning (pp. 449-458). PMLR.
[899] Xue, K., Zhai, L., Li, Y., Lu, Z., and Zhou, W. (2025). Task Offloading and Multi-cache Place-
ment Based on DRL in UAV-assisted MEC Networks. Vehicular Communications, 100900.
[900] Amodu, O. A., Mahmood, R. A. R., Althumali, H., Jarray, C., Adnan, M. H., Bukar, U.
A., ... and Zukarnain, Z. A. (2025). A question-centric review on DRL-based optimization for
UAV-assisted MEC sensor and IoT applications, challenges, and future directions. Vehicular
Communications, 100899.
[901] Silvestri, A., Coraci, D., Brandi, S., Capozzoli, A., and Schlueter, A. (2025). Practical de-
ployment of reinforcement learning for building controls using an imitation learning approach.
Energy and Buildings, 115511.
[902] SARIGUL, F. A., and BAYEZIT, I. DEEP REINFORCEMENT LEARNING BASED AU-
TONOMOUS HEADING CONTROL OF A FIXED-WING AIRCRAFT.
[903] Mukhamadiarov, R. (2025). Controlling dynamics of stochastic systems with deep reinforce-
ment learning. arXiv preprint arXiv:2502.18111.
[904] Ali, N., and Wallace, G. (2025). The Future of SOC Operations: Autonomous Cyber Defense
with AI and Machine Learning.
[905] Yan, L., Wang, Q., Hu, G., Chen, W., and Noack, B. R. (2025). Deep reinforcement cross-
domain transfer learning of active flow control for three-dimensional bluff body flow. Journal
of Computational Physics, 113893.
[906] Silvestri, A., Coraci, D., Brandi, S., Capozzoli, A., and Schlueter, A. (2025). Practical de-
ployment of reinforcement learning for building controls using an imitation learning approach.
Energy and Buildings, 115511.
[907] Alajaji, S. A., Sabzian, R., Wang, Y., Sultan, A. S., and Wang, R. (2025). A Scoping Review
of Infrared Spectroscopy and Machine Learning Methods for Head and Neck Precancer and
Cancer Diagnosis and Prognosis. Cancers, 17(5), 796.
[908] Wang, X., and Liu, L. (2025). Risk-Sensitive DRL for Portfolio Optimization in Petroleum
Futures.
[909] Thongkairat, S., and Yamaka, W. (2025). A Combined Algorithm Approach for Optimizing
Portfolio Performance in Automated Trading: A Study of SET50 Stocks. Mathematics, 13(3),
461.
BIBLIOGRAPHY 669
[910] Dey, D., and Ghosh, N. Iquic: An Intelligent Framework for Defending Quic Connection
Id-Based Dos Attack Using Advantage Actor-Critic Rl. Available at SSRN 5129475.
[911] Zhao, K., Peng, L., and Tak, B. (2025). Joint DRL-Based UAV Trajectory Planning and
TEG-Based Task Offloading. IEEE Transactions on Consumer Electronics.
[912] Mounesan, M., Zhang, X., and Debroy, S. (2025). Infer-EDGE: Dynamic DNN Inference
Optimization in’Just-in-time’Edge-AI Implementations. arXiv preprint arXiv:2501.18842.
[913] Hou, Y., Yin, C., Sheng, X., Xu, D., Chen, J., and Tang, H. (2025). Automotive Fuel Cell
Performance Degradation Prediction Using Multi-Agent Cooperative Advantage Actor-Critic
Model. Energy, 134899.
[914] Radaideh, M. I., Tunkle, L., Price, D., Abdulraheem, K., Lin, L., and Elias, M. (2025). Multi-
step Criticality Search and Power Shaping in Nuclear Microreactors with Deep Reinforcement
Learning. Nuclear Science and Engineering, 1-13.
[915] LI, B., SHEN, L., ZHAO, C., and FEI, Z. (2025). Robust Resource Optimization in Integrated
Sensing, Communication, and Computing Networks Based on Soft Actor-Critic, 47(3), 1-10.
[916] Khan, N., Ahmad, S., Raza, S., Khan, A., and Younas, M. (2025). COST EFFECTIVE
ROUTE OPTIMIZATION FOR DAIRY PRODUCT DELIVERY. Kashf Journal of Multidis-
ciplinary Research, 2(02), 13-26.
[917] Yuan, Y., Zhang, J., Xu, X., Wang, B., Han, S., Sun, M., and Zhang, P. (2025). Learning-
Based Task-Centric Multi-User Semantic Communication Solution for Vehicle Networks. IEEE
Transactions on Vehicular Technology.
[918] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., ... and Kavukcuoglu,
K. (2016, June). Asynchronous methods for deep reinforcement learning. In International con-
ference on machine learning (pp. 1928-1937). PmLR.
[919] Wang, Y., Zhang, C., Yu, T., and Ma, M. (2022). Recursive Least Squares Advantage Actor-
Critic Algorithms. arXiv preprint arXiv:2201.05918.
[920] G, Rubell Marion Lincy and Sagar, Som and Narayanan, Vishnu and Binu, Dhanush and
Selby, Nevin and Thomas, Sheba Elizabeth, Advantage Actor-Critic Reinforcement Learning
with Technical Indicators for Stock Trading Decisions.
[921] Paczolay, G., and Harmati, I. (2020, October). A new advantage actor-critic algorithm for
multi-agent environments. In 2020 23rd International Symposium on Measurement and Control
in Robotics (ISMCR) (pp. 1-6). IEEE.
[922] Qin, S., Xie, X., Wang, J., Guo, X., Qi, L., Cai, W., ... and Talukder, Q. T. A. (2024).
An Optimized Advantage Actor-Critic Algorithm for Disassembly Line Balancing Problem
Considering Disassembly Tool Degradation. Mathematics, 12(6), 836.
[923] Kölle, M., Hgog, M., Ritz, F., Altmann, P., Zorn, M., Stein, J., and Linnhoff-Popien,
C. (2024). Quantum advantage actor-critic for reinforcement learning. arXiv preprint
arXiv:2401.07043.
[924] Benhamou, E. (2019). Variance reduction in actor critic methods (ACM). arXiv preprint
arXiv:1907.09765.
[925] Peng, B., Li, X., Gao, J., Liu, J., Chen, Y. N., and Wong, K. F. (2018, April). Adversarial
advantage actor-critic model for task-completion dialogue policy learning. In 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6149-6153).
IEEE.
670 BIBLIOGRAPHY
[926] van Veldhuizen, V. (2022). Autotuning PID control using Actor-Critic Deep Reinforcement
Learning. arXiv preprint arXiv:2212.00013.
[927] Cicek, D. C., Duran, E., Saglam, B., Mutlu, F. B., and Kozat, S. S. (2021, November).
Off-policy correction for deep deterministic policy gradient algorithms via batch prioritized
experience replay. In 2021 IEEE 33rd International Conference on Tools with Artificial Intel-
ligence (ICTAI) (pp. 1255-1262). IEEE.
[928] Han, S., Zhou, W., Lü, S., and Yu, J. (2021). Regularly updated deterministic policy gradient
algorithm. Knowledge-Based Systems, 214, 106736.
[929] Pan, L., Cai, Q., and Huang, L. (2020). Softmax deep double deterministic policy gradients.
Advances in neural information processing systems, 33, 11767-11777.
[930] Luck, K. S., Vecerik, M., Stepputtis, S., Amor, H. B., and Scholz, J. (2019, November).
Improved exploration through latent trajectory optimization in deep deterministic policy gra-
dient. In 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
(pp. 3704-3711). IEEE.
[931] Dong, R., Du, J., Liu, Y., Heidari, A. A., and Chen, H. (2023). An enhanced deep determin-
istic policy gradient algorithm for intelligent control of robotic arms. Frontiers in Neuroinfor-
matics, 17, 1096053.
[932] Jesus, J. C., Bottega, J. A., Cuadros, M. A., and Gamarra, D. F. (2019, December). Deep
deterministic policy gradient for navigation of mobile robots in simulated environments. In
2019 19th International Conference on Advanced Robotics (ICAR) (pp. 362-367). IEEE.
[933] Lin, T., Zhang, X., Gong, J., Tan, R., Li, W., Wang, L., ... and Gao, J. (2023). A dos-
ing strategy model of deep deterministic policy gradient algorithm for sepsis patients. BMC
Medical Informatics and Decision Making, 23(1), 81.
[934] Sumalatha, V., and Pabboju, S. (2024). Optimal Index Selection using Optimized Deep
Deterministic Policy Gradient for NoSQL Database. Engineering, Technology and Applied
Science Research, 14(6), 18125-18130.
[935] Yang, C., Chen, J., Huang, X., Lian, J., Tang, Y., Chen, X., and Xie, S. (2025). Joint
Driving Mode Selection and Resource Management in Vehicular Edge Computing Networks.
IEEE Internet of Things Journal.
[936] Tian, S., Zhu, X., Feng, B., Zheng, Z., Liu, H., and Li, Z. (2025). Partial Offloading Strategy
Based on Deep Reinforcement Learning in the Internet of Vehicles. IEEE Transactions on
Mobile Computing.
[937] Chen, H., Cui, H., Wang, J., Cao, P., He, Y., and Guizani, M. (2025). Computation Offload-
ing Optimization for UAV-Based Cloud-Edge Collaborative Task Scheduling Strategy. IEEE
Transactions on Cognitive Communications and Networking.
[938] Deng, J., Zhou, H., and Alouini, M. S. (2025). Distributed Coordination for Heterogeneous
Non-Terrestrial Networks. arXiv preprint arXiv:2502.17366.
[939] Zhang, Y., Fan, W., Yu, Y., and Liu, Y. A. (2025). DRL-Based Resource Orchestration for
Vehicular Edge Computing With Multi-Edge and Multi-Vehicle Assistance. IEEE Transactions
on Intelligent Transportation Systems.
[940] Cuéllar, R., Posada, D., Henderson, T., and Karimi, R. R. ORBITAL MANEUVER AND
INTERPLANETARY TRAJECTORY DESIGN VIA REINFORCEMENT LEARNING.
BIBLIOGRAPHY 671
[941] Liu, L., Sun, M., Zhao, E., and Zhu, K. (2025). Three-Dimensional Dynamic Trajectory
Planning for Autonomous Underwater Robots Under the PPO-IIFDS Framework. Journal of
Marine Science and Engineering, 13(3), 445.
[942] Figueroa, N. F., Tafur, J. C., and Kheddar, A. (2025). Fast Autolearning for Multimodal
Walking in Humanoid Robots with Variability of Experience. IEEE Robotics and Automation
Letters.
[943] Xu, C., Zhang, P., and Yu, H. (2025). Lyapunov-Guided Resource Allocation and Task
Scheduling for Edge Computing Cognitive Radio Networks via Deep Reinforcement Learning.
IEEE Sensors Journal.
[944] Li, L., Jing, X., Liu, H., Lei, H., and Chen, Q. (2025). Adaptive Anti-Jamming Resource
Allocation Scheme in Dynamic Jamming Environment. IEEE Transactions on Vehicular Tech-
nology.
[945] Chandrasiri, S., and Meedeniya, D. (2025). Energy-Efficient Dynamic Workflow Scheduling
in Cloud Environments Using Deep Learning. Sensors, 25(5), 1428.
[946] Wu, Y., and Xie, N. (2025). Design of digital low-carbon system for smart buildings based
on PPO algorithm. Sustainable Energy Research, 12(1), 1-14.
[947] Guan, Q., Cao, H., Jia, L., Yan, D., and Chen, B. (2025). Synergetic attention-driven trans-
former: A Deep reinforcement learning approach for vehicle routing problems. Expert Systems
with Applications, 126961.
[948] Zhang, B., Wang, Y., and Dhillon, P. S. (2025). Policy Learning with a Natural Language
Action Space: A Causal Approach. arXiv preprint arXiv:2502.17538.
[949] Zhang, C., Dai, L., Zhang, H., and Wang, Z. (2025). Control Barrier Function-Guided Deep
Reinforcement Learning for Decision-Making of Autonomous Vehicle at On-Ramp Merging.
IEEE Transactions on Intelligent Transportation Systems.
[950] Stanley, K. O., and Miikkulainen, R. (2002). Evolving neural networks through augmenting
topologies. Evolutionary computation, 10(2), 99-127.
[951] Stanley, K. O., Bryant, B. D., and Miikkulainen, R. (2005). Real-time neuroevolution in the
NERO video game. IEEE transactions on evolutionary computation, 9(6), 653-668.
[952] Gauci, J., and Stanley, K. (2007, July). Generating large-scale neural networks through dis-
covering geometric regularities. In Proceedings of the 9th annual conference on Genetic and
evolutionary computation (pp. 997-1004).
[953] Metzen, J. H., Edgington, M., Kassahun, Y., and Kirchner, F. (2007, December). Performance
evaluation of EANT in the robocup keepaway benchmark. In Sixth International Conference
on Machine Learning and Applications (ICMLA 2007) (pp. 342-347). IEEE.
[954] Kassahun, Y., and Sommer, G. (2005, April). Efficient reinforcement learning through Evo-
lutionary Acquisition of Neural Topologies. In ESANN (pp. 259-266).
[955] Siebel, N. T., and Sommer, G. (2007). Evolutionary reinforcement learning of artificial neural
networks. International Journal of Hybrid Intelligent Systems, 4(3), 171-183.
[956] Siebel, N. T., and Sommer, G. (2008, June). Learning defect classifiers for visual inspection
images by neuro-evolution using weakly labelled training data. In 2008 IEEE Congress on
Evolutionary Computation (IEEE World Congress on Computational Intelligence) (pp. 3925-
3931). IEEE.
672 BIBLIOGRAPHY
[957] Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., ... and Hodjat,
B. (2024). Evolving deep neural networks. In Artificial intelligence in the age of neural networks
and brain computing (pp. 269-287). Academic Press.
[958] Liang, J., Meyerson, E., Hodjat, B., Fink, D., Mutch, K., and Miikkulainen, R. (2019, July).
Evolutionary neural automl for deep learning. In Proceedings of the genetic and evolutionary
computation conference (pp. 401-409).
[959] Vargas, D. V., and Murata, J. (2016). Spectrum-diverse neuroevolution with unified neural
models. IEEE transactions on neural networks and learning systems, 28(8), 1759-1773.
[960] Such, F. P., Madhavan, V., Conti, E., Lehman, J., Stanley, K. O., and Clune, J. (2017). Deep
neuroevolution: Genetic algorithms are a competitive alternative for training deep neural
networks for reinforcement learning. arXiv preprint arXiv:1712.06567.
[961] Assunção, F., Lourenço, N., Ribeiro, B., and Machado, P. (2021). Fast-DENSER: Fast deep
evolutionary network structured representation. SoftwareX, 14, 100694.
[963] Stanley, K. O., Clune, J., Lehman, J., and Miikkulainen, R. (2019). Designing neural networks
through neuroevolution. Nature Machine Intelligence, 1(1), 24-35.
[964] Bertens, P., and Lee, S. W. (2019). Network of evolvable neural units: Evolving to learn at
a synaptic level. arXiv preprint arXiv:1912.07589.
[965] Wang, Z., Zhou, Y., Takagi, T., Song, J., Tian, Y. S., and Shibuya, T. (2023). Genetic
algorithm-based feature selection with manifold learning for cancer classification using mi-
croarray data. BMC bioinformatics, 24(1), 139.
[966] Pagliuca, P., Milano, N., and Nolfi, S. (2020). Efficacy of modern neuro-evolutionary strategies
for continuous control optimization. Frontiers in Robotics and AI, 7, 98.
[967] Behjat, A., Chidambaran, S., and Chowdhury, S. (2019, May). Adaptive genomic evolution of
neural network topologies (agent) for state-to-action mapping in autonomous agents. In 2019
International Conference on Robotics and Automation (ICRA) (pp. 9638-9644). IEEE.
[968] Ahmed, S. F., Alam, M. S. B., Hassan, M., Rozbu, M. R., Ishtiak, T., Rafa, N., ... and
Gandomi, A. H. (2023). Deep learning modelling techniques: current progress, applications,
advantages, and challenges. Artificial Intelligence Review, 56(11), 13521-13617.
[969] Miikkulainen, R. (2023, July). Evolution of neural networks. In Proceedings of the Companion
Conference on Genetic and Evolutionary Computation (pp. 1008-1025).
[970] Kannan, A., Selvi, M., Santhosh Kumar, S. V. N., Thangaramya, K., and Shalini, S. (2024).
Machine Learning Based Intelligent RPL Attack Detection System for IoT Networks. In Ad-
vanced Machine Learning with Evolutionary and Metaheuristic Techniques (pp. 241-256). Sin-
gapore: Springer Nature Singapore.
[971] Zeng, X., Cai, J., Liang, C., and Yuan, C. (2022). A hybrid model integrating long short-
term memory with adaptive genetic algorithm based on individual ranking for stock index
prediction. Plos one, 17(8), e0272637.
[972] KV, S., and Swamy, A. (2024). Enhancing Software Quality with Ensemble Machine Learning
and Evolutionary Approaches.
BIBLIOGRAPHY 673
[973] Gruau, F. (1993, April). Cellular encoding as a graph grammar. In IEE colloquium on gram-
matical inference: Theory, applications and alternatives (pp. 17-1). IET.
[974] Gruau, F., Whitley, D., and Pyeatt, L. (1996, July). A comparison between cellular encoding
and direct encoding for genetic neural networks. In Proceedings of the 1st annual conference
on genetic programming (pp. 81-89).
[975] Gruau, F., and Whitley, D. (1993). Adding learning to the cellular development of neural
networks: Evolution and the Baldwin effect. Evolutionary computation, 1(3), 213-233.
[976] Gutierrez, G., Galvan, I., MoIina, J., and Sanchis, A. (2004, July). Studying the capacity of
cellular encoding to generate feedforward neural network topologies. In 2004 IEEE Interna-
tional Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541) (Vol. 1, pp. 211-215).
IEEE.
[977] Zhang, B. T., and Muhlenbein, H. (1993). Evolving optimal neural networks using genetic
algorithms with Occam’s razor. Complex systems, 7(3), 199-220.
[978] Kitano, H. (1990). Designing neural networks using genetic algorithms with graph generation
system. Complex System, 4(4), 461-476.
[979] Miller, J., and Turner, A. (2015, July). Cartesian genetic programming. In Proceedings of the
Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Compu-
tation (pp. 179-198).
[980] Miller, J. F. (2020). Cartesian genetic programming: its status and future. Genetic Program-
ming and Evolvable Machines, 21(1), 129-168.
[981] Hernández Ruiz, A. J., Vilalta Arias, A., and Moreno-Noguer, F. (2021). Neural cellular
automata manifold. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR) (pp. 10015-10023). IEEE Computer Society Conference Publish-
ing Services (CPS).
[982] Hajij, M., Istvan, K., and Zamzmi, G. (2020). Cell complex neural networks. arXiv preprint
arXiv:2010.00743.
[983] Sun, W., Winnubst, J., Natrajan, M., Lai, C., Kajikawa, K., Bast, A., ... and Spruston, N.
(2025). Learning produces an orthogonalized state machine in the hippocampus. Nature, 1-11.
[984] Guan, B., Chu, G., Wang, Z., Li, J., and Yi, B. (2025). Instance-level semantic segmentation
of nuclei based on multimodal structure encoding. BMC bioinformatics, 26(1), 42.
[985] Ghosh, N., Dutta, P., and Santoni, D. (2025). TFBS-Finder: Deep Learning-based Model
with DNABERT and Convolutional Networks to Predict Transcription Factor Binding Sites.
arXiv preprint arXiv:2502.01311.
[986] Sun, R., Qian, L., Li, Y., Cheng, H., Xue, Z., Zhang, X., ... and Guo, T. (2025). A perturba-
tion proteomics-based foundation model for virtual cell construction. bioRxiv, 2025-02.
[987] Grosjean, P., Shevade, K., Nguyen, C., Ancheta, S., Mader, K., Franco, I., ... and Kampmann,
M. (2025). Network-aware self-supervised learning enables high-content phenotypic screening
for genetic modifiers of neuronal activity dynamics. bioRxiv, 2025-02.
[988] Gonzalez, K. C., Noguchi, A., Zakka, G., Yong, H. C., Terada, S., Szoboszlay, M., ... and
Losonczy, A. (2025). Visually guided in vivo single-cell electroporation for monitoring and
manipulating mammalian hippocampal neurons. Nature Protocols, 1-17.
674 BIBLIOGRAPHY
[989] de Carvalho, L. M., Carvalho, V. M., Camargo, A. P., and Papes, F. (2025). Gene network
analysis identifies dysregulated pathways in an autism spectrum disorder caused by mutations
in Transcription Factor 4. Scientific Reports, 15(1), 4993.
[990] Sprecher, S. G. (2025). Disentangling how the brain is wired. Fly, 19(1), 2440950.
[991] Li, S., Cai, Y., and Xia, Z. (2025). Function and regulation of non-neuronal cells in the
nervous system. Frontiers in Cellular Neuroscience, 19, 1550903.
[992] Saunders, G., Angeline, P., and Pollack, J. (1993). Structural and behavioral evolution of
recurrent networks. Advances in Neural Information Processing Systems, 6.
[993] Angeline, P. J., Saunders, G. M., and Pollack, J. B. (1994). An evolutionary algorithm that
constructs recurrent neural networks. IEEE transactions on Neural Networks, 5(1), 54-65.
[994] Schmidhuber, J. (1999). A general method for incremental self-improvement and multi-agent
learning. In Evolutionary Computation: Theory and Applications (pp. 81-123).
[995] Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9), 1423-
1447.
[996] Floreano, D., Dürr, P., and Mattiussi, C. (2008). Neuroevolution: from architectures to
learning. Evolutionary intelligence, 1, 47-62.
[997] Gomez, F. J., and Miikkulainen, R. (1999, July). Solving non-Markovian control tasks with
neuroevolution. In IJCAI (Vol. 99, pp. 1356-1361).
[998] Moriarty, D. E., and Mikkulainen, R. (1996). Efficient reinforcement learning through sym-
biotic evolution. Machine learning, 22(1), 11-32.
[999] Gomez, F., and Miikkulainen, R. (1997). Incremental evolution of complex general behavior.
Adaptive Behavior, 5(3-4), 317-342.
[1000] MacQueen, J. (1967, January). Some methods for classification and analysis of multivariate
observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and
Probability, Volume 1: Statistics (Vol. 5, pp. 281-298). University of California press.
[1001] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incom-
plete data via the EM algorithm. Journal of the royal statistical society: series B (method-
ological), 39(1), 1-22.
[1002] Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biolog-
ical cybernetics, 43(1), 59-69.
[1003] Belkin, M., and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and
data representation. Neural computation, 15(6), 1373-1396.
[1004] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv
preprint physics/0004057.
[1005] Hinton, G. E., and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with
neural networks. science, 313(5786), 504-507.
[1006] Kingma, D. P., and Welling, M. (2013, December). Auto-encoding variational bayes.
[1007] Van der Maaten, L., and Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine
learning research, 9(11).
BIBLIOGRAPHY 675
[1008] Roweis, S. T., and Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear
embedding. science, 290(5500), 2323-2326.
[1010] Parmar, Tarun. (2020). Leveraging Unsupervised Learning for Identifying Unknown Defects
in New Semiconductor Products. 10.5281/zenodo.14840180.
[1011] Raikwar, Teena and Gupta, Divya (2025). AI-Driven Trust Management Framework for
Secure Wireless Ad Hoc Networks. 6. 2582-6948.
[1012] Moustakidis, S., Stergiou, K., Gee, M., Roshanmanesh, S., Hayati, F., Karlsson, P., and
Papaelias, M. (2025). Deep Learning Autoencoders for Fast Fourier Transform-Based Cluster-
ing and Temporal Damage Evolution in Acoustic Emission Data from Composite Materials.
Infrastructures, 10(3), 51.
[1013] Liu W, Ning Q, Liu G, Wang H, Zhu Y, Zhong M (2025) Unsupervised feature selec-
tion algorithm based on L 2,p -norm feature reconstruction. PLoS ONE 20(3): e0318431.
https://doi.org/10.1371/journal.pone.0318431
[1014] Zhou, M., Sun, T., Yan, Y., Jing, M., Gao, Y., Jiang, B., ... and Zhao, J. (2025). Metabolic
subtypes in hypertriglyceridemia and associations with diseases: insights from population-
based metabolome atlas. Journal of Translational Medicine, 23(1), 1-5.
[1015] Lin, P., Cai, Y., Wu, H., Yin, J., and Luorang, Z. (2025). AI-Driven Risk Control for Health
Insurance Fund Management: A Data-Driven Approach. INTERNATIONAL JOURNAL OF
COMPUTERS COMMUNICATIONS and CONTROL, 20(2).
[1016] Huang, Y., Hu, J., and Luo, R. (2025). FMDL: Enhancing Open-World Object Detection
with foundation models and dynamic learning. Expert Systems with Applications, 127050.
[1017] Wu, J., and Liu, C. (2025). VQ-VAE-2 Based Unsupervised Algorithm for Detecting Con-
crete Structural Apparent Cracks. Materials Today Communications, 112075.
[1018] Nagelli, A., and Saleena, B. (2025). Aspect-based Sentiment Analysis with Ontology-assisted
Recommender System on Multilingual Data using Optimised Self-attention and Adaptive Deep
Learning Network. Journal of Information and Knowledge Management.
[1019] Ekanayake, M. B. Deep Learning for Magnetic Resonance Image Reconstruction and Super-
resolution (Doctoral dissertation, Monash University).
[1020] LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11), 2278-2324.
[1021] Friedman, J., Hastie, T., and Tibshirani, R. (2000). Additive logistic regression: a statistical
view of boosting (with discussion and a rejoinder by the authors). The annals of statistics,
28(2), 337-407.
[1022] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6), 386.
[1023] Schapire, R. E. (1990). The strength of weak learnability. Machine learning, 5, 197-227.
[1024] Rafiei, M., Shojaei, A., and Chau, Y. (2025). Machine learning-assisted design of im-
munomodulatory lipid nanoparticles for delivery of mRNA to repolarize hyperactivated mi-
croglia. Drug Delivery, 32(1), 2465909.
676 BIBLIOGRAPHY
[1025] Pei, Z., Wu, X., Wu, X., Xiao, Y., Yu, P., Gao, Z., ... and Guo, W. (2025). Segmenting
Vegetation from UAV Images via Spectral Reconstruction in Complex Field Environments.
Plant Phenomics, 100021.
[1026] Efendi, A., Ammarullah, M. I., Isa, I. G. T., Sari, M. P., Izza, J. N., Nugroho, Y. S., ...
and Alfian, D. (2025). IoT-Based Elderly Health Monitoring System Using Firebase Cloud
Computing. Health Science Reports, 8(3), e70498.
[1027] Pang, Y. T., Kuo, K. M., Yang, L., and Gumbart, J. C. (2025). DeepPath: Overcoming data
scarcity for protein transition pathway prediction using physics-based deep learning. bioRxiv,
2025-02.
[1028] Curry, A., Singer, M., Musu, A., and Caricchi, L. Supervised and Unsupervised Machine
Learning Applied to an Ignimbrite Flare-Up in the Central San Juan Caldera Cluster, Col-
orado.
[1029] Li, X., Ouyang, Q., Han, M., Liu, X., He, F., Zhu, Y., ... and Ma, J. (2025). π-PhenoDrug: A
Comprehensive Deep Learning-Based Pipeline for Phenotypic Drug Screening in High-Content
Analysis. Advanced Intelligent Systems, 2400635.
[1030] Liu, Y., Deng, L., Ding, F., Zhang, W., Zhang, S., Zeng, B., ... and Wu, L. (2025). Discovery
of ASGR1 and HMGCR dual-target inhibitors based on supervised learning, molecular dock-
ing, molecular dynamic simulations, and biological evaluation. Bioorganic Chemistry, 108326.
[1031] Dutta, R., and Karmakar, S. (2024, March). Ransomware Detection in Healthcare Organi-
zations Using Supervised Learning Models: Random Forest Technique. In International Con-
ference on Emerging Trends and Technologies on Intelligent Systems (pp. 385-395). Singapore:
Springer Nature Singapore.
[1032] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv
preprint physics/0004057.
[1033] Chechik, G., Globerson, A., Tishby, N., and Weiss, Y. (2003). Information bottleneck for
Gaussian variables. Advances in Neural Information Processing Systems, 16.
[1034] Chechik, G., and Tishby, N. (2002). Extracting relevant structures with side information.
Advances in Neural Information Processing Systems, 15.
[1035] Tishby, N., and Zaslavsky, N. (2015, April). Deep learning and the information bottleneck
principle. In 2015 ieee information theory workshop (itw) (pp. 1-5). Ieee.
[1036] Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., and Cox,
D. D. (2019). On the information bottleneck theory of deep learning. Journal of Statistical
Mechanics: Theory and Experiment, 2019(12), 124020.
[1037] Shwartz-Ziv, R., and Tishby, N. (2017). Opening the black box of deep neural networks via
information. arXiv preprint arXiv:1703.00810.
[1038] Noshad, M., Zeng, Y., and Hero, A. O. (2019, May). Scalable mutual information estimation
using dependence graphs. In ICASSP 2019-2019 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (pp. 2962-2966). IEEE.
[1039] Goldfeld, Z., Berg, E. V. D., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., and
Polyanskiy, Y. (2018). Estimating information flow in deep neural networks. arXiv preprint
arXiv:1810.05728.
BIBLIOGRAPHY 677
[1040] Geiger, B. C. (2021). On information plane analyses of neural network classifiers—A review.
IEEE Transactions on Neural Networks and Learning Systems, 33(12), 7039-7051.
[1041] Kawaguchi, K., Deng, Z., Ji, X., and Huang, J. (2023, July). How does information bottle-
neck help deep learning?. In International Conference on Machine Learning (pp. 16049-16096).
PMLR.
[1042] Dardour, O., Aguilar, E., Radeva, P., and Zaied, M. (2025). Inter-separability and intra-
concentration to enhance stochastic neural network adversarial robustness. Pattern Recogni-
tion Letters.
[1043] Krinner, M., Aljalbout, E., Romero, A., and Scaramuzza, D. (2025). Accelerating
Model-Based Reinforcement Learning with State-Space World Models. arXiv preprint
arXiv:2502.20168.
[1044] Yildirim, A. B., Pehlivan, H., and Dundar, A. (2024). Warping the residuals for image
editing with stylegan. International Journal of Computer Vision, 1-16.
[1045] Yang, Y., Wang, Y., Ma, C., Yu, L., Chersoni, E., and Huang, C. R. (2025). Sparse Brains are
Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs. arXiv preprint
arXiv:2502.19078.
[1046] Liu, H., Jia, C., Shi, F., Cheng, X., and Chen, S. (2025). SCSegamba: Lightweight Structure-
Aware Vision Mamba for Crack Segmentation in Structures. arXiv preprint arXiv:2503.01113.
[1047] STIERLE, M., and Valtere, L. Addressing the Gene Therapy Bottleneck in the EU: Patent
vs. Regulatory Incentives. Gewerblicher Rechtsschutz und Urheberrecht. Internationaler Teil.
[1048] Chen, Z. S., Tan, Y., Ma, Z., Zhu, Z., and Skibniewski, M. J. (2025). Unlocking the potential
of quantum computing in prefabricated construction supply chains: Current trends, challenges,
and future directions. Information Fusion, 103043.
678 BIBLIOGRAPHY
[1049] Yuan, X., Smith, N. S., and Moghe, G. D. (2025). Analysis of plant metabolomics data using
identification-free approaches. Applications in Plant Sciences, e70001.
[1050] Dey, A., Sarkar, S., Mondal, A., and Mitra, P. (2025). Spatio-Temporal NDVI Prediction
for Rice Crop. SN Computer Science, 6(3), 1-13.
[1051] Li, W. (2025). Navigation path extraction for garden mobile robot based on road median
point. EURASIP Journal on Advances in Signal Processing, 2025(1), 6.
[1053] Carreira-Perpinan, M. A., and Hinton, G. (2005, January). On contrastive divergence learn-
ing. In International workshop on artificial intelligence and statistics (pp. 33-40). PMLR.
[1054] Hinton, G. E. (2012). A practical guide to training restricted Boltzmann machines. In Neural
Networks: Tricks of the Trade: Second Edition (pp. 599-619). Berlin, Heidelberg: Springer
Berlin Heidelberg.
[1055] Fischer, A., and Igel, C. (2014). Training restricted Boltzmann machines: An introduction.
Pattern Recognition, 47(1), 25-39.
[1056] Larochelle, H., and Bengio, Y. (2008, July). Classification using discriminative restricted
Boltzmann machines. In Proceedings of the 25th international conference on Machine learning
(pp. 536-543).
[1057] Salakhutdinov, R., Mnih, A., and Hinton, G. (2007, June). Restricted Boltzmann machines
for collaborative filtering. In Proceedings of the 24th international conference on Machine
learning (pp. 791-798).
[1058] Coates, A., Ng, A., and Lee, H. (2011, June). An analysis of single-layer networks in unsu-
pervised feature learning. In Proceedings of the fourteenth international conference on artificial
intelligence and statistics (pp. 215-223). JMLR Workshop and Conference Proceedings.
[1059] Hinton, G. E., and Salakhutdinov, R. R. (2009). Replicated softmax: an undirected topic
model. Advances in neural information processing systems, 22.
[1060] Adachi, S. H., and Henderson, M. P. (2015). Application of quantum annealing to training
of deep neural networks. arXiv preprint arXiv:1510.06356.
[1061] Salloum, H., Nayal, L., and Mazzara, M. Evaluating the Advantage 2 Quantum Annealer
Prototype: A Comparative Evaluation with Advantage 1 and Hybrid Solver and Classical
Restricted Boltzmann Machines on MNIST Classification.
[1062] Joudaki, M. (2025). A Comprehensive Literature Review on the Use of Restricted Boltzmann
Machines and Deep Belief Networks for Human Action Recognition.
[1063] Prat Pou, A., Romero, E., Martı́, J., and Mazzanti, F. (2025). Mean Field Initialization
of the Annealed Importance Sampling Algorithm for an Efficient Evaluation of the Partition
Function Using Restricted Boltzmann Machines. Entropy, 27(2), 171.
[1064] Decelle, A., Gómez, A. D. J. N., and Seoane, B. (2025). Inferring High-Order Couplings
with Neural Networks. arXiv preprint arXiv:2501.06108.
[1065] Savitha, S., Kannan, A. R., and Logeswaran, K. (2025). Augmenting Cardiovascular Dis-
ease Prediction Through CWCF Integration Leveraging Harris Hawks Search in Deep Belief
Networks. Cognitive Computation, 17(1), 52.
BIBLIOGRAPHY 679
[1066] Béreux, N., Decelle, A., Furtlehner, C., Rosset, L., and Seoane, B. (2025, April). Fast
training and sampling of Restricted Boltzmann Machines. In 13th International Conference on
Learning Representations-ICLR 2025.
[1067] Thériault, R., Tosello, F., and Tantari, D. (2024). Modelling structured data learning with
restricted boltzmann machines in the teacher-student setting. arXiv preprint arXiv:2410.16150.
[1068] Manimurugan, S., Karthikeyan, P., Narmatha, C., Aborokbah, M. M., Paul, A., Ganesan,
S., ... and Ammad-Uddin, M. (2024). A hybrid Bi-LSTM and RBM approach for advanced
underwater object detection. PloS one, 19(11), e0313708.
[1069] Hossain, M. M., Han, T. A., Ara, S. S., and Shamszaman, Z. U. (2025). Benchmark-
ing Classical, Deep, and Generative Models for Human Activity Recognition. arXiv preprint
arXiv:2501.08471.
[1070] Qin, Y., Peng, Z., Miao, L., Chen, Z., Ouyang, J., and Yang, X. (2025). Integrating nanode-
vice and neuromorphic computing for enhanced magnetic anomaly detection. Measurement,
244, 116532.
[1071] Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009, June). Convolutional deep belief
networks for scalable unsupervised learning of hierarchical representations. In Proceedings of
the 26th annual international conference on machine learning (pp. 609-616).
[1072] Mohamed, A. R., Dahl, G. E., and Hinton, G. (2011). Acoustic modeling using deep belief
networks. IEEE transactions on audio, speech, and language processing, 20(1), 14-22.
[1073] Peng, K., Jiao, R., Dong, J., and Pi, Y. (2019). A deep belief network based health indicator
construction and remaining useful life prediction using improved particle filter. Neurocomput-
ing, 361, 19-28.
[1074] Zhang, Z., and Zhao, J. (2017). A deep belief network based fault diagnosis model for
complex chemical processes. Computers and chemical engineering, 107, 395-407.
[1075] Liu, H. (2018). Leveraging financial news for stock trend prediction with attention-based
recurrent neural network. arXiv preprint arXiv:1811.06173.
[1076] Zhang, D., Zou, L., Zhou, X., and He, F. (2018). Integrating feature selection and feature
extraction methods with deep learning to predict clinical outcome of breast cancer. Ieee Access,
6, 28936-28944.
[1077] Hoang, D. T., and Kang, H. J. (2018, June). Deep belief network and dempster-shafer
evidence theory for bearing fault diagnosis. In 2018 IEEE 27th international symposium on
industrial electronics (ISIE) (pp. 841-846). IEEE.
[1078] Zhong, P., Gong, Z., Li, S., and Schönlieb, C. B. (2017). Learning to diversify deep belief
networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote
Sensing, 55(6), 3516-3530.
[1079] Alzughaibi, A. (2025). Leveraging Pattern Recognition based Fusion Approach for Pest
Detection using Modified Artificial Hummingbird Algorithm with Deep Learning. Appl. Math,
19(3), 509-518.
[1080] Tausani, L., Testolin, A., and Zorzi, M. (2025). Investigating the intrinsic top-down dynamics
of deep generative models. Scientific Reports, 15(1), 2875.
680 BIBLIOGRAPHY
[1081] Kumar, S., and Ravi, V. (2025, January). XDATE: eXplainable Deep belief network-based
Auto-encoder with exTended Garson Algorithm. In 2025 17th International Conference on
COMmunication Systems and NETworks (COMSNETS) (pp. 108-113). IEEE.
[1083] Pavithra, D., Bharathraj, R., Poovizhi, P., Libitharan, K., and Nivetha, V. (2025). Detection
of IoT Attacks Using Hybrid RNN-DBN Model. Generative Artificial Intelligence: Concepts
and Applications, 209-225.
[1084] Bhadane, S. N., and Verma, P. (2024, November). Review of Machine Learning and Deep
Learning algorithms for Personality traits classification. In 2024 2nd DMIHER International
Conference on Artificial Intelligence in Healthcare, Education and Industry (IDICAIEI) (pp.
1-6). IEEE.
[1085] Keivanimehr, A. R., and Akbari, M. (2025). TinyML and edge intelligence applications in
cardiovascular disease: A survey. Computers in Biology and Medicine, 186, 109653.
[1086] Kobak, D., and Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics.
Nature communications, 10(1), 5416.
[1087] Belkina, A. C., Ciccolella, C. O., Anno, R., Halpert, R., Spidlen, J., and Snyder-Cappione, J.
E. (2019). Automated optimized parameters for T-distributed stochastic neighbor embedding
improve visualization and analysis of large datasets. Nature communications, 10(1), 5415.
[1088] Linderman, G. C., and Steinerberger, S. (2019). Clustering with t-SNE, provably. SIAM
journal on mathematics of data science, 1(2), 313-332.
[1089] De Amorim, R. C., and Mirkin, B. (2012). Minkowski metric, feature weighting and anoma-
lous cluster initializing in K-Means clustering. Pattern Recognition, 45(3), 1061-1075.
[1090] Wattenberg, M., Viégas, F., and Johnson, I. (2016). How to use t-SNE effectively. Distill,
1(10), e2.
[1091] Pezzotti, N., Lelieveldt, B. P., Van Der Maaten, L., Höllt, T., Eisemann, E., and Vilanova,
A. (2016). Approximated and user steerable tSNE for progressive visual analytics. IEEE trans-
actions on visualization and computer graphics, 23(7), 1739-1752.
[1092] Kobak, D., and Linderman, G. C. (2021). Initialization is critical for preserving global data
structure in both t-SNE and UMAP. Nature biotechnology, 39(2), 156-157.
[1093] Becht, E., McInnes, L., Healy, J., Dutertre, C. A., Kwok, I. W., Ng, L. G., ... and Newell,
E. W. (2019). Dimensionality reduction for visualizing single-cell data using UMAP. Nature
biotechnology, 37(1), 38-44.
[1094] Moon, K. R., Van Dijk, D., Wang, Z., Gigante, S., Burkhardt, D. B., Chen, W. S., ... and
Krishnaswamy, S. (2019). Visualizing structure and transitions in high-dimensional biological
data. Nature biotechnology, 37(12), 1482-1492.
[1095] Rivera, G., and Deniega, J. V. Artificial Intelligence-Driven Automation of Flow Cytometry
Gating. Capstone Chronicles, 186.
[1097] Chern, W. C., Gunay, E., Okudan-Kremer, G. E., and Kremer, P. Exploring the Impact
of Defect Attributes and Augmentation Variability on Recent Yolo Variants for Metal Defect
Detection. Available at SSRN 5149346.
[1098] Li, D., Monteiro, D. D. G. N., Jiang, H., and Chen, Q. (2025). Qualitative analysis of wheat
aflatoxin B1 using olfactory visualization technique based on natural anthocyanins. Journal of
Food Composition and Analysis, 107359.
[1099] Singh, M., and Singh, M. K. (2025). Content-Based Gastric Image Retrieval Using Fusion
of Deep Learning Features with Dimensionality Reduction. SN Computer Science, 6(2), 1-12.
[1100] Sun, J. Q., Zhang, C., Liu, G. D., and Zhang, C. Detecting Muscle Fatigue during Lower
Limb Isometric Contractions Tasks: A Machine Learning Approach. Frontiers in Physiology,
16, 1547257.
[1101] Su, Z., Xiao, X., Tong, D., Wang, X., Zhong, Z., Zhao, P., and Yu, J. (2025, March). Seismic
fragility of earth-rock dams with heterogeneous compacted materials using deep learning-aided
intensity measure. In Structures (Vol. 73, p. 108373). Elsevier.
[1102] Yousif, A. Y., and Al-Sarray, B. (2025, March). Integrating t-SNE and spectral clustering
via convex optimization for enhanced breast cancer gene expression data diagnosing. In AIP
Conference Proceedings (Vol. 3264, No. 1). AIP Publishing.
[1103] Park, M. S., Lee, J. K., Kim, B., Ju, H. Y., Yoo, K. H., Jung, C. W., ... and Kim, H. Y.
(2025). Assessing the clinical applicability of dimensionality reduction algorithms in flow cy-
tometry for hematologic malignancies. Clinical Chemistry and Laboratory Medicine (CCLM),
(0).
[1104] Qiao, S., YANG, L., ZHANG, G., LU, A., and LI, F. (2025). Abstract B097: Cancer-
associated fibroblasts in pancreatic ductal adenocarcinoma patients defined by a core inflam-
matory gene network exhibited inflammatory characteristics with high CCN2 expression. Can-
cer Immunology Research, 13(2-Supplement), B097-B097.
[1105] Saul, L. K., and Roweis, S. T. (2000). An introduction to locally linear embedding. unpub-
lished. Available at: http://www. cs. toronto. edu/ roweis/lle/publications. html.
[1106] Polito, M., and Perona, P. (2001). Grouping and dimensionality reduction by locally linear
embedding. Advances in neural information processing systems, 14.
[1107] Zhang, Z., and Zha, H. (2004). Principal manifolds and nonlinear dimensionality reduction
via tangent space alignment. SIAM journal on scientific computing, 26(1), 313-338.
[1108] Donoho, D. L., and Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding tech-
niques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10),
5591-5596.
[1109] Zhang, Z., and Wang, J. (2006). MLLE: Modified locally linear embedding using multiple
weights. Advances in neural information processing systems, 19.
[1110] Liang, P. (2005). Semi-supervised learning for natural language (Doctoral dissertation, Mas-
sachusetts Institute of Technology).
[1111] Coates, A., and Ng, A. Y. (2012). Learning feature representations with k-means. In Neural
Networks: Tricks of the Trade: Second Edition (pp. 561-580). Berlin, Heidelberg: Springer
Berlin Heidelberg.
682 BIBLIOGRAPHY
[1112] Hyvärinen, A., and Oja, E. (2000). Independent component analysis: algorithms and appli-
cations. Neural networks, 13(4-5), 411-430.
[1113] Lee, H., Battle, A., Raina, R., and Ng, A. (2006). Efficient sparse coding algorithms. Ad-
vances in neural information processing systems, 19.
[1114] Yang, B., Gu, X., An, S., Song, K., Wang, S., Qiu, X., and Meng, X. (2025). ASSESSMENT
OF CHINESE CITIES’INTERNATIONAL TOURISM COMPETITIVENESS USING AN IN-
TEGRATED ENTROPY-TOPSIS AND GRA MODEL.
[1115] Wang, Y., Ma, T., Shen, L., Wang, X., and Luo, R. (2025). Prediction of thermal conduc-
tivity of natural rock materials using LLE-transformer-lightGBM model for geothermal energy
applications. Energy Reports, 13, 2516-2530.
[1116] Jin, X., Li, H., Xu, X., Xu, Z., and Su, F. (2025). Inverse Synthetic Aperture Radar Image
Multi-Modal Zero-Shot Learning Based on the Scattering Center Model and Neighbor-Adapted
Locally Linear Embedding. Remote Sensing, 17(4), 725.
[1117] Li, X., Zhu, Z., Hui, L., Ma, X., Li, D., Yang, Z., and Nai, W. (2024, December). Locally
Linear Embedding Based on Neiderreit Sequence Initialized Ali Baba and The Forty Thieves
Algorithm. In 2024 IEEE 4th International Conference on Information Technology, Big Data
and Artificial Intelligence (ICIBA) (Vol. 4, pp. 1466-1470). IEEE.
[1118] Pouya Jafari, Ehsan Espandar, Fatemeh Baharifard, Snehashish Chakraverty, Linear local
embedding, Dimensionality Reduction in Machine Learning, Morgan Kaufmann, 2025, Pages
129-156
[1119] Zhou, X., Ye, D., Yin, C., Wu, Y., Chen, S., Ge, X., ... and Liu, Q. (2025). Application
of Machine Learning in Terahertz-Based Nondestructive Testing of Thermal Barrier Coatings
with High-Temperature Growth Stresses. Coatings, 15(1), 49.
[1120] Dou, F., Ju, Y., and Cheng, C. (2024, December). Fault detection based on locally linear
embedding for traction systems in high-speed trains. In Fourth International Conference on
Testing Technology and Automation Engineering (TTAE 2024) (Vol. 13439, pp. 314-319).
SPIE.
[1121] Bagherzadeh, M., Kahani, N., and Briand, L. (2021). Reinforcement learning for test case
prioritization. IEEE Transactions on Software Engineering, 48(8), 2836-2856.
[1122] Liu, H., Yang, B., Kang, F., Li, Q., and Zhang, H. (2025). Intelligent recognition algorithm
of connection relation of substation secondary wiring drawing based on D-LLE algorithm.
Discover Applied Sciences, 7(1), 1-12.
[1123] Comon, P. (1994). Independent component analysis, a new concept?. Signal processing,
36(3), 287-314.
[1124] Jutten, C., and Herault, J. (1991). Blind separation of sources, part I: An adaptive algorithm
based on neuromimetic architecture. Signal processing, 24(1), 1-10.
[1125] Hyvärinen, A., and Oja, E. (1997). A fast fixed-point algorithm for independent component
analysis. Neural computation, 9(7), 1483-1492.
[1126] Cardoso, J. F., and Souloumiac, A. (1993, December). Blind beamforming for non-Gaussian
signals. In IEE proceedings F (radar and signal processing) (Vol. 140, No. 6, pp. 362-370).
IEE.
BIBLIOGRAPHY 683
[1127] Amari, S. I., Cichocki, A., and Yang, H. (1995). A new learning algorithm for blind signal
separation. Advances in neural information processing systems, 8.
[1128] Lee, T. W., Girolami, M., and Sejnowski, T. J. (1999). Independent component analysis
using an extended infomax algorithm for mixed subgaussian and supergaussian sources. Neural
computation, 11(2), 417-441.
[1129] Pham, D. T., and Garat, P. (1997). Blind separation of mixture of independent sources
through a quasi-maximum likelihood approach. IEEE transactions on Signal Processing, 45(7),
1712-1725.
[1130] Højen-Sørensen, P. A., Winther, O., and Hansen, L. K. (2002). Mean-field approaches to
independent component analysis. Neural Computation, 14(4), 889-918.
[1132] Behzadfar, N., Mathalon, D., Preda, A., Iraji, A., and Calhoun, V. D. (2025). A multi-
frequency ICA-based approach for estimating voxelwise frequency difference patterns in fMRI
data. Aperture Neuro, 5.
[1133] Eierud, C., Norgaard, M., Bilgel, M., Petropoulos, H., Fu, Z., Iraji, A., ... and Calhoun,
V. (2025). Building Multivariate Molecular Imaging Brain Atlases Using the NeuroMark PET
Independent Component Analysis Framework. bioRxiv, 2025-02.
[1134] Wang, J., Shen, Y., Awange, J., Tangdamrongsub, N., Feng, T., Hu, K., ... and Wang, X.
(2025). Exploring potential drivers of terrestrial water storage anomaly trends in the Yangtze
River Basin (2002–2019). Journal of Hydrology: Regional Studies, 58, 102264.
[1135] Heurtebise, A., Chehab, O., Ablin, P., Gramfort, A., and Hyvärinen, A. (2025). Identifiable
Multi-View Causal Discovery Without Non-Gaussianity. arXiv preprint arXiv:2502.20115.
[1136] Ouyang, G., and Li, Y. (2025). Protocol for semi-automatic EEG preprocessing incorporat-
ing independent component analysis and principal component analysis. STAR Protocols, 6(1),
103682.
[1137] Zhang, G., and Luck, S. (2025). Assessing the impact of artifact correction and artifact
rejection on the performance of SVM-based decoding of EEG signals. bioRxiv, 2025-02.
[1138] Kirsten, O., and Süssmuth, B. (2025). Forecasting the unforecastable: An independent
component analysis for majority game-like global cryptocurrencies. Physica A: Statistical Me-
chanics and its Applications, 130472.
[1139] Jung, S., Kim, J., and Kim, S. (2025). A hybrid fault detection method of independent
component analysis and auto-associative kernel regression for process monitoring in power
plant. IEEE Access.
[1140] Wang, Z., Hu, L., Wang, Y., Li, H., Li, J., Tian, Z., and Zhou, H. (2025, February). A dual
five-element stereo array passive acoustic localization fusion method. In Fourth International
Computational Imaging Conference (CITA 2024) (Vol. 13542, pp. 17-28). SPIE.
[1141] Luo, W., Xiong, S., Li, Y., and Jiang, P. (2025, March). Research on brain signal acquisition
and transmission via noninvasive brain-computer interface. In Third International Conference
on Algorithms, Network, and Communication Technology (ICANCT 2024) (Vol. 13545, pp.
366-374). SPIE.
684 BIBLIOGRAPHY
[1142] McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017, April).
Communication-efficient learning of deep networks from decentralized data. In Artificial intel-
ligence and statistics (pp. 1273-1282). PMLR.
[1143] Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., ... and Zhao,
S. (2021). Advances and open problems in federated learning. Foundations and trends® in
machine learning, 14(1–2), 1-210.
[1144] Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang,
L. (2016, October). Deep learning with differential privacy. In Proceedings of the 2016 ACM
SIGSAC conference on computer and communications security (pp. 308-318).
[1145] Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., ... and
Seth, K. (2017, October). Practical secure aggregation for privacy-preserving machine learning.
In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications
Security (pp. 1175-1191).
[1146] Zhao, Y., Li, M., Lai, L., Suda, N., Civin, D., and Chandra, V. (2018). Federated learning
with non-iid data. arXiv preprint arXiv:1806.00582.
[1147] Sattler, F., Wiedemann, S., Müller, K. R., and Samek, W. (2019). Robust and
communication-efficient federated learning from non-iid data. IEEE transactions on neural
networks and learning systems, 31(9), 3400-3413.
[1148] Reddi, S., Charles, Z., Zaheer, M., Garrett, Z., Rush, K., Konečný, J., ... and McMahan, H.
B. (2020). Adaptive federated optimization. arXiv preprint arXiv:2003.00295.
[1149] Sattler, F., Marban, A., Rischke, R., and Samek, W. (2020). Communication-efficient fed-
erated distillation. arXiv preprint arXiv:2012.00632.
[1150] Fallah, A., Mokhtari, A., and Ozdaglar, A. (2020). Personalized federated learning with the-
oretical guarantees: A model-agnostic meta-learning approach. Advances in neural information
processing systems, 33, 3557-3568.
[1151] Sheller, M. J., Edwards, B., Reina, G. A., Martin, J., Pati, S., Kotrotsou, A., ... and Bakas, S.
(2020). Federated learning in medicine: facilitating multi-institutional collaborations without
sharing patient data. Scientific reports, 10(1), 12598.
[1152] Byrd, D., and Polychroniadou, A. (2020, October). Differentially private secure multi-party
computation for federated learning in financial applications. In Proceedings of the first ACM
international conference on AI in finance (pp. 1-9).
[1153] Jagatheesaperumal, S. K., Rahouti, M., Ahmad, K., Al-Fuqaha, A., and Guizani, M. (2021).
The duo of artificial intelligence and big data for industry 4.0: Applications, techniques,
challenges, and future research directions. IEEE Internet of Things Journal, 9(15), 12861-
12885.
[1154] Meduri, K., Nadella, G. S., Yadulla, A. R., Kasula, V. K., Maturi, M. H., Brown, S., ...
and Gonaygunta, H. (2024). Leveraging Federated Learning for Privacy-Preserving Analysis of
Multi-Institutional Electronic Health Records in Rare Disease Research. Journal of Economy
and Technology.
[1155] Tzortzis, I. N., Gutierrez-Torre, A., Sykiotis, S., Agulló, F., Bakalos, N., Doulamis, A., ...
and Berral, J. L. (2025). Towards generalizable Federated Learning in Medical Imaging: A
real-world case study on mammography data. Computational and Structural Biotechnology
Journal.
BIBLIOGRAPHY 685
[1156] Szelag, J. K., Chin, J. J., and Yip, S. C. (2025). Adaptive Adversaries in Byzantine-Robust
Federated Learning: A survey. Cryptology ePrint Archive.
[1157] Ferretti, S., Cassano, L., Cialone, G., D’Abramo, J., and Imboccioli, F. (2025). Decentralized
coordination for resilient federated learning: A blockchain-based approach with smart contracts
and decentralized storage. Computer Communications, 108112.
[1158] Chen, Z., Hoang, D., Piran, F. J., Chen, R., and Imani, F. (2025). Federated Hyperdimen-
sional Computing for hierarchical and distributed quality monitoring in smart manufacturing.
Internet of Things, 101568.
[1159] Mei, Q., Huang, R., Li, D., Li, J., Shi, N., Du, M., ... and Tian, C. (2025). Intelligent
hierarchical federated learning system based on semi-asynchronous and scheduled synchronous
control strategies in satellite network. Autonomous Intelligent Systems, 5(1), 9.
[1160] Rawas, S., and Samala, A. D. (2025). EAFL: Edge-Assisted Federated Learning for real-time
disease prediction using privacy-preserving AI. Iran Journal of Computer Science, 1-11.
[1161] Becker, C., Peregrina, J. A., Beccard, F., Mohr, M., and Zirpins, C. (2025). A Study on the
Efficiency of Combined Reconstruction and Poisoning Attacks in Federated Learning. Journal
of Data Science and Intelligent Systems.
[1162] Fu, H., Tian, F., Deng, G., Liang, L., and Zhang, X. (2025). Reads: A Personalized Feder-
ated Learning Framework with Fine-grained Layer Aggregation and Decentralized Clustering.
IEEE Transactions on Mobile Computing.
[1163] Li, Y., Kundu, S. S., Boels, M., Mahmoodi, T., Ourselin, S., Vercauteren, T., ... and
Granados, A. (2025). UltraFlwr–An Efficient Federated Medical and Surgical Object Detection
Framework. arXiv preprint arXiv:2503.15161.
[1164] Shi, C., Li, J., Zhao, H., and Chang, Y. (2025). FedLWS: Federated Learning with Adaptive
Layer-wise Weight Shrinking. arXiv preprint arXiv:2503.15111.
[1166] Zhou, Z., He, Y., Zhang, W., Ding, Z., Wu, B., and Xiao, K. Blockchain-Empowered Cluster
Distillation Federated Learning for Heterogeneous Smart Grids. Available at SSRN 5187086.
[1167] Scheirer, W. J., de Rezende Rocha, A., Sapkota, A., and Boult, T. E. (2012). Toward
open set recognition. IEEE transactions on pattern analysis and machine intelligence, 35(7),
1757-1772.
[1168] Bendale, A., and Boult, T. (2015). Towards open world recognition. In Proceedings of the
IEEE conference on computer vision and pattern recognition (pp. 1893-1902).
[1169] Panareda Busto, P., and Gall, J. (2017). Open set domain adaptation. In Proceedings of the
IEEE international conference on computer vision (pp. 754-763).
[1170] Saito, K., Yamamoto, S., Ushiku, Y., and Harada, T. (2018). Open set domain adaptation
by backpropagation. In Proceedings of the European conference on computer vision (ECCV)
(pp. 153-168).
[1171] Geng, C., Huang, S. J., and Chen, S. (2020). Recent advances in open set recognition: A
survey. IEEE transactions on pattern analysis and machine intelligence, 43(10), 3614-3631.
686 BIBLIOGRAPHY
[1172] Chen, G., Qiao, L., Shi, Y., Peng, P., Li, J., Huang, T., ... and Tian, Y. (2020). Learning
open set network with discriminative reciprocal points. In Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16 (pp. 507-
522). Springer International Publishing.
[1173] Liu, B., Kang, H., Li, H., Hua, G., and Vasconcelos, N. (2020). Few-shot open-set recognition
using meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 8798-8807).
[1174] Kong, S., and Ramanan, D. (2021). Opengan: Open-set recognition via open data gen-
eration. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
813-822).
[1175] Fang, Z., Lu, J., Liu, A., Liu, F., and Zhang, G. (2021, July). Learning bounds for open-set
learning. In International conference on machine learning (pp. 3122-3132). PMLR.
[1176] Mandivarapu, J. K., Camp, B., and Estrada, R. (2022). Deep active learning via open-set
recognition. Frontiers in Artificial Intelligence, 5, 737363.
[1177] Engelbrecht, E. R., and Preez, J. A. D. (2020). Open-set learning with augmented categories
by exploiting unlabelled data. arXiv preprint arXiv:2002.01368.
[1178] Shao, J. J., Yang, X. W., and Guo, L. Z. (2024). Open-set learning under covariate shift.
Machine Learning, 113(4), 1643-1659.
[1179] Park, J., Park, H., Jeong, E., and Teoh, A. B. J. (2024). Understanding open-set recognition
by Jacobian norm and inter-class separation. Pattern Recognition, 145, 109942.
[1180] Liu, Y. C., Ma, C. Y., Dai, X., Tian, J., Vajda, P., He, Z., and Kira, Z. (2022, October).
Open-set semi-supervised object detection. In European conference on computer vision (pp.
143-159). Cham: Springer Nature Switzerland.
[1181] Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. (2021). Open-set recognition: A good
closed-set classifier is all you need?.
[1182] Barcina-Blanco, M., Lobo, J. L., Garcia-Bringas, P., and Del Ser, J. (2023). Manag-
ing the unknown: a survey on open set recognition and tangential areas. arXiv preprint
arXiv:2312.08785.
[1183] iCGY96. (2023). Awesome Open Set Recognition List. GitHub. Retrieved April 1, 2025,
from https://github.com/iCGY96/awesome_OpenSetRecognition_list
[1184] Wikipedia contributors. (n.d.). Topological deep learning. Wikipedia, The Free Encyclo-
pedia. Retrieved April 1, 2025, from https://en.wikipedia.org/wiki/Topological_deep_
learning
[1185] Zhou, Y., Fang, S., Li, S., Wang, B., and Kung, S. Y. (2024). Contrastive learning based
open-set recognition with unknown score. Knowledge-Based Systems, 296, 111926.
[1186] Abouzaid, S., Jaeschke, T., Kueppers, S., Barowski, J., and Pohl, N. (2023). Deep learning-
based material characterization using FMCW radar with open-set recognition technique. IEEE
Transactions on Microwave Theory and Techniques, 71(11), 4628-4638.
[1187] Cevikalp, H., Uzun, B., Salk, Y., Saribas, H., and Köpüklü, O. (2023). From anomaly
detection to open set recognition: Bridging the gap. Pattern Recognition, 138, 109385.
BIBLIOGRAPHY 687
[1188] Palechor, A., Bhoumik, A., and Günther, M. (2023). Large-scale open-set classification
protocols for imagenet. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision (pp. 42-51).
[1189] Cen, J., Luan, D., Zhang, S., Pei, Y., Zhang, Y., Zhao, D., ... and Chen, Q. (2023). The
devil is in the wrongly-classified samples: Towards unified open-set recognition. arXiv preprint
arXiv:2302.04002.
[1190] Huang, H., Wang, Y., Hu, Q., and Cheng, M. M. (2022). Class-specific semantic reconstruc-
tion for open set recognition. IEEE transactions on pattern analysis and machine intelligence,
45(4), 4214-4228.
[1191] Wang, Z., Xu, Q., Yang, Z., He, Y., Cao, X., and Huang, Q. (2022). Openauc: Towards
auc-oriented open-set recognition. Advances in Neural Information Processing Systems, 35,
25033-25045.
[1192] Alliegro, A., Borlino, F. C., and Tommasi, T. (2022). Towards open set 3d learning: A
benchmark on object point clouds. arXiv preprint arXiv:2207.11554, 2(3).
[1193] Grieggs, S., Shen, B., Rauch, G., Li, P., Ma, J., Chiang, D., ... and Scheirer, W. J. (2021).
Measuring human perception to improve handwritten document transcription. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 44(10), 6594-6601.
[1194] Grcić, M., Bevandić, P., and Šegvić, S. (2022, October). Densehybrid: Hybrid anomaly
detection for dense open-set recognition. In European Conference on Computer Vision (pp.
500-517). Cham: Springer Nature Switzerland.
[1195] Moon, W., Park, J., Seong, H. S., Cho, C. H., and Heo, J. P. (2022, October). Difficulty-
aware simulator for open set recognition. In European conference on computer vision (pp.
365-381). Cham: Springer Nature Switzerland.
[1196] Kuchibhotla, H. C., Malagi, S. S., Chandhok, S., and Balasubramanian, V. N. (2022).
Unseen classes at a later time? no problem. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition (pp. 9245-9254).
[1197] Katsumata, K., Vo, D. M., and Nakayama, H. (2022). Ossgan: Open-set semi-supervised
image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 11185-11193).
[1198] Bao, W., Yu, Q., and Kong, Y. (2022). Opental: Towards open set temporal action lo-
calization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (pp. 2979-2989).
[1199] Dietterich, T. G., and Guyer, A. (2022). The familiarity hypothesis: Explaining the behavior
of deep open set methods. Pattern Recognition, 132, 108931.
[1200] Cai, J., Wang, Y., Hsu, H. M., Hwang, J. N., Magrane, K., and Rose, C. S. (2022, June).
Luna: Localizing unfamiliarity near acquaintance for open-set long-tailed recognition. In Pro-
ceedings of the AAAI conference on artificial intelligence (Vol. 36, No. 1, pp. 131-139).
[1201] Wang, Q. F., Geng, X., Lin, S. X., Xia, S. Y., Qi, L., and Xu, N. (2022, June). Learngene:
From open-world to your learning task. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 36, No. 8, pp. 8557-8565).
[1202] Zhang, X., Cheng, X., Zhang, D., Bonnington, P., and Ge, Z. (2022, June). Learning network
architecture for open-set recognition. In Proceedings of the AAAI Conference on Artificial
Intelligence (Vol. 36, No. 3, pp. 3362-3370).
688 BIBLIOGRAPHY
[1203] Lu, J., Xu, Y., Li, H., Cheng, Z., and Niu, Y. (2022, June). Pmal: Open set recognition
via robust prototype mining. In Proceedings of the AAAI conference on artificial intelligence
(Vol. 36, No. 2, pp. 1872-1880).
[1204] Xia, Z., Wang, P., Dong, G., and Liu, H. (2023). Adversarial kinetic prototype framework
for open set recognition. IEEE Transactions on Neural Networks and Learning Systems.
[1205] Kong, S., and Ramanan, D. (2021). Opengan: Open-set recognition via open data gen-
eration. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
813-822).
[1206] Huang, J., Fang, C., Chen, W., Chai, Z., Wei, X., Wei, P., ... and Li, G. (2021). Trash
to treasure: Harvesting ood data with cross-modal matching for open-set semi-supervised
learning. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
8310-8319).
[1207] Wang, Y., Li, B., Che, T., Zhou, K., Liu, Z., and Li, D. (2021). Energy-based open-world un-
certainty modeling for confidence calibration. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 9302-9311).
[1208] Zhang, H., and Ding, H. (2021). Prototypical matching and open set rejection for zero-shot
semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer
vision (pp. 6974-6983).
[1209] Girish, S., Suri, S., Rambhatla, S. S., and Shrivastava, A. (2021). Towards discovery and
attribution of open-world gan generated images. In Proceedings of the IEEE/CVF international
conference on computer vision (pp. 14094-14103).
[1210] Wang, W., Feiszli, M., Wang, H., and Tran, D. (2021). Unidentified video objects: A bench-
mark for dense, open-world segmentation. In Proceedings of the IEEE/CVF international
conference on computer vision (pp. 10776-10785).
[1211] Cen, J., Yun, P., Cai, J., Wang, M. Y., and Liu, M. (2021). Deep metric learning for open
world semantic segmentation. In Proceedings of the IEEE/CVF international conference on
computer vision (pp. 15333-15342).
[1212] Wu, Z. F., Wei, T., Jiang, J., Mao, C., Tang, M., and Li, Y. F. (2021). Ngc: A unified frame-
work for learning with open-world noisy data. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 62-71).
[1213] Bastan, M., Wu, H. Y., Cao, T., Kota, B., and Tek, M. (2019). Large scale open-set deep
logo detection. arXiv preprint arXiv:1911.07440.
[1214] Saito, K., Kim, D., and Saenko, K. (2021). Openmatch: Open-set consistency regularization
for semi-supervised learning with outliers. arXiv preprint arXiv:2105.14148.
[1215] Esmaeilpour, S., Liu, B., Robertson, E., and Shu, L. (2022, June). Zero-shot out-of-
distribution detection based on the pre-trained model clip. In Proceedings of the AAAI con-
ference on artificial intelligence (Vol. 36, No. 6, pp. 6568-6576).
[1216] Chen, G., Peng, P., Wang, X., and Tian, Y. (2021). Adversarial reciprocal points learning
for open set recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
44(11), 8065-8081.
[1217] Guo, Y., Camporese, G., Yang, W., Sperduti, A., and Ballan, L. (2021). Conditional varia-
tional capsule network for open set recognition. In Proceedings of the IEEE/CVF International
Conference on Computer Vision (pp. 103-111).
BIBLIOGRAPHY 689
[1218] Bao, W., Yu, Q., and Kong, Y. (2021). Evidential deep learning for open set action recog-
nition. In Proceedings of the IEEE/CVF international conference on computer vision (pp.
13349-13358).
[1219] Sun, X., Ding, H., Zhang, C., Lin, G., and Ling, K. V. (2021). M2iosr: Maximal mutual
information open set recognition. arXiv preprint arXiv:2108.02373.
[1220] Hwang, J., Oh, S. W., Lee, J. Y., and Han, B. (2021). Exemplar-based open-set panoptic
segmentation network. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (pp. 1175-1184).
[1221] Balasubramanian, L., Kruber, F., Botsch, M., and Deng, K. (2021, July). Open-set recogni-
tion based on the combination of deep learning and ensemble method for detecting unknown
traffic scenarios. In 2021 IEEE Intelligent Vehicles Symposium (IV) (pp. 674-681). IEEE.
[1222] Jang, J., and Kim, C. O. (2023). Teacher–explorer–student learning: A novel learning
method for open set recognition. IEEE Transactions on Neural Networks and Learning Sys-
tems.
[1223] Zhou, D. W., Ye, H. J., and Zhan, D. C. (2021). Learning placeholders for open-set recogni-
tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
(pp. 4401-4410).
[1224] Cevikalp, H., Uzun, B., Köpüklü, O., and Ozturk, G. (2021). Deep compact polyhedral conic
classifier for open and closed set recognition. Pattern Recognition, 119, 108080.
[1225] Yue, Z., Wang, T., Sun, Q., Hua, X. S., and Zhang, H. (2021). Counterfactual zero-shot and
open-set visual recognition. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition (pp. 15404-15414).
[1226] Jia, J., and Chan, P. K. (2022, September). Self-supervised detransformation autoencoder
for representation learning in open set recognition. In International Conference on Artificial
Neural Networks (pp. 471-483). Cham: Springer Nature Switzerland.
[1227] Jia, J., and Chan, P. K. (2021). MMF: A loss extension for feature learning in open set
recognition. In Artificial Neural Networks and Machine Learning–ICANN 2021: 30th Interna-
tional Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021,
Proceedings, Part II 30 (pp. 319-331). Springer International Publishing.
[1228] Salomon, G., Britto, A., Vareto, R. H., Schwartz, W. R., and Menotti, D. (2020, July).
Open-set face recognition for small galleries using siamese networks. In 2020 International
conference on systems, signals and image processing (IWSSIP) (pp. 161-166). IEEE.
[1229] Sun, X., Yang, Z., Zhang, C., Ling, K. V., and Peng, G. (2020). Conditional gaussian
distribution learning for open set recognition. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition (pp. 13480-13489).
[1230] Perera, P., Morariu, V. I., Jain, R., Manjunatha, V., Wigington, C., Ordonez, V., and Patel,
V. M. (2020). Generative-discriminative feature representations for open-set recognition. In
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp.
11814-11823).
[1231] Ditria, L., Meyer, B. J., and Drummond, T. (2020). Opengan: Open set generative adver-
sarial networks. In Proceedings of the Asian conference on computer vision.
[1232] Geng, C., and Chen, S. (2020). Collective decision for open set recognition. IEEE Transac-
tions on Knowledge and Data Engineering, 34(1), 192-204.
690 BIBLIOGRAPHY
[1233] Jang, J., and Kim, C. O. (2020). One-vs-rest network-based deep probability model for open
set recognition. arXiv preprint arXiv:2004.08067.
[1234] Zhang, H., Li, A., Guo, J., and Guo, Y. (2020). Hybrid models for open set recognition.
In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
2020, Proceedings, Part III 16 (pp. 102-117). Springer International Publishing.
[1235] Shao, R., Perera, P., Yuen, P. C., and Patel, V. M. (2020, August). Open-set adversarial
defense. In European Conference on Computer Vision (pp. 682-698). Cham: Springer Interna-
tional Publishing.
[1236] Yu, Q., Ikami, D., Irie, G., and Aizawa, K. (2020). Multi-task curriculum framework for
open-set semi-supervised learning. In Computer Vision–ECCV 2020: 16th European Confer-
ence, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16 (pp. 438-454). Springer
International Publishing.
[1237] Miller, D., Sunderhauf, N., Milford, M., and Dayoub, F. (2021). Class anchor clustering: A
loss for distance-based open set recognition. In Proceedings of the IEEE/CVF Winter Confer-
ence on Applications of Computer Vision (pp. 3570-3578).
[1238] Jia, J., and Chan, P. K. (2021). MMF: A loss extension for feature learning in open set
recognition. In Artificial Neural Networks and Machine Learning–ICANN 2021: 30th Interna-
tional Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021,
Proceedings, Part II 30 (pp. 319-331). Springer International Publishing.
[1239] Oliveira, H., Silva, C., Machado, G. L., Nogueira, K., and Dos Santos, J. A. (2023). Fully
convolutional open set segmentation. Machine Learning, 1-52.
[1240] Yang, Y., Wei, H., Sun, Z. Q., Li, G. Y., Zhou, Y., Xiong, H., and Yang, J. (2021). S2OSC: A
holistic semi-supervised approach for open set classification. ACM Transactions on Knowledge
Discovery from Data (TKDD), 16(2), 1-27.
[1241] Sun, X., Zhang, C., Lin, G., and Ling, K. V. (2020). Open set recognition with conditional
probabilistic generative models. arXiv preprint arXiv:2008.05129.
[1242] Yang, H. M., Zhang, X. Y., Yin, F., Yang, Q., and Liu, C. L. (2020). Convolutional proto-
type network for open set recognition. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 44(5), 2358-2370.
[1243] Dhamija, A., Gunther, M., Ventura, J., and Boult, T. (2020). The overlooked elephant of ob-
ject detection: Open set. In Proceedings of the IEEE/CVF Winter Conference on Applications
of Computer Vision (pp. 1021-1030).
[1244] Meyer, B. J., and Drummond, T. (2019, May). The importance of metric learning for robotic
vision: Open set recognition and active learning. In 2019 International Conference on Robotics
and Automation (ICRA) (pp. 2924-2931). IEEE.
[1245] Oza, P., and Patel, V. M. (2019). Deep CNN-based multi-task learning for open-set recog-
nition. arXiv preprint arXiv:1903.03161.
[1246] Yoshihashi, R., Shao, W., Kawakami, R., You, S., Iida, M., and Naemura, T.
(2019). Classification-reconstruction learning for open-set recognition. In Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition (pp. 4016-4025).
[1247] Malalur, P., and Jaakkola, T. (2019). Alignment based matching networks for one-shot
classification and open-set recognition. arXiv preprint arXiv:1903.06538.
BIBLIOGRAPHY 691
[1248] Schlachter, P., Liao, Y., and Yang, B. (2019, September). Open-set recognition using intra-
class splitting. In 2019 27th European signal processing conference (EUSIPCO) (pp. 1-5).
IEEE.
[1249] Imoscopi, S., Grancharov, V., Sverrisson, S., Karlsson, E., and Pobloth, H. (2019). Exper-
iments on Open-Set Speaker Identification with Discriminatively Trained Neural Networks.
arXiv preprint arXiv:1904.01269.
[1250] Mundt, M., Pliushch, I., Majumder, S., and Ramesh, V. (2019). Open set recognition
through deep neural network uncertainty: Does out-of-distribution detection require genera-
tive classifiers?. In Proceedings of the IEEE/CVF international conference on computer vision
workshops (pp. 0-0).
[1251] Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., and Yu, S. X. (2019). Large-scale long-tailed
recognition in an open world. In Proceedings of the IEEE/CVF conference on computer vision
and pattern recognition (pp. 2537-2546).
[1252] Perera, P., and Patel, V. M. (2019). Deep transfer learning for multiple class novelty de-
tection. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition
(pp. 11544-11552).
[1253] Xiong, H., Lu, H., Liu, C., Liu, L., Cao, Z., and Shen, C. (2019). From open set to closed
set: Counting objects by spatial divide-and-conquer. In Proceedings of the IEEE/CVF inter-
national conference on computer vision (pp. 8362-8371).
[1254] Yang, Y., Hou, C., Lang, Y., Guan, D., Huang, D., and Xu, J. (2019). Open-set human
activity recognition based on micro-Doppler signatures. Pattern Recognition, 85, 60-69.
[1255] Oza, P., and Patel, V. M. (2019). C2ae: Class conditioned auto-encoder for open-set recogni-
tion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
(pp. 2307-2316).
[1256] Liu, S., Garrepalli, R., Dietterich, T., Fern, A., and Hendrycks, D. (2018, July). Open
category detection with PAC guarantees. In International Conference on Machine Learning
(pp. 3169-3178). PMLR.
[1257] Venkataram, V. M. (2018). Open set text classification using neural networks. University of
Colorado Colorado Springs.
[1258] Hassen, M., and Chan, P. K. (2020). Learning a neural-network-based representation for
open set recognition. In Proceedings of the 2020 SIAM International Conference on Data
Mining (pp. 154-162). Society for Industrial and Applied Mathematics.
[1259] Shu, L., Xu, H., and Liu, B. (2018). Unseen class discovery in open-world classification.
arXiv preprint arXiv:1801.05609.
[1260] Dhamija, A. R., Günther, M., and Boult, T. (2018). Reducing network agnostophobia.
Advances in Neural Information Processing Systems, 31.
[1261] Zheng, Z., Zheng, L., Hu, Z., and Yang, Y. (2018). Open set adversarial examples. arXiv
preprint arXiv:1809.02681, 3.
[1262] Neal, L., Olson, M., Fern, X., Wong, W. K., and Li, F. (2018). Open set learning with
counterfactual images. In Proceedings of the European conference on computer vision (ECCV)
(pp. 613-628).
692 BIBLIOGRAPHY
[1263] Rudd, E. M., Jain, L. P., Scheirer, W. J., and Boult, T. E. (2017). The extreme value
machine. IEEE transactions on pattern analysis and machine intelligence, 40(3), 762-768.
[1264] Vignotto, E., and Engelke, S. (2018). Extreme Value Theory for Open Set Classification–
GPD and GEV Classifiers. arXiv preprint arXiv:1808.09902.
[1265] Cardoso, D. O., Gama, J., and França, F. M. (2017). Weightless neural networks for open
set recognition. Machine Learning, 106(9), 1547-1567.
[1266] Rozsa, A., Günther, M., and Boult, T. E. (2017). Adversarial robustness: Softmax versus
openmax. arXiv preprint arXiv:1708.01697.
[1267] Shu, L., Xu, H., and Liu, B. (2017). Doc: Deep open classification of text documents. arXiv
preprint arXiv:1709.08716.
[1268] Ge, Z., Demyanov, S., Chen, Z., and Garnavi, R. (2017). Generative openmax for multi-class
open set classification. arXiv preprint arXiv:1707.07418.
[1269] Yu, Y., Qu, W. Y., Li, N., and Guo, Z. (2017). Open-category classification by adversarial
sample generation. arXiv preprint arXiv:1705.08722.
[1270] Júnior, P. R. M., Boult, T. E., Wainer, J., and Rocha, A. (2016). Specialized support vector
machines for open-set recognition. arXiv preprint arXiv:1606.03802, 2.
[1271] Mendes Júnior, P. R., De Souza, R. M., Werneck, R. D. O., Stein, B. V., Pazinato, D. V., De
Almeida, W. R., ... and Rocha, A. (2017). Nearest neighbors distance ratio open-set classifier.
Machine Learning, 106(3), 359-386.
[1272] Dong, H., Fu, Y., Hwang, S. J., Sigal, L., and Xue, X. (2022). Learning the compositional
domains for generalized zero-shot learning. Computer Vision and Image Understanding, 221,
103454.
[1273] Vareto, R., Silva, S., Costa, F., and Schwartz, W. R. (2017, October). Towards open-set face
recognition using hashing functions. In 2017 IEEE international joint conference on biometrics
(IJCB) (pp. 634-641). IEEE.
[1274] Fei, G., and Liu, B. (2016, June). Breaking the closed world assumption in text classification.
In Proceedings of the 2016 conference of the North American chapter of the association for
computational linguistics: human language technologies (pp. 506-514).
[1275] Neira, M. A. C., Júnior, P. R. M., Rocha, A., and Torres, R. D. S. (2018). Data-fusion
techniques for open-set recognition problems. IEEE access, 6, 21242-21265.
[1276] Scheirer, W. J., Jain, L. P., and Boult, T. E. (2014). Probability models for open set recog-
nition. IEEE transactions on pattern analysis and machine intelligence, 36(11), 2317-2324.
[1277] Jain, L. P., Scheirer, W. J., and Boult, T. E. (2014). Multi-class open set recognition us-
ing probability of inclusion. In Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III 13 (pp. 393-409). Springer
International Publishing.
[1278] Zhang, H., and Patel, V. M. (2016). Sparse representation-based open set recognition. IEEE
transactions on pattern analysis and machine intelligence, 39(8), 1690-1696.
[1279] Cevikalp, H. (2016). Best fitting hyperplanes for classification. IEEE transactions on pattern
analysis and machine intelligence, 39(6), 1076-1088.
BIBLIOGRAPHY 693
[1280] Cevikalp, H., and Serhan Yavuz, H. (2017). Fast and accurate face recognition with image
sets. In Proceedings of the IEEE International Conference on Computer Vision Workshops
(pp. 1564-1572).
[1281] Gal, Y., and Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Represent-
ing model uncertainty in deep learning. Proceedings of the 33rd International Conference on
Machine Learning (ICML), 1050–1059.
[1282] Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2017). Simple and scalable predic-
tive uncertainty estimation using deep ensembles. Advances in Neural Information Processing
Systems (NeurIPS), 30.
[1283] Rudd, E. M., Jain, L. P., Scheirer, W. J., and Boult, T. E. (2017). The extreme value
machine. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 40(3),
762–768.
[1284] Malinin, A., and Gales, M. (2018). Predictive uncertainty estimation via prior networks.
Advances in Neural Information Processing Systems (NeurIPS), 31.
[1285] Liu, W., Wang, X., Owens, J., and Li, Y. (2020). Energy-based out-of-distribution detection.
Advances in Neural Information Processing Systems (NeurIPS), 33.
[1286] Chen, G., Peng, P., Ma, L., Li, J., Du, L., and Tian, Y. (2021). Bayesian open-world learning.
International Conference on Learning Representations (ICLR).
[1287] Nandy, J., Hsu, W., and Lee, M. L. (2020). Towards maximizing the representation gap be-
tween in-domain and out-of-distribution examples. Advances in Neural Information Processing
Systems (NeurIPS), 33.
[1288] Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H., and Gal, Y. (2021). Deterministic
neural networks with inductive biases capture epistemic and aleatoric uncertainty. Proceedings
of the 38th International Conference on Machine Learning (ICML).
[1289] Kristiadi, A., Hein, M., and Hennig, P. (2020). Being Bayesian, even just a bit, fixes over-
confidence in ReLU networks. Proceedings of the 37th International Conference on Machine
Learning (ICML).
[1290] Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., ... and Snoek, J. (2019).
Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset
shift. Advances in Neural Information Processing Systems (NeurIPS), 32.
[1291] Sensoy, M., Kaplan, L., and Kandemir, M. (2018). Evidential deep learning to quantify
classification uncertainty. Advances in Neural Information Processing Systems (NeurIPS), 31.
[1292] Bendale, A., and Boult, T. E. (2016). Towards open set deep networks. Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1563–1572.
[1293] Neal, L., Olson, M., Fern, X., Wong, W. K., and Li, F. (2018). Open set learning with
counterfactual images. Proceedings of the European Conference on Computer Vision (ECCV).
[1294] Zhang, H., Li, A., Guo, J., and Guo, Y. (2020). Hybrid models for open set recognition.
Proceedings of the European Conference on Computer Vision (ECCV).
[1295] Charoenphakdee, N., Lee, J., and Sugiyama, M. (2021). On symmetric losses for learning
from corrupted labels. Proceedings of the 38th International Conference on Machine Learning
(ICML).
694 BIBLIOGRAPHY
[1296] Hendrycks, D., Mazeika, M., and Dietterich, T. (2019). Deep anomaly detection with outlier
exposure. International Conference on Learning Representations (ICLR).
[1297] Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. (2022). Generalized category discov-
ery. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR).
[1298] Liu, J., Lin, Z., Padhy, S., Tran, D., Bedrax Weiss, T., and Lakshminarayanan, B. (2020).
Simple and principled uncertainty estimation with deterministic deep learning via distance
awareness. Advances in Neural Information Processing Systems (NeurIPS), 33.
[1299] Van Amersfoort, J., Smith, L., Teh, Y. W., and Gal, Y. (2020). Uncertainty estimation using
a single deep deterministic neural network. Proceedings of the 37th International Conference
on Machine Learning (ICML).
[1300] Smith, L., and Gal, Y. (2018). Understanding measures of uncertainty for adversarial ex-
ample detection. Conference on Uncertainty in Artificial Intelligence (UAI).
[1301] Fort, S., Hu, H., and Lakshminarayanan, B. (2019). Deep ensembles: A loss landscape
perspective. Advances in Neural Information Processing Systems (NeurIPS), 32.
[1302] Ober, S. W., Rasmussen, C. E., and van der Wilk, M. (2021). The promises and pitfalls of
deep kernel learning. Proceedings of the 38th International Conference on Machine Learning
(ICML).
[1303] Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational Bayesian neural
networks. International Conference on Learning Representations (ICLR).
[1304] Bradshaw, J., Matthews, A. G., and Ghahramani, Z. (2017). Adversarial examples, uncer-
tainty, and transfer testing robustness in Gaussian process hybrid deep networks. Proceedings
of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS).
[1305] Daxberger, E., Kristiadi, A., Immer, A., Eschenhagen, R., Bauer, M., and Hennig, P. (2021).
Laplace redux—effortless Bayesian deep learning. Advances in Neural Information Processing
Systems (NeurIPS), 34.
[1306] Kapoor, S., Benavoli, A., Azzimonti, D., and Póczos, B. (2022). Robust Bayesian inference
for simulator-based models via the MMD posterior bootstrap. Journal of Machine Learning
Research (JMLR), 23(1).
[1307] Pidhorskyi, S., Almohsen, R., and Doretto, G. (2018). Generative probabilistic novelty
detection with adversarial autoencoders. Advances in Neural Information Processing Systems
(NeurIPS), 31.
[1308] Schlegl, T., Seeböck, P., Waldstein, S. M., Schmidt-Erfurth, U., and Langs, G. (2017). Un-
supervised anomaly detection with generative adversarial networks. Medical Image Computing
and Computer-Assisted Intervention (MICCAI).
[1309] Xiao, Z., Yan, Q., and Amit, Y. (2020). Likelihood regret: An out-of-distribution detec-
tion score for variational auto-encoder. Advances in Neural Information Processing Systems
(NeurIPS), 33.
[1310] Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., and Lakshminarayanan, B. (2019). Do
deep generative models know what they don’t know? International Conference on Learning
Representations (ICLR).
BIBLIOGRAPHY 695
[1311] Choi, H., Jang, E., and Alemi, A. A. (2018). WAIC, but why? Generative ensembles for
robust anomaly detection. Advances in Neural Information Processing Systems (NeurIPS), 31.
[1312] Denouden, T., Salay, R., Czarnecki, K., Abdelzad, V., Phan, B., and Vernekar, S. (2018). Im-
proving reconstruction autoencoder out-of-distribution detection with Mahalanobis distance.
NeurIPS Workshop on Bayesian Deep Learning.
[1313] Kirichenko, P., Izmailov, P., and Wilson, A. G. (2020). Why normalizing flows fail to detect
out-of-distribution data. Advances in Neural Information Processing Systems (NeurIPS), 33.
[1314] Serra, J., Alvarez, D., Gomez, V., Slizovskaia, O., Núñez, J. F., and Luque, J. (2020).
Input complexity and out-of-distribution detection with likelihood-based generative models.
Proceedings of the 37th International Conference on Machine Learning (ICML).
[1315] Morningstar, W., Ham, C., Gallagher, A., Lakshminarayanan, B., Alemi, A., and Dillon, J.
(2021). Density-supervised deep learning for uncertainty quantification. International Confer-
ence on Learning Representations (ICLR).
[1316] Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., ...
and Müller, K. R. (2021). A unifying review of deep and shallow anomaly detection. Journal
of Machine Learning Research (JMLR), 22(1).
[1317] Lampert, C. H., Nickisch, H., and Harmeling, S. (2009). Learning to detect unseen object
classes by between-class attribute transfer. In 2009 IEEE Conference on Computer Vision and
Pattern Recognition (pp. 951-958). IEEE.
[1318] Akata, Z., Reed, S., Walter, D., Lee, H., and Schiele, B. (2013). Evaluation of output
embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (pp. 2927-2936).
[1319] Romera-Paredes, B., and Torr, P. H. (2015). An embarrassingly simple approach to zero-shot
learning. In International Conference on Machine Learning (pp. 2152-2161). PMLR.
[1320] Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. (2017). Zero-shot learning-a compre-
hensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 41(9), 2251-2265.
[1321] Zhang, L., Xiang, T., and Gong, S. (2017). Learning a deep embedding model for zero-shot
learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp. 2021-2030).
[1322] Fu, Y., Hospedales, T. M., Xiang, T., and Gong, S. (2015). Transductive multi-view zero-
shot learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(11), 2332-
2345.
[1323] Kodirov, E., Xiang, T., and Gong, S. (2017). Semantic autoencoder for zero-shot learning.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp.
3174-3183).
[1324] Changpinyo, S., Chao, W. L., Gong, B., and Sha, F. (2016). Synthesized classifiers for
zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 5327-5336).
[1325] Kampffmeyer, M., Chen, Y., Liang, X., Wang, H., Zhang, Y., and Xing, E. P. (2019). Re-
thinking knowledge graph propagation for zero-shot learning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 11487-11496).
696 BIBLIOGRAPHY
[1326] Wang, X., Ye, Y., and Gupta, A. (2018). Zero-shot recognition via semantic embeddings and
knowledge graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 6857-6866).
[1327] Li, J., Jing, M., Lu, K., Ding, Z., Zhu, L., and Huang, Z. (2019). Leveraging the invariant side
of generative zero-shot learning. In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (pp. 7402-7411).
[1328] Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learn-
ing transferable visual models from natural language supervision. In International Conference
on Machine Learning (pp. 8748-8763). PMLR.
[1329] Chao, W. L., Changpinyo, S., Gong, B., and Sha, F. (2016). An empirical study and analysis
of generalized zero-shot learning for object recognition in the wild. In European Conference
on Computer Vision (pp. 52-68). Springer.
[1330] Verma, V. K., Rai, P., and Namboodiri, A. (2018). Generalized zero-shot learning via syn-
thesized examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 4281-4289).
[1331] Huynh, D., and Elhamifar, E. (2020). Fine-grained generalized zero-shot learning via dense
attribute-based attention. In Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition (pp. 4483-4493).
[1332] Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). Zero-shot learning
with semantic output codes. In Advances in Neural Information Processing Systems (pp. 1410-
1418).
[1333] Socher, R., Ganjoo, M., Manning, C. D., and Ng, A. (2013). Zero-shot learning through
cross-modal transfer. In Advances in Neural Information Processing Systems (pp. 935-943).
[1334] Hariharan, B., and Girshick, R. (2017). Low-shot visual recognition by shrinking and hallu-
cinating features. In Proceedings of the IEEE International Conference on Computer Vision
(pp. 3018-3027).
[1335] Xian, Y., Lorenz, T., Schiele, B., and Akata, Z. (2018). Feature generating networks for
zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (pp. 5542-5551).
[1336] Scheirer, W. J., Rocha, A., Sapkota, A., and Boult, T. E. (2013). Toward open set recogni-
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7), 1757-1772.
[1337] Yang, J., Zhou, K., Li, Y., and Liu, Z. (2021). Generalized out-of-distribution detection: A
survey. arXiv preprint arXiv:2110.11334.